RolfRolles
November 23rd, 2009, 19:41
BitBlaze has just released their TEMU extension of QEMU as open source, which is a whole-system dynamic taint analysis platform.  If you know what that means, just be happy and go ahead and download it here:  http://bitblaze.cs.berkeley.edu/release/index.html.
For those of you who don't, if you've spent much time reverse engineering, you're probably intrinsically familiar with the concept. Let's say you want to figure out how an application (e.g. a daemon) processes input (e.g. network input). You probably begin by setting an execution breakpoint on the APIs responsible for reading said input (e.g. recv()), and then run the process. When the breakpoint fires, you set a data breakpoint on the buffer into which the data is copied. If the data gets copied again, you set a new memory breakpoint on the destination buffer. If some copy of the data is overwritten, you delete that breakpoint. If the data is manipulated in any other way, say by a portion of it being copied into a register and perhaps having arithmetic operations performed upon it, you make a note that the register is input-dependent. Continuing in this fashion, one can obtain a complete listing of how the application manipulates its input. From here, you may want to inspect the security properties of the code involved, e.g. ensuring that some portion of the input does not overflow a stack buffer, that memory allocated based on the input is not subject to integer overflows, etc.
Dynamic taint analysis piggybacks upon the existing capabilities of whole-system dynamic translators in order to automate this process on a whole-system basis. Basically, the user of a dynamic taint analyzer marks certain sources of input as tainted, and the system automatically propagates the taint throughout the system (e.g. from the network driver the whole way down into the user-level application). For the original paper on dynamic taint analysis, see here: http://bitblaze.cs.berkeley.edu/papers/taintcheck.pdf; an extended version of that paper is available here: http://bitblaze.cs.berkeley.edu/papers/taintcheck-full.pdf. TEMU in particular is rather sophisticated: according to this summary http://bitblaze.cs.berkeley.edu/papers/bitblaze_iciss08.pdf (which you should read), it's able to track taint throughout the file system is well, so if tainted memory happens to be swapped out to disk or written to a file, and then accessed again later, TEMU will behave correctly.
Like all reverse engineering tools, there are a few limitations; this paper http://bitblaze.cs.berkeley.edu/papers/influence_plas09.pdf has a survey of them. Basically, there exists a fundamental question of when taint should be propagated. For instance, if a tainted value is used as the index into an array of data, should the result be considered tainted? Answering "yes" in all cases leads to noise in the system; answering "no" in all cases leads to missed opportunities for tracking legitimately interesting taint scenarios, e.g. translation of keyboard scan codes. Another example is control-dependent taint propagation; consider the following code:
while(input[I])
{
switch(input[I])
{
case 0: output[I] = 0; break;
case 1: output[I] = 1; break;
/* ... */
}
++i;
}
The output bytes do not exhibit a direct data dependency on the input bytes, and so taint is not propagated by default. This paper http://bitblaze.cs.berkeley.edu/papers/panorama.pdf describes how TEMU can be used to taint individual instructions to propagate taint in circumstances where the default would be not to do so. I didn't see anything in TEMU's user manual describing how to do this manually, so this type of modification might involve some programming.
Dynamic taint analysis is not merely interesting in isolation. A few months ago, BitBlaze also released their VinE static analysis platform. VinE can work upon instruction traces provided by TEMU in order to provide various additional advanced analysis. One such analysis is mixed concrete and symbolic execution, which is able to answer questions beginning with "how must I modify the input in order to" and ending with things like "take the other side of this branch", "cause this memory allocation to be subject to an integer overflow". This is how tools such as Microsoft's SAGE white-box fuzzer work.
In summary, TEMU is a powerful system by itself and also in combination with VinE, and I have no doubt that its release will alter the landscape of manual reverse engineering permanently, particularly vulnerability analysis. If it becomes popular, which I assume it will, I imagine that malware authors will begin applying countermeasures such as the snippet supplied above; vulnerability analysis will most likely not become subject to these concerns.
What are you waiting for? Install TEMU, play with it, and write blog entries, articles, and security conference presentations based on it. I'm sure contributing useful patches upstream would also be appreciated. Thank your friends in academia for advancing the state of the art among practitioners of reverse engineering.
For those of you who don't, if you've spent much time reverse engineering, you're probably intrinsically familiar with the concept. Let's say you want to figure out how an application (e.g. a daemon) processes input (e.g. network input). You probably begin by setting an execution breakpoint on the APIs responsible for reading said input (e.g. recv()), and then run the process. When the breakpoint fires, you set a data breakpoint on the buffer into which the data is copied. If the data gets copied again, you set a new memory breakpoint on the destination buffer. If some copy of the data is overwritten, you delete that breakpoint. If the data is manipulated in any other way, say by a portion of it being copied into a register and perhaps having arithmetic operations performed upon it, you make a note that the register is input-dependent. Continuing in this fashion, one can obtain a complete listing of how the application manipulates its input. From here, you may want to inspect the security properties of the code involved, e.g. ensuring that some portion of the input does not overflow a stack buffer, that memory allocated based on the input is not subject to integer overflows, etc.
Dynamic taint analysis piggybacks upon the existing capabilities of whole-system dynamic translators in order to automate this process on a whole-system basis. Basically, the user of a dynamic taint analyzer marks certain sources of input as tainted, and the system automatically propagates the taint throughout the system (e.g. from the network driver the whole way down into the user-level application). For the original paper on dynamic taint analysis, see here: http://bitblaze.cs.berkeley.edu/papers/taintcheck.pdf; an extended version of that paper is available here: http://bitblaze.cs.berkeley.edu/papers/taintcheck-full.pdf. TEMU in particular is rather sophisticated: according to this summary http://bitblaze.cs.berkeley.edu/papers/bitblaze_iciss08.pdf (which you should read), it's able to track taint throughout the file system is well, so if tainted memory happens to be swapped out to disk or written to a file, and then accessed again later, TEMU will behave correctly.
Like all reverse engineering tools, there are a few limitations; this paper http://bitblaze.cs.berkeley.edu/papers/influence_plas09.pdf has a survey of them. Basically, there exists a fundamental question of when taint should be propagated. For instance, if a tainted value is used as the index into an array of data, should the result be considered tainted? Answering "yes" in all cases leads to noise in the system; answering "no" in all cases leads to missed opportunities for tracking legitimately interesting taint scenarios, e.g. translation of keyboard scan codes. Another example is control-dependent taint propagation; consider the following code:
while(input[I])
{
switch(input[I])
{
case 0: output[I] = 0; break;
case 1: output[I] = 1; break;
/* ... */
}
++i;
}
The output bytes do not exhibit a direct data dependency on the input bytes, and so taint is not propagated by default. This paper http://bitblaze.cs.berkeley.edu/papers/panorama.pdf describes how TEMU can be used to taint individual instructions to propagate taint in circumstances where the default would be not to do so. I didn't see anything in TEMU's user manual describing how to do this manually, so this type of modification might involve some programming.
Dynamic taint analysis is not merely interesting in isolation. A few months ago, BitBlaze also released their VinE static analysis platform. VinE can work upon instruction traces provided by TEMU in order to provide various additional advanced analysis. One such analysis is mixed concrete and symbolic execution, which is able to answer questions beginning with "how must I modify the input in order to" and ending with things like "take the other side of this branch", "cause this memory allocation to be subject to an integer overflow". This is how tools such as Microsoft's SAGE white-box fuzzer work.
In summary, TEMU is a powerful system by itself and also in combination with VinE, and I have no doubt that its release will alter the landscape of manual reverse engineering permanently, particularly vulnerability analysis. If it becomes popular, which I assume it will, I imagine that malware authors will begin applying countermeasures such as the snippet supplied above; vulnerability analysis will most likely not become subject to these concerns.
What are you waiting for? Install TEMU, play with it, and write blog entries, articles, and security conference presentations based on it. I'm sure contributing useful patches upstream would also be appreciated. Thank your friends in academia for advancing the state of the art among practitioners of reverse engineering.

 .
.


