Log in

View Full Version : A Guide to Decompiler Design - Part 0


LLXX
October 15th, 2007, 22:38
Certainly many of you have wondered if "perfect" decompilation of, for example, a Win32 PE compiled with C is possible.

By "perfect" decompilation, I mean a high-level source code which will compile to produce a program or fragment thereof which reproduces the behavior of the original low-level Asm equivalently. The resulting compiled output need not be the exact instructions found in the original code; it only suffices that they produce the same functionality. Similarly, names comments and other essentially superfluous information in the original source code can be completely ignored since they do not affect the functioning of the code.

I am not saying that 100% equivalent decompilation may be possible; I am saying that it IS. In the upcoming series of articles I'll be posting here, I will gradually develop a decompiler which will be able to decompile almost any arbitrary sequence of 32-bit x86 Asm to its C equivalent.

Here are several examples of its output: http://www.woodmann.com/forum/showpost.php?p=66595&postcount=3
http://www.woodmann.com/forum/showthread.php?t=10155

(I've been keeping this project in relative silence for the moment, as [1] I haven't had much time to work on it and [2] the decompiler still can't get it quite right all the time and I don't want to release a half-baked job like the half a dozen others out there that claim to have a "decompiler". But after reading my series of articles you *should* be able to construct your own, with a decent amount of programming knowledge.)

When I stated "arbitrary" code, I meant that the output of any compiled language could be decompiled into C; for example, given the following standard example Asm Fibonacci sequence calculation function:
Code:
Code:
:fib | push edx | push ecx
xor eax eax | cdq | inc eax
:fibloop
add eax edx | xchg edx
loop :fibloop
pop ecx | pop edx | ret

...and the specification:
Code:
Code:
int:eax fib(int count:ecx)

The decompiler should produce
Code:
Code:
int fib(int count:ecx) {
int a, b; // a:eax b:edx
int xchg_temp;
b = 0;
a = 1;
while(--count) {
a += b;
xchg_temp = a;
a = b;
b = xchg_temp;
}
return a;
}

Note that an extra variable xchg_temp has been introduced due to the fact that C does not have an exchange operator, but otherwise the decompiler has faithfully reproduced the semantics of the original code. Compiling the above fragment (after minor modifications to remove register specifications etc. -- I may modify a C compiler to ignore or even use them at some future date...) in VC98 yields the following Asm code:
Code:
Code:
00401009 56 push esi
0040100A 8B742408 mov esi,[esp][0008]
0040100E 6A01 push 01
00401010 33C9 xor ecx,ecx
00401012 58 pop eax
00401013 4E dec esi
00401014 7409 je 0040101F
00401016 8D1408 lea edx,[eax][ecx]
00401019 8BC1 mov eax,ecx
0040101B 8BCA mov ecx,edx
0040101D EBF4 jmps 00401013
0040101F 5E pop esi
00401020 C3 retn

Certainly quite different from the original Asm version, but except for the use of a stack parameter instead of register paramters, equivalent nonetheless. The decompiler gives the following output for the above after specifying the calling conventions (I've chosen to preserve register names):
Code:
Code:
int fib(int count) {
int _esi = count;
int _ecx = 0;
int _eax = 1;
int _edx;
while(--_esi) {
_edx = _eax + _ecx;
_eax = _ecx;
_ecx = _edx;
}
return _eax;
}

After recompiling, we get something a little different again:
Code:
Code:
00401009 8B542404 mov edx,[esp][0004]
0040100D 33C9 xor ecx,ecx
0040100F 4A dec edx
00401010 6A01 push 01
00401012 85D2 test edx,edx
00401014 58 pop eax
00401015 740C je 00401023
00401017 56 push esi
00401018 8D3408 lea esi,[eax][ecx]
0040101B 8BC1 mov eax,ecx
0040101D 4A dec edx
0040101E 8BCE mov ecx,esi
00401020 75F6 jne 00401018
00401022 5E pop esi
00401023 C3 retn

...but the decompiler gives the following output:
Code:
Code:
int fib(int count) {
int _edx = count;
int _ecx = 0;
int _eax = 1;

while(--_edx) {
int _esi = _eax + _ecx;
_eax = _ecx;
_ecx = _esi;
}
return _eax;
}

Again slightly different (and larger this time because for some reason the compiler decided to use a dual-test loop) both in terms of register usage and the choice of opening up ESI inside the loop body, but after the decompiler performs some statement sequencing rearrangement and loop determination, the end result is equivalent.

I won't continue with the compile -> decompile -> compile loop any further (you'll find that the final C fragment above reaches equilibrium, i.e. it produces the same last Asm listing, which then decompiles into the same C fragment, and so on, ad infinitum), but this is ultimately the result of the decompiler which I shall discuss the design of in the following postings.

habituallurker
November 5th, 2007, 20:57
Quote:
[Originally Posted by naides]"I was particularly enticed by the construct

_loop :Compile -> decompile -> goto _loop

you stated this cycle will eventually reach an equilibrium, which is intuitively plausible. . .

Is there a mathematical lemma that would formalize your statement?"
If this was valid math, then yes, this would be a trivial consequence of the descending chain condition combined with the Knaster-Tarski theorem. In general, you cannot make a statement like this (at least not a statement with any validity behind it) without having a formal, provably correct model of how the compiler works (and taking into account that "fully optimizing" compilers do not and furthermore can never exist).

LLXX's example where the code gets bigger (e.g. Compile(Decompile(X)) is an "extensive function" shows why this cannot be proven. If that were not true (e.g. if Compile(Decompile(X)) were a "reductive function", then yes, you could say that the equation Compile(Decompile(X)) would eventually reach a fixedpoint (e.g. Compile(Decompile(X)) = X). There is a sophisticated branch of mathematics dedicated to investigating claims such as this. I suggest LLXX look into order theory -- specifically, lattice theory and the theory of continuous/monotonic functions thereupon. This stuff is the bread and butter behind denotational semantics, which has been dealing with these type of intellectual problems for nearly a half-century.

For reference, added by dELTA:
Quote:
[Originally Posted by LLXX]
Quote:
could it be demonstrated that such statement: Equilibrium is always true??
Let's prove by contradiction the two other cases: Assume that a certain code sequence exists which gets shorter every time it was decompiled and then compiled; if such a sequence performs any operation at all, it must then reach a limit of a single byte to still be considered a valid decompilation; if it does not perform any useful operation, e.g. a sequence of alignment NOPs, or more subtly, (inc eax | dec eax | inc eax | dec eax ...) then a valid decompiler must reduce it to zero bytes and thus in both cases equilibrium is reached.

Consider the slightly more difficult case of supposing that there exists a sequence of machine-language bytes that would expand without bounds through repeated decompilation and compilation; although some (to be exact, nearly 2.7 times) expansion did occur in the example, it quickly reached equilibrium due to the optimisations performed by the compiler and the analysis performed by the (assumedly perfect) decompiler which put a lower bound on the entropy of a set of particular semantic actions.

If the code can't grow nor shrink into negative sizes, then it must reach an equilibrium size (or equilibrium sizes).
Quote:
In a related vein: Could a Compile->Decompile
loop machine, provided you instill some heuristic, rules/code optimization protocol, be sufficient or useful to remove code obfuscation??
Absolutely. Flow-control analysis trivially untangles "spaghetti code" by removing jumps, non-returning-calls, and concatenating SESE blocks together, data-flow analysis eliminates useless code (e.g. the common obfuscatory technique of doing unused calculations in registers and then wiping the result by loading a new value into them), and semantic analysis concatenates and simplifies operations (e.g. "mov eax hC728893D | sub eax hC6E8793C | dec eax" becomes "mov eax h00401000". The main principle behind the decompilation process is to figure out what is being done, not how it is being done.

Woodmann
November 5th, 2007, 21:35
Errrrrrrrrrrrrrrrrrrr..............

Ya, ummm.... are you sure your in the right place?

Perhaps you should be hangin out in the crypto forum .

Woodmann

habituallurker
November 6th, 2007, 00:10
I sort of am in the wrong place. When you view this link http://www.woodmann.com/forum/blog.php?b=15 it has four comments underneath it, but when you click "Post or View Comments" it takes you into the forum topic for that entry (e.g. this page), which doesn't have those comments. I was responding to the comments in the link above.

dELTA
November 6th, 2007, 04:11
Hey habituallurker, and welcome to the board! You are very welcome to stick around with your seemingly quite advanced math skills, and maybe you could even lure Mike to hang around more often, I'm sure you two would have lots of fun together.

Sorry for the blog comment glitch, we have integrated the blogs and blog comments with the forums now, but before we did that some comments were made directly to the blogs. This is no longer possible though, so it won't happen again.

Btw, I included LLXX's last comment from the original blog comments in your post above, for reference.

Looking forward to more advanced math ramblings from you.

mike
November 8th, 2007, 20:14
Heh. His math may not be over my head, but it's certainly to one side of it!

I'm working on the Google Caja project, an object-capability-secure subset of Javascript. Think "mash up anything securely."

Some folks on the board may be interested in my recent blog post "Category theory for the Java programmer". It's aimed at programmers (not computer scientists or mathematicians) who have heard of category theory and want to know what it's about.

http://reperiendi.wordpress.com

JMI
November 8th, 2007, 21:02
Sounds like another Blog we should make a part of our offering of outside Blogs!

Regards,

Silver
November 9th, 2007, 12:31
Mike, your blog is very interesting, although I wish I'd stayed awake in math class all those years ago as much of it is over my head. I have enough problems with 3d spatial maths for DirectX, your stuff is way more complex!

Maximus
November 9th, 2007, 13:45
EEEK!!!

you already crossed your brain with hyperplanes, shattered it in hulls (strictly convex!) and projected it with hyperlines?

Now, just to do a bit of camouflage of an useless post made for increasing the post counter ( ), i'll give an half useless web reference for reversers..

http://www.gamedev.net/reference/articles/abrash/abrash.pdf


LLXX
November 10th, 2007, 05:21
Quote:
[Originally Posted by dELTA;70071]Sorry for the blog comment glitch, we have integrated the blogs and blog comments with the forums now, but before we did that some comments were made directly to the blogs. This is no longer possible though, so it won't happen again.
Would it be relatively easy to pull those last few comments into their corresponding threads here with a bit of database manipulation?
Quote:
LLXX's example where the code gets bigger (e.g. Compile(Decompile(X)) is an "extensive function" shows why this cannot be proven. If that were not true (e.g. if Compile(Decompile(X)) were a "reductive function", then yes, you could say that the equation Compile(Decompile(X)) would eventually reach a fixedpoint (e.g. Compile(Decompile(X)) = X). There is a sophisticated branch of mathematics dedicated to investigating claims such as this. I suggest LLXX look into order theory -- specifically, lattice theory and the theory of continuous/monotonic functions thereupon. This stuff is the bread and butter behind denotational semantics, which has been dealing with these type of intellectual problems for nearly a half-century.
Is it possible for a certain code sequence to grow without bound every time it's decompiled and compiled? Of course the compiler cannot optimize the code perfectly, but given the fact that my decompiler reduces the initial literal register-register expressions into the simplest C expression that still performs the same operation, is there an upper limit to how "inefficient" and lengthy the input can be before the decompiler will not simplify it to its simplest? Since the algorithm I use is iterative, it will continue to simplify the expression until it can be simplified no further. Take an example of a very inefficient compiler which expands addition by a constant to repeated increments. Thus the partial expression "+2" in its input results in, e.g. "inc eax | inc eax" in the output; "+4" in "inc eax | inc eax | inc eax | inc eax", etc. The decompiler will simplify the first into the proper "+2", and similarly for the second sequence; recompiling these results in the same sequences as above. Now suppose the compiler inserts "inc eax | dec eax", which does nothing at all, after every "inc eax" instruction that it outputs. Thus "+2" becomes "inc eax | inc eax | dec eax | inc eax | inc eax | dec eax", which the decompiler will also simplify to "+2", and since we are assuming that the compiler is deterministic, compiling it will result in the same output as the decompiler's input.

In other words, could you give an example of a code sequence, in a high-level language such as C, which would expand every time it was decompiled and recompiled?




(I _am_ working on a part 1 but there are a lot of other things I have to do which are presently more important than these decompiler construction articles.)

janus68
November 10th, 2007, 16:06
If every basic block is represented in HLIL tree, there every redundant instructions sequences,such as:same operator in two arithmetic or boolean expressions,subtraction result compared with zero and more could be reduced to it's "canonical" representation by relatively simple recursive function.There both processes:compiler and decompiler attempts to produce more simplified output,so its one-way toward simplification.I can't see, how could ANY code sequence expand, when gets decompiled/compiled.