LLXX
October 15th, 2007, 22:38
Certainly many of you have wondered if "perfect" decompilation of, for example, a Win32 PE compiled with C is possible.
By "perfect" decompilation, I mean a high-level source code which will compile to produce a program or fragment thereof which reproduces the behavior of the original low-level Asm equivalently. The resulting compiled output need not be the exact instructions found in the original code; it only suffices that they produce the same functionality. Similarly, names comments and other essentially superfluous information in the original source code can be completely ignored since they do not affect the functioning of the code.
I am not saying that 100% equivalent decompilation may be possible; I am saying that it IS. In the upcoming series of articles I'll be posting here, I will gradually develop a decompiler which will be able to decompile almost any arbitrary sequence of 32-bit x86 Asm to its C equivalent.
Here are several examples of its output: http://www.woodmann.com/forum/showpost.php?p=66595&postcount=3
http://www.woodmann.com/forum/showthread.php?t=10155
(I've been keeping this project in relative silence for the moment, as [1] I haven't had much time to work on it and [2] the decompiler still can't get it quite right all the time and I don't want to release a half-baked job like the half a dozen others out there that claim to have a "decompiler". But after reading my series of articles you *should* be able to construct your own, with a decent amount of programming knowledge.)
When I stated "arbitrary" code, I meant that the output of any compiled language could be decompiled into C; for example, given the following standard example Asm Fibonacci sequence calculation function:
Code:
...and the specification:
Code:
The decompiler should produce
Code:
Note that an extra variable xchg_temp has been introduced due to the fact that C does not have an exchange operator, but otherwise the decompiler has faithfully reproduced the semantics of the original code. Compiling the above fragment (after minor modifications to remove register specifications etc. -- I may modify a C compiler to ignore or even use them at some future date...) in VC98 yields the following Asm code:
Code:
Certainly quite different from the original Asm version, but except for the use of a stack parameter instead of register paramters, equivalent nonetheless. The decompiler gives the following output for the above after specifying the calling conventions (I've chosen to preserve register names):
Code:
After recompiling, we get something a little different again:
Code:
...but the decompiler gives the following output:
Code:
Again slightly different (and larger this time because for some reason the compiler decided to use a dual-test loop) both in terms of register usage and the choice of opening up ESI inside the loop body, but after the decompiler performs some statement sequencing rearrangement and loop determination, the end result is equivalent.
I won't continue with the compile -> decompile -> compile loop any further (you'll find that the final C fragment above reaches equilibrium, i.e. it produces the same last Asm listing, which then decompiles into the same C fragment, and so on, ad infinitum), but this is ultimately the result of the decompiler which I shall discuss the design of in the following postings.
By "perfect" decompilation, I mean a high-level source code which will compile to produce a program or fragment thereof which reproduces the behavior of the original low-level Asm equivalently. The resulting compiled output need not be the exact instructions found in the original code; it only suffices that they produce the same functionality. Similarly, names comments and other essentially superfluous information in the original source code can be completely ignored since they do not affect the functioning of the code.
I am not saying that 100% equivalent decompilation may be possible; I am saying that it IS. In the upcoming series of articles I'll be posting here, I will gradually develop a decompiler which will be able to decompile almost any arbitrary sequence of 32-bit x86 Asm to its C equivalent.
Here are several examples of its output: http://www.woodmann.com/forum/showpost.php?p=66595&postcount=3
http://www.woodmann.com/forum/showthread.php?t=10155
(I've been keeping this project in relative silence for the moment, as [1] I haven't had much time to work on it and [2] the decompiler still can't get it quite right all the time and I don't want to release a half-baked job like the half a dozen others out there that claim to have a "decompiler". But after reading my series of articles you *should* be able to construct your own, with a decent amount of programming knowledge.)
When I stated "arbitrary" code, I meant that the output of any compiled language could be decompiled into C; for example, given the following standard example Asm Fibonacci sequence calculation function:
Code:
Code:
:fib | push edx | push ecx
xor eax eax | cdq | inc eax
:fibloop
add eax edx | xchg edx
loop :fibloop
pop ecx | pop edx | ret
...and the specification:
Code:
Code:
int:eax fib(int count:ecx)
The decompiler should produce
Code:
Code:
int fib(int count:ecx) {
int a, b; // a:eax b:edx
int xchg_temp;
b = 0;
a = 1;
while(--count) {
a += b;
xchg_temp = a;
a = b;
b = xchg_temp;
}
return a;
}
Note that an extra variable xchg_temp has been introduced due to the fact that C does not have an exchange operator, but otherwise the decompiler has faithfully reproduced the semantics of the original code. Compiling the above fragment (after minor modifications to remove register specifications etc. -- I may modify a C compiler to ignore or even use them at some future date...) in VC98 yields the following Asm code:
Code:
Code:
00401009 56 push esi
0040100A 8B742408 mov esi,[esp][0008]
0040100E 6A01 push 01
00401010 33C9 xor ecx,ecx
00401012 58 pop eax
00401013 4E dec esi
00401014 7409 je 0040101F
00401016 8D1408 lea edx,[eax][ecx]
00401019 8BC1 mov eax,ecx
0040101B 8BCA mov ecx,edx
0040101D EBF4 jmps 00401013
0040101F 5E pop esi
00401020 C3 retn
Certainly quite different from the original Asm version, but except for the use of a stack parameter instead of register paramters, equivalent nonetheless. The decompiler gives the following output for the above after specifying the calling conventions (I've chosen to preserve register names):
Code:
Code:
int fib(int count) {
int _esi = count;
int _ecx = 0;
int _eax = 1;
int _edx;
while(--_esi) {
_edx = _eax + _ecx;
_eax = _ecx;
_ecx = _edx;
}
return _eax;
}
After recompiling, we get something a little different again:
Code:
Code:
00401009 8B542404 mov edx,[esp][0004]
0040100D 33C9 xor ecx,ecx
0040100F 4A dec edx
00401010 6A01 push 01
00401012 85D2 test edx,edx
00401014 58 pop eax
00401015 740C je 00401023
00401017 56 push esi
00401018 8D3408 lea esi,[eax][ecx]
0040101B 8BC1 mov eax,ecx
0040101D 4A dec edx
0040101E 8BCE mov ecx,esi
00401020 75F6 jne 00401018
00401022 5E pop esi
00401023 C3 retn
...but the decompiler gives the following output:
Code:
Code:
int fib(int count) {
int _edx = count;
int _ecx = 0;
int _eax = 1;
while(--_edx) {
int _esi = _eax + _ecx;
_eax = _ecx;
_ecx = _esi;
}
return _eax;
}
Again slightly different (and larger this time because for some reason the compiler decided to use a dual-test loop) both in terms of register usage and the choice of opening up ESI inside the loop body, but after the decompiler performs some statement sequencing rearrangement and loop determination, the end result is equivalent.
I won't continue with the compile -> decompile -> compile loop any further (you'll find that the final C fragment above reaches equilibrium, i.e. it produces the same last Asm listing, which then decompiles into the same C fragment, and so on, ad infinitum), but this is ultimately the result of the decompiler which I shall discuss the design of in the following postings.