Localised Code Analysis And The Art Of Nanomite Filtering [Archive]

Admiral

September 11th, 2005, 09:36

If you don't care for Armadillo, you may want to skip to the second-from-last paragraph.

I'm coding a Nanomite recovery tool (for Armadillo) and so far it's going excellently. I've been able to recover the conditional jump length, type and destination for every potential Nanomite (i.e. each 0xCC byte in the code section) by means of brute-forcing Armadillo into processing each 0xCC as an INT3 exception sixty-four times (once for each combination of the relevant flags) and hooking GetThreadContext & SetThreadContext. I do this for compatibility - so that the tool should work with all existing versions of Armadillo and possibly some future ones.

Now I'm faced with the task of patching the jumps back into the child process's code ready for dumping. But here's the problem: Armadillo prepares a table corresponding to all of the 0xCCs corresponding to stolen jumps, but also for every other 0xCC. So I end up with a table with an entry for each genuine nanomite, but it contains lots of entries pertaining to 0xCCs that appear mid-instruction, and so will never cause an exception.
Hence patching them all in would fix all the Nanomites but screw up some instructions that didn't want to be repaired at all.

As I see it, I have two major options, each with a few variations:

I could emulate the master process and patch the jumps in as the INT3s occur. This could be done easily using a loader, but that kind of defeats the object of unpacking it.
I could alternatively inject some code (and redirect the entry point) to install an exception handler to deal with the INT3s as they occur within the process itself. This is also a bit messy. I'd rather not have to do either of these.

The other idea is to try and analyse the dead-listing containing the nanomites, and determine which 0xCCs need fixing and which don't. I'm guessing this is trickier than it sounds.
I could disassemble the entire code section to find out which occur at the beginning of an instruction and which don't. In theory this would solve all my problems, but I don't exactly fancy coding the next IDA just so I can patch a few nanomites.

What I really want to do is to be able to analyse localised chunks of code (surrounding a 0xCC, naturally) and determine whether the byte would execute as an INT3 or if it's part of another instruction. I know OllyDbg can do this on the fly without batting an eyelid, but I'm uncertain that this can be done with 100% accuracy, and am hence reluctant to give it a go.

If anyone has an idea how such an algorithm should go, or if you have a different idea, I'd love to hear from you.

Admiral

q137

September 11th, 2005, 16:31

Why not write a loader/patcher. The loader would patch any nanomites it found while running the process. After you believe you have all the nanomites patched you would just start the son process by itself. If later you run into an error you would just start the the loader/patcher and goto the part of the program that executed the missed nanomite and the loader would apply the patch.

q137

LLXX

September 12th, 2005, 02:10

You might be disappointed to know that something like this has already been made. It is, not surprisingly, called the Armadillo Nanomites Recoverer, and produced by the Tsrh group. It uses the tables found in the 'dillo itself to correctly patch the jumps instructions; even though the erroneous entries are present in the table, the tool appears to function correctly. (The documentation is in either Russian or Chinese, I have no understanding of either so I don't know how it works nor do I want to reverse it to find out.)

The other alternative, analysing the code flow to diferentiate between CCs that were supposed to be Jcc instructions and CCs that are not (e.g. a call 4010cc), is a rather difficult task, requiring flow analysis not unlike that of IDA (maybe even better than IDA, as even it doesn't recognise all possible branches of flow, requiring the user to intervene). This is done by building a multiway tree of the execution path, tracing through the tree (depth first or width first, either way works) and replacing each CC byte that occurs as a new instruction with its correct Jcc, then adding the branch of the Jcc to the tree. Continue until all the branches have been searched. This method clearly needs a large amount of processing power (in effect, you are executing every single byte of code that is contained within the program) but ensures that every CC that could be possibly encountered as an instruction is resolved correctly.

Your idea of "localised code analysis", unfortunately, requires much more analysis intelligence than even IDA itself has. Even a reverser (the person, not the program) would have some difficulty with this, as x86 instructions are of various lengths and finding the boundaries between them is very difficult. Just by looking at, e.g. ą127 bytes of the CC itself, one can make educated guesses at the instruction boundaries, using heuristics e.g. a Jcc is likely to be following some sort of CMP instruction, but that is no guarantee that the CC is not a part of a CMP AL, CC! x86 instructions are meant to be read forwards, not backwards. For example, here is a stream of bytes with a CC in it.

30 57 60 a8 cc 7f 60 13 c7 88 04 4b 70 12

Although the above is actually 16-bit code, it serves its exemplary purpose. Disassembling starting from the first byte yields:

xor [bx+60] dl | test al cc | jg ... | adc ax di | mov [si] al | dec bx | jo ...

From the second byte...

push di | pusha | test al cc | jg ... | adc ax di | mov [si] al | dec bx | jo ...

From the fourth byte...

int 3 | jg ... | adc ax di | mov [si] al | dec bx | jo ...

In this example, it "synchronised" itself after a few bytes, but some sequences of instructions tend to desynchronise further and further, making it difficult to ascertain the exact instruction boundaries. The only way to have near 100% accuracy is to track the flow starting from the entry point, as that is guaranteed to be the start of a logical instruction sequence. Determining whether or not an arbitary code sequence is aligned on the right instruction boundaries is highly difficult even for the power of the reverser's mind. Identifying meaningful instruction sequences is similarily difficult.

Ricardo Narvaja

September 12th, 2005, 05:59

i think the better idea you make a loader debugger, this loader debug the victim catch the INT3 exceptions and have a table stored with the JUMPS, and when detect a INT replace for the correct instruction and when exit save all the changes in the victim, with this you can in few time the victim 90% repared and if you want get rid of the loader you can manually repair the few if not with the loader runs perfect.

This is the method for repair nanomites i use in 3.78 or more new version, is a good method, the difference i repair the fake nanomites manually, but if can be automatized will be more quick.

Keep in this good work
Ricardo

blabberer

September 12th, 2005, 09:41

Quote:

I know OllyDbg can do this on the fly without batting an eyelid, but I'm uncertain that this can be done with 100% accuracy, and am hence reluctant to give it a go.

well then start doing and the accumalated wisdom available here and elsewhere
(ollydbg forum for example) could then squish out the problems that appear along along the way (providied you ask it appropriate in appropriate places)

Admiral

September 12th, 2005, 13:17

Thanks for all the replies.

To those who suggested I get a loader to fix the Nanomites for me:
As I said in my post, this would probably be the easiest and most reliable solution, but it is still a little messy. If I keep the loader at all times, then it will act like a Debug-Blocker itself. Alternatively, if I redump the slave process after I believe most of the Nanomites have been fixed, then there's always a possibility that some will be left and there is no telling how many dumps will be necessary to entirely remove the Nanomites.
I'd be happy with this is if I were unpacking for my own needs, but ultimately this Nanomite tool will be a part of my ArmInline suite. I'm not convinced that in the light of ArmInline's 'one click photo fix' setup for repairing Code-Splicing and Import-Shuffling that this 'do some clicking and hope for the best' method would quite cut it.
Nevertheless, beggars can't be chosers, so I may have to stop moaning and get coding.

To LLXX:

I'm not at all disappointed to hear that such a tool already exists. If anything, it gives me more confidence that this style of tool is a plausible solution.
I have never tried 'Armadillo Nanomite Recoverer', but I was under the impression that it no longer works with the new Armadillo. From poking through the WaitForDebugEvent code, it seems that Armadillo 4.1 and later have a much more prudent algorithm working behind the scenes, that keeps the important values well-hidden from the reverser.
Perhaps I have been lied to and this tool does still work, but eventually SRT will get wise to it and change the format of the tables they use to store the Nanomite data.
However, this shouldn't affect ArmInline, as it is designed to spy on Get/SetThreadContext and spoof WaitForDebugEvent calls whilst letting Armadillo do all the work in decoding the Nanomites, independent of how the tables are stored.
Anyway, I'll stop bigging up my algorithm now.

So it turns out there is no panacaea. I guess I shouldn't be surprised.
I think I'll rule out the conditional branching analysis method. Like I say, I don't fancy coding the next IDA. And besides, no user wants to sit by for ten minutes while their tool works out the target's life story to the trivial end of fixing the Nanomites.

I'll have a go at seeing how successfully a localised disassembler will work (on chunks of code whose 'solution' I already know) when it relies on the opcodes synchronising themselves. I'll report back with my success rate.

Nobody commented on my other proposed solution:
I won't claim to be an expert at SEH, but if nobody can find a flaw in the theory, this may be my best shot: Inject some code at the (O)EP to install an exception handler to do much the same as what the loader would do. This way I can keep it all in one process and should maintain 100% accuracy. Perhaps I'm overlooking something though.

I'll keep you posted
Admiral

doug

September 12th, 2005, 16:20

Quote:

[Originally Posted by Admiral]
Nobody commented on my other proposed solution:
I won't claim to be an expert at SEH, but if nobody can find a flaw in the theory, this may be my best shot: Inject some code at the (O)EP to install an exception handler to do much the same as what the loader would do. This way I can keep it all in one process and should maintain 100% accuracy. Perhaps I'm overlooking something though.

Except that SEH are chained, and the way you do it (before the OEP), means that an application can always override your exception handler. A lot of applications come with their own "fault handler" & bug report interface like windows has. I'm thinking that this might break your scheme if an application installs its own SEH system that intercepts & "handles" INT3 exceptions (most likely by terminating) before you get the chance to.

Admiral

September 12th, 2005, 16:25

Ah. Well I guess that rules that out then.

LLXX

September 12th, 2005, 20:27

For your testing purposes... here is a fragment of a randomly selected PE in my WINDOWS\SYSTEM directory (it was a DLL) containing a CC in its code section. Notice that it does look like a nanomite in some alignments, and not in other alignments. (I know that this file is definitely not packed at all, but just by examining the fragment you cannot tell). Please excuse the crude syntax of my linear dasm (I wrote it a long time ago for 16-bit code, and modified it to handle 32-bit code, though not in the prettiest way)... "r" means "relative offset", "h" prefix is "hexadecimal", and "d" means "dord" (opr. width).

Code:



shr al | pop esi | dec edi | mov esi eax | mov ecx esi | call rh77e | push [esi+4]

lea ecx [ebp-h10] | call rhbb7 | lea eax [ebp+hffcffecc] | push eax | call [h4091d0]

lea esi [ebx+h5e0] | add esp hc | inc d[eax]

Code:



call hcfffde8f | mov esi eax | mov ecx esi | ... (synchronised)

Code:



pop esi | dec edi | mov esi eax | mov ecx esi | ... (synchronised)

Code:



jle rh07 | add [eax] al | push [esi+4] ... (synchronised)

Code:



add bh bh | jbe rh04 | lea ecx [ebp-h10] ... (synchronised)

Code:



add al h8d | dec ebp | lock | call rhbb7 | ... (synchronised)

Code:



mov bh hb (mov bh, 0bh) | add [eax] al | ... (synchronised)

Code:



or eax [eax] | add [ebp+hcffecc85] cl | call [eax-1] | adc eax h4091d0 ...

Code:



int 3 | dec bh | call [eax-1] | adc eax h4091d0 | ... (sure looks like a nanomite!)

So it seems that for the most part, x86 opcodes do synchronise eventually, but as you have mentioned that a solution that works most of the time is not acceptable, this clearly is one of those. You, a reverser with a Brain, will find some instruction sequences more sensical and "normal" than others; however, a tool that contains such intelligence would be even more difficult to implement than a traditional tree-based flow analyser. It would have to make a decision: "This CC is a nanomite, it has to be patched", or "This CC is part of an instruction, and must be left alone". We look at the code and make the decision based on "Does this code make sense in this alignment where the CC occurs at the start of a boundary?" or the opposite, "Does this code make sense when the CC is part of an instruction's operand?" It is clear that deciding what code "make sense" is quite difficult for a machine.

nikolatesla20

September 13th, 2005, 12:18

One solution I found a while ago to this problem is to simply rip the nano tables and create your own "NanoWrapper" that is just a debugger that is wrapped around the program, and contains the nanomite tables. When a 0xCC is triggered, the wrapper debugger does the job of what Arma would normally do.

In this way you can then just create a generalized wrapper exe, and then the tool rips the nano tables from the target, attaches the wrapper exe (could put the unpacked nano'd program in as a resource to be extracted and run at run time) and then runs the "child" and emulates the nano funcionality.

I did this very thing a while ago with GetRight when it had nanos, although I wrote it manually, I just ripped the 4 tables and put them in C arrays inside a small debugger exe, which then loaded the nano'd exe and ran it. The debugger took care of the jumps just as real armadillo would. All I would have needed to add after that was a automatic table ripper or some tool to figure out what the jumps should be, etc. Seems like you may already have that part.

Anyway, doing it this way you don't patch any code so you don't have to worry about valid entries or not. (In fact the reason I DID do it this way was to avoid having to worry about that very thing).

-nt20

Admiral

September 13th, 2005, 13:21

Hi nikola.
That sounds like a tidy way to keep everything in one file, but I still don't like the idea of having two processes (I guess I'm not easily pleased) since unpacking is generally done to allow further reversing. However, you've brought up a couple of ideas I think I'll take advantage of.

I've written a neat (external) loader to do what most people are suggesting, which works fine. I've coded it to report whenever a nanomite is repaired and the reults are a little disappointing. It seems that virtually every feature (in my current target) is bursting with yet more nanomites, and so the loader-patcher-saver method would require you to test-drive the app very thoroughly before you could be confident that the loader may be discarded.

So here's my new plan:

Create some 'virus' like code containing all the necessary tables to wrap itself around the unpacked exe (as in nikola's example). This code has two tasks to perform before is passes control to the OEP.
The first thing it does is to install a vectored exception filter (it will be the only one in the chain at this point) to deal with Nanomites in the way you'd expect. Because of the way VEH works, this will trigger before the SEH chain gets touched.
The second thing it does is to 'hook' the AddVectoredExceptionHandler function to ensure that no vector handlers get placed in front of the wrapper's. If an attempt is made to do so, the wrapper will intercept the call to AddVectoredExceptionHandler, remove its own handler, allow the target to install the new one, then finally place it's handler back at the head of the VEH chain where it belongs. When the function returns, the target will be none the wiser.

This should ensure that any INT3 exceptions get handled by the wrapper regardless of how many SEH and VEH filters the target installs. It's also a 100% accurate, one-file, one-process solution. So now's your chance to ruin my day by spotting the logic holes.

laola

September 13th, 2005, 21:41

At first sight, it really sounds like a great idea. You are really investing a lot of time into this, I like people having that "mad spirit" while chewing on a particular problem

However, what happens if the application runs into one of its own (due to bad programming or whatever) 0xCCs that does not belong to Armadillo? Will your solution allow it to quit gracefully? You should make sure to keep maximum compatibility here. I am very keen on seeing a working version of your proposal, once it is implemented into your ArmInline suite.

Admiral

September 13th, 2005, 22:32

Well... Unless the exception is an INT3 and it occurs exactly at a place where a Nanomite resides (or resided), my VEH filter should (upon failing to find a matching jump) pass the exception on to the rest of the VEH chain, which in turn will pass onto the SEH chain. So in theory, any old INT3 (which is generally a nail in the coffin of a compiled app anyway) will go through all of the usual filters and hence will be handled as gracefully as the target program is capable of handling it.
So I can't see how this method could cause many incompatibilities. Certainly no more than Armadillo's existing Nanomite handling.

I'm sure this task will prove to be more difficult than the blueprints suggest, but I'm filled with determination, and I aim to publish any successes I have on this very forum. I'll let you all know if I have any luck.

Thanks for the enthusiasm, laola.
Admiral

LLXX

September 14th, 2005, 02:26

In effect you're emulating the Armadillo, except without the extra process and obfuscated antidebugging code, but using exceptions instead. This is an elegant solution, when it is only needed that the unpacked program runs perfectly, but don't forget that the target will be rather difficult to reverse further, as it is still filled with INT3s.

I have an interesting idea that doesn't involve trying to analyze every single branch of execution, but could potentially nearly restore all of them during the usage of the unpacked program itself - whenever an INT3 occurs, the VEH will handle patching the original instruction back, as well as write that change back to the file on disk. In this case, one shall see a lot of disk activity during the first run, as most of the nanomites in the common flow paths will be patched. As the program is used, more are discovered and corrected, until eventually most of them have been "virginised" and then it becomes easier to reverse further, as now most of the common paths are corrected (the occasional INT3 may still be encountered, but definitely not as many as before).

It is indeed difficult very difficult, to fully restore the virginity of a program which has been severely raped by the Armadillo!

nikolatesla20

September 14th, 2005, 03:52

It may be true that the app still contains the int3's and it makes it harder to "reverse" , however, I've found most authors rely on arma's protection anyway and don't add their own on top of it. So once you can remove arma you are already set most of the time.

-nt20

Admiral

September 14th, 2005, 08:10

LLXX,
I think I'll do just that. Only I think it'd be easier on the drive heads if I cut out a few unnecessary disk writes: I'll code it to save all the changes in one go, right after the call to OEP returns.

MiK3_d4_kNiF3

September 15th, 2005, 08:32

If you want to use a SEH you should create it at the OEP, then copy the Stack over itself (without the Handler). Then your handler should not be overwritten by the programm anymore.

If you want to find out if a nanomite really is nanomite you may try a Length Disassembler Engine. With this you could parse the code-Section ouf your executable to find out where an opcode starts and where it ends. Maybe this could help you.

Sorry for my broken english

fighter_81

September 16th, 2005, 03:48

Some time ago to fix nanos i used a SEH while the app is unpacked, to do this when it reach a int3, i check which seh was intalled, that of course can't resolve nanos 'cause the prg doesn't have a father any more, at this point i made a per-thread handler and install it between the GetThreadContext and the SetThreadContext,where there is a lot of obfuscating code,then i let Armazio calculate the new eip and then i copy this value in the eip where int3 occurs. I hope this can helps you a little.Sorry for my bad english but i am italian
Regards,Fighter_81

upb

September 30th, 2005, 19:57

about the seh thing....
theres one more level up from the VEH, its
KiUserExceptionDispatcher implemented in ntdll.dll. This is where stuff gets reported into ring3.
Disassemble it and you'll learn a lot about how seh works

Or you can just hook it

user

October 12th, 2005, 19:23

Quote:

[Originally Posted by Admiral]I think I'll rule out the conditional branching analysis method. Like I say, I don't fancy coding the next IDA. And besides, no user wants to sit by for ten minutes while their tool works out the target's life story to the trivial end of fixing the Nanomites.

I'll have a go at seeing how successfully a localised disassembler will work (on chunks of code whose 'solution' I already know) when it relies on the opcodes synchronising themselves. I'll report back with my success rate.

i think some of you treat this 'synchronization' as a magic event that 'eventually' takes place but if you think about it, you will realize that it must happen after at most 15 bytes (the maximum instruction length of IA-32). that is, any particular sequence of bytes can be disassembled in at most 15 different ways, in practice that'll be even less as no real application has so long instructions. for your case it means that if you want to find the 'true' boundary of an instruction around address X, you have to go back to at most X-15-15 and disassemble from there (and X-15-14, X-15-13, etc), then whatever disassembly comes out most often is the right one. this method will require some heuristics in function prologues only where you can't trust the control flow that much beforehand.

QuickeneR

October 13th, 2005, 04:00

I remember using static analysis (Ida) to find and fix the nanomites. It was based on locating the tables in the dildo, not on looping through CC bytes. Did not run into difficult problems, except having to study all the stuff required to code a framework for this. Of course this was rather long ago (more than a year), may be the things are not so easy these days.

Admiral

October 13th, 2005, 06:17

Quote:

[Originally Posted by user]...then whatever disassembly comes out most often is the right one...

I don't doubt this method would probably work, but it is nevertheless a probabilistic one. I'm not entirely sure of the efficiacy of this method but I believe one could contrive a code example which fails to be correctly analysed. Just how often such snippets will occur in compiled code is for anyone to guess, but I'd rather keep my hit-rate as high as possible.

As it happens, my current VEH method is working just fine, so the pursuit of this thread is of little importance to me now. However, I'd still like to hear if anybody has anything insightful to say.

upb: I wasn't aware of this user-exception dispatcher (like I say, I'm no EH guru), but would I be correct in asserting that Copymem-II doesn't work with ring3 applications and so there is no need to concern myself with this function?

nikolatesla20

October 13th, 2005, 07:32

CopyMem-II works with EXE's but not DLLs, and of course it works fine in ring3. It's basically just a debugger. Armadillo launches as a debugger, then unpacks the protected program and runs it. That's why you'll see 2 instances of the program in task manager in a copymemII program. Not only is there the ability to now add nanomites (int 3 bytes) so the debugger gets called, but also the entire memory of the text section of the protected program is set to non-access so when the program goes to access it the debugger also catches that exception and maps out 0x1000 bytes ( 1 page ) so the program can continue. This is the reason it's called "CopyMem" because you can't just dump the child with LordPE since the memory is no-access. Of course you then simply inject a DLL which scans the protected program's VA space 0x1000 bytes at a time and saves it to disk, and all is good

Then you just have to fix the nanomites (if present). That's what this thread is all about .

-nt20

Admiral

October 13th, 2005, 10:54

Sure, I've been following my thread

.
Sorry if I didn't make my question clear. I was just trying to get a free answer without having to do any work

.

I was just wondering if any exe files would be able to use ring3 to bypass the VEH/SEH chains.
I've done a little research and from what I gather KiUserExceptionDispatcher is simply the hub of the 'exception pump', as it were. It makes no attempts to handle the exceptions directly, hence hooking it will not have any advantages over holding onto the first place in the VEH chain.
In theory, an exe could itself hook KiUserExceptionDispatcher, effectively short-circuiting my injected VEH. But then this couldn't be any better avoided by hooking KiUserExceptionDispatcher myself, as my hook (installed at OEP) would promptly be overwritten by the target's hook.

On a separate note, does anybody know what the Zw prefix stands for? I understand that Rtl pertains to runtime-libraries, Ki are the kernel-internal functions and Zw correspond to file-access. So why the odd-one-out?

Regards
Admiral

bilbo

October 13th, 2005, 11:41

Quote:

[Originally Posted by The NT Insider, Vol 10, Issue 4]Zw is entirely random and the developers chose it specifically because it could never mean anything

Rretrieve that article for other details (from google cache...)
Regards, bilbo

doug

October 13th, 2005, 14:27

Quote:

[Originally Posted by Admiral]I was just wondering if any exe files would be able to use ring3 to bypass the VEH/SEH chains.

Did you mean ring0? or did you mean something along the lines of an API that disables VEH or SEH chains? Other than removing a VEH or adding another after, I don't think the mechanism can be safely disabled.

Quote:

hence hooking it will not have any advantages over holding onto the first place in the VEH chain.

Except that the VEH chain is XP+ only. Hooking KiUserExceptionDispatcher gives you compatibility on windows 2000 as well. But indeed on XP, if you make sure that you are always at the head of the chain, VEH is fine.

LLXX

October 13th, 2005, 16:20

Quote:

[Originally Posted by QuickeneR]It was based on locating the tables in the dildo, not on looping through CC bytes.

I'm not normally observant of spelling errors but this one really caught my attention. Just for clarification it's spelled 'dillo.

Quote:

[Originally Posted by Admiral]On a separate note, does anybody know what the Zw prefix stands for? I understand that Rtl pertains to runtime-libraries, Ki are the kernel-internal functions and Zw correspond to file-access. So why the odd-one-out?

Actually, if you rotate your head 180° and look, it becomes "MZ" which is the initials of one of the original programmers at M$, Mark Zbikowski. (He designed the DOS EXE format which bears his name). Interesting isn't it?