Log in

View Full Version : Armadillo, Nanomites and vectored exception-handling


Ring3 Circus
December 11th, 2007, 05:42
Let me tell you about a problem I ran into a couple of years ago, and the solution I ended up with. If you’ve ever heard of ArmInline, then this is the story behind its Nanomites tool.

The Background

If you’re not already aware, Armadillo is a commercial anti-cracking software scheme for Windows: you buy a license, throw your exe (or DLL) at it, and you end up with a new, protected, file. This new program does just what the old one did, but it’s far harder to reverse-engineer. The goal is to remove the protection so that we can have our wicked way with the program.

Among other things, Armadillo employs a system known as Debug Blocker. Briefly put, this causes the program to create two instances whenever it is run - we call them the ‘parent’ and ‘child’ processes. The parent acts as a user-mode debugger, nannying the child (which does all the real work) to make sure that no bad guys can get too close. This system was fairly easy to defeat - all you needed to do was detach the parent process’s debugger at an appropriate moment and attach your own.

So to prevent this happening, the developers of Armadillo invented what they call Nanomites. When the protector is installed on the program, user-marked parts of the code section are scanned for jump instructions (JZ, JNZ, JBE and so on), and a database is created containing the address, type and offset of each. These jump instructions are patched over with ‘INT 3’s (user-mode breakpoint interrupt) and the database is put in the hands of the debugger. The idea is that the child process will raise a debug-break exception whenever one of these instructions fires, whence the parent steps in, grabs the thread context, looks up the appropriate jump in the database and sets the child process on its merry way.

This works very well. If the Nanomite-enabled code regions are chosen carefully then performance is virtually unaffected, and any attempts to sever the child-parent bond results in an immediate and unrecoverable crash. Even worse for the would-be cracker, the information needed to recover the code to a working state is locked up in this database, which is encrypted several times over and accessed only by heavily obfuscated, anti-debug-ridden routines. Reverse-engineering this would be a royal pain.

Getting the table

Many successful efforts had been made to reverse this encryption process and produce a working Nanomite table, but with each offence from the crackers came a counter-offence from the developers and pretty soon there were several variants of the Nanomite system floating around. It was time for a unified approach. Being lazy as I am, I insisted on making the computer do as much of the work as possible. So the plan was to this:

Write a program to debug the parent process. That is, debug the debugger. With this level of control, it would be reasonably easy to fool the parent into processing Nanomites at our will. Three function hooks need to be created in the parent process:


WaitForDebugEvent - This is the primary source of information for any debugger. With a hook in here, we could forge any conceivable exception and let the parent attempt to handle it.
GetThreadContext - When alerted of an INT 3 exception, the parent calls this to find out where the Nanomite was struck. Another hook and we can feign a Nanomite hit at an arbitrary address.
SetThreadContext - After ploughing through that obfuscated code, the parent will have decided where execution should continue from, and enforces its will by setting the thread context. This last inside-element will help us determine the details of any given Nanomite.
From here the algorithm writes itself. We find all instances of the byte 0xCC (INT 3) in the code section, spoof an INT 3 exception at each of these points and watch how the parent responds. By setting the EFlags register to take different values for the same Nanomite address, we can determine under which circumstances the jump occurs and hence exactly which conditional jump is being emulated. A few switch-statements later and we have a complete Nanomite table, without having to step through a single instruction of Armadillo’s code.

The Real Problem

After all that work, it we can just assemble all the jumps from the database into place and dump the process. That’ll be sure to remove all the Nanomites, right? Well, yes, but it turns out that something far nastier happens in the process. See, when Armadillo creates the table in the first place, it doesn’t just store the addresses of the jumps but also creates some false entries at addresses that happen to legitimately contain a 0xCC byte. This means that a completely unrelated ‘CALL DWORD PTR:[0043CC7A]’, for instance, will produce a false entry in the table. This entry will never be needed, as the 0xCC is in the middle of an instruction and can’t trigger an exception under normal circumstances, but those clever developers have put us in a real dilly of a pickle.

There is simply no sure-fire way to weed out the ‘false Nanomites’ from the real ones. Without defeating the object of our plight and writing a purpose-built debugger to do exactly what we didn’t want the parent process doing, how can we fix this?

The Solution

It took a little bit of brainstorming, but this is where vectored exception-handling comes to the rescue. This little-used feature of the Win32 API allows for installation of a process-wide exception-handler that doesn’t depend on stack-frames. They are of limited use in the real world, but just perfect for our needs for the sole reason that the VEH chain is triggered before the SEH chain.

Suppose that we’ve managed to dump and patch the program (and fixed the imports, encrypted pages, code-splicing) so that it runs without the parent. Suppose further that the original program didn’t use any VEH. Then everything works great until a Nanomite triggers: a debug-break fires, promptly falls through all the structured exception-handlers and the process crashes and burns. But if we had a VEH installed, we’d be given a chance to deal with it.

So by adding a new section to the exe containing the Nanomite table along with some code, we can save the day:

Redirect the entry-point to our code, which installs the VEH and jumps straight to the original entry-point.
Have the VEH handle only INT 3 exceptions, searching the database and patching in the appropriate jump instruction when necessary.

That nearly takes care of everything. The only remaining problem is for programs that use VEHs of their own. It’s unlikely that anybody would implement their own exception handler to deal with breakpoints, but conceivable for a catch-all scenario to ruin our best-laid plans. So the last piece of the puzzle is to hook RtlAddVectoredExceptionHandler, telling it to remove our handler before installing the client’s, then replace it afterwards. In this way, the Nanomite-handler is guaranteed to be the first exception-handler on the scene (be it structured or vectored), and existing functionality is unaffected.



http://www.ring3circus.com/rce/armadillo-nanomites-and-vectored-exception-handling/

evlncrn8
December 11th, 2007, 14:30
Quote:
There is simply no sure-fire way to weed out the ‘false Nanomites’ from the real ones.

sure there is, what about checking if the 0xCC appears on an aligned va?

janus68
December 11th, 2007, 15:07
Few months ago I found a small bug (?) in the VEH, (rwb32.bin) , certainly where nanomite type 2 (jmp) is handled.The unpacked app seriously crashed, and after some debugging i found, that short jumps back (EB 80) are wrongly parsed in VEH,and assembled as long jumps 7 bytes long,thus the original code become destroyed.
My quick, dirty patch of rwb32.bin were that :
Code:

000004C7: 2BC8 sub ecx,eax
000004C9: 8B450C mov eax,[ebp][0C]
000004CC: 8908 mov [eax],ecx
000004CE: 81F982000000 cmp ecx,000000082
000004D4: 7D35 jge 00000050B --
000004D6: 8BD9 mov ebx,ecx
000004D8: F7DB neg ebx
000004DA: 81FB81000000 cmp ebx,000000081
000004E0: 7D29 jge 00000050B --
000004E2: 83E902 sub ecx,002 ;
000004E5: 8908 mov [eax],ecx
000004E7: 90 nop

After this patch the app working fine.

Admiral
December 11th, 2007, 16:01
Quote:
[Originally Posted by evlncrn8;70886]sure there is, what about checking if the 0xCC appears on an aligned va?

Mmm... It's quite good but certainly not 'sure-fire'. Disregarding obfuscated code and SMC, which will obviously cause problems, there are still some legitimate cases where this fails. In particular, the padding placed between functions in some debug-builds and a certain few sequences where the opcodes don't align themselves in time after a restart ('cause it's not always possible to know where to start disassembling from).

Quote:
[Originally Posted by janus68]Few months ago I found a small bug ... After this patch the app working fine.
Huh. Good find
I've seen a few complaints dotted around but never thought to investigate. The main reason was that maintaining that extra code was a real nightmare, as I made the mistake of assembling it all manually in OllyDbg. You don't have to tell me - I know it was stupid .

I'd be tempted to fix this if I hadn't already declared 0.97 as 'final' all those months ago :P

Admiral

evlncrn8
December 12th, 2007, 05:58
Quote:
[Originally Posted by Admiral;70889]Mmm... It's quite good but certainly not 'sure-fire'. Disregarding obfuscated code and SMC, which will obviously cause problems, there are still some legitimate cases where this fails. In particular, the padding placed between functions in some debug-builds and a certain few sequences where the opcodes don't align themselves in time after a restart ('cause it's not always possible to know where to start disassembling from)


and how many programs out there have you seen with obfuscated code / smc before the protection was applied?... im sure its very very little... the alignment check is a good general test before you begin to go deeper... even in debug builds, procs are usually aligned... and yep in dillo etc, there are short conditionals handled as well as the long ones....

Maximus
December 12th, 2007, 06:17
mmh... from a theoretical point of view, procedures and loops should be always aligned. Sometime you may find 'nops' in code (like xchg eax,eax, mov ebx,ebx) that are added for improving speed. While this is not true for most P4 family (trace cache), it is almost true for the rest. Aligned instructions 'run faster'. Aligned jump locations too.

So, it is often reasonable to expect aligned procedures, especially in speed-optimized code.

Admiral
December 12th, 2007, 09:48
I'm sure you're both right, but I think we're missing the point of the project, here. I wasn't trying to get 99.9% of the Nanomites correct (and as you know, there are often tens of thousands in a program), but produce a flawless method that works in every conceivable case. This way, the bug reports would be kept to a minimum. Or even a zero, provided the analysis and implementation is correct.

I did try the disassembly method you're talking about and got disappointing results. Admittedly I didn't persist for very long, but that's because the design is fundamentally flawed (in an otherwise 100% effective procedure). If you disagree then you're very welcome to write your own patcher . Indeed, this is why I made sure I documented the table-dump format.

Admiral

dELTA
December 12th, 2007, 17:27
Very nice work Admiral (and I also can't stand 99.9% solutions, so I'm absolutely on your side).

JMI
December 13th, 2007, 00:44
I think a solution which works in "every single case" would, indeed, be a "good thing."

Regards,