Funny coded malware [Archive] - RCE Messageboard's Regroupment

ZaiRoN

July 3rd, 2008, 14:00

Some days ago I had the opportunity to check one of the last msn malware. I think there’s often something interesting inside a malware, no matter what it does and this is a perfect example!

The malware is able to infect only right handed people! I’m not kidding...
Among all the windows settings there’s one made for left handed people. The option I’m referring to is located under the Mouse control panel, labelled “Switch primary and secondary buttons”. It lets you exchange the functions performed by the right and left mouse button. Don’t know if this setting is usefull or not, most of the left handed friends I have are still using the mouse like a right handed. Maybe they don’t even know the existence of such an option. Anyway, look at this code:

http://zairon.files.wordpress.com/2008/06/left_hand_check.jpg?w=460&h=495

It’s a simple query on a registry key named SwapMouseButtons.
result_value is sent back to the caller, and the caller checks the value. If the value is equal to 0×30 (right handed) the malware goes on running the rest of the code, but if the value is 0×31 (left handed) the malware ends immediately. All the nasty things performed by the malware are executed after this check, it means that a left handed won’t get infected!

I’ve seen some malwares using SwapMouseButton function in the past, but never something like that. I bet the author is left handed and he wrote the check just to be sure to avoid a possible infection… I can’t think of anything else. Quite funny!!!

The malware is not really interesting per se, but it has something I’ve never noticed before. It’s not a cool and dangerous new technique, but a coding behaviour. Look at the graph overview:

http://zairon.files.wordpress.com/2008/06/long_diagram.jpg?w=110&h=350

The image represents the content of a malware procedure. Nothing strange per se, except the fact that it contains 657 instructions in it, too many for a simple malware. It’s a big routine and I was surprised at first because you can do a lot of things with so many instructions. I started analysing the code, nothing is passed to the routine and nothing is returned back to the original caller. I tought it should be an important part of the malware, but I was disappointed by the real content of the routine. After few seconds I realized what’s really going on: 657 lines of code for doing something that normally would require around 50 lines…
The function contains a block of 17 instructions repeated 38 times. When I’m facing things like that I always have a little discussion with my brain. The questions are:
- why do you need to repeat each block 38 times?
- can’t you just use a while statement?
- is this a sort of anti-disassembling trick?
- can you produce such a procedure setting up some specific compiler’s options?

The repeated block contains the instruction below:

Code:

00402175    push 9                       ; Length of the string to decrypt

00402177    push offset ntdll_dll        ; String to decrypt

0040217C    push offset aM4l0x123456789  ; key: "M4L0X123456789"

00402181    call sub_401050              ; decrypt "ntdll.dll"

00402186    add  esp, 0Ch

00402189    mov  edi, eax

0040218B    mov  edx, offset ntdll_dll

00402190    or   ecx, 0FFFFFFFFh

00402193    xor  eax, eax

00402195    repne scasb

00402197    not  ecx

00402199    sub  edi, ecx

0040219B    mov  esi, edi

0040219D    mov  eax, ecx

0040219F    mov  edi, edx

004021A1    shr  ecx, 2

004021A4    rep movsd

004021A6    mov  ecx, eax

004021A8    and  ecx, 3

004021AB    rep movsb

It’s only a decryption routine, nothing more. The string is decrypted by the “call 401050″, the rest of the code simply moves the string in the right buffer.
Ok, let’s try answering the initial questions.

According to some PE scanners the exe file was produced by Microsoft Visual C++ 6.0 SPx.
It’s possible to code the big procedure just using a loop (while, for, do-while) containing the snippet above. I don’t think the author used one of these statements because as far as I know it’s not possible to tell the compiler to explode a cycle into a sequence of blocks. At this point I have to options:
- he wrote the same block for 38 times
- he defined a macro with the block’s instructions repeating the macro for 38 times
I won’t code something like that, but the macro option seems to be the most probable choice.
Is it an anti-disassembling trick? My answer is no because it’s really easy to read such a code. You don’t have to deal with variables used inside a for/while; to understand what’s going on you only have to compare three or four blocks.
I don’t have a valid answer to the doubt I had at first….

Trying to find out some more info I studied the rest of the code. I was quite surprised to see another funny diagram.

http://zairon.files.wordpress.com/2008/06/pyramid_diagram.jpg?w=105&h=345

This time the image represents the content of the procedure used to retrieve the address of the API functions. Again, no while/for/do-while statement. The rectangle on the upper part of the image it’s a sequence of calls to GetProcAddress, and the code below it’s just a sequence of checks on the addresses obtained by GetProcAddress.
It’s a series of:

address = GetProcAddress(hDLL, "function_name"

;

followed by a series of:

if (!address) goto _error;

Apart the non-use of a loop there’s something more this time, something that I think reveals an unusual coding style; tha author checks errors at the end of the procedure. I always prefer to check return values as soon as I can, it’s not a rule but it’s something that help you to avoid oversight and potential errors… The procedure has a little bug/oversight at the end, the author forgot to close an opened handle. Just a coincidence?
Anyway, two procedures without a single loop. Seems like the author didn’t use any kind of loop for choice. In case you still have some doubts here’s another cool pictures for you:

http://zairon.files.wordpress.com/2008/06/triangle_diagram.jpg?w=110&h=210

The routine inside the picture contains the code used to check if the API(s) are patched or not. The check is done comparing the first byte with 0xE8 and 0xE9 (call and jump). If the functions are not patched the malware goes on, otherwise it ends. As you can see no loops are used.

In summary: it’s not jungle code, it’s not an anti-disasm code and it’s not a specific compiler setting. I think it’s only a personal choice, but I would really like to know why the author used this particular style.
Do you have any suggestions?

Beyond the coding style, the malware has some more strange things. As pointed out by *asaperlo*, the code contains a bugged RC4 implementation.
It also has a virtual machine check. The idea is pretty simple, the malware checks the nick of the current user. If the nick is “sandbox” or “vmware” you are under a virtual machine…
This malware spawns another one (it’s encrypted inside the file), it might be material for another post.

That’s a funny coded malware for sure!

dELTA

July 4th, 2008, 06:32

Loop unwinding is indeed a well-known code and compiler optimization technique, often used in situations with extreme timing requirements (e.g. crypto code):

http://en.wikipedia.org/wiki/Loop_unwinding

Don't know for sure about available options for it in this specific compiler though.

ZaiRoN

July 4th, 2008, 07:18

Thx Delta, I've been told about some specific settings able to unroll loop. I'm reading some articles about code optimization, but the doubt remains because I think you can increase the speed of the first loop only (the decryption block); you won't get a significant improvement in performance for the last two loops. Why did he need to use such optimizations?
Anyway, I need to read something more about optimization stuff...

dELTA

July 5th, 2008, 18:53

My guess would be that this compiler option (if that is the case now) was unintentionally left on by the author, not for the sake of any necessary, or even intended, optimizations.

Maximus

July 5th, 2008, 19:34

loop unwinding is a good technique, but this cannot be the case.
No compiler would do a loop unwinding of this kind, because it would just overflow the L1 cache, the trace buffer etc..

So, it is not possible (imho) that such code can be generated by a compiler with loop unrolling on Win32.

enhzflep

July 5th, 2008, 23:55

Could you expand on that a little Maximus?
By my calculations, the code in the loop is 0x38 bytes long. Unroll that loop 38 times & you're looking at 2128 bytes. I though the L1 was 16,384 bytes in size - if so, I'm wondering how this code overflow it..

As for the trace buffer, isn't it around 12Kuops long - 12,000 micro instructions long. Would the instructions in the above code really work out to be 5 uOps each on average?

Maximus

July 6th, 2008, 20:25

Indeed, you are absolutely right with your size&counting (and most of their mops are less than 4), the code would comfortably stay into the trace buffer, but unrolling a loop with a call in it has very little meaning (even if ia32 optimizes somehow calls/return).
It would be better to unroll less the loop and inline the function, even if the called one has a loop in it (the CPU can even predict its end if it "loops" little).

Personally, I prefer to unroll loops only if they access memory, and only for the purpose of counting the size of the memory they access, in order to place prefetch(es) at start of the loop (and in middle if needed) to hide the memory latency of the next loop (so aligning loops to 64-byte boundary data).
When this is not possible, I even prefer doing a conditional check for the right prefetch, since by unrolling the loop the check time might become marginal (or just place it, as it does not 'eat' so much CPU time, and test).
This is especially useful if memory is not accessed progressively but backwardly, that usually make your cpu chokes at, and your customer unhappy about.
edit---
said simple, you need to have enough instructions executed to 'hide' the lookup, transfer and caching of prefetched data, so while your CPU eat clocks, the memory is made ready on cache line for it on next loop. The time loss of a jmp/jz is absolutely nothing (obfuscation apart) compared to data not being available when you need it.
In this case, you see no memory optimization, and your code optimization is 'ruined' by a call.

enhzflep

July 7th, 2008, 20:49

Ahhhhhh! Gotcha.

Thanks for that Maximus.

begemott

September 4th, 2008, 08:50

Hello,

Looong time ago when I was still a developer(and protectionist+reverser) I found the works of Todd Veldhuizen on template metaprogramming. (it was 1995 I think). As far as I know Todd is the originator of compile time programming. I started to think about the possibility to use these techniques in software protections.
One of my ideas was to use template metaprogramming to unroll (some of) the loops in the program in order to make it difficult to trace the executable with a debugger.
Here is how unrolling could be done:

#include <iostream>

template< int i >
class LoopClass{
public:
static inline void Unroll(){
//loop body here, for instance cout
std::cout << i << "\n";
LoopClass<i-1>::Unroll();
}
};

template<>
class LoopClass<0>{
public:
static inline void Unroll(){
//loop body here, for instance cout
std::cout << 0 << "\n";
}
};

int _tmain(int argc, _TCHAR* argv[])
{
LoopClass<100>::Unroll();//print the numbers from 100 to 0

return 0;
}

If you compile and disassemble this program you would see similar picture.
I suppose the mallware uses such a technique.
Note that I don't assert that this is useful antidebugging technique

But I still believe that exploring the template metaprogramming from reversing point of view makes sense.

Regards!
Begemott

Maximus

September 11th, 2008, 11:22

For the curious -I were roaming in the XP ntdll for... meh, my business, and when doing a mouse-over saw this strange Incipit:

Code:



.text:7C91035A                                 public wcslen

.text:7C91035A                 wcslen          proc near               ; CODE XREF: RtlInitUnicodeStringEx+14p

.text:7C91035A                                                         ; RtlCreateUnicodeString+Bp ...

.text:7C91035A

.text:7C91035A                 arg_0           = dword ptr  8

.text:7C91035A

.text:7C91035A 000 8B FF                       mov     edi, edi

.text:7C91035C 000 55                          push    ebp

.text:7C91035D 004 8B EC                       mov     ebp, esp

.text:7C91035F 004 8B 45 08                    mov     eax, [ebp+arg_0]

.text:7C910362 004 90                          nop

.text:7C910363 004 90                          nop

.text:7C910364 004 90                          nop

.text:7C910365 004 90                          nop

.text:7C910366 004 90                          nop

.text:7C910367 004 90                          nop

.text:7C910368 004 90                          nop

.text:7C910369 004 90                          nop

.text:7C91036A 004 90                          nop

.text:7C91036B 004 90                          nop

.text:7C91036C 004 90                          nop

.text:7C91036D 004 90                          nop

.text:7C91036E 004 90                          nop

.text:7C91036F 004 90                          nop

.text:7C910370

.text:7C910370                 loc_7C910370:                           ; CODE XREF: wcslen+1Ej

.text:7C910370 004 66 8B 08                    mov     cx, [eax]

.text:7C910373 004 40                          inc     eax

.text:7C910374 004 40                          inc     eax

.text:7C910375 004 66 85 C9                    test    cx, cx

.text:7C910378 004 75 F6                       jnz     short loc_7C910370

If we do a simple hex math, we see that nops are used to align the count loop to a paragraph. So it is faster? yes and no...Sadly, this cycle is slightly slower on certain intel processors because the two incs are actually translated as 2 same register adds + a mop to mask the overflow.
---
I were forgetting, on older processors (but that's still valid) a nop string was/is used when trying to fill all instructions into the parallel execution units offered by a processor. While this is not a concern nowadays due to the Out-Of-Order rescheduling, when writing aggressively optimized loops a manual aid to the OOO surely never hurt...

zhzhtst

September 17th, 2008, 08:42

When you reverse WinObjEx written by Mark Russinovich, you will find something like this.