Log in

View Full Version : Static Disassembly - Best way forward


live_dont_exist
September 22nd, 2011, 01:13
Hey Guys,
I've been doing some malware analysis over the last few months. Have read quite a bit, stepped through a lot of Lena's stuff, done quite a few crack me's and challenges, analyzed quite a few samples of malware, blogged about my learnings etc. So I'm sort of getting the hang of things. But now, I feel that I am stuck again.. what's new eh?? . So here's my problem..

When confronted with a new piece of malware, I tend to do as much dynamic analysis as I can on it to try and understand what happens. I then try and load up the binary in IDA and study the static disassembly. Almost every time, I fail here.. as things get too complicated too fast. So then I load up the same malware in a debugger and then step through it, using IDA just to edit the names of functions...i.e sub 4012c9 now becomes => sub start_of_malware

This is more fruitful, and I do move forward, but invariably I am able to proceed to the point of understanding around 7 or 8 functions and their purpose... these functions seem to be very similar to the knowledge I gained while doing dynamic analysis. So you can say, I have confirmed what I learnt while doing dynamic analysis...by doing static analysis on sections of the code.

However, this is in no way complete as I understand, as there are vast parts of the disassembly with plenty of functions still 'un-analyzed'? And since the malware doesn't trigger them directly (something needs to happen).. I'm left with no other choice but to read each function individually and try and understand why its present. And I try doing this, but I get stuck... There have been articles which tell me to change the entry point to the function I want to trigger, but surely they wont run on their own in most cases? There'd be stack arguments and parameters that have to be passed?

No, its not the assembly which is a problem.. I can "understand" the instructions.. as in I can understand MOV AL,1 .. but why is it there? what is its purpose? .. all that seems a bit fuzzy. I just read a very old Win95 tutorial that told me to rename "variables" as well along with "functions" and I'll try that... but just thought I'd ask here too.

So what do you guys advise, to try and learn stuff better? Is the way I am doing it the only way? All suggestions are appreciated

Thnx
Arvind

ioactivity
September 22nd, 2011, 02:25
From my point of view, it's more of a philosophical problem you may have, not a technical one

I may be wrong on this, as I'm an entirely self-tought person, but you just may need to train your brain to help you subconciously understand the big picture in the listing. You can do it only by experience .

You say you understand the mnemnonics, but fail at understanding the algorithm; that's perfectly OK. It's the same thing when in school you had to figure out a mathematical problem which could be pretty hard to solve, even if you knew the basic algebra syntax (+, -, /, etc). But when you get a hang of it, it becomes easier, not because you understand the syntax better, but the brain will hint you with more solutions, and one of them - after verification by conciousness - can be the solution you seek.

The answer could be that it's the ability to read the source code and interpret it in your mind. When I'm looking at dead listing, I'm imaginating the higher level language's constructs. The are also some patterns that some compilers use to generate the code - recognizing these patterns could be useful to understand the higher abstraction. For example, if I'd trace into the code used to invoke a virtual function of a class derived from a parent abstract class, I'd be lost for a long time. But since I know the pattern, it's easier for me to understand what structures are used, what are the tables doing, why there are some pointers loaded into some registers, etc. I also search what could be a side effect of an expression. I read somewhere that a good technique is to pretend you're writing the code yourself -- this way you can think of what you need, only to see that further instructions of the code are doing a similar thing. I think the key is to be able to recognize the patterns; which is, after all, what the human's brain designed to do

That said, I'm no expert in reverse engineering; I've done it alot few years ago, had a long break without doing it, only to see that my skills related to disassembly interpretation have degraded, to the point that I can see the difference in the convinience of reading assembly code. Of course this degradation itself is a learning experience as well, so I'm not ranting about it

live_dont_exist
September 22nd, 2011, 02:45
Thanks ioactivity. That's an interesting post. Train the brain. So if I break down your post, I should:

a) Keep working (obviously)
b) Try and convert as many functions into high level code as possible
c) Keep identifying patterns, so b) becomes easier to do

One thing I wasn't clear on though, was that you mentioned I should think of how I would write stuff. Now, its primarily malware I'm analyzing and I really don't know what I'm looking at - so its kind of tough to think on those lines. As in, if I knew --- 'This function connects to malware.xxx.com' .. I could look at rewriting pesudo high level code from assembly... that's cool. But, if I have no clue at all.. how do I do it? I hope I was clear

All the same, some good points.. thank you.

Kayaker
September 22nd, 2011, 02:59
Hi

ioactivity makes some good points. It is really all about Zen after all


But Yeah, sometimes you just end up spinning your wheels in the mud wondering what a particular function does if it's not actually called during normal execution. One thing that might be useful for at least some functions, but I haven't tried as yet, is the IDA Appcall feature.

http://www.hexblog.com/?p=112
http://www.hexblog.com/?p=113

Here's another example that uses IDAPython rather than Appcall to call a self-contained function

Calculating API hashes with IDA Pro
http://www.hexblog.com/?p=193



As for naming variables - YES, always, even if it's to something like "wtf_is_this_var".

I find it also helps to increase the number of XREFS from the default to a _much_ larger number. That way you can more easily see if one of your defined functions pops up as a reference to another function. You can create a /cfg/idauser.cfg file to set your own default options for all new disassemblies, i.e.

Code:

// /cfg/idauser.cfg

SHOW_XREFS = 60 // Show 60 cross-references (the rest is accessible by Ctrl-X)

MAX_NAMES_LENGTH = 50 // Maximal length of new names (you may specify values up to 511)

OPCODE_BYTES = 6



Of course one of the most useful things is if you are able to resolve API's and decrypt strings. At that point you can almost read the disassembly like a text book (at least it helps). That's where things like calculating API hashes or coming up with an IDC script to decrypt strings come in useful, IF the malware makes it easy enough for you to do that, not all are that accomodating.

Actually, the Lenny Zeltser's malware challenge that you brought up in an earlier thread is a good example of that. It uses ROL 7 for an API hash, and there just happens to be a WinDbg extension which can deal with that in a standalone manner (one of the MSEC Crash Analyzer Debugger Extension commands). Several API's are called using a fancy-ass MapViewOfSection technique, but big deal, all you have to do is plug in the hash to the WinDbg extension and update your static disassembly with the API name.

So too, the string decryption routine, while effective, can be resolved with an IDC script. At this point the malware is pretty much laid open, and even if a function isn't called during normal tracing, you can guess what it does.

And then other times, you just spin your wheels in the mud...

Cheers,
Kayaker

live_dont_exist
September 22nd, 2011, 03:12
Thanks Kayaker. Will do variable names and Xrefs in IDA..cool. I'm sure that will help a little...

Now the rest of the stuff, Python inside IDA etc.. is OTT at the moment for me. I've read the IDA PRO book but only till Chap 7 or 8, where it teaches me how to navigate around IDA. And then, it all started glazing over .

My point being...I get what you're saying about IDA scripting.. but I feel I've not yet reached, even a level, where I can say, hey.. this needs a script... and then I can start learning what I need to write a script. Like for e.g There was this challenge on Osix.net..level 5.. to find a serial with a specific hash ONLY. So I knew that doing it manually is stupid and I must code.. so i wrote a simple Perl script and I was done. So I know.. in that case.. that I must script... I can visualize the problem....and at least 'know'.. this is what I need to do... here .. its not that straightforward always...

So after reading that, would you say I just need to keep spinning my wheels in the mud, 'till' I get to the level I need to get to.. to script.. or is there a middle path?

Arvind

blabberer
September 22nd, 2011, 03:58
grow dynamically static is stagnant

well jokes apart like io and kayaker posted you need to keep spinning the wheel in the mud till your fingers start making pots

and once you become potter (not harry) every bit of mud will look like a pot

btw ollydbg too has the ability to label and comment and there were utilities (ollysync i think ) that can transfer olly labels and comments back to ida for a better picture

live_dont_exist
September 22nd, 2011, 04:23
All right then blabberer.. will keep spinning them and keep coming here when they dont take any shape... let alone a pot

Maximus
September 22nd, 2011, 08:42
if I can give a suggestion, try to use olly on a monitor and IDA on another.

take note of some of the procs/addresses you enter into when doing dynamic analysis, then check out in IDA/create code at such addresses, and start your analysis in IDA from there.
you will speed up your analysis this way by examining only 'taken' ways.
Sometime you fight with dynamic-encrypted code, so there you have basically 2 approach:

1) grab the encryption scheme, do your IDA script and decrypt.
2) do a partial dump right after the interesting code had been decrypted, and check it there.

Plan B, if you're decent coder:
...grab executed instructions from olly trace, and write a simple IDA plugin that allows you to better view/follow such code.

live_dont_exist
September 22nd, 2011, 08:49
Thanks Maximus... No luxury of 2 monitors just yet. Maybe sometime in the future .. But I hear what you say.

When you say 'taken ways' though.. do you mean, all the paths of code that are visible to me during dynamic malware analysis and stepping through code while the malware is running? So for e.g If there are 20 subroutines and the malware uses only 8 of those... as I understand... you're saying..focus on those 8 only to start with.. that'll speed things up? rt?

Will keep your suggestions for encrypted code in mind..when I hit that type and have to work.

I code ok but primarily in Perl and Ruby. Time to learn Python then. Both Immunity and IDA seem to have Python support.

Thanks though!!

blabberer
September 22nd, 2011, 23:37
Maximus We Are In Virtual World
no need for two monitors set up a vm say vpc2007
run ollydbg in host
run ida in vm

run kernel debugger that debugs ida debugging your debugee in host and dance through all

ioactivity
September 23rd, 2011, 01:19
Personally I also find multi-monitor setup as more productive. It's very convenient if I can open some docs on one monitor, code on the second monitor, let go of the mouse and keyboard, and just do some thinking without clicking anything

@live_dont_exist: By the "pretend you wrote it yourself" I didn't mean to try and think about writing the whole procedure, because - as you said - without interpreting it's impossible to say what's inside. It's a more subtle approach, also based on pattern matching. For example, if you see an OUT instruction to 0CF8h/0CFAh port, and some bit shift and manipulation slightly above, which uses some local stack variables, it could mean that you are looking at some write_to_pci() procedure. This, in turn, could mean that you're looking at some PCI device enumeration sequence. So what would you need to write such procedure? Some local variables holding PCI device number, function and bus number, some loops to increment these variables, and some read/write functions accessing the ports. Suddenly it becomes more clear why the code has a 0x07 or 0x20 in some of the the CMP instructions -- because maximum number of PCI functions is 7, or devices is 32, so these CMPs are the loop expressions, which decide if the loop continues or breaks. After some time it may look like you just generally reversed whole big procedure without actually interpreting the instructions one by one, but just by flagging the local variables in IDA and seeing how they interact. You can also figure the return value of this function (or the structure it fills out), and see how it is used in the scope above. Then, it may be possible to figure out other functions, based on information you got from the PCI enumeration function - allowing you to describe more structures, return values, local variables. But then again, situations like this aren't very common , and sometimes it's just plain more effective to just use a debugger, to get a value in some memory address. But it's always good to look at IDA, even if you have debugger running near by .

This probably won't help you much, at least practically, because there are probably different patterns used in regular malware (unless you reverse rootkits), but I hope you get the point

live_dont_exist
September 23rd, 2011, 01:29
@blabberer: Yes, understood. I'll look at multiple VMs at least as 2 monitors are kind of out for the moment.

@ioactivity: Got that yes..think logical..not linear is what you're saying... will keep it going

Thnx
Arvind

live_dont_exist
September 25th, 2011, 06:34
I felt this thread was very useful and also felt there will be many newbies like me who are struggling with 'information overload' so I wrote a small blog post on the same. It basically just summarizes things I have learnt so far, stuff from this thread and stuff that I learnt after I implemented a few of the suggestions in this thread. I do think this will help many as much as it helped me .

Here is the link to my blog article - http://ardsec.blogspot.com/2011/09/reverse-engineering-know-your-tools.html

p.s.. If I'm not allowed to 'advertise' stuff I write, please let me know and I will not paste such links in future and keep the discussion only inside the forum. I'll advertise it elsewhere

Kayaker
September 25th, 2011, 11:31
Quote:
p.s.. If I'm not allowed to 'advertise' stuff I write, please let me know and I will not paste such links in future and keep the discussion only inside the forum. I'll advertise it elsewhere


Informative reversing blogs are always welcome Arvind. Link it in your Signature if you want.

K.

live_dont_exist
March 27th, 2012, 14:22
Bumping an oldish thread..sorry but there didn't seem to be much sense starting a new one.

So I have come a little way learning to do static reversing...now today I took up a malware which has a DLL and a kernel driver (all from dynamic analysis). Some very superficial analysis of the DLL in Olly and IDA reveals that there are 7090 functions (IDA Functions Menu). Some are imports from other DLLs so I can ignore those, but that still leaves me with say..6000 possible functions.

So I can sit and manually use the Debug - Call DLL Export in Olly and struggle my way through all the functions...going mad in the process ... but want to know ... is there a better way to do it?

Forget DLL...what about "large" EXEs which have a huge number of functions? My blog below seems okay for smaller EXEs with under 100 functions. More than that? How do you do it comprehensively?

Thanks
Arvind

Aimless
March 27th, 2012, 21:50
I think you are all over the place live, so you need to focus first. You need to ask yourself: "What do I want to do the most?"

If the answer is learn static disassembly, the best way to do it is to write code in different compilers, decompile them and see what they look like in assembly. In time, your eyes will learn to 'understand' what each function does, just by looking at the instructions and their sequencing.

However, if you want to understand the purpose of each and every function in a binary, my question is why? Mostly, what you want to understand, is the purpose of each function that is used during a particular path. By particular path, I mean doing something (eg: For MIRC to connect to the IRC servers, a seperate path is taken, for MIRC to open a channel, a seperate function path is taken, etc).

The point here is, you need to understand what "effect" of the binary you want to study then understand the functions used to achieve that "effect" and then understand each of those functions. Trying to analyze EVERY function will make you old before your time.

If you want understand this, you need, what is known as a "coverage" profiler. There are many that work at IDA level (one of them is freeware, excellent and uses IDAPython). You can also go to hexblog (Ilfak's blog) and download the colorizer that colors the executed instructions/functions. An easy way to understand which functions are being executed for a particular "effect".

While "analyze each function to death" sounds awesome, usually you need to tone down the optimism a bit. After all, you are not really interested in disassembling the entire binary just for the sake of it.

If, however, you DO need to analyze the whole binary, I'd suggest waiting for a rainy day, a pack of cigarettes in hand, a bottle of scotch and to go at it. Maybe you won't analyze the whole binary but by the time you are through with the bottle you won't mind either. :P (thanks, Dark Spyrit for the idea, heh!)

Have Phun

live_dont_exist
March 28th, 2012, 02:08
Thanks aimless. My primary aim is to study "malware" and reverse it so I learn how malware operates. Static disassembly as I understand is a part of it. By "static disassembly" all I meant was reverse code enough to study and know every SINGLE thing that the malware is doing. That's all for now

Now say for example with that scope in mind...I get a piece of malware which has 7000 functions. Now I run it and track it etc etc ... I will get say the "50 functions" which are triggered by the malware but will miss all the rest. Could not some of these remaining be important? I keep worrying on how to identify this...from what I understand you're saying, don't worry about those as they will take a lot of time. But is that the fully correct procedure?

This problem becomes worse if I have a DLL. Coz now I have a piece of malware which just drops a DLL and I have to find out what the DLL does. So I can't "run" a DLL..rt? Or find out..which functions in the DLL are really important ..unless I look at each one. Is this correct? Or do I just change DLL Chars to an EXE and run it like an EXE? That somehow did not make sense..please correct me if I am wrong.

The coverage profiler seems very interesting...I will surely take a look at it.

But overall...I hope that makes my problem clearer. Do see if you/anyone can help

Thnx
Arvind
p.s.... I dont smoke or drink...so the last suggestion also is invalid

bilbo
March 29th, 2012, 00:10
Quote:
I get a piece of malware which has 7000 functions

I do not think there is a programmer who ever writes as many as 7000 functions in his application/malware!

So I think there are two cases:

(a) most of the functions are brought-in by the linker and belong to some runtime/framework (e.g. MFC): your skill is in recognizing them (maybe from the position in memory, maybe manually applying some IDA signature; e.g. Delphi signatures are not automatically applied), but you obviously must not reverse engineer the whole framework!

(b) some kind of automatic obfuscator has been applied containing many fake "CALL"s, and your job is to deobfuscate the code before analizing it

Quote:
I have a piece of malware which just drops a DLL and I have to find out what the DLL does

Before looking at the inner functions of the DLL, I would try to understand what is the purpose of the DLL.
(a) Is it injected in all the applications running on the computer? In that case starting from the EntryPoint is enough (no other functions are exported). You can replace the first byte with a CC, and when the exception pops up, attach a debugger and go on.
(b) or maybe the DLL is a plugin (e.g. for the explorer): you can find this by looking at the exported functions, if there are any...

Best regards, bilbo

blabberer
March 29th, 2012, 10:27
ollydbgs hittrace offers single colorization of instructions executed during trace and can profile them

as to 7000 functions 95 % of them may be some never used code path
excessive exception handlers that were written to handle some rarely may happen edge cases
if it is oop destuctors garbabe Collectors and what not that are of probably of no use at all analysing
and as already said you develop an instinct what is useful and what is appendicitis when you see the code often in its guts

train yourself to focus on the main 5 % code that is useful leave the rest of the crap

if you are into maths think what a convergent series would be like

1/2 + 1/4 + 1/8 +1/16 +1/32 + infinity tends to 1 so you can safely leave the infinity and assume you got 98.32 % right just by concentrating on first five of the series

live_dont_exist
March 29th, 2012, 12:58
Thanks bilbo,blabberer.

I relooked at the main 'installer malware' and that has some 344 'Functions' as per the IDA Functions Window. I am planning to now learn to use Code Coverage tools mentioned by aimless and Olly DBG HitTrace mentioned by blabberer. Once I am clear on the EXE, I will start studying the DLL.

Now here I looked at the DLL in IDA and looked at the 'Functions' menu...and taking into consideration just the sub_<ADDRESS> functions... there were around 1000 such functions. I'll again try and identify "used" functions and study those. So largely, I am just going to make the DLL an EXE and apply the same methods. The other interesting thing about the DLL is that there seem to be no exports at all...so I doubt it is a DLL at all in the first place.

IDA FLIRT signatures say it is this - Using FLIRT signature: Microsoft VisualC 2-8/net runtime

Arvind
p.s...yesterday I had accidentally goofed and was looking at a System DLL which had 7000 lines .. sorry

bilbo
March 30th, 2012, 00:10
Quote:
I'll again try and identify "used" functions and study those

Just concentrate yourself on the initial functions: the high-addresses ones are for imported libraries (Microsoft runtime in your case)... And as Maximus suggested, try to use Olly at the same time (dynamic analysis: stepping into the code), to know what are the effectively used functions.

Quote:
The other interesting thing about the DLL is that there seem to be no exports at all...so I doubt it is a DLL at all in the first place

Read again the case (a) i wrote you before: it is an injected DLL, with just one exported function, its entry point!

Quote:
I am just going to make the DLL an EXE and apply the same methods

Simply write an application which loads the DLL for starting the dynamic analysis. In Perl something like
Code:
use Win32;
BEGIN {
Win32::LoadLibrary('target_dll_path');
}


Best regards, bilbo

P.S. if it is a dangerous malware, i'd prefer to work inside a virtual machine when performing dynamic analysis, to avoid damages to your computer