Log in

View Full Version : heuristic scanning


nikolatesla20
September 19th, 2003, 10:11
I've searched the net, and haven't really found much info on how to actually implement heuristic scanning...

Um, the reason - I want to build an advanced scanner for unpacking purposes.

Does anyone know of some good sites by chance? Is heuristic scanning based off of genetic algorithms, or neural network programming, etc?.


-nt20

volodya
September 19th, 2003, 11:15
There are several opportunities as to the implementation.
Lets say you have some sort of signature of the packer. Now you can do the following:
For exact match:
1) Right from the EntryPoint - byte-by-byte comparison - extremely slow and stupid
2) Using binary tree algorithm or Boyer-Moor fast search though the WHOLE file
3) Using syntax analyzer after the event has been triggered and decide if it is true hit or the false one
For heuristic scan:
1) Use some of bioinformatics ideas (e.g. BLAST engine)
2) Train NN. But you have to provide it with the FIXED-LENGTH(!!!) input all the time. E.g. you have to have all the signatures of the SAME number of bytes. And you have to have single binary output NN for each and every packer. The speed of processing will be fast but the idea is not that easy to implement...
3) Your genetic algorithms... Mmm... Never thought about it!
Anyway, we will try to implement sth like this in the PE Sniffer new version (ships with PE Tools).

nikolatesla20
September 19th, 2003, 11:44
Hi volodya,

Thanks for the answers, nice to see some good ideas going into PESniffer.

The whole reason for my query is I have a basic OEP detection tool, but I wish to improve upon it, to make it more robust , i.e, reliable and accurate.

Most PE tools out there right now only scan for packer detection, my tool will scan a program in memory for it's OEP code, to aid in unpacking tasks. It works quite well already but there will be some small things that will need to be addressed.

what I'd like to do is create this program and a method by which I can simply feed a whole directory of EXE's to the program for analysis, and it would gather an analysis of all the EXE entrypoints code, and "learn" or build a best-match type of database for when it scans an unknown target in memory (or a program that has been dumped to disk)

So in actuality, it isn't signatures of packers I am looking for, it's signatures of programs written with common compilers. (this should be much easier). I was actually thinking a set of regular expressions might work just as well, although they require some manual labor at creating..

ALso, "stolen bytes" packers are not a problem, I'm taking the signatures from around 32 to 50 bytes down from entry (on normal programs that aren't packed, I grab the entrypoint signatures)

So basically doing a type of FLIRT like IDA does..having byte sequences with wildcard entries. What is your opinion of this idea? Just as a side note, I've created a very basic tool already and used it a couple of times, and it's worked

-niko

volodya
September 19th, 2003, 14:13
Idea is pretty interesting, but not that simple. First of all, IDA uses not only byte-sequenses but CRC16 and the control byte in addition. And you should take into account relocatable bytes as well. The problem will arise when playing with dll.
Therefore some really good heuristic algorithm needed. The problem is not to search for OEP in memory, it is pretty trivial, the problem as I see it is to CORRECTLY detect the packer. Let me read about BLAST engine a little bit more and we discuss it a little bit wider.

nikolatesla20
September 19th, 2003, 14:59
regarding relocatable bytes:

In some targets, yes, you can't get around these. I would simply comment these out as wildcards, hence the scan engine would simply ignore them. I'm only looking for a sequence of some type. Most compilers I've seen use the exact same startup code every time they compile except for maybe one or 2 bytes or so.

Also, I was only picking segments of code that didn't contain relocatable offsets, for example, making sure there were only reg based offsets in the whole block if there were offsets at all.

Yes, I read about IDA's flirt, I'm just not sure you'd have to go to the same level when only searching for OEP since entrypoint code is usually simpler, and doesn't contain duplicates. At least, in most of my tests, I haven't found any. (for example, scanning the whole file only will return with 1 result, the OEP area)

Yes, I see perhaps you are correct in detecting OEP in memory is rather trivial..but I think it's a good tool for a user to quickly find
OEP without needing to do anything else special.

I would however like to say hats off to the authors of PEScan and PETools, very nice work. Keep it up, we need people like u.


-nt20

volodya
September 19th, 2003, 15:26
First of all - thanks for the kind words.
As to the startup code. Consider the following. Suppose one recompiled crt0.c with some asm shit like:

_asm
{
xchg eax,eax;
mov eax, ebx;
nop;
mov ebx, eax;
xchg ebx, ebx;
}

you got the idea. Tell me - will you search engine be able to recognise such shit? BTW, there are several utilities which are able to do sth like this automatically. Consider, for example, Stealth PE 1.01/Hide PE 1.0 - you may download them from wasm or from hxxp://bgcorp.narod.ru/download.htm.

nikolatesla20
September 19th, 2003, 16:19
Great, I'll take a look at them when I have time, thank you

-niko

SiNTAX
September 19th, 2003, 16:32
Don't know if it's of any use to you, so I'll just post it and let you figure it out

Description of Scan engine of ClamAV (open source anti virus):
hxxp://clamav.elektrapro.com/doc/html/node41.html

Clandestiny
September 19th, 2003, 18:53
Hiya,

This is an interesting topic... You mentioned not being able to find a lot of information on heuristics... I would like to suggest an interesting avenue for research... Have you considered researching heuristic techniques as they relate to anti-viral cleaning? Incidentally, I'm considering exploring this topic for a project in a graduate course I'm taking and have just started some research into it. In doing this research, I came upon the realization that the obstacles faced in attempting to design a program to heurstically clean infected viral files bears a striking similiarity to the obstacles faced in attempting to design a heuristic OEP locator. Essentially, it is the same problem, albeit with a different objective (ie. ridding an .exe of a "protector" as compared to ridding an .exe of a "virus". Indeed, when you think about it, "protectors" behave similarly to many virii, albeit without the feature of "replication". They append themselves to PE files, redirect the OEP of the original program, encrypt and decrypt themselves dynamically during run-time, and ultimately often return control to their target so that it runs without alerting the user that there's a problem. A good heuristic viral cleaner then, would essentially be a very good generic unpacker and it would necessitate solving the same difficulties of locating the OEP, which may or may not be obsfucted by the virus. For example, no doubt, a good virus would attempt anti-anti techniques similar to the "stolen bytes" trick used in some packers.

As I understand it, heuristics, involves defining a solution based on probablities associated with a "rule-set" and it is this feature which makes it "generic" as opposed to non-heuristic methods which would attempt to identify the OEP with a specific "byte sequence". Each rule would be assigned a certain weight and the sum of the weights at the end of the analsyis would define a probability which may or may not cross a minimum threshold value. Consider the simple case of determining weather or not a program is "packed"... One may look it in a PE editor and observe that the entry point of an .exe is located in the last section in the file. From experience, one would know that this could indicate that the program might be "packed", but it's not enough to make a 100% definate conclusion. One may then note that the section names are not consistent with those generated by common compilers. This would lend a little more weight to the hypothesis that the program is "packed". Finally one might load it up in a debugger and note the absence of common APIs called during the startup of a normal windows program. Once again, this lends more weight. Finally, one may begin tracing and note the presence of self-modifying code and conclude that the program is almost certainly packed. Taking this thought process, we have some rules that we tend to apply when determining weather a program is packed.

1. Location of the entry point at end of PE file
2. non standard section names
3. absence of common APIS called during windows startup
4. presence of SMC

Clearly, some of the rules would hold more weight than others. The presence of SMC would probably hold a very high weight as opposed to the simple presence of non standard section names. Taken by itself, the weight of a single rule may not be enough to determine that a program is packed (or infected with a virus as the case may be), but taken together we may conclude with a high probablity that the program is packed (or infected).

So, as I understand it, the goal is to attempt to generically define common behaviors of how encryptors return control to their targets and then attempt to define a set of rules that will be used to locate potential OEPs. The OEP with the highest weight would then be the one returned as the highest *probable* location of the entry point. Of course, since you are doing an OEP locator and not a full unpacker, if you found more than 1 potential OEP, you could return them all sorted in order of likelihood. Using the signatures of common compilers is a great idea to use for the design of these rules.

Like I said, I just started researching this topic so I don't have any definitive answers. Your post just happened to catch my eye so I thought I would suggest the heuristic AV cleaners as an avenue of research.

Here are a couple of links I found so far:

vx.netlux.org/lib/static/vdat/tumisc72.htm
vx.netlux.org/lib/static/vdat/epheurs1.htm
www.extremetech.com/article2/0,3973,367051,00.asp
www.extremetech.com/article2/0,3973,1166167,00.asp
www.extremetech.com/article2/0,3973,1166168,00.asp


Who would've thought that "crackers" and AV people are working on the same side of the fence,

Cheers,
Clandestiny

nikolatesla20
September 19th, 2003, 19:38
Thank you very much for the links Clandestiny!

Actually yes I was researching all the information about AV scanners, but as you might have found they guard their secrets closely

I will use your links as further research, and glad this topic is of interest to some!

I would have to say the only downside is that many of the higher level authorties on virus scanning think the heiuristics doesn't work at all. (yet)

-nt20

UrgeOverKill
September 23rd, 2003, 07:49
I do believe the Mac Afee already has this built into their products already and its a function you can use automatically.

v0kram
September 26th, 2003, 10:57
Hi,

Isn't that exactly what GenOEP does ?? It scans for pattern bytes of the compilers and tries to guess at the OEP, though it's not very heuristic.

It supports most possible packers, but the new ones with the byte ripping technique can avoid the detection as they rip most of the common bytes.

I doubt whether looking much below the OEP for the common bytes would help in most cases, though again I maybe wrong.

Also, if the bytes are ripped, their is no way to confirm your detection...

nikolatesla20
September 26th, 2003, 11:26
Quote:
[Originally Posted by v0kram]Hi,

Isn't that exactly what GenOEP does ?? It scans for pattern bytes of the compilers and tries to guess at the OEP, though it's not very heuristic.

It supports most possible packers, but the new ones with the byte ripping technique can avoid the detection as they rip most of the common bytes.

I doubt whether looking much below the OEP for the common bytes would help in most cases, though again I maybe wrong.

Also, if the bytes are ripped, their is no way to confirm your detection...


Any unpacker I've worked on only rips at maximum the first 14 or so bytes. Compiler OEP code is much longer than that - all you need to do is a pattern match on the byte sequences farther down. It works - my test program covers VC++ EXE and DLL's, but I can't do delphi yet because I need a better pattern match algo - I.E., I need to get regular expressions to work on binary data. For now I haven't yet figured out how to get regex++ library to work on binary data (data with embedded nulls) properly yet.

GenOEP? I've never heard of it. But that's what the whole point of this thread was, to see if there was already a tool out there to find OEP in a program running in memory.

Just as an example, even on packed programs, DeDe usually always finds the OEP of a Delphi program running in memory.

Packers can't rip too much code from OEP since they would also have to start making too many assumptions. But using a regular expresssion you should be able to easily detect common sequences, in a manner similar to IDA (and yes, you could add checksums too I suppose)

Regarding confirmation: We're not looking for a positive ID of the compiler here (although it still is handy), we are only looking for a byte sequence that is a possible OEP. In my tests so far, with my current pattern matches, you'll never get more than one match. So basically it's a strong pointer to the OEP AREA. NOtice I said AREA. You may still have to find the OEP manually from there, but it puts you right on top of it. Besides, I think many would agree with me, and I"ve done this as well, you can usually ignore ripped OEP bytes and simply place the program's entry point on beginning of the detected area, and the program will still run fine. I can verify for sure that it works on ACProtected apps. In that packer's case- Ripped bytes? who cares, you don't need em.

-niko20

v0kram
September 26th, 2003, 12:20
Hey,

GenOEP comes with PEiD as a plugin...

As for detecting Delphi OEP, its got a bit of different method, GenOEP, PEiD and DeDe all use the same method to both detect Delphi versions and in the former case identify and point to the OEP (Area).

Like you said, getting an approximate area is a good area as it helps out a lot, and yeah, I'd like to see some nice tools for heuristic scanning as well...