Entyzer v0.1 [Advanced Entropy Analyzer] [Archive]

View Full Version : Entyzer v0.1 [Advanced Entropy Analyzer]

tHE mUTABLE

April 8th, 2010, 20:11

[Entyzer+ v0.1 Alpha Build:080410]
Mohammed Fadel Mokbel
http://www.themutable.com

Description: Entyzer+ is an Advanced Entropy Analyzer.
- Calculates the Entropy, Redundancy, A. Mean and StdDev for any file.
- Calculates the Entropy and Redundancy for a specific range.
- Generates an HTML Graph Visualization.
- Calculates the Entropy, Redundancy and StdDev. for each section of an elf binary file.

- Description: Entropy Analyzer

+ Syntax: Entyzer -f <filename>

- To get the Entropy, Redundancy, A. Mean and StdDev. for any file.

+ Syntax: Entyzer -f <filename> -range <start address> <end address>

- To get the Entropy and Redundancy for a specific range.

+ Syntax: Entyzer -f <filename> -graph <IsValue> <Color Template>

- To generate an HTML graphical visualization of the supplied file.

- IsValue takes either 0 or 1. 1 for having the frequency of each character displayed, 0 otherwise.

- Color Template takes a value between 1 and 7 for different templates:

1:= Gray I, 2:= Gray II, 3:= Tan, 4:= Olive Green, 5:= Blue,

6:= Green + Green + Yellow, 7:= Orange + Orange + Yellow

+ Syntax: Entyzer -elf <filename>

- To get the Entropy, Redundancy and StdDev. for each section of an elf binary file.

Stay tuned for the paper.

All your suggestions, comments and feedback are welcome.

dELTA

May 6th, 2010, 04:56

Thanks (and sorry for late reply).

CRCETL:
http://www.woodmann.com/collaborative/tools/Entyzer

tHE mUTABLE

December 22nd, 2010, 22:47

Entyzer+ - Revision History
===========================

Note {Non Functional Changes}
----

[?] A new build is released on December 26, 2010 with static linking only.

Version 0.2 {Fermions Build:221210}
-----------

[?] Released on (December 22, 2010).
[?] Major update with lots of new fine-grained options.
[+] Added PE file format parsing for reporting the Entropy and
other statistical informations.
[+] Added block selection option wherever applicable.
[+] Added XML report generation for reporting general and Entropy
information, percentage and frequency of every hex value.
[+] Added fine-grained options (-select) for parsing ELF binaries.
[+] Added "Symbiotic Differential Comparison Algorithm" (-SDCAlg)
[+] Added Kullback-Leibler Divergence (KLD) measure. The impleme-
ntation also reports the Resistor Average (RA) distance which
symmetrizes KLD.
[+] Added various mathematical hex transformations options (-h:hex).
[+] Added simple Encryption/Decryption module.
[+] Added the capability to generate an unsigned C/C++ hex char
byte array.
[+] Added icon to the executable file.

Version 0.1 {Alpha Build:080410}
-----------

[?] First Public Release (April 08, 2010).
[?] Supported reporting the entropy for every section of an ELF
binary file with some other statistical analysis.
HTML Graph output using matrix representation.

General Info.
-------------

- This tool is part of "An Unobtrusive Entropy Based Compiler Optimization
Comparator" paper.
- Please refer to the paper for more information about Entropy, Matrix Graphical
Representation, Symbiotic Differential Comparison Algorithm, Kullback-Leibler
Divergence measure and the Average-Resistor.

----------------------------------------------------------------------------------

_____________________________________________________________________

[Entyzer+ v0.2 - Fermions Build:221210]
[Advanced Entropy Analyzer]
<All Rights Reserved (C) 2010>
_____________________________________________________________________

- Description: Entropy Analyzer+ with Hex Editing Capabilities (-h:hex)

+ Syntax: Entyzer -f <filename> [ -b <start_offset> <size> ]

- To get the Entropy, Redundancy, A. Mean and StdDev. for any file
or for a specific block.

+ Syntax: Entyzer -f <filename> -graph <IsValue> <Color Template>

- To generate an HTML graphical visualization of the supplied file.

- IsValue takes either 0 or 1. 1 for having the frequency of each
character displayed, 0 otherwise.

- Color Template takes a value between 1 and 7 for different templates:
1:= Gray I, 2:= Gray II, 3:= Tan, 4:= Olive Green, 5:= Blue,
6:= Green + Green + Yellow, 7:= Orange + Orange + Yellow

+ Syntax: Entyzer -f <filename> -xml

- To generate an XML report: general and Entropy information, percentage
and frequency of every hex value.

+ Syntax: Entyzer -pe <filename>

- To get the Entropy, Redundancy and StdDev. for every section of a
PE binary file.

+ Syntax: Entyzer -elf -section -<option> <filename>

- <option> = list, To list all the sections names of an elf binary file.
<option> = all, To get the Entropy, Redundancy and StdDev.
for every section of an elf binary file.

<option> = select, Option select is followed by a <section_name>
To get the Entropy, Redundancy and StdDev. for a selected
section of an elf binary file. (e.g. section_name = .text)

+ Syntax: Entyzer -elf -SDCAlg <filename * 5>

- To apply the Symbiotic Differential Comparison Algorithm on a reference
elf binary file and 4 files compiled at varying levels of optimizations
(in increasing order). Only the .text section is considered.

+ Syntax: Entyzer -elf -section -select <section_name> -KLD <filename * 2>

- To apply Kullback-Leibler Divergence (KLD) measure on two elf files
for a selected section. The implementation also reports the Resistor
Average (RA) distance which symmetrizes KLD.

+ Syntax: Entyzer -f -KLD <filename * 2>

- To apply KLD and RA on any file.

[?] To list the hex transformation options, use the sub-option -h:hex

+ Syntax: Entyzer -f <filename> -hext: <operation> <operand>

[ -b <start_offset> <end_offset> ]

- To apply various mathematical hex transformations (operations) on
a specific file. All the operations work at the byte level. If the
block (-b) option is specified, the transformation operates only on
the range specified by the SO and EO, otherwise the whole file is
taken. <operand> accepts a decimal value between 0 and 255.

- The <operation> can take any of the following transformations:

+ {mod, neg, div, mult, sub, add} (neg takes no operand)

+ Binary operations: {xor, or, and, inv} (inv takes no operand)

+ {sleft, sright, rotl, rotr} => Shift/Rotate Left/Right

# ex. [... -hext: xor 4 -b 10 20]

+ {rand} (Randomize takes two operand values: Min and Max)

+ {t1e} (The (t1e) encryption/decryption template module)

# Takes 3 operand values: 'x', 'y' and 'z'
# t1e := {add x, xor y, sub z} - t1d := {add z, xor y, sub x}
# ex. To encrypt: [... -hext: t1e x y z]
# To decrypt: [... -hext: t1e z y x]

+ Syntax: Entyzer -f <filename> -cpp [ -b <start_offset> <end_offset> ]

- To generate an unsigned C/C++ hex char byte array.

[----------------------------------------------------]

+ Entyzer.exe Signature:

- 32-Bit: MD5 DBF13E1D00D396DD4A8A2A27C28191CE
- 64-Bit: MD5 8170A5D78173993EC00ED33ADDB33BE4

+ Libraries used:

- ELFIO library by Serge Lamikhov
- MD5 Library by Benjamin Grüdelbach

[----------------------------------------------------]

2381

The paper "An Unobtrusive Entropy Based Compiler Optimization Comparator" is available at:
http://themutable.com/Pubs/Mokbel_CASCON_10_V0.5.pdf

Silkut

December 23rd, 2010, 06:21

I updated the entry concerning your tool on the CRCETL.

http://www.woodmann.com/collaborative/tools/Entyzer

Cheers tHE mUTABLE

Silkut

December 23rd, 2010, 09:23

I tested your tool on a Win7 64bits, fully patched, it's spitting an error.

MSVCP100.dll is missing. I guess it's the Visual C compiler redistributable package =/

tHE mUTABLE

December 23rd, 2010, 12:07

Quote:

[Originally Posted by Silkut;88667]I tested your tool on a Win7 64bits, fully patched, it's spitting an error.

MSVCP100.dll is missing. I guess it's the Visual C compiler redistributable package =/

Thanks Silkut! Could you please try to place MSVCP100.dll in the same folder where Entyzer is and see if the error disappears!

GamingMasteR

December 24th, 2010, 10:46

Hello Mohammed,

It's good to compile with /MT option, I think increase in size will be trivial

tHE mUTABLE

December 26th, 2010, 11:55

Quote:

[Originally Posted by GamingMasteR;88678]Hello Mohammed,

It's good to compile with /MT option, I think increase in size will be trivial

Thanks! A new build with static linking is up (CRCETL version has been updated as well)!.

Silkut

December 27th, 2010, 03:46

Yay,

Sorry for not trying with the DLL, family stuff you know..
Works great now, thanks for the fixup

niaren

December 27th, 2010, 17:09

First of all, congratulations! It is a very nice tool and the description of the tool speaks for itself

Took a quick look at the paper that is being linked to in the description of the tool. Curious to know what type of statistical model you apply and whether or not you make inference based on that model (Kullback-Leibler is sometimes used for this). However, it appears that with your blackbox approach it is very convenient (and practical) to treat the code section as a pile of bytes with the result that the statistical model is very simple. Considering the fact that code contains a lot of structure it is interesting how much you can infer about the binary based on the simple model you have chosen. As such your method/work could just as well be used to examine other data such as images, audio or/and video that undergo severe compression.

Anyway, I have a few questions.

- if a compiler performs optimizations (speed-wise) by unrolling, would you expect the entropy of the underlying random variable to increase, decrease or stay the same?

The reason I'm asking is that, as far as I understand, you assume that compiler optimizations lead to increased entropy. Sorry, just found that on page 2 end of paragraph 2 you write "..., a loop unrolling will yield bigger entropy..". Intuitively, I would have thought it would decrease. Can you give an explanation of this statement?

In the description of table 2 it is written "We notice how the entropy value is increasing as the level of optimization increases. That's not only due to the increase in file size but....". You seem to assume that an increase in file size imply higher entropy. From a theoretical point of view ( yawn

) I would say that they have nothing to do with each other but what you are seeing is maybe caused by the 'small sample-size effect'. If you consider a Gaussian random variable, its entropy stays the same no matter how many samples you draw from it. However, if you try to estimate its entropy you will have fluctuating results (high variance) in your estimates when your sample size is small.
The same comment goes for the statements "...less instructions and the entropy value decreased to reflect this alteration..." and "...hence the decrease in the entropy value.." in the paragraph below table 3.

Does your tool in any way take into consideration the size of the code section?

Hope to try out the tool soon

tHE mUTABLE

December 29th, 2010, 15:10

@niaren. Thank you for your interest and your insightful comments!

Quote:

As such your method/work could just as well be used to examine other data such as images, audio or/and video that undergo severe compression.

Yes, it can be used to examine different data types as well because of the high level of abstraction the model embodies.

Quote:

if a compiler performs optimizations (speed-wise) by unrolling, would you expect the entropy of the underlying random variable to increase, decrease or stay the same?

Perhaps I should have elaborated on this specific statement more (when I update the paper!). Actually, this case is more complicated than what it seems. For a perfect homogeneous unrolling the Entropy would stay the same. However, it is not always the case, since the level of noise in the actual transformation exhibits different code fluctuations from optimization level to another, hence, in our case, the Entropy increases (non-homogeneous). On the other hand, it is also possible for the Entropy to decrease in cases where we have an almost perfect homogeneous distribution (with very few repetitive anomalies) considering that the 'reference' is the perfect homogeneous distribution.

Please make sure to go over the paper completely since that might answer all/some of your questions! I have mentioned that "The entropy value is bounded to the actual optimization properties", and in many other places in the paper I've emphasized this point.

Quote:

In the description of table 2 it is written "We notice how the entropy value is increasing as the level of optimization increases. That's not only due to the increase in file size but....". You seem to assume that an increase in file size imply higher entropy. From a theoretical point of view ( yawn ) I would say that they have nothing to do with each other but what you are seeing is maybe caused by the 'small sample-size effect'. If you consider a Gaussian random variable, its entropy stays the same no matter how many samples you draw from it. However, if you try to estimate its entropy you will have fluctuating results (high variance) in your estimates when your sample size is small.

Sure, the file size plays a crucial role in the entropy, it is an inherit mathematical characteristic of the equation (depends on the distribution). The correlation is obvious since the size taken is the denominator. I'm not generalizing this observation, since it is different for every distribution.

The size of the binaries of the benchmarks could reach up to 11MB, so definitely we're not talking about "small-size". Note that the probability distribution is discrete while in the case of Gaussian it is continuous!

Quote:

The same comment goes for the statements "...less instructions and the entropy value decreased to reflect this alteration..." and "...hence the decrease in the entropy value.." in the paragraph below table 3.

It is true by definition (depends on the homogeneity of the distribution).

Quote:

Does your tool in any way take into consideration the size of the code section?

Well, the analysis is solely based on the code section only. As for the tool, you can choose whatever section you want to get the Entropy (via '-select' sub-option) or get the Entropy for the file as a whole.

I hope that answers your questions.

niaren

January 1st, 2011, 15:03

Thanks for the explanation

I'm not sure I understand it or understand every detail of it. Confused on a higher-level I would say

Quote:

[Originally Posted by tHE mUTABLE;88740]
However, it is not always the case, since the level of noise in the actual transformation exhibits different code fluctuations from optimization level to another, hence, in our case, the Entropy increases (non-homogeneous). On the other hand, it is also possible for the Entropy to decrease in cases where we have an almost perfect homogeneous distribution (with very few repetitive anomalies) considering that the 'reference' is the perfect homogeneous distribution.

I don't understand your coupling between homogeneous/non-homogeneous distribution and the increase/decrease in entropy. Maybe it is because I don't understand or have heard of a homogeneous distribution before. Looked here
http://en.wikipedia.org/wiki/Homogeneous_distribution
but I still have a hard time seeing the relevance to your results.
Is a Gaussian distribution homogeneous?
Is a mixture distribution of two Gaussians homogeneous?
but most importantly what does it mean for the output of your tool?

However, thinking twice about what you're actually doing I now see that it makes sense, at least to me, to think of your tool as being a classifier. Your 'classifier' is a little special in that the feature extraction stage reduces the (high-dimensional) input only one scalar and as such the actual classification stage is reduced to thresholding of that scalar.
Out of curiosity I made a small experiment. I took an exe file 7z.exe (http://www.7-zip.org/) and made a UPX packed version as well. Then the contents of those two files, from offset 0 to eof, were inpreted as a time series, x(t), where each sample is 32 bit and considered as a Q31 fixedpoint decimal number. Then scatter plots were made, they are plotting x(t) vs x(t-1). In the image below, the left plot shows the scatter plot for the raw 7z.exe and the right plot shows the scatter plot for the UPX packed version.

http://i52.tinypic.com/35lgjo0.jpg

As another experiment the below plot shows the autocorrelation function for lags 1 to 1000. Blue line is for raw exe and red line is for packed version.

http://i52.tinypic.com/29vh4du.jpg

Have you tried other 'features' than the entropy one?

tHE mUTABLE

January 3rd, 2011, 23:32

In the context of the paper, the word "Homogeneous" as defined mathematically (from the wiki link you provided) has nothing to do with what I was referring to. I simply meant the uniformity in the structure of the distribution ("composed of similar or identical parts or elements": as defined in the dictionary).

Quote:

However, thinking twice about what you're actually doing I now see that it makes sense, at least to me, to think of your tool as being a classifier. Your 'classifier' is a little special in that the feature extraction stage reduces the (high-dimensional) input only one scalar and as such the actual classification stage is reduced to thresholding of that scalar.

You perfectly nailed it!

Quote:

Have you tried other 'features' than the entropy one?

Well, there are two other methods mentioned in the paper "A Symbiotic Differential Comparison (SDC) Algorithm" and "The Complete Juxtaposation of All Optimization Levels Using Kullback-Leibler Divergence (relative Entropy)".

Nonetheless, still there are other mathematical formulations which can be used based on clustering and classifications. However, because of the one dimensional scalar that is, the byte representation only, it becomes very hard to draw any meaningful generalizations without making lots of exceptions. I've tried!

If I to take the semantic meaning of the distribution, that is the assembly instructions, or to work on the actual generated listing instead of the byte distribution, then a lot of things can be done. We've brainstormed some potential ideas in this direction!

tHE mUTABLE

July 3rd, 2011, 16:44

Entyzer+ - Revision History
===========================

Version 0.3 {Geometry Build:030711}
-----------

[?] Released on (July 03, 2011).
[?] Major update with 7 new generic features related to distance metrics.
[+] Added 'Rolling XOR' (... -hext: -rxor) for performing xor with
various key sizes.
[+] Added various mathematical distance metrics (-h:stat):
(check 'Help.html' for more info.)
[+] Simpson's Index
[+] Canberra's Distance
[+] Sorensen's Distance
[+] Minkowski's Distance of Order, Lambda = 3
[+] Manhattan's Distance, Lambda = 1
[+] Pearson's Test-Statistic (Chi-Square Test).
[!] Fixed the Entropy range option miscalculation in the size to be covered.
[!] Fixed a bug in the case of '-hext' '-b' option. When the (End Offset <
Start Offset) it was throwing an uncaught exception. Instead, an error
message is issued!
Some other minor architectural improvements and features clarifications.

2467