Control Flow Deobfuscation Part 1 [Archive] - RCE Messageboard's Regroupment

fr33ke

January 13th, 2008, 09:50

Control Flow Obfuscation is one of the biggest protection mechanisms used in commercial protections for intermediate languages like IL (.NET) and Java bytecode. It is also used a bit in some native protectors, mostly to hide their own code, but I won't go into that. I'll use pseudocode for the examples.

The basic idea behind control flow obfuscation is that spaghetti code is hard to read. So the protectors transform normal code into spaghetti code, with lots of spurious branches. You can repair this by hand, but as functions get bigger this becomes a huge PITA. So we must write a program to do it for us.

I'll assume we have access to the bytecode in the program. Also we'll make use of some primitive operations on this bytecode. These problems are straightforward to tackle, all you need is the documentation and some (much

) time.

To reorganize the function we will use a two-step approach:

Make a Control Flow Graph from the bytecode.
Make bytecode from the CFG.

You might think "That doesn't do anything!", but the trick is that we read in bytecode that is obfuscated and write out bytecode that is simple to read. Both form the same control flow graph, so the meaning is not changed.

English example:

Obfuscated: "The bike, that is owned by John, who has blue eyes, needs to be repaired."
Graph:
http://img259.imageshack.us/img259/3913/graph1zo4.png
Simplified: "John has blue eyes. His bike needs to be repaired."

Of course this isn't a perfect analogy but I hope you see what I mean.

Building the CFG
Let's see what we want from the CFG:

Have vertices with some amount of bytecode
Have edges that show the connection between vertices.

E.g. if vertex 1 branches to vertex 2 we want an edge from 1 to 2. Note that this edge has a direction: it doesn't go from 2 to 1.
We need a label on the edge that shows the type of connection. For instance if vertex 1 branches to vertex 2 if some condition is true, and to vertex 3 if it is false, we would have this graph:
http://img166.imageshack.us/img166/4302/graph2iy9.png

Every vertex has one entry point and one exit point.

(if this is all gibberish for you, please read up on graphs before continuing)

The last point ensures our CFG is a proper Directed Graph.
In practice you'll want to keep the edge information with the vertex it comes from, but this changes nothing to the basic ideas.

With this info, we can make a simple recursive function to construct the CFG:

Code:

already_done_vertices = empty

cfg_of_function = makecfg(entry_point_of_function)



function makecfg(current_instruction):

    vertex myvertex



    if current_instruction is in already_done_vertices

        myvertex = the appropriate vertex in already_done_vertices



    else

        myvertex.bytecode   = bytecode of current_instruction

        myvertex.branchtype = branch type of current_instruction

        myvertex.edges    = empty

        add myvertex to already_done_vertices

        for each target of current_instruction

            add makecfg(target) to myvertex.edges



    return myvertex

Note that the bytecode of a branch is nothing. Its only significance is in the branchtype and the targets. Most instructions have a branch type of "always" with a target that is the next instruction. The branch type relates to the targets, f.i. "top_of_stack != 0" may go to the first target if true, or the second target if false.

Of course in a real implementation you want to keep the number of vertices and edges low, so you lump together as many instructions as you can without breaking the rules, splitting them when necessary.

Well, that's it for now, in part 2 (and maybe 3) I'll show how to make bytecode from the CFG.