fr33ke
January 13th, 2008, 09:50
Control Flow Obfuscation is one of the biggest protection mechanisms used in commercial protections for intermediate languages like IL (.NET) and Java bytecode. It is also used a bit in some native protectors, mostly to hide their own code, but I won't go into that. I'll use pseudocode for the examples.
The basic idea behind control flow obfuscation is that spaghetti code is hard to read. So the protectors transform normal code into spaghetti code, with lots of spurious branches. You can repair this by hand, but as functions get bigger this becomes a huge PITA. So we must write a program to do it for us.
I'll assume we have access to the bytecode in the program. Also we'll make use of some primitive operations on this bytecode. These problems are straightforward to tackle, all you need is the documentation and some (much
) time.
To reorganize the function we will use a two-step approach:
English example:
Building the CFG
Let's see what we want from the CFG:
The last point ensures our CFG is a proper Directed Graph.
In practice you'll want to keep the edge information with the vertex it comes from, but this changes nothing to the basic ideas.
With this info, we can make a simple recursive function to construct the CFG:
Note that the bytecode of a branch is nothing. Its only significance is in the branchtype and the targets. Most instructions have a branch type of "always" with a target that is the next instruction. The branch type relates to the targets, f.i. "top_of_stack != 0" may go to the first target if true, or the second target if false.
Of course in a real implementation you want to keep the number of vertices and edges low, so you lump together as many instructions as you can without breaking the rules, splitting them when necessary.
Well, that's it for now, in part 2 (and maybe 3) I'll show how to make bytecode from the CFG.
The basic idea behind control flow obfuscation is that spaghetti code is hard to read. So the protectors transform normal code into spaghetti code, with lots of spurious branches. You can repair this by hand, but as functions get bigger this becomes a huge PITA. So we must write a program to do it for us.
I'll assume we have access to the bytecode in the program. Also we'll make use of some primitive operations on this bytecode. These problems are straightforward to tackle, all you need is the documentation and some (much

To reorganize the function we will use a two-step approach:
You might think "That doesn't do anything!", but the trick is that we read in bytecode that is obfuscated and write out bytecode that is simple to read. Both form the same control flow graph, so the meaning is not changed.
Make a Control Flow Graph from the bytecode.
Make bytecode from the CFG.
English example:
Of course this isn't a perfect analogy but I hope you see what I mean.
Obfuscated: "The bike, that is owned by John, who has blue eyes, needs to be repaired."
Graph:
http://img259.imageshack.us/img259/3913/graph1zo4.png
Simplified: "John has blue eyes. His bike needs to be repaired."
Building the CFG
Let's see what we want from the CFG:
(if this is all gibberish for you, please read up on graphs before continuing)
Have vertices with some amount of bytecode
Have edges that show the connection between vertices.
E.g. if vertex 1 branches to vertex 2 we want an edge from 1 to 2. Note that this edge has a direction: it doesn't go from 2 to 1.
We need a label on the edge that shows the type of connection. For instance if vertex 1 branches to vertex 2 if some condition is true, and to vertex 3 if it is false, we would have this graph:
http://img166.imageshack.us/img166/4302/graph2iy9.png
Every vertex has one entry point and one exit point.
The last point ensures our CFG is a proper Directed Graph.
In practice you'll want to keep the edge information with the vertex it comes from, but this changes nothing to the basic ideas.
With this info, we can make a simple recursive function to construct the CFG:
Code:
already_done_vertices = empty
cfg_of_function = makecfg(entry_point_of_function)
function makecfg(current_instruction):
vertex myvertex
if current_instruction is in already_done_vertices
myvertex = the appropriate vertex in already_done_vertices
else
myvertex.bytecode = bytecode of current_instruction
myvertex.branchtype = branch type of current_instruction
myvertex.edges = empty
add myvertex to already_done_vertices
for each target of current_instruction
add makecfg(target) to myvertex.edges
return myvertex
Note that the bytecode of a branch is nothing. Its only significance is in the branchtype and the targets. Most instructions have a branch type of "always" with a target that is the next instruction. The branch type relates to the targets, f.i. "top_of_stack != 0" may go to the first target if true, or the second target if false.
Of course in a real implementation you want to keep the number of vertices and edges low, so you lump together as many instructions as you can without breaking the rules, splitting them when necessary.
Well, that's it for now, in part 2 (and maybe 3) I'll show how to make bytecode from the CFG.