fr33ke
June 5th, 2008, 06:42
Now we have:
Interestingly, we could also make a obfuscator in this way. All we have to do is put the vertices in a random order.
So to deobfuscate we can't just use any ordering. We have to use one that is 'intuitive' and 'natural'. Unfortunately these words mean nothing to a computer; we'll have to specify them more carefully.
Let's start simple. In which order would you place these vertices if you wanted to deobfuscate?
http://img176.imageshack.us/img176/7269/firstmh5.png
I think we can agree that 1-2-3 is he right order.
3-1-2 is bad because we have to start with 1: this tells everyone that the function starts there.
1-3-2 is bad too, but why? For one it creates more branches than necessary, but I think this is not the reason it is hard to read. I think the reason is that we go backwards from 2 to 3. Most of us like to read in one direction: from top to bottom or from left to right.
Another one:
http://img127.imageshack.us/img127/246/secondco2.png
The right order here is 1-2-4-3 or 1-4-2-3.
1-2-3-4 is wrong because 4 then goes backwards to 3.
We can say: the order is right iff for all edges, the start of the edge comes before the end of the edge. This ordering is known as a topological ordering. There are various ways to compute it; it is important to remember that one is the reverse postorder of a depth-first traversal.
This might all sound like gibberish to you, so feel free to take a pause and look up some of the underlined words. Here's a little pseudocode example for the reverse postorder:
Well, so far so good. But what to think of this one:
http://img136.imageshack.us/img136/3333/thirdnw5.png
I'd say that the best ordering is 1-2-3-5-4. Some compilers order this as 1-5-2-3-4. Even 1-3-5-2-4 isn't unreasonable. But according to our ordering rules, they are all wrong.
If your graph has a cycle, it isn't possible to get a perfect ordering. We can't have 2 before 3, 3 before 5 and 5 before 2 at the same time, so we'll have to go backwards at least one time.
Something like 1-4-2-3-5 is even more wrong though. Now we have two backward edges, 3 to 4 and 5 to 2.
One way to solve the problem is like this:
http://img378.imageshack.us/img378/820/fourthfl1.png
We contract the vertices of the strongly connected component (a generalization of cycles) into one "supervertex". Now we order that and get 1-I-4. Then we order I: first we pick one vertex to start, I like to get the one that has most edges to it from outside I, but it doesn't matter much. In this example we pick 2. Then we order the rest (and if we find sub-SCC's in I we recurse). So here we get 3-5. Then the final order is 1-2-3-5-4.
Now thinking that out is one thing, implementing it is another. The most obvious problem is how to find the SCC's. Luckily some very smart people already figured it out for us. We'll use Tarjan's strongly connected components algorithm. It is a bit hard to understand, but http://www.ics.uci.edu/~eppstein/161/960220.html explains it very clearly.
So, are we done now? Is the following algorithm enough:
/ this is the clever part
Now both Tarjan's algorithm and the reverse postorder are based on depth first search. So maybe it's possible to combine them? It turns out we can indeed do this and save ourself a lot of work. The only modification we need to do is:
So the final algorithm is:
I omitted the choose_first function because it isn't important.
Now we have every part, putting it together is simple:
Well that's it mostly. I wanted to show how to do this so maybe someone doesn't have to spend weeks figuring this stuff out and because I think
is a very nice insight.
P.S. Sorry, my tool to do this for .NET is currently private because I'm not interested in an arms race. The ideas behind it are more important anyway
So to deobfuscate we only need one more thing:
A way to decompose the bytecode of a function into its CFG ("http://www.woodmann.com/forum/blog.php?b=47")
A way to make bytecode from the CFG if we know the order of the vertices ("http://www.woodmann.com/forum/blog.php?b=83")
A way to order the vertices
Interestingly, we could also make a obfuscator in this way. All we have to do is put the vertices in a random order.
So to deobfuscate we can't just use any ordering. We have to use one that is 'intuitive' and 'natural'. Unfortunately these words mean nothing to a computer; we'll have to specify them more carefully.
Let's start simple. In which order would you place these vertices if you wanted to deobfuscate?
http://img176.imageshack.us/img176/7269/firstmh5.png
I think we can agree that 1-2-3 is he right order.
3-1-2 is bad because we have to start with 1: this tells everyone that the function starts there.
1-3-2 is bad too, but why? For one it creates more branches than necessary, but I think this is not the reason it is hard to read. I think the reason is that we go backwards from 2 to 3. Most of us like to read in one direction: from top to bottom or from left to right.
Another one:
http://img127.imageshack.us/img127/246/secondco2.png
The right order here is 1-2-4-3 or 1-4-2-3.
1-2-3-4 is wrong because 4 then goes backwards to 3.
We can say: the order is right iff for all edges, the start of the edge comes before the end of the edge. This ordering is known as a topological ordering. There are various ways to compute it; it is important to remember that one is the reverse postorder of a depth-first traversal.
This might all sound like gibberish to you, so feel free to take a pause and look up some of the underlined words. Here's a little pseudocode example for the reverse postorder:
Code:
reverse_postorder(vertex):
order = []
for each child in vertex.edges
order = reverse_postorder(child) + order
order = vertex + order
return order
Well, so far so good. But what to think of this one:
http://img136.imageshack.us/img136/3333/thirdnw5.png
I'd say that the best ordering is 1-2-3-5-4. Some compilers order this as 1-5-2-3-4. Even 1-3-5-2-4 isn't unreasonable. But according to our ordering rules, they are all wrong.
If your graph has a cycle, it isn't possible to get a perfect ordering. We can't have 2 before 3, 3 before 5 and 5 before 2 at the same time, so we'll have to go backwards at least one time.
Something like 1-4-2-3-5 is even more wrong though. Now we have two backward edges, 3 to 4 and 5 to 2.
One way to solve the problem is like this:
http://img378.imageshack.us/img378/820/fourthfl1.png
We contract the vertices of the strongly connected component (a generalization of cycles) into one "supervertex". Now we order that and get 1-I-4. Then we order I: first we pick one vertex to start, I like to get the one that has most edges to it from outside I, but it doesn't matter much. In this example we pick 2. Then we order the rest (and if we find sub-SCC's in I we recurse). So here we get 3-5. Then the final order is 1-2-3-5-4.
Now thinking that out is one thing, implementing it is another. The most obvious problem is how to find the SCC's. Luckily some very smart people already figured it out for us. We'll use Tarjan's strongly connected components algorithm. It is a bit hard to understand, but http://www.ics.uci.edu/~eppstein/161/960220.html explains it very clearly.
So, are we done now? Is the following algorithm enough:
Yes, in a certain sense. It works fine, but it's quite a lot to code (especially in C) and it's pretty inefficient.
Find all SCC's
Collapse them into supervertices
Order the resulting graph
Apply this algorithm to all supervertices in the result
/ this is the clever part
Now both Tarjan's algorithm and the reverse postorder are based on depth first search. So maybe it's possible to combine them? It turns out we can indeed do this and save ourself a lot of work. The only modification we need to do is:
Whenever we find an SCC, we check if it is trivial (one vertex). If so, we add it to the order. Else we order the SCC and put the result in the order.
So the final algorithm is:
Code:
get_order(first_vertex, vertices_to_consider):
/* Define some variables */
order = []
stack = empty_stack
cur_dfsnum = 0
index = []
low = []
/* Nested function, has access to above variables (closure) */
visit(cur_vertex):
index[cur_vertex] = cur_dfsnum
low[cur_vertex] = cur_dfsnum
cur_dfsnum++
push cur_vertex on stack
for each child in cur_vertex.edges where child in vertices_to_consider
if index[child] == "To be done"
visit(child)
low[cur_vertex] = min(low[cur_vertex], low[child])
else if index[child] == "Done"
/* Do nothing */
else
low[cur_vertex] = min(low[cur_vertex], index[child])
if low[cur_vertex] == index[cur_vertex]
/* we found an SCC */
scc = []
do
popped = pop from stack
scc += popped
index[popped] = "Done"
until popped == current_vertex
if scc.length == 1
order = cur_vertex + order
else
order = get_order(choose_first(scc), scc) + order
/* visit() ends */
for vertex in vertices_to_consider:
index[vertex] = "To be done"
/* Special-case the start vertex to prevent infinite recursion */
index[first_vertex] = "Done"
for each child in first_vertex.edges where child in vertices_to_consider
if index[child] == "To be done"
visit(child)
order = first_vertex + order
return order
I omitted the choose_first function because it isn't important.
Now we have every part, putting it together is simple:
Code:
de_flow_obfuscate(bytecode):
already_done_vertices = empty
cfg = makecfg(bytecode)
order = get_order(cfg, already_done_vertices)
return rebuild_from_order(cfg, order)
Well that's it mostly. I wanted to show how to do this so maybe someone doesn't have to spend weeks figuring this stuff out and because I think
Code:
reverse postorder + Tarjan's strongly connected components algorithm = approximate topological ordering
is a very nice insight.
P.S. Sorry, my tool to do this for .NET is currently private because I'm not interested in an arms race. The ideas behind it are more important anyway
