Well, I'm an idiot... [Archive] - RCE Messageboard's Regroupment

View Full Version : Well, I'm an idiot...

RITZ

June 28th, 2006, 18:57

A tool I'm making makes a lot of hooks. My basic hooking technique wrote a jump to the hook procedure over the beginning of the function, but the jump being 5 bytes (at least 2 autonomous instructions to write), created a problem; since I needed to unhook by writing the old code back in every hook procedure so that I could call the old function, every so often the unaware process in which the hooks were made would call the function at the same time as I unhooked it, and I'd get data corruption and an eventual exception just on these rare events. So, I tried everything you can think of to try and synchronize the memory access, but it just turned out to be impossible when the other threads of the process don't cooperate.

So - what I decided to do was to save the original function code and tack a jump back to the rest of the function on the end, then I could instead of jumping straight to the hook procedure, check an enabled variable, and jump either to the hook, or to the original function in the buffer, and let that jump on the end jump back to the rest of the function and I'd only need to change the enabled variable to unhook and call the original function, no thread synchronization problems. But that created another stupid problem... and I can't believe I didn't realize it ahead of time. If the original function code has relative addresses and I run it from the buffer it goes all wrong.

Ok, so that was stupid. Can anyone give me some advice on how to solve this problem? Thanks

BTW, here is that hooking class:

Code:



#pragma once



#include <windows.h>

#include <tlhelp32.h>

#include <stdio.h>

extern "C"

{

	#include "xde.h"

}



namespace RITZ

{

	template<typename FUNCTIONTYPE>

	class HOOK

	{

	public:

		DWORD enabled;



		// The original hook isn't thread safe. I've been unable to find a perfect solution.

		// There's still theoretically a slim chance that even with us setting the thread priority high

		// that the function could be called at the same time and since we require more than one

		// instruction to write the hooking stub in, there is a possibility for data corruption

		// if another thread tries to read the code simultaneously due to calling the function.

		// This CAN however be bypassed if all other threads in the process are suspended.

		// But unless you are creating the process with THREAD_SUSPEND,

		// getting all threads to suspend smoothly is a whole set of problems on its own.

		void Insert(FUNCTIONTYPE from, FUNCTIONTYPE to, HMODULE hmod)

		{

			const int patch_pos_enabled = 2; // Were to write the hook procedure address operand in to the patch

			const int patch_pos_hook = 10; // Were to write the hook procedure address operand in to the patch

			const int patch_pos_orig = 15; // And the original code operand



			static BYTE patch[] = {

			0x83, 0x3D, 0,0,0,0, 0x00, // CMP DWORD PTR DS:[enabled],0 ; Compare 'enable' with 0

			0x74, 0x05, // JE SHORT [Disabled code] ; Jump to disabled code if 'enabled' is 0

			// Enabled code...

			0xE9, 0,0,0,0, // Run the hook procedure

			// Disabled code...

		//	0xFF, 0x25, 0,0,0,0 // Run the original code

			0xE9, 0,0,0,0 // Run the original code

			};



			// Trace to find how much we need to grab so that we have only whole instructions...

			while(m_overwrite_len < sizeof(patch))

			{

				xde_instr ins;

				xde_disasm((unsigned char*)to+m_overwrite_len, &ins);

				m_overwrite_len += ins.len;

			}

system("pause";

			BYTE jumpcode[] = {0xE9, 0,0,0,0}; // Offset jump code

			// Save the code we are about to overwrite, with a jump to the rest of the code tacked on

			ProtectedMemcpy(from, m_original, m_overwrite_len); // Save original code

			*((DWORD*)(jumpcode+1)) = (DWORD)from - (DWORD)&m_original - 5; // Write jump operand

			ProtectedMemcpy(jumpcode, m_original+m_overwrite_len, sizeof(jumpcode)); // Concatenate the jump



			*((DWORD*)(patch+patch_pos_enabled)) = (DWORD)&enabled;

			*((DWORD*)(patch+patch_pos_hook)) = (DWORD)to - (DWORD)from - patch_pos_hook - sizeof(DWORD); // Write hook procedure offset operand in

			*((DWORD*)(patch+patch_pos_orig)) = (DWORD)&m_original - (DWORD)from - patch_pos_orig - sizeof(DWORD); // Write original code offset operand in

			ProtectedMemcpy((void*)&patch, from, sizeof(patch));



			orig = from;

			hook = to;

			m_hmod = hmod;

		}



		void Insert(char *from_name, FUNCTIONTYPE to, HMODULE hmod)

		{

			FARPROC procaddr = GetProcAddress(hmod, from_name);

			if(procaddr)

			{

				Insert((FUNCTIONTYPE)procaddr, to, hmod);

			}

		}



		void Remove()

		{

			ProtectedMemcpy(m_original, orig, m_overwrite_len);

		}



		void Enable()

		{

			enabled = 1;

		}



		void Disable()

		{

			enabled = 0;

		}



		void Reinsert()

		{

			Insert((FUNCTIONTYPE)m_baseaddr, m_to, m_hmod);

		}



		FUNCTIONTYPE orig; // For calling after unhooking.

		FUNCTIONTYPE hook;



	private:

		BYTE m_original[100];

		DWORD m_overwrite_len;

		HMODULE m_hmod;



		void ProtectedMemcpy(void *from, void *to, DWORD size)

		{

			DWORD oldprot_from;

			DWORD oldprot_to;



			VirtualProtect(from, size, PAGE_EXECUTE_READWRITE, &oldprot_from);

			VirtualProtect(to, size, PAGE_EXECUTE_READWRITE, &oldprot_to);



			DWORD dummy;



			int oldprior = GetThreadPriority(GetCurrentThread());

			SetThreadPriority(GetCurrentThread(), THREAD_PRIORITY_HIGHEST);

			BOOL test = WriteProcessMemory(GetCurrentProcess(), to, from, size, &dummy);

			SetThreadPriority(GetCurrentThread(), oldprior);



			VirtualProtect(from, size, oldprot_from, &oldprot_from);

			VirtualProtect(to, size, oldprot_from, &oldprot_to);

		}



	public:

		HOOK()

		{

			enabled = 1;

		}



		HOOK(FUNCTIONTYPE from, FUNCTIONTYPE to, HMODULE hmod)

		{

			Insert(from, to, hmod);

			HOOK();

		}



		~HOOK()

		{

			Remove();

		}

	};

}

It works great if the first 19 bytes of the function doesn't do anything relative.

LLXX

June 29th, 2006, 00:50

Basically, what you need to do is prevent (or minimise) the chances of a task switch occurring between your two critical instructions.

This may seem a bit strange, and is indeed a rather unconventional method, but it should work, given the design of Window's scheduler.

Insert a sleep(0) call immediately before your critical instructions. This causes your thread to relinquish the rest of its timeslice to other threads in the system. Once sleep() returns, your thread will be executing at the start of a new timeslice, which allows more than enough time to execute the two critical instructions without being interrupted by another task switch.

RITZ

June 29th, 2006, 03:51

Yes, I came up with solutions like that (mfence, FlushInstructionCache, SwitchToThread), that works all the time but they were not real solutions. I could just set the thread priority to high temporarily (see ProtectedMemcpy) and it would never crash, but that technically just reduces the chances of it crashing because there is no thread priority that guarantees only that thread will execute, and I'm looking to make it impossible. Plus they're slow...

0xf001

June 29th, 2006, 04:26

heya,

just a thought - (i don't have much time

): try to look into some virii tutorials / code, they face similar problematics when jumping to the "OEP". else i think with a bit of logic that could be solved.

regards, 0xf001

LLXX

June 29th, 2006, 04:38

Quote:

[Originally Posted by RITZ]Yes, I came up with solutions like that (mfence, FlushInstructionCache, SwitchToThread), that works all the time but they were not real solutions. I could just set the thread priority to high temporarily (see ProtectedMemcpy) and it would never crash, but that technically just reduces the chances of it crashing because there is no thread priority that guarantees only that thread will execute, and I'm looking to make it impossible. Plus they're slow...

You don't need to gain complete control of the CPU. You only need to ensure that a task switch does not occur between the two critical instructions. As I said above, making sure that those two instructions are the first to execute at the start of a timeslice is going to ensure that a task switch will never occur between them.

dELTA

June 29th, 2006, 05:02

I don't like probabilistic solutions either.

A more robust way would be to insert the 5 byte jump, then at the jump location (your injected buffer) insert first the hook yes/no test, and if yes, execute your hook, if no, emulate the instructions that was destroyed by the 5 byte patch and then jump back into the real function.

The only weakness of this method is if the 5 first bytes contain an opcode that uses relative addressing, which of course brings us back to your original problem but with much lower probability. It is quite rare that the first 5 bytes of a function contains relative opcodes, but if you want to cover this case too (which we of course do in order to work 100%

) you can scan these 5 bytes for any relative opcode, and recalculate its offset in your emulated version of these bytes.

And yes, you might as well have done the offset recalculation for your original 19 bytes, but only having to do it for the first five bytes is cleaner, less prone for problems and less prone for needing to do it at all, and after all, the "relative opcode problem" didn't occur until you expanded your patch size from 5 to 19 bytes.

tom324

June 29th, 2006, 06:54

MS Detours 2.1 is available for download. It has support for begin/commit detour transactions and seems to be thread safe. Check its source code...

Tom

Maximus

June 29th, 2006, 11:13

1. use critical sections
2. lock cmpxchg8b on p1+
3. r0:cli/sti, kernel locks

@LLXX: unfortunately, your approach might not work fine on XT or dual core, as there is always a chance the other processor can do that.

0xf001

June 29th, 2006, 11:53

hi,

i am just thinking how about making the hook handler more intelligent, what about this: dont write the bytes back to the original location, instead write them somewhere else and redirect to there.
before re-redirection to that block you can then ie count how many calls are made to the function, let them wait until the other "instance" finished (not required of course) and blah ... any fancy things you like to do in your handler.
at least all time then your handler is called. you need to take care of instruction sizes to align to the assembled code for determining a "good" end (jmp back) of that code block. i think thats easier to do (insn size table) and almost bullet proof, or?

regards, 0xf001
ps: i think you are definately not an idiot

Maximus

June 29th, 2006, 12:55

Dont work if you have a short jump/cond there. 5 bytes are enough. It works fine for every wxpsp2 on, as m$ made all apis stub-able, but it is not guaranteed on prior versions, and custom dlls...

0xf001

June 29th, 2006, 13:29

true, I wouldn't expect a such condition in the first few bytes of a functions prologue, especially in a function (not just a random call destination, as its about hooking), so maybe I am the idiot

the point i wanted to make is to find a way to avoid all that weird mem syncing, thread stuff

you can check for such jumps, too etc, which is a generic way without dealing with OS internals - was the idea, outlined very basic - to just give the idea.

i didn't talk about how many bytes, and agree 5 are enough, I did not name any number so i wonder what you mean by that ... also long jumps would be a "problem"

further i did not see anything abt specific m$ api-hooks specifically which is why i wanted to think in a generic way.

regards, 0xf001

Maximus

June 29th, 2006, 14:35

well... using some well known disassembler you can:
1) copy 32 bytes.
2) do {disasm(instruction)} while (disasmed_bytes<5)
3) do { disasm_list(i)==short jcc?replace_with(jcc+32+i*5),place_abs_jump_at(+32+i*5) }.
4) place a jmp to the stub with jcc "emulation" in the api.

As long as you don't have specific code (i.e. poly,push/ret, heading backloop!) against it, should work fine.

edit---
forgot the "4"

0xf001

June 29th, 2006, 14:52

heya,

1) i won't use this disassembler, its so expensive

2) i'd use libdisasm or any other, especially if i want to _code_ a program - this is the real reason

3) that is all so "manual" uuh

4) this is getting off topic please forget my post

regards, 0xf001

Maximus

June 29th, 2006, 17:55

hehe, I have the solution

(frushing out the ace from the uh... don't know the name...)
...sigh I have to reopen the page to fetch the link...
http://www.cybertech.net/~sh0ksh0k/projects/

cheers,
Maximus

edit----
forgot to say, disable the AV if you compile the sources, it won't like them...

0xf001

June 29th, 2006, 19:14

awesome stuff

, thanks Maximus for sharing

!!

... and ssssssssssourcecode

RITZ I would be interested in how you come along / which strategy you are going to use, or more ..... what tool is it you are coding?

cheers, 0xf001

RITZ

June 29th, 2006, 23:20

Wow, quite the active forum... I take a nap and come back to all these responses.

@dELTA: Yes I considered that. It would be great if a normal function did the same thing in every 5 bytes. But sadly, the only thing that can be expected is push ebp/mov ebp,esp which is only 3 bytes. So that would still leave 1 instruction afterwards with the exact same problem. (see http://www.woodmann.com/forum/showthread.php?t=9228).

@Maximus: 1) Critical sections will not help because the other threads of the process don't cooperate. They would need to EnterCriticalSection and LeaveCriticalSection to synchronize access to the function. I'm injecting a DLL and the process has no idea about my hooks. 2) I tried lock cmpxchg8b a long time ago, but it didn't work! After scouring my Intel manuals and finding it, I came up with this:

Code:



		void ProtectedMemcpy(void *from, void *to, DWORD size)

		{

			DWORD oldprot_from;

			DWORD oldprot_to;



			VirtualProtect(from, size, PAGE_EXECUTE_READWRITE, &oldprot_from);

			VirtualProtect(to, size, PAGE_EXECUTE_READWRITE, &oldprot_to);





			__int64 buf = *((__int64*)to);

			void *bufptr = &buf;



			__asm

			{

				// Zero out edx:eax and ecx:ebx so that if the data at 'to'

				// is zero and edx:eax isn't set it will still be correct.

				xor edx, edx

				xor eax, eax

				xor ebx, ebx

				xor ecx, ecx



				mov esi, to // Load to address into register for cmpxchg8b.

				lock cmpxchg8b qword ptr ds:[esi] // Read current data at 'to' into edx:eax



				// Fill the buffer 'buf' with edx:eax.

				mov dword ptr ds:[bufptr], eax

				mov dword ptr ds:[bufptr+4], edx



				// Save eax:edx.

				push eax

				push edx

			}



			memcpy(bufptr, from, size); // Write our new data up to 'size' into the buffer.



			__asm

			{

				// We save eax:edx so that it matches the data at 'to' and copies ecx:ebx for us.





				// Load the buffer with out modified data into ecx:ebx for writing.

				mov ecx, dword ptr ds:[bufptr]

				mov ebx, dword ptr ds:[bufptr+4]



				mov esi, to // Load to address into register for cmpxchg8b.



				// Write the buffer with modified data to 'to'

				//address using the same method as we read it with.

				lock cmpxchg8b qword ptr ds:[esi]	

			}



			VirtualProtect(from, size, oldprot_from, &oldprot_from);

			VirtualProtect(to, size, oldprot_from, &oldprot_to);

		}

For some reason I wasn't able to get it to work. I think I figured out why but I don't remember. It was one of my earlier attempts before giving up and coming back to it a month or two later now.

I'm kind of looking for an elegant solution that doesn't involve fixing all the addresses to work in the new relative location. Keep in mind what that would require... I'd have to go through the entire function and change even jumps that might go back to the code that I overwrote to go back to the buffer instead.

@0xf001: The tool is a packet sniffer that can do everything. Edit real time packets and view the call stack that led to the Winsock API. It works by hooking pretty much every Winsock API and simulating a lot of crap in an injected DLL then displaying the data through the program with IPC. I've been working on it in my spare time for about a year and a half. And it's always had the problem that if you attach to something like Firefox and dowload a bunch of stuff it will eventually crash because of the hooking problem since in every hook I have to disable it, call the old function, then re-enable it in order to preserve everything.

RITZ

June 30th, 2006, 02:09

Omg... I went back to the cmpxchg8b thing and discovered that I was mixing up my endians.

Code:



void SafeMemcpy(void *from, void *to, DWORD size)

{

	DWORD oldprot_from;

	DWORD oldprot_to;



	VirtualProtect(from, size, PAGE_EXECUTE_READWRITE, &oldprot_from);

	VirtualProtect(to, size, PAGE_EXECUTE_READWRITE, &oldprot_to);



	__int64 todata = *((__int64*)to);

	__int64 fromdata = *((__int64*)from);



	__asm

	{

		mov edx, dword ptr ds:[todata+4]

		mov eax, dword ptr ds:[todata]



		mov ecx, dword ptr ds:[fromdata+4]

		mov ebx, dword ptr ds:[fromdata]



		mov esi, to // Load to address into register for cmpxchg8b.

		lock cmpxchg8b qword ptr ds:[esi] // Write

	}



	VirtualProtect(from, size, oldprot_from, &oldprot_from);

	VirtualProtect(to, size, oldprot_from, &oldprot_to);

}

Works perfectly! I think also at the time I wrote the previous attempt I was thinking I'd need to use cmpxchg8b to read the data too, without thinking about what actually causes data corruption. You can have a bazillion threads reading the same data without any problem. It's when one tries to write to the data and change its meaning while another thread is in the middle of reading it that causes data corruption.

I'm gonna scrounge up my old simple 5 byte jump version of the hook class and see if this works right now. Good thing I saved it before I started to implement the more complex method.

---

[EDIT]

Alright! It works great, and I've tested it rigorously and it seems to truly be 100% thread safe.

Here is the final hook class, all clean and neat:

Code:



#pragma once



#include <windows.h>

#include <stdio.h>



namespace RITZ

{

	template<typename FUNCTIONTYPE>

	class HOOK

	{

	public:

		const static int PATCH_LEN = 5;



		void Hook(FUNCTIONTYPE from, FUNCTIONTYPE to, HMODULE hmod = GetCurrentModule())

		{

			SafeMemcpy(overwritten, from, PATCH_LEN); // Save original function code



			static BYTE jumpcode[] = {0xE9, 0,0,0,0}; // Offset jump code

			*((DWORD*)(jumpcode+1)) = (DWORD)to - (DWORD)from - PATCH_LEN; // Write in jump operand



			SafeMemcpy(from, jumpcode, PATCH_LEN); // Insert hook



			orig = from;

			hook = to;

			this->hmod = hmod;

			hooked = true;

		}



		void Hook(char *from_name, FUNCTIONTYPE to, HMODULE hmod = GetCurrentModule())

		{

			FARPROC procaddr = GetProcAddress(hmod, from_name);

			if(procaddr)

			{

				Hook((FUNCTIONTYPE)procaddr, to, hmod);

			}

		}



		void Unhook()

		{

			SafeMemcpy(orig, overwritten, PATCH_LEN);

			hooked = false;

		}



		void Rehook()

		{

			Hook(orig, hook, this->hmod);

		}



		FUNCTIONTYPE orig; // For calling after unhooking.

		FUNCTIONTYPE hook;



	private:

		BYTE overwritten[PATCH_LEN];

		HMODULE hmod;

		bool hooked;

		

		void SafeMemcpy(void *to, void *from, DWORD size)

		{

			DWORD oldprot_from;

			DWORD oldprot_to;



			VirtualProtect(from, 8, PAGE_EXECUTE_READWRITE, &oldprot_from);

			VirtualProtect(to, 8, PAGE_EXECUTE_READWRITE, &oldprot_to);



			__int64 todata = *((__int64*)to);

			__int64 fromdata = *((__int64*)from);



			// So that we don't overwrite more than 'size',

			// copy the data that we are supposed to keep to the 'fromdata' buffer.

			memcpy(((BYTE*)&fromdata)+size, ((BYTE*)&todata)+size, 8-size);



			__asm

			{

				mov edx, dword ptr ds:[todata+4]

				mov eax, dword ptr ds:[todata]



				mov ecx, dword ptr ds:[fromdata+4]

				mov ebx, dword ptr ds:[fromdata]



				mov esi, to // Load to address into register for cmpxchg8b.

				lock cmpxchg8b qword ptr ds:[esi] // Write

			}



			VirtualProtect(from, 8, oldprot_from, &oldprot_from);

			VirtualProtect(to, 8, oldprot_from, &oldprot_to);

		}



		static HMODULE GetCurrentModule()

		{

			MEMORY_BASIC_INFORMATION mbi;

			static int dummy;

			VirtualQuery(&dummy, &mbi, sizeof(mbi));



			return (HMODULE)mbi.AllocationBase;

		}



	public:

		HOOK()

		{

			hooked = false;

		}



		HOOK(FUNCTIONTYPE from, FUNCTIONTYPE to, HMODULE hmod)

		{

			Hook(from, to, hmod);

		}



		~HOOK()

		{

			if(hooked)

				Unhook();

		}

	};

}

And here is an example of its usage:

Code:



#include "hook.h"

#include <iostream>

#include <windows.h>

using namespace std;



typedef void (*BAR)();

RITZ::HOOK<BAR> hook;



void bar()

{

	cout << "bar" << endl;



}



void foo()

{

	cout << "foo" << endl;

}





DWORD WINAPI Hookinator(LPVOID p)

{



	for(;

	{

		hook.Unhook();

		hook.Rehook();

	}

}



int main()

{

	hook.Hook(foo, bar);

	CreateThread(0, 0, Hookinator, 0, 0, 0);

	for(;

	{

		foo();

	}

}

Notice how the Hookinator thread does not cause a crash.

It did before after a few seconds.

...

My excitement has been subsided by the reminder that I'm a PHP programmer, and I have to get back to making a CMS for the company I work for. -.- Psht... life.

dELTA

June 30th, 2006, 10:49

Hehe, we're glad you got it working.

Just one thing about your comment regarding the critical sections though: of course the other threads will cooperate, since you have patched the EnterCriticalSection into the beginning of the function that they are all calling (i.e. in the beginning of your hook stub code).

Using this method you would never be able to remove this hook stub though, that is correct (i.e. even after having completed the "unhooking", each function call would still result in a pair of calls to EnterCriticalSection/LeaveCriticalSection, before execting the original code).

RITZ

June 30th, 2006, 10:56

If I put EnterCriticalSection/LeaveCriticalSection in the function that would require overwriting function code exactly the same, so removing the hook would then require removing the EnterCriticalSection/RemoveCriticalSection code too which would create the exact same problem with a redundant added step. The whole problem was being able to remove the hook synchronously.

The way that critical sections would work normally would be if the threads were my own, then I could enter a critical section every time I call the function from another thread, and then enter the same critical section when I'm going to apply or remove the hook.

...

Another thought just came to mind. I could lose information with this hooking method... When I unhook the function so that I can call the original, if another thread was to call the function at that exact time I wouldn't be able to detect it. Meh, I might have to try another solution; putting an int3 over the push ebp at the beginning of a function then handling the exception to then enter a critical section, and then run the push ebp, and then run the rest of the code. But I didn't want to try this before because throwing exceptions means the victim process could easily just catch the exceptions itself to prevent it.

dELTA

June 30th, 2006, 11:23

As I said, you can never remove the small hook stub containing the CriticalSection commands with this method, but you could "enable/disable" your custom hook code in a safe way whenever you want, even if you cannot actually remove the hook stub from memory (although this shouldn't really be important i most cases, would it?).

And yeah, the interrupt/exception method fails if the program has any whatsoever exception handlers installed itself, which is very likely in most cases. Already thought of and discarded that idea before.

RITZ

June 30th, 2006, 12:03

You're overwriting code to put in the critical section. How are you going to get that code back? Jump to it in a saved buffer? And then you get the exact same original problem of relative addresses, and there was no point for the critical section at all. It does nothing but overwrites more code.

dELTA

July 1st, 2006, 05:51

Again, with this method there will be a "hook stub" left in the code, even after the hook is disabled, but this won't be an issue. This hook stub will consist of:

1.
The original jump to your buffer.

In this buffer there will be:

2.
A call to EnterCriticalSection.

3.
A condition check regarding executing hook or original code (which will always choose the original code after you have deactivated the hook).

4.
A call to LeaveCriticalSection.

5.
An emulation of the first 5 bytes of the function.

6.
A jump back to the original function.

This buffer can be dynamically allocated, allowing you to even unload your hook DLL without any problems after the hook is disabled this way.

Also, first of all, this was mainly a reply to your reply to Maximus, saying that his suggestion about critical sections wouldn't work because other threads wouldn't cooperate.

Secondly, I'm still working under your originally stated assumption, that "a 5 byte hook overwrite will cause no problems with relative operands, but a bigger hook overwrite will", and just like Maximus also says, 5 byte hooks will currently not overwrite any relative opcodes for any Windows API.

RITZ

July 3rd, 2006, 17:29

I don't think you understand. The offset operands is the ONLY PROBLEM. I was already putting the overwritten data in a dynamically allocated buffer before you guys ever said anything. There is NO NEED for critical sections if it doesn't solve the only problem there is. What is the point of the critical section? As I said multiple times, it does nothing but overwrite more code.