Generating Virtual Machine Code [Archive] - RCE Messageboard's Regroupment

View Full Version : Generating Virtual Machine Code

b3n

March 28th, 2007, 01:52

hello,

this might seem to be a stupid question, but its going round in my head for a couple of days and i couldnt find an answer to it so far. I'm currently trying to get a deeper understanding of virtual machines and therefore i have read quite a few articles and tutorials. The question i cant answer myself is how the machine code for a virtual machine is generated. If i wrote a virtual machine with an own instruction set, do i also have to write a compiler for it? I was thinking about writing a small virtual machine to gain a deeper understanding of the topic but i couldnt figure out how i can generate code i could run inside this machine. Is there any "easy" way (meaning without having to write a compiler) to write small programs?

cheers,
b3n

Maximus

March 28th, 2007, 06:54

there is always an easy way... just think to it

...use alpha encoding for your vm instruction set and fields.
You end up writing 'asm' programs with the reverser's favourite tool -notepad.

i.e.

LDA169
STA
LDX12
TXS
...

(ah, implements NOP as CR/LF..)
----
technically, you always need to write a compiler. However, you can 'force' the instruction set to be complaint with a human-compatible way of writing, so effectively 'removing' the interpretation step by uniforming it to a human-compatible one.
This also reminds me that most of the VMs around still does not use minimally the full VM power, as they are just used as 'polymorphic' machines or so.

Silver

March 28th, 2007, 13:03

Quote:

If i wrote a virtual machine with an own instruction set, do i also have to write a compiler for it?

Yes. You should look into lexers and parsers like Flex and Bison. In a nutshell, you use your lexer to define language key words (for, while, if etc can be defined as goat, apple, icecream or whatever). The lexer picks these key words out of code. You then join these key words into structures using a parser, for example you'd say "the keyword IF is followed by an evaluation then a keyword THEN then a list of statements then a keyword ENDIF".

When you've coded your lexer and parser definitions, flex and bison generate C source based on your rules. This source becomes the backbone of your compiler. You use it to create a tree of source code. Your compiler can then traverse the tree and generate bytecode for each node.

Here's a practical example for you. We start off with our lexer definition, we'll do a standard C-like "for" keyword:

Code:

"for"			{return FOR;}

"("			{return OPEN_BRACKET;}

..etc for ( ) + - = / * and every other single language "item" you need.

This says "look for the text string 'for' and return the token "FOR", look for a "(" and return token OPEN_BRACKET etc etc. The token is actually an ENUM member - the enum is generated by the lexer, so the above line of lexer script will generate this type of C code:

Code:

enum tokentype {FOR = 199};

You then #include the header your lexer generates into your compiler code. Next you write your parser syntax. Here's the for:

Code:

for_statement

		: FOR OPEN_BRACKET declaration_statement END_STATEMENT expression STATEMENT  expression CLOSE_PAR statement

			{

				$$ = new Node(FOR_STMT, $3, $5, $7, $9);

			}

This is incredibly hard to explain and you'll need to spend some time learning how parsers work. But essentially a parser takes lexer tokens (the first FOR, plus OPEN_BRACKET etc which I haven't shown you) and searches for this sequence of tokens in supplied text. When it finds a matching sequence, it uses the templates in the curly brackets to generate C source for you, in this case return a new instance of a C++ class called "Node" and pass in some params to the constructor.

Remember that all this is pseudo-C language. When you run your lexer and parser files through flex/bison, actual C++ code is generated for you to include in your app.

So right now if we wrote a new text file containing this:

Code:

for(a = 1; a < 10; a++)

{

}

The code generated by the lexer would recognise the "for" keyword, the "(" bracket, the "=" as assignment operator etc etc etc... It would pass each one of these to the code generated by the parser in a BOTTOM UP fashion - this is called LALR(1).

The parser then looks at what it just got from the lexer - "hey, what's a FOR token followed by an OPEN_BRACKET token followed by a...... etc" and try and match it to one of the parser rules. It'll find the rule we coded (above) and execute the code in the braces that generates a new instance of the node class.

You then code your own "Node" class. Essentially this is a self-entrant class that is used to create a tree structure of your code. ....

Actually, damn. I just tried to explain this 3 times and unfortunately a forum post does not lend itself to showing LALR(1) parsing.

So, skipping the tree bit, you end up with a structure in memory whereby you can easily break down each bit of code into individual instructions. Then all it's a matter of doing is assigning values to each opcode. So we give our "for" statement a value of 0x01.

When you've done it for all your language statements you'll just have a long list of opcodes that oddly looks exactly like how machine code does! That's because it is! All you need to do now is code your virtual stack machine, which is a whole other post...

You've got a lot of studying ahead of you to understand this, but trust me, it's very worth it when it makes sense!

fr33ke

March 28th, 2007, 14:49

I have never done it but my idea would be to add some macros to your favorite assembler.

Example in NASM syntax:

Code:

%macro LOAD 1

db 0x31           ;opcode for LOAD

dd %1             ;arg1

%endmacro

Unless your making it terribly complicated it should be enough.

Kayaker

March 28th, 2007, 17:30

You guys are good

b3n

March 28th, 2007, 18:20

Quote:

[Originally Posted by Maximus;64643]
This also reminds me that most of the VMs around still does not use minimally the full VM power, as they are just used as 'polymorphic' machines or so.

what would you say is the "full VM power"? I saw it more from a polymorphic point of view too.

b3n

March 28th, 2007, 18:29

thanks for the long explanation Silver, i think if i code a more complex machine i would go for your approach but since its just very basic ill give fr33ke's suggestion a go first, because it seems a little more straightforward to me.

Maximus

March 28th, 2007, 18:36

Well, at old times you would have used "yacc" to generate an LR(1) parser for you. Dunno if it's still around or not.
However, I suggest you to implement a simple LL(1), which is more suited for this.
Take any uni compiler's theory book -it works fine. Things didnt change -at very best they are infested -ahem, enriched- with UML and such instad of old good flowcharts

Long story to explain what is the truly unused VM power...
However, just change the byte-sequence of your instruction set and you are done without any further coding.

edit-----
mmh... I've re-read the Silver's long post- yeah, I was talking about that too, I should read long posts in whole at first^^
Said this, look for yacc and lex tools. With 'em, you end up defining the generative grammar and you get out a full C coded parser out by magic. Wow, googling led me to flex and bison -things has moved, meanwhile, nice

...which means I should again read better, as Silver talked about them at the start of his post eheh

OHPen

March 29th, 2007, 05:24

Hi,

i do not agree with maximus and silver that you need a compiler for a virtual machine. Maybe we have different definitions of a compiler...
I think that an interpreter would be enough, if we consider the code maximus mentioned above to be metacode which is then interpreted by the virtual machine.

Maybe im wrong, if so corrections are welcome

Cheers PAPiLLiON

Silver

March 29th, 2007, 09:58

Quote:

Fair enough, fr33ke's method is much simpler. Be aware though - if you do decide you want to expand your language you'll have to recode from scratch using a proper lexer/parser, as things will get quite unmanageable otherwise. You'll need some form of lexer anyway, unless your language structure is fixed and very simple.

If you do decide to use my (lol, yes, I "own" LALR(1)

) method, good luck. Feel free to ask me questions as you code, I'm sure other people would benefit from your experience too.

Incidentally what Maximus said about forcing the instruction set to be human-language compatible is very important - it's probably not clear just how important that sentence is. Remember that if you're coding a VM and a compiler, the instruction set is whatever you say it is.

Compiler & VM bible: Compilers Principles Techniques Tools 2nd Ed
# Publisher: Addison Wesley; 2 edition (August 31, 2006)
# Language: English
# ISBN-10: 0321486811
# ISBN-13: 978-0321486813

Quote:

which means I should again read better, as Silver talked about them at the start of his post eheh

Please search before posting

Quote:

I think that an interpreter would be enough

I think you'd still need some kind of lexer and parser though, otherwise you won't be able to "convert" the plain text source code into tokens, operators etc you can act on.