Quote:
If i wrote a virtual machine with an own instruction set, do i also have to write a compiler for it? |
Yes. You should look into lexers and parsers like Flex and Bison. In a nutshell, you use your lexer to define language key words (for, while, if etc can be defined as goat, apple, icecream or whatever). The lexer picks these key words out of code. You then join these key words into structures using a parser, for example you'd say "the keyword IF is followed by an evaluation then a keyword THEN then a list of statements then a keyword ENDIF".
When you've coded your lexer and parser definitions, flex and bison generate C source based on your rules. This source becomes the backbone of your compiler. You use it to create a tree of source code. Your compiler can then traverse the tree and generate bytecode for each node.
Here's a practical example for you. We start off with our lexer definition, we'll do a standard C-like "for" keyword:
Code:
"for" {return FOR;}
"(" {return OPEN_BRACKET;}
..etc for ( ) + - = / * and every other single language "item" you need.
This says "look for the text string 'for' and return the token "FOR", look for a "(" and return token OPEN_BRACKET etc etc. The token is actually an ENUM member - the enum is generated by the lexer, so the above line of lexer script will generate this type of C code:
Code:
enum tokentype {FOR = 199};
You then #include the header your lexer generates into your compiler code. Next you write your parser syntax. Here's the for:
Code:
for_statement
: FOR OPEN_BRACKET declaration_statement END_STATEMENT expression STATEMENT expression CLOSE_PAR statement
{
$$ = new Node(FOR_STMT, $3, $5, $7, $9);
}
This is incredibly hard to explain and you'll need to spend some time learning how parsers work. But essentially a parser takes lexer tokens (the first FOR, plus OPEN_BRACKET etc which I haven't shown you) and searches for this sequence of tokens in supplied text. When it finds a matching sequence, it uses the templates in the curly brackets to generate C source for you, in this case return a new instance of a C++ class called "Node" and pass in some params to the constructor.
Remember that all this is pseudo-C language. When you run your lexer and parser files through flex/bison, actual C++ code is generated for you to include in your app.
So right now if we wrote a new text file containing this:
Code:
for(a = 1; a < 10; a++)
{
}
The code generated by the lexer would recognise the "for" keyword, the "(" bracket, the "=" as assignment operator etc etc etc... It would pass each one of these to the code generated by the parser in a BOTTOM UP fashion - this is called LALR(1).
The parser then looks at what it just got from the lexer - "hey, what's a FOR token followed by an OPEN_BRACKET token followed by a...... etc" and try and match it to one of the parser rules. It'll find the rule we coded (above) and execute the code in the braces that generates a new instance of the node class.
You then code your own "Node" class. Essentially this is a self-entrant class that is used to create a tree structure of your code. ....
Actually, damn. I just tried to explain this 3 times and unfortunately a forum post does not lend itself to showing LALR(1) parsing.
So, skipping the tree bit, you end up with a structure in memory whereby you can easily break down each bit of code into individual instructions. Then all it's a matter of doing is assigning values to each opcode. So we give our "for" statement a value of 0x01.
When you've done it for all your language statements you'll just have a long list of opcodes that oddly looks exactly like how machine code does! That's because it is! All you need to do now is code your virtual stack machine, which is a whole other post...
You've got a lot of studying ahead of you to understand this, but trust me, it's very worth it when it makes sense!