LLXX
October 8th, 2005, 04:11
Even though this board is mostly concerned with Reversing I noticed this forum had the name "Programming" in it, so might as well ask this question here to see if any experienced coder can understand it.
We are creating an x86 assembler. Not just another assembler, but one written in a distinct way - a simple, tiny assembler written with 16-bit x86 Asm, one that can assemble itself. Why code an assembler? This assembler is designed specifically for the creation of small, efficient COM programs for 16-bit. (Not exactly demoscene, but competition-related) It has distinctive syntax features that allow rapid coding, e.g. more than one instruction per line and simplified syntax.
This assembler is also to be written in such a way that certain hardcoded limits found with existing assemblers (e.g. length of one source line, length of symbols) are nonexistent - it should allow lines of infinite length, and symbols that can be as long as the symbol table memory (64k at the moment).
The difficulty is in the assembler's symbol table. It is only to store the ASCII name and a word that contains its value. An implementation that I have considered and partially implemented is a double-headed linked list. Recently it occurred to me that binary trees might provide better performance, due to their logarithmic search-depths. However, it seems that in all the implementations of binary trees designed to store strings that I've found on the Internet, rely at some point on a fixed-length buffer and in essence a series of strcmps to decide where in the tree the new node is to be added.
Adding entries to the binary tree isn't too difficult - the new node itself can be used as the buffer for the strcmps. However, searching is where the real problem is. All the example binary-tree code on the Internet seems to show fixed-length data being stored in the nodes. These nodes are not going to be fixed length - they consist of:
- 1 word pointer to Less
- 1 word pointer to More
- 1 word symbol's value
- ASCIIZ string name
As you can see, the Less and More values can be used either as pointers to previous and next entries in case of the linked list, or as Left and Right in the binary tree method. Either way the dynamic storage requirements are the same. It is only the size and efficiency of the algorithm which we are concerned with.
I do not want to have to read the entire input string to be found into a buffer and then do strcmps (or CMPSBs, actually). In the double-doubly-linked-list (two doubly-linked lists) method, the lookeahead-free algorithm was simple - initially, set two pointers, Low and High, to the start and end of the list, and as successive characters of the string are read in, move them toward each other based on character matches until they converge on one entry. I've thought about the converging pointers scheme in a binary tree, but it doesn't seem to work since it's a tree and not a sorted list.
Does such a successive-match algorithm for searching binary trees containing arbitrarily long strings exist? If it doesn't, I'll just revert back to the doubly-linked-list method, in which the insertion code is less than 200 bytes and the search code less than 100 bytes
If it does, but the algorithm is very complex or requires recursion (stack space is limited and I'm not too fond of storing the search string on the stack, lest it overflow and cause a bad crash), I'll still stick with the linked list. Otherwise, it'll be implemented.
I've also considered multiway trees. They are excellent for searching (scasb the list of branches, follow new branch, repeat until leaf) but not easy to add to. It is for this reason that the instruction table of our assembler will be implemented using multiway trees (most critical component, must have the best performance).
+ Yes, I know hash tables are another alternative. But they take up way too much space (the hash table needs to be sparse for optimal performance anyway) and are after all basically indexed linked lists. I don't consider them.
Any comments, suggestions, etc. are welcome.
We are creating an x86 assembler. Not just another assembler, but one written in a distinct way - a simple, tiny assembler written with 16-bit x86 Asm, one that can assemble itself. Why code an assembler? This assembler is designed specifically for the creation of small, efficient COM programs for 16-bit. (Not exactly demoscene, but competition-related) It has distinctive syntax features that allow rapid coding, e.g. more than one instruction per line and simplified syntax.
This assembler is also to be written in such a way that certain hardcoded limits found with existing assemblers (e.g. length of one source line, length of symbols) are nonexistent - it should allow lines of infinite length, and symbols that can be as long as the symbol table memory (64k at the moment).
The difficulty is in the assembler's symbol table. It is only to store the ASCII name and a word that contains its value. An implementation that I have considered and partially implemented is a double-headed linked list. Recently it occurred to me that binary trees might provide better performance, due to their logarithmic search-depths. However, it seems that in all the implementations of binary trees designed to store strings that I've found on the Internet, rely at some point on a fixed-length buffer and in essence a series of strcmps to decide where in the tree the new node is to be added.
Adding entries to the binary tree isn't too difficult - the new node itself can be used as the buffer for the strcmps. However, searching is where the real problem is. All the example binary-tree code on the Internet seems to show fixed-length data being stored in the nodes. These nodes are not going to be fixed length - they consist of:
- 1 word pointer to Less
- 1 word pointer to More
- 1 word symbol's value
- ASCIIZ string name
As you can see, the Less and More values can be used either as pointers to previous and next entries in case of the linked list, or as Left and Right in the binary tree method. Either way the dynamic storage requirements are the same. It is only the size and efficiency of the algorithm which we are concerned with.
I do not want to have to read the entire input string to be found into a buffer and then do strcmps (or CMPSBs, actually). In the double-doubly-linked-list (two doubly-linked lists) method, the lookeahead-free algorithm was simple - initially, set two pointers, Low and High, to the start and end of the list, and as successive characters of the string are read in, move them toward each other based on character matches until they converge on one entry. I've thought about the converging pointers scheme in a binary tree, but it doesn't seem to work since it's a tree and not a sorted list.
Does such a successive-match algorithm for searching binary trees containing arbitrarily long strings exist? If it doesn't, I'll just revert back to the doubly-linked-list method, in which the insertion code is less than 200 bytes and the search code less than 100 bytes

I've also considered multiway trees. They are excellent for searching (scasb the list of branches, follow new branch, repeat until leaf) but not easy to add to. It is for this reason that the instruction table of our assembler will be implemented using multiway trees (most critical component, must have the best performance).
+ Yes, I know hash tables are another alternative. But they take up way too much space (the hash table needs to be sparse for optimal performance anyway) and are after all basically indexed linked lists. I don't consider them.
Any comments, suggestions, etc. are welcome.