LHA encoding/decoding [Archive] - RCE Messageboard's Regroupment

View Full Version : LHA encoding/decoding

WaxfordSqueers

June 30th, 2009, 22:27

I'm trying to follow the decoding in a BIOS decompression utility and have confirmed the encoding is LHA (lh5). I could not follow the logic of the decoding so I decided to encode the same ROM module that I decoded independently. I'm using LHA under softice in DOS mode (in a VM) and I have confirmed that LHA does in fact produce exactly the same compressed module (from the uncompressed module) that is found in the ROM file. The only difference is that LHA adds a short header, which is missing in the ROM file.

I'm scratching my head over the different compression algorithms used in LHA, which apparently uses Huffman, Ziv-Lempel and LZSS, by a Japanese author. Can someone shed some light on this to get me started? First of all, with Huffman, apparently a tree is constructed that repesents the probabilities of characters/words appearing in a file. Are the values in the tree independent of the data it is compressing? That is, does the Huffman-type decoding create a tree and match data bytes to the probabilities in the tree?

In LHA, they seem to construct a tree out of nothing. They initialize a 64k section of memory and add a 1 in position 0x0. Then they operate on the table, deriving values somehow. I have seen that done with an SHA1 algorithm as well. What is it exactly they are doing and is the binary tree created independently of the data?

When the file to be decompressed is loaded by LHA, it cuts out a chunk that is 0x2000 in size, It takes the first byte and processes it against a value in a table. The second and ensuing bytes are processed before the compressed byte is formed. I'm wondering if the app needs to read all the data first before it does the compression, and if the compression is done in chunks of 0x2000?

I have not confirmed yet that the table is the one created from an empty 64k chunk of memory. The 2000 is significant because it is referenced independently by the ROM utility during decompression. The decompression is done in units of 0x2000 bytes. Also, the decompressor seeds the initialization with the values 0x10 and 0x8.

There are a lot of references in the LHA decompression to values like 0x61, also 0x72 and 0x77. At first, I thought that 0x61 was related to the ASCII letter 'a'. I'm curious about that because my ROM file cannot be processed by the ROM utility because of data it contains that is out of bounds. The same data in a ROM file that can be deciphered begins with a subtraction of 0x41 from the first data byte and that is followed by DEC 1's and JMPs. My initial data byte is way too high and the subtraction of 0x41 runs out of DEC 1's, leaving the app in an endless loop.

I mention that only because 0x41 is also the ASCII character 'A' and I have seen that method used in apps to determine if code is ASCII or not. That's why I wondered about the 0X61 that LHA uses.

disavowed

June 30th, 2009, 22:56

Quote:

[Originally Posted by WaxfordSqueers;81426]First of all, with Huffman, apparently a tree is constructed that repesents the probabilities of characters/words appearing in a file. Are the values in the tree independent of the data it is compressing?

No. For Huffman Encoding, the values in the tree are dependent on the data.

WaxfordSqueers

July 1st, 2009, 20:38

Quote:

[Originally Posted by disavowed;81428]No. For Huffman Encoding, the values in the tree are dependent on the data.

I delayed getting back to you, disavowed, because I was working on the code. It's tough to do with softice 2.8 since it's very primitive and you seem to be limited to one data window.

Anyway, Huffman is part of LHA but it seems to be used in stage 2. I think LZH is used in stage 1. All I've seen so far is a 64k table initialized to zeros and seeded with an 0x1 in the first position.Then the entire table is scanned for a value which obviously is not there yet. Eventually, however, by scanning for values I don't understand, it starts to fill the table, at first with values like 2, 5 and 13.

At that point, it hasn't even read the file that is to be compressed. If, as you say, the Huffman table is dependent on the data, this must be another dictionary, or table, being built. That might also explain what is bothering me in the decompression in the ROM utility. There seems to be a stage one decompression followed by a stage 2 using a table that gets loaded in the apps address space.

By that, I mean those addresses in IDA are filled with question marks and I have confirmed there is no data in them early in the app's boot. In fact, they don't get filled with data till a ROM file is loaded for processing. I did a BPM on the addresses and identified them being filled, but I took a break to try understanding the compression routine first.