Log in

View Full Version : Base-85


LLXX
March 13th, 2006, 06:32
Not a reversing question, but a programming one. ("Advanced reversing and programming" so I believe it goes here.)

I've written a base-85 encoder and decoder (http://en.wikipedia.org/wiki/ascii85) for a project. However, one of them does not seem to be working correctly and only for some cases of input. I've manually done the arithmetic, and it seems that this may be a cause of a flaw in how the official Ascii85 specification handles partial blocks (base85 encodes 4 bytes blocks into 5 bytes), by truncating the output. One particular example I encountered was while trying to encode the three-letter string "art":

61 62 74 ; original string

61627400 ; pad with zero and turn into a dord, as per specification

01254ca8 ; quotient after division by 85 (55h) remainder: 38h
00037359 ; quotient after division by 85 (55h) remainder: 1bh
00000a64 ; quotient after division by 85 (55h) remainder: 25h
0000001f ; quotient after division by 85 (55h) remainder: 19h

The "digits" of the base-85 number are then 1f,19,25,1b,38 but according to the specification, since I'm encoding a partial block, only the first 4 are used and the last one is dropped, resulting in 1f,19,25,1b. However, when this is attempted to decode, it is padded with nulls to become 1f,19,25,1b,00, and the decoding process:

1f*55 + 19 = a64
a64*55 + 25 = 37359
37359*55 + 1b = 1254ca8
1254ca8*55 + 0 = 616273c8
Output: 61 62 73

Notice the original string was 61 62 74, while this is 61 62 73. Noticing the c8 at the end, perhaps it is necessary to round the output before taking the first 3 bytes? This is *not* mentioned at all in the official specification (which only details the encoding process, and only says that "decoding is the inverse of encoding" :thinking. I don't understand why a lossless encoding such as this would just drop the last byte of the output - information is being lost there. I think it's not impossible for this case to occur also with 1 and 2-byte partial blocks as well.

From searching on the Internet for the sourcecodes of other base85 decoders, I've found three variations:

1. Ignore the extra bytes completely when outputting partial blocks
2. Add a power of 85 corresponding to the length of the partial block (length=3,add 85; length=2,add 85²; length=1,add 85^3)
3. Add 128 to the byte preceding the last one to be output

I'm supposing that only one of those is correct. Currently my decoder is following (1) and it doesn't look like it's working, so it's either (2) or (3).

Another possibility is that my encoder is wrong, as incrementing the last byte output may solve the problem (but introduce another one in the process )

Any thoughts or comments on this?

Maximus
March 13th, 2006, 08:22
pad the 0 from the other side...

Admiral
March 13th, 2006, 12:26
I'm guessing you typoed the original string:

61 62 74 - abt
61 72 74 - art

That's not important though.
Like Maximus said, you're padding from the wrong side. But that's only because you're working (I assume) on a little-endian machine. Things'd probably work if you cast into a dword directly (as in without modifying the data).
Other than that, all you're really doing is long-dividing, the same way as you'd do for base-10 numbers or polynomials. For clarity, here's what should be happening (I'll stick with "abt":

Encode:

"abt" = {61, 62, 74}
{61, 62, 74, 00} - Pad to 4-boundary
00746261 - Cast to long (note that the byte-order is now reversed)

746261 \ 55 = 15E85 _ 38
\ 55 = 41F _ 3A
\ 55 = 0C _ 23

If you keep the whole factorisation on one line, it looks like:

746261 = 55(15E85) + 38
= 55(55(41F) + 3A) + 38

= 55(55(55(0C) + 23) + 3A) + 38

= C.55³ + 23.55² + 3A.55 + 38

So in radix notation:

746261h [16] = 0Ch 23h 3Ah 38h [85]

The decoding process goes just as the bold line above suggests:

0C * 55 + 23 = 41F
41F * 55 + 3A = 15E85
15E85 * 55 + 38 = 746261

Casting back to a char array we get

00746261 = {61, 62, 74, 00} = "abt\0"

I hope this helps.

Regards
Admiral

Maximus
March 13th, 2006, 15:57
yep, 2^24<85^4 thats why you can use 4 digits for 3 bytes. the wrong padding transformed the 2^24 number in 2^32, like padding 10 on right instead of left - 10-->010 (left) 10-->100 (right)

LLXX
March 13th, 2006, 17:37
I am working on a little-endian machine. An Intel 286, to be exact.

However, the specification indicates that every 4 bytes of input are to be treated as a dord in big-endian, i.e. MSB first. It also indicates that padding is to be performed at the right ("appends 4-n zero bytes" so I believe that 61627400 is the correct value to hold in the register.

If the sequence of 4 bytes in the input was e.g. 61 62 74 73 then 61627473 would be held in the register before dividing, and all 5 bytes would be output (correctly).

Here's the relavant part of the specification that I scanned+OCR'd+annotated:
http://img343.imageshack.us/img343/7060/851wr.gif

Maximus
March 13th, 2006, 18:24
from wiki:
Quote:

When encoding, the binary data is divided into groups of four bytes, which are treated as a 32-bit number
[...]
At the end of the data, the last group can have fewer than four bytes. Virtual zero bytes are appended to the data

LLX:
Quote:
61627400 ; pad with zero and turn into a dword, as per specification"


No. They are 4-byte numbers, not byte strings. you pad not the string, you pad the 4-byte number, and this means adding '0' in front of it, not as a multiplicator. the 4byte is a number, as from specs.

0xf001
March 13th, 2006, 19:05
hi,

i agree, its to be seen from mathematical point of view - if you want to get the right number out, you may not modify it. "padding" 00 in front of it does not change the value.
some 2 cents hehe
edit:

there are sources in the wiki to compare - so whats the output if you pad "left"? that would answer the question
almost grabbing compiler ...

Quote:
that every 4 bytes of input are to be treated as a dord in big-endian


doesn't that allready say all? hehe
big endian -> value changes if you shift them left as you do by padding right

regards,

--
0xf001

LLXX
March 13th, 2006, 19:49
This is exactly the opposite of what the example in binaryencoding.pdf ("binaryencoding.pdf filetypedf" in Google will find it) states.

Also, when I see the phrase "zero bytes are appended to the data", I interpret it as being zero bytes added to the end of the stream of input bytes.

+ it seems I'm not the only one that's encountered this problem:
http://ghostscript.com/pipermail/bug-gs/2002-January/000953.html

0xf001
March 13th, 2006, 20:12
ok,

just out of curiosity - does the calculation work if you just try to padd them on the other side? I think its quick to try. I admit I did not read all links etc, but at least I want to suggest to try the other padding as it made some sense to me.
And to the padding: the question is just: where is "the end" ? Stream vs integer thematics etc ... so I tohught there could be a chance there

regards, 0xf001

Woodmann
March 14th, 2006, 00:14
Quote:
This is exactly the opposite of what the example in binaryencoding.pdf ("binaryencoding.pdf filetypedf" in Google will find it) states.

Also, when I see the phrase "zero bytes are appended to the data", I interpret it as being zero bytes added to the end of the stream of input bytes.

+ it seems I'm not the only one that's encountered this problem:
http://ghostscript.com/pipermail/bug-gs/2002-January/000953.html
__________________
[ ~Litana L.X. Xahanien~ ]



I read this a few times and it seems this person has assumed the same thing as you. Perhaps the use of the word "appends" is the problem.

http://www.m-w.com/dictionary/append ("http://www.m-w.com/dictionary/append")

To further investigate:

http://www.m-w.com/dictionary/appendix ("http://www.m-w.com/dictionary/appendix")

It seems we have the incorrect usage of the word, thus causing confusion to where to make the need changes.

Append does in deed mean to add to the end but, this is wrong.
Whomever wrote this had the wrong syntax.

Pad the front and it will not detract from the final result. It see's the 0 and ignores it.

Woodmann

LLXX
March 14th, 2006, 04:58
My implementation involves filling a register with up to 4 bytes, and then repeatedly dividing that register by 85 and outputting the remainders in reverse order (since the base-85 digits are to be output MSB first).

The original way of filling was to, after the register was zeroed, to fill it in the following order (MSB is on the left, LSB is on the right):

1 2 3 4

Thus, if e.g. 61 62 74 ("abt" was read and following by EOF, the register would fill as follows:

61000000
61620000
61627400
61627400 (eof)

Doing it this way, the output of the encoder on the sample text of the Wikipedia page
Quote:
Man is distinguished, not only by his reason, but by this singular passion from other animals, which is a lust of the mind, that by a perseverance of delight in the continued and indefatigable generation of knowledge, exceeds the short vehemence of any carnal pleasure.

agrees with the output there,
Quote:
9jqo^BlbD-BleB1DJ+*+F(f,q/0JhKF<GL>Cj@.4Gp$d7F!,L7@<6@)/0JDEF<G%<+EV:2F!,
O<DJ+*.@<*K0@<6L(Df-\0Ec5e;DffZ(EZee.Bl.9pF"AGXBPCsi+DGm>@3BB/F*&OCAfu2/AKY
i(DIb:@FD,*)+C]U=@3BN#EcYf8ATD3s@q?d$AftVqCh[NqF<G:8+EV:.+Cf>-FD5W8ARlolDIa
l(DId<j@<?3r@:F%a+D58'ATD4$Bl@l3De:,-DJs`8ARoFb/0JMK@qB4^F!,R<AKZ&-DfTqBG%G
>uD.RTpAKYo'+CT/5+Cei#DII?(E,9)oF*2M7/c~>

But the decoder decodes the last characters /c to 2d '-' instead of 2e '.' as should be.

I modified the encoder to pad from the other side, by shifting bytes in from the right instead:

0 0 0 1
0 0 1 2
0 1 2 3
1 2 3 4

This produces the identical results when 4 bytes are read in, but on EoF:

00000000 (initialised to 0)
00000061 (read in first byte)
00006162 (shift over and read second)
00616274 (shift again and read third)
00616274 (EoF, now it's padded on the left)

However, the output of the encoder now becomes:
Quote:
9jqo^BlbD-BleB1DJ+*+F(f,q/0JhKF<GL>Cj@.4Gp$d7F!,L7@<6@)/0JDEF<G%<+EV:2F!,
O<DJ+*.@<*K0@<6L(Df-\0Ec5e;DffZ(EZee.Bl.9pF"AGXBPCsi+DGm>@3BB/F*&OCAfu2/AKY
i(DIb:@FD,*)+C]U=@3BN#EcYf8ATD3s@q?d$AftVqCh[NqF<G:8+EV:.+Cf>-FD5W8ARlolDIa
l(DId<j@<?3r@:F%a+D58'ATD4$Bl@l3De:,-DJs`8ARoFb/0JMK@qB4^F!,R<AKZ&-DfTqBG%G
>uD.RTpAKYo'+CT/5+Cei#DII?(E,9)oF*2M7!!~>
Those two are zeroes, which decode to a zero byte.

I tested a few other encoders I found on the 'net, and so far all the ones I've found output /c for the final two bytes, so they're padding from the right.

I'm now wondering if the fault is in the encoder or the decoder, since my original implementation's output agrees with every other encoder I could find. It's the decoder's output that's in error.

If you'd like I can post the source for the encoder or decoder or both.

Admiral
March 15th, 2006, 20:17
Hi again.

I didn't believe you at first, but after reading up on the format (on Wikipedia) it seems that you're correct about padding from the right. Sorry about that :P

However, I still think this (although standard) is 'the wrong way' to solve the problem. The agreed method pads the whole string (from the right) to have a multiple of four length before splitting and encoding it. This would seem fine but it causes a small problem with the final remainder. An example:
Code:
"123\0" = {0x31, 0x32, 9x33, 0x00}

31 32 33 00 [16/256]

Factorising:

942AC3 _ 41 (65d)
1BE3E _ 2D (45d)
53F _ 53 (83d)
0F _ 44 (68d)
0 _ 0F (15f)

So in base 85 we get:

0F 44 53 2D 41 [16/85]
15 68 83 45 65 [10/85]
48 101 116 78 98 [10+33/85]
0 e t N b

Because there are only three important characters in the source (r256) string, it would seem fair that we can express it using only four characters with our new radix. This is true, but it loses our least-significant digit by rounding down. (So I say this is the wrong way to convert to base 85 because this issue wouldn't exist if we padded from the left).
So now when we output the string:

"<~0etN~>", the 'b' has disappeared, and it takes a unit away with it.

The result of this, which can be seen if you repeatedly encode and decode is that the last character of any string with non 4x length will lose one ASCII value per recode:

"123" -> "122" -> "121" -> "120" -> "12/" -> "12." -> "12-"
(but "1234" -> "1234"

Okay, so the problem isn't huge and the solution is simply to add one to the last value in your resulting array, but I still think our left-padding method is cleaner

Regards
Admiral

LLXX
March 16th, 2006, 02:39
Indeed, I think it is the wrong way as well. If I was designing base85 I'd make the dords little-endian, avoiding the problem altogether.

I have what seems to be a working but inelegant solution - rounding partial blocks before outputting them, by adding 128 to the first non-output byte. I wrote a tester program to cycle through all 256, 65536, and 16777216 of the 1, 2, and 3-byte partial block values, encoding and decoding and comparing the values obtained. All of the values encoded and decoded properly with my rounding scheme. Now that I have a working base-85 encoder and decoder I decided to optimise the code to see how fast and tiny I could get it to be. Performance statistics follow.

The 3-byte 16M checking process took approximately 20 seconds on my 4GHz CPU, which is approximately 84,0000 blocks per second being encoded and decoded, or approximately 4800 clocks per block encoded + decoded. This was prior to optimising the code.

A file of 1,9211,3544 bytes was encoded in 60 seconds, for an encoding speed of 3.2Mb/s, or approximately 1200 machine clocks per byte encoded.

The encoded version of the file, which was 2,4014,1932 bytes, was decoded in 40 seconds, for a decoding speed of 6.0Mb/s, or approximately 666 machine clocks per byte encoded.

No errors were found in the encoding or decoding as the decoded version was identical to the original file. The decoding was expected to be faster than the encoding, since it involves multiplication which is faster than the division used in the encoding process.

The filesize of the encoder is 209 bytes (haven't optimised it yet).
The filesize of the decoder is 260 bytes (optimised down from 491).
A base85 encoder+decoder I found on the Internet was 40960 bytes, another was 103781

I'll post my completed encoder + decoder here once I finish optimising the encoder

Maximus
March 21st, 2006, 11:45
LLXX, might you post the last bytes of the sentence code, so that I can give a look without coding it all?

Note that specs say that ending !!! must not be included in output at end. There is a big flaw in a85 format, note:
Quote:

At the end of the data, the last group can have fewer than four bytes. Virtual zero bytes are appended to the data, and after encoding, if there was one byte, only two characters are output; if there were two bytes, three characters are output; and if there were three bytes, four characters are output. The "z" case does not apply here. This way, the length of the original data can be determined by a reader of the encoded data.

Padding from right instead from left in BE will never give you a 3/4 char output, as rightpad will mul the number, thus increasing its significant digits.

LLXX
March 22nd, 2006, 05:56
I have completed the Base85 encoder and decoder, and have tested them to make sure they work properly. At current, approximately 660 clocks are spent for each byte decoded, while encoding is slightly slower at 700 clocks per byte.

The encoder and decoder work as filters, so use them like:
Code:
encode85 < infile.bin > outfile.85
decode85 < infile.85 > outfile.bin


The encoder and decoder along with source (converted to MASM syntax) is attached. (Quite a coincidence that the decoder's source is 3072 bytes which compressed to exactly 1024...) The encoder is quite straightforward, but the decoder is a little more complicated since it has to handle the special partial-byte rounding case (around line 89...) and output of partial bytes.

- Freeware public domain - use modify distributes freely