Tuesday 6 February 2007

Number Compression

Number compression.

The numeric representation of values (references), require a simple, compact way of storing numbers of arbitrary large size. We could choose to use 64 bit integers, but that would be wasteful of space.

Instead I (may) have invented a new compression scheme for variably sized integers which I have named Rhizome Encoding.

It was inspired by Huffman coding, and this is nearly a special case of that coding, constrained to only have codewords which are an integer number of bytes long.

(I did try the Huffman encoding of various integer distributions, but I was spending far too long debugging C code with all sorts of fiddly-bit offsets, especially incrementing the value of bytes which had non-byte-aligned offsets. This is coding is much simpler to implement and does the same job more quickly and almost as well.)

The basic idea is that it is a prefix-free code for variable length (positive) integers. The first byte contains zero or more high bits set, denoting how many bytes long the number is. (This is system is very convenient with big-endian architecture, little-endians numbers would either need to be converted to big-endians, or bit-shifted. My C code would benefit from someone with knowledge of x86 assembly.)

i.e. 00000000 is null and is skipped with no effect nor any returned number.
0xxxxxxx returns 1, 0xxxxxxx times.
10xxxxxx returns 00xxxxxx.
110xxxxx xxxxxxxx returns 000xxxxx xxxxxxxx.
1110xxxx xxxxxxxx xxxxxxxx returns 0000xxxx xxxxxxxx xxxxxxxx.

11111111 110xxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx returns xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx (truncated to the lower 64 bits, until we get wider architectures).

There *is* some wasted number space here, as 10000110 represents decimal 6, as does 11000000 00000110, as does 11100000 00000000 00000110, etc. Fixing this would reduce Rhizome sizes by 1% at the cost of increasing code complexity and processor usage.

The zero-bit-prefixed codewords are special. They denote 0xxxxxxx repeats of the number 1.

The sequence 3, 1, 1, 1, 1, 1, 1, 258, 1, 1 would be stored as:

10000011 00000110 11000001 00000010 00000010
number: 3 , return 1 x6 , two-byte number: 258, return 1 x2


If we now choose to interpret the returned values as increases (from a base of zero), the example becomes:

The sequence 3, 4, 5, 6, 7, 8, 9, 267, 268, 269 would be stored as:

10000011 00000110 11000001 00000010 00000010
add: 3 , increment: x6, add two-byte number: 258 , increment: x2

No comments: