Tuesday, 6 February 2007

Number Compression - Rhizome Encoding

Remembering that the major use of this encoding is for Rhizomes, think how many times you have seen data that looks like this?

id, field1, field2, field3, field4
0 A C F H
1 A C F I
2 A C F I
3 A C G H
4 A C F I
5 A D F J
6 A D F K
7 A D F H
8 A D G H
9 A E G H
10 B C G H
11 B C G H
12 B C G I
13 B C G I
14 B E G I
15 B E G J


Field 1 takes 4 bytes to encode, 2 for A and 2 for B
Field 2 takes 9 bytes to encode, 4 bytes for C, 2 bytes for D, 3 bytes for E
Field 3 takes 7 bytes to encode, 4 bytes for F, 3 bytes for G
Field 4 takes 12 bytes to encode, 4 bytes for H, 5 bytes for I, 2 bytes for J, 1 byte for K

For a total of 32 bytes to index the whole table with Rhizomes.

Examples:

Encoding of A: Add 0, increment x9.
Encoding of I: Add 1, increment x1, add 2, add 8, increment x3.
I will now demonstrate how Rhizomes can speed searches.

To find records where field3==G and field4== H:

Take the Rhizomes for G and H, Gs and Hs.

These begin at 3 and 0, respectively. (Record 3 is the first record where field3==G, record 0 is the first record where field4==H.)

If the two numbers are equal, this field is a match. Record or use the fact, then increase both Rhizomes.

If the two numbers are different (i.e. not the same record) then get the next number from the lower Rhizome.

Rhizome G H
3 0 get next number from H
3 3 *match*, increase both
8 7 get next number from H
8 8 *match*, increase both
9 9 *match*, increase both
10 10 *match*, increase both
11 11 *match*, increase both
12 eof end of H Rhizome -> end of results

No comments: