Tuesday 6 February 2007

Rhizomes - Efficiency of Comparisons

Rhizomes - Efficiency of Comparisons

Using symbols, rather than storing values, leads to savings in processing when a query has conditions. (Especially complex conditions.)

For example:

If a column has 10^9 records with 10^6 distinct values, then performing complex comparisons on 10^6 values is 10^3 times faster than doing likewise on 10^9 values.

Complex comparisons can include regular expression, or multi-column, comparisons.


Also, values can be grouped and ordered in arbitrary fashions. We could sort a 10^7 record subset (with 10^4 distinct values) of 10^9 records in any way much more quickly than sorting a traditional table.

Rhizomes - Further Thoughts

Further thoughts.

The use of the concept of tables is only really an aid in communicating with an SQL literate audience.

There is no type system here, except what is created by grouping values.

Also a record could be made up of indexes from several ‘tables’.

{table1.name => “Kevin”,
table2.dob => ’01/01/1981’,
table1.address => “111 Some Street”}

Records ID’s are unique within the database, not any ‘table’.

I haven’t yet thought what this might mean, or whether it’s useful outside of the Rhizome-Tesseract design.

This is one part (maybe a third?) of my Rhizome-Tesseract design. Rhizome-Tesseract is a radical departure from relational databases, very web 2.0 and duck-typed; a natural fit for languages such as Ruby and Python.

Rhizomes - CRUD

CRUD

So far I have deliberately concentrated on Aggregate Read performance of databases, as that is where I experience most day-to-day database performance problems.

Create Record:

The creation of a new record must add one or more bytes (or increase the value of the last byte) on all of the Rhizomes associated with that record’s values. (One Rhizome per ‘column’ of data being written. The tails of these Rhizomes can be (at worst case) considered to be randomly distributed across the systems drives.

This each write of a n-column record will cause at least n random writes. This is comparable with a traditional database, assuming that half the columns are indexed.

Also, any Cache Rhizomes whose criteria apply to this record must also be modified.

As a rule of thumb, consider a Create operation on a Rhizome based Database to be 1-2x more resource intensive than on a traditional database.


Read Record:

A read of a particular record is equally as fast with Rhizomes as with a traditional database. (There is a tree of values for each Rhizome, analogous to a traditional database’s indexes.) If the read conditions are represented by a Cached Rhizome, performance will be orders of magnitude faster.


Read Multiple Records:

As with the example before, aggregating is typically 10^5 times faster with Rhizomes (on the first request) and 10^8 times faster subsequently. Similar performance can be gained using data-cubes on a traditional database, but these require setup, updating, and tuning.


Update Record:

Rhizomes, because of their linear nature, *cannot* be updated. Any updates must be performed as deletes and creates.


Delete Record:

Deleting a record means removing it’s entry from each Rhizome in which it occurs. To delete the entry from the Rhizome, the following (or preceding) addition has to be made equal to the sum of it and the deleted addition, then padded will nulls.

add:3, increment:x10, add:4, add:8, add:2, …

Deleting the record indicated by the add:4 required changing the Rhizome to:

add:3, increment:x10, add:12, some nulls, add:2, …

Deleting is therefore inefficient and Rhizomes will benefit from having their nulls removed when they become too ‘holey’ (>40%?). This is efficiently done spindle-to-spindle on a multi-spindle system.

Performance of deletions is comparable to traditional databases, which also have to perform several random writes to remove index entries.

Records

So far I have made no mention of records. I have assumed that finding which IDs are matches for a particular query is the main aim here.

I propose (though have neither implemented nor tested this) that Cache Rhizomes are created with Multipart Encoding (nothing to do with MIME).

Formerly, we assumed that each number retrieved from a Rhizome was the increment to the next record whose index matched the Rhizome’s value (or condition).

If we instead interpret interleaving numbers as (integer references to) the values then we have something very close to a clustered index. Again, the challenge here is for the Rhizome Manager to efficiently manage which Clustered Cache Rhizomes to keep or cull.

Records can be stored (as references to symbols) in a very efficient manner. The implementation is trivial.

Rhizomes of Rhizomes

Q: How do I get to the 1,000,000,042nd number in Rhizome?

A: *Only* by reading 1,000,000,041 numbers.

Therefore I propose a system of Rhizomes of Rhizomes.

These contain the offsets of selected numbers.

In this case Rhizome A has 10^9 entries.

If I make a new Rhizome B, which indexes Rhizome A, every 1,000 (or so*) entries, I can get there in between 1,000th and a 500th of the data it would have otherwise taken.

I can the make a new Rhizome C, which indexes Rhizome B, it comes down to between 1,000,000th and a 500,000t,h of the data.



* The 1,000th number might not start on a byte. (It could be in the middle of an instruction to increment.) Therefore it may not have an offset. So we might store the offsets of the 1,004th, the 2,000th, the 3,005th and the 4,000th numbers instead.

Rhizomes - Theortical Performance

Rhizomes - Theortical Performance

Let’s consider a real-world example. We have a table of 10^9 rows. Value G occurs 10^6 times, Values H occurs 10^6 times. If G and H are statistically independent, we should have around 10^3 matches.

As values of G or H occur only every 1,000 records or so, we’ll assume a *worst case* scenario and require Rhizomes of (two bytes) * (10^6 entries), or 2 megabytes, for each value.

Reading through 2x 2 megabytes of data (ideally from separate spindles) is trivially fast, as *streamed* data from contiguous blocks on hard disks is typically 50Mb/s, even on a £30 SATA drive.

Time needed: 100ms, worst case, including lookups, etc.

Contrast this with a traditional index-based search, which would require up to 10^6 index node reads. Random transfer of 1k blocks from a 10ms SATA disk is only 100k/s. This is going to take a long time… hence data cubes were invented.

Time needed: 10^4 seconds, worst case.

Cache Rhizomes.

Why the query is run, we should create a new Cache Rhizome of the results. There will be only 10^3 results in the ‘field3==G field4==H’ Cache Rhizome.

If we need to run the same query again, it will complete in 1,000th of the time of the original, *even if the data and result have changed*.

If anyone queries ‘field3==G, field4==H, field1==A’ we can use this Cache Rhizome to reduce the time needed for this query by 3 orders of magnitude.


Rhizomes on disk.

Ideally this will use a multi-spindle system. Rhizomes should be located on the fastest, most accessible parts of the disk. (Perhaps centred around the 1/3 capacity mark?)

Cache Rhizomes, as they already shrink query times by orders of magnitude, can live near the slow centre of disks.

A Rhizome management algorithm will need to deal with both the placement of Rhizomes on disk to allow for speed and growth, and with the need to handle the eradication of the less-frequently-accessed Cached Rhizomes. It will also have to handle the spindle-to-spindle movement of Rhizomes if they start fragmenting.


Cache Rhizome Size

As Cache Rhizomes contain only entries for particular combinations of values, they will remain small, compared with ordinary Rhizomes. (Though they will be numerous.)

Number Compression - Rhizome Encoding

Remembering that the major use of this encoding is for Rhizomes, think how many times you have seen data that looks like this?

id, field1, field2, field3, field4
0 A C F H
1 A C F I
2 A C F I
3 A C G H
4 A C F I
5 A D F J
6 A D F K
7 A D F H
8 A D G H
9 A E G H
10 B C G H
11 B C G H
12 B C G I
13 B C G I
14 B E G I
15 B E G J


Field 1 takes 4 bytes to encode, 2 for A and 2 for B
Field 2 takes 9 bytes to encode, 4 bytes for C, 2 bytes for D, 3 bytes for E
Field 3 takes 7 bytes to encode, 4 bytes for F, 3 bytes for G
Field 4 takes 12 bytes to encode, 4 bytes for H, 5 bytes for I, 2 bytes for J, 1 byte for K

For a total of 32 bytes to index the whole table with Rhizomes.

Examples:

Encoding of A: Add 0, increment x9.
Encoding of I: Add 1, increment x1, add 2, add 8, increment x3.
I will now demonstrate how Rhizomes can speed searches.

To find records where field3==G and field4== H:

Take the Rhizomes for G and H, Gs and Hs.

These begin at 3 and 0, respectively. (Record 3 is the first record where field3==G, record 0 is the first record where field4==H.)

If the two numbers are equal, this field is a match. Record or use the fact, then increase both Rhizomes.

If the two numbers are different (i.e. not the same record) then get the next number from the lower Rhizome.

Rhizome G H
3 0 get next number from H
3 3 *match*, increase both
8 7 get next number from H
8 8 *match*, increase both
9 9 *match*, increase both
10 10 *match*, increase both
11 11 *match*, increase both
12 eof end of H Rhizome -> end of results

Number Compression

Number compression.

The numeric representation of values (references), require a simple, compact way of storing numbers of arbitrary large size. We could choose to use 64 bit integers, but that would be wasteful of space.

Instead I (may) have invented a new compression scheme for variably sized integers which I have named Rhizome Encoding.

It was inspired by Huffman coding, and this is nearly a special case of that coding, constrained to only have codewords which are an integer number of bytes long.

(I did try the Huffman encoding of various integer distributions, but I was spending far too long debugging C code with all sorts of fiddly-bit offsets, especially incrementing the value of bytes which had non-byte-aligned offsets. This is coding is much simpler to implement and does the same job more quickly and almost as well.)

The basic idea is that it is a prefix-free code for variable length (positive) integers. The first byte contains zero or more high bits set, denoting how many bytes long the number is. (This is system is very convenient with big-endian architecture, little-endians numbers would either need to be converted to big-endians, or bit-shifted. My C code would benefit from someone with knowledge of x86 assembly.)

i.e. 00000000 is null and is skipped with no effect nor any returned number.
0xxxxxxx returns 1, 0xxxxxxx times.
10xxxxxx returns 00xxxxxx.
110xxxxx xxxxxxxx returns 000xxxxx xxxxxxxx.
1110xxxx xxxxxxxx xxxxxxxx returns 0000xxxx xxxxxxxx xxxxxxxx.

11111111 110xxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx returns xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx (truncated to the lower 64 bits, until we get wider architectures).

There *is* some wasted number space here, as 10000110 represents decimal 6, as does 11000000 00000110, as does 11100000 00000000 00000110, etc. Fixing this would reduce Rhizome sizes by 1% at the cost of increasing code complexity and processor usage.

The zero-bit-prefixed codewords are special. They denote 0xxxxxxx repeats of the number 1.

The sequence 3, 1, 1, 1, 1, 1, 1, 258, 1, 1 would be stored as:

10000011 00000110 11000001 00000010 00000010
number: 3 , return 1 x6 , two-byte number: 258, return 1 x2


If we now choose to interpret the returned values as increases (from a base of zero), the example becomes:

The sequence 3, 4, 5, 6, 7, 8, 9, 267, 268, 269 would be stored as:

10000011 00000110 11000001 00000010 00000010
add: 3 , increment: x6, add two-byte number: 258 , increment: x2

Rhizome design - backported to traditional databases.

The Problem:

When dealing with tables of greater than 10^7 rows, aggregate queries can become slow. For example:

SELECT SUM(value) total, COUNT(*) records FROM sales WHERE branch=”Oxford Street” AND transaction_date BETWEEN 01/01/2005 AND 31/01/2005 AND value > 10.00;


Traditional Solution:

The traditional solution is to make a ‘data cube’ from the table. This is either done nightly, or on-the-fly. This has the disadvantage that either a user or a process has to decide what should be totalled and what the dimensions should be.


Rhizome solution.

Let’s turn the problem on its head. We’ll forget about records and tables soon. Take the following example:

INSERT INTO sales(id, transaction_date, value, branch, paid, product, quantity) VALUES (123, ‘03/01/2005’, £123.45, “Oxford Street”, £123.45, “Widget”, 2);

INSERT INTO sales(id, transaction_date, value, branch, paid, product, quantity) VALUES (124, ‘04/01/2005’, £12.34, “Oxford Street”, £12.34, “FooBar”, 2);

INSERT INTO sales(id, transaction_date, value, branch, paid, product, quantity) VALUES (125, ‘04/01/2005’, £6.78, “Bond Street”, £6.78, “FooBar”, 1);

Can also be represented by: (in Ruby array/hash syntax)

[{ “sales.id” => 123,
“sales.transaction_date” => ‘03/01/2005’,
“sales.value” => £123.45,
“sales.branch” => “Oxford Street”,
“sales.paid” => £123.45,
“sales.product” => “Widget”,
“sales.quantity” => 2 },

{ “sales.id” => 124,
“sales.transaction_date” => ‘04/01/2005’,
“sales.value” => £12.34,
“sales.branch” => “Oxford Street”,
“sales.paid” => £12.34,
“sales.product” => “FooBar”,
“sales.quantity” => 2 },

{ “sales.id” => 124,
“sales.transaction_date” => ‘04/01/2005’,
“sales.value” => £6.78,
“sales.branch” => “Bond Street”,
“sales.paid” => £6.78,
“sales.product” => “FooBar”,
“sales.quantity” => 1 }]


This can be rearranged:

[{“sales.transaction_date” => {‘03/01/2005’ => [123], ‘04/01/2005’ => [124, 125]}},
{“sales.value” => {£123.45 =>[123], £12.34 => [124], £6.78 => [125]}},
{“sales.branch” => {“Oxford Street” =>[123, 124], “Bond Street” =>[125]}},
{“sales.paid” => {£123.45 =>[123], £12.34 => [124], £6.78 => [125]}},
{“sales.product” => {“Widget” =>[123], “FooBar” =>[124, 125]}},
{“sales.quantity” => {2 =>[123, 124], 1=>[125]}}]


Substituting references to symbols yields:

Symbols = [“sales.id”, “sales.transaction_date”, ‘03/01/2005’, “sales.value”, £123.45, 5“sales.branch”, “Oxford Street”, “sales.paid”, “sales.product”, “Widget”, 10“sales.quantity”, 2, ‘04/01/2005’, £12.34, “FooBar”, “Bond Street”, £6.78, 1]

[{1 => {2 => [123], 12 => [124, 125]}},
{3 => {4 =>[123], 13 => [124], 16 => [125]}},
{5 => {6 =>[123, 124], 15 =>[125]}},
{7 => {4 =>[123], 13 => [124], 16 => [125]}},
{8 => {9 =>[123], 14 =>[124, 125]}},
{10 => {11 =>[123, 124], 17=>[125]}}]


To make explanation simpler, I will give the above elements names:

[{index => {value => Rhizome, value => Rhizome, …}}


With larger dataset it would look more like:

[{1 => {2 => [123,126,127,128,132,133,134,142,149],
12 => [124, 125,129,135,140,141],
130 =>[131,136,137,138,139,143,148],
144=>[145,146,147,150]}},



As the data set grows (based on experience of business data, rather than academic data) the following two changes happen:

a)The number of field symbols becomes small with regard to the number of value symbols. Thus there is no point in storing the field symbols separately, as they will not impact on the enumeration of value symbols.

b)The number of value symbols becomes small with regard to the size of Rhizomes.


For example: in a company with over 10^7 historic invoices, there are 7x10^4 *unique* invoice values. Basic number-space reasoning shows that if the smallest unit of measure is a penny (1/100th of a £), then if the maximum invoice value is £900, then there *could only be* 90,000 unique values at most. The invoice value space is unlikely to be fully populated, so the number of unique values will be even fewer.

The storage of the existence of any value in a record, as the value itself, rather than a reference, becomes hideously inefficient in tables with over a 10^7 rows.

The sorting of symbols can be done in many ways – alphanumeric, numeric, date order, etc. Symbols can be grouped into sets, depending on how they can be evaluated (date set, integer set, numeric set, complex numeric set, etc.).

The manipulation and ordering of these symbols is trivial with current programming practises. Sets defined by ranges, etc., can be easily created and queried. This is the basis of any Rhizome type/collation system.

Monday 5 February 2007

Foreword

The purpose of this blog is to gain peer-review for my non-traditional design of database.

The Rhizome-Tesseract (RhiTess for short) design could (theorically) wipe the floor with other RDBMS approaches under the following curcumstances.
  • One or more tables with more than 10^7 rows.
  • A storage medium (disk), where sequential reading is at least 10^2 faster than reading random blocks.
  • A dataset size of greater than 10 times physical RAM.

The project currently has a few working modules (written in C++ and Ruby), but is not yet suitable for even an alpha release. I suffer from a lack of free time.

I believe the advantages of the RhiTess design include:
  • Select queries (with aggregate fields) on tables with more than 10^7 rows are performed 10^4 - 10^7 times faster, than with traditional indices. e.g. SELECT sum(value) FROM sales WHERE sale_date BETWEEN '01/01/06' AND '31/12/06' AND value > 10.00
  • Select queries with multiple conditions can be 10^2 faster. e.g. SELECT * FROM products WHERE description LIKE "%book%" AND weight > 0.2
  • Other database operations are of similar speed to traditional designs (0.5-2 times as fast).
  • Disk space required for storing the data and indices is typically 10-40% that of a traditional RDBMS.

The design, to be described in the following blog entries, may well already be in use. If so, please let me know.

The blog will take the form of a brain-dump, rather than a well-written paper. This is useful to provide prior-art against any subsequent software-patent claims.

If there is sufficient interest in the blog, I will work it into a coherent paper.

I may also be making major mistakes in my performance calculations. My understanding of current databases in limited - it is not my field. I am familiar with the workings of C-ISAM/Informix SE and MySQL MyISAM tables. I am also aware of clustered indices and datacubes. There is much else in DB technology of which I am ignorant. Please forgive, and inform, me if my design is naive and misguided.

Of course, the principle reason for writing this blog is to encourage the guys from Google to offer me a job ;-)