Tuesday 6 February 2007

Rhizomes - Theortical Performance

Rhizomes - Theortical Performance

Let’s consider a real-world example. We have a table of 10^9 rows. Value G occurs 10^6 times, Values H occurs 10^6 times. If G and H are statistically independent, we should have around 10^3 matches.

As values of G or H occur only every 1,000 records or so, we’ll assume a *worst case* scenario and require Rhizomes of (two bytes) * (10^6 entries), or 2 megabytes, for each value.

Reading through 2x 2 megabytes of data (ideally from separate spindles) is trivially fast, as *streamed* data from contiguous blocks on hard disks is typically 50Mb/s, even on a £30 SATA drive.

Time needed: 100ms, worst case, including lookups, etc.

Contrast this with a traditional index-based search, which would require up to 10^6 index node reads. Random transfer of 1k blocks from a 10ms SATA disk is only 100k/s. This is going to take a long time… hence data cubes were invented.

Time needed: 10^4 seconds, worst case.

Cache Rhizomes.

Why the query is run, we should create a new Cache Rhizome of the results. There will be only 10^3 results in the ‘field3==G field4==H’ Cache Rhizome.

If we need to run the same query again, it will complete in 1,000th of the time of the original, *even if the data and result have changed*.

If anyone queries ‘field3==G, field4==H, field1==A’ we can use this Cache Rhizome to reduce the time needed for this query by 3 orders of magnitude.


Rhizomes on disk.

Ideally this will use a multi-spindle system. Rhizomes should be located on the fastest, most accessible parts of the disk. (Perhaps centred around the 1/3 capacity mark?)

Cache Rhizomes, as they already shrink query times by orders of magnitude, can live near the slow centre of disks.

A Rhizome management algorithm will need to deal with both the placement of Rhizomes on disk to allow for speed and growth, and with the need to handle the eradication of the less-frequently-accessed Cached Rhizomes. It will also have to handle the spindle-to-spindle movement of Rhizomes if they start fragmenting.


Cache Rhizome Size

As Cache Rhizomes contain only entries for particular combinations of values, they will remain small, compared with ordinary Rhizomes. (Though they will be numerous.)

No comments: