Tuesday, 6 February 2007

Rhizome design - backported to traditional databases.

The Problem:

When dealing with tables of greater than 10^7 rows, aggregate queries can become slow. For example:

SELECT SUM(value) total, COUNT(*) records FROM sales WHERE branch=”Oxford Street” AND transaction_date BETWEEN 01/01/2005 AND 31/01/2005 AND value > 10.00;


Traditional Solution:

The traditional solution is to make a ‘data cube’ from the table. This is either done nightly, or on-the-fly. This has the disadvantage that either a user or a process has to decide what should be totalled and what the dimensions should be.


Rhizome solution.

Let’s turn the problem on its head. We’ll forget about records and tables soon. Take the following example:

INSERT INTO sales(id, transaction_date, value, branch, paid, product, quantity) VALUES (123, ‘03/01/2005’, £123.45, “Oxford Street”, £123.45, “Widget”, 2);

INSERT INTO sales(id, transaction_date, value, branch, paid, product, quantity) VALUES (124, ‘04/01/2005’, £12.34, “Oxford Street”, £12.34, “FooBar”, 2);

INSERT INTO sales(id, transaction_date, value, branch, paid, product, quantity) VALUES (125, ‘04/01/2005’, £6.78, “Bond Street”, £6.78, “FooBar”, 1);

Can also be represented by: (in Ruby array/hash syntax)

[{ “sales.id” => 123,
“sales.transaction_date” => ‘03/01/2005’,
“sales.value” => £123.45,
“sales.branch” => “Oxford Street”,
“sales.paid” => £123.45,
“sales.product” => “Widget”,
“sales.quantity” => 2 },

{ “sales.id” => 124,
“sales.transaction_date” => ‘04/01/2005’,
“sales.value” => £12.34,
“sales.branch” => “Oxford Street”,
“sales.paid” => £12.34,
“sales.product” => “FooBar”,
“sales.quantity” => 2 },

{ “sales.id” => 124,
“sales.transaction_date” => ‘04/01/2005’,
“sales.value” => £6.78,
“sales.branch” => “Bond Street”,
“sales.paid” => £6.78,
“sales.product” => “FooBar”,
“sales.quantity” => 1 }]


This can be rearranged:

[{“sales.transaction_date” => {‘03/01/2005’ => [123], ‘04/01/2005’ => [124, 125]}},
{“sales.value” => {£123.45 =>[123], £12.34 => [124], £6.78 => [125]}},
{“sales.branch” => {“Oxford Street” =>[123, 124], “Bond Street” =>[125]}},
{“sales.paid” => {£123.45 =>[123], £12.34 => [124], £6.78 => [125]}},
{“sales.product” => {“Widget” =>[123], “FooBar” =>[124, 125]}},
{“sales.quantity” => {2 =>[123, 124], 1=>[125]}}]


Substituting references to symbols yields:

Symbols = [“sales.id”, “sales.transaction_date”, ‘03/01/2005’, “sales.value”, £123.45, 5“sales.branch”, “Oxford Street”, “sales.paid”, “sales.product”, “Widget”, 10“sales.quantity”, 2, ‘04/01/2005’, £12.34, “FooBar”, “Bond Street”, £6.78, 1]

[{1 => {2 => [123], 12 => [124, 125]}},
{3 => {4 =>[123], 13 => [124], 16 => [125]}},
{5 => {6 =>[123, 124], 15 =>[125]}},
{7 => {4 =>[123], 13 => [124], 16 => [125]}},
{8 => {9 =>[123], 14 =>[124, 125]}},
{10 => {11 =>[123, 124], 17=>[125]}}]


To make explanation simpler, I will give the above elements names:

[{index => {value => Rhizome, value => Rhizome, …}}


With larger dataset it would look more like:

[{1 => {2 => [123,126,127,128,132,133,134,142,149],
12 => [124, 125,129,135,140,141],
130 =>[131,136,137,138,139,143,148],
144=>[145,146,147,150]}},



As the data set grows (based on experience of business data, rather than academic data) the following two changes happen:

a)The number of field symbols becomes small with regard to the number of value symbols. Thus there is no point in storing the field symbols separately, as they will not impact on the enumeration of value symbols.

b)The number of value symbols becomes small with regard to the size of Rhizomes.


For example: in a company with over 10^7 historic invoices, there are 7x10^4 *unique* invoice values. Basic number-space reasoning shows that if the smallest unit of measure is a penny (1/100th of a £), then if the maximum invoice value is £900, then there *could only be* 90,000 unique values at most. The invoice value space is unlikely to be fully populated, so the number of unique values will be even fewer.

The storage of the existence of any value in a record, as the value itself, rather than a reference, becomes hideously inefficient in tables with over a 10^7 rows.

The sorting of symbols can be done in many ways – alphanumeric, numeric, date order, etc. Symbols can be grouped into sets, depending on how they can be evaluated (date set, integer set, numeric set, complex numeric set, etc.).

The manipulation and ordering of these symbols is trivial with current programming practises. Sets defined by ranges, etc., can be easily created and queried. This is the basis of any Rhizome type/collation system.

No comments: