Thursday, April 14, 2011

Data Grid Pattern - Snowflake data schema

Evil of distributed joins

Traditionally data grids based solutions are using denormalized data models. Denormalized model allows reducing number of data lookups in storage and achieve low response time. Traditional normalized data models are no good for distributed key/value storages. Main argument against them is requirement to do multiple joins during typical data access operation. If your data are partitioned, joins are becoming prohibitively expensive. You have to join each partition from left join side with each partition on right side thus cost of join grows in quadratic proportion of your number of partitions. 
Even if you can narrow record set before actual join, your data is still may be located on different servers. Each join between partitioned tables will force your data to be moved across network, adding network latency to response time. More joins – more latency accumulated and you will find you out of your allowed response time very soon.
I other words, generally, in grid, you cannot use neither joins, nor normalized data model (though there may always be an exception).

Denormalization is evil

But data denormalization approach is far from perfect. It has its downsides and serious ones. Data redundancy is a side effect of denormalization. Redundancy increase memory consumption and cost of updates. Loosing consistency due to redundancy is another foe. It is a constant battle.

Data warehousing experience

Data warehousing industry has similar challenges to solve. They have huge amounts of data and also have to deal with distributed data storages. And they have an answer to this challenge for some time!
Analytic databases are usually using "snowflake" data model. Snowflake schema has single fact table (in center of schema) and several dimension tables. Size of fact table is huge; sizes of dimension tables are relatively small.

Below are few examples to illustrate snowflake schema.
From financial industry
and from retail
Snowflake data model allows implementing joins across distributed data storage in smart way. While we still want to have fact data to be partitioned across cluster, we can have a full copy of dimension tables on each node executing queries. In other words we may use replication strategy for small dimension table while keeping large fact data partitioned.
Now we can partially execute query using dimensions which are available in local memory of process and then issue single query to fact table. As long as you do not need to join “facts” table with itself of another facts table query execution requires just one network round trip (and if you really need joins between facts wait for next article “Map/Reduce in data grid”).

How can we you utilize snowflake schema with grid middleware?

First we have to decide how we are going to store dimensions tables in memory. We can use in-memory data grid itself or we can use some in-memory database like Hypersonic or H2 database. Grid will support replication of dimension data out of box and natural support for java object, but in memory relational DB will offer better support for queries (e.g. Oracle Coherence does not support any kind of joins even in replicated storage).
Next question is cluster topology, we may replicate dimension data over all nodes or just few of them (query nodes). These query nodes can be either grid peer or remote grid client.
Functional separation between query nodes and data nodes (one storing partitions of facts table) is probably most practical because they have different memory/CPU usage. This way you can tune JVM memory options and scale each tier independently.
Unfortunately this is just a pattern but not built in feature any data grid product I'm aware of. You still have to complete a lot of non trivial engineering to make it work, but this work worth an effort. This pattern allows to use IMDG in cases they denormalization is unable to solve problem, opening new horizons for using of technology.

1 comment:

  1. Hi Alexis, MS SQL Server PDW edition (MS APS) provide the ability of setting tables as "distributed" (facts) and "replicated" (dimensions).