Monday, October 31, 2011

Data Grid Pattern - Time series index for managing versioned data

Many critical application are using append only approach for dealing with transactional data. In other words, they never update data records, but instead insert new records with greater timestamp (or sequence number, or any other kind of version, they are using to find latest record). Common challenge for such data model, is how to retrieve an appropriate version of record (e.g. latest version, or version at certain moment in time). A simple query for latest version for key 'A' would translate into query as complex as
select * from versions where series='A' and version = (select max(version) from versions where key='A')
This query is already too complex for data grid (and even RDBMS would not be too happy).
Accidently Ben Stopford has recently published a great article about this problem, outlining a lot of important aspects. I do not want to repeat him here, I suggest you read his article now, then continue with mine, which complements his two approaches this third one using custom index in Coherence.

Using custom index for accessing versioned data

In this  approach, each version is stored as a separate entry in cache.
Entry key is composite key, including logical key (series key) and some additional field to make version key unique (e.g. transaction ID, sequence number etc). Value contains actual business data (payload) and timestamp, we are using in our queries (technically timestamp could be part of composite key).
Series key should be an affinity key also - all versions related to same series should be physically on one Coherence node (or affinity key can be a part of series key, this will also satisfy this requirement).
Normally, if you want to find certain version by series key and timestamp you have to do aggregation of all versions for this series. In two approaches mentioned by Ben Stopford, latest version is separated from all other versions (using separate cache - approach 1, using marker - approach 2). It solves problem of finding latest version, but doesn't help if we need to find version for certain moment at time.

Time series index structure

Normal Coherence indexes cannot help us much, due to complexity of query, but it is possible to create custom index, tailored specifically for this task.
Time series index is similar to traditional inverted index, but instead of storing set of entry references, it is storing a nested index, indexing  only versions belonging to certain series by timestamp. Using this index structure you could find latest version or version for certain moment in time, without any aggregation, just by index lookup.
This index goes beyond standard index Coherence API, so it requires a complementary implementation of custom filter.

PROs and CONs

Below and PROs and CONs of this approach, compared to approaches from Ben's article.

PRO

  • Inserting new version doesn't require modifications of any other versions. In particular, you do not need to use hack, directly accessing to backing map, and you do not create extra replication traffic.
  • Time series index works efficiently for any point in time, not only latest versions.
  • It can be used with any kind of caches (even with continuous queries).

CON

  • Through custom index usage is straightforward, troubleshooting could be very tricky unless you understand index mechanics very well.

Source code

Time series index implementation is available at GridKit project.

No comments:

Post a Comment