Most of modern in-memory-data-grid products have grown from
distributed caching. For traditional cache, loss of data is not a big deal,
missing data could always be recovered from backend storage via read through.
Main focus of distributed cache is data coherence between nodes (preventing
stale reads, etc).
Advanced patterns like all-in-memory and proactive caching
may provide considerable benefits over traditional read through cache. But
simple fall back to read-through as a strategy for data recovery may not be an
option for these advanced cache usages. Read-through have two prerequisite to
be implemented:
- data should be accessed by primary key,
- for given primary key, you should known its master data source.
Second read-through prerequisite also may be sacrificed,
i.e. by using multiple backends (cache acting as aggregator for data set
scattered across multiple databases). You can more details in my previous
article.
Data loss imminent
Please mention, that loss of data in modern
in-memory-data-grid is an exceptional event. Data is usually protected by
multiple replicas and grid can tolerate server failure. But it still possible
and you cannot ignore this aspect as you cannot ignore e.g. backing up your database.
Through all reliability provided by data grid technology
data may
be lost and it means they will be lost eventually. Next
question, what is your desired strategy to cope with incomplete data set?
It depends on type of application.
- For some applications, it is ok to have incomplete results from application during recovery window.
- For some application, incomplete response is worst than no response. Application should guaranty that every response is complete and if it cannot provide complete response (e.g. part of data set is missing in cache) it should raise an error.
First strategy is rather simple,
you should monitor your grid and automatically trigger recovery procedure if disastrous
event is detected.
Second type of strategy is more
tricky in implementation. Just monitoring is not an option. There would be a gap between data loss event
and reaction of monitoring system (which e.g. can switch service to offline
mode for duration recovery process). Some application are totally intolerant to
inconsistent data. This data, for example, could be used in complex batch of financial
risk calculations (running for few hours in large HPC grid) and single inconsistent
piece of input data could invalidate whole batch of work.
We need a solution better than
monitoring for such kind of applications.
Canary keys to detect missing data
We must guaranty that result of each query is consistent
(i.e. all data that has to be processed has been processed, often this means -
whole dataset). Here we have a paradox at hands: we must check presence of certain key/value pairs (data grid is a
key/value storage) in cache, but we cannot know keys of these pairs or even their
total number.
Solution to this paradox lies in base approach of data
distribution used by data grids. Technique described in this article has been
used by me with Oracle Coherence grid, but idea can also be applied to other
products using similar type of DHT. Oracle Coherence is using partitioned
distributed hash table (DHT). In practice this means that key/value pairs are
not distributed individually, but whole partition is assigned to particular node.
It also means, that you cannot lose individual key/value pair but only whole
partition at once (if you lose all replicas of that partition at the same time).
Number of partitions is fixed for live time of grid (changing that number
requires rebuild of DHT).
How this could be helpful to ensure data completeness?
We may not be caring about presence of individual key/value
pairs, instead we may check presence of all partitions (and we know their exact
IDs and total number). But how we can check presence of partition (technically
partition cannot be missing, it will be just empty)? Also we should join data
completeness check with queering of data in the same operation otherwise we
will always have a gap of uncertainty.
Canary keys is a trick to solve this problem. Canary keys
are synthetic keys, you put just on key to each partition. Every partition
should have a canary key. So if your grid is configured to have N partitions,
it should contain exactly N canary keys. If number of canary keys is less than
N, that means portion of data has been lost (poor canary has perished) and is
not recovered yet. Of cause your data loading/recovery procedure should put canary
keys back in cache once data is restored.
It is also possible (though quite awkward to be honest) to integrate
canary keys check in single request with data queering. In each query you
should select canary keys along with actual data you need to retrieve. Once
query is executed you should check presence of all canaries and then strip them
from result set. If all of them are there, you can be sure that your result set
in complete and no SLA would be broken. If some canaries are missing, you
should trigger data recovery and/or raise an error indicating that request
cannot be completed until recovery is done.
Conclusion
Technique described in this article is very advanced. Most
applications do not require such rigorous consistency checks for every
operations. But few do, and for them this approach may be useful (still implementation
may have its quirks).
On high level canary keys technique demonstrate how understanding
of core principles of DHT can help
solving challenging task. Understanding of low level grid operations, their guaranties
and limitation is a cornerstone in engineering of complex data processing
solution using distributed ACID-less data grid as a storage.
See also
Open source implementation of canary keys framework - http://code.google.com/p/gridkit/wiki/DataLossListener.
This comment has been removed by a blog administrator.
ReplyDelete