Most of modern in-memory-data-grid products have grown from distributed caching. For traditional cache, loss of data is not a big deal, missing data could always be recovered from backend storage via read through. Main focus of distributed cache is data coherence between nodes (preventing stale reads, etc).
Advanced patterns like all-in-memory and proactive caching may provide considerable benefits over traditional read through cache. But simple fall back to read-through as a strategy for data recovery may not be an option for these advanced cache usages. Read-through have two prerequisite to be implemented:
- data should be accessed by primary key,
- for given primary key, you should known its master data source.
Second read-through prerequisite also may be sacrificed, i.e. by using multiple backends (cache acting as aggregator for data set scattered across multiple databases). You can more details in my previous article.
Data loss imminent
Through all reliability provided by data grid technology data may be lost and it means they will be lost eventually. Next question, what is your desired strategy to cope with incomplete data set?
It depends on type of application.
- For some applications, it is ok to have incomplete results from application during recovery window.
- For some application, incomplete response is worst than no response. Application should guaranty that every response is complete and if it cannot provide complete response (e.g. part of data set is missing in cache) it should raise an error.
First strategy is rather simple, you should monitor your grid and automatically trigger recovery procedure if disastrous event is detected.
Second type of strategy is more tricky in implementation. Just monitoring is not an option. There would be a gap between data loss event and reaction of monitoring system (which e.g. can switch service to offline mode for duration recovery process). Some application are totally intolerant to inconsistent data. This data, for example, could be used in complex batch of financial risk calculations (running for few hours in large HPC grid) and single inconsistent piece of input data could invalidate whole batch of work.
We need a solution better than monitoring for such kind of applications.
Canary keys to detect missing data
We must guaranty that result of each query is consistent (i.e. all data that has to be processed has been processed, often this means - whole dataset). Here we have a paradox at hands: we must check presence of certain key/value pairs (data grid is a key/value storage) in cache, but we cannot know keys of these pairs or even their total number.
Solution to this paradox lies in base approach of data distribution used by data grids. Technique described in this article has been used by me with Oracle Coherence grid, but idea can also be applied to other products using similar type of DHT. Oracle Coherence is using partitioned distributed hash table (DHT). In practice this means that key/value pairs are not distributed individually, but whole partition is assigned to particular node. It also means, that you cannot lose individual key/value pair but only whole partition at once (if you lose all replicas of that partition at the same time). Number of partitions is fixed for live time of grid (changing that number requires rebuild of DHT).
How this could be helpful to ensure data completeness?
We may not be caring about presence of individual key/value pairs, instead we may check presence of all partitions (and we know their exact IDs and total number). But how we can check presence of partition (technically partition cannot be missing, it will be just empty)? Also we should join data completeness check with queering of data in the same operation otherwise we will always have a gap of uncertainty.
Canary keys is a trick to solve this problem. Canary keys are synthetic keys, you put just on key to each partition. Every partition should have a canary key. So if your grid is configured to have N partitions, it should contain exactly N canary keys. If number of canary keys is less than N, that means portion of data has been lost (poor canary has perished) and is not recovered yet. Of cause your data loading/recovery procedure should put canary keys back in cache once data is restored.
It is also possible (though quite awkward to be honest) to integrate canary keys check in single request with data queering. In each query you should select canary keys along with actual data you need to retrieve. Once query is executed you should check presence of all canaries and then strip them from result set. If all of them are there, you can be sure that your result set in complete and no SLA would be broken. If some canaries are missing, you should trigger data recovery and/or raise an error indicating that request cannot be completed until recovery is done.
Technique described in this article is very advanced. Most applications do not require such rigorous consistency checks for every operations. But few do, and for them this approach may be useful (still implementation may have its quirks).
On high level canary keys technique demonstrate how understanding of core principles of DHT can help solving challenging task. Understanding of low level grid operations, their guaranties and limitation is a cornerstone in engineering of complex data processing solution using distributed ACID-less data grid as a storage.
Open source implementation of canary keys framework - http://code.google.com/p/gridkit/wiki/DataLossListener.