Data grid patterns series
Imagine you have a large ecommerce web application and want to do some real time analysis over user actions. You have a stream of simple events, (e.g clicks). You need to do real time aggregation by various dimensions: by user, by product, etc. At large scale this task is quite challenging: number of writes is enormous, different dimensions make sharding challenging and business want this analytics to happen as close to real time as possible (say few seconds delay). With or without data grid this task will remain challenging, but data grid technology have a strong advantages for this problem.
Using shared memory for inter-process communications is a very popular pattern in UNIX world. Many popular RDBMS (both open source and commercial) are implemented as sets of OS processes communicating via shared memory. Designing a system as a set of functionally specialized processes cooperating with each other helps to keep system more transparent and maintainable; yet, using shared memory as IPC provider keeps communication overhead very low. Shared memory is very efficient compared to other forms of IPC. Several processes can exchange or share information with each other via shared memory without involvement of OS (e.i. no syscalls, no context switching overheads).
Traditionally data grids based solutions are using denormalized data models. Denormalized model allows reducing number of data lookups in storage and achieve low response time. Traditional normalized data models are no good for distributed key/value storages. Main argument against them is requirement to do multiple joins during typical data access operation. If your data are partitioned, joins are becoming prohibitively expensive. You have to join each partition from left join side with each partition on right side thus cost of join grows in quadratic proportion of your number of partitions.
Classic and most widely used approach for caching is read through pattern. Look up in cache, then try to load from primary data source if entry is missing in cache - that is how it works. This pattern is easy to implement but it has few unpleasant limitations. Proactive caching is alternative to classical read-through approach, that allows better utilization of memory resources available at modern hardware.
Many critical application are using append only approach for dealing with transactional data. In other words, they never update data records, but instead insert new records with greater timestamp. Common challenge for such data model, is how to retrieve an appropriate version of record.
Most of modern in-memory-data-grid products have grown from distributed caching. For traditional cache, loss of data is not a big deal, missing data could always be recovered from backend storage via read through. Advanced patterns like all-in-memory and proactive caching may provide considerable benefits over traditional read through cache. But simple fall back to read-through as a strategy for data recovery may not be an option for these advanced cache usages.