Monday, November 8, 2010

Data Grid Pattern - Network shared memory

Using shared memory for inter-process communications is a very popular pattern in UNIX world. Many popular RDBMS (both open source and commercial) are implemented as sets of OS processes communicating via shared memory. Designing a system as a set of functionally specialized processes cooperating with each other helps to keep system more transparent and maintainable; while using shared memory as IPC keeps communication overhead is very low. Shared memory is very efficient compared to other forms of IPC. Several processes can exchange or share information with each other via shared memory without involvement of OS (e.i. no syscalls, no context switching overheads).
Unfortunately shared memory is useful only if all communicating processes are hosted on same server (e.i. connected to same memory circuits). While we cannot use real shared memory when building a distributed system, idea of building system as a set of cooperating specialized processes still remains attractive.
Key point of shared memory is what all process can access and modify same data; all processes have consistent view and can do atomic operations on data (e.g. compare-and-set). Data grid technologies allow us to achieve same features but in network environment. Modern data grid products (e.g. Oracle Coherence, GemStone GemFire, etc) provide us key features for shared-memory-like communication:
  • share data with strong consistency guaranties across processes in cluster;
  • atomic operations over single key in shared storage.
Still, data grids remain a very different technology. They have with different tradeoffs compared to shared memory. It is impossible to take a PostgresSQL and make it clustered using Oracle Coherence. But we can reuse and extend architectural approach and develop a “network shared memory” pattern as a way to build distributed system.

“Network shared memory” pattern

First you need separate data and processes in your mind. Next you should identify operations with data which are happening in your system. You may have operations such as serving request using data, updating, importing data, exporting data, transforming data, etc. Ideally you should have separate process dedicated for each operation (though it is not always possible due to various reasons, so be holistic). Once you finished with analyzing your data and designing processes, you can start to work on physical model to store your data in grid. Remember, data for grid should be always modeled with access pattern in mind. Just coping data model from RDBMS may produce disastrous results (data modeling for grid is another large topic).

 Why having multiple processes is better?

Why having multiple processes is better compared e.g. with multiple threads within a single process?
Below are few answers:
  • You can distributed your processes between servers (e.g. for optimizing resource utilization)
  • You can start/stop processes independently
  • You can upgrade processes independently
  • You have better failure isolation (e.g. memory leak in one process will not bring down whole system)
  • You can use less heap per JVM, and have individual memory options for different processes
Sure there are some drawbacks also. A few of them:
  • You will have more JVMs running and thus more overhead of JVM itself.
  • All processes in the end have to communicate over network and it is not as fast as using memory shared between threads.

 Advanced usages of “Network shared memory”

Data grids have high availability built in. We can leverage this feature to achieve high availability for our processes. First pattern is a “hot standby”. We may have several instances of same process running (e.g. on different boxes) but only one of them active. In case of active process going down, grid will detect process failure and we can promote one of standbys to be new active. Sounds simple, but this simple approach requires “death detection”, “peer discovery” and “distributed consensus”. Believe me, all of these is not an easy task to implement. Fortunately, data grid already have all of this implemented, you just need to use its API. Data grid also can be used to store internal state of process, this way you can implement failover even for stateful processes.
A next evolutionary step of this approach is a load balancing between processes. Data grid can help coordinating processes by sharing routing table for request, state for stateful operations and/or using distributed locks for controlling access to resources.

Data grid is more than just distributed data store

While data grid technology is primarily designed for working with larger data sets, advanced features of modern data grid products may bring benefits to your system even if all your data fit a memory of single server. Their high availability and distributed coordination features may be invaluable for designing modular distributed solutions.