Monday, November 8, 2010

Data Grid Pattern - Network shared memory

Using shared memory for inter-process communications is a very popular pattern in UNIX world. Many popular RDBMS (both open source and commercial) are implemented as sets of OS processes communicating via shared memory. Designing a system as a set of functionally specialized processes cooperating with each other helps to keep system more transparent and maintainable; while using shared memory as IPC keeps communication overhead is very low. Shared memory is very efficient compared to other forms of IPC. Several processes can exchange or share information with each other via shared memory without involvement of OS (e.i. no syscalls, no context switching overheads).
Unfortunately shared memory is useful only if all communicating processes are hosted on same server (e.i. connected to same memory circuits). While we cannot use real shared memory when building a distributed system, idea of building system as a set of cooperating specialized processes still remains attractive.
Key point of shared memory is what all process can access and modify same data; all processes have consistent view and can do atomic operations on data (e.g. compare-and-set). Data grid technologies allow us to achieve same features but in network environment. Modern data grid products (e.g. Oracle Coherence, GemStone GemFire, etc) provide us key features for shared-memory-like communication:
  • share data with strong consistency guaranties across processes in cluster;
  • atomic operations over single key in shared storage.
Still, data grids remain a very different technology. They have with different tradeoffs compared to shared memory. It is impossible to take a PostgresSQL and make it clustered using Oracle Coherence. But we can reuse and extend architectural approach and develop a “network shared memory” pattern as a way to build distributed system.

“Network shared memory” pattern

First you need separate data and processes in your mind. Next you should identify operations with data which are happening in your system. You may have operations such as serving request using data, updating, importing data, exporting data, transforming data, etc. Ideally you should have separate process dedicated for each operation (though it is not always possible due to various reasons, so be holistic). Once you finished with analyzing your data and designing processes, you can start to work on physical model to store your data in grid. Remember, data for grid should be always modeled with access pattern in mind. Just coping data model from RDBMS may produce disastrous results (data modeling for grid is another large topic).

 Why having multiple processes is better?

Why having multiple processes is better compared e.g. with multiple threads within a single process?
Below are few answers:
  • You can distributed your processes between servers (e.g. for optimizing resource utilization)
  • You can start/stop processes independently
  • You can upgrade processes independently
  • You have better failure isolation (e.g. memory leak in one process will not bring down whole system)
  • You can use less heap per JVM, and have individual memory options for different processes
Sure there are some drawbacks also. A few of them:
  • You will have more JVMs running and thus more overhead of JVM itself.
  • All processes in the end have to communicate over network and it is not as fast as using memory shared between threads.

 Advanced usages of “Network shared memory”

Data grids have high availability built in. We can leverage this feature to achieve high availability for our processes. First pattern is a “hot standby”. We may have several instances of same process running (e.g. on different boxes) but only one of them active. In case of active process going down, grid will detect process failure and we can promote one of standbys to be new active. Sounds simple, but this simple approach requires “death detection”, “peer discovery” and “distributed consensus”. Believe me, all of these is not an easy task to implement. Fortunately, data grid already have all of this implemented, you just need to use its API. Data grid also can be used to store internal state of process, this way you can implement failover even for stateful processes.
A next evolutionary step of this approach is a load balancing between processes. Data grid can help coordinating processes by sharing routing table for request, state for stateful operations and/or using distributed locks for controlling access to resources.

Data grid is more than just distributed data store

While data grid technology is primarily designed for working with larger data sets, advanced features of modern data grid products may bring benefits to your system even if all your data fit a memory of single server. Their high availability and distributed coordination features may be invaluable for designing modular distributed solutions.

Wednesday, October 13, 2010

Data Grid Pattern - Data flow mediator

I want to start series of articles about data grid oriented architectural patterns. First pattern I want to present is a “Data flow mediator”. Let me start with example.
Imagine you have a large ecommerce web application and want to do some real time analysis over user actions. You have a stream of simple event, let’s just say clicks. And you need to do real time aggregation by various dimensions like by user, by product, etc. At large scale this task is quite challenging: number of writes is enormous, different dimensions made shading challenging and business want this analytics as close to real time as possible (say few seconds delay). With or without data grid such will remain challenging, but data grid technology have a strong advantages for such task.
Let me now introduce “data flow mediator” pattern.
In this pattern, data grid is used as buffer between systems which produces events (clicks), and systems which consumes information (real time analysis modules).
From producer point of view:
  • Grid provides high and scalable throughput,
  • Grid provides reasonable balance between durability/performance/cost. In grid we can store data in memory only, protected by multiple redundant copies.
From consumers’ (RT analysis modules) point of view:
  • Advanced data grids (e.g. Coherence, GemFire, etc) provide required queering/aggregation tool (implementing efficient queries by multiple dimensions in grid still an art, but it is doable),
  • High and scalable read throughput. Different analysis modules may share same “mediator”
From architect point of view:
  • Mediator decouples data producer from data consumers, thus localizing impact of changes for each component,
  • Data grid is self managing. Imagine managing DB with 50 shards + HA replication + dynamic adding/removing servers to cluster and you will treasure this feature of data grid.
I have demonstrated this pattern with ecommerce examples, but there are similar use cases in finance and telecom. Key prerequisites for this pattern are:
  • Large number of small updates,
  • Data loss is not fatal (either we can tolerate it or restore data from somewhere else),
  • Large number of read queries,
  • Queries are reasonable simple but more complicated than just get by primary key,
  • Low response time requirements for both read and write,
  • Scale is above than single RDMB can handle.
I hope this article was helpful for you to better understand data grid technology and its use cases.

Monday, October 11, 2010

Coherence, magic trick with cache index

A technical article related to Oracle Coherence.

Oracle Coherence has ability to query data in cache not only by key, but by values (or their attributes). Of course queering by secondary attributes is not as efficient as by primary key, but still can be very useful. Oracle Coherence also supports indexes which improve performance of value based queries significantly.
But some times index may not behave exactly as you expect.

Full text of article is available at GridDynamics blog - http://blog.griddynamics.com/2010/10/coherence-magic-trick-with-cache-index.html