Wednesday, October 13, 2010

Data Grid Pattern - Data flow mediator

I want to start series of articles about data grid oriented architectural patterns. First pattern I want to present is a “Data flow mediator”. Let me start with example.
Imagine you have a large ecommerce web application and want to do some real time analysis over user actions. You have a stream of simple event, let’s just say clicks. And you need to do real time aggregation by various dimensions like by user, by product, etc. At large scale this task is quite challenging: number of writes is enormous, different dimensions made shading challenging and business want this analytics as close to real time as possible (say few seconds delay). With or without data grid such will remain challenging, but data grid technology have a strong advantages for such task.
Let me now introduce “data flow mediator” pattern.
In this pattern, data grid is used as buffer between systems which produces events (clicks), and systems which consumes information (real time analysis modules).
From producer point of view:
  • Grid provides high and scalable throughput,
  • Grid provides reasonable balance between durability/performance/cost. In grid we can store data in memory only, protected by multiple redundant copies.
From consumers’ (RT analysis modules) point of view:
  • Advanced data grids (e.g. Coherence, GemFire, etc) provide required queering/aggregation tool (implementing efficient queries by multiple dimensions in grid still an art, but it is doable),
  • High and scalable read throughput. Different analysis modules may share same “mediator”
From architect point of view:
  • Mediator decouples data producer from data consumers, thus localizing impact of changes for each component,
  • Data grid is self managing. Imagine managing DB with 50 shards + HA replication + dynamic adding/removing servers to cluster and you will treasure this feature of data grid.
I have demonstrated this pattern with ecommerce examples, but there are similar use cases in finance and telecom. Key prerequisites for this pattern are:
  • Large number of small updates,
  • Data loss is not fatal (either we can tolerate it or restore data from somewhere else),
  • Large number of read queries,
  • Queries are reasonable simple but more complicated than just get by primary key,
  • Low response time requirements for both read and write,
  • Scale is above than single RDMB can handle.
I hope this article was helpful for you to better understand data grid technology and its use cases.

1 comment:

  1. Thanks, Alexey, very useful blog!
    (there are some minor problems with english, mainly typos like "some there else")