Thursday, January 28, 2010

Data storages and read vs write controversy

While working with various high-loaded systems recently, I noticed a paradoxical contradiction in the data models. Let me explain through the following comparisons. First let's think about normalized vs. denormalized data models.


Normalized Denormalized

Read

Bad

- Queries become complex
- Joins and nested selects are slow

Good

- Fast queries
- No joins
- Queries are mostly single index lookup
(assuming schema
is tailored for application need)

Write

Good

- Consistency is easier to keep
- No self contraction by schema
- Less rows to update

Bad

- Potential inconsistency in data
- More rows to update
(data may be
duplicated in several places)
- Complex update procedures

Let's continue with the next comparison -- single vs. multiple copies of data.


Single copy Multiple copies

Read

Bad

- Single copy is bottleneck

Good

- Operations can be balanced between copies

Write

Good

- No need to keep copies in sync
- Single place to update

Bad

- Update should performed at every copy
- Synchronization and consistency issues
- Transactional updates become distributed transactions

In general, the patterns are simple. If we think of our data as a set of facts, we anticipate each fact will become several data records (either though database replicas or denormalization of the model). For write access, dealing with each fact as a single record is much more efficient. The impact of this difference between what is and what should be can mitigated by using the optimal storage technology for a given application.


The picture above is very simple (or even simplistic) but it provides some insight for selecting the proper data storage for your applications (or even specific parts of your application). If you have any comments about this, please reply to this post.

Key/value and document-oriented storages

These are data-storage techniques that don't support joins and they usually break the 1st normal form. Due to limited query support, the schema is often prepared in such a way that application queries become simple index lookups. This can be a very effective approach, but the flip side of this is usually a duplication of the entity's attributes across several tables (or their analogs). This makes updating of data more expensive. These types of storages also tending to use asynchronous disk operations which is further undermine their value as system of record storage. Lack of ACID properties is also of no help.

Search indexes

Search indexes such as Lucene and Sphynx provide excellent query performance. They are using data structures designed for information retrieval at cost of expensive updates. Search engines also require denormalized data model, further complicating writes.

RDBMS

Relational databases have a strong reliance on normalization of the data model. It is extremely difficult for an RDBMS to be effective for both read and write operations. While an RDBMS will never be as fast or as simple as a key/value hash table because the write operations will always be quite expensive (due to indexes and consistency checks), they can be a good middle ground between key/value hash tables and MQ-based storage.

MQ

It may be surprising to some people that I have included message queues in the same discussion with databases. But persistent queues or publish/subscribe systems with quarantined delivery are similar to databases. Submitting a message is like a write operation and receiving a message is like a read. The important thing about MQs is that they can be very efficient for write operations. Read operations in MQ environments are very limited, but experience has shown that "write-only" data storage can fill an important niche.

Convergence of paradigms

In the early days of relational databases, many people were skeptical about their future. The main argument against them was performance. But RDBMS technology has survived. Vendors have made the indexes faster and the query optimizers smarter. Overall, the performance and reliability have improved significantly. While maintaining a strong position in their niche, they have slowly assimilated key features of competing technologies to expand their appeal.

Materialized views are actually a smart way to get the benefits of denormalized data while keeping the schema normalized when it comes to updating. Message queues are also becoming a part of RDBMS offerings (e.g., Oracle RDBMS has queues at its core and PostgreSQL has production-ready built-in queues used by Skype).

We are living in an interesting time. Networking has led to the exponential growth in the amount of data created by, and use for, applications. Data-storage technologies have to adapt and evolve quickly, and it's fascinating to watch this evolution. Good luck with your data storage!

No comments:

Post a Comment