Tuesday, August 30, 2011

Lucene in the grid, query performance

Apache Lucene is very powerful open source full text search engine. Though its primary focus is full text search and relevance, search machinery implemented in Lucene could be used far beyond classical full text search. Oracle Coherence 3.6 has introduced an API for custom indexes. These API allows transparently use custom indexing engines inside of data grid (via normal Coherence query API) and let grid itself handle partitioning and keep index up to date.
In this article I would like to compare Lucene with built in Coherence indexes, and explain when you may want to use Lucene even if same functionality is available using built in Coherence indexes.
You can find more information about Cohernce / Lucene integration at project home page.
Both Lucene and Coherence are using inverted index to efficiently execute queries. Unlike relational databases which are tending to use single index for single select clause (you can read more here), both Lucene and Coherence will try to use all possible indexes matching query criteria. Using multiple indexes means extensive use of binary operations over sets of candidate keys. Coherence is manipulating these sets using hash tables, while Lucene is using bitmaps. This is a main factor explaining performance difference between Lucene and Coherence native indexes on certain types of queries.

Test data set

For testing I'm using a set of 100k objects, having number of randomly generated attributes. Attributes have following cardinality: A - 1, B - 10, C - 50, D - 1000, E - 10000, H - 50000. While 100k is not many, it is a reasonable number, cause data are partitioned in Coherence grid so is index itself. Each storage node is indexing just its own content, and 100k of objects per node is reasonable data set size.

Single criterion query

Lucene and Coherence are in same situation here. Single criterion is producing just one set of candidate keys, no need for any set manipulation. Coherence can just take key set from index and return it as is, while Lucene have to reconstruct keys from documents (which involves base64 decoding and takes time). So here Lucene have no advantages in speed, but it doesn't lag behind too much either. Diagram also shows separate bar for sorted index in Coherence, but its performance is roughly same as hash index.

Range query

Coherence's BetweenFilter implementation is using index extremely inefficient. I have implemented my own RangeFilter using standard Coherence indexes. For my data set RangeFilter beats  BetweenFilter by factor of 1000 (on attribute A). Further in this article I always mean RangeFilter when speaking about range queries.
Range query execution conceptually similar in Coherence and Lucene. Both should scan range of term set, find matching terms and union key sets for these terms. So performance is again roughly on par, Lucene is suffering some penalty for reconstructing each matching key from base64.

Multiple criteria query

So far, there were no much reason for performance difference between Coherence native indexes and Lucene. But for indexes using multiple criteria we may expect to see some considerable difference. I have measure few complex queries including several attributes and sometimes range or multi term match (InFilter for Coherence) .

Queries with two criteria

Top 5 queries have 2 criteria, one of which are E attribute having low selectivity. Coherence filter execution time depends on order of criteria (if most selective criteria is put first execution time would be shorter). Lucene is resilient  to order of criteria.

Three criteria query

This query (E1 = x & E2 = y & E3 = z) have only low selectivity criteria, so Lucene beats Coherence of this query almost 5 times.

Four criteria query

Same trend as with 2 criteria query, if most selective criteria is first Coherence and Lucene are on par, otherwise Lucene few times faster.

Query with range criterion

This query is emulating a case when we need to limit result set by range of attribute values (e.g. timestamp range). Query is using 10k long range of attribute A values, so it is hard work for both indexing engines to union 10k sets. Lucene is doing this job 3 times better.

Query with multi term match criterion

This query includes multi term match criterion (InFilter for Coherence). Lucene is consistently faster with factor about 5 times. This query does not have single good selective criterion, so Coherence is lagging behind.

Query with high cardinality attribute

Attribute H has only 2 possible values, so it makes filter processing a lot harder. Coherence performance ranging from poor to very bad depending on order of criteria. Lucene is fast as usual.


Coherence is sensitive for order of criteria in query. It is showing better results if most selective criteria is first in filter. Lucene is resilient to order of criteria in query. Coherence index performance is suffering considerably as we adding new criteria, especially if attribute has high cardinality (e.g. boolean attribute). Lucene should reconstruct entry key for each entry of final result set, so it is slowing down as result set of query is growing. Lucene is handling range queries better than Coherence.
These tests are showing that for many non trivial queries Lucene is showing better performance than Coherence built in indexes. For certain types of queries Coherence may require hand optimization, while Lucene is not. All above are making Lucene very attractive for ad hoc querying of data datasets and ability to execute full text search and wildcard queries compliments nicely for this purpose.
And BTW Lucene index is consuming few times less memory than built in Coherence index (it is more expensive for updates though). But it takes another article to elaborate these aspects.

Wednesday, August 24, 2011

TechTalk: Garbage collection in Java

Today I was speaking about Java garbage collection and tuning JVM for short pauses on large heaps at one of major web/social players in Russia.

Below is slide deck (in english)

Saturday, August 13, 2011

Announce: ReflectionPofSerializer 1.2 is available

ReflectionPofSerializer has been laying around without significant changes about a year and a half. Though I'm actively using it across projects, it was just working. But recently I decided to make few important functional improvements and extend capabilities of ReflectionPofSerializer.

Advanced support for collections

A few complains I was hearing about ReflectionPofSerializer were bad support for collections. Actually these complains should have been addressed to POF protocol itself, ReflectionPofSerializer just were using it as is. POF has a bad habit to substitute collections and maps with its own implementations, which may not necessary be compatible application needs (i.e. you have a good chance for HashSet to be replaced by list, and TreeMap by HashMap). If you are writing POF serializer by hand you can easily fix it, but for automatic serialization it may be a serious issue.
How new version of ReflectionPofSerializer is handling this problem?
If you have problem with collection classes being converted during serialization and this is breaking your application, you can force Coherence to use ReflectionPofSerializer for that specific collection class. This will require you to add standard (or your custom) collection classes to POF config.




That is it. ReflectionPofSerializer now knows how to properly serialize/deserialize collection classes.

Using POF without POF config

Using POF drastically improves Coherence performance for most applications. Unfortunately you have to register all your classes in pof-config.xml and provide serializers. Even through you can avoid writing serialization/deserialization code, you still have to maintain pof-config.xml (and it may be hundreds of classes). Could we just generate pof-config.xml in run-time on demand? Well, all nodes should use same pof-config.xml (or at least same class to ID mapping), so it is problematic. But wait, we already have a Coherence which can easily keep such mapping in sync across all nodes! That is the idea behind AutoPofSerializer.
AutoPofSerializer manages shared class to ID mapping (using Coherence cache) and automatically adds new classes to that mapping. It also automatically chooses to use ReflectionPofSerializer if object is not implementing PortableObject interface.
So now, you just need to configure AutoPofSerializer for cache and it will just work without need for pof-config.xml (actually it may look at pof-config.xml, but if class is not in there, it will be added automatically).
 AutoPofSerializer is still experimental and I would probably not recommend it to be used in production. But it is very useful for development, prototyping and evaluating POF performance. You can make sure what your code is solving problem and only then do boiler plate work of configuring POF.

See more information of ReflectionPofSerializer page at GridKit.