Monday, November 21, 2011

Using co-segments to dynamically adapt Lucene for frequent queries

Apache Lucene is magnificent library for working with inverted indexes. While full text search is its primary use case, you may often find application of Lucene in other areas too. In particular, in data mining, it can be useful for categorizing large volumes of textual information (think about finding trends in news feeds). Data mining is traditional domain for map/reduce style of distributed processing, but it you want to mine data interactively,  changing rules on fly, map/reduce is not an option. Inverted indexes are more suited for interactive use case and Lucene is excellent in that role.

Query complexity problem

Typically "patterns" to be mined (e.g. positive words about certain movie or description matching certain camera model) are represented by fairly complex Lucene queries having large numbers of subqueries.  Another problem is having terms with low selectivity in index which may be participate as a part of complex pattern  (for full text search you usually can just drop them out of index).
These two factors make mining job very hungry for CPU.

Reducing CPU load by adding "synthetic" terms and query rewriting

Different patterns usually share a lot of common subqueries. These subqueries themselves may be quite computation expensive (e.g. phrase and span queries). Instead of evaluating each subquery each time for any pattern query, it is possible to precalculate them. First, "synthetic" terms, are added to index (I call them hints for short). These hints mark documents matching particular subquery. Then all queries to be executed should be rewritten with "synthetic" terms (hints) to let Lucene use that additional information.
Such optimization may increase query throughput per CPU by substantial factor. In my practice, for complex queries, CPU utilization can be reduced down by 3-10 times.
Though applying this technique not so easy. There few problems to be solved:
  • matching documents against set of queries to generate synthetic terms,
  • adding synthetic terms to existing index,
  •  rewrite queries to benefit from precalculated information.
All of them are tricky. Normally to add new attribute to document you have to rebuild whole Lucene index which is heavy and resource intensive task. This is a very bad solution. It is critical to add new subqueries interactively along with users exploring new patterns. Using this technique does not make sense if additional synthetic terms cannot be added to index relatively cheaply on the fly.

Generating co-segment

Straight forward approach to generate co-segment for a number of subquery terms would be - testing these subquery for each document in index. Lucene has MemoryIndex which is quite handy for that. It allows to create single-document index in memory and test any queries against it. So far so good, but then you realize that you have to load and parse each of documents, it turns out to be prohibitively slow (just 2-5 times faster, than rebuild whole index).
There is a much, much better way. We may query main index, get all document ID matching subquery and encode this information into co-segement.
Querying index is blazing fast (compared to scanning  through documents).

Adding co-segments to existing index

Lucene index organized into segments, which are stored in file system. Each specific group of files (segment) is an inverted index for subset of documents. Whole index may be divided into several segments. Inverted - means data in file is sorted by terms. This means that if all our synthetic term would be "greater" than normal terms, we could just append them at end of existing index file (instead of rebuilding whole file).  After second though, we do not need to do anything with real files at all. Making Lucene think that there are few more terms in segment is enough (read - implement own Directory merging data behind scenes making Lucene to "see" segment + all co-segments as single file).

Rewriting queries

Once we have "synthetic" terms or hints in index, we have to rewrite query somehow. Queries can be very sophisticated (and they are, otherwise we wouldn't need all these tricks).
Simple idea is just to add hint clause and additional top level MUST condition. E.g. query Q will be written to H & Q, there H is condition for "synthetic" terms. This way we do not need to care of internal Q of and ensure correctness of rewrite.
Unfortunately, using BooleanQuery to join original query with hint clause will not produce any speed improvements. Actually the way queries are executed in Lucene is making whole idea of "hinting" very non-trivial for implementation.
Solution was writing own HintAwareQuery to wrap around "pattern" query (vanilla Lucene query). This wrapper do all dirty work. First it analyzes existing hints and chooses ones to be used. Second, it optimizes execution of nested query making parts of search index, masked-out by chosen set of hints, "invisible" to query execution.

Conclusion

So, what was achieved?
  • Hints can be added/removed to "live" index in matter of seconds,
  • Transparent from application- just wrap everything into  HintAwareQuery,
  • Order of magnitude speed up for complex queries.
Thanks to Lucene flexibility, which made such optimization ever possible!

More ideas

So far, hints are created manually using requests statistics from application search service. Interesting idea would be to automate this process, let search service itself profile requests and create hints using own statistics.
Another idea is using index masking technique for optimizing queries without hints - e.g. if MUST clauses of top level BooleanQuery could be used instead of "synthetic" hints if they are simple enough (e.g. as simple as TermQuery). Such trick could bring comparable bust for vanilla  BooleanQuery without any need of precalculation at all.

Friday, November 4, 2011

Coherence SIG: Advanced usage of indexes

Another my presentation Coherence SIG, this time from London.
Main theme of presentation was internal mechanics of indexes in Coherence. How indexes are stored, how queries are executed, how create custom filters and indexes - all these topics were covered.

Wednesday, November 2, 2011

Java GC, HotSpot's CMS promotion buffers

Recently, I have unfairly blamed promotion local allocation buffers (PLAB) for fragmentation of old space using concurrent mark sweep garbage collector. I was very wrong. In this article, I'm going to explain how PLABs really work with all details.

PLABs

PLAB stand for promotion local allocation buffer. PLABs are used during young collection. Young collection in CMS (and all other garbage collectors in HotSpot JVM) is a stop-the-world copy collection. CMS may use multiple threads for young collection, each of these threads may need to allocate space for objects being copied either in survivor or old space. PLABs are required to avoid competition of threads for shared data structures managing free memory. Each thread have one PLAB for survival space and one for old space. Free memory in survivor space are continuous, so do survivor PLABs, which are simply continuous blocks. On other hand, free memory in old space (using CMS collector) is fragmented and managed via sophisticated dictionary or free chunks ...

Free list space(FLS)

CMS collector cannot compact old space (actually it can, but compaction involves long stop-the-world pause, often referred as GC freeze). Memory manager operates with lists of free chunks to manage fragmented free space. As a counter measure from fragmentation, chunks of free space are grouped by size.В  If available, free chunk of exact required size will be used to serve allocation request. If chunks of given size are exhausted, memory manager will split larger chunk into several smaller to satisfy demand. Consecutive free chunk can also be coalesced to create larger ones (coalescence is made along with sweeping during concurrent GC cycle). This splitting/coalesce logic is controlled by complex heuristics and chunk demand per size statistics.

Old space PLABs

Naturally old space PLABs mimic structure of indexed free list space. Each thread preallocates certain number of chunk of each size below 257 heap words (large chunk allocated from global space). Number of chunks of each size to be preallocated is controlled by statistics. Following JVM flag will enabled verbose reporting of old space PLAB sizing (too verbose for production though).
-XX:+PrintOldPLAB
At the beginning of each young collection we will see following lines in GC log
6.347: [ParNew ...
...
0[10]: 722/5239/897
0[12]: 846/5922/987
0[14]: 666/5100/850
...
1[12]: 229/3296/987
1[14]: 2/2621/850
1[16]: 69/1812/564
1[18]: 247/1160/290
...
[10]: 905
[12]: 1002
[14]: 865
[16]: 567
...
First lines are statistics from each scavenger (young collector) thread in following format:
<tid>[<chunk size>]: <num_retire>/<num_blocks>/<blocks_to_claim>
tid - GC thread ID,
chunk size - chunk size in heap words,
num_retire - number of free chunks in PLAB at the end of young GC,
num_blocks - number of chunks allocated from FLS to PLAB during young GC,
blocks_to_claim - desired number of blocks to refill PLAB.
Next few lines show estimated number of chunks (per size) to be preallocated (per GC thread) at beginning of next young collection.
[<chunk size>]: <blocks_to_claim>

Calculating desired block to claim

Initial number of blocks (chunks) per chunk size is configured via -XX:+CMSParPromoteBlocksToClaim JVM command line option (-XX:+OldPLABSize is alias for this option if CMS GC is used). If resizing of old PLAB is not disabled by -XX:-ResizeOldPLAB option, then desired PLAB size will be adjusted after each young GC.
Ideal desired number per chunk size is calculated by following formula:
block_to_claimideal = MIN(-XX:CMSOldPLABMax, MAX(-XX:CMSOldPLABMin, num_blocks / (-XX:ParallelGCThreads -XX:CMSOldPLABNumRefills)))
,but effective value is exponentially smoothed over time
blocks_to_claimnext = (1 - w) blocks_to_claimprev + w block_to_claimideal
,there w is configured via -XX:OldPLABWeight (0.5 by default).

On-the-fly PLAB resizing

During young collection, if chunk list of certain size will get exhausted, thread will refill it from global free space pool (allocating same number of chunks as at the beginning of collection). Normally thread will have to refill chunk list few times during collection (-XX:CMSOldPLABNumRefills sets desired number of refills). Though, if initial estimate was too small, GC thread will refill its chunk list too often (refill requires global lock for memory managed, so it may be slow). If on-the-fly PLAB resizing is enabled JVM will try to detect such conditions as resize PLAB in the middle of young collection.
-XX:+CMSOldPLABResizeQuicker will enable on-the-fly PLAB resizing (disabled by default).
Few more options offer additional tuning:
-XX:CMSOldPLABToleranceFactor=4 tolerance of the phase-change detector for on-the-fly PLAB resizing during a scavenge.
-XX:CMSOldPLABReactivityFactor=2 gain in the feedback loop for on-the-fly PLAB resizing В during a scavenge.
-XX:CMSOldPLABReactivityCeiling=10 clamping of the gain in the feedback loop for on-the-fly PLAB resizing during a scavenge.

Conclusion

I have spent some time digging though OpenJDK code to make sure, that I'm getting that thing now. It was educating. This article has brought up and explained few more arcane JVM options,В  though I doubt that I will ever use them in practice. Problem with heap fragmentation is that you have to run application for really long time before fragmentation will manifest itself. Most of options above require trial and error path (even though -XX:+PrintOldPLAB might give you some insights about your application) . It much easier just to give damn JVM little more memory (hey, RAM is cheap nowadays) than spend day tuning arcane options.
Anyway, I hope it was as education for you as it was for me.

See also