Monday, November 21, 2011

Using co-segments to dynamically adapt Lucene for frequent queries

Apache Lucene is magnificent library for working with inverted indexes. While full text search is its primary use case, you may often find application of Lucene in other areas too. In particular, in data mining, it can be useful for categorizing large volumes of textual information (think about finding trends in news feeds). Data mining is traditional domain for map/reduce style of distributed processing, but it you want to mine data interactively,  changing rules on fly, map/reduce is not an option. Inverted indexes are more suited for interactive use case and Lucene is excellent in that role.

Query complexity problem

Typically "patterns" to be mined (e.g. positive words about certain movie or description matching certain camera model) are represented by fairly complex Lucene queries having large numbers of subqueries.  Another problem is having terms with low selectivity in index which may be participate as a part of complex pattern  (for full text search you usually can just drop them out of index).
These two factors make mining job very hungry for CPU.

Reducing CPU load by adding "synthetic" terms and query rewriting

Different patterns usually share a lot of common subqueries. These subqueries themselves may be quite computation expensive (e.g. phrase and span queries). Instead of evaluating each subquery each time for any pattern query, it is possible to precalculate them. First, "synthetic" terms, are added to index (I call them hints for short). These hints mark documents matching particular subquery. Then all queries to be executed should be rewritten with "synthetic" terms (hints) to let Lucene use that additional information.
Such optimization may increase query throughput per CPU by substantial factor. In my practice, for complex queries, CPU utilization can be reduced down by 3-10 times.
Though applying this technique not so easy. There few problems to be solved:
  • matching documents against set of queries to generate synthetic terms,
  • adding synthetic terms to existing index,
  •  rewrite queries to benefit from precalculated information.
All of them are tricky. Normally to add new attribute to document you have to rebuild whole Lucene index which is heavy and resource intensive task. This is a very bad solution. It is critical to add new subqueries interactively along with users exploring new patterns. Using this technique does not make sense if additional synthetic terms cannot be added to index relatively cheaply on the fly.

Generating co-segment

Straight forward approach to generate co-segment for a number of subquery terms would be - testing these subquery for each document in index. Lucene has MemoryIndex which is quite handy for that. It allows to create single-document index in memory and test any queries against it. So far so good, but then you realize that you have to load and parse each of documents, it turns out to be prohibitively slow (just 2-5 times faster, than rebuild whole index).
There is a much, much better way. We may query main index, get all document ID matching subquery and encode this information into co-segement.
Querying index is blazing fast (compared to scanning  through documents).

Adding co-segments to existing index

Lucene index organized into segments, which are stored in file system. Each specific group of files (segment) is an inverted index for subset of documents. Whole index may be divided into several segments. Inverted - means data in file is sorted by terms. This means that if all our synthetic term would be "greater" than normal terms, we could just append them at end of existing index file (instead of rebuilding whole file).  After second though, we do not need to do anything with real files at all. Making Lucene think that there are few more terms in segment is enough (read - implement own Directory merging data behind scenes making Lucene to "see" segment + all co-segments as single file).

Rewriting queries

Once we have "synthetic" terms or hints in index, we have to rewrite query somehow. Queries can be very sophisticated (and they are, otherwise we wouldn't need all these tricks).
Simple idea is just to add hint clause and additional top level MUST condition. E.g. query Q will be written to H & Q, there H is condition for "synthetic" terms. This way we do not need to care of internal Q of and ensure correctness of rewrite.
Unfortunately, using BooleanQuery to join original query with hint clause will not produce any speed improvements. Actually the way queries are executed in Lucene is making whole idea of "hinting" very non-trivial for implementation.
Solution was writing own HintAwareQuery to wrap around "pattern" query (vanilla Lucene query). This wrapper do all dirty work. First it analyzes existing hints and chooses ones to be used. Second, it optimizes execution of nested query making parts of search index, masked-out by chosen set of hints, "invisible" to query execution.


So, what was achieved?
  • Hints can be added/removed to "live" index in matter of seconds,
  • Transparent from application- just wrap everything into  HintAwareQuery,
  • Order of magnitude speed up for complex queries.
Thanks to Lucene flexibility, which made such optimization ever possible!

More ideas

So far, hints are created manually using requests statistics from application search service. Interesting idea would be to automate this process, let search service itself profile requests and create hints using own statistics.
Another idea is using index masking technique for optimizing queries without hints - e.g. if MUST clauses of top level BooleanQuery could be used instead of "synthetic" hints if they are simple enough (e.g. as simple as TermQuery). Such trick could bring comparable bust for vanilla  BooleanQuery without any need of precalculation at all.


  1. it sounds like filter cache in Solr, or new CachingWrapperFilter(new QueryWrapperFilter(query)) in Lucene.

  2. Described technique is a way to workaround lack of query execution planning in Lucene.
    Lucene does not differentiate "fast" (term querues) and "slow" (e.g. span queries) parts of composite query and may start testing "slow" query on document which are not going to pass "anyway".

    Imagine you want to apply span query to range of documents selected by range term query. There is a high chance that Lucene will try to apply span to whole index content.

    Technique in this article, hides away part of index which cannot produce matches, thus preventing Lucene from doing useless work.

    Only resemblance with Solr query cache are "cache" and "filter" words.