Apache Lucene is magnificent library for working with inverted indexes. While full text search is its primary use case, you may often find application of Lucene in other areas too. In particular, in data mining, it can be useful for categorizing large volumes of textual information (think about finding trends in news feeds). Data mining is traditional domain for map/reduce style of distributed processing, but it you want to mine data interactively, changing rules on fly, map/reduce is not an option. Inverted indexes are more suited for interactive use case and Lucene is excellent in that role.
Query complexity problem
Typically "patterns" to be mined (e.g. positive words about certain movie or description matching certain camera model) are represented by fairly complex Lucene queries having large numbers of subqueries. Another problem is having terms with low selectivity in index which may be participate as a part of complex pattern (for full text search you usually can just drop them out of index).
These two factors make mining job very hungry for CPU.
Reducing CPU load by adding "synthetic" terms and query rewriting
Different patterns usually share a lot of common subqueries. These subqueries themselves may be quite computation expensive (e.g. phrase and span queries). Instead of evaluating each subquery each time for any pattern query, it is possible to precalculate them. First, "synthetic" terms, are added to index (I call them hints for short). These hints mark documents matching particular subquery. Then all queries to be executed should be rewritten with "synthetic" terms (hints) to let Lucene use that additional information.
Such optimization may increase query throughput per CPU by substantial factor. In my practice, for complex queries, CPU utilization can be reduced down by 3-10 times.
Though applying this technique not so easy. There few problems to be solved:
- matching documents against set of queries to generate synthetic terms,
- adding synthetic terms to existing index,
- rewrite queries to benefit from precalculated information.
All of them are tricky. Normally to add new attribute to document you have to rebuild whole Lucene index – which is heavy and resource intensive task. This is a very bad solution. It is critical to add new subqueries interactively along with users exploring new patterns. Using this technique does not make sense if additional synthetic terms cannot be added to index relatively cheaply on the fly.
Straight forward approach to generate co-segment for a number of subquery terms would be - testing these subquery for each document in index. Lucene haswhich is quite handy for that. It allows to create single-document index in memory and test any queries against it. So far so good, but then you realize that you have to load and parse each of documents, it turns out to be prohibitively slow (just 2-5 times faster, than rebuild whole index).
There is a much, much better way. We may query main index, get all document ID matching subquery and encode this information into co-segement.
Querying index is blazing fast (compared to scanning through documents).
Adding co-segments to existing index
Lucene index organized into segments, which are stored in file system. Each specific group of files (segment) is an inverted index for subset of documents. Whole index may be divided into several segments. Inverted - means data in file is sorted by terms. This means that if all our synthetic term would be "greater" than normal terms, we could just append them at end of existing index file (instead of rebuilding whole file). After second though, we do not need to do anything with real files at all. Making Lucene think that there are few more terms in segment is enough (read - implement own merging data behind scenes making Lucene to "see" segment + all co-segments as single file).
Once we have "synthetic" terms or hints in index, we have to rewrite query somehow. Queries can be very sophisticated (and they are, otherwise we wouldn't need all these tricks).
Simple idea is just to add hint clause and additional top level MUST condition. E.g. query Q will be written to H & Q, there H is condition for "synthetic" terms. This way we do not need to care of internal Q of and ensure correctness of rewrite.
Unfortunately, usingto join original query with hint clause will not produce any speed improvements. Actually the way queries are executed in Lucene is making whole idea of "hinting" very non-trivial for implementation.
Solution was writing ownto wrap around "pattern" query (vanilla Lucene query). This wrapper do all dirty work. First it analyzes existing hints and chooses ones to be used. Second, it optimizes execution of nested query making parts of search index, masked-out by chosen set of hints, "invisible" to query execution.
So, what was achieved?
- Hints can be added/removed to "live" index in matter of seconds,
- Transparent from application- just wrap everything into ,
- Order of magnitude speed up for complex queries.
Thanks to Lucene flexibility, which made such optimization ever possible!
So far, hints are created manually using requests statistics from application search service. Interesting idea would be to automate this process, let search service itself profile requests and create hints using own statistics.
Another idea is using index masking technique for optimizing queries without hints - e.g. if MUST clauses of top level without any need of precalculation at all.could be used instead of "synthetic" hints if they are simple enough (e.g. as simple as ). Such trick could bring comparable bust for vanilla