Apache Lucene is magnificent library for working with
inverted indexes. While full text search is its primary use case, you may often
find application of Lucene in other areas too. In particular, in data mining,
it can be useful for categorizing large volumes of textual information (think
about finding trends in news feeds). Data mining is traditional domain for
map/reduce style of distributed processing, but it you want to mine data
interactively, changing rules on fly,
map/reduce is not an option. Inverted indexes are more suited for interactive
use case and Lucene is excellent in that role.
Query complexity problem
Typically "patterns" to be mined (e.g. positive
words about certain movie or description matching certain camera model) are
represented by fairly complex Lucene queries having large numbers of
subqueries. Another problem is having terms
with low selectivity in index which may be participate as a part of complex pattern
(for full text search you usually can
just drop them out of index).
These two factors make mining job very hungry for CPU.
Reducing CPU load by adding "synthetic" terms and query rewriting
Different patterns usually share a lot of common subqueries.
These subqueries themselves may be quite computation expensive (e.g. phrase and
span queries). Instead of evaluating each subquery each time for any pattern
query, it is possible to precalculate them. First, "synthetic" terms,
are added to index (I call them hints
for short). These hints mark
documents matching particular subquery. Then all queries to be executed should
be rewritten with "synthetic" terms (hints) to let Lucene use that additional information.
Such optimization may increase query throughput per CPU by
substantial factor. In my practice, for complex queries, CPU utilization can be
reduced down by 3-10 times.
Though applying this technique not so easy. There few
problems to be solved:
- matching documents against set of queries to generate synthetic terms,
- adding synthetic terms to existing index,
- rewrite queries to benefit from precalculated information.
All of them are tricky. Normally to add new attribute to
document you have to rebuild whole Lucene index – which is heavy and resource intensive task. This is a very bad
solution. It is critical to add new subqueries interactively along with users
exploring new patterns. Using this technique does not make sense if additional
synthetic terms cannot be added to index relatively cheaply on the fly.
Generating co-segment
Straight forward approach to generate co-segment for a
number of subquery terms would be -
testing these subquery for each document in index. Lucene has MemoryIndex which is quite
handy for that. It allows to create single-document index in memory and test
any queries against it. So far so good, but then you realize that you have to
load and parse each of documents, it turns out to be prohibitively slow (just
2-5 times faster, than rebuild whole index).
There is a much, much better way. We may query main index,
get all document ID matching subquery and encode this information into
co-segement.
Querying index is blazing fast (compared to scanning through documents).
Adding co-segments to existing index
Lucene index organized into segments, which are stored in
file system. Each specific group of files (segment) is an inverted index for
subset of documents. Whole index may be divided into several segments. Inverted
- means data in file is sorted by terms. This means that if all our synthetic
term would be "greater" than normal terms, we could just append them
at end of existing index file (instead of rebuilding whole file). After second though, we do not need to do
anything with real files at all. Making Lucene think that there are few more
terms in segment is enough (read - implement own Directory merging data
behind scenes making Lucene to "see" segment + all co-segments as
single file).
Rewriting queries
Once we have "synthetic" terms or hints in index, we have to rewrite query
somehow. Queries can be very sophisticated (and they are, otherwise we wouldn't
need all these tricks).
Simple idea is just to add hint clause and additional top level MUST condition. E.g. query Q
will be written to H & Q, there H
is condition for "synthetic" terms. This way we do not need to care
of internal Q of and ensure correctness of rewrite.
Unfortunately, using BooleanQuery to join
original query with hint clause will not produce any speed improvements. Actually
the way queries are executed in Lucene is making whole idea of
"hinting" very non-trivial for implementation.
Solution was writing own HintAwareQuery
to wrap around "pattern" query (vanilla Lucene query). This wrapper
do all dirty work. First it analyzes existing hints and chooses ones to be
used. Second, it optimizes execution of nested query making parts of search
index, masked-out by chosen set of hints,
"invisible" to query execution.
Conclusion
So, what was achieved?
- Hints can be added/removed to "live" index in matter of seconds,
- Transparent from application- just wrap everything into HintAwareQuery,
- Order of magnitude speed up for complex queries.
Thanks to Lucene flexibility, which made such optimization
ever possible!
More ideas
So far, hints are created manually using requests statistics
from application search service. Interesting idea would be to automate this
process, let search service itself profile requests and create hints using own statistics.
Another idea is using index masking technique for optimizing
queries without hints - e.g. if MUST clauses of top level BooleanQuery could be used
instead of "synthetic" hints if they are simple enough (e.g. as
simple as TermQuery). Such trick could
bring comparable bust for vanilla BooleanQuery without any
need of precalculation at all.