HellOnline

Icon

Eran's blog

Writing a Lucene Based Search Engine (pt. 3)

Part 3: Implementing parallel remote search

Initially I’d hoped to make use of Lucene’s support for parallel and remote (RMI) based search. With promising class names like RemoteSearchable and ParallelMultiSearcher things were looking well and, indeed, my first attempts at implementing remote search seemed to work well enough.

Search queries were sent over RPC and Hits objects (Lucene’s container for searchs results) were sent back. I expanded on this theme by using Lucene’s ParallelMultiSearcher class which uses several threads to query several Searchables in parallel. Pretty soon, however, I came across two problems when testing this setup:

  1. This setup is not very robust. When a search slave fails, it is pretty much impossible to get ParallelMultiSearcher to let you know which slave failed. This makes recovery difficult or at least inefficient.
  2. Hits objects use caching to improve performance. This means that one must maintain a connection to an open IndexReader if one wants to access the contents of a Hits object. This can be very wasteful over RPC and tends to break very easily especially in a system which has to reload indexes often.

In my solution I tried to address both these problems and in addition make SearchSlave easier to control and monitor.

Searching

Step 1: I defined a new interface for remote search, dubbed Controllable. This interface mimics Lucene’s Searchable interface but adds a few additional methods. Both Controllable and Searchable extend java’s Remote interface (the interface that allows remote access over RPC) but Controllable adds a few methods lacking in Searchable that make remote control of search slaves easier.

  • Methods like ping() and status() allow for remote monitoring of slaves. These methods are usually accessed by the Servlet to verify the status of remote search slaves.
  • Methods like close() and reload() allow for remote control of search slaves. These are used by the new class ControlSlave to shut down slaves or to have a slave reload its index.
  • The rest of the methods are just copied over from Lucene’s Searchable, meant to be a minimal set of functions necessary for remote search.

Step 2: I created a modified version of ParallelMultiSearcher, called PRMSearcher (for Parallel, Remote, Multisearch) that is aware of the need to monitor remote search slaves and exposes the collection of remote searchers to its owner. This allows for monitoring individual slaves and recovery of an individual slave in case one should fail.

Step 3: I created the SimpleHit class and its corresponding collection SimpleHits. This is a version of Hits that does not employ caching. Yes, this probably means a hit on my performance as all hits must be read from the index but it also saves access over the network to get hit contents and makes the whole process less prone to failure. It also allows me to reload the IndexReader as often as I want without worrying about open Hits objects breaking.

Indexing

Making search parallel took some work on the indexing side as well. I opted to go with a partitioned design where the index is partitioned into several non-overlapping partitions. This allows me to run several search slaves in parallel on different machines and should, in theory at least, allow for close to linear scaling in size of index with constant performance. Another advantage of this solution is its relative simplicity. The next step up from that would improve robustness by having some overlap between partitions so that the entire index is still available if one search slave happens to go down. This solution, however, would require more complex handling of the incoming search results which is already a possible bottle-neck. For now, simple is good.

The IndexMaster in the initial design ran as part of the web application. Since the application is designed to run on several servers, some configuration control was needed to make sure that only one instance of the application would ever write to the index. This instance was dubbed the Index Master. Communication between the application and the Index Master is done by creating Search Jobs.

Search Jobs are simple database entries that let the Index Master know that new content is ready to be indexed. Later those same entries can be used as a log to track performance of the indexing process. The Index Master periodically checks for new search jobs which it then performs as a batch. Batch indexing can be a huge gain in performance on Lucene. Based on the afore-mentioned advice from Doug Cutting the Index Master performs a checkpoint on the index every so often, causing the index to be copied to a new directory from which the various search slaves can remote copy the relevant partition of the index.

Partitioning is done in a very simple manner. An IndexBalancer object is both a collection of indexes and a mechanism for deciding the index partition into which a specific piece of information should go. I started out with a random balancer which worked pretty well but soon switched to a more deterministic approach based on the modulus of a hash of the object’s ID. This makes accessing objects by ID more efficient, a necessary operation when trying to update or delete an object in the index.

One of the problems in this design is the multiplicity of asynchronous processes. By decoupling the indexing process from the main application, it becomes easier to control and recover from a failure but it is also much harder to debug as some processes are time dependent and hard to predict. I ended up creating a few method calls that allow direct access into the bowels of the indexing process just to make testing more efficient.

Next: Rethinking indexing.

Advertisements

Filed under: Projects, Search

11 Responses - Comments are closed.

  1. Benjamin Yu says:

    This series of post had exactly the information I had been searching for. I want to setup a scalable lucene search solution for a project I’m working on.

    I just want to ask if the code be made available or committed into the lucene project? I’d like to know whether or not I need to start an implementation based on your posts, or if I can wait expectantly for your publication. 🙂

    Thanks,
    Ben

  2. limbo says:

    Hi Ben,

    Thanks for your interest. Unfortunately, at this point I cannot promise that code will be available as this is not my own project. If I see that there is interest I will try to get permission to open-source this code and publish it for all to mock 😉

  3. Wenjie says:

    Hello,
    I was very excited when I came across you blog the other day, because I am doing a similar project.

    One question for you. You said “Search queries were sent over RPC and Hits objects (Lucene’s container for searchs results) were sent back.” How can you send back Hits objects, which are not Serializable, over RMI? When I tried to do it, I got errors “unmarshaledException”.

    Thanks
    Wenjie

  4. limbo says:

    @Wenjie,

    Look at the test source files for a sample implementation, TestRemoteSearchable is a good starting point.

    I guess that the objects actually sent over the wire are TopDocs, not Hits, as the remoting is done at the lower layers of Lucene’s Searchable interface.

  5. Wenjie says:

    Thanks Limbo! I figured it out.

  6. James says:

    Hi,

    This sounds like great work, and addresses issues that we are getting ready to have to tackle. I was wondering if you ever got permission to open-source this code. It would be really cool if it could be included in Lucene 1.9. Barring that, maybe you could email me and we could talk about what we would need to do to use this code — I hate to reinvent the wheel.

    Thanks,
    James

  7. […] teup on parallel searching in Lucene Eran has written a very thoughtful blog on Writing a Lucene based Search Engine where he discusses some of the pitfalls of doing remote search using Lucene […]

  8. Kunthar says:

    You’re tallking about abilities, mind storm etc.
    What about code? Why not you start to write all Lucene staff from the zero point?
    Those all smarty junks just a blah blahs to me unless i see some good project on sourceforge.

  9. limbo says:

    @Kunthar,

    no idea what you’re talking about bud. sorry.

  10. dana says:

    Hi

    Its been a few years, I would like to ask if things have changed that you can publish, or at least let others take a peek remote searching code. Has it been updated for 2.4.0? I am finding it hard to run searching in a shared WAS environment with memory contraints so I would like to use remote searching.. This would be perfect fit if things have changed or you can elaborate further on the code..

    Thanks,

    I just hate reinventing the wheel 🙂

    /d

  11. limbo says:

    Sorry dana, I’ve not kept any of that code. Check out Apache Solr, they may have solved the problems you’re dealing with.

    http://lucene.apache.org/solr/

    Eran.

%d bloggers like this: