Eran's blog

Writing a Lucene Based Search Engine (pt. 2)

Part 2: Architecture and Design

There’s A couple of concise, yet very helpful, posts by Doug Cutting, the primary developer of Lucene and Nutch, that guided me towards my chosen design. Also helpful was this article by Otis Gospodnetic, a Lucene developer and co-author of “Lucene in Action.â€?

Based on these posts, additional research and some discussion with Ryan, I came up with the following initial design. Note that at this time, indexing the full content of a URL was outside the scope of the project, instead we index a short description making the indexing process much simpler but severely (as it turned out) hurting the search process. An improved design will follow, for now here’s what I had:

  1. Indices are partitioned, this allows us to parallelize both indexing and searching.
  2. Index search is done by remote Search Slaves.
    1. SearchSlaves are the only ones to read and search a Lucene index.
    2. Communications between master and slave is done with RMI.
  3. Indexing is done by the IndexMaster.
    1. IndexMaster is the only one to modify a Lucene index.
    2. The application creates Index Jobs to generate work for the IndexMaster.
  4. Synchronization is done using a combination of Java based timers, cron jobs and shell scripts.

Figure 1 contains a sketch of the logical architecture overlaid on a potential physical deployment.

Web requests are processed by the web application which uses the SearchMaster class as an interface to the search engine. The Search Master performs all the query analysis and contacts the SearchSlave processes over RMI with the finalized query. Each SearchSlave searches its own index partition and send back whatever results it found. Those results are collected by the SearchMaster and returned to the application for further processing and presentation.

Indexing is done in two parts. Requests to add information to the index are handled by the application. However the application part in this process is very simple, all it does is create an IndexJob object in the database, requesting that the object be indexed by the IndexMaster when it is ready. The IndexMaster run periodically, reading open IndexJobs from the database and funneling those to Indexer objects. Indexers are the only objects that actually modify Lucene indexes. Note that since Lucene does not support an ‘update’ operation, delete and insert are required to modify existing data.

Every so often (measured in time or in number of updates), the IndexMaster checkpoints the index, copying it to a new directory. Every minute a cron job checks for new directories and copies them over to the search slaves. This keeps the indexes fresh and uptodate. Similarly, indexes are optimized after a configurable number of checkpoint operations.

Figure 1
[Figure 1]

Next: Implementing parallel remote search


Filed under: Projects, Search

%d bloggers like this: