Eran's blog

Writing a Lucene Based Search Engine (pt. 1)

Part 1: Introduction

I feel that there’s a lack of long winded, purely anecdotal, mildly helpful texts about developing a high volume search tool. There are many informational posts about very specific issues but none that tackle the entire process from beginning to end and are aimed at writing a production level search tool instead of just a toy. Add to that recent trends like tagging, blogs and RSS feeds and relevant information becomes even harder to find. I’ve recently started working on just such a project and I think it might be useful to me and others if I document some of my (mis)adventures along the road to (as of yet unseen) success.

First, a short statement of the project’s goals: A rapidly updating search engine capable of indexing and retrieving hundreds of thousands or even millions of JetPaks (see my previous post for a description). The system must be stable, robust and easy to expand. It must provide Jetpak level access control (private, open, shared, etc.), spam and mature content filtering and support for tags (eventually at the single resource level). I18n and support for different file formats is highly desirable.

Using Lucene was almost an immediate decision. There might be better solutions (I’m not really familiar with any) but I’ve had some positive previous experience with Lucene and the existing product contained some support for it already so it was an easy decision to go with. As a whole, I like Lucene, it has many useful features but also has some problems that I will (for the most part) discuss later. The one general problem I’d like to mention right now is attempting to extend Lucene. Lucene contains many internal private classes and final classes that cannot be extended. This means that in many cases one must replicate code or import the entire Lucene source tree into one’s project and create new classes as part of Lucene’s package.

I’ve also taken a look at Nutch, it is a very nice package but seems designed to be used as a complete solution rather than as a library. Nutch is relatively easy to extend (using their plugin architecture) but as it is right now, it is designed with a very clear goal in mind – create an open source version of Google. Every part of Nutch seems designed to reach that specific goal so getting it to do other, simpler, tasks might be difficult. When I tried to use parts of Nutch as a library, I got somewhat annoyed at the inclusion of a preset (non log4j) logging and configuration interfaces that stopped me from simply using it as a library inside my project. Instead, I found myself copying/replicating some of the code and ideas into my own classes.

Next: Architecture and Design.


Filed under: Projects, Search

%d bloggers like this: