Eran's blog

Writing a Lucene Based Search Engine (pt. 4)

Part4: Adding advanced textual analysis

In a simple search engine that has no link analysis the more you do with the document text, the better, or at least that’s my opinion. I’ve started doing some experimentation with linguistic/semantic analysis of the document text in order to improve search results and search result summary quality. These experiments were contained in a library I called Linger.

Linger is a library and an XML based format that adds a semantic/linguistic layer of analysis to the indexing and search process. The results of the analysis are expressed using simple XML format so that the information may be easily stored and retrieved. The library performs sentence boundary and named entity analysis is using LingPipe from alias-i. It is designed to work as a pipeline, each step adding more semantic markup to the document, so it is simple to add more steps to the process or change existing ones without affecting the rest.

Analysis process
Semantic analysis is performed during indexing and results in an XML document. This document is tokenized using a Linger aware tokenizer that produces a stream of Lucene tokens. For the tokenization process, sentence boundaries are ignored but named entities end up being indexed twice. A named entity token is indexed once for every token contained in it and once for the complete complex token. This should result in more significant matches on tokens belonging to named entities.

The entire document is also stored in its post analysis (linger) form in the Lucene index. During search this stored form is retrieved and is used to generate the context sensitive summary. Since the semantic information is stored in XML form it is very easy to acquire it again. The new summarizer algorithm uses the sentence boundary information to create more meaningful summaries (ones that start and maybe stop at sentence boundaries).

Tokenization of the Linger format is made to resemble Lucene’s standard analysis tokens. LingerHandler is a Lucene Tokenizer and produces Lucene Tokens, thus it can be used as part of the indexing process instead of Lucene’s standard tokenizer and analyzer. LingerHandler adds two new types of tokens to match the two types of semantic markup. The End of Sentence token and the Named Entity token. As stated before, Named Entity tokens are duplicated. The duplicate token (the token with the full named entity) has a length of 0 so it can be ignored when trying to recreate the original text from tokens.

All in all, this experiment turned out pretty well. The new summaries make much more sense than the old ones (which would start in the middle of a sentence and end in the middle of the next one) and Named Entity detection works pretty well to enhance the search results. However, I did not get to do much performance testing on any of the new algorithms involved so it is hard to quantify the effect they would have on overall performance. There is definitely some more work to be done on the named entity detection (a learning algorithms would be nice) and maybe some matching enhancements to the query processor. Tokenizing and reconstructing the linger document was not as easy as I thought it would be. I’ve started considering external markup, it might make things easier.


Filed under: Projects, Search

2 Responses - Comments are closed.

  1. Very nice. Let us know if you need help training new entity recognition models for LingPipe. The main bottleneck for most folks is creating the training data. Given that, LingPipe’s easy to train over a new domain.

  2. […] Lucy – Lucene’s little sister Summarization and NLP Eran at hellonline.com has done some interesting experiments using Lucene and Lingpipe from Alias-i.  Approaches like t […]

%d bloggers like this: