Eran's blog

Social Networks for Social Search

There are 3 main reasons to add someone as a contact on flickr:

  1. They just added you and you feel obliged to reciprocate.
  2. They’re actually someone you know and you want to keep up with them and/or see photos from shared events.
  3. You like their photos.

Discounting #1 as an illness that plagues all YASNs, #2 and 3 are all about finding photos that are relevant to you. It’s not quite search yet because flickr doesn’t let you search in your social network but it’s close.

Other YASNs are like that as well. In fact, I’d go as far as saying that social networks are about 2 things: hooking up and social search. By hooking up I mean pure social connection, not necessarily of the sexual kind and by social search I mean using your social network to find/discover information that is relevant to you. Sadly, most of them are lacking in tools to do that effectively but we’ll get there some day. I hope.

Why am I writing this? Well, I’ve been thinking of social search on and off for the last year or two. The model I’ve come up with can be best described as multi-layered social search. Different layers of your social network (layers that may be completely orthogonal to one another) are used as filters for different topics. You can have your music friends, your Java architects, your gaming geeks… each group would be used to filter different sources of information. Get news about music and live shows from people whose music taste you appreciate. Get recommendations about new games from Tycho and Gabe and so on.

Most social networks won’t let you do that yet. Yahoo’s My Web 2.0 is making a good start with their labels and I’d like to see more of that. Del.icio.us let’s you find supposed experts on certain topics with their ‘active users.’ A combination of both features would be useful. A place where I can bookmark, discuss, discover and connect with like-minded people would be another welcome addition once we’ve combined the above two features (segmentation and expert discovery).

But for now, we have a different solution. This solution (partial and lacking in many ways) comes in the form of niche YASNs that keep showing up. Use flickr to find photos, upcoming for events, boompa for car information, tribe for like-minded people, last.fm for music, netflix for movies and so on and so forth. We don’t have the Grand Unified Social Network yet but we can pretend we do by using 20 different YASNs. If we’re lucky, at least some of those will be open (as in open standards and APIs) for integration in the future.

By the way, this idea integrates quite seamlessly with the theory of Object Mediated Social Networks which I’ve talked about before. When people connect (discuss, analyze, link, etc.) over objects (movies, bands, jobs) they create filters. Those filters can then be used for search. Social search.

Filed under: Search, Tagging, The Net

Writing a Lucene Based Search Engine (pt. 4)

Part4: Adding advanced textual analysis

In a simple search engine that has no link analysis the more you do with the document text, the better, or at least that’s my opinion. I’ve started doing some experimentation with linguistic/semantic analysis of the document text in order to improve search results and search result summary quality. These experiments were contained in a library I called Linger.

Linger is a library and an XML based format that adds a semantic/linguistic layer of analysis to the indexing and search process. The results of the analysis are expressed using simple XML format so that the information may be easily stored and retrieved. The library performs sentence boundary and named entity analysis is using LingPipe from alias-i. It is designed to work as a pipeline, each step adding more semantic markup to the document, so it is simple to add more steps to the process or change existing ones without affecting the rest.

Analysis process
Semantic analysis is performed during indexing and results in an XML document. This document is tokenized using a Linger aware tokenizer that produces a stream of Lucene tokens. For the tokenization process, sentence boundaries are ignored but named entities end up being indexed twice. A named entity token is indexed once for every token contained in it and once for the complete complex token. This should result in more significant matches on tokens belonging to named entities.

The entire document is also stored in its post analysis (linger) form in the Lucene index. During search this stored form is retrieved and is used to generate the context sensitive summary. Since the semantic information is stored in XML form it is very easy to acquire it again. The new summarizer algorithm uses the sentence boundary information to create more meaningful summaries (ones that start and maybe stop at sentence boundaries).

Tokenization of the Linger format is made to resemble Lucene’s standard analysis tokens. LingerHandler is a Lucene Tokenizer and produces Lucene Tokens, thus it can be used as part of the indexing process instead of Lucene’s standard tokenizer and analyzer. LingerHandler adds two new types of tokens to match the two types of semantic markup. The End of Sentence token and the Named Entity token. As stated before, Named Entity tokens are duplicated. The duplicate token (the token with the full named entity) has a length of 0 so it can be ignored when trying to recreate the original text from tokens.

All in all, this experiment turned out pretty well. The new summaries make much more sense than the old ones (which would start in the middle of a sentence and end in the middle of the next one) and Named Entity detection works pretty well to enhance the search results. However, I did not get to do much performance testing on any of the new algorithms involved so it is hard to quantify the effect they would have on overall performance. There is definitely some more work to be done on the named entity detection (a learning algorithms would be nice) and maybe some matching enhancements to the query processor. Tokenizing and reconstructing the linger document was not as easy as I thought it would be. I’ve started considering external markup, it might make things easier.

Filed under: Projects, Search

Advanced Lucene (ApacheCon 2005)

Ryan pointed me to this fine resource on advanced Lucene topics. Grant Ingersoll of CNLP recently presented at ApacheCon 2005, and discussed several very interesting topics, most accompanied by code (!).

Relevance feedback is a technique used to augment the user’s query with terms from the best matching documents. Grant shows how to use Lucene’s Term Vectors to do this. Relevance feedback has been on my todo-list for a few weeks now, it’s great to have a good starting point.

Span queries provide information about where a match took place within a document. The presentation explains how to use them for phrase matching (I wonder if that could replace named entity detection completely) and how to further use them for question answering. Looks like you could also use span queries as the basis for an efficient result summarizer.

Grant then goes on to discuss some of the work done at CNLP, using Lucene in various scenarios and domains. Very interesting stuff. I’ve yet to digest most of it but I’m sure I will soon, go check it out!

Filed under: Search

Writing a Lucene Based Search Engine (pt. 3)

Part 3: Implementing parallel remote search

Initially I’d hoped to make use of Lucene’s support for parallel and remote (RMI) based search. With promising class names like RemoteSearchable and ParallelMultiSearcher things were looking well and, indeed, my first attempts at implementing remote search seemed to work well enough.

Search queries were sent over RPC and Hits objects (Lucene’s container for searchs results) were sent back. I expanded on this theme by using Lucene’s ParallelMultiSearcher class which uses several threads to query several Searchables in parallel. Pretty soon, however, I came across two problems when testing this setup:

  1. This setup is not very robust. When a search slave fails, it is pretty much impossible to get ParallelMultiSearcher to let you know which slave failed. This makes recovery difficult or at least inefficient.
  2. Hits objects use caching to improve performance. This means that one must maintain a connection to an open IndexReader if one wants to access the contents of a Hits object. This can be very wasteful over RPC and tends to break very easily especially in a system which has to reload indexes often.

In my solution I tried to address both these problems and in addition make SearchSlave easier to control and monitor.


Step 1: I defined a new interface for remote search, dubbed Controllable. This interface mimics Lucene’s Searchable interface but adds a few additional methods. Both Controllable and Searchable extend java’s Remote interface (the interface that allows remote access over RPC) but Controllable adds a few methods lacking in Searchable that make remote control of search slaves easier.

  • Methods like ping() and status() allow for remote monitoring of slaves. These methods are usually accessed by the Servlet to verify the status of remote search slaves.
  • Methods like close() and reload() allow for remote control of search slaves. These are used by the new class ControlSlave to shut down slaves or to have a slave reload its index.
  • The rest of the methods are just copied over from Lucene’s Searchable, meant to be a minimal set of functions necessary for remote search.

Step 2: I created a modified version of ParallelMultiSearcher, called PRMSearcher (for Parallel, Remote, Multisearch) that is aware of the need to monitor remote search slaves and exposes the collection of remote searchers to its owner. This allows for monitoring individual slaves and recovery of an individual slave in case one should fail.

Step 3: I created the SimpleHit class and its corresponding collection SimpleHits. This is a version of Hits that does not employ caching. Yes, this probably means a hit on my performance as all hits must be read from the index but it also saves access over the network to get hit contents and makes the whole process less prone to failure. It also allows me to reload the IndexReader as often as I want without worrying about open Hits objects breaking.


Making search parallel took some work on the indexing side as well. I opted to go with a partitioned design where the index is partitioned into several non-overlapping partitions. This allows me to run several search slaves in parallel on different machines and should, in theory at least, allow for close to linear scaling in size of index with constant performance. Another advantage of this solution is its relative simplicity. The next step up from that would improve robustness by having some overlap between partitions so that the entire index is still available if one search slave happens to go down. This solution, however, would require more complex handling of the incoming search results which is already a possible bottle-neck. For now, simple is good.

The IndexMaster in the initial design ran as part of the web application. Since the application is designed to run on several servers, some configuration control was needed to make sure that only one instance of the application would ever write to the index. This instance was dubbed the Index Master. Communication between the application and the Index Master is done by creating Search Jobs.

Search Jobs are simple database entries that let the Index Master know that new content is ready to be indexed. Later those same entries can be used as a log to track performance of the indexing process. The Index Master periodically checks for new search jobs which it then performs as a batch. Batch indexing can be a huge gain in performance on Lucene. Based on the afore-mentioned advice from Doug Cutting the Index Master performs a checkpoint on the index every so often, causing the index to be copied to a new directory from which the various search slaves can remote copy the relevant partition of the index.

Partitioning is done in a very simple manner. An IndexBalancer object is both a collection of indexes and a mechanism for deciding the index partition into which a specific piece of information should go. I started out with a random balancer which worked pretty well but soon switched to a more deterministic approach based on the modulus of a hash of the object’s ID. This makes accessing objects by ID more efficient, a necessary operation when trying to update or delete an object in the index.

One of the problems in this design is the multiplicity of asynchronous processes. By decoupling the indexing process from the main application, it becomes easier to control and recover from a failure but it is also much harder to debug as some processes are time dependent and hard to predict. I ended up creating a few method calls that allow direct access into the bowels of the indexing process just to make testing more efficient.

Next: Rethinking indexing.

Filed under: Projects, Search

Riya to Launch Alpha Tomorrow?

I hear rumors that Riya will be sending out those anticipated alpha invites tomorrow…

Update: Apparently those last rumores were wrong. The new rumors say monday…

Update: They’re here! They’re here! Pictures are uploading even as I type…

Filed under: Search, The Net


Wink is a social search site that’s got an interesting take on search. You could say it’s a mash-up of a search-engine, a bookmark tagging service and a wiki.

The search page is divided into 3 areas. At first you’ll probably notice the “favorite resultsâ€? area. These results were recommended (i.e. tagged and rated) by wink users. Eventually this would become one of Wink’s strongest features. Currently, it just lacking in content.

Following the favorites are Google’s search results. You’re given the option to tag and/or rate links in both result sets. Tag a link and it’s saved as a tagged bookmark, accessible through your personal page. It would have been nice to use other search engines here but, frankly, who uses anything but Google anyway?

You’ll also notice the concept box, immediately under the search bar. This is an attempt at disambiguating search terms by providing the user with a list of articles from Wikipedia. It’s a pretty interesting idea as these concepts can be edited by wink users in the familiar WikiMedia interface, helping Wink keep up with the times and helping users narrow down their search even in areas they’re not deeply knowledgeable. The concepts also come with a related links (also wiki-fied) letting you surf through similar or connected concepts.

Occasionally, you’ll see a list of users show up between search results. These are users who created search sets that matched your search, more on search sets later. If you click on any of the user names here, you’ll get to their search page where you can see their favorite links, sets, etc.

Does that sound like a busy page? We’re barely even scratching the surface.

So you’ve tagged a few links, rated others, now what? Click on “My Pageâ€?and check your main Wink page where you can organize tags and manage your search sets. This page tries to cram as much information as possible into limited space while maintaining an “airyâ€? design. The result is that you only get to see about 3 of your links/sets at a time and you must page through your tags with about 20 tags per page. Navigating through your tags is made easier by dynamically filtering tags as you type.

This page also lets you see related users, similarly to the search results and search sets created by other users which you added to your list of favorites. Sometime in the future, you’ll also see your search history at the bottom of the page. Adding sets (or tags) from other users is a good way to stalk^H^H^H^H^H keep up with someone whom you consider a good source on that topic.

Clicking on a tag pulls up related links and search sets both from your collection and from Wink’s favorite links. As usual, you can tag just about anything in view, even tags. Clicking the edit button brings up the Search Set editor, here you can create (or edit) your search set. There is a strange duality, on Wink, between tags and sets, they appear to be two sides of the same coin. For now I’ll just treat tags as sets meaning you can add a link to a tag or you can tag a link with a set, achieving pretty much identical results. Confusing? There’s more! You can add sets to other sets. This means you can (relatively) easily organize your links in a hierarchy but still access them through tags. That’s kinda neat.

All in all, Wink has some interesting ideas but most are hidden behind a limited UI and unclear concepts. I don’t see any way to export my content from wink but they do make it very easy to import data from del.icio.us. Yeah, I know, they’re in beta… 🙂

Filed under: Search, Tagging, The Net

Writing a Lucene Based Search Engine (pt. 2)

Part 2: Architecture and Design

There’s A couple of concise, yet very helpful, posts by Doug Cutting, the primary developer of Lucene and Nutch, that guided me towards my chosen design. Also helpful was this article by Otis Gospodnetic, a Lucene developer and co-author of “Lucene in Action.â€?

Based on these posts, additional research and some discussion with Ryan, I came up with the following initial design. Note that at this time, indexing the full content of a URL was outside the scope of the project, instead we index a short description making the indexing process much simpler but severely (as it turned out) hurting the search process. An improved design will follow, for now here’s what I had:

  1. Indices are partitioned, this allows us to parallelize both indexing and searching.
  2. Index search is done by remote Search Slaves.
    1. SearchSlaves are the only ones to read and search a Lucene index.
    2. Communications between master and slave is done with RMI.
  3. Indexing is done by the IndexMaster.
    1. IndexMaster is the only one to modify a Lucene index.
    2. The application creates Index Jobs to generate work for the IndexMaster.
  4. Synchronization is done using a combination of Java based timers, cron jobs and shell scripts.

Figure 1 contains a sketch of the logical architecture overlaid on a potential physical deployment.

Web requests are processed by the web application which uses the SearchMaster class as an interface to the search engine. The Search Master performs all the query analysis and contacts the SearchSlave processes over RMI with the finalized query. Each SearchSlave searches its own index partition and send back whatever results it found. Those results are collected by the SearchMaster and returned to the application for further processing and presentation.

Indexing is done in two parts. Requests to add information to the index are handled by the application. However the application part in this process is very simple, all it does is create an IndexJob object in the database, requesting that the object be indexed by the IndexMaster when it is ready. The IndexMaster run periodically, reading open IndexJobs from the database and funneling those to Indexer objects. Indexers are the only objects that actually modify Lucene indexes. Note that since Lucene does not support an ‘update’ operation, delete and insert are required to modify existing data.

Every so often (measured in time or in number of updates), the IndexMaster checkpoints the index, copying it to a new directory. Every minute a cron job checks for new directories and copies them over to the search slaves. This keeps the indexes fresh and uptodate. Similarly, indexes are optimized after a configurable number of checkpoint operations.

Figure 1
[Figure 1]

Next: Implementing parallel remote search

Filed under: Projects, Search

Writing a Lucene Based Search Engine (pt. 1)

Part 1: Introduction

I feel that there’s a lack of long winded, purely anecdotal, mildly helpful texts about developing a high volume search tool. There are many informational posts about very specific issues but none that tackle the entire process from beginning to end and are aimed at writing a production level search tool instead of just a toy. Add to that recent trends like tagging, blogs and RSS feeds and relevant information becomes even harder to find. I’ve recently started working on just such a project and I think it might be useful to me and others if I document some of my (mis)adventures along the road to (as of yet unseen) success.

First, a short statement of the project’s goals: A rapidly updating search engine capable of indexing and retrieving hundreds of thousands or even millions of JetPaks (see my previous post for a description). The system must be stable, robust and easy to expand. It must provide Jetpak level access control (private, open, shared, etc.), spam and mature content filtering and support for tags (eventually at the single resource level). I18n and support for different file formats is highly desirable.

Using Lucene was almost an immediate decision. There might be better solutions (I’m not really familiar with any) but I’ve had some positive previous experience with Lucene and the existing product contained some support for it already so it was an easy decision to go with. As a whole, I like Lucene, it has many useful features but also has some problems that I will (for the most part) discuss later. The one general problem I’d like to mention right now is attempting to extend Lucene. Lucene contains many internal private classes and final classes that cannot be extended. This means that in many cases one must replicate code or import the entire Lucene source tree into one’s project and create new classes as part of Lucene’s package.

I’ve also taken a look at Nutch, it is a very nice package but seems designed to be used as a complete solution rather than as a library. Nutch is relatively easy to extend (using their plugin architecture) but as it is right now, it is designed with a very clear goal in mind – create an open source version of Google. Every part of Nutch seems designed to reach that specific goal so getting it to do other, simpler, tasks might be difficult. When I tried to use parts of Nutch as a library, I got somewhat annoyed at the inclusion of a preset (non log4j) logging and configuration interfaces that stopped me from simply using it as a library inside my project. Instead, I found myself copying/replicating some of the code and ideas into my own classes.

Next: Architecture and Design.

Filed under: Projects, Search

JetEye Launched!

I’ve been working with this small start-up in downtown SF for the last month or so and we’ve just launched our beta. The site is called JetEye and what it offers is a new way to search, collect and share digital information. Using packages (called JetPaks) you can collect links, images, notes and other JetPaks, add tags and share information with the rest of the world.

JetPaks are collections of resources that offer more flexibility than a single bookmark but are more structured than a Web page. A JetPak can contain many related resources of different types, expanding the bookmark concept, but can still be quickly scanned and digested unlike a blog post or a Web page. As information stores, JetPaks contains just the right amount of data to be useful without becoming cluttered and unwieldy.

Vannevar Bush talked about sharing associative trails of information; JetPaks allow you to do just that – as you search, collect the links in a JetPak, annotate them or add select quotes and images. Attach this JetPak to your research notes when you’re done so that, sometime in the future, you or anyone else can recreate the trail of thought you followed on the way to a your final conclusion.

JetPaks allow us to re-index the web. By collecting related resources into coherent packages we’re creating a new type of folksonomy, one that can better capture our own associations. Delicious and flickr let you browse the associative space of related tags, bookmarks and images as defined by whatever algorithms they designed. By allowing you to create packages of information, JetEye lets you create your own associative space based on your own connections. Every JetPak is a statement by a person, saying these items are all related to each other, they are interesting and they are even more interesting when viewed together in a single context.

You can use JetPaks to track distributed conversations, publish research notes, share specific information on any topic or even as a presentation aid. Taken as a whole JetPaks create a fascinating view of the web, re-indexed, re-contextualized and re-mixed by users. I can’t wait to see what people make of this.

Filed under: Search, Tagging, The Net

Google Personalized Search Launched

Posted by Sep Kamvar on Google Blog:

With the launch of Personalized Search, you can use that search history you’ve been building to get better results. You probably won’t notice much difference at first, but as your search history grows, your personalized results will gradually improve.

This whole concept of personalization has been a big part our lives since some of the team was in grad school at Stanford. We shared an office, which happened to be the same one Sergey had used before, and we were pretty familiar with the research he and Larry had done. Related to their work, we thought building a scalable system for personalizing search results presented an interesting challenge. We’ve still got a long way to go, but we’re excited to release this first step…

Seems like i’ll have to let Google track my search history now if only to see how well this new feature works.

PS. for further information:
Personalzied Search Help
Google Groups announcement

Filed under: Search