Eran's blog

What is Web 2.0?

Is the question on everybody’s blog these days. Everyone seems to have their own opinion (and so do I) but I want to ask you this instead: Why do we care??? Leave the definitions to future time historians and go out there and do something. If it’s useful and people like it, they’ll use it even if it doesn’t stand up to the latest random collection of Web2.0 definitions. If it sucks, all the buzzword compliance in the world won’t save you.

Filed under: The Net

Web2.0 is Coming Up

But here’s a couple of better things to do:

Web 1.0 Summit
Date: October 5, 2005
Location: House of Shields

32 New Montgomery St.(Google Maps)
San Francisco, California 94105

Sponsored by 43 Folders and the year 1998. We will meet to discuss line breaks, spacer gifs, and the ability to launch links in a new browser window. There will be beer.

And a couple days later, Chris Heuer is organizing Web2.1:

a BarCamp styled BrainJam for the rest of us who are in the trenches of this next wave. It is time for us to seize the reins of the evolution ourselves – it is time for us to let those big money interests have the Web 2.0 – it is time we launched Web 2.1.

…I propose we put together a Web2point1 Conference for next Friday October 7, 2005 somewhere here in San Francisco. We can charge $2.80 for people to attend – providing 1000x the take away value for 1/1000th of the cost – a factor of a million times better than Web 2.0.

Go check out Chris’s post and see if you can help out.

via: The Bay Area is Talking and Miss Rogue

Update: Chris sent out an Email letting people know that the Web 2.1 WebJam site is live!

Filed under: Events, The Net


Coming up tomorrow, Tag Tuesday #3:

And next month, building on the success of Bar Camp, it’s Tag Camp! Tag Camp promises to be (and I quote) [a] welcoming event for geeks to camp out for a couple days, get wired on Halloween candy and think really fast. It’s like Tag Tuesday only on a weekend, with sleeping bags and luxurious showers. Yes, that’s right showers with little fishies on the shower curtain too!

Filed under: Tagging

Writing a Lucene Based Search Engine (pt. 2)

Part 2: Architecture and Design

There’s A couple of concise, yet very helpful, posts by Doug Cutting, the primary developer of Lucene and Nutch, that guided me towards my chosen design. Also helpful was this article by Otis Gospodnetic, a Lucene developer and co-author of “Lucene in Action.â€?

Based on these posts, additional research and some discussion with Ryan, I came up with the following initial design. Note that at this time, indexing the full content of a URL was outside the scope of the project, instead we index a short description making the indexing process much simpler but severely (as it turned out) hurting the search process. An improved design will follow, for now here’s what I had:

  1. Indices are partitioned, this allows us to parallelize both indexing and searching.
  2. Index search is done by remote Search Slaves.
    1. SearchSlaves are the only ones to read and search a Lucene index.
    2. Communications between master and slave is done with RMI.
  3. Indexing is done by the IndexMaster.
    1. IndexMaster is the only one to modify a Lucene index.
    2. The application creates Index Jobs to generate work for the IndexMaster.
  4. Synchronization is done using a combination of Java based timers, cron jobs and shell scripts.

Figure 1 contains a sketch of the logical architecture overlaid on a potential physical deployment.

Web requests are processed by the web application which uses the SearchMaster class as an interface to the search engine. The Search Master performs all the query analysis and contacts the SearchSlave processes over RMI with the finalized query. Each SearchSlave searches its own index partition and send back whatever results it found. Those results are collected by the SearchMaster and returned to the application for further processing and presentation.

Indexing is done in two parts. Requests to add information to the index are handled by the application. However the application part in this process is very simple, all it does is create an IndexJob object in the database, requesting that the object be indexed by the IndexMaster when it is ready. The IndexMaster run periodically, reading open IndexJobs from the database and funneling those to Indexer objects. Indexers are the only objects that actually modify Lucene indexes. Note that since Lucene does not support an ‘update’ operation, delete and insert are required to modify existing data.

Every so often (measured in time or in number of updates), the IndexMaster checkpoints the index, copying it to a new directory. Every minute a cron job checks for new directories and copies them over to the search slaves. This keeps the indexes fresh and uptodate. Similarly, indexes are optimized after a configurable number of checkpoint operations.

Figure 1
[Figure 1]

Next: Implementing parallel remote search

Filed under: Projects, Search

Walled Garden Gone? Not Quite Yet

Robert Young writes (and Jeremy Zawodny and Mark Pincus seems to agree) :

It won’t be the corporation that locks its customers into a walled garden any more; instead, it will be the people themselves who create their own high switching costs. For instance, if you are an eBay seller, your switching cost is not so much the relationship you’ve created with eBay itself and the store you set up, it’s the reputation and trust you spent years building with fellow members of the community. Similarly, if you are a member of MySpace, it’s not the web-page and blog you spent time constructing, it’s your social network of cyber-friends you’ve cultivated and accumulated over time.

At the end, the lesson is one of a paradox. As the power shifts increasingly towards community, the corporation loses its grip on the traditional means of control. Yet, by letting go of control, the corporation creates an environment where the community willingly creates its own switching costs.

To some degree I agree with Mr. Young, it is hard to leave behind the reputation you created in a community and move some other, maybe better, community. However, in the second paragraph, he claims that those limits are now self imposed by users as corporations are letting go of control. Is that completely true? If corporations really let go, I would own my own data, I would be able to “takeâ€? my social network and my reputation with me wherever I go or share it with other services. If it were true the limits to moving to a new community would be much lower or disappear altogether. The Ebay merchant would be able to import his reputation and reviews to (say) Amazon’s zShops, with references to Ebay users linking directly to Ebay; any MySpace user would be able to start using tribe with her entire list of contacts still in place, her friends who are already on tribe would appear as tribe friends, those who aren’t would appear as contacts on other networks.

A truly open social network would cooperate with other such social networks or at least offer an API to programmatically access information like contacts, reputation and other user-created content. As far as I know, none of the major players currently offer that (Tribe allows you publish contacts and listings but in a very limited way). When open identity services, social networks and networks of trust are widely available and used by sites like Yahoo or MySpace I’ll agree that corporations are letting go of control. For now, the walls are still there, they just look a little different.

via: Mark Pincus

Filed under: The Net

On the Impossibility of Simplifying Identity

How does one create a simple identity aggregation tool? By the act of creating the tool you create another “identity fragment” thereby aggrevating the situtation.

Filed under: The Net

Writing a Lucene Based Search Engine (pt. 1)

Part 1: Introduction

I feel that there’s a lack of long winded, purely anecdotal, mildly helpful texts about developing a high volume search tool. There are many informational posts about very specific issues but none that tackle the entire process from beginning to end and are aimed at writing a production level search tool instead of just a toy. Add to that recent trends like tagging, blogs and RSS feeds and relevant information becomes even harder to find. I’ve recently started working on just such a project and I think it might be useful to me and others if I document some of my (mis)adventures along the road to (as of yet unseen) success.

First, a short statement of the project’s goals: A rapidly updating search engine capable of indexing and retrieving hundreds of thousands or even millions of JetPaks (see my previous post for a description). The system must be stable, robust and easy to expand. It must provide Jetpak level access control (private, open, shared, etc.), spam and mature content filtering and support for tags (eventually at the single resource level). I18n and support for different file formats is highly desirable.

Using Lucene was almost an immediate decision. There might be better solutions (I’m not really familiar with any) but I’ve had some positive previous experience with Lucene and the existing product contained some support for it already so it was an easy decision to go with. As a whole, I like Lucene, it has many useful features but also has some problems that I will (for the most part) discuss later. The one general problem I’d like to mention right now is attempting to extend Lucene. Lucene contains many internal private classes and final classes that cannot be extended. This means that in many cases one must replicate code or import the entire Lucene source tree into one’s project and create new classes as part of Lucene’s package.

I’ve also taken a look at Nutch, it is a very nice package but seems designed to be used as a complete solution rather than as a library. Nutch is relatively easy to extend (using their plugin architecture) but as it is right now, it is designed with a very clear goal in mind – create an open source version of Google. Every part of Nutch seems designed to reach that specific goal so getting it to do other, simpler, tasks might be difficult. When I tried to use parts of Nutch as a library, I got somewhat annoyed at the inclusion of a preset (non log4j) logging and configuration interfaces that stopped me from simply using it as a library inside my project. Instead, I found myself copying/replicating some of the code and ideas into my own classes.

Next: Architecture and Design.

Filed under: Projects, Search