Eran's blog

The Bubble 2.0 Snark Group

b2sg We’ve been working on this for months, meeting in secret in dark rooms, scheming, planning, fighting with CSS and finally, it’s alive!

The Bubble 2.0 Snark Group (B2SG) is a group of elite bloggers, snarks, ranters, malcontents and professional beer drinkers dedicated to desecrating everything Web2.0 and much that isn’t. B2SG members are as varied as they are snarky:

  • Supr.c.ilio.us: The Blog founders Eran and Ryan together with their rag-tag team of guest bloggers are on a cruise around the world, sailing on the R.S.S snark. From their floating platform they make sure none of us lack for snark in our diets.
  • RSSing Down Under is written by our Australian Bureau Chief, Ben Barren. Ben combines news from the Silicon Valley with the latest Hollywood buzz.
  • Geek Entertainment TV adds a splash of video to the B2SG. Irina Slutsky and Eddie Codel report live and in color from deep inside the bubble. Citizen journalism at it’s best!

Check it out!

Via: Supr.c.ilio.us: The Blog

Filed under: The Net

Tag Tuesday, Again!

I think it’s the fourth one…

Tag Tuesday is coming to Mountain View! After three meetings in San Francisco we have decided to give Silicon Valley a try to extend our reach. Our next meeting will take place next Tuesday, November 29, from 6:30-8:30 p.m. at AOL’s campus in Mountain View.

Speakers will be Edwin Aoki of AOL and Kevin Burton of TailRank. Seems like this one might be worth the drive…

More details at the Tag Tuesday blog.

Filed under: Events, Tagging

Really? Simple Sharing?

I hate specs, they bore me. I’d rather see a few examples and keep a handy reference for any questions not covered by the examples. I’ve decided, however, to make an exception for Microsoft’s SSE and read the actual spec. All I could see was a simplified version control system that, for some unknown reason, is published with RSS and OPML. I quickly turned to Ray Ozzie’s post to see if maybe he’s got some motivation for me.

It seems that Ozzie is really excited about SSE because it will let him better coordinate schedules and contacts with his wife’s staff – heart warming, really! Personally I think that would work just as well using a phone but I can see how a protocol would scale better. So assuming we agree that a protocol is desirable for this new-found problem of synching our PIMs that leaves the question, why RSS and OPML?

Ozzie seems to have two reasons for that: the existing support for those standards and the simplicity of RSS. Reason one is not good enough; it’s what leads you to backwards-compatibility hell. This is even worse when the proposed enhancements break reason two. Synchronization isn’t simple and RSS + SX isn’t simple either.

So if we do still want this subscription model and RSS + SX is not the solution, what is? And what do we stand to gain from a new solution? The answer is very simple, actually, it’s XHTML (with microformats for seasoning).

We already know that anything OPML can do, XOXO can do as well so that’s easy. As for RSS, well, it’s about time we got rid of this hack anyway. There’s no reason for RSS to exist when it can be replaced by XHTML, banishing that ugly XML link from our blogs and maintaining style at the same time. The gain here is obvious, feeds become readable or even (how’s this for magic?) disappear completely – the content is the feed.

Calendar data? Easy. Contacts? Easy. Just use hCalendar and hCard to represent those. What did you just get? Is this a feed of your events that also serves as your calendar? Amazing! And importing contacts into your favorite PIM application is as easy as applying a XSLT? Magic! Synchronization can be done as per Microsoft’s schema or, since we’re just dealing with XHTML here, use any existing solution. How about DAV?

That’s what I call simple.

Filed under: MicroFormats, The Net

Interviewed on Gabbr.com

My 15GB of fame continue to fade away…

Based on my writings over at Supr.c.ilio.us: The blog I was asked to do a short E-mail interview with Dr. Del of Gabbr.com. The result can be seen here. I think I managed to offer a balanced perspective of the Web 2.0 world as I see it now, from my comfortable perch, right in the middle of it all:

The market right now is rife with opportunity. There are many problems waiting to be solved, some of which can be solved by small teams of dedicated people. Many of those solutions come with a real revenue oppotunity beside them (I’m talking about more than Adsense ads here) and if done properly can be the base of a very sustainable business.

Some Web2.0 companies go after those problems, offering real solutions and have well laid plans on how to sustain themselves (see skype and flock for example). Others are just here for the quick flip, the hype, the investor money, etc. I do not think it’s fair to generelize and say that all web 2.0 companies are based on hype, just as not all web 2.0 bloggers are only here to fan the flames and generate some more circular hype.

Filed under: The Net

Pre Approved: The Hack2.0 Workgroup

Hack 2.0 Workgroup I humbly accept the invitation to join this illustrious group of ne’er-do-wells, hax0rs and paparazzi chasers.

The Hack 2.0 Workgroup (or H2.0 for short) is a network of sub-premium weblogs that hack content exclusively about the new generation of the Web. Combined, these hacks reach a large readership of influential technology and media professionals.

Thank you for welcoming this humble blog into your lines!

Filed under: The Net

Quoted on Wired.com

It’s almost as if I know what I’m talking about…

Another tester, San Francisco software engineer Eran Globen, said Riya’s system is smart.

“The training part of the program, when it gets things right, is very helpful,” said Globen. Having photos of individuals “from the same event grouped together … makes tagging much faster.”

Read the rest here.

PS. Where’s the link-love??

Filed under: General

Hacking Memeorandum, More on Humans vs. Computers

If you look at Memeorandum right now, you’ll notice that our discussion on hacking Memeorandum is still up there and seems to be picking up speed, currently Pete Cashmore’s post – More proof that algorithms don’t work is leading the pack. Pete says:

The serious point here is that once someone figures out how an algorithm works, they will use that knowledge to their own advantage – if Memeorandum ever goes mainstream, it will be targeted by spammers and lose much of its usefulness to the community.
In my post Humans vs Algorithms, I suggested that we need to put human minds in the loop if we are to keep the spam out of search engines and news sites.

There is a long and interesting discussion in the comments on that post about the quality of filtering of computers vs. humans. But looking at Memeorandum right now and seeing the words “Hacking Memeorandumâ€? repeated several times in large bold letters I cannot help but think of another difference between humans and computers.

It is well known that humans are bad at creating random data. Ask a thousand people to choose a number between 1 and 100 and hardly any of them will choose ‘55’ because it doesn’t appear to be random. To a human, the number ‘55’ has an obvious pattern and so cannot be random. To a computer, on the other hand, 55 is just another sequence of bits as meaningless as any other. This happens because we, as humans, have context and bias, we keep examining everything around us in a specific context and in the light of our own bias.

In a similar manner to the above, human editors would probably not allow a post titled “Hacking Memeorandumâ€? to show up on Memeorandum, they’re likely to let their bias get in the way of that and decide to remove the “offendingâ€? post (offensive and even subversive in the context of maintaining a site). A good example of that is the lack of content about subverting/hacking/spamming Wikipedia found on Wikipedia.

A computer algorithm on the other hand, only has whatever bias it was designed with, in a properly designed algorithm this bias would be content-neutral and would judge all content based on the same criteria. This is what allows posts like “hacking Memeorandumâ€? to show up on Memeorandum and these pages to show up on Google.

As an aside, I’d like to commend Gabe on not using his control of Memeorandum to kill this fascinating conversation that spawned from a somewhat subversive idea.

Filed under: The Net

Exploring Memeorandum

A while ago, I met Gabe Rivera at Slide’s Launch party, I really wanted to have a serious discussion with him about the algorithms behind Memeorandum. I’m a geek like that, I like to know how things work. We never got to have that discussion so I really don’t know how much Gabe would have divulged but tonite, instead of that discussion, we’re having a Memeorandum exploration session.

Tara Hunt, aka Miss Rogue, has some ideas about the best ways to get on Memeorandum:

However, it is actually trivially simple. I’ve put together a wee step-by-step (There may or may not be a screencast in the future) ‘how to’:

  1. Get quoted saying something quite controversial
  2. Squawk about the quote
  3. Get a whole bunch of attention for squawking about that quote
  4. Wish you had just shut up in the first place ’cause nobody but youwould have probably even noticed it

See? Simple. 😉

Alex Barnett seem to agree with Tara.

On the other hand, over at Supr.c.ilio.us: The Blog, Assaf has a couple of things to say about this, basically claiming that:

The best way to get on memeorandum is to talk about memeorandum.

As for me, I’m on the fence. Still waiting to have that conversation with Gabe. Maybe over at Long-Tail camp?

Update: Alex Barnett informs us that it works! And while you’re here, check out Alex’s screencast.

Filed under: The Net

Writing a Lucene Based Search Engine (pt. 3)

Part 3: Implementing parallel remote search

Initially I’d hoped to make use of Lucene’s support for parallel and remote (RMI) based search. With promising class names like RemoteSearchable and ParallelMultiSearcher things were looking well and, indeed, my first attempts at implementing remote search seemed to work well enough.

Search queries were sent over RPC and Hits objects (Lucene’s container for searchs results) were sent back. I expanded on this theme by using Lucene’s ParallelMultiSearcher class which uses several threads to query several Searchables in parallel. Pretty soon, however, I came across two problems when testing this setup:

  1. This setup is not very robust. When a search slave fails, it is pretty much impossible to get ParallelMultiSearcher to let you know which slave failed. This makes recovery difficult or at least inefficient.
  2. Hits objects use caching to improve performance. This means that one must maintain a connection to an open IndexReader if one wants to access the contents of a Hits object. This can be very wasteful over RPC and tends to break very easily especially in a system which has to reload indexes often.

In my solution I tried to address both these problems and in addition make SearchSlave easier to control and monitor.


Step 1: I defined a new interface for remote search, dubbed Controllable. This interface mimics Lucene’s Searchable interface but adds a few additional methods. Both Controllable and Searchable extend java’s Remote interface (the interface that allows remote access over RPC) but Controllable adds a few methods lacking in Searchable that make remote control of search slaves easier.

  • Methods like ping() and status() allow for remote monitoring of slaves. These methods are usually accessed by the Servlet to verify the status of remote search slaves.
  • Methods like close() and reload() allow for remote control of search slaves. These are used by the new class ControlSlave to shut down slaves or to have a slave reload its index.
  • The rest of the methods are just copied over from Lucene’s Searchable, meant to be a minimal set of functions necessary for remote search.

Step 2: I created a modified version of ParallelMultiSearcher, called PRMSearcher (for Parallel, Remote, Multisearch) that is aware of the need to monitor remote search slaves and exposes the collection of remote searchers to its owner. This allows for monitoring individual slaves and recovery of an individual slave in case one should fail.

Step 3: I created the SimpleHit class and its corresponding collection SimpleHits. This is a version of Hits that does not employ caching. Yes, this probably means a hit on my performance as all hits must be read from the index but it also saves access over the network to get hit contents and makes the whole process less prone to failure. It also allows me to reload the IndexReader as often as I want without worrying about open Hits objects breaking.


Making search parallel took some work on the indexing side as well. I opted to go with a partitioned design where the index is partitioned into several non-overlapping partitions. This allows me to run several search slaves in parallel on different machines and should, in theory at least, allow for close to linear scaling in size of index with constant performance. Another advantage of this solution is its relative simplicity. The next step up from that would improve robustness by having some overlap between partitions so that the entire index is still available if one search slave happens to go down. This solution, however, would require more complex handling of the incoming search results which is already a possible bottle-neck. For now, simple is good.

The IndexMaster in the initial design ran as part of the web application. Since the application is designed to run on several servers, some configuration control was needed to make sure that only one instance of the application would ever write to the index. This instance was dubbed the Index Master. Communication between the application and the Index Master is done by creating Search Jobs.

Search Jobs are simple database entries that let the Index Master know that new content is ready to be indexed. Later those same entries can be used as a log to track performance of the indexing process. The Index Master periodically checks for new search jobs which it then performs as a batch. Batch indexing can be a huge gain in performance on Lucene. Based on the afore-mentioned advice from Doug Cutting the Index Master performs a checkpoint on the index every so often, causing the index to be copied to a new directory from which the various search slaves can remote copy the relevant partition of the index.

Partitioning is done in a very simple manner. An IndexBalancer object is both a collection of indexes and a mechanism for deciding the index partition into which a specific piece of information should go. I started out with a random balancer which worked pretty well but soon switched to a more deterministic approach based on the modulus of a hash of the object’s ID. This makes accessing objects by ID more efficient, a necessary operation when trying to update or delete an object in the index.

One of the problems in this design is the multiplicity of asynchronous processes. By decoupling the indexing process from the main application, it becomes easier to control and recover from a failure but it is also much harder to debug as some processes are time dependent and hard to predict. I ended up creating a few method calls that allow direct access into the bowels of the indexing process just to make testing more efficient.

Next: Rethinking indexing.

Filed under: Projects, Search