HellOnline

Icon

Eran's blog

Updates: cite-rel, distributed social anything

Time for a couple of updates.

cite-rel

I’ve been doing some research on distributed conversations and markup in order to bring cite-rel to a draft status. You can see the markup examples I collcted, formats used in other places and the current state of the discussion all on the wiki. Please add whatever additional information or thoughts you have on the subject to the appropriate wiki page.

Distributed Social Anything

In case you missed the blog post, there’s now a whole trac wiki devoted to the project. I’ll be pursuing this project throughout this semester at USF as my Master’s Project. I’m planning to keep the project as open and transparent as possible. I’ll try to keep the wiki up-to-date and I would appreciate any feedback from interested parties. I only ask that you create an account on the wiki before modifying anything, this will allow me to keep track of who I’m talking to.

Oh, the code-name for this is Project Nirvana. Don’t ask why.

Filed under: MicroFormats, Projects

Writing a Lucene Based Search Engine (pt. 4)

Part4: Adding advanced textual analysis

In a simple search engine that has no link analysis the more you do with the document text, the better, or at least that’s my opinion. I’ve started doing some experimentation with linguistic/semantic analysis of the document text in order to improve search results and search result summary quality. These experiments were contained in a library I called Linger.

Linger is a library and an XML based format that adds a semantic/linguistic layer of analysis to the indexing and search process. The results of the analysis are expressed using simple XML format so that the information may be easily stored and retrieved. The library performs sentence boundary and named entity analysis is using LingPipe from alias-i. It is designed to work as a pipeline, each step adding more semantic markup to the document, so it is simple to add more steps to the process or change existing ones without affecting the rest.

Analysis process
Semantic analysis is performed during indexing and results in an XML document. This document is tokenized using a Linger aware tokenizer that produces a stream of Lucene tokens. For the tokenization process, sentence boundaries are ignored but named entities end up being indexed twice. A named entity token is indexed once for every token contained in it and once for the complete complex token. This should result in more significant matches on tokens belonging to named entities.

The entire document is also stored in its post analysis (linger) form in the Lucene index. During search this stored form is retrieved and is used to generate the context sensitive summary. Since the semantic information is stored in XML form it is very easy to acquire it again. The new summarizer algorithm uses the sentence boundary information to create more meaningful summaries (ones that start and maybe stop at sentence boundaries).

Tokenization of the Linger format is made to resemble Lucene’s standard analysis tokens. LingerHandler is a Lucene Tokenizer and produces Lucene Tokens, thus it can be used as part of the indexing process instead of Lucene’s standard tokenizer and analyzer. LingerHandler adds two new types of tokens to match the two types of semantic markup. The End of Sentence token and the Named Entity token. As stated before, Named Entity tokens are duplicated. The duplicate token (the token with the full named entity) has a length of 0 so it can be ignored when trying to recreate the original text from tokens.

Summary
All in all, this experiment turned out pretty well. The new summaries make much more sense than the old ones (which would start in the middle of a sentence and end in the middle of the next one) and Named Entity detection works pretty well to enhance the search results. However, I did not get to do much performance testing on any of the new algorithms involved so it is hard to quantify the effect they would have on overall performance. There is definitely some more work to be done on the named entity detection (a learning algorithms would be nice) and maybe some matching enhancements to the query processor. Tokenizing and reconstructing the linger document was not as easy as I thought it would be. I’ve started considering external markup, it might make things easier.

Filed under: Projects, Search

Using a DNS-like model for Distributed Conversations

One of the goals of cite-rel is to enable tracking of distributed conversation by aggregators (like technorati or memeorandum) over multiple blogs. Using a simple microformat like cite-rel to solve the problem has the advantage of a very low cost of entry. Any user can employ cite-rel and any blog software, indeed any tool that publishes HTML can support the format. The downside to that is requiring a third party – the aggregator – and the possibly large amount of work required by that aggregator. It is possible, however, to build a different solution to the problem that would not require any third party and would only require analyzing the conversations each participant is a part of.

DNS is a distributed publishing mechanism. Each DNS server is in charge of only a small subset of the entire domain system but using recursive queries each server can serve information about every public domain. Recursive queries work as a distributed search mechanism that leads your DNS server to the servers with the authority to answer the query. Your DNS server then caches the reply for a limited time so that repeated queries for the same domain would be served faster. We can employ a similar solution with blogs.

To implement such a solution we require two elements:

  1. Support for recursive queries.
  2. A search mechanism.

The search mechanism can be based in existing web technologies – pingbacks. Blog software already supports sending pingback and tracking them, this allows blogs to store references to all replying posts. Further, when posting a reply to a post on a different blog, the blog software can keep track of the original post. With these two mechanisms in place we can completely reconstruct the entire thread, even though each blog only stores links to directly connected posts.

Recursive queries for thread information will come in two different flavors.

  1. A request for an entire thread from a specific post. This type of request will be recursively redirected to the blog hosting the original post, unless this it is already there.
  2. A recursive request for all replies starting with a specific post in a thread. This request will recursively propagate down using pingback information to all blogs that published replies to this post.

Using these two simple requests any blog can give access to full threads for every post published on it. If we add a simple caching mechanism, the performance of the system should improve dramatically without using too much space.

Filed under: Aggregation, MicroFormats, Projects

Distributed Social Anything

Following are generally unstructured thoughts and plans for a possible project. I’ve been thinking about something in this vain for a while but have never put those thoughts into a more permanent form so here goes. This post serves mostly as scratch paper for my ideas so feel free to skip it if you don’t like long, raw, technical posts.

For lack of a better name, I’m calling this Distributed Social Anything. The most concise description I can come up with is distributed Tribe.net. Completely distributed (and then aggregated for convenience :).

  • All content published and owned by the users.
  • All content is accessible by any would be aggregator and formatted according to open standards (mostly microformats).
  • Based on existing tools and technologies. The main publishing tool is a blog.
  • Compatible with current tools. The requirements to participate are few, users of most blog hosting services should be able to participate.

Features and concepts:

  • Identity is defined by a URL. Currently the entities in the system are users and groups, both will have a canonical URL that contains at least XFN data. This XFN data (slightly expanded) defines the standard social network for users but also group membership.
  • Reciprocal XFN links might be required for some of the relations defined later. This is optional and left to aggregators to decide.
  • Group membership is published in XFN. This might require reciprocal links between users and group.
  • Users can publish information about themselves using XFN and hCard. User rel=”me” to link to additional shards of identity.
  • Users publish content on their blog. This content is later aggregated by groups to create a coherent group view .
  • Channels are feeds of blog posts that belong to a specific set. Channels are defined by tags or categories. Each group has at least one channel. Posts marked with that channel’s ID will be part of that group’s discussions.
  • Discussion are annotated using citeRel. Group aggregators might display those in a threaded format
  • Displaying previous versions of posts (in case of editing) with a diff view would be nice
  • XFN links should be aggregated and searchable (similar to rubhub.com). A service that offers search in the XFN space would be very nice
  • Group aggregators should also be aware of rich data (events, listings, etc.)
  • The group site might be able to highlite specific type of rich data (images, bookmarks, etc.) and/or offer access to it using feeds/API
  • We need administrative control of the group – membership, post moderation, rules, access-control, etc.
  • API for the group aggregator
  • Note: group aggregators can collect content from many sources, not just blogs (e.g. flickr, delicious)

Existing support:

  • WordPress supports feeds for categories. Also posts can belong to more than one category. Free channels!
  • RubHub does some XFN search but does not seem to be open source 😦

To do:

  • Express group membership using XFN (rel=”memberof” ?)
  • Finalize citeRel.
  • Expand and improve on structured blogging.
  • A format for publishing group information.
  • Possibly replicate and improve on RubHub.
  • The group aggregator service.

Filed under: Aggregation, MicroFormats, Projects, Tagging

Smells Like Teen Spirit

A friend recently wrote the following in an Email:

Certainly seems like the virgin web is growing up quickly.

Unfortunately, I see the exact opposite trend. To me, being mature, or a “grown up,â€? means owning up to your actions and taking responsibility for what you do and what you say. What I’ve been seeing lately is the proliferation of poorly conceived, high-school level thuggery. The list of completely pointless, mean, crass and hateful blogs grows on an almost daily basis. I’m afraid to think that Supr.c.ilio.us: The Blog might have been an inspiration to some of those people as what they do is diametrically opposed to the vision Ryan and me had when we started blogging on Supr.c.ilio.us.

When we created Supr.c.ilio.us (the tagging site), our intent was to show people, through satire, the silliness behind creating more and more tagging sites, sites that offered little more than the opportunity to tag yet another type of item and existed only because of the rising popularity of del.icio.us. When we expanded that to create Web 2 or Not, we were trying to make people see that obsessing over the definition of Web 2.0 is not nearly as important as actually contributing, creating new sites and coming up with new solutions to problems.

Supr.c.ilio.us: The Blog followed in those footsteps. Trying to get people to take themselves a little less seriously and not fall to their own hype. Oh, one more thing, everything we wrote on Supr.c.ilio.us carries our names and our faces. We’ve done nothing in secret because we’ve got nothing to be ashamed of and nothing to hide.

It could be just a matter of taste, because some people seem to find them funny. I guess it’s the same kind of reaction that makes people laugh when the school bully gives the scrawny geek a wedgy in the middle of the cafeteria. I never found that funny and I find it kinda sad when it’s happening again now, when most of the participants are supposed “adults.â€?

I’ll close with another quote, this time from Jack of All Blogs:

*Douche sucks. Blogebrity sucks. They are both fags. Fucking fags. I blog about shit in such a middle school way. They blog about shit in such a preschool way. Get a life both of you.

What can I say, at least he’s honest.

Filed under: The Net