August 22, 2004

Blogs, pubsub and Web Collections

I've been looking at publish/subscribe and Web based notifications for quite some time. A few years ago I built searchalert.net in order to learn directly what it might take to do Web scale notifications. Of course, I haven't done anything near large scale, but it still gives me a chance to write software to do something fun, while my day jobs drift farther from directly coding cool stuff.

It was somewhat annoying that Google started to offer free email notifications for their search results - that's pretty much what searchalert.net was doing. As a weekend project, I didn't expect searchalert to really become a company, but that event pretty much put an end to the whole notifications-as-business idea.

Recently I've started looking at what pubsub.com is doing - their recent performance numbers sounded intriguing (something like 2.4M 'matches' a second).

In order to understand what all these have in common, I've tried to apply REST architectural concepts - specifically, resource modelling - and from there I'll try to predict where these will go next and who else will get into the game.

So let's start with pubsub.com - they provide subscription and notification technology for notifying users about web logs, newsgroups and Edgar fillings. Now let's compare with searchalert.net - they provide subscription and notification technology for notifying users about web search results. And lastly, Google - they provide subscription and notification technology for notifying users about web search results. (Notice that searchalert.net and Google are annoyingly similar.)

Even though pubsub.com talks about publish/subscribe technology, there isn't any 'publish' in their technology. This isn't a bad thing, the Web is full of publishing technologies to choose from.

Let's break things down into subscriptions and notifications. In the area of subscribing, pubsub.com supports subscriptions over three sets of data - blogs, newsgroups and Edgar financial filling. Both searchalert.net and Google support subscriptions over two sets of data - general web search results and web search results focused on 'news'.

Subscription management varies across these three providers. pubsub.com will list your subscriptions and give you edit and delete capabilities. searchalert.net will also list your subscriptions and provide edit and delete capabilities. Google has no way of listing subscriptions or modifying them, but you can cancel a subscription from a link within the email notification itself.

For notifications pubsub.com supports Jabber-based IM and an RSS or Atom feed hosted on their site. searchalert.net supports daily or immediate notifications via email as well as Web notifications - sending XML to a Web address of your choice (this includes Atom formated XML to your blog, Weblogger API calls, etc). Searchalert also has search results in RSS, but it isn't a public feature and trying to scale the load would be too much at this point. Google supports daily email notifications.

So how does REST and resource modelling fit into all this and how RESTful are these different approaches?

In each of these systems, a user provides search terms and the system sends notifications about the search results - basically a 'saved search'. There are two resources - the search and the collection of search results. It's the collection of search results that is key. The notification system pays attention to this resource - items added to this collection would generate a notification. Theoretically, items removed or re-sequenced could also generate notifications. Essentially, these systems are a large search index. The resource that is the collection of search results has several representations. For Google, an HTML representation is their first choice (and how they became rich and famous). For pubsub.com, it's Atom flavored XML.

The resource that is the search results is interesting, because it is so similar to what a blog is - a collection of items. An RSS or Atom feed is essentially a format for a list of items. (Simple HTML with unordered-lists and list-items could do that, but where is the glory in that?) Blogs are generated manually by an author or editor, feeds are generated automatically - like from a search index. Search results are lists of items - that's what pubsub.com does for blogs and Google does for the Web and for news. And it's what Amazon does with popular products. All of these are just Web resources that are search results generated from a large search index.

I think pubsub.com is in a losing game because creating, hosting and serving up very large search indices is the core application from Google, Amazon and others. There are companies such as Technorati that do this just for blogs. It would be straightforward for Technorati or Google to provide search results in RSS and Atom. Instant competition.

So how RESTful are these approaches?

Unfortunately, I don't have the time to do a full analysis, but here's what I've found so far:

  • Google supports creating a resource for search results in one step - merely put the search terms in the URI. Both RESTful and useful.
  • Technorati supports creating a resource for search results in one step - merely put the search terms in the URI. Both RESTful and useful. They should add a 'view as Atom' button (easy) or a 'tell me when this changes' button (harder) and they'd be in the pub/sub business too.
  • searchalert.net and pubsub.com require a two step process - submit the search terms and a magic URI is created. RESTful but not the most useful. (Shame on me for doing it this way... I see another weekend project coming up)
  • pubsub.com provides an XML representation of search results, but they also use client-side stylesheets to display the results as HTML in a browser. Nicely RESTful and extra credit for using client-side processing in a standards-compliant way.
  • Google's email notifications have a URI to cancel your subscription, and merely visiting the page cancels the subscription. Very not RESTful and double plus ungood. A utility that automatically pre-fetches pages referenced in your email would auto-cancel a lot of things.


1 comment:

Unknown said...

Thanks for the long post on PubSub, etc... I'm glad that we've built something interesting enough to motivate such a detailed analysis! However, I think you've missed a few important points about what we do.

You wrote: "Even though pubsub.com talks about publish/subscribe technology, there isn't any 'publish' in their technology." The "publish" stuff is certainly in our technology. What we haven't done yet is expose it very widely. Currently, it is only used by our internal processes and the systems of a few select publishers. In the future, we'll be exposing it. Currently, we support (internally) publishing via REST, SOAP, XML-RPC, or over XMPP. External publishers access these publishing interfaces indirectly by posting things in their blogs or by pinging using XML-RPC...

On REST: "REST" architectures are, unfortunately, not very useful when it comes to providing notification -- which is one of the key features of PubSub.com. We've learned this the hard way by implementing a REST API and then seeing almost everyone that used it complain that they couldn't get notifications through firewalls or that they could only get notifications sent to servers on the open net but not to their desk-top client programs. That was a major motivator behind our adoption of the XMPP XML streaming protocol. Since it relies on persistent connections it allows us to pass through firewalls and reach the desktop. Using XMPP, we can deliver notifications to the desktop using an IETF standard protocol. Using REST, we can't.

There is a big difference between the files that you retrieve from PubSub and those that you get from Technorati or Feedster or Google (If Google supported results in RSS/Atom). The difference is that the PubSub files are static files which are updated when data arrives rather than generated on-the-fly in response to a query. The PubSub approach is massively more scalable since the amount of work that we need to do to respond to any request is tiny -- all we do is feed the file to you. This is different from other systems that do a tremendous amount of work whenever they get polled by a client -- they execute the search again. Basically, by constantly working on behalf of the customer, we avoid any need for "burst" processing. The result is a much, much smoother and predictable resource consumption profile.

Thanks for taking the time to think about what we do, I hope the comments are useful.

bob wyman
CTO, PubSub Concepts, Inc.
http://bobwyman.pubsub.com/