Blog search engine frenzy

Exciting times in blog search country:

Question: can someone make a meta-blog-search-service (like an on-line Copernic for blogs) that :

  • searches all “who-links-to-me” blog search engines
  • filters out the doubles
  • creates a weighted combined ranking (a referer featured on 5 search engines ranks higher than a site only listed once)
  • can use a date parameter (“only show posts younger than 1 month”)
  • gives us some sexy graphics like Blogpulse does
  • offers its results via RSS
  • has a viral component, like e.g. a counter I can display on my blog – something like Feedburner’s Awareness API (which always makes me think of Jean-Claude Vandamme’s legendary “AWARE” theory)

Anyone aware of a service like this? Please leave a comment!

Pareto doesn’t do search

I’m gonna talk about a post that is 6 months old, I know, but I recently re-read it and wanted to link it to Technorati’s recent traffic troubles.
Joe Kraus from JotSpot (and previously Excite) wrote an excellent article called “The long tail of software. Millions of Markets of Dozens.“. I’ll concentrate on the following segements:

(about Excite) That means if you wrote each of the millions of queries on a slip of paper, put them all in a fish bowl and grabbed one at random, there was a high likelihood that this query was asked only once during the day. Of ten-plus million queries a day, the average search was nearly unique. The most interesting statistic however, was that while the top 10 searches were thousands of times more popular than the average search, these top-10 searches represented only 3% of our total volume. 97% of our traffic came from the “long tail”, queries asked a little over once a day.

Now apply that information to Technorati: it has been struggling for a while with the “Cosmos” function: what blogs link to specific pages. As Dave Sifri says:

However, Cosmos search (or URL search) is still being worked on, and is often timing out under the increased load. Unfortunately this is also one of the searches that bloggers find most compelling, as it helps you to all know who is linking to your blog, and it is the very first type of search that Technorati made available, so it is near and dear to our hearts. Everyone here also uses it every day, so it really sucks when it isn’t working right.

Even if “search is hard“, the average web user is spoiled by Google: a random search in 8 billion objects comes back in less than a second. That is the benchmark. Technorati has no problem doing that for its tag search, but that is way easier. There are 2 million different tags, but I would expect more than 50% of traffic comes from a limited group of say 10.000 tags. Just put a cluster of reverse proxy caches in front of your tag search servers, keeping copies of each page for 15 minutes, and the number of actual search results to be generated drops dramatically.
But search results on a random combination of words, several million times a day, within 17 million sites and 1.5 billion links in your database: “the average search is nearly unique”, that’s hard. You need an expensive Google-like architecture to cope with that. Unfortunately for Technorati, Icerocket, Feedster, Pubsub and Blogpulse are capable of doing it, either because they have less traffic or a better architecture. So some high-profile bloggers like Kottke and Calacanis are jumping ship.

Technorati has already burned some credit, but could survive if it can perform to Google standards within weeks, not months. Or it will not be the weapon of choice for blogger’s vanity searches. And that’s how this blogging thing got started in the first place.

Google Desktop: buggy stuff

I regret having installed the Google Desktop Beta. I thought they would have ironed the biggest bugs, but my first experience is not reassuring.

  • I installed it on my P4 2.8GHz with about 100GB of data (lots of it CD and DVD copies – so files that are large where no indexing is needed).
  • The indexing process has been running for more than 5 days. Every now and then the progress bar remains hanging (e.g. remains on 27% after 12 hours of running) although the process keeps running. Reboot necessary.
  • the GoogleDesktopIndex.exe process runs continously at 50% CPU, which makes the computer slower, but also often at 100%, at which point the option left is to reboot.
  • the “Sidebar” is very buggy, it remains “Loading” forever, cannot be stopped and does not update. I think it’s the “Images” component. When it crashes: reboot required.
  • they’ve used ActiveX everywhere. Why not Python with py2exe? Then the Sidebar would run on every platform instead of just on Windows, and it would be easier and safer to write plugins.
  • the Google Adsense plugin sounds like a good idea, but I did not get it working, the authentication failed every time.

So I remain to be convinced about the Desktop searching thing. Bye Bye Google Desktop.

What would be a good idea in the mean-time: an publishers’ API for (read-only) access to the Google Adsense statistics. How about an RSS feed (an ‘AdStatFeed’)? A nice simple MRTG-like graph? An web-based version of CSVAdStats? How hard can that be?