Idea: hosted classification service

Yesterday evening I was watching “How to replace yourself with very small shell script” by Hilary Mason.

[youtube width=”500″ height=”360″][/youtube]

In short: she uses some scripts to process incoming mail and send outgoing reminders. The part that really interested me is the one where she uses classification, probably naive Bayes, to extract topics from the tweets of her friends.

That made me think about Paul Graham’s famous spam essay (2002), which boosted the development of Bayesian spam filters for email. A Bayesian spam filter will, in very broad terms, analyze the words in a message, compare them to words typically used in a ‘spam’ or ‘ham’ collection, and come back with either a binary classification (spam/ham) or a spamminess score. The first time I read that article must have been back in 2003 or 2004. I recall installing one of the early versions of POPFile, a spam filter written in Perl. It worked as a POP3 proxy and did a pretty good job. POP3 made sense, because at that time, the only spam we had was email spam. Now there’s blog spam, comment spam, trackback spam, Twitter spam …

But these are the cloud days, right? If you think about it, Akismet (WordPress) and Mollom (Drupal) offer cloud-based spam filtering. Before them, Postini (now part of Google) offered hosted spam filtering services for email. But would it be possible to offer a very generic web service-like document classification service? Imagine the service

Sounds like something Google would offer? Well, they do, in some way: Now if someone would develop a nice and easy interface around it …

💬 idea 💬 spam