Monthly Archive for August, 2005

Detecting Blogspot splogs the Bayesian way


SHORT HISTORY OF SPLOGS

SPLOG SOLUTIONS
There are several stragtegies to make life harder for sploggers, while not making it too hard for regular bloggers:

prevent the (automatic) creation of a splog site 

CAPTCHA SYSTEMS: useless. In the end, one can always outsource creation of blogs to ‘real people’ in India at less than $1/blog 

EMAIL CONFIRMATION: can be too easily automated 

NO MORE FREE BLOGS: ask a fee (per month, like Typepad) or a kind of deposit (e.g. $50 - returned when you stop blogging, or confiscated when you run a splog) 

THOROUGH ID CHECK: so you can trace back splogs to the actual user. Not realistic, I’m afraid. Anonymity can sometimes be a positive thing. 

prevent (automatic) posting of splog items 

NO MORE API: no posting via email, REST, SOAP, XML-RPC, Atom, … That’s throwing out the baby with the bathwater. 

CAPTCHA SYSTEMS: require human interaction for each post. Is OK for web-based editors, but what with the APIs above? 

CHECK CONTENT: process the content of each post and try to decide if it’s splog-like. See the chapter below on Bayesian Detection 

detect splogs when they are created 

AUTOMATED DETECTION: using systems similar to the ones now employed for email spam detection. See the chapter below on Bayesian Detection. 

GROUP EFFORT: projects like splogreporter.com, Blogger’s “flag as objectionable”, Google Adsense abuse report. Maybe an initiative such as “Razor” for splogs? 

prevent the benefits of splog usage 

PAGERANK 0: when in doubt on whether a blog is a splog, reset the PageRank to 0 

REL NOFOLLOW: is a decision easier to take than deleting the whole blog (because if you delete blog that’s not a splog …): make all links worthless for search engines 

REDIRECT SCRIPT: replace all links to www.example.com to e.g. www.blogger.com/r?url=www.example.com, so that the redirects work bu do not transfer Google juice. The ‘old’ way of doing rel=nofollow. 

EDUCATION: there are way too much gullible people around who actually buy bogus stuff on the web. No one should leave high school and not know what SPAM is. You do not buy V*AGRA on the web!!

BAYESIAN DETECTION OF SPLOGS
Kailash Nadh already started with a project, but let’s take it one step further. I’d like to go back to the beginning of email spam fighting: most of the efforts started after an article by Paul Graham: “A plan for spam”. He proposed Bayesian Filtering as a better method for detecting spam than using whitelists/blacklists. In his follow-up article, he says:

I don’t think it’s a good idea to treat spam filtering as a straight text classification problem. You can use text classification techniques, but solutions can and should reflect the fact that the text is email, and spam in particular. Email is not just text; it has structure.

He then suggests treating the typical email fields: sender, destination, subject, date differently from the email body.

Let’s try to pinpoint the specifics of a blog (they are not the same as emails):

  • a blog has a title, url, meta tags, and HTML content (which does not appear in the RSS/Atom feed) 
  • a blog post has a title, url, date, metatags and HTML content (which does appear in the feed)
  • the HTML of the blog or of each post can contain links (<a href="http://www.example.com" mce_href="http://www.example.com">)

So intuitively I suspect the indicators of a splog will be (using car-deals-and-more.blogspot.com as example):

Keyword stuffing 

some words will appear over and over again, with small variations. In our example: over 3% of all words is “car”, 1.7% is “loan” and 1.09% is “credit”. Keywords in the URLs of in the domainname are even stronger indication of splogs. 

Nonsense content 

some sites use really dumb search engine scraping to generate content, which fills the posts with script URLs, lists of keywords without any grammatical contruction. 

Repetitive linking 

since a splog will be started to promote one product or site, the links will typically point to the same site and maybe to other ‘joined’ splogs. (in our example: always to www.car-financing.online-auto-center.info, a 41-character(?!) domain name) 

Post frequency 

Since recency and frequency are so important for a search engine, the splog will have many posts/day. If the owner is an idiot every time 2 minutes after the hour, but presumably something more random. One might do an RFM analysis on this.

So one would have to adapt the Bayesian model as follows:
- add BLOGTITLE*whatever for each word in the blog title
- add BLOGPAGE*whatever for each word in the blog URL or domain (separated by “.”, “-” or “_”)
- add POSTTITLE*whatever for each word in the post title
- add POSTPAGE*whatever for each word in the post page URL (separated by “.”, “-” or “_”)
- add LINKURL*http://…/path/to/page.html for each link
- add LINKDOMAIN*http://www.example.com for each link
- calculate a POSTFREQ (posts/day stat) and POSTFREQDEV (standard deviation)

The Bayesian Filtering would then place a ’splog probability’ on each blog and a company like Blogger could set up a weekly scan of each blog and do the following:
SPLOGPROB > 99%: automatically disable all outgoing links, send warning to owner, if no positive reply after 7 days: delete blog
SPLOGPROB > 95%: automatically disable all outgoing links, send warning to owner, test again after 2 days
SPLOGPROB > 90%: add rel=”nofollow” to each link - add captcha to posting - next control in 1 week
SPLOGPROB > 80% add captcha to posting - next control in 1 week
SPLOGPROB > 70%: do nothing - next control in 1 week
SPLOGPROB < 70%: do nothing - next control in 1 month
I’ll probably refine this model in the future.
All remarks welcome!

Technorati: - - -

If you're new here, you may want to subscribe to my RSS feed or receive updates via email. Thanks for visiting!

Flickr/Yahoo experimenting with new ad format



Flickr has been placing contextual advertising on their “tag” search pages (example: the “brussels” tag) for a while. They seemed to use a mixture of Google Adsense and Yahoo Publisher text-based ads. For people who are not that familiar with contextual ad units: there are

  • image based ads, that consist entirely of 1 graphic
  • text-based ads: a title, a 5-10 word description and a link, for one specific product - offered by both Google and Yahoo. The ads should be more or less relevant to the content of the page or the site. You can have at the most 4 ads in 1 ad unit, either vertical (”skyscraper”) or horizontal (”banner”).
  • the more recent Google “link units”, a collection of one-line topic links that lead to a page fuyll of advertisements. The advantage: let’s say you do a post on digital photography: the contextual analysis picks up the “digital camera”, and might only have space for 1 ad, so shows you an ad for buying a camera on Amazon. But with the topical link units, it can first check whether you are looking to buy a camera, or already have one and are more interested in memory cards, online image hosting or printing services. So while taking up less space, they allow to filter out the interested prospects and direct them to more relevant ads.

But this weekend I saw a new kind of ad format popping up on the Flickr site: let’s call them “image-enhanced topical link units” from the Yahoo Publisher Network (the left image is a screenshot). It’s probably in a test phase, since you see a headphone picture for “web cameras”, a GSM for “cameras” and another mobile phone for “home theatre” systems, and the link between those is not that obvious. But the line of thinking is logical: use the images to catch people’s attention (worth a thousand words, aren’t they) and then use the topics to filter out the interested customers. Let’s see if Google follows.

Technorati: Google - - - - -

Google Desktop: buggy stuff


I regret having installed the Google Desktop Beta. I thought they would have ironed the biggest bugs, but my first experience is not reassuring.

  • I installed it on my P4 2.8GHz with about 100GB of data (lots of it CD and DVD copies - so files that are large where no indexing is needed).
  • The indexing process has been running for more than 5 days. Every now and then the progress bar remains hanging (e.g. remains on 27% after 12 hours of running) although the process keeps running. Reboot necessary.
  • the GoogleDesktopIndex.exe process runs continously at 50% CPU, which makes the computer slower, but also often at 100%, at which point the option left is to reboot.
  • the “Sidebar” is very buggy, it remains “Loading” forever, cannot be stopped and does not update. I think it’s the “Images” component. When it crashes: reboot required.
  • they’ve used ActiveX everywhere. Why not Python with py2exe? Then the Sidebar would run on every platform instead of just on Windows, and it would be easier and safer to write plugins.
  • the Google Adsense plugin sounds like a good idea, but I did not get it working, the authentication failed every time.

So I remain to be convinced about the Desktop searching thing. Bye Bye Google Desktop.

What would be a good idea in the mean-time: an publishers’ API for (read-only) access to the Google Adsense statistics. How about an RSS feed (an ‘AdStatFeed’)? A nice simple MRTG-like graph? An web-based version of CSVAdStats? How hard can that be?

Technorati: - -

Automated initial image tagging: Ojos Inc


What meta-data do we have for the average digital picture we take:

MINIMAL:

a filename, typically autogenerated by the camera (e.g. “DSC0009″) or chosen at the moment of import (e.g. “Trip to Portugal 001″ or “Aug2005_001″)
a filedate, which probably correponds to the date the picture landed on the hard disk
EXIF information: date of image capture, camera brand and model, aperture, … (maybe in the future also geo-location from a built-in GPS)

ADDED BY HUMAN HAND

a title and a description: in free text
an group/set/album name: typically less than 10 words
tags or labels: the ideal search criteria, typically added by the owner
geo coordinates: the new craze on Pixagogo (who then also add the city name as a tag), so the pictures can be mapped on Google Maps

If the human-added metadata is missing, there is hardly a way to find the picture through Google Images or Flickr. What if there could be an software that analyzes a picture and automatically adds relevant metadata to a picture?

Munjal Shah, onetime cofounder of the auction services firm Andale, finally let slip on his new blog what he’s been working on since leaving last year (…) In other words, his startup, tentatively named Ojos (Spanish for “eyes”), is creating a new way to search and organize photos.
(…) he revealed the key technologies behind Ojos: face and text recognition. (…) The other key: You can assign tags, or keywords, to one photo and the service will automatically append that tag to other photos of the same people.
blogs.businessweek.com

I wonder if it also could be used to recognize familiar archetypes/icons like: a house, an sunset, an iPod, a Ferrari…

On his own blog, Shah writes:

I think Flickr’s tag based system is just super (in fact I love it), but I wanted all of my photos on there, I wanted them all tagged, and I didn’t want to spend hundreds of hours doing it. So being the lazy engineers that we are, we thought maybe we can at least auto-tag some of the faces and names.
on munjal.typepad.com

Ho John Lee states on his blog that the technology should be offered as a web service, not as yet another photo storage site. He has a point, and I can see also it working in a technology licensing model: let Flickr or Pixagogo run it locally and let them pay per million pictures treated. Anyway, it will be interesting to see where this company goes.
(via John)

Technorati: - - -

Web feeds are like RSS, only different


There recently has been some commotion over the fact that Microsoft is introducing RSS support in the new Internet Explorer 7 (which is great), but they call them “web feeds”. Oh! My! God! They are so evil!

Actually, Microsoft has a point. Currently RSS is being used as a format to deliver all kinds of different stuff: blog posts, podcasts, images, videos, search results, weather reports, stock quotes, … While they all use RSS as underlying format, they are not all the same ‘kind’ of information. I think it makes sense to distinguish between the technology/standard format “RSS” and the usages it enjoys.

So you could have:

  • Web feeds: feeds from blogs or other web sites
  • Podcast or Audiofeeds: been around for over a year now, started out as RSS 2.0 + MP3 enclosures, but now also implies Apple iTunes and Yahoo Media extensions.
  • Photofeeds: a bit like the podcasting for images, supported by the likes of Flickr, Smugmug and Pixagogo
  • Videofeeds for ‘vlogs’: the logical successor of podcasts, but here the main issue will be: formats! Whereas MP3 works on almost every machine, there is no such universal format for video. MPG (MPEG-2)? WMV (Windows Media)? MOV (QuickTime)?
  • Search feeds: the result of a search operation as a RSS feed, from MSN, Technorati, Feedster or Blogdigger
  • Stock feeds: would contain Index, Change, Day’s Range and Year’s Range in specific extensions
  • Weather feeds: would contain expected temperature, humidity, precipitation so your domotica system can open windows or light the heater

In this logic the term ‘feed’ would be synonymous for ‘based upon RSS’, and this means: a ‘channel’ with one or more ‘items’, each one with at least a date, a unique ID and a title. (RSS is the cornerstone of the ‘reverse chronological’ movement)

Some of the arguments against

“Everyone calls it RSS”

I give you: Firefox’s Live Bookmarks

“But RSS has this whole ‘brand’ recognition thing going for it!”

Not really. Maybe for us web geeks, but not for Mr/Mrs Average Webuser. It is true that people like Dave Winer and Steve Gillmor have invested a lot of effort in evangilizing the usage of RSS, and that’s very good. It’s not because Microsoft picks a more sexy name that the standard will vanish, on the contrary.

“RSS is not per-se a difficult/unsexy name”

Of course people can remember acronyms, but only if they can visualize what they stand for:

  • a DVD is like a CD: same size, slightly different colour. They contain movies.
  • VHS cassettes are black and chunky. They contained movies in the previous century.
  • A GSM is like a phone with no wires (unless you’re charging it).
  • An SUV is a really big car to go shopping with

But try to explain a non-technical person the differences between HTML and HTTP, CSS and PHP?

“Everyone uses the or button on their site to indicate their feed.”

Glad you mentioned that. Why exactly would an orange [XML] button mean RSS? Isn’t Atom also XML? Isn’t KML, SOAP, … also XML? I would love to see the [XML] buttons disappear. Indicate what is important: a videofeed, OK, but is it QuickTime, DivX or Windows Media?

Dave says we should stick with RSS

Dave’s contribution to the popularity of RSS is quite considerable and he surely is entitled to his opinion. But I don’t agree on the naming issue.

“It’s not good to have multiple terms to refer to the same thing

Correct. But it’s not because 2 systems use RSS as delivery format that they are the same thing. RSS is no longer just a way to syndicate blog postings, it’s become a building block, a bit like HTML and CSS. Personally I am more bothered in the case of folksonomy: tags = keywords = labels. That’s confusing!

“Microsoft should also support Atom.”

Atom is a comatose patient that is being kept alive by Google/Blogger. Once Blogger starts using RSS, they will have to pull the plug.

Technorati: - - -

Photofeed: image podcasting

As I said in a previous blog post: it’s not logical that there is no picture podcasting yet, while the content, the devices and the technology are all there. That’s why I decided to lend the ‘loosely coupled’ movement a hand: I just set up a new project:
PHOTOFEED - IMAGE PODCASTING.

It introduces the concept of a Photofeed (an RSS 2.0 feed with image enclosures - the picture counterpart of a podcast feed) and also features a service to display photofeeds in any web site: Photoroll. I invented the term ‘photofeed’ (’photcast’ was an earlier option, but it’s too limiting)

(Update: especially for the visitors from scripting.com)
What’s so great about a photofeed? Well, since there is an image URL specified separately and attached to each feed item, a photofeed consumer application can ‘do stuff’ with that image. So you could display the image in whatever layout you want on your site (that’s my Photoroll), you could have a photofeed screensaver, print them, make sepia thumbnails, save them to your iPod photo or PDA, …




Who already delivers photofeeds? For now, there’s Fotothing, Pixagogo and Flickr, but I hope soon other photo sharing sites will follow. They have one for each of their tags/labels, so you can have an ever changing feed of ’sunset’ images and use it for whatever you want. If you want to make your own photofeed, consider using the Feedburner SmartCast for images, which they kindly developed upon my request (doing a ‘Hackathon‘: great idea!).

What is my purpose with this? Well, I want to introduce the concept so people start playing with it and come up with new and untought-of applications. Do you have the “Hey, I could use this to …” feeling? An original hack that does funky stuff with a photofeed? An idea for a way to add ‘fitting’ pictures to an existing text-only RSS feed? Geo-photo-feeds? Some social-software remix project? Let me know, leave a comment here or on the Photofeed Blog. Just picture it!

Inspiration and support came from people who are maybe not aware of it: Joris from Pixagogo, Eric from Feedburner, Alan from Feed2JS, Lucas from Webjay and Erwin from DopplerRadio.

Technorati: - - -

Mini-backstage @ Pukkelpop

I’m gonna be responsible for managing a mini-backstage at the Pukkelpop festival this Thursday, Friday, Saturday. Mini-backstage means: the small hide-out right next to the more distant stages, where artists take their preparatory beers and coffees right before they perform. Since I did not choose my own schedule, I’m going to take care of a bunch of people I’ve mostly never heard of:

Thursday - ‘Club’ stage

Adam Green - Blood Brothers - The Departure

Friday - ‘Wablief’ stage

Millionaire - El Guapo Stuntteam - Vandal X

Saturday - ‘Chateau’ stage

Vincent Gallo - The Dresden Dolls - Whitey

But with some luck, my shift will be over when Jamie Lidell plays on the Chateau stage on Thursday. Might be a bit tight, but “that’s the use of figuring it all out” (#5 on the “Multiply” CD - addictive)!

Technorati: -

New Google hack: Pixagogo Maps

Pixagogo has just released a feature to show images that are geo-tagged on a map (via Google Maps).

Create your own Photo Maps in 4 easy steps.
1. Enter Photo Map Title & Description
2. Upload your Photos
3. Add Geocodes & Label your Photos
4. Share your Photo Map
Try it out here.

(via pixagogo.typepad.com)

Technorati: - - -