Monthly Archive for August, 2005

Detecting Blogspot splogs the Bayesian way


SHORT HISTORY OF SPLOGS

SPLOG SOLUTIONS
There are several stragtegies to make life harder for sploggers, while not making it too hard for regular bloggers:

prevent the (automatic) creation of a splog site 

CAPTCHA SYSTEMS: useless. In the end, one can always outsource creation of blogs to ‘real people’ in India at less than $1/blog 

EMAIL CONFIRMATION: can be too easily automated 

NO MORE FREE BLOGS: ask a fee (per month, like Typepad) or a kind of deposit (e.g. $50 – returned when you stop blogging, or confiscated when you run a splog) 

THOROUGH ID CHECK: so you can trace back splogs to the actual user. Not realistic, I’m afraid. Anonymity can sometimes be a positive thing. 

prevent (automatic) posting of splog items 

NO MORE API: no posting via email, REST, SOAP, XML-RPC, Atom, … That’s throwing out the baby with the bathwater. 

CAPTCHA SYSTEMS: require human interaction for each post. Is OK for web-based editors, but what with the APIs above? 

CHECK CONTENT: process the content of each post and try to decide if it’s splog-like. See the chapter below on Bayesian Detection 

detect splogs when they are created 

AUTOMATED DETECTION: using systems similar to the ones now employed for email spam detection. See the chapter below on Bayesian Detection. 

GROUP EFFORT: projects like splogreporter.com, Blogger’s “flag as objectionable”, Google Adsense abuse report. Maybe an initiative such as “Razor” for splogs? 

prevent the benefits of splog usage 

PAGERANK 0: when in doubt on whether a blog is a splog, reset the PageRank to 0 

REL NOFOLLOW: is a decision easier to take than deleting the whole blog (because if you delete blog that’s not a splog …): make all links worthless for search engines 

REDIRECT SCRIPT: replace all links to www.example.com to e.g. www.blogger.com/r?url=www.example.com, so that the redirects work bu do not transfer Google juice. The ‘old’ way of doing rel=nofollow. 

EDUCATION: there are way too much gullible people around who actually buy bogus stuff on the web. No one should leave high school and not know what SPAM is. You do not buy V*AGRA on the web!!

BAYESIAN DETECTION OF SPLOGS
Kailash Nadh already started with a project, but let’s take it one step further. I’d like to go back to the beginning of email spam fighting: most of the efforts started after an article by Paul Graham: “A plan for spam”. He proposed Bayesian Filtering as a better method for detecting spam than using whitelists/blacklists. In his follow-up article, he says:

I don’t think it’s a good idea to treat spam filtering as a straight text classification problem. You can use text classification techniques, but solutions can and should reflect the fact that the text is email, and spam in particular. Email is not just text; it has structure.

He then suggests treating the typical email fields: sender, destination, subject, date differently from the email body.

Let’s try to pinpoint the specifics of a blog (they are not the same as emails):

  • a blog has a title, url, meta tags, and HTML content (which does not appear in the RSS/Atom feed) 
  • a blog post has a title, url, date, metatags and HTML content (which does appear in the feed)
  • the HTML of the blog or of each post can contain links (<a href="http://www.example.com" mce_href="http://www.example.com">)

So intuitively I suspect the indicators of a splog will be (using car-deals-and-more.blogspot.com as example):

Keyword stuffing 

some words will appear over and over again, with small variations. In our example: over 3% of all words is “car”, 1.7% is “loan” and 1.09% is “credit”. Keywords in the URLs of in the domainname are even stronger indication of splogs. 

Nonsense content 

some sites use really dumb search engine scraping to generate content, which fills the posts with script URLs, lists of keywords without any grammatical contruction. 

Repetitive linking 

since a splog will be started to promote one product or site, the links will typically point to the same site and maybe to other ‘joined’ splogs. (in our example: always to www.car-financing.online-auto-center.info, a 41-character(?!) domain name) 

Post frequency 

Since recency and frequency are so important for a search engine, the splog will have many posts/day. If the owner is an idiot every time 2 minutes after the hour, but presumably something more random. One might do an RFM analysis on this.

So one would have to adapt the Bayesian model as follows:
- add BLOGTITLE*whatever for each word in the blog title
- add BLOGPAGE*whatever for each word in the blog URL or domain (separated by “.”, “-” or “_”)
- add POSTTITLE*whatever for each word in the post title
- add POSTPAGE*whatever for each word in the post page URL (separated by “.”, “-” or “_”)
- add LINKURL*http://…/path/to/page.html for each link
- add LINKDOMAIN*http://www.example.com for each link
- calculate a POSTFREQ (posts/day stat) and POSTFREQDEV (standard deviation)

The Bayesian Filtering would then place a ‘splog probability’ on each blog and a company like Blogger could set up a weekly scan of each blog and do the following:
SPLOGPROB > 99%: automatically disable all outgoing links, send warning to owner, if no positive reply after 7 days: delete blog
SPLOGPROB > 95%: automatically disable all outgoing links, send warning to owner, test again after 2 days
SPLOGPROB > 90%: add rel=”nofollow” to each link – add captcha to posting – next control in 1 week
SPLOGPROB > 80% add captcha to posting – next control in 1 week
SPLOGPROB > 70%: do nothing – next control in 1 week
SPLOGPROB < 70%: do nothing – next control in 1 month
I’ll probably refine this model in the future.
All remarks welcome!

Technorati:

Flickr/Yahoo experimenting with new ad format



Flickr has been placing contextual advertising on their “tag” search pages (example: the “brussels” tag) for a while. They seemed to use a mixture of Google Adsense and Yahoo Publisher text-based ads. For people who are not that familiar with contextual ad units: there are

  • image based ads, that consist entirely of 1 graphic
  • text-based ads: a title, a 5-10 word description and a link, for one specific product – offered by both Google and Yahoo. The ads should be more or less relevant to the content of the page or the site. You can have at the most 4 ads in 1 ad unit, either vertical (“skyscraper”) or horizontal (“banner”).
  • the more recent Google “link units”, a collection of one-line topic links that lead to a page fuyll of advertisements. The advantage: let’s say you do a post on digital photography: the contextual analysis picks up the “digital camera”, and might only have space for 1 ad, so shows you an ad for buying a camera on Amazon. But with the topical link units, it can first check whether you are looking to buy a camera, or already have one and are more interested in memory cards, online image hosting or printing services. So while taking up less space, they allow to filter out the interested prospects and direct them to more relevant ads.

But this weekend I saw a new kind of ad format popping up on the Flickr site: let’s call them “image-enhanced topical link units” from the Yahoo Publisher Network (the left image is a screenshot). It’s probably in a test phase, since you see a headphone picture for “web cameras”, a GSM for “cameras” and another mobile phone for “home theatre” systems, and the link between those is not that obvious. But the line of thinking is logical: use the images to catch people’s attention (worth a thousand words, aren’t they) and then use the topics to filter out the interested customers. Let’s see if Google follows.

Technorati: Google

Google Desktop: buggy stuff


I regret having installed the Google Desktop Beta. I thought they would have ironed the biggest bugs, but my first experience is not reassuring.

  • I installed it on my P4 2.8GHz with about 100GB of data (lots of it CD and DVD copies – so files that are large where no indexing is needed).
  • The indexing process has been running for more than 5 days. Every now and then the progress bar remains hanging (e.g. remains on 27% after 12 hours of running) although the process keeps running. Reboot necessary.
  • the GoogleDesktopIndex.exe process runs continously at 50% CPU, which makes the computer slower, but also often at 100%, at which point the option left is to reboot.
  • the “Sidebar” is very buggy, it remains “Loading” forever, cannot be stopped and does not update. I think it’s the “Images” component. When it crashes: reboot required.
  • they’ve used ActiveX everywhere. Why not Python with py2exe? Then the Sidebar would run on every platform instead of just on Windows, and it would be easier and safer to write plugins.
  • the Google Adsense plugin sounds like a good idea, but I did not get it working, the authentication failed every time.

So I remain to be convinced about the Desktop searching thing. Bye Bye Google Desktop.

What would be a good idea in the mean-time: an publishers’ API for (read-only) access to the Google Adsense statistics. How about an RSS feed (an ‘AdStatFeed’)? A nice simple MRTG-like graph? An web-based version of CSVAdStats? How hard can that be?

Technorati:

Automated initial image tagging: Ojos Inc


What meta-data do we have for the average digital picture we take:

MINIMAL:

a filename, typically autogenerated by the camera (e.g. “DSC0009″) or chosen at the moment of import (e.g. “Trip to Portugal 001″ or “Aug2005_001″)
a filedate, which probably correponds to the date the picture landed on the hard disk
EXIF information: date of image capture, camera brand and model, aperture, … (maybe in the future also geo-location from a built-in GPS)

ADDED BY HUMAN HAND

a title and a description: in free text
an group/set/album name: typically less than 10 words
tags or labels: the ideal search criteria, typically added by the owner
geo coordinates: the new craze on Pixagogo (who then also add the city name as a tag), so the pictures can be mapped on Google Maps

If the human-added metadata is missing, there is hardly a way to find the picture through Google Images or Flickr. What if there could be an software that analyzes a picture and automatically adds relevant metadata to a picture?

Munjal Shah, onetime cofounder of the auction services firm Andale, finally let slip on his new blog what he’s been working on since leaving last year (…) In other words, his startup, tentatively named Ojos (Spanish for “eyes”), is creating a new way to search and organize photos.
(…) he revealed the key technologies behind Ojos: face and text recognition. (…) The other key: You can assign tags, or keywords, to one photo and the service will automatically append that tag to other photos of the same people.
blogs.businessweek.com

I wonder if it also could be used to recognize familiar archetypes/icons like: a house, an sunset, an iPod, a Ferrari…

On his own blog, Shah writes:

I think Flickr’s tag based system is just super (in fact I love it), but I wanted all of my photos on there, I wanted them all tagged, and I didn’t want to spend hundreds of hours doing it. So being the lazy engineers that we are, we thought maybe we can at least auto-tag some of the faces and names.
on munjal.typepad.com

Ho John Lee states on his blog that the technology should be offered as a web service, not as yet another photo storage site. He has a point, and I can see also it working in a technology licensing model: let Flickr or Pixagogo run it locally and let them pay per million pictures treated. Anyway, it will be interesting to see where this company goes.
(via John)

Technorati: