Detecting Blogspot splogs the Bayesian way


SHORT HISTORY OF SPLOGS

SPLOG SOLUTIONS
There are several stragtegies to make life harder for sploggers, while not making it too hard for regular bloggers:

prevent the (automatic) creation of a splog site 

CAPTCHA SYSTEMS: useless. In the end, one can always outsource creation of blogs to ‘real people’ in India at less than $1/blog 

EMAIL CONFIRMATION: can be too easily automated 

NO MORE FREE BLOGS: ask a fee (per month, like Typepad) or a kind of deposit (e.g. $50 – returned when you stop blogging, or confiscated when you run a splog) 

THOROUGH ID CHECK: so you can trace back splogs to the actual user. Not realistic, I’m afraid. Anonymity can sometimes be a positive thing. 

prevent (automatic) posting of splog items 

NO MORE API: no posting via email, REST, SOAP, XML-RPC, Atom, … That’s throwing out the baby with the bathwater. 

CAPTCHA SYSTEMS: require human interaction for each post. Is OK for web-based editors, but what with the APIs above? 

CHECK CONTENT: process the content of each post and try to decide if it’s splog-like. See the chapter below on Bayesian Detection 

detect splogs when they are created 

AUTOMATED DETECTION: using systems similar to the ones now employed for email spam detection. See the chapter below on Bayesian Detection. 

GROUP EFFORT: projects like splogreporter.com, Blogger’s “flag as objectionable”, Google Adsense abuse report. Maybe an initiative such as “Razor” for splogs? 

prevent the benefits of splog usage 

PAGERANK 0: when in doubt on whether a blog is a splog, reset the PageRank to 0 

REL NOFOLLOW: is a decision easier to take than deleting the whole blog (because if you delete blog that’s not a splog …): make all links worthless for search engines 

REDIRECT SCRIPT: replace all links to www.example.com to e.g. www.blogger.com/r?url=www.example.com, so that the redirects work bu do not transfer Google juice. The ‘old’ way of doing rel=nofollow. 

EDUCATION: there are way too much gullible people around who actually buy bogus stuff on the web. No one should leave high school and not know what SPAM is. You do not buy V*AGRA on the web!!

BAYESIAN DETECTION OF SPLOGS
Kailash Nadh already started with a project, but let’s take it one step further. I’d like to go back to the beginning of email spam fighting: most of the efforts started after an article by Paul Graham: “A plan for spam”. He proposed Bayesian Filtering as a better method for detecting spam than using whitelists/blacklists. In his follow-up article, he says:

I don’t think it’s a good idea to treat spam filtering as a straight text classification problem. You can use text classification techniques, but solutions can and should reflect the fact that the text is email, and spam in particular. Email is not just text; it has structure.

He then suggests treating the typical email fields: sender, destination, subject, date differently from the email body.

Let’s try to pinpoint the specifics of a blog (they are not the same as emails):

  • a blog has a title, url, meta tags, and HTML content (which does not appear in the RSS/Atom feed) 
  • a blog post has a title, url, date, metatags and HTML content (which does appear in the feed)
  • the HTML of the blog or of each post can contain links (<a href="http://www.example.com" mce_href="http://www.example.com">)

So intuitively I suspect the indicators of a splog will be (using car-deals-and-more.blogspot.com as example):

Keyword stuffing 

some words will appear over and over again, with small variations. In our example: over 3% of all words is “car”, 1.7% is “loan” and 1.09% is “credit”. Keywords in the URLs of in the domainname are even stronger indication of splogs. 

Nonsense content 

some sites use really dumb search engine scraping to generate content, which fills the posts with script URLs, lists of keywords without any grammatical contruction. 

Repetitive linking 

since a splog will be started to promote one product or site, the links will typically point to the same site and maybe to other ‘joined’ splogs. (in our example: always to www.car-financing.online-auto-center.info, a 41-character(?!) domain name) 

Post frequency 

Since recency and frequency are so important for a search engine, the splog will have many posts/day. If the owner is an idiot every time 2 minutes after the hour, but presumably something more random. One might do an RFM analysis on this.

So one would have to adapt the Bayesian model as follows:
- add BLOGTITLE*whatever for each word in the blog title
- add BLOGPAGE*whatever for each word in the blog URL or domain (separated by “.”, “-” or “_”)
- add POSTTITLE*whatever for each word in the post title
- add POSTPAGE*whatever for each word in the post page URL (separated by “.”, “-” or “_”)
- add LINKURL*http://…/path/to/page.html for each link
- add LINKDOMAIN*http://www.example.com for each link
- calculate a POSTFREQ (posts/day stat) and POSTFREQDEV (standard deviation)

The Bayesian Filtering would then place a ‘splog probability’ on each blog and a company like Blogger could set up a weekly scan of each blog and do the following:
SPLOGPROB > 99%: automatically disable all outgoing links, send warning to owner, if no positive reply after 7 days: delete blog
SPLOGPROB > 95%: automatically disable all outgoing links, send warning to owner, test again after 2 days
SPLOGPROB > 90%: add rel=”nofollow” to each link – add captcha to posting – next control in 1 week
SPLOGPROB > 80% add captcha to posting – next control in 1 week
SPLOGPROB > 70%: do nothing – next control in 1 week
SPLOGPROB < 70%: do nothing – next control in 1 month
I’ll probably refine this model in the future.
All remarks welcome!

Technorati:

Related posts:

  1. Blogspot splogs in Technorati For some reason, if I search for “baeyens” on Technorati...
  2. Avoiding wiki spam in Mediawiki The great thing about Wiki’s is that everyone can edit...
  3. Colorbar: belgian spam In the last three days I have received 3 mails...
  4. Idea: hosted classification service Yesterday evening I was watching “How to replace yourself with...
  5. Migrating from blogspot to a ‘real’ blog I have often said that Blogger is one of the...

1 Response to “Detecting Blogspot splogs the Bayesian way”


Comments are currently closed.