
SHORT HISTORY OF SPLOGS
- Mar 7: Matthew Haughey from Metafilter reports on Blogspot spam blogs.
- Apr 12: Blogger introduces the usage of a CAPTCHA system to prevent automated creation of blogs.
- Jun 30: Geektronica and PSFK discover loads of Blogspot spam blogs.
- Aug 9: David Sifry talks about Technorati’s efforts to cope with spam and fake blogs, and annouces the Web 2.0 Spam Squashing Summit.
- Aug 9: Matthew Mullenweg has to block 80% of traffic to his Ping-o-Matic service due to spamblogs.
- Aug 15: Mark Cuban writes his complaint about spam blogs, at the mean time coining the term ‘splogs‘. He suggests banning all Blogger/Blogspot sites might be a possible solution, which is rather like using a sledgehammer to kill a fly.
- Aug 17: the ‘citizen watch’ splog site splogreporter.com is launched by Frank J. Gruber and asks user to report splogs. The goal is to *sell* splog lists to search engines.
- Aug 17: Blogger launches its “flag as objectionable” button on all Blogspot sites.
- Aug 25: Scott Johnson from Feedster has a conversation with a not-too-bright blog spammer and gets him banned from search engines.
- Aug 29: Google Blogoscoped does an elementary count of ‘random’ Blogspot sites and counts 42% as splogs.
SPLOG SOLUTIONS
There are several stragtegies to make life harder for sploggers, while not making it too hard for regular bloggers:
- prevent the (automatic) creation of a splog site
- CAPTCHA SYSTEMS: useless. In the end, one can always outsource creation of blogs to ‘real people’ in India at less than $1/blog
- EMAIL CONFIRMATION: can be too easily automated
- NO MORE FREE BLOGS: ask a fee (per month, like Typepad) or a kind of deposit (e.g. $50 - returned when you stop blogging, or confiscated when you run a splog)
- THOROUGH ID CHECK: so you can trace back splogs to the actual user. Not realistic, I’m afraid. Anonymity can sometimes be a positive thing.
- prevent (automatic) posting of splog items
- NO MORE API: no posting via email, REST, SOAP, XML-RPC, Atom, … That’s throwing out the baby with the bathwater.
- CAPTCHA SYSTEMS: require human interaction for each post. Is OK for web-based editors, but what with the APIs above?
- CHECK CONTENT: process the content of each post and try to decide if it’s splog-like. See the chapter below on Bayesian Detection
- detect splogs when they are created
- AUTOMATED DETECTION: using systems similar to the ones now employed for email spam detection. See the chapter below on Bayesian Detection.
- GROUP EFFORT: projects like splogreporter.com, Blogger’s “flag as objectionable”, Google Adsense abuse report. Maybe an initiative such as “Razor” for splogs?
- prevent the benefits of splog usage
- PAGERANK 0: when in doubt on whether a blog is a splog, reset the PageRank to 0
- REL NOFOLLOW: is a decision easier to take than deleting the whole blog (because if you delete blog that’s not a splog …): make all links worthless for search engines
- REDIRECT SCRIPT: replace all links to www.example.com to e.g. www.blogger.com/r?url=www.example.com, so that the redirects work bu do not transfer Google juice. The ‘old’ way of doing rel=nofollow.
- EDUCATION: there are way too much gullible people around who actually buy bogus stuff on the web. No one should leave high school and not know what SPAM is. You do not buy V*AGRA on the web!!
BAYESIAN DETECTION OF SPLOGS
Kailash Nadh already started with a project, but let’s take it one step further. I’d like to go back to the beginning of email spam fighting: most of the efforts started after an article by Paul Graham: “A plan for spam”. He proposed Bayesian Filtering as a better method for detecting spam than using whitelists/blacklists. In his follow-up article, he says:
I don’t think it’s a good idea to treat spam filtering as a straight text classification problem. You can use text classification techniques, but solutions can and should reflect the fact that the text is email, and spam in particular. Email is not just text; it has structure.
He then suggests treating the typical email fields: sender, destination, subject, date differently from the email body.
Let’s try to pinpoint the specifics of a blog (they are not the same as emails):
- a blog has a title, url, meta tags, and HTML content (which does not appear in the RSS/Atom feed)
- a blog post has a title, url, date, metatags and HTML content (which does appear in the feed)
- the HTML of the blog or of each post can contain links (
<a href="http://www.example.com" mce_href="http://www.example.com">)
So intuitively I suspect the indicators of a splog will be (using car-deals-and-more.blogspot.com as example):
- Keyword stuffing
- some words will appear over and over again, with small variations. In our example: over 3% of all words is “car”, 1.7% is “loan” and 1.09% is “credit”. Keywords in the URLs of in the domainname are even stronger indication of splogs.
- Nonsense content
- some sites use really dumb search engine scraping to generate content, which fills the posts with script URLs, lists of keywords without any grammatical contruction.
- Repetitive linking
- since a splog will be started to promote one product or site, the links will typically point to the same site and maybe to other ‘joined’ splogs. (in our example: always to www.car-financing.online-auto-center.info, a 41-character(?!) domain name)
- Post frequency
- Since recency and frequency are so important for a search engine, the splog will have many posts/day. If the owner is an idiot every time 2 minutes after the hour, but presumably something more random. One might do an RFM analysis on this.
So one would have to adapt the Bayesian model as follows:
- add BLOGTITLE*whatever for each word in the blog title
- add BLOGPAGE*whatever for each word in the blog URL or domain (separated by “.”, “-” or “_”)
- add POSTTITLE*whatever for each word in the post title
- add POSTPAGE*whatever for each word in the post page URL (separated by “.”, “-” or “_”)
- add LINKURL*http://…/path/to/page.html for each link
- add LINKDOMAIN*http://www.example.com for each link
- calculate a POSTFREQ (posts/day stat) and POSTFREQDEV (standard deviation)
The Bayesian Filtering would then place a ’splog probability’ on each blog and a company like Blogger could set up a weekly scan of each blog and do the following:
SPLOGPROB > 99%: automatically disable all outgoing links, send warning to owner, if no positive reply after 7 days: delete blog
SPLOGPROB > 95%: automatically disable all outgoing links, send warning to owner, test again after 2 days
SPLOGPROB > 90%: add rel=”nofollow” to each link - add captcha to posting - next control in 1 week
SPLOGPROB > 80% add captcha to posting - next control in 1 week
SPLOGPROB > 70%: do nothing - next control in 1 week
SPLOGPROB < 70%: do nothing - next control in 1 month
I’ll probably refine this model in the future.
All remarks welcome!
Technorati: blog - spam - splog - blogspot
If you're new here, you may want to subscribe to my RSS feed or receive updates via email. Thanks for visiting!



the more recent Google “link units”, a collection of one-line topic links that lead to a page fuyll of advertisements. The advantage: let’s say you do a post on digital photography: the contextual analysis picks up the “digital camera”, and might only have space for 1 ad, so shows you an ad for buying a camera on Amazon. But with the topical link units, it can first check whether you are looking to buy a camera, or already have one and are more interested in memory cards, online image hosting or printing services. So while taking up less space, they allow to filter out the interested prospects and direct them to more relevant ads.


or
button on their site to indicate their feed.”









Latest Comments
Lothika, Roskes, Peter, Dirk Blanchart, Marco, dan, N Blue, MzK, Outlines, moxie [...]
elsie, Patricia, Lori, kimberley, ronny, ashley, Donna, Mark Lawless, John, Jo [...]
Glenn Bruynooghe (aka pixelpet)
Roger Pack, Todd in L.A., Internet Marketing Badger, Crazy Penguin, Daniel, Thellis, butter, underground, Trent, Josh [...]
Valeisha, cara
arneossys, Chris, vtheman, Gary Gomes, Sloane Stone, Jake, jan, nick, robin, kevin [...]