Rss Feed
Facebook button
Reddit button
Delicious button

Archive for the 'spam' Category

Page 3 of 4

Detecting Blogspot splogs the Bayesian way


SHORT HISTORY OF SPLOGS

SPLOG SOLUTIONS
There are several stragtegies to make life harder for sploggers, while not making it too hard for regular bloggers:

prevent the (automatic) creation of a splog site 

CAPTCHA SYSTEMS: useless. In the end, one can always outsource creation of blogs to ‘real people’ in India at less than $1/blog 

EMAIL CONFIRMATION: can be too easily automated 

NO MORE FREE BLOGS: ask a fee (per month, like Typepad) or a kind of deposit (e.g. $50 – returned when you stop blogging, or confiscated when you run a splog) 

THOROUGH ID CHECK: so you can trace back splogs to the actual user. Not realistic, I’m afraid. Anonymity can sometimes be a positive thing. 

prevent (automatic) posting of splog items 

NO MORE API: no posting via email, REST, SOAP, XML-RPC, Atom, … That’s throwing out the baby with the bathwater. 

CAPTCHA SYSTEMS: require human interaction for each post. Is OK for web-based editors, but what with the APIs above? 

CHECK CONTENT: process the content of each post and try to decide if it’s splog-like. See the chapter below on Bayesian Detection 

detect splogs when they are created 

AUTOMATED DETECTION: using systems similar to the ones now employed for email spam detection. See the chapter below on Bayesian Detection. 

GROUP EFFORT: projects like splogreporter.com, Blogger’s “flag as objectionable”, Google Adsense abuse report. Maybe an initiative such as “Razor” for splogs? 

prevent the benefits of splog usage 

PAGERANK 0: when in doubt on whether a blog is a splog, reset the PageRank to 0 

REL NOFOLLOW: is a decision easier to take than deleting the whole blog (because if you delete blog that’s not a splog …): make all links worthless for search engines 

REDIRECT SCRIPT: replace all links to www.example.com to e.g. www.blogger.com/r?url=www.example.com, so that the redirects work bu do not transfer Google juice. The ‘old’ way of doing rel=nofollow. 

EDUCATION: there are way too much gullible people around who actually buy bogus stuff on the web. No one should leave high school and not know what SPAM is. You do not buy V*AGRA on the web!!

BAYESIAN DETECTION OF SPLOGS
Kailash Nadh already started with a project, but let’s take it one step further. I’d like to go back to the beginning of email spam fighting: most of the efforts started after an article by Paul Graham: “A plan for spam”. He proposed Bayesian Filtering as a better method for detecting spam than using whitelists/blacklists. In his follow-up article, he says:

I don’t think it’s a good idea to treat spam filtering as a straight text classification problem. You can use text classification techniques, but solutions can and should reflect the fact that the text is email, and spam in particular. Email is not just text; it has structure.

He then suggests treating the typical email fields: sender, destination, subject, date differently from the email body.

Let’s try to pinpoint the specifics of a blog (they are not the same as emails):

  • a blog has a title, url, meta tags, and HTML content (which does not appear in the RSS/Atom feed) 
  • a blog post has a title, url, date, metatags and HTML content (which does appear in the feed)
  • the HTML of the blog or of each post can contain links (<a href="http://www.example.com" mce_href="http://www.example.com">)

So intuitively I suspect the indicators of a splog will be (using car-deals-and-more.blogspot.com as example):

Keyword stuffing 

some words will appear over and over again, with small variations. In our example: over 3% of all words is “car”, 1.7% is “loan” and 1.09% is “credit”. Keywords in the URLs of in the domainname are even stronger indication of splogs. 

Nonsense content 

some sites use really dumb search engine scraping to generate content, which fills the posts with script URLs, lists of keywords without any grammatical contruction. 

Repetitive linking 

since a splog will be started to promote one product or site, the links will typically point to the same site and maybe to other ‘joined’ splogs. (in our example: always to www.car-financing.online-auto-center.info, a 41-character(?!) domain name) 

Post frequency 

Since recency and frequency are so important for a search engine, the splog will have many posts/day. If the owner is an idiot every time 2 minutes after the hour, but presumably something more random. One might do an RFM analysis on this.

So one would have to adapt the Bayesian model as follows:
- add BLOGTITLE*whatever for each word in the blog title
- add BLOGPAGE*whatever for each word in the blog URL or domain (separated by “.”, “-” or “_”)
- add POSTTITLE*whatever for each word in the post title
- add POSTPAGE*whatever for each word in the post page URL (separated by “.”, “-” or “_”)
- add LINKURL*http://…/path/to/page.html for each link
- add LINKDOMAIN*http://www.example.com for each link
- calculate a POSTFREQ (posts/day stat) and POSTFREQDEV (standard deviation)

The Bayesian Filtering would then place a ’splog probability’ on each blog and a company like Blogger could set up a weekly scan of each blog and do the following:
SPLOGPROB > 99%: automatically disable all outgoing links, send warning to owner, if no positive reply after 7 days: delete blog
SPLOGPROB > 95%: automatically disable all outgoing links, send warning to owner, test again after 2 days
SPLOGPROB > 90%: add rel=”nofollow” to each link – add captcha to posting – next control in 1 week
SPLOGPROB > 80% add captcha to posting – next control in 1 week
SPLOGPROB > 70%: do nothing – next control in 1 week
SPLOGPROB < 70%: do nothing – next control in 1 month
I’ll probably refine this model in the future.
All remarks welcome!

Technorati:

If you're new here, you may want to subscribe to my RSS feed or receive updates via email. Thanks for visiting!

Amy Cross spamming Technorati

If you look through the posts under the Technorati tag ‘Adsense’ you find most posts are from a blog “Adsense For profit” or “holy grail of adsense dot com” (No, I won’t link to it, I don’t like their kind).

When you click the link of the post, which typically links to a http://www.holygrail__.com/ archives/ $number, you are redirected to the site of a hosting provider spagack dot info. When you click the name of the blog, you don’t see a blog, but a classic badly designed, keyword infested, get-rich-quick spam page for a handbook on Adsense advertising ($80).

Why would Technorati include a spam page, you say? Well, the web site is actually built on a WordPress blog software, but with a firmly modified template: in the HTML there are some <div> tricks to push the blog-generated content out of sight and throw the spam page in your face. When you check the actual contents of the blog feed: it’s some kind of automatic re-post of Adsense related articles, but with the links (anchor tags) modified to <ab href="http://...">...</ab>, so they do not get picked up by crawlers like Google and Technorati.

Who is responsible for this stuff? The DNS registration shows it to be a certain Amy Cross from Texas, who’s also behind the spagack hosting site. The address she uses is a mailbox in McCamey, TX 79752, which I first expected to be a phony mailbox just like they used in the DRA scam, but there actually is an Amy Cross registered in McCamey. Amy has already been outed as a spammer on www.blogherald.com (Apr 2005). She actually responded to it:

I guess I’ll just leave it alone. I can learn to handle any insult from any small minded nit as long as it gives me the exposure and reach that you have given me.

I guess she’ll be delighted with this post then.

Other blog posts on the issue of Technorati spam: johnaugust.com

Technorati:

Old-style Nigerian scam: via fax


Amazing: I just got my first Nigerian (419) scam via FAX! In these days of practically free email sending, you have to admire someone who goes the extra mile and pays for sending faxes. A handwritten letter would have made me feel even more special, but it’s a start.

A Mister Victor Abbor (vicabbor2@yahoo.com or vicabbor@sify.com) from Lagos has picked me as a candidate for transfering a substantial amount of money ($17.6 mio), formerly belonging to Mr. Akbar Ali, who unfortunately died in the Benin plane crash on Dec 25, 2003. I have to pose as a relative for Mr. Ali (I do have a slight tan from one week in Portugal, so I should pass easily for an African) In return, he would only take a meager 65% and leave me with the remaining 35% (what, no taxes?) which amounts to … $6.16 mio, or € 4.7 mio. A day’s work for a day’s pay.

Victor refers to articles on CNN.com: Benin crash, and Pravda.ru: Bloody Christmas. Obviously, the fact he knows these articles should be enough proof that he is indeed capable of touching the money. Funnily enough, just after asking my personal details, he signs the letter as Paul Udo. I’m confused now. Is Udo Abbor’s secretary or something? While “Victor Abbor” has no Google hits yet, “Paul Udo” does have prior activity: on nigerian419fraud.freeserve.co.uk he appears to have promised 30% of 32 mio to someone else. I suddenly don’t feel that special anymore.

The fax is supposedly sent from +234-7594683 (+234 is Nigeria’s country prefix)… Nah, I don’t think I’m gonna do it. If any one is interested in doing business with Victor/Paul, just let me know. Just browse through The Spamletters first to get a feel of how to do business the Nigerian way.

Googlistics: messing with the big “G”


He probably also first thought it was an April’s Fool joke:
Matt Mullenweg from Wordpress was discovered to have used his PageRank 8 site (Wordpress is a popular open-source blogging software) for hosting lots of irrelevant content, with the purpose to get high scores in Google rankings and (let a customer of his) make money on Google Adsense.

The content in articles is essentially advertising by a third party that we host for a flat fee. I’m not sure if we’re going to continue it much longer, but we’re committed to this month at least, it was basically an experiment. However around the beginning of Feburary donations were going down as expenses were ramping up, so it seemed like a good way to cover everything. The adsense on those pages is not ours and I have no idea what they get on it, we just get a flat fee. The money is used just like donations but more specifically it’s been going to the business/trademark expenses so it’s not entirely out of my pocket anymore.
(from wordpress.org)

Andy Baio (Waxy) broke the news on March 30th, at a moment when Matt was on holiday (and off-line), so he only replied on April 1st, about a thousand angry emails later. His defense is that it was a interesting idea, badly implemented, not followed up and never evaluated. Since Matt does not have the profile of a cash-hungry opportunist, and he’s explaining this to an audience of people that understand these reasons (reads like an IT project management what-not-to-do list), the storm will probably blow over.

Normally this is the kind of situation where one would say: “SEO? Leave that to the professionals!”. But the fact is that here in Belgium, some of the companies that claim to be SEO specialists, use dirty tricks all over. Hidden links, bot cloaking, keyword spamming, <noscript> tricks, the whole shebang. It’s like they read the Google SEO warning page as a guideline. “Hey look! We could put ourselves and other customers on every client’s doorway pages. Neat!”.

Joris just posted another example (Immoweb.be) on his SEO blog. And again the so-called “SEO professional” fooling around is Extenseo, just as it was for Automagazine and Actel. As one can see on their unprotected Javascript hosting site, they recently add VW/MyWay to their customers, so we can expect those homepages to be featured in the Hoe Het Niet Moet (What Not To Do) series soon!

Technorati:

Dave Winer’s problem and solution

Dave Winer
Dave Winer seems to be very excited about something but he can’t say yet what it is:

Last night I got an email from someone I’ve been wanting to hear from for a long time. There’s a problem on the Internet, a big one, that only one entity can solve. The email outlined the solution and asked what I thought of it, and asked me not to say what it is publicly. I can live with that. I just want to mark this moment. A milestone. Real cooperation. I immediately implemented the feature on one of my sites. The same message was sent to a bunch of other people by the same person. I hope they did the same. When this is announced users everywhere will smile
(from archive.scripting.com)

and a day later:

Watch this space for an interesting announcement.
(from www.bloggercon.org)

First I thought it would be related to RSS. Maybe RSS and Atom are merging into 1 standard (but then what does he mean with ‘implemented the feature on one of my sites’?) or Blogger (Google) will now support RSS as well as Atom feeds (which would basically mean Atom dies)?

But speculation in the blogosphere tends to go in the direction of Google taking into account the rel="nofollow" attribute of a link, so bloggers can make a distinction between links that Google should follow (and transfer Pagerank to) or not. A promising solution for comment spam.
(via poorbuthappy.com / gorissen.info / phaedo.cx)

Comment spam is a problem I almost never encounter. Most of my sites are created with Blogger, and they use a redirector script for outgoing links in comments:
http://www.blogger.com/r?http%3A%2F%2Fwww.example.com.
(cf help.blogger.com)
Neither WordPress nor SixApart (Movable Type) mention in their comment-spam combat guide. If every blog software used this trick, it would make the comment-spamming tactic less attractive!

Don’t unsubscribe from spam


Brian McWilliams, author of Spam Kings has published an article “Remove me” on salon.com on his recent under-cover job within the spammers community to check whether these people really take “Unsubscribe me” requests into account. He poses as an affiliate (someone who sends spam on behalf of some company and gets commissions for each sale) to a company selling fake Rolexes.

When I signed on to BlackMarketMoney.com for the first time, I saw a page where my sales stats would be displayed. A preferences section included a form where I could specify account numbers for my commission payments. There were also pages with suggested ad copy and graphics files, as well as an updated list of the various domains we affiliates were supposed to advertise in our spams.

But what really caught my eye was a note at the site that insisted all affiliate spams include an “unsubscribe link.” Two huge archives were also available for download, containing lists of “remove” addresses. The October list held around 202,000 e-mails, while the November list had over 282,000 addresses. Sales affiliates were instructed to scrub their mailing lists to remove these names.
(from salon.com)

Eventhough the affiliates are given all the information necessary to remove the addresses of people, reality turns out to work differently. I let you read the story – and his conversations with the people who unsubscribed – on the site, but his conclusion remains unchanged:
Do NOT use the “unsubscribe/remove” option in spam mails!

“Domain Registry of America” scam

UPDATE: I received a cease-or-desist from DRoA in March 2006 about this post.

Just received a letter in the post from ‘Domain Registry of America’ (DRoA), urging me to pay them for renewing my domain name. The paper, with a London address on the back, looks like a bill and tries to scare the reader with “Your registration will expire on May 10, 2005. Act today!”


The issue is: it’s a scam. There is no need for me to renew now, and certainly not with DRoA. I know exactly who manages my domains and I am quite happy with them. But I can imagine this trick works quite well with people who have no clue how DNS works, or in accounting departments of companies. These guys have been fooling people since at least 2002, as the Domain Registry of America, of Canada, of Europe and of Australia.
Who’s behind it?

  • The letter I received has a UK address: 56 Gloucester Rd, Suite 526, SW7 4UB London. This happens to be the same address as Mail Boxes Etc, so it’s probably just a mail box.
  • The letter was shipped from “Jamaica, NY”, which seems to be the USPS post office in John F Kennedy Airport. The address on the DRoA site says “2316 Delaware Avenue #266 – Buffalo, New York”, which apparently is again a “Mail Boxes Etc” location.
  • A legal document of the Federal Trade Commission (Dec 2003) places them in Ontario, Canada. I found a “PO Box 4577, Markham, Ontario, L3R 5M7″ address on their Canadian site. Again no ‘real’ address.
  • A search in Canada’s WHOIS comes up with: droc.ca was registered in Aug 2001 by a Mr Pearl Bitton. A search in Canada’s 411 phone directory gives us one Mr. P. Bitton in Thornhill, Ontario – about 15km away from Markham. I can’t be sure it’s him, of course.
  • The FTC final judgment of Dec 2003 mentions a Daniel Klemann, President of DRoA. Mr Klemann must be seing a lot of court rooms, because he was already convicted in Canada in June 2002. So this man has been told twice by a judge that he should stop his practices, but still continues. What a hero. There is a D. Klemann living in Markham and in Toronto.
  • His partner in crime, James Tetaka, equally popular in courtrooms, was convicted for similar facts in May of 2004 and described as a Toronto-area man.
  • The company under which they operated is “1473253 Ontario Inc”, which is run by Peter Kuryliw, also from Toronto.
  • So they first ran a scam as Yellowbusiness.ca, then as Internet Registry of Canada (IRC), and now they are basically active all over the world. Global scumbags.

Other mentions of the scam:
asa.org.uk
compudave.blogspot.com
coofercat.com
domainavenue.com
domainregistrationtips.net
drbacchus.com
seowebsitepromotion.com
theregister.co.uk
webhostingtalk.com
wellho.net

Comment spam on my wwwcoder blog


Well, I knew it was gonna happen at some point: I got comment spammed on my winAdmin blog. Apparently someone has a script running that looks for blog posts that contain a link to another page and then posts the contents of the link’s destination as a comment. The purpose being of course, to include a link to crappy medication sites as the homepage of the comment author (and boost their Google PR in that way). The content of the comment looks (and is most cases is) relevant to the blog post, so any Bayesian filters might mistake it for an on-topic comment.

Unfortunately, much as I love the Bayesian hammer, comment spam isn’t a nail. (…) There’s just nothing for a Bayesian filter to get a handle on: the only thing that’s spammy is the URL, so a Bayesian filter will actually be worse than just a blacklist, since you’ll inevitably get false positives trying to use a Bayesian filter on something where the actual text is completely insignificant.
(from rawbrick.net)

This intelligent job is apparently outsourced to India. Bahut Shoukriah!
Tracing route to dsl.Har.074.31.101.203.touchtelindia.net [203.101.31.74]
over a maximum of 30 hops:
(...)
17 428 ms 440 ms 438 ms 61.95.250.5
18 427 ms 422 ms 424 ms 203.101.83.198
19 * * 436 ms brasdel.touchtelindia.net [202.56.215.10]
20 504 ms 492 ms 490 ms dsl.Har.074.31.101.203.touchtelindia.net [203.101.31.74]

The traffic on this particular blog is limited, so I just switched off comments. For my other blogs, I use Blogger, and their redirection trick seems to make comment spamming useless (fingers crossed).