Archive for the 'spam' Category

Page 5 of 7

Spam economics: government role

The Belgian Minister of Economy, Marc Verwilghen, recently announced the efforts the Belgian government would take to restore trust in the Internet as a way of doing business. This includes a directory of trustworthy online shops (e.g. in the travel business), but also some efforts to reduce spam. On the site spamsquad.be the following 4 basic rules are described to avoid spam: 1) don’t leave your email address, 2) don’t answer dubious emails, 3) camouflage your email address and 4) protect your computer.

As I said before, I think they forget one important detail, the main reason spam exists: “Don’t be an ass”.

  • Don’t buy your V*agra from people who can’t even spell the drug’s name right

  • If you buy the product that would make you “SCARE PEOPLE WITH YOUR HUGE C*CK”, how exactly is that going to help you?
  • Don’t give money to a perfect stranger who needs you to help recover X million from his father, the recently deceased president of some banana republic.
  • Don’t buy a economics degree on-line, you’re only proving you’re not worth it.

SPAM ECONOMICS
If you try to distill the spammer’s logic into a simple formula, check this:

P$ = [N * (I% * S% * W% * B% * M$)] – (N * E$) – (L% * C% * R$)
where
P$ = profit, bottom-line

N = number of emails sent (can be millions!)
I% = % of addresses that are valid/correct
S% = % of addresses that are not intercepted by anti-spam software
W% = % of emails to cause the receiver to go visit the website
B% = % of site visitors that actually buy the product
M$ = margin per product sold

E$ = cost of sending 1 email

L% = risk of having legal action taken against you
C% = risk of getting convicted when you’re in court
R$ = average fine you would have to pay

The parameters I%, S% and E$ are defined by technology, and government should not mingle with that. Spam detection technology is a very active line of research and new products and/or services are coming out all the times. Yahoo, Microsoft, IETF, … are trying to reshape email so sending email to 5 million addresses isn’t so darn easy, but again, these issues are technical, we don’t need any minister to tell us or buy us a solution. L%, C% and R$, on the other hand, are very much things that should be dealt with on a national level: law-making and law-enforcing. But I doubt if many of the big spammers are Belgian, so there is little the Belgian government can do about that.

SPAM-EDUTAINMENT
The main focus of this country should be focused on reducing W% (website conversion) and B% (buyer conversion), the ‘naivite’ parameters, and the weapon of choice there is education. The Belgian federal agency Fedict has already done a fine job by launching peeceefobie.be, a consumer-oriented portal on PC security with some good advise on spam-mail (Dutch). But to reach Average Joe and Jane, they should use TV and radio. I would like to see an entertaining program on internet security that teaches people the PC security basics and that has humoristic sketches like In De Gloria. I would like to hear a program on Internet crime in the Sample Minds style.

If someone drives a gasoline car and fills it with diesel/fuel, he will be made fun of, because you’re supposed to know these things when you have a car. The same should happen with someone who lost money in an on-line scam. Invest $500 and get $50.000 from a dyslexic Russian dude who won’t disclose anything but a Hotmail address? Come on, you fell for that one?

Technorati:

Detecting Blogspot splogs the Bayesian way


SHORT HISTORY OF SPLOGS

SPLOG SOLUTIONS
There are several stragtegies to make life harder for sploggers, while not making it too hard for regular bloggers:

prevent the (automatic) creation of a splog site 

CAPTCHA SYSTEMS: useless. In the end, one can always outsource creation of blogs to ‘real people’ in India at less than $1/blog 

EMAIL CONFIRMATION: can be too easily automated 

NO MORE FREE BLOGS: ask a fee (per month, like Typepad) or a kind of deposit (e.g. $50 – returned when you stop blogging, or confiscated when you run a splog) 

THOROUGH ID CHECK: so you can trace back splogs to the actual user. Not realistic, I’m afraid. Anonymity can sometimes be a positive thing. 

prevent (automatic) posting of splog items 

NO MORE API: no posting via email, REST, SOAP, XML-RPC, Atom, … That’s throwing out the baby with the bathwater. 

CAPTCHA SYSTEMS: require human interaction for each post. Is OK for web-based editors, but what with the APIs above? 

CHECK CONTENT: process the content of each post and try to decide if it’s splog-like. See the chapter below on Bayesian Detection 

detect splogs when they are created 

AUTOMATED DETECTION: using systems similar to the ones now employed for email spam detection. See the chapter below on Bayesian Detection. 

GROUP EFFORT: projects like splogreporter.com, Blogger’s “flag as objectionable”, Google Adsense abuse report. Maybe an initiative such as “Razor” for splogs? 

prevent the benefits of splog usage 

PAGERANK 0: when in doubt on whether a blog is a splog, reset the PageRank to 0 

REL NOFOLLOW: is a decision easier to take than deleting the whole blog (because if you delete blog that’s not a splog …): make all links worthless for search engines 

REDIRECT SCRIPT: replace all links to www.example.com to e.g. www.blogger.com/r?url=www.example.com, so that the redirects work bu do not transfer Google juice. The ‘old’ way of doing rel=nofollow. 

EDUCATION: there are way too much gullible people around who actually buy bogus stuff on the web. No one should leave high school and not know what SPAM is. You do not buy V*AGRA on the web!!

BAYESIAN DETECTION OF SPLOGS
Kailash Nadh already started with a project, but let’s take it one step further. I’d like to go back to the beginning of email spam fighting: most of the efforts started after an article by Paul Graham: “A plan for spam”. He proposed Bayesian Filtering as a better method for detecting spam than using whitelists/blacklists. In his follow-up article, he says:

I don’t think it’s a good idea to treat spam filtering as a straight text classification problem. You can use text classification techniques, but solutions can and should reflect the fact that the text is email, and spam in particular. Email is not just text; it has structure.

He then suggests treating the typical email fields: sender, destination, subject, date differently from the email body.

Let’s try to pinpoint the specifics of a blog (they are not the same as emails):

  • a blog has a title, url, meta tags, and HTML content (which does not appear in the RSS/Atom feed) 
  • a blog post has a title, url, date, metatags and HTML content (which does appear in the feed)
  • the HTML of the blog or of each post can contain links (<a href="http://www.example.com" mce_href="http://www.example.com">)

So intuitively I suspect the indicators of a splog will be (using car-deals-and-more.blogspot.com as example):

Keyword stuffing 

some words will appear over and over again, with small variations. In our example: over 3% of all words is “car”, 1.7% is “loan” and 1.09% is “credit”. Keywords in the URLs of in the domainname are even stronger indication of splogs. 

Nonsense content 

some sites use really dumb search engine scraping to generate content, which fills the posts with script URLs, lists of keywords without any grammatical contruction. 

Repetitive linking 

since a splog will be started to promote one product or site, the links will typically point to the same site and maybe to other ‘joined’ splogs. (in our example: always to www.car-financing.online-auto-center.info, a 41-character(?!) domain name) 

Post frequency 

Since recency and frequency are so important for a search engine, the splog will have many posts/day. If the owner is an idiot every time 2 minutes after the hour, but presumably something more random. One might do an RFM analysis on this.

So one would have to adapt the Bayesian model as follows:
- add BLOGTITLE*whatever for each word in the blog title
- add BLOGPAGE*whatever for each word in the blog URL or domain (separated by “.”, “-” or “_”)
- add POSTTITLE*whatever for each word in the post title
- add POSTPAGE*whatever for each word in the post page URL (separated by “.”, “-” or “_”)
- add LINKURL*http://…/path/to/page.html for each link
- add LINKDOMAIN*http://www.example.com for each link
- calculate a POSTFREQ (posts/day stat) and POSTFREQDEV (standard deviation)

The Bayesian Filtering would then place a ‘splog probability’ on each blog and a company like Blogger could set up a weekly scan of each blog and do the following:
SPLOGPROB > 99%: automatically disable all outgoing links, send warning to owner, if no positive reply after 7 days: delete blog
SPLOGPROB > 95%: automatically disable all outgoing links, send warning to owner, test again after 2 days
SPLOGPROB > 90%: add rel=”nofollow” to each link – add captcha to posting – next control in 1 week
SPLOGPROB > 80% add captcha to posting – next control in 1 week
SPLOGPROB > 70%: do nothing – next control in 1 week
SPLOGPROB < 70%: do nothing – next control in 1 month
I’ll probably refine this model in the future.
All remarks welcome!

Technorati:

Old-style Nigerian scam: via fax


Amazing: I just got my first Nigerian (419) scam via FAX! In these days of practically free email sending, you have to admire someone who goes the extra mile and pays for sending faxes. A handwritten letter would have made me feel even more special, but it’s a start.

A Mister Victor Abbor (vicabbor2@yahoo.com or vicabbor@sify.com) from Lagos has picked me as a candidate for transfering a substantial amount of money ($17.6 mio), formerly belonging to Mr. Akbar Ali, who unfortunately died in the Benin plane crash on Dec 25, 2003. I have to pose as a relative for Mr. Ali (I do have a slight tan from one week in Portugal, so I should pass easily for an African) In return, he would only take a meager 65% and leave me with the remaining 35% (what, no taxes?) which amounts to … $6.16 mio, or € 4.7 mio. A day’s work for a day’s pay.

Victor refers to articles on CNN.com: Benin crash, and Pravda.ru: Bloody Christmas. Obviously, the fact he knows these articles should be enough proof that he is indeed capable of touching the money. Funnily enough, just after asking my personal details, he signs the letter as Paul Udo. I’m confused now. Is Udo Abbor’s secretary or something? While “Victor Abbor” has no Google hits yet, “Paul Udo” does have prior activity: on nigerian419fraud.freeserve.co.uk he appears to have promised 30% of 32 mio to someone else. I suddenly don’t feel that special anymore.

The fax is supposedly sent from +234-7594683 (+234 is Nigeria’s country prefix)… Nah, I don’t think I’m gonna do it. If any one is interested in doing business with Victor/Paul, just let me know. Just browse through The Spamletters first to get a feel of how to do business the Nigerian way.

Googlistics: messing with the big “G”


He probably also first thought it was an April’s Fool joke:
Matt Mullenweg from WordPress was discovered to have used his PageRank 8 site (WordPress is a popular open-source blogging software) for hosting lots of irrelevant content, with the purpose to get high scores in Google rankings and (let a customer of his) make money on Google Adsense.

The content in articles is essentially advertising by a third party that we host for a flat fee. I’m not sure if we’re going to continue it much longer, but we’re committed to this month at least, it was basically an experiment. However around the beginning of Feburary donations were going down as expenses were ramping up, so it seemed like a good way to cover everything. The adsense on those pages is not ours and I have no idea what they get on it, we just get a flat fee. The money is used just like donations but more specifically it’s been going to the business/trademark expenses so it’s not entirely out of my pocket anymore.
(from wordpress.org)

Andy Baio (Waxy) broke the news on March 30th, at a moment when Matt was on holiday (and off-line), so he only replied on April 1st, about a thousand angry emails later. His defense is that it was a interesting idea, badly implemented, not followed up and never evaluated. Since Matt does not have the profile of a cash-hungry opportunist, and he’s explaining this to an audience of people that understand these reasons (reads like an IT project management what-not-to-do list), the storm will probably blow over.

Normally this is the kind of situation where one would say: “SEO? Leave that to the professionals!”. But the fact is that here in Belgium, some of the companies that claim to be SEO specialists, use dirty tricks all over. Hidden links, bot cloaking, keyword spamming, <noscript> tricks, the whole shebang. It’s like they read the Google SEO warning page as a guideline. “Hey look! We could put ourselves and other customers on every client’s doorway pages. Neat!”.

Joris just posted another example (Immoweb.be) on his SEO blog. And again the so-called “SEO professional” fooling around is Extenseo, just as it was for Automagazine and Actel. As one can see on their unprotected Javascript hosting site, they recently add VW/MyWay to their customers, so we can expect those homepages to be featured in the Hoe Het Niet Moet (What Not To Do) series soon!

Technorati: