February 8th, 2008


Merry Christmas and happy easter

I've found out what those gibberish spams that don't seems to be selling anything are for: they're designed to poison spam filters' corpuses.

The idea is this: there's too much spam, so you write a program that filters it out, known, oddly enough, as a spam filter. Trouble is, spam doesn't come conveniently marked as such, so you need some pretty hairy techniques to try and recognise it.

There are various techniques, and one of the most popular starting points is called Bayesian filtering. I won't go into the maths here, but basically it's a statistical system which you "prime" by showing it a load of spam and saying "hey, filter, this stuff is evil - I want you to get rid of it", and then a load of real email (sometimes called ham, by analogy with spam) and saying "this is the good stuff. I want to keep it."

The Bayesian filter then attempts to work out phrases, patterns, and styles which distinguish spam from ham. Done right, it works well. My GMail account gets fewer than ten spams a week in the inbox, but several hundred a day go straight into the spam bucket.

Yes, you read it right - several hundred. And in the three years I've had my GMail account, I've had one false positive (that I've been aware of, admittedly).

If you want to gauge how impressive it is that a computer can do that kind of filtering, think of it liike this: imagine you're describing to a small child who has absolutely no grasp of context or the outside world, how to tell the difference between a trustworthy adult and a potential child-molester.

Difficult, huh? That's why we tell children, whose brains are millions of times more powerful than any computer, don't talk to strangers. Kids just can't make that kind of quality judgement. So it's a feat of science that computers can, fairly reliably, distinguish spam from ham.

That sidetrack aside, for these filters to work, they need to be fed a large quantity of ham and spam to get them "primed". Also, systems like GMail are paying attention when you click "Report Spam": when you do that, the mail gets fed into the filter as an example of spam. When enough people do that, the spam starts getting blocked. It also takes some of your emails that you don't report as spam, and feeds them into the filter as examples of ham (the filter needs example of both spam and ham so it can establish the differences).

The collection of emails used as examples of spam and ham is known as the corpus, from the latin, meaning "body".

These gibberish emails, which are composed of gramatically sound English (or whatever language), get reported as spam, and end up in the spam corpus, which gets diluted with random crap.

In other words, by sending spams that only look like spam because they have no useful content, the spammers weaken the filters' ability to tell the one from the other.

Luckily, collaborative systems like GMail pretty quickly filter out even the gibberish mails.