reCAPTCHA = altruism

The appeal of CMU’s reCAPTCHA spam prevention application isn’t the protection; It is the excavation of collective intelligence. Irresistible as that may be to a complex systems guy, I’m not going to install the plug-in on BlogSchmog. I don’t think we need the protection.

reCAPTCHA

I heard about reCAPTCHA not long before Kynthia tried it out. Leveraging 150,000 hours of collective typing work each day for some other noble purpose—such as helping digitize books from the Internet Archive—is a great idea … provided the existing action is necessary in the first place. According to the reCAPTCHA application site, human wisdom is mined this way:

Each new word that cannot be read correctly by OCR is given to a user in conjunction with another word for which the answer is already known. The user is then asked to read both words. If they solve the one for which the answer is known, the system assumes their answer is correct for the new one. The system then gives the new image to a number of other people to determine, with higher confidence, whether the original answer was correct.

That wonderful solution is offset by the increased nuisance that comes with forcing people to type words from images to prove they are human. It is an added barrier to reader comments, one that only addresses the problem of automated spam, not spam itself. Humans, after all, can easily pass a CAPTCHA test only to post something irrelevant.

Akismet, on the other hand, addresses spam after the post but before the publication.

Why we need spam protection
The motivation for CAPTCHA methods of verifying humanity are clear: most communication is spam. According to recent data from Message Labs, a messaging security and management service for businesses, three out of every four emails are spam. (That number is actually higher—83.6%—when considering the honeypot accounts the company sets up as an unprotected baseline.) While Message Labs says that July 2004 was the peak of email spam, Akismet statistics indicate we are experiencing the peak for spam commenting on blogs right now. Since the filtering tool began counting comments in November 2005, some 1,612,121,330 spam has been caught compared to 87,847,507 legitimate posts (read: about 95% of all blog comments are spam). That’s a lot of crap weighing down the system.

Message Labs spam threat
From the MessageLabs Intelligence Report: April 2007 (PDF)

Akismet stops the spam, not the spammer
Akismet leverages network intelligence by having blogs running the plug-in send all comments to a central database upon submission. If the post is a known bit of spam, it gets trapped and never bothers the blog author. Any mistakes—spam that gets through, or good comments labeled as spam—can be corrected by members on a case-by-case basis, helping to make the system smarter. An earlier upgrade of WordPress this year added a link to double-check the comment queue in case any of those mistakes had already been corrected. Akismet is just one of many Automattic projects, now including an automatic stats plug-in for self-published WordPress authors.

In a recent interview of the WordPress guru by C-Net’s Rafe Needleman, Matt Mullenweg talked about the inspiration for his spam filtering concept and why Akismet wasn’t an open source product. Mullenweg had noticed that the spam filters he had been writing had an effective life that was becoming shorter and shorter. With each new fix, spammers worked to beat it quickly. They had more resources and “fewer scruples” than he did, so Matt was looking for a different approach. When his mother said she was going to start blogging, the pressure to find an effective solution grew. “Spam tends to be some of the most vile stuff,” Mullenweg told Needleman, “and my mother is a fairly conservative Catholic woman. I thought, ‘Oh, God, this is what she’s going to think I do all day.’”

Akismet is proprietary only as a means to make it more effective against spammers. By making the source code available to everyone, countermeasures are easier to find. Mullenweg doesn’t want to take that chance. In weighing the pros of opening the code with the benefits of thwarting spammers, he tips toward the latter. This is a difficult debate for an open source guru, but he is making other adjustments, too. WordPress has shifted to a 4-month development cycle, moving away from the practice of releasing incremental changes. He cites the benefit of having a million blogs running off of the hosted WordPress.com site, so the improvements are well de-bugged there before the official releases are announced.

Akismet works
The tool has drawn criticism, mainly for disallowing false positives and effectively eating comments. Aside from the primary problem of denying a reader a voice, there is also the related problem of assumption of blame. A blog reader isn’t going to think, “Oh, Akismet the Spam Filtering Application probably ate it” before they assume that the blog author deleted their feedback. BlogSchmog is only on the fringe of the blogosphere, so my experiences with the plug-in are quite different than others. It is rare when spam makes it through the community filtering for me to see it, and when something spammish does appear, it usually goes away by re-checking the queue. Every time I have checked the spam queue—captured comments don’t get deleted for 15 days, so if you scan it twice a month you should be able to keep it honest—there’s been nothing but spam.

Many blogs, like this one, have moderation enabled that requires at least one approved post before that person can freely comment. I used to force registration as a protection, but then came Akismet. Until this blog grows to a point where readers comment in long threads, I don’t think there is a reason to panic by adding CAPTCHA.

4 thoughts on “reCAPTCHA = altruism

  1. Thanks for the update and sharing the link.

    Without knowing what is in the Akismet black box, I suspect that many false positives with spam comments are the result of blog authors marking legit comments they don’t want as spam instead of deleting. I think your solution seems reasonable to correct those kinds of mistakes, and further assume that method has been anticipated enough to keep spammers from effectively doing the same.

Leave a Reply