Too Busy For Words - the PaulWay Blog

Fri 16th Feb, 2007

Comment spam eradication, attempt 2

Dave's Web Of Lies allows people to submit new lies, a facility that is of course abused by comment spammers. These cretins seem to not notice the complete absence of any linkback generation and the proscription of any text including the magic phrase http://. Like most spammers, they don't care if 100% of their effort is blocked somewhere, because it won't be blocked somewhere else. And there's no penalty for them brutalising a server: their botnets are just trawling away spamming continuously, leaving the spammers free to exploit new markets. It is vital to understand these two factors when considering how to avoid and, ultimately, eradicate spam.

For a while now, I've done a certain amount of checking that the lie submitted meets certain sanity guidelines that also filter out a lot of comment spam. In each case, the user is greeted with a helpful yet not prescriptive error message: for instance, when the lie contains an exclamation point the user is told "Your lie is too enthusiastic". (We take lying seriously at Dave's Web Of Lies.) This should be enough for a person to read and deduce what they need to do to get a genuine lie submitted, but not enough for a spammer to work out quickly what characters to remove for their submission to get anywhere. Of course, this is violating rule 1 above: spammers don't care if any number of messages get blocked, so long as one message gets through somehow.

This still left me with a healthy chunk of spam to wade through and mark as rejected. This also fills up my database (albeit slowly), and I object to this on principle. So I implemented a suggestion from someone's blog: include a hidden field called "website" that, when filled in, indicates that it's from a spammer (since it's ordinarily impossible for a real person to fill any text in the field). Then we silently ignore this field. No false positives? Sounds good to me.

Initial indications, however, were that it was having no effect. I changed the field from being hidden to having the style property "display: none", which causes any modern browser to not display it, but since this was in the stylesheet a spammer would have no real indication just by scraping the submit page that this field was not, in fact, used. This, alas, also had no effect. I surmised that this was probably because the form previously had no 'website' field and spammers were merely remembering what forms to fill in where, rather than re-scraping the form (though I have no evidence for this). Pity.

So my next step was to note that a lot of the remaining spam had a distinctive form. The 'lie' would be some random comment congratulating me on such an informative and helpful web site, the 'liar' would be a single word name, and there was a random character or two tacked on the lie to make it unlikely to be exactly the same as any previous submission. So I hand-crafted a 'badstarts.txt' file and, on lie submission, I read through this file and silently ignore the lie if it starts with a bad phrase. Since almost all of these are crafted to be such that no sane or reasonable lie could also start with the same words, this reduces the number of false positives - important (in my opinion) when we don't tell people whether their submission has succeeded or failed.

Sure enough, now we started getting rejected spams. The file now contains about 36 different phrases. I don't have any statistics on how many got through versus how many got blocked, but that's just a matter of time... And I'm probably reinventing some wheel somewhere, but it's a simple thing and I didn't want to use a larger, more complex but generalised solution.

I'd be willing to share the list with people, but I won't post the link in case spammers find it.

I really want to avoid a captcha system on the Web Of Lies. I like keeping Dave's original simplistic design, even if there are better, all-text designs that I could (or perhaps should) be using.

Last updated: | path: tech / web | permanent link to this entry


All posts licensed under the CC-BY-NC license. Author Paul Wayper.


Main index / tbfw/ - © 2004-2023 Paul Wayper
Valid HTML5 Valid CSS!