Real Programmers/ Projects/ Spamvertised Domain Blacklist
Google site:

realprogrammers.com

Spamvertized URL/Domain Realtime Blacklist

What is it?

The Spamvertized Domain Realtime Blacklist project provides a DNS-based realtime blacklist (RBL) that facilitates detecting presence of spammer domain names in the text i.e. subject or body of an email. This presence indicates a likelihood that the email is "spamvertizing", advertizing through use of unsolicited bulk/commercial email.

Currently, 2004-05-07, the blacklist domain is operating under sd.sdbl.org (a DNS zone accepting signed dynamic updates) with some sample (real!) spamvertised domains and a patched SpamAssassin 2.63. Crude patch and module to hunt for domains.

How it works

  1. An email is received and scanned for domain names, perhaps via the SpamAssassin software or other mail filter software.
  2. The list of domains is then cropped back to their second- or third-level domains, e.g. zxcf.spamdomain.com becomes spamdomain.com, hjsde.spamdomain.co.uk becomes spamdomain.co.uk
  3. Each domain is then prepended to sd.sdbl.org and a DNS lookup performed on the result, e.g. a query for the A record for spamdomain.co.uk.sd.sdbl.org.
  4. If the record is present an IP address (e.g. 127.0.0.2) will be returned to indicate that the domain is in the blacklist
  5. What happens after this is up to the querying software: some may outright reject the mail at this point, SpamAssassin would increase the spam score of the mail.

Why do it this way?

Spam detection employs a substantial arsenal of techniques from looking for presence or probability of presence of certain words, phrases, sequence of characters, IP addresses of sender, and so on. While effective to a degree ("viagra" rarely comes up in routine email conversation) it doesn't address the fundamental aspect of spam which is that it is advertizing. This is the spammers' weakpoint: an advertisement ultimately requires a conduit, typically for a transaction. Without it the reader will have no means to execute on that transaction and the spammer make money or have their political views expounded upon, etc. That conduit is with spam either a phone number or, far more commonly, a website.

So we maintain a list of domains that have been verified as being associated with unsolicited bulk email and a mechanism for quickly looking them up.

Notes

Domains v. URLs

One obvious point is that using this scheme blacklists entire domains. So why not, say, encode a full URL (or, heavens, URI) and have the system query for that? Quite simply it seems that the vast majority of spamvertised domains are not URL-specific. That is to say, the entire domain is hosting only spamvertised products. The differences in URLs typically fall into two classes: randomized subdomains (presumably to defeat exactly this process), and URLs that encode affiliate information.

Algorithm for finding domains

It essentially looks for any four or more letter domain-string (i.e. [-a-z0-9]{4,}) followed by the common sequences of top/second level endings, e.g. co.uk, biz, etc. This method has some advantages over the full URI scanning methods which look for http:// or www.. Firstly it's faster, secondly it finds more domains I've seen a lot of examples of "please paste this into your browser: cheapassmeds.biz" which would be missed otherwise. Thirdly it gets around the problem of redirect URLs whose base (e.g. rd.yahoo.com) is a legitimate domain.

Implementation

There's a hack then to check that the first char before the found domain isn't an @ sign to exclude email addresses. This could use some improvement.

Architecture of submission, revocation, etc

I haven't really looked into this. I just wanted to stem the flow of the spam ;-). Perhaps a model like,

I'll build something like this when I get time.

Credits

The idea of looking for a range of domains against a blacklist I first heard from Jay Allen's MT-Blacklist / Comment Spam. Coupled with noticing that the majority of spam I received that SpamAssassin wasn't catching had a) very little text and b) one or more URLs, suggested the possibility of targetting domain names in the body of emails. What remained simply was a way to provide access to a list of domains that didn't require sucking down some ever growing text file, or RSS feed: a DNS-based RBL fitted the job perfectly.

Resources

In classic style of finding info after the fact, a couple of useful links. I since noticed a similar idea was discussed on spamassassin-talk (list now obsolete) back in 2002: [SAtalk] URL blacklist. Three emails between Mark Reynolds and Justin Mason discussing a very similar idea and interesting thoughts on a "non-sue-able" blacklist process back in 2001, http://bl.reynolds.net.au/ksi/email/.


All non-user content and code Copyright © 2000-2006 realprogrammers.com / Paul Makepeace. Comments & feedback welcome!