The Spamvertized Domain Realtime Blacklist project provides a DNS-based realtime blacklist (RBL) that facilitates detecting presence of spammer domain names in the text i.e. subject or body of an email. This presence indicates a likelihood that the email is "spamvertizing", advertizing through use of unsolicited bulk/commercial email.
Currently, 2004-05-07, the blacklist domain is operating under sd.sdbl.org (a DNS zone accepting signed dynamic updates) with some sample (real!) spamvertised domains and a patched SpamAssassin 2.63. Crude patch and module to hunt for domains.
Spam detection employs a substantial arsenal of techniques from looking for presence or probability of presence of certain words, phrases, sequence of characters, IP addresses of sender, and so on. While effective to a degree ("viagra" rarely comes up in routine email conversation) it doesn't address the fundamental aspect of spam which is that it is advertizing. This is the spammers' weakpoint: an advertisement ultimately requires a conduit, typically for a transaction. Without it the reader will have no means to execute on that transaction and the spammer make money or have their political views expounded upon, etc. That conduit is with spam either a phone number or, far more commonly, a website.
So we maintain a list of domains that have been verified as being associated with unsolicited bulk email and a mechanism for quickly looking them up.
One obvious point is that using this scheme blacklists entire domains. So why not, say, encode a full URL (or, heavens, URI) and have the system query for that? Quite simply it seems that the vast majority of spamvertised domains are not URL-specific. That is to say, the entire domain is hosting only spamvertised products. The differences in URLs typically fall into two classes: randomized subdomains (presumably to defeat exactly this process), and URLs that encode affiliate information.
It essentially looks for any four or more letter domain-string (i.e.
[-a-z0-9]{4,}
) followed by the common sequences of
top/second level endings, e.g. co.uk, biz, etc.
This method has some advantages over the full URI scanning methods which
look for http:// or www.. Firstly it's faster,
secondly it finds more domains I've seen a lot of examples of "please
paste this into your browser: cheapassmeds.biz" which would be missed
otherwise. Thirdly it gets around the problem of redirect URLs whose
base (e.g. rd.yahoo.com) is a legitimate domain.
There's a hack then to check that the first char before the found domain isn't an @ sign to exclude email addresses. This could use some improvement.
I haven't really looked into this. I just wanted to stem the flow of the spam ;-). Perhaps a model like,
I'll build something like this when I get time.
The idea of looking for a range of domains against a blacklist I first heard from Jay Allen's MT-Blacklist / Comment Spam. Coupled with noticing that the majority of spam I received that SpamAssassin wasn't catching had a) very little text and b) one or more URLs, suggested the possibility of targetting domain names in the body of emails. What remained simply was a way to provide access to a list of domains that didn't require sucking down some ever growing text file, or RSS feed: a DNS-based RBL fitted the job perfectly.
In classic style of finding info after the fact, a couple of useful links. I since noticed a similar idea was discussed on spamassassin-talk (list now obsolete) back in 2002: [SAtalk] URL blacklist. Three emails between Mark Reynolds and Justin Mason discussing a very similar idea and interesting thoughts on a "non-sue-able" blacklist process back in 2001, http://bl.reynolds.net.au/ksi/email/.