Preventing SPAM

The professor was shocked. She never expected any of her students to do this. The computer monitor showed an obscene e-mail from her favorite student. It was actually a spoofed spam, not actually sent from her student, but sent by spammers.

Well this is an imaginary story, but very real in our lives. So what is this spam ?

Introduction

The jargon [1] defines spam as

To mass-mail unrequested identical or nearly-identical email messages, particularly those containing advertising. Especially used when the mail addresses have been culled from network traffic or databases without the consent of the recipients.

We won't talk about spam any more as everyone knows what it is. So lets look at ways to prevent spam. First, we'll see various methods to filter spam once it reaches the MTA. And then, about more robust technologies that kill spam before it enters the mail server.

Filtering Spam

Spam filtering is done using Rule based filtering methods and statistical methods. The Rule based methods analyze the presence of regular expressions or combinations of them to filter spam. On the other hand statistical spam filtering is more robust and uses probability of occurrence of certain tokens in the mail.

Rule based Filtering - Server side

This is the basic method for filtering spam mails. The filter searches for specific words found in spam mails. The probability for false positives (marking legitimate mail as spam) is very high. So its best to use this method in combination with the methods described in the sections given below for better results. Its advisable to use the filtering at the server itself so that you can avoid wasting bandwidth caused by transporting unwanted spam to your local inbox.

Using .procmailrc + Sendmail or Postfix

Both Sendmail and Postfix use procmail as Mail Delivery Agent. Procmail can be made use to filter spam using rules (recipes) defined in the procmailrc. A few examples are given below. These can be expanded to suit your environment.

:0 
* ^Received:.*ispam\.net
* ^From:.*spammer@nigerian-scam\.com
/dev/null

Its an ugly way to filtering but will work if you often get spam from spam.net or from spammer@nigerian-scam.com. All mail from these sources will be redirected (saved) to /dev/null. You can redirect the spams to a regular file by replacing /dev/null by a file name.

:0
* ^Content-Type:.*multipart/
* 1^1 B ?? ^Content-Type:.*application/x-msdownload
* 1^1 B ?? ^Content-Type:.*name=.*\.(exe|scr|pif|com|bat)
/dev/null

Send those lil virii to /dev/null. Let them live in the time space void forever.

:0
* ^(From|To|Sender|Reply-To):[	]*.STRING-ADDED-BY-MAILSERVER

This will catch unqualified addresses as the unqualified addresses are used only by spammers. STRING-ADDED-BY-MAILSERVER must be replaced by the string added by the mail server to such mails.

:0
* ^Subject:.*REMOVE\ ME|\
^Subject:.*viagra
dev/null

This is the simplest filter based on subject.

.procmailrc + Exim

When Exim is used as MTA, the filtering can be done by asking Exim to invoke procmail via .forward file. To enable this, add the following to .forward file.

|IFS=''&&p=`which procmail`
   && test -f $p && exec $p
      -Yf- || exit 75 #username

Once this is added to the .forward, the procmail recipes described above will start working.

Drawbacks of rule bases filtering

  • Percentage of false positives is high (about 5-10%)
  • Filtering is mostly English specific
  • Its very difficult to automate the process and becomes impractical in large establishments.

Statistical method for filtering

Bayesian Filtering

Bayesian spam filters calculate the spam score of a message by assigning actual probability to the tokens found in the mail. The value assigned to a word may be positive or negative. These values are remembered and when the message is completely processed these values are added up. The resulting value may be positive or negative. If this value is equal to or greater than the threshold value for a mail to be identified as spam the mail will be marked as spam.

Unlike blind Rule based filters, Bayesian spam filtering learns from all mails it sees. As the time passes system adapts itself to more efficient and mature system. As time passes, the number false positives also decreases.

In general Bayesian filtering filters mail to good and bad, spam and non-spam not only as spam. So even if a ham comes with the words/tokens found in spam it will still be recognized as ham based on the overall probability of tokens. The effect of "bad" tokens will be nullified by innocent tokens. This can be exploited by hackers by inserting innocent tokens into spams.

The advantages of statistical spam filtering given below.

  1. They are effective
  2. They generate few false positives
  3. They learn
  4. They let each user define what is spam
  5. They are hard to trick

Spamassassin : The most popular Bayesian Filter

Spamassassin is developed by the Apache foundation and the latest version of spamassassin (version 3.0) uses Bayesian filtering to filter spam.

Spamassassin comes in two main flavors: an On-demand scanner and a daemon. The former can be invoked every time a message comes in and the latter continuously runs in memory and scans all the incoming messages. This article focuses on the latter approach.

Spamassassin is a complete set of tools which can prevent spam in various methods. Spamassassin comprises of three executables spamassassin, spamd ( perl scripts) and spamc (C program).

The perl scripts run in "tainted mode" due to security reasons. The C program spamc is intended to be called from other programs. The basic perl module associated with spamassassin is Mail::Spamassassin::Bayes (spam detector and markup Engine) and via plugins and addon modules its functionality can be extended. Spamassassin comes with a Bayes algorithm which "learns" to recognize new spam on the basis of old messages (both spam and ham). This makes it possible for the software to automatically adapt and identify spam even in the absence of specific header or body tests.

A (automatic) white list system makes it easy to list e-mail addresses that you already know or verified as valid; messages from these senders are exempted from further filtering and directly get routed to your mailbox.

How it works ?

Spamassassin works by performing a range of tests on all the messages it sees. A wide number of tests are provided, including checks to see if the sender address and IP, recipient address, message dates etc are valid, the message body contains any words from a list of forbidden words stored locally, if any of the sending servers are blacklisted, and so on. Each test adds to a message's overall spam score and messages whose score exceeds a certain user-defined threshold are treated as spam and can be either deleted or marked with a special spam header for further processing by other programs.

Spamassassin can be configured at a mail server with a MTA like Postfix, Exim or Sendmail. In such a setup the MTA accepts mail, passes it to the MDA and it gives spamassassin control over the mail before completing the processing. Spamassassin verifies the mail headers, Body of the message etc and makes modification in the header of the message or takes appropriate action based on the configuration file. Whenever a message comes in spamassassin tries to scan it by parsing /etc/spamassassin.rules or $ENV{HOME}.'/.spamassassin/user_prefs [2]. Finally MDA delivers mail to the correct location.

Various actions that can be performed by Spamassassin are :

  • Submit to a distributive spam detecting network

  • Mark as spam

  • Move to a separate spam box

  • write the information collected in the form of tokens to various databases

Spamassassin is designed in such a way that more plugins can be added to it very easily and it can combined with various mail clients like Mutt, Outlook express etc. In addition to the above mentioned functionalities Spamassassin supports Hashcash and SPF, it can submit the mails marked as spam to various distributive spam filtering networks like Vipul's Razor. and perform lookup in various DNSBL's etc. [3] More about Hashcash and SPF are discussed in later sections.

Integrating with mail servers

In mail servers where the same binary handles functionalities of both MTA and MDA configuration is done in the main startup script of the SMTP daemon. eg : Exim. In the case of SMTP servers which make use of a MDA like procmail we can forward the mails to the spamassassin.

Eg: in the case of Postfix and sendmail add the following to .procmailrc of the user.

:0fw
| /usr/bin/spamc # correct path to spamc

But this is a very basic configuration and impractical in servers with heavy traffic. We can use a software called amavisd-new as an interface between the MTA and content filters ( antivirus softwares, Spam filters etc). A combination of Postfix, amavisd-new and Spamassassin is claimed to be the best method to block both spam and viruses. More about the topic can be found here: http://www.linuxplanet.com/linuxplanet/tutorials/5561/3/

Blacklists - RBL

Practically of no use. RBLs are in existence for years but it never worked beyond a certain level. Spammers are too fast and smart to be blocked by RBLs. And more over it can be used as a weapon against web hosting service providers or companies by adding their IP or domains hosted in them to the black lists. The chance of generating false positives and hence causing harm is more with such black lists. So we don't discuss it here.

Filtering is brain dead

But I believe that spam "filtering" is a brain dead method to fight spam. Its far far better to prevent the chances of spam than fighting against spam by filtering. This will save lot of network bandwidth and CPU time to process spam. Its is believed that more than 30% of emails send today are spam. So in the following sections we will discuss about methods other than filtering to prevent spam.

HashCash

Hashcash is a denial-of-service counter measure tool. It's main current use is to help hashcash users avoid losing email due to content based and blacklist based anti-spam systems. [4] Hashcash stamps each message with X-Hashcash: header and filtering systems and blacklists are encouraged to exempt mails with the valid stamp.

How it works ?

The basic idea is that the clients must do some work before they can send mail. (proof-of-work) They spend the proof of labour like money to get service. Hashcash creates the stamp similar to a md5 sum but it uses SHA1 to compute the stamp. The work required to compute the stamp can be made arbitrarily expensive (from fractions of a second to hours). The process described above is called minting. At the receivers end the receiver can check stamps using the checking function and if the proof-of-work value is too low or bogus it simply rejects the mail. The validity of a stamp is by default set to 28 days after this period the stamp expires. This is very necessary since if this is not enabled the available pool of stamps will exhaust within a short period.

Since each mail requires considerable amount of work it becomes very hard to send spam mails for the spammers and as a result the total number of spams decreases considerably.

SPF: Sender Policy Framework

This technology works by keeping a record of locations (IPs) from where a user sends mail. So e-mail spoofing becomes almost impossible. Even if the spammers want to send mail they have to use their own identity to do so. And by detecting the spam source we can simply block them. SPF works by domains publishing reverse MX records to tell the world what machines send mail from the domain. When receiving a message from a domain, the recipient can check those records to make sure mail is coming from where it should be coming from. With SPF the reverse MX records can be published by just one line in the DNS record. [5]

How to do it ?

Add a single DNS record of type TXT to your DNS record in the format given below.

domainname.com. TXT v=SPF_version_identifier default_ mechanism

This announces which computers are allowed to deliver e-mail from your domains. So it is checked by the receiving SMTP server before even accepting the content (this significantly reduces bandwidth usage). Detailed information about configuration for various scenarios like shared hosting can be found at the implementers site. http://spf.pobox.com

The working of the SPF protocol can be described in 3 steps.

  1. A user sends mail from sender.com or a spammer forges from sender.com to a user at receiver.com

  2. The SMTP server at receiver.com checks sender.com's SPF record

  3. If the origin is not listed receiver.com gives the message a fail

SPF is probably not the end to Spam in total, but it might be the end to Spam as we know it today: coming in masses, spoofing e-mail addresses and most of all severely annoying.

Possible flaw in SPF

With the recent invention of techniques like invisible bullet proof hosting in which the certain hosting companies provide untraceable domains by providing dynamically changing webspaces can bring down SPF to certain level. If the content filtering software installed in the server marks the incoming mail as spam and adds the IP of the origin to spam blocking lists innocent end users will suffer. And it may bring the entire Internet to its knees by corrupting the whole IP name space. (Note that SPF check pass doesn't mean that the mail is not spam)

An example : gmail.com

Simply type "dig gmail.com txt" to see the SPF entries. The received mail contains Received-SPF: field in the header. A list of domains supporting SPF can be found here : http://personal.telefonica.terra.es/web/news/spf/

Domain Keys

This is a method proposed by Yahoo.com to fight against spam. The method is fairly easy to understand and there here is no centralized authority, no need to change the existing protocols etc. The mail servers generate a public/private key pair and publish their public key as a part of their DNS record. Each outgoing mail is signed with the secret private key. Now the signature can be used to verify that the mail is not forged. In this way the presence or lack of a valid signature can be used to classify mail as spam and ham.

How it works ?

A rough overview of the working of the domain keys is given below.

  1. Generate a private key/public key pair and add the public key in the TXT field of the DNS record.

  2. Each outgoing mail is signed with the private key (rsa-sha1) and this is added to the e-mail header.

  3. The receiving SMTP sever verifies the sender my checking the signature with the public key available in the TXT field of the DNS record of the sending domain.

  4. If the signature is verified to be of the server public/private key pair the message is accepted.

  5. If the test fails the message is rejected.

An Example : gmail.com

Gmail launched Domain keys around 15th September 2004 and it is still in the beta stages.

The headers of the mails coming from gmail.com look something like this:

DomainKey-Signature: a=rsa-sha1; c=nofws;
s=beta; d=gmail.com;
h=received:message-id:date:from:reply-to:to:subject:
mime-version:content-type:content-transfer-encoding;
b=UHWIvAn9.....jw5mJ7H+A

The various tokens in the header are

  • s=beta shows the sender name. ie "beta" is the sender name.

  • d=gmail.com is the sending domain's name

  • a=rsa-sha1 algorithm used to generate the key pair.

  • b=UHWIvA...7H+A is the signature of the message.

The following DNS query can be used to get the domain key information about a domain.

dig user._domainkey.domain.com TXT

The Domain key information about gmail can be found by issuing "dig beta._domainkey.gmail.com TXT".

Conclusion

These are some of the general methods used for spam prevention. New protocols like Sender ID from Microsoft are in development and some time in the near future we can expect to live in a world with out spams and spammers.

Addendum

1. One can use Content filtering and Bayesian logic for CRM (Customer Relationship Management). A company usually manages communication via various mail addresses like info@compny.com, careers@comapny.com etc etc. By using Bayesian filtering we can avoid using multiple mail address or sorting mails manually based on the subject line etc. The system will require initial training and after that it will automatically do the work for us. Spams can be automatically deleted. If there are any unclassified mails we can sort them manually. This topic is out of the scope of this article so I will try to compile a separate document on it sometime in the future.

2. Popmail: is an excellent program which can be used at the client side to filter spam. The main advantage of the program is that it will perform filtering at the server and deletes spams from the server itself. This program can be used in combination with programs like fetchmail. The popmail project is hosted in Sourceforge.net.

Bibliography

1
The Jargon File: http://www.catb.org/esr/jargon/
2
Learning from messages : http://spamassassin.apache.org/full/3.0.x/dist/doc/Mail_Spamassassin_PerMsgLearner.html
3
Spamassassin V3.0.0 docs : http://spamassassin.apache.org/full/3.0.x/dist/doc/
4
Hashcash : hashcash.org
5
SPF : http://spf.pobox.com
6
Bayesian filtering :
http://email.about.com/cs/bayesianfilters/a/bayesian_filter.htm
7
A Plan for spam :
http://www.paulgraham.com/spam.html
8
Domain Keys : http://antispam.yahoo.com/domainkeys

Blessen Cherian
Member, Executive Team, Bobcares