February 22, 2019

How I Caught the Spam and What I Did With it When I Caught it - page 2

An Unpleasant Surprise

  • October 14, 1999
  • By Mark-Jason Dominus
Mail messages contain an actual message with some information that someone wanted to send you; that's called the body of the message. They also contain some meta-information such as who the message is from, who it was sent to, when it was sent, and soforth; this part of the message is called the header. I could try to recognize spam by looking in the body, or in the header, or both. But I couldn't think of a good way to recognize spam by looking at the body that wouldn't also lead to an unacceptably high false positive rate. Every time I thought about the problem, I got stuck at the same place: suppose I know that lots of spam arrives exhorting me to see sexy Annabel Chong, and suppose I were to establish a policy of rejecting messages that mention Annabel. Now suppose someone sends me a message that discusses spam filtering strategies and mentions Annabel in connection with this. ``Dominus, is it a good strategy to reject messages that mention Annabel Chong?'' Oops, I've just thrown their message away. Now suppose someone sent me a copy of the article you're reading right now. Ooops, it mentions Annabel Chong also.

I decided that since there wasn't any way to tell whether a message actually mentions Annabel directly rather than mentioning that messages sometimes mention her, content filtering wasn't going to work. I didn't want to cut myself off from discussions about spam filtering.

I decided I would have to filter based on information in the message header, not its body. The `subject' line is the most obvious place to start, but I didn't want to use it because it's really part of the message content and I would have had the same problem as if I were filtering on the message body. I didn't want to automatically reject mail that said

����Subject: These green card lottery articles are driving me crazy!

Now, among people who've followed this train of thought there seem to be two basic strategies. One strategy is to simply reject mail that's not addressed to you; if the `To:' address isn't yours, throw it out. That sounds good, but it has some problems. I get a lot of mailing list and carbon-copy mail that might not include me in the the receipient list for perfectly good reasons. The typical mailing-list mail arrives in your inbox without your address on it anywhere; it says something like

����To: perl5-porters@perl.com

This is analogous to paper mailing list mail: You open it up and inside is a letter that says not `Dear Mark Dominus, ...' but rather `Dear Pigeon Fancier, ...'.

Also, I analyzed the spam messages I had collected and decided it wouldn't work well enough---about one spam message in five actually does arrive addressed to me.

Instead, the basic idea that I adopted was to make a list of domains that sent me a lot of spam, and to blacklist those domains. Any mail from a blacklisted domain would be rejected; other mail would be delivered. And now I'll show how to implement that, because the details turn out to be very interesting.

Most Popular LinuxPlanet Stories