Home | Hardware | Internet News |Web Hosting |IT Management |Network Storage
LinuxPlanet
Search 
  Power Search | Tips 

 Front Door
 Discussion
 LinuxEngine
 Opinions
 Reports
 Reviews
 Tutorials
 News
 Technology Jobs

 Browse by subject.
Free Newsletter

Java/Open Source Daily
Linux Today
More Free Newsletters

Be a Commerce Partner


















internet.com
IT
Developer
Internet News
Small Business
Personal Technology

Search internet.com
Advertise
Corporate Info
Newsletters
Tech Jobs
E-mail Offers

Print this article
Email this article
Related Items

•  Stopping Spam with Linux

•  More on Perl regular expressions

•  MJDs Perl page


   LinuxPlanet / Tutorials







How I Caught the Spam and What I Did With it When I Caught it
Message Headers

Mark-Jason Dominus
Thursday, October 14, 1999 03:43:35 PM

We're not done processing the message. A mail header is made up of several lines that carry different information. You're probably familiar with the structure already; it looks like this:

    Return-Path: LRS@getstartednow.com
    Return-Path: <LRS@getstartednow.com>
    Delivered-To: mjd-deliver@plover.com
    Received: (qmail 15266 invoked by uid 119); 10 Apr 1997 05:08:37 -0000
    Delivered-To: mjd-filter@plover.com
    Received: (qmail 15261 invoked by uid 119); 10 Apr 1997 05:08:33 -0000
    Delivered-To: mjd@plover.com
    Received: (qmail 15258 invoked from network); 10 Apr 1997 05:08:31 -0000
    Received: from renoir.op.net (root@206.84.208.4)
      by plover.com with SMTP; 10 Apr 1997 05:08:31 -0000
    Received: from major.globecomm.net (major.globecomm.net [207.51.48.5])
      by renoir.op.net (8.7.1/8.7.1/$Revision: 1.10 $) with ESMTP id BAA02191
      for <mjd@op.net>; Thu, 10 Apr 1997 01:06:35 -0400 (EDT)
    From: LRS@getstartednow.com
    Received: from globecomm.net (ip252.new-haven.ct.pub-ip.psi.net [38.11.102.252])
      by major.globecomm.net (8.8.5/8.8.0) with SMTP id BAA00454;
      Thu, 10 Apr 1997 01:06:23 -0400 (EDT)
    Received: from mailhost.greatchances.com (alt3.greatchances.com(917.876.92.65))
      by greatchances.com (8.8.5/8.6.5) with SMTP id GAA04352
      for <friend@public.com>; Thu, 10 Apr 1997 01:00:43 -0600 (EST)
    To: friend@public.com
    Message-ID: <282732679098.HAb9037@greatchances.com>
    Date: Thu, 10 Apr 97 01:00:43 EST
    Subject: MAKE MONEY AT HOME!
    X-UIDL: 698987574a97aqd1p134jud427k9a6d

We'd like to break this up into the individual lines and then put the information into a hash so that we can find the various parts easily. For example, we'd like to be able to find the `To:' address in $hash{To} and the subject in $hash{Subject}.

Breaking a Perl string into lines is easy; just use split:

    @lines = split /\n/, $header;

This tells Perl to take the string $header and break it into lines wherever it sees a \n character. The \n's are discarded and the parts in between are stored into the elements of the perl array @lines. Then we can dismantle each individual line:

    foreach $line (@lines) {
      my ($label, $value) = split /:\s*/, $line, 2;
      $hash{$label} = $value;
    }

This runs the loop once for each line. We use split again, this time to cut each line into two pieces. The /:\s*/ says that the pieces will be separated by a : followed by some white space; \s is an abbreviation for `space' and the star means that we don't know how much space there will be. For the header line Date: Thu, 10 Apr 97 01:00:43 EST this places Date into $label and Thu, 10 Apr 97 01:00:43 EST into $value. Notice that Perl does not split on the :s in the date; that's because the 2 in the split tells Perl that there are only two fields here, so that it ignores any :s after the first one. If we had omitted the 2, Perl would have split this line into four fields: $value would have gotten Thu, 10 Apr 97 01, and the two other fields with 00 and 43 EST would have been thrown away.

Continuation Lines

Actually though, there's a problem with this. When header lines are long, they can be broken up into two or more lines and continue on the following line. There's an example of this above. The two physical lines

    Received: from renoir.op.net (root@206.84.208.4)
      by plover.com with SMTP; 10 Apr 1997 05:08:31 -0000

are actually one logical line that is broken in half. We are supposed to consider this to be one line, even though there's a newline character in the middle. The rule for recognizing these extended logical lines is simple: If a line starts with white space, it's a continuation of the previous line.

The code we have incorrectly breaks this header line in two. Then it sets $hash{Received} to the partial value

from renoir.op.net(root@206.84.208.4)

Then it tries to process the following line the same way, not realizing that it's a continuation. It breaks the continuation into two pieces,

by plover.com with SMTP; 10 Apr 1997 05

and 08:31 -0000 and it interprets the first piece as the header field name and the second piece as the value. This is obviously all wrong.

We need a way to handle the continuation lines. Here's one way: We'll split up the lines as before, and then put the continuations back together.

1    @physical_lines = split /\n/, $header;
2    @logical_lines = ();
3    for $current_line (@physical_lines) {
4      if ($current_line =~ /^\s/) {
5        $previous_line .= $current_line;
6      } else {
7        push @logical_lines, $previous_line if defined $previous_line;
8        $previous_line = $current_line;
9      }
10  }
11  push @logical_lines, $previous_line;

@logical_lines will contain the array of header lines after we've pasted the continued lines back together. We loop over the physical lines, and check each one to see if it begins with white space. We saw that \s is a Perl pattern for a white space; the ^ in front of it requires that the white space occur at the beginning of the string. In the normal case, the line does not begin with white space, and we come to line 7. We store the previous line into the @logical_lines array, if there was one, because we can be sure we're done with it. And we remember the current line in the $previous_line variable in case it turns out that it is continued on the following line.

If the current line does begin with white space, it is a continuation of the previous line, and on line 5 we append it to $previous_line. $previous_line will keep getting longer and longer as long as we keep seeing continuation lines, and then finally when we see a line that's not a continuation, we'll push the entire $previous_line onto the list of logical lines at line 7.

When we reach the end of the header and exit the loop, the last logical line is sitll in $previous_line, so we have line 11 to take care of it.

A Simpler Way to Deal with the Continuation Lines

This is a general pattern that you can apply to any problem that involves continuations or escape sequences. It's fairly simple, but it turns out that in Perl there's an even simpler way to write the same thing, if we're willing to use a little regex magic:

  @logical_lines = split /\n(?!\s)/, $header;

This replaces the 11-line loop we had above.

What's going on here? It says that the delimiters between header lines aren't \n characters; just an \n by itself isn't enough. (?!foo) says that in order to match, perl must not see foo coming up at that position in the string. (?!\s) says that the next character after the \n must not be a whitespace character. So where /\n/ will match any newline character, /\n(?!\s)/ will only match the newline characters that are not immediately followed by whitespace. These are precisely the ones that are at the ends of logical lines.

Next: Tune in Next Time »

Skip Ahead

1 An Unpleasant Surprise
2 Filtering Strategies
3 How the Mail Gets Into our Filtering Program
4 Message Headers
5 Tune in Next Time





Linux is a trademark of Linus Torvalds.


internet.com home | search | help! | about us

Jupiter Online Media

internet.comearthweb.comDevx.commediabistro.comGraphics.com

Search:

Jupitermedia Corporation has two divisions: Jupiterimages and JupiterOnlineMedia

Jupitermedia Corporate Info


Legal Notices, Licensing, Reprints, & Permissions, Privacy Policy.

Web Hosting | Newsletters | Tech Jobs | Shopping | E-mail Offers