How I Caught the Spam and What I Did With it When I Caught it
Message Headers

Mark-Jason Dominus
Thursday, October 14, 1999 03:43:35 PM
We're not done processing the message. A mail
header is made up of several lines that carry
different information. You're probably familiar
with the structure already; it looks like this:
Return-Path: LRS@getstartednow.com
Return-Path: <LRS@getstartednow.com>
Delivered-To: mjd-deliver@plover.com
Received: (qmail 15266 invoked by uid 119); 10 Apr 1997 05:08:37 -0000
Delivered-To: mjd-filter@plover.com
Received: (qmail 15261 invoked by uid 119); 10 Apr 1997 05:08:33 -0000
Delivered-To: mjd@plover.com
Received: (qmail 15258 invoked from network); 10 Apr 1997 05:08:31 -0000
Received: from renoir.op.net (root@206.84.208.4)
by plover.com with SMTP; 10 Apr 1997 05:08:31 -0000
Received: from major.globecomm.net (major.globecomm.net [207.51.48.5])
by renoir.op.net (8.7.1/8.7.1/$Revision: 1.10 $) with ESMTP id BAA02191
for <mjd@op.net>; Thu, 10 Apr 1997 01:06:35 -0400 (EDT)
From: LRS@getstartednow.com
Received: from globecomm.net (ip252.new-haven.ct.pub-ip.psi.net [38.11.102.252])
by major.globecomm.net (8.8.5/8.8.0) with SMTP id BAA00454;
Thu, 10 Apr 1997 01:06:23 -0400 (EDT)
Received: from mailhost.greatchances.com (alt3.greatchances.com(917.876.92.65))
by greatchances.com (8.8.5/8.6.5) with SMTP id GAA04352
for <friend@public.com>; Thu, 10 Apr 1997 01:00:43 -0600 (EST)
To: friend@public.com
Message-ID: <282732679098.HAb9037@greatchances.com>
Date: Thu, 10 Apr 97 01:00:43 EST
Subject: MAKE MONEY AT HOME!
X-UIDL: 698987574a97aqd1p134jud427k9a6d
We'd like to break this up into the individual
lines and then put the information into a hash
so that we can find the various parts easily.
For example, we'd like to be able to find the
`To:' address in $hash{To} and the subject
in $hash{Subject}.
Breaking a Perl string into lines is easy; just use split:
@lines = split /\n/, $header;
This tells Perl to take the string
$header and break it into
lines wherever it sees a \n
character. The \n's are
discarded and the parts in between are
stored into the elements of the perl array
@lines. Then we can dismantle each
individual line:
foreach $line (@lines) {
my ($label, $value) = split /:\s*/, $line, 2;
$hash{$label} = $value;
}
This runs the loop once for each line. We
use split again, this time
to cut each line into two pieces. The
/:\s*/ says that the pieces will
be separated by a : followed
by some white space; \s is an
abbreviation for `space' and the star means
that we don't know how much space there will
be. For the header line Date: Thu,
10 Apr 97 01:00:43 EST this places
Date into $label
and Thu, 10 Apr 97 01:00:43 EST
into $value. Notice that Perl
does not split on the :s in
the date; that's because the 2
in the split tells Perl that
there are only two fields here, so that it
ignores any :s after the first
one. If we had omitted the 2, Perl
would have split this line into four fields:
$value would have gotten Thu,
10 Apr 97 01, and the two other fields
with 00 and 43 EST
would have been thrown away.
Continuation Lines
Actually though, there's a problem with
this. When header lines are long, they can be
broken up into two or more lines and continue
on the following line. There's an example of
this above. The two physical lines
Received: from renoir.op.net (root@206.84.208.4)
by plover.com with SMTP; 10 Apr 1997 05:08:31 -0000
are actually one logical line that is
broken in half. We are supposed to consider
this to be one line, even though there's a
newline character in the middle. The rule for
recognizing these extended logical lines is
simple: If a line starts with white space,
it's a continuation of the previous line.
The code we have incorrectly breaks
this header line in two. Then it sets
$hash{Received} to the partial
value
from renoir.op.net(root@206.84.208.4)
Then it tries to process the following line the same
way, not realizing that it's a continuation. It
breaks the continuation into two pieces,
by plover.com with SMTP; 10 Apr 1997 05
and 08:31 -0000 and it interprets the first piece as the header field name and the second piece as the value. This is obviously all wrong.
We need a way to handle the continuation
lines. Here's one way: We'll split up the
lines as before, and then put the continuations
back together.
1 @physical_lines = split /\n/, $header;
2 @logical_lines = ();
3 for $current_line (@physical_lines) {
4 if ($current_line =~ /^\s/) {
5 $previous_line .= $current_line;
6 } else {
7 push @logical_lines, $previous_line if defined $previous_line;
8 $previous_line = $current_line;
9 }
10 }
11 push @logical_lines, $previous_line;
@logical_lines will contain
the array of header lines after we've pasted
the continued lines back together. We loop
over the physical lines, and check each one
to see if it begins with white space. We saw
that \s is a Perl pattern for a
white space; the ^ in front of
it requires that the white space occur at the
beginning of the string. In the normal case, the
line does not begin with white space,
and we come to line 7. We store the previous
line into the @logical_lines array, if there was one, because we can be sure we're
done with it. And we remember the current line
in the $previous_line variable in
case it turns out that it is continued on the
following line.
If the current line does begin
with white space, it is a continuation
of the previous line, and on line 5 we
append it to $previous_line.
$previous_line will keep getting
longer and longer as long as we keep seeing
continuation lines, and then finally when we
see a line that's not a continuation, we'll
push the entire $previous_line
onto the list of logical lines at line 7.
When we reach the end of the header and exit
the loop, the last logical line is sitll in
$previous_line, so we have line
11 to take care of it.
A Simpler Way to Deal with the Continuation Lines
This is a general pattern that you can apply
to any problem that involves continuations or
escape sequences. It's fairly simple, but it
turns out that in Perl there's an even simpler
way to write the same thing, if we're willing
to use a little regex magic:
@logical_lines = split /\n(?!\s)/, $header;
This replaces the 11-line loop we had above.
What's going on here? It says that the
delimiters between header lines aren't
\n characters; just an
\n by itself isn't enough.
(?!foo) says that in
order to match, perl must not see
foo coming up at that position in
the string. (?!\s) says that
the next character after the \n
must not be a whitespace character. So
where /\n/ will match any newline
character, /\n(?!\s)/ will only
match the newline characters that are not
immediately followed by whitespace. These are
precisely the ones that are at the ends of
logical lines.
Next: Tune in Next Time »