September 23, 2014
 
 
RSSRSS feed

How I Caught the Spam and What I Did With it When I Caught it - page 4

An Unpleasant Surprise

  • October 14, 1999
  • By Mark-Jason Dominus
We're not done processing the message. A mail header is made up of several lines that carry different information. You're probably familiar with the structure already; it looks like this:

����Return-Path: LRS@getstartednow.com
����Return-Path:
����Delivered-To: mjd-deliver@plover.com
����Received: (qmail 15266 invoked by uid 119); 10 Apr 1997 05:08:37 -0000
����Delivered-To: mjd-filter@plover.com
����Received: (qmail 15261 invoked by uid 119); 10 Apr 1997 05:08:33 -0000
����Delivered-To: mjd@plover.com
����Received: (qmail 15258 invoked from network); 10 Apr 1997 05:08:31 -0000
����Received: from renoir.op.net (root@206.84.208.4)
������by plover.com with SMTP; 10 Apr 1997 05:08:31 -0000
����Received: from major.globecomm.net (major.globecomm.net [207.51.48.5])
������by renoir.op.net (8.7.1/8.7.1/$Revision: 1.10 $) with ESMTP id BAA02191
������for ; Thu, 10 Apr 1997 01:06:35 -0400 (EDT)
����From: LRS@getstartednow.com
����Received: from globecomm.net (ip252.new-haven.ct.pub-ip.psi.net [38.11.102.252])
������by major.globecomm.net (8.8.5/8.8.0) with SMTP id BAA00454;
������Thu, 10 Apr 1997 01:06:23 -0400 (EDT)
����Received: from mailhost.greatchances.com (alt3.greatchances.com(917.876.92.65))
������by greatchances.com (8.8.5/8.6.5) with SMTP id GAA04352
������for ; Thu, 10 Apr 1997 01:00:43 -0600 (EST)
����To: friend@public.com
����Message-ID: <282732679098.HAb9037@greatchances.com>
����Date: Thu, 10 Apr 97 01:00:43 EST
����Subject: MAKE MONEY AT HOME!
����X-UIDL: 698987574a97aqd1p134jud427k9a6d

We'd like to break this up into the individual lines and then put the information into a hash so that we can find the various parts easily. For example, we'd like to be able to find the `To:' address in $hash{To} and the subject in $hash{Subject}.

Breaking a Perl string into lines is easy; just use split:

����@lines = split /\n/, $header;

This tells Perl to take the string $header and break it into lines wherever it sees a \n character. The \n's are discarded and the parts in between are stored into the elements of the perl array @lines. Then we can dismantle each individual line:

����foreach $line (@lines) {
������my ($label, $value) = split /:\s*/, $line, 2;
������$hash{$label} = $value;
����}

This runs the loop once for each line. We use split again, this time to cut each line into two pieces. The /:\s*/ says that the pieces will be separated by a : followed by some white space; \s is an abbreviation for `space' and the star means that we don't know how much space there will be. For the header line Date: Thu, 10 Apr 97 01:00:43 EST this places Date into $label and Thu, 10 Apr 97 01:00:43 EST into $value. Notice that Perl does not split on the :s in the date; that's because the 2 in the split tells Perl that there are only two fields here, so that it ignores any :s after the first one. If we had omitted the 2, Perl would have split this line into four fields: $value would have gotten Thu, 10 Apr 97 01, and the two other fields with 00 and 43 EST would have been thrown away.

Continuation Lines

Actually though, there's a problem with this. When header lines are long, they can be broken up into two or more lines and continue on the following line. There's an example of this above. The two physical lines

����Received: from renoir.op.net (root@206.84.208.4)
������by plover.com with SMTP; 10 Apr 1997 05:08:31 -0000

are actually one logical line that is broken in half. We are supposed to consider this to be one line, even though there's a newline character in the middle. The rule for recognizing these extended logical lines is simple: If a line starts with white space, it's a continuation of the previous line.

The code we have incorrectly breaks this header line in two. Then it sets $hash{Received} to the partial value

from renoir.op.net(root@206.84.208.4)

Then it tries to process the following line the same way, not realizing that it's a continuation. It breaks the continuation into two pieces,

by plover.com with SMTP; 10 Apr 1997 05

and 08:31 -0000 and it interprets the first piece as the header field name and the second piece as the value. This is obviously all wrong.

We need a way to handle the continuation lines. Here's one way: We'll split up the lines as before, and then put the continuations back together.

1����@physical_lines = split /\n/, $header;
2����@logical_lines = ();
3����for $current_line (@physical_lines) {
4������if ($current_line =~ /^\s/) {
5��������$previous_line .= $current_line;
6������} else {
7��������push @logical_lines, $previous_line if defined $previous_line;
8��������$previous_line = $current_line;
9������}
10��}
11��push @logical_lines, $previous_line;

@logical_lines will contain the array of header lines after we've pasted the continued lines back together. We loop over the physical lines, and check each one to see if it begins with white space. We saw that \s is a Perl pattern for a white space; the ^ in front of it requires that the white space occur at the beginning of the string. In the normal case, the line does not begin with white space, and we come to line 7. We store the previous line into the @logical_lines array, if there was one, because we can be sure we're done with it. And we remember the current line in the $previous_line variable in case it turns out that it is continued on the following line.

If the current line does begin with white space, it is a continuation of the previous line, and on line 5 we append it to $previous_line. $previous_line will keep getting longer and longer as long as we keep seeing continuation lines, and then finally when we see a line that's not a continuation, we'll push the entire $previous_line onto the list of logical lines at line 7.

When we reach the end of the header and exit the loop, the last logical line is sitll in $previous_line, so we have line 11 to take care of it.

A Simpler Way to Deal with the Continuation Lines

This is a general pattern that you can apply to any problem that involves continuations or escape sequences. It's fairly simple, but it turns out that in Perl there's an even simpler way to write the same thing, if we're willing to use a little regex magic:

��@logical_lines = split /\n(?!\s)/, $header;

This replaces the 11-line loop we had above.

What's going on here? It says that the delimiters between header lines aren't \n characters; just an \n by itself isn't enough. (?!foo) says that in order to match, perl must not see foo coming up at that position in the string. (?!\s) says that the next character after the \n must not be a whitespace character. So where /\n/ will match any newline character, /\n(?!\s)/ will only match the newline characters that are not immediately followed by whitespace. These are precisely the ones that are at the ends of logical lines.

Sitemap | Contact Us