Sawing Linux Logs with Simple Tools
Good Ole grep

Carla Schroder
Monday, September 20, 2004 11:22:15 AM
So there you are
with all of your Linux servers humming along happily. You have tested,
tweaked, and configured until they are performing at their peak of
perfection. Users are hardly whining at all. Life is good. You may
relax and indulge in some nice, relaxing rounds of TuxKart. After all,
you earned it.
Except for one little
remaining chore: monitoring your log files. [insert horrible alarming
music of your choice here.] You're conscientious, so you know you can't
just ignore the logs until there's a problem, especially for public
services like Web and mail. Somewhere up in the pointy-haired suites,
they may even be plotting to require you to track and analyze all sorts
of server statistics.
Not to worry, for
there are many ways to implement data reduction, which is what log
parsing is all about. You want to slice and dice your logs to present
only the data you're interested in viewing. Unless you wish to devote
your entire life to manually analyzing log files. Even if you only pay
attention to logfiles when you're debugging a problem, having some
tools to weed out the noise is helpful.
The simplest method is
a keyword search. Suppose you want to separate out the 404 errors in
your Apache log, and see if you have any missing files:
$ grep 404 bratgrrl.com-Aug-2004
...
212.27.41.34 - - [30/Aug/2004:02:25:13 -0700] "GET /robots.txt HTTP/1.0" 404 - "-"
Pompos/1.3 http://dir.com/pompos.html"
65.54.188.90 - - [30/Aug/2004:10:32:26 -0700] "GET /robots.txt HTTP/1.0" 404 - "-"
"msnbot/0.11 (+http://search.msn.com/msnbot.htm)"
207.65.113.58 - - [12/Aug/2004:06:49:11 -0700] "GET /favicon.ico HTTP/1.1" 404 - "-"
"Opera/7.21 (X11; Linux i686; U) [en]"
...
These entries are typical. This site has no
robots.txt
or favicon, so any requests for these files generate a 404 error. The
first two are Web bots. The third entry is probably some random surfer.
You can ignore these. So let's screen out robots.txt
and favicon, and see what is left:
$ grep 404 bratgrrl.com-Aug-2004 | grep -v -E "favicon.ico|robots.txt"
....
200.16.116.3 - - [29/Aug/2004:20:59:27 -0700] "GET /images/142spacer.gif HTTP/1.0"
404 - "http://www.bratgrrl.com/" "Mozilla/5.0 Galeon/1.2.7 (X11; Linux i686; U;) Gecko/20030131"
200.16.116.3 - - [29/Aug/2004:21:00:08 -0700] "GET /email_crimes.html HTTP/1.0" 404
- "http://www.bratgrrl.com/" "Mozilla/5.0 Galeon/1.2.7 (X11; Linux i686; U;) Gecko/20030131"
....
Now we're getting somewhere. These two files —
images/142spacer.gif
and email_crimes.html
— are referenced somewhere on the Web site, but they do not exist. This
is something that should be fixed. How to find the URLs that refer to
these files? grep can do this too. Suppose all the site files are in /var/www/bratgrrl
:
$ grep -R "142spacer.gif" /var/www/bratgrrl
Here's another cool
grep trick for Apache logs. You doubtless noticed that the above
examples were referred from http://www.bratgrrl.com. When you're
checking to see where your traffic is coming from, you don't care about
local referrals. Weed them out with this:
$ cat bratgrrl.com-Aug-2004 | fgrep -v bratgrrl | cut -d\" -f4 | grep -v ^-
http://www.computerbits.com/archive/2004/0800/schroder0408.html
http://www.pdxlinux.org/resources/nw_linux
www.dianagaydon.com/
http://www.netcraft.com/survey/
http://www.techsupportforum.com/computer/topic/3520-1.html
http://us.altavista.com/web/results?tlb=1&kgs=0&ienc=utf8&q=carla+schroder
Now you can see where traffic to your site is coming from, uncluttered by local references. Here's how it works, piece by piece:
fgrep -v bratgrrl means "look for the literal string bratgrrl, then exclude lines that contain it."
cut -d\" -f4
means "using quotation marks as the delimiter, print only the text in
the fourth field." The fourth field is the text between the third and
fourth quotation marks.
grep -v ^- means "exclude lines that start with a hyphen." Try running the command without this to see why.
Next: More Simple Stuff »