|
Python Squeezes the Web

By: Stephen Pitts
Saturday, October 23, 1999 12:16:50 PM EST
URL: http://www.linuxplanet.com/linuxplanet/tutorials/1132/1/
Introduction Sometimes, when developing a web
application, you want to acquire data
from another source to put on your
site. For example, portal sites like portaloo
place the latest news headlines
on their pages, and the news headlines
are constantly updated without human
intervention.
I ran into a similar problem
designing the web site and database system
of Memphis
Scholastic Chess. They were spending 5-6
hours per week manually locating the latest
ratings for 900 players (having to navigate
through around 75 individual pages) and then
skim the list for players in our organization
on the United
States Chess Federation web site. I used
Python to write a server site program that runs
automatically once a week and downloads and
parses pages from the USCF's web site.
The problem..and the solution Here's a sample set of data from the USCF's
web site that we need to parse with Python
From http://www.64.com/cgi-bin/ratings.pl?nm=T&st=TN:
12-97 373p TN 03-97 <AHREF=/cgi-bin/ratings.pl/USCF/21005567>TARVER,NATHAN</A>
12-94 401p TN 02-95 <A HREF=/cgi-bin/ratings.pl/USCF/12613391>TASHIE,DAPHNE</A>
10-99 385 TN 11-99 <A HREF=/cgi-bin/ratings.pl/USCF/12752592>TATE,JEREMY</A> 10-05 367
From http://www.64.com/cgi-bin/ratings.pl?nm=P&st=MS:
12-96 1167p MS 04-97 <A HREF=/cgi-bin/ratings.pl/USCF/12660161>PATTERSON,RAPHAEL C</A>
08-99 1452 1261 MS 09-00 <A HREF=/cgi-bin/ratings.pl/USCF/12499243>PATTILLO,BILLY R</A> 09-22 1476 1261
12-94 960p MS 04-95 <A HREF=/cgi-bin/ratings.pl/USCF/12619152>PATT
ON,SAM R</A>
08-99 863 MS 04-00 <A HREF=/cgi-bin/ratings.pl/USCF/12739657>PAYNE,DANIEL</A>
The logic for the program is quite simple:
- Retrieve a list of the USCF ID #s of players
in our organization
- Retrieve a list of state/letter combinations
that we need to fetch
- Fetch each page
- Go through the pages line by line. If a
newer rating exists to the far right, use it;
otherwise use the existing rating.
- Write the changes to the database, where
the PHP3 scripts on the web server will
automatically pick up on the changes
And the Python program that does the dirty work
(in under 100 lines of code!):
#! /usr/bin/python
# A Python daemon that checks the uschess.org ratings against a
# mysql database and updates them accordingly.
import urllib
import re
import string
import calendar
import MySQL
# called on a per-state basis
def ProcessUSCFInfo(letter, state):
"Processes the information from the USCF Web site and imports it into the Mysql database."
print "Downloading data for state", state, "letter", letter
uscf_data = urllib.urlopen("http://www.64.com/cgi-bin/ratings.pl?st=" +state + "&nm=" + letter).read()
# the data that we want is inside the <pre> tag, but after the <b>-enclosed title
# get a list of players
beginMatch = beginDataRegexp.search(uscf_data)
endMatch = endDataRegexp.search(uscf_data)
uscf_data = uscf_data[beginMatch.end():endMatch.start()]
uscf_player_lines = string.split(uscf_data, "\n")
# parse the lines and fill up a list with USCFPlayer instances
uscf_players = []
playercount = 0
for data_line in uscf_player_lines:
# Use a regexp to extract the needed information from a line of
data
this_player = USCFPlayer()
regexp_match = perLineRegexp.match(data_line)
if regexp_match == None: continue
this_player.USCFRating, exp_month, exp_year, this_player.PlayerId, w_rating = regexp_match.groups()
# make sure that this player is in the database
if this_player.PlayerId not in player_id_list: continue
# the weekly updates
if(w_rating != None): this_player.USCFRating = w_rating
# handle Life memberships and the Y2K issue in the expiration dates
if exp_month == None:
this_player.ExpDate = "2099/12/31"
else:
else:
if string.atoi(exp_year) > 70: exp_year = "19" + exp_year
else: exp_year = "20" + exp_year
# get the last day of the month
exp_day = (calendar.monthrange(string.atoi(exp_year), string.atoi(exp_month)))[1]
this_player.ExpDate = exp_year+"/"+exp_month+"/"+str(exp_day)
# add the USCFPlayer to the list
USCFPlayers.append(this_player)
playercount = playercount + 1
print "Retrieved", playercount, "players from state", state, "letter", letter
return uscf_players
# used to hold data related to a USCF Player
class USCFPlayer:
pass
# common regexps used by ProcessUSCFInfo
beginDataRegexp = re.compile(r"<pre>\n<b>.*</b>", re.I | re.DO
TALL)
endDataRegexp = re.compile(r"</pre>")
perLineRegexp =re.compile(r".{5}\s+(\d{3,4})p?\s+(?:\d{3,4}p?)?\s*\w{2}\s+(?:(?:(\d{2})-(\d{2}))|Life)\s+<A.*USCF/(\d{8}).*/A>(?:\s+.{5}\s+(\d{3,4})\s*(?:\d{3,4}p?)?)?")
# global list of all players
USCFPlayers = []
# get a list with all of the valid playerids
db_conn = MySQL.connect("host_name_here", "user_id", "pass_word")
db_conn.selectdb("database_name")
player_id_list_tmp = db_conn.do("SELECT PlayerId FROM Players")
player_id_list = []
# eliminate all of the singletons
for pid_singleton in player_id_list_tmp:
player_id_list.append(pid_singleton[0])
# get a list of the state/letter combinations
state_letter_list = db_conn.do("SELECT LEFT(LastName, 1) AS Letter, State, CONCAT(LEFT(LastName, 1), State) AS Sorter FROM Players GROUP BY Sorter")
# iterate and process each state/letter combo
for state_letter in state_letter_list:
ProcessUSCFInfo(state_letter[0], state_letter[1])
# dump the whole mess to the database
print "Trying to save", len(USCFPlayers), "players to database...",
updated_players = 0
for uscf_player in USCFPlayers:
db_conn.do("UPDATE Players SET USCFRating = " + uscf_player.USCFRating + ", ExpDate = '" + uscf_player.ExpDate + "' WHERE PlayerId = '" + uscf_player.PlayerId + "'")
print "done"
All about Python
Now that I've piqued your interest with this mass of code, you are probably wondering ...
What is Python?
(from the Python
web site)
Python is an
interpreted, object-oriented, high-level
programming language with dynamic semantics. Its
high-level built in data structures, combined
with dynamic typing and dynamic binding,
make it very attractive for Rapid Application
Development, as well as for use as a scripting
or glue language to connect existing components
together. Python's simple, easy to learn
syntax emphasizes readability and therefore
reduces the cost of program maintenance. Python
supports modules and packages, which encourages
program modularity and code reuse. The Python
interpreter and the extensive standard library
are available in source or binary form without
charge for all major platforms, and can be
freely distributed.
Often, programmers
fall in love with Python because of the
increased productivity it provides. Since there
is no compilation step, the edit-test-debug
cycle is incredibly fast. Debugging Python
programs is easy: a bug or bad input will
never cause a segmentation fault. Instead,
when the interpreter discovers an error, it
raises an exception. When the program doesn't
catch the exception, the interpreter prints a
stack trace. A source level debugger allows
inspection of local and global variables,
evaluation of arbitrary expressions, setting
breakpoints, stepping through the code a line
at a time, and so on. The debugger is written
in Python itself, testifying to Python's
introspective power. On the other hand, often
the quickest way to debug a program is to add
a few print statements to the source: the fast
edit-test-debug cycle makes this simple approach
very effective.
What I Like about Python
I've written programs in a number of
different languages, including Visual
Basic, C/C++, Perl, and PHP3. There are
some things Python has that makes it, in
my opinion, substatially more flexible than
other languages: - Powerful
Datatypes and Operations -
Python has
built in strings, tuples, lists, dictionaries,
and more. Want your function to return two
values? Return a tuple, an immutable list
of values! Want to grab elements 4-6 of
list MyList? Use slice notation to write:
MyList[4:6]! This slice notation works on
strings, too, so "Monty Python"[0:5] evaluates
to "Monty". List can also dynamically grow,
too. You can easily iterate over the elements
of a list with the "for" command, and the "in"
and "not in" statements let you take advantage
of Python's built-in binary search routines
instead of having to code your own. Very few
languages have this type of functionality
built in and available as a core part of the
language. Most of the time, a special add-on
library (such as STL) is required to get all
these features.
- Rapid Development with an Interactive
Interpreter -
Rather than go
through the compile/test/run cycle of most
traditional programming languages, or even the
edit/run cycle of many scripting languages,
Python has an incredibly useful interactive
interpreter. During the development of the
aforementioned application, I pasted a chunk
of data into the interpreter, assigned to to a
variable, and wrote the string parsing regular
expression in about an hour. Whenever I'm
curious about the built-in methods of a list,
I pop into the interpreter and run dir([]). When
I'm not sure exactly how some esoteric feature
works, I define a test case and run it. I even
build my applications bottom up, importing and
testing critical functions in the interpreter
before I write the top-level code that uses
the functions.
- Runs on Multiple Platforms and Has the
Same Implementation on Multiple Platforms
-
I don't have to extol the virtues of a
multi-platform language to you; you have a
tremendous amount of flexibility in where you
develop and deploy your applications. You can
write Python on a Mac and upload it to a Unix
server, start out with a Linux server and move
up to a Sun Ultra, et. al. But, unlike some
"cross-platform" languages like ANSI C++
(which I originally wrote the uscfratingd
program in) and PHP3, it supports the same
features everywhere because there is only
one main implementation of Python in common
use. I have written GUI applications with Python
that run on Windows and Linux without changing
one line of code (more on this in a following
article). Can any other language (aside from
Perl) claim this type of functionality?
- Rich Core Library -
Out of
the box, on all platforms, Python programs
can use sockets to speak any protocol or use
predefined classes to speak HTTP, FTP, SMTP,
POP, Telnet and a variety of other Internet
protocols. Built-in classes are provided to permit
your app to parse XML, HTML, and SGML. Regular
Expressions, a powerful feature that allows text
parsing (look at the perLineRegexp variable in
the program for a useful example), were borrowed
from Perl and are present in Python on Windows,
Mac and Linux. Python/Tk, a moderately
powerful GUI framework is available for Windows,
Mac and Unix, and wxPython, a wrapper to the
wxWindows C++ library, are available for Unix and
Windows and are under development for BeOS and
the Macintosh. Overall, Python provides a lot
of features for free that might require
costly third-party libraries in other languages.
The Example Explained
I don't have enough space to provide a
complete introduction to Python (check the Python
Tutorial for that), but I'll try to explain
things briefly as I go. If you've done some
sort of programming before, you'll find that
Python is extremely easy to learn and lets you
do a lot with a small amount of code. To try the
code in this article, install a copy of Python
for your distribution of Linux. Debian 2.1
users should be able to just type "apt-get
install python", and Python 1.5.x is included
with RedHat 5.0 or higher and can be installed
with glint. Using your favorite text editor
(I like VIM),
pull up a chair and follow along! Note, in order
to run the example program exactly as written,
you'll need to create a MySQL table called
"Players" like this:
CREATE TABLE Players (
PlayerId char(8) NOT NULL PRIMARY KEY,
LastName varchar(50) NOT NULL,
FirstName varchar(50) NOT NULL,
USCFRating mediumint NOT NULL,
State char(2) NOT NULL,
ExpDate date NOT NULL
)
And some sample data from the list above:
INSERT INTO Players(PlayerId, LastName, FirstName, State) VALUES('21005567', 'Tarver', 'Nathan', 'TN');
INSERT INTO Players(PlayerId, LastName, FirstName, State) VALUES('12613391', 'Tashie', 'Daphne', 'TN');
INSERT INTO Players(PlayerId, LastName, FirstName, State) VALUES('21005567', 'Tate', 'Jeremy', 'TN');
INSERT INTO Players(PlayerId, LastName, FirstName, State) VALUES('12660161','Patterson','Raphael','MS');
INSERT INTO Players(PlayerId, LastName, FirstName, State) VALUES('12499243','Pattillo','Billy','MS');
INSERT INTO Players(PlayerId, LastName, FirstName, State) VALUES('12660161','Pat
ton','Sam','MS');
INSERT INTO Players(PlayerId, LastName, FirstName, State) VALUES('12739657','Pay
ne','Daniel','MS');
Defining the Regular Expression
In developing this program, I started
by looking at my data. For a couple of hours,
I dabbled in the interpreter like so:
Python 1.5.2 (#0, Sep 13 1999, 09:12:57) [GCC 2.95.1 19990816 (release)] on lin
ux2
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
>>> import urllib, re, string
>>> beginDataRegexp =re.compile(r"<pre>\n<b>.*</b>", re.I | re.DOTALL)
>>> endDataRegexp = re.compile(r"</pre>")
>>> test_data = urllib.urlopen("http://www.64.com/cgi-bin/ratings.pl?nm=T&st=TN").read()
>>> len(test_data)
25286
Thus far, I've pulled in some libraries
that give me string, url downloading, and
regular expression functions. Then, I defined
two regular expressions. Regular expressions
are analagous to keys. Put simplistically, the
regular expression is moved down the string
one character at a time, similarly to trying
a key on a hallway full of doors. When the
regular expression matches, the door is opened.
The first regular expression looks for the
text "<pre>" followed by a blank line,
followed by a "<b>" set of tags with
something inside of it. The second one looks for
the end of the "<pre>" tag. After that,
I downloaded some test data from the USCF web
site to play with. One line of code, that's
all it takes! The original C++ version of this
program required 11 lines of code to emulate
the functionality of just this one! It also had
to be linked with the GNOME HTTP library, which
wasn't present on my target FreeBSD system.
>>> beginMatch = beginDataRegexp.search(test_data)
>>> endMatch = endDataRegexp.search(test_data)
>>> test_data_lines = string.split(test_data[beginMatch.end():endMatch.
start()], "\n")
>>> len(test_data_lines)
Here, I'm using the regular expression that
I defined earlier to search the string for a
match. Then, I used the string slicing to grab
that chunk of the data. The data is then turned
into a list containing the individual lines.
>>> perLineRegexp = re.compile(r".{5}\s+(\d{3,4})p?")
>>> test_data_rows[1]
'12-96 523p TN 10-96 <A HREF=/cgi-bin/ratings.pl/USCF/12659889>TAB
AKOFF,ADRIAN</A>
>>> perLineRegexp.search(test_data_rows[1]).groups()
('523',)
Note that I've put \d{3,4}
in parenthesis. This represents an extremely
powerful aspect of regular expressions:
grouping! The regular expression parser will
save any values that it finds in parenthesis,
and they can be accessed via the groups
function as shown above. The final part of this
regular expression is "p?". The USCF uses "p"
after a rating to denote it as "provisional",
meaning that the player has not yet played 20
games. For our purposes it is not needed, so
"p?" tells the regular expression parser "if
there is a p there, ignore it and move on."
I continued this trial and error sequence,
slowly expanding my regular expression until
something like this:
>>> perLineRegexp = re.compile(r".{5}\s+(\d{3,4})p?\s+(?:\d{3,4}p?)?\s*\w{2}\s+(?:(?:(\d{2})-(\d{2}))|Life)\s+<A.*USCF/(\d{8}).*/A>(?:\s+.{5}\s+(\d{3,4})\s*(?:\d{3,4}p?)?)?")
>>> perLineRegexp.match(test_data_lines[1]).groups()
('523', '10', '96', '12659889', None)
Using Python with VIM While regular expressions form a key
part of this program, they didn't account
for the fivefold performance increase I
experienced when porting this program from C++
to Python! The secret lies in the statement:
if this_player.PlayerId not in player_id_list:
continue. Before, with C++, I was stuck
between tough choices: suffer the performance
penalty of hitting the database more times than
needed (by issuing an UPDATE statement for each
row) or taking the time to implement a binary
search algorithm. Being pressed for time,
I chose the former and immediately regretted
it. Luckily, Python came along and saved the
day, yet again, with its rich, powerful data
types built into the language itself.
Also, as another example of Python's
flexibility, it can be used as the embedded
scripting language for VIM. While writing this
article, I grew tired of manually replacing
> with >, so I composed the following
function and put it in ~/vim-htmlify.py:
# a little macro for VIM that helps when composing HTML documents
import vim, string, htmlentitydefs
htmlequivs = {}
# swap built-in table, we want a dictionary indexed by
# characters that can't be used.
for key, value in htmlentitydefs.entitydefs.items():
if key != "amp":
htmlequivs[value] = key
def htmlify():
for i in range(0, len(vim.current.range)):
cLine = vim.current.range[i]
cLine = string.replace(cLine, "&", "&")
for badchar in htmlequivs.keys():
cLine = string.replace(cLine, badchar, "&" + h
tmlequivs[badchar] + ";")
vim.current.range[i] = cLine
print len(vim.current.range), "line(s) HTMLified"
and put this in ~/.vimrc:
pyfile ~/vim-htmlify.py
map h :py htmlify()<CR>
With the touch of "h", I could
convert <b>Foo Bar</b> into
&lt;b&gt;Foo Bar&lt;b&g. This is
yet another example of the power and ubiquity
of Python!
Conclusion I hope I have given you a taste of how
Python can be a very effective tool to parse
information from the Internet. In this case,
this simple program, written in 2 days with no
prior knowledge of Python, has saved Memphis
Scholastic Chess over 70 hours to date in the
3 months that it has been implemented. Python
combines ease of use with the ability to run on
multiple platforms and provides a rich library
that makes the tasks that are simple in theory
(downloading a web page, parsing an HTML file,
showing a window on the screen, etc.) simple
in practice. I strongly urge you to try out
Python. Stay tuned for the next article,
wherein I'll create a cross-platform GUI to run
on top of our data parsing application!
Copyright Jupitermedia Corp.
All Rights Reserved.
|