November 26, 2014
 
 
RSSRSS feed

Python Squeezes the Web - page 3

Introduction

  • October 23, 1999
  • By Stephen Pitts

Now that I've piqued your interest with this mass of code, you are probably wondering ...

What is Python?

(from the Python web site)  

Python is an interpreted, object-oriented, high-level programming language with dynamic semantics. Its high-level built in data structures, combined with dynamic typing and dynamic binding, make it very attractive for Rapid Application Development, as well as for use as a scripting or glue language to connect existing components together. Python's simple, easy to learn syntax emphasizes readability and therefore reduces the cost of program maintenance. Python supports modules and packages, which encourages program modularity and code reuse. The Python interpreter and the extensive standard library are available in source or binary form without charge for all major platforms, and can be freely distributed.

 

Often, programmers fall in love with Python because of the increased productivity it provides. Since there is no compilation step, the edit-test-debug cycle is incredibly fast. Debugging Python programs is easy: a bug or bad input will never cause a segmentation fault. Instead, when the interpreter discovers an error, it raises an exception. When the program doesn't catch the exception, the interpreter prints a stack trace. A source level debugger allows inspection of local and global variables, evaluation of arbitrary expressions, setting breakpoints, stepping through the code a line at a time, and so on. The debugger is written in Python itself, testifying to Python's introspective power. On the other hand, often the quickest way to debug a program is to add a few print statements to the source: the fast edit-test-debug cycle makes this simple approach very effective.

 

What I Like about Python

I've written programs in a number of different languages, including Visual Basic, C/C++, Perl, and PHP3. There are some things Python has that makes it, in my opinion, substatially more flexible than other languages:

  • Powerful Datatypes and Operations -

    Python has built in strings, tuples, lists, dictionaries, and more. Want your function to return two values? Return a tuple, an immutable list of values! Want to grab elements 4-6 of list MyList? Use slice notation to write: MyList[4:6]! This slice notation works on strings, too, so "Monty Python"[0:5] evaluates to "Monty". List can also dynamically grow, too. You can easily iterate over the elements of a list with the "for" command, and the "in" and "not in" statements let you take advantage of Python's built-in binary search routines instead of having to code your own. Very few languages have this type of functionality built in and available as a core part of the language. Most of the time, a special add-on library (such as STL) is required to get all these features.

  • Rapid Development with an Interactive Interpreter -

    Rather than go through the compile/test/run cycle of most traditional programming languages, or even the edit/run cycle of many scripting languages, Python has an incredibly useful interactive interpreter. During the development of the aforementioned application, I pasted a chunk of data into the interpreter, assigned to to a variable, and wrote the string parsing regular expression in about an hour. Whenever I'm curious about the built-in methods of a list, I pop into the interpreter and run dir([]). When I'm not sure exactly how some esoteric feature works, I define a test case and run it. I even build my applications bottom up, importing and testing critical functions in the interpreter before I write the top-level code that uses the functions.

  • Runs on Multiple Platforms and Has the Same Implementation on Multiple Platforms -

    I don't have to extol the virtues of a multi-platform language to you; you have a tremendous amount of flexibility in where you develop and deploy your applications. You can write Python on a Mac and upload it to a Unix server, start out with a Linux server and move up to a Sun Ultra, et. al. But, unlike some "cross-platform" languages like ANSI C++ (which I originally wrote the uscfratingd program in) and PHP3, it supports the same features everywhere because there is only one main implementation of Python in common use. I have written GUI applications with Python that run on Windows and Linux without changing one line of code (more on this in a following article). Can any other language (aside from Perl) claim this type of functionality?

  • Rich Core Library -

    Out of the box, on all platforms, Python programs can use sockets to speak any protocol or use predefined classes to speak HTTP, FTP, SMTP, POP, Telnet and a variety of other Internet protocols. Built-in classes are provided to permit your app to parse XML, HTML, and SGML. Regular Expressions, a powerful feature that allows text parsing (look at the perLineRegexp variable in the program for a useful example), were borrowed from Perl and are present in Python on Windows, Mac and Linux. Python/Tk, a moderately powerful GUI framework is available for Windows, Mac and Unix, and wxPython, a wrapper to the wxWindows C++ library, are available for Unix and Windows and are under development for BeOS and the Macintosh. Overall, Python provides a lot of features for free that might require costly third-party libraries in other languages.

The Example Explained

I don't have enough space to provide a complete introduction to Python (check the Python Tutorial for that), but I'll try to explain things briefly as I go. If you've done some sort of programming before, you'll find that Python is extremely easy to learn and lets you do a lot with a small amount of code. To try the code in this article, install a copy of Python for your distribution of Linux. Debian 2.1 users should be able to just type "apt-get install python", and Python 1.5.x is included with RedHat 5.0 or higher and can be installed with glint. Using your favorite text editor (I like VIM), pull up a chair and follow along! Note, in order to run the example program exactly as written, you'll need to create a MySQL table called "Players" like this:



CREATE TABLE Players (

        PlayerId char(8) NOT NULL PRIMARY KEY,

        LastName varchar(50) NOT NULL,

        FirstName varchar(50) NOT NULL,

        USCFRating mediumint NOT NULL,

        State char(2) NOT NULL,

        ExpDate date NOT NULL

)

And some sample data from the list above:



INSERT INTO Players(PlayerId, LastName, FirstName, State) VALUES('21005567', 'Tarver', 'Nathan', 'TN');

INSERT INTO Players(PlayerId, LastName, FirstName, State) VALUES('12613391', 'Tashie', 'Daphne', 'TN');

INSERT INTO Players(PlayerId, LastName, FirstName, State) VALUES('21005567', 'Tate', 'Jeremy', 'TN');

INSERT INTO Players(PlayerId, LastName, FirstName, State) VALUES('12660161','Patterson','Raphael','MS');

INSERT INTO Players(PlayerId, LastName, FirstName, State) VALUES('12499243','Pattillo','Billy','MS');



INSERT INTO Players(PlayerId, LastName, FirstName, State) VALUES('12660161','Pat

ton','Sam','MS');

INSERT INTO Players(PlayerId, LastName, FirstName, State) VALUES('12739657','Pay

ne','Daniel','MS');

 

Defining the Regular Expression

In developing this program, I started by looking at my data. For a couple of hours, I dabbled in the interpreter like so:



Python 1.5.2 (#0, Sep 13 1999, 09:12:57)  [GCC 2.95.1 19990816 (release)] on lin

ux2

Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam

>>> import urllib, re, string

>>> beginDataRegexp =re.compile(r"<pre>\n<b>.*</b>", re.I | re.DOTALL)

>>> endDataRegexp = re.compile(r"</pre>")

>>> test_data = urllib.urlopen("http://www.64.com/cgi-bin/ratings.pl?nm=T&st=TN").read()

>>> len(test_data)

25286

Thus far, I've pulled in some libraries that give me string, url downloading, and regular expression functions. Then, I defined two regular expressions. Regular expressions are analagous to keys. Put simplistically, the regular expression is moved down the string one character at a time, similarly to trying a key on a hallway full of doors. When the regular expression matches, the door is opened. The first regular expression looks for the text "<pre>" followed by a blank line, followed by a "<b>" set of tags with something inside of it. The second one looks for the end of the "<pre>" tag. After that, I downloaded some test data from the USCF web site to play with. One line of code, that's all it takes! The original C++ version of this program required 11 lines of code to emulate the functionality of just this one! It also had to be linked with the GNOME HTTP library, which wasn't present on my target FreeBSD system.



>>> beginMatch = beginDataRegexp.search(test_data)

>>> endMatch = endDataRegexp.search(test_data)

>>> test_data_lines = string.split(test_data[beginMatch.end():endMatch.

start()], "\n")

>>> len(test_data_lines)

Here, I'm using the regular expression that I defined earlier to search the string for a match. Then, I used the string slicing to grab that chunk of the data. The data is then turned into a list containing the individual lines.



>>> perLineRegexp = re.compile(r".{5}\s+(\d{3,4})p?")

>>> test_data_rows[1]

'12-96  523p        TN 10-96 <A HREF=/cgi-bin/ratings.pl/USCF/12659889>TAB

AKOFF,ADRIAN</A>

>>>  perLineRegexp.search(test_data_rows[1]).groups()

('523',)

Note that I've put \d{3,4} in parenthesis. This represents an extremely powerful aspect of regular expressions: grouping! The regular expression parser will save any values that it finds in parenthesis, and they can be accessed via the groups function as shown above. The final part of this regular expression is "p?". The USCF uses "p" after a rating to denote it as "provisional", meaning that the player has not yet played 20 games. For our purposes it is not needed, so "p?" tells the regular expression parser "if there is a p there, ignore it and move on."

I continued this trial and error sequence, slowly expanding my regular expression until something like this:

>>> perLineRegexp = re.compile(r".{5}\s+(\d{3,4})p?\s+(?:\d{3,4}p?)?\s*\w{2}\s+(?:(?:(\d{2})-(\d{2}))|Life)\s+<A.*USCF/(\d{8}).*/A>(?:\s+.{5}\s+(\d{3,4})\s*(?:\d{3,4}p?)?)?")

>>> perLineRegexp.match(test_data_lines[1]).groups()

('523', '10', '96', '12659889', None)

Sitemap | Contact Us