Screen Scraping with Python

I’ve done a few projects now involving screen scraping web data with Python.  Here’s my (for now) preferred method.

Define the Problem

Always a good first programming step!  In this case, the problem is “how do I extract <bit of data> from <website address>?”

For this example, we’ll extract the population density of Jackson County, Missouri, from it’s city-data webpage.

Download the Webpage

There are several python libraries that allow you to download a webpage.  I’ve found mechanize to be the easiest to use.

import mechanize
mech = mechanize.Browser() #mechanize will mimic a web browser - web servers are none the wiser.
url = "" #an example URL from my last project.  Contains county data.
response =
page = #read() returns the url's html code as one big string

If you are behind a firewall or something and need to access the internet through a proxy server, then before calling mechanize’s open function, you need to set up the proxy.


Extract the Desired Data with Regular Expressions

Now we have the webpage in html format.  It’s just a big ol’ string.  Here’s a snippet:

<table border="0" cellpadding="0" cellspacing="0"><tr><td>Population density: 1167 people per square mile&nbsp;</td><td><div align="left"><table border="2" cellpadding="0" cellspacing="0" width="20" bordercolor="#DDDD00" bgcolor="#e8e8e8"><tr><td>&nbsp;</td></tr></table></div></td> <td>&nbsp;(very high).</td></tr></table>

You can view the html source for a webpage with pretty much any web browser by right-clicking and selecting “View source.”  Looks kind of confusing, eh?

We need to search through the string we downloaded with mechanize for our piece of data.  Luckily, Python has a built-in, well developed regular expression library that works great for this kind of problem.

import re
extractor = re.compile(r'Population density: (.+?) people')
data = re.findall(extractor,page)[0]

“compile()” will set up a regular expression for use in other functions.  The “r” in front of the search string indicates that the string is a regular expression.  The parentheses ( ) indicate the start and end of the regular expression.  “.” matches any character (except a newline), “+” matches 1 or more of the preceding regular expression, and “?” makes the preceding regular expression non-greedy.

“data” is a string that can be printed, converted to a number and used in calculations or whatever further along in your script.

Congratulations!  You screen scraped some data with Python!

2 responses

  1. […] project involves screen scraping’s Ensign archives and then using a concordance (of sorts) to do some analysis for […]

  2. […] extension uses pretty much my same screen scraping techniques I have talked about before, only with Javascript now instead of Python, obviously.  I don’t really like this approach […]

What do you think?

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: