Screen Scraping with Python

I’ve done a few projects now involving screen scraping web data with Python.  Here’s my (for now) preferred method.

Define the Problem

Always a good first programming step!  In this case, the problem is “how do I extract <bit of data> from <website address>?”

For this example, we’ll extract the population density of Jackson County, Missouri, from it’s city-data webpage.

Download the Webpage

There are several python libraries that allow you to download a webpage.  I’ve found mechanize to be the easiest to use.

import mechanize
mech = mechanize.Browser() #mechanize will mimic a web browser - web servers are none the wiser.
url = "http://www.city-data.com/county/Jackson_County-MO.html" #an example URL from my last project.  Contains county data.
response = mech.open(url)
page = response.read() #read() returns the url's html code as one big string

If you are behind a firewall or something and need to access the internet through a proxy server, then before calling mechanize’s open function, you need to set up the proxy.

PROXY = {'http':'PUT_YOUR_PROXY_SERVER_AND_PORT_HERE'}
mech.set_proxies(PROXY)

Extract the Desired Data with Regular Expressions

Now we have the webpage in html format.  It’s just a big ol’ string.  Here’s a snippet:

<table border="0" cellpadding="0" cellspacing="0"><tr><td>Population density: 1167 people per square mile&nbsp;</td><td><div align="left"><table border="2" cellpadding="0" cellspacing="0" width="20" bordercolor="#DDDD00" bgcolor="#e8e8e8"><tr><td>&nbsp;</td></tr></table></div></td> <td>&nbsp;(very high).</td></tr></table>

You can view the html source for a webpage with pretty much any web browser by right-clicking and selecting “View source.”  Looks kind of confusing, eh?

We need to search through the string we downloaded with mechanize for our piece of data.  Luckily, Python has a built-in, well developed regular expression library that works great for this kind of problem.

import re
extractor = re.compile(r'Population density: (.+?) people')
data = re.findall(extractor,page)[0]

“compile()” will set up a regular expression for use in other functions.  The “r” in front of the search string indicates that the string is a regular expression.  The parentheses ( ) indicate the start and end of the regular expression.  “.” matches any character (except a newline), “+” matches 1 or more of the preceding regular expression, and “?” makes the preceding regular expression non-greedy.

“data” is a string that can be printed, converted to a number and used in calculations or whatever further along in your script.

Congratulations!  You screen scraped some data with Python!

Advertisements

2 responses

  1. […] project involves screen scraping lds.org’s Ensign archives and then using a concordance (of sorts) to do some analysis for […]

  2. […] extension uses pretty much my same screen scraping techniques I have talked about before, only with Javascript now instead of Python, obviously.  I don’t really like this approach […]

What do you think?

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: