Category Archives: Python

edX: Introduction to Computer Science and Programming Using Python (MIT)

Actually finished up this course a while ago, but haven’t written it up yet.  I felt a little vindicated by how easy this was, especially after getting hammered with the Discrete Optimization course I took last.  I barely watched any of the videos; really just did the programming assignments.

  • One on string manipulation, finding vowels or substrings
  • A debt/min payment equation solver
  • Hangman game
  • Scrabble-like game
  • Caesar cipher coding and decoding
  • Parsing an RSS feed for stories matching desired keywords, etc.

Everything was pretty much fill-in-the-blank.  Decent auto-grader, as seems to be the norm for these types of online programming classes.

Besides being a good Python review, I did learn more about properly working with dictionaries:

Remember that with a dictionary, the usual way to access a value is hand['a'], where 'a' is the key we want to find. However, this only works if the key is in the dictionary; otherwise, we get a KeyError. To avoid this, we can use the call hand.get('a',0). This is the “safe” way to access a value if we are not sure the key is in the dictionary. d.get(key,default) returns the value for key if key is in the dictionary d, else default.

 

Advertisements

Discrete Optimization on Coursera

I recently finished the Discrete Optimization course on Coursera, taught by Pascal Van Hentenryck.  I thought it was really good!  Van Hentenryck is pretty crazy in the first set of videos, pretending to be Indiana Jones.  He gets quite a bit tamer later on, but still good material.

The course covers three main optimization techniques:

  • Constraint Programming – “Constraint Programming represents one of the closest approaches computer science has yet made to the Holy Grail of programming: the user states the problem, the computer solves it.” – Freuder, 1997
  • Local Search – define a set of moves (neighborhood), start somewhere, then make a move.  Usually move to decrease objective function but there is danger of local minima.  So do stuff like simulated annealing or tabu search to break out.
  • Linear and Integer Programming – Linear program solution is a vertex of polytope defined by intersection of constraints.  Simplex algorithm is a way to explore vertices.  The nice thing about linear programming is that it is easy to show optimality.

I was most intrigued by constraint programming due to some applications to a constrained scheduling problem at my job, so I focused on CP solutions to the problems … probably why I didn’t score as well as I would have liked; I think part of the point of the course is that no single hammer is sufficient to optimally drive all these nails.

The programming assignments were the best part of the course.  For each one, a skeleton Python script is provided which reads in data files and performs some simple “dummy” approach.  The student’s job is to implement something better…there’s no real rules on how you get there.  You can even enlist the aid of external solver libraries, which is what I did (see below).  There is an auto-grader which will test your solution against 6 or so data sets of increasing complexity.

The problems are fairly standard ones from the OR world.  These kinds of problems are NP-hard, so determining solutions eventually take exponential time as the problem size increases, but as covered in the lectures, the goal is to push that curve out as far to the right as possible.

  1. Knapsack – given a set of items with a given value and given weight, select the set of items to place into a weight-limited knapsack which maximizes total value
  2. Graph Coloring – use as few colors as possible to color a map, such that no countries with a shared border have the same color
  3. Traveling Salesman – find the shortest distance route which visits each location once
  4. Facility Location – the are a set of possible factories, which can produce a given number of widgets but cost a certain amount to open (so some factories will likely remain closed).  Then there is a set of customers, each with a given widget demand.  These customers and factories are located in a 2D space.  The goal is to minimize the total cost = set up cost of all open factories + distance for each customer and the factory it is served by.
  5. Vehicle Routing – a set of vehicles start at a warehouse.  Each has a fixed capacity.  There is a set of customers at various locations relative to the warehouse.  Each has a fixed demand.  Vehicle routes must be determined such that every customer is serviced and total vehicle distance traveled is minimized.  (Kind of a combo of knapsack and traveling salesman.)

One bad part of this course is that there wasn’t a lot of guidance on actually implementing some of the ideas covered in the lectures.  It’s pretty much a free-for-all.  I did do as recommended and started out with a simple greedy algorithm to start each assignment.  For instance, for knapsack “take greatest value density items first” actually turned out to be a good strategy.  I was pretty surprised actually that these simple greedy algorithms, for most of the problems, came up with an 70% – 80% solution.  (by my estimate; the scoring for the class was a bit tougher than that: 3 points for what I would call a 75% solution, 7 points for a 90% solution, and 10 points for 95%+ solution)  But I guess that’s what separates the men from the boys in OR – a decent solution is ok (probably good enough for most purposes) but it isn’t OPTIMAL.

So, after my greedy solution attempts, I elected to integrate Google’s or-tools.  I thought Google would be a sure thing, but I wasn’t totally satisfied in the end.  All my issues have to do with the subpar documentation.  There actually is what appears to be pretty good documentation, but this is an illusion — much of it is incomplete stubs and virtually nothing is specific to the Python wrapper (around a C++ implementation … so I couldn’t easily examine code.)

So I didn’t really have any good direction on how to use the library, except for a large selection of excellent examples, mainly from a guy called hakank, that are included in the or-tools distribution.  Actually there are example implementations of several of the assigned problems available; I got decent scores by making these work with the course data formats.  But not quite 10 point solutions, and not always 7 point solutions … I blame on lack of tuning, since, again, I did not have much visibility into what the or-tools algorithms were doing.

I was also probably hampered by some Python rustiness, such as forgetting about python’s mutable list behavior.  I’m still using Pyscripter and think it is fine (good debugger is the key for me!) but I still haven’t got a great feel on the best Python data structures to use in different situations.

Finally, like I said above, I only focused on constraint programming which was probably not always the correct tool for the job.  In the end I got 223 points out of 320; the requirement for a certificate was 224.  Yes, I was one point off.  (Reminds me of the time I missed 3rd place in the state Academic Decathlon meet by something like 30 points out of 10000 … but that’s another story.)  I realize the certificates are kind of meaningless, but having a deadline and a goal gets me motivated and I was pretty bummed to miss out. I really want to continue poking at some of these (and related) problems – pretty interesting stuff.

Computational Investing part I on Coursera

I recently finished up my second course on Coursera – Computational Investing I by Tucker Balch.  Learned a little bit more about how the stock market works, and (maybe somewhat) how the quants do their thing.  A little bit of what they do could probably be called order book exploitation — looking for arbitrage opportunities between buy and sell orders at the microsecond level.  This type of computational investing is pretty much out of reach for anyone outside of a big firm with direct trading floor access.

What’s left for the little guy is offline, day-to-day analysis of stock information.  Dr. Balch has created a Python package, QSTK, which pulls downloaded stock price information into a data analysis friendly format within Python.  I really liked how the assignments all fit together — I now have a basic, yet functional, “Market Sim” in Python.  It is not really a predictive simulation for future market movement; rather it is an evaluation tool for how a certain metric-based investment strategy would have performed if executed during a past window of time.  For instance, the final assignment involved setting up an “event study” looking for instances where a Bollinger band metric behaved in a fashion showing that the stock was falling while the overall market was rising.  Whenever the event was observed for a given stock, the tool logs a buy order for the stock followed by a sell order 20 days later.  Then, a separate piece of code reads in the buy/sell orders and computes what the overall rate of return (and other metrics like Sharpe ratio) would be given actual stock price historical data, then compares the strategy to the overall performance of the market during the same time period.  Thus the Market Sim developed in this course would allow the user to try out different investment strategies, evaluating them against real historical data.

It’s a simple yet pretty cool project that I’m glad to have completed.  Especially glad for the opportunity to get back into Python, which I still love (two timing now with Matlab, I guess…) and which I’ve got a pretty decent workflow going with PyScripter (nice tool).

But I don’t think I’ll be going into business anytime soon as a day trader.  My first inclination was to somehow meld the Market Sim developed in this course with a neural network or other machine learning strategy — then you wouldn’t need to explicitly define any “events.”  But a quick Google search shows that many have tried this and failed.  Maybe it is due to my second reservation, which is that past performance is not a reliable indication of future performance.  (But it’s the best we have to go with, so maybe that’s ok…)  Third reservation is that many strategies would involve too many individual trades to be profitable for the little guy, both when trade commissions and when the time required to execute trades manually are taken into account.  (Are there any systems where you can specify auto-trades upon a certain event?  That would be cool, but scary too … if the perfect storm comes and your algorithm gets really fouled up, it could clear out your bank account real fast.  Automated gambling, anyone?)

Not saying it is impossible for the little guy to be profitable using these methods, but it would be difficult.  The course itself talks about the fundamental law, basically saying that if you aren’t able to execute a small number sure-fire big money trades, then your only recourse is to execute a very large number of more modest trades, each with a slim yet positive probabilistic chance of profitability.  But lots and lots of trades are impractical for the little guy for reasons discussed above.  Also, I tend to believe in the efficient market hypothesis, simply because there are so many people working on this problem, with so much more knowledge and resources than I have to devote to it.  I think I’ll stick with index investing.

On a side note, shortly before taking this course I read an article on the Aladdin system used by Blackrock.  (I can’t find the article now … it wasn’t this Economist piece, nor was it this one from FT.)  A lot of money is controlled by algorithms now, for better or for worse.  I couldn’t help thinking of the Asimov story from I, Robot, where artificial intelligences are directly controlling the economy . . . are we there yet?

Generation Ship

Here’s a project from a while back that I worked on for a week or two, lost steam, meant to come back to and finish, but never did.  It’s got a simple XML-based system for loading in content that was kind of fun to create.  Since there’s not too much actual content, the XML system is pretty much the only interesting thing here. 🙂

Called “Generation Ship,” it’s basically “The Oregon Trail” in space.  A generation ship is a real theoretical concept that may be used in the future to deliver colonists from Earth to far flung destinations across the galaxy.  Assuming faster-than-light travel remains a science fiction fantasy (ha! see what I did there), the only possibility to cross the light years that lie between us and even the nearest star may be generation ships that take hundreds of years, and several generations of people living, reproducing, and dying in transit.

This game is a text adventure that puts the player in the shoes of a generation ship captain.  The goal is to get to the destination star with a sufficient number of crew remaining to form a viable colony.  Along the way there are many random events that require the captain to make a decision – do you cave in to the mutineers demands (temporarily increasing morale but possibly leading to future supremacy struggles), or jettison them out the airlock (lowering population, the ultimate required commodity of success)?

Along with “The Oregon Trail,” other inspirations are old school text adventures and also the text quests from Space Rangers 2.

Some game design thoughts:

  • Content is hard to create!  In “Generation Ship” the content is the different random events that may fire and the decisions that could be made by the player at each event.  My idea was to write a few lines for each; even that proved pretty difficult.  If anyone actually downloads and “plays” you’ll find the selection of events to be rather limited.  Lesson: for me, creating the game engine is more interesting than creating content.  So that’s what I’ll try to stick with.  I love the idea of Minecraft’s infinite, procedurally generated terrain.  Great variety and even beauty, but no need for an army of “level designers” to create it.
  • Along those same lines … there’s really no way for one guy, no matter how smart or clever or even how much time he has available (barring infinity or a lifetime, I guess) to create a game on par in all aspects with the latest AAA title from Bioware, Blizzard, Rockstar, or whoever.  It’s one of those “choose your battles” wisely things.  Where are your strengths?  What do you like the most about creating games?  Focus on those areas.
  • Feature creep.  I kept on thinking of scenarios that might happen to the generation ship and then outcomes that might result from the player’s decision.  Some might directly affect the number of crew members, some their morale, and some their health.  I have a different “meter” for each of these.  Then there are things like what’s in the ship’s cargo bay, the ship’s speed, and the ship’s condition.  Yet more “meters.”  But I don’t think they add too much to the experience and could be simplified into just a few meters.
  • Who knows about game balance.  Right now the frequency of events and in some cases which possible outcome is chosen is random; serious play testing or some kind of simulation would be needed to fine tune the balance.

I think I am beginning to see why there are game designers as well as game programmers, and that the two are not equivalent!

Download “Generation Ship” here: gen.zip.  Run wxui.py for the WxPython GUI version or gen.py for the command line version.  I warn you that although it is technically complete, it’s more like a skeleton that needs to be fleshed out than a fully formed beast.

Adding date photo taken to filename

Digital photo organization is a pain.  Everybody has their own scheme (although I suspect many have none at all).  Here’s mine:

  1. Keep all pictures in one directory.  Yup, no subdirectories at all.
  2. When downloading files from camera’s SD card, batch rename (Win XP – select group of files, then ‘F2’ or right-click and ‘Rename’; Picasa also has a batch rename function) files according to the event/place/whatever.  You’ll end up with ‘Grandpa’s Birthday.jpg’, ‘Grandpa’s Birthday (1).jpg’, ‘Grandpa’s Birthday (2).jpg’, …
  3. Append the date the picture was taken to the front of the filename.  That way, the files will display in chronological order.

My Python script jpg_batch_rename.py (uses EXIF.py) does #3 for me.  For all files in a chosen directory, it extracts the date a picture was taken from the jpg exif datetime field.  If the file is not a jpg or if exif datetime info is not present for some reason, then it will extract the date the file was last modified.  The resulting date is appended to the front of each filename.  I don’t want to change the “file last modified” date itself, so at the end I reset it to what it was before the file rename.

Besides EXIF.py, the rest of the magic is done primarily with the Python os module.

For added usability, jpg_batch_rename can be used in conjunction with image_viewer2_SRO.py.  I adapted a script from the The Mouse vs The Python blog that uses wxPython to put the jpg_batch_rename functionality in a GUI.

After loading a directory with the button in the upper left, the first picture will be displayed.  (If there aren’t any pictures in the directory then it will crash, sorry.)  “Previous” and “Next” allow you to scroll through the photos – left over from the script I stol- er, borrowed.  The “Add Date to Filename” calls jpg_batch_rename and will append the file’s date to the front of all the files in the directory – even non picture files.  Checking the “Make backup” button will copy the original files to a “backup” subdirectory before making any changes.  The field at the bottom allows you to add an optional descriptor to add to the file name, as well as the date.

Before:

After:

Words of the Prophets

The General Conference of the Church of Jesus Christ of Latter-Day Saints (Mormons) is held twice a year, in April and October.  General Authorities of the Church, including Prophets and Apostles, speak to the church membership about doctrinal issues and give other counsel.

During the last conference, I wondered: Is there a pattern to what is taught in General Conference?  Maybe Python can help us find out….

Methodology

This project involves screen scraping lds.org’s Ensign archives and then using a concordance (of sorts) to do some analysis for word counts and word usage frequency.  My thinking is, the more a certain word is used, the more the general authorities are giving counsel about a particular topic.  An index of all General Conference talks is also created.

The church magazine “Ensign” prints the full text of General Conference in the May and November issues each year.  Luckily, the Ensign is available online at lds.org.  So here’s what we’ll do:

  1. Screen scrape the General Conference Ensign articles
  2. Count up the words in each article and generate word count summaries for each General Conference
  3. Profit!  Er, I mean, make some charts and stuff.

Note: Each October (at least for the last several years) there is a General Relief Society meeting held the week prior to General Conference.  The proceedings of this meeting are included in the November Ensign.  So they are included in this project as well.  Whether or not the meeting is a part of General Conference or not is a matter of debate I guess, but the speakers are General Authorities, so surely belong in this analysis.

Note #2: Only General Conferences back to 1974 are available online.  The online Ensign archives go back further, but prior to 1974 there was a separate “Conference Report” for the proceedings of General Conference, which is not available online as far as I can tell.  So all my results are 1974 – 2010.

Note #3: I didn’t include stuff in the Ensign that did not list an author.  This cuts out stuff like the “Sustaining of Church Officers” and the “Statistical Report”, etc.

Note #4: The Ensign articles use Unicode, which I had some headaches parsing.  So I ended up throwing out everything but the Ascii character set.  Therefore the resulting titles and words might occasionally be incorrect – mainly missing punctuation.  But it’s generally ok.

Results

It took about an hour and a half to download and parse all the General Conference articles!  There are two output .csv files:

GenConfArticleSummary1974to2010.csv – index of LDS General Conference talks, 1974 – 2010.  Lists speaker, title, Ensign month, year and page number, word count, unique word count, unique word ratio (unique count / word count), and top 100 words for each conference talk.

WordsOfProphets1974to2010.csv – lists all the unique words found across all General Conference talks.  Gives the total count for each word.  For each General Conference, the percentage contribution of each unique word to the Conference’s total is given.  Ie, “0.389497” for the word “church” for “May1974” means that in the April 1974 General Conference, 0.38% of the words spoken (well, scraped from the Ensign webpage) were the word “church.”

Some stats (1974 – 2010 General Conference):

  • 2,713 talks
  • 397 different speakers (176 of those gave just a single talk)
  • 5,181,241 total words
  • 51,274 unique words (0.98% of total)

A note on the “unique word ratio” ( = 100 * unique words / total words) : I’ve noticed this generally tends to decrease the longer the body of text is.  So it probably is only meaningful (although what the meaning is I do not know) to compare for texts of about the same size.

The next chart shows the top 20 General Conference speakers who gave the most talks (“Count of Title”).  Average total word count and the average unique word ratio are also shown.  Gordon B. Hinckley is tops, no surprise.  He was in the First Presidency (generally 2 talks per Conference) or was the Prophet (about 4 talks per Conference – usually gives a welcome and a goodbye talk in addition to 2 meatier ones) for much of the time period under consideration (1974 – 2010).  The same can be said about his successor and next most frequent General Conference speaker, Thomas S. Monson.  (This is only the top 20; see GenConfArticleSummary1974to2010.csv and sort in Excel or OpenOffice to get the full list.)

Speaker Count of Title Average of Word_Count Average of Unique Ratio
Gordon B. Hinckley 207 2065.545894 34.51755811
Thomas S. Monson 162 2189.833333 36.73315815
James E. Faust 98 2286.030612 33.84414864
L. Tom Perry 75 2113.973333 32.77750744
Boyd K. Packer 75 2276.106667 31.69806946
Spencer W. Kimball 66 2264.893939 34.32113774
M. Russell Ballard 62 2113.903226 32.98976187
Ezra Taft Benson 57 2153.578947 31.79958897
Russell M. Nelson 56 2328.535714 34.46947778
David B. Haight 55 2022.8 33.50270983
Dallin H. Oaks 54 2459.018519 30.85206586
Neal A. Maxwell 53 1965.773585 39.94132846
Joseph B. Wirthlin 53 2259.528302 33.46916535
Richard G. Scott 51 1934.647059 34.08122863
Marion G. Romney 51 2157.607843 30.11546943
Robert D. Hales 46 2255.086957 30.38011966
Henry B. Eyring 46 2437.673913 26.86417891
Howard W. Hunter 45 1676.977778 34.74803859
Marvin J. Ashton 38 2248.263158 34.31398562
N. Eldon Tanner 36 2428.833333 31.85897606

Something else interesting regarding the “unique word ratio”.  Neal A. Maxwell’s is particularly high at 39.9%.  This is somewhat expected; Elder Maxwell was somewhat renowned for eloquence and a large vocabulary.  Surprisingly, Henry B. Eyring’s unique word ratio is particularly low at 26.8%.  But I wouldn’t call his talks rudimentary or simplistic by any means; quite the opposite.  The relative word count between Maxwell (1965.7) and Eyring (2437.6) may have something to do with these numbers — as I said before, longer data sources tend to have smaller unique word ratios.  But then again, N. Eldon Tanner’s total word count (2428.8) is close to Eyring’s, but Tanner’s unique word ratio is higher, 31.8%.

Now for some individual word analysis – using data in WordsOfProphets1974to2010.csv.  Probably lots of interesting stuff that could be done with this data, but for now we’ll just look at some Excel charts plotting the word usage frequency for some “interesting” words.

“Constitution” and “Pioneers”

I picked these first because I was fairly certain where there would be big spikes.  As expected, “Constitution” gets used more frequently in 1976 and 1987 — the bicentennial of the Declaration of Independence and the Constitution, respectively.  “Pioneers” gets a big spike in 1997 – the sesquicentennial of Mormon pioneers arriving in the Salt Lake valley in 1847.

I should note that the y-axis is a %.  For example, about 0.11% of the words scraped from talks in the May 1997 Ensign were the word “pioneers”.

“Internet” and “Pornography”

The internet isn’t mentioned at all til 1996, about the time it started becoming popular and mainstream.  The counsel from Church leaders about the evils of pornography seems to have increased, on average, in the years since the internet became more common and the problem of internet pornography more pervasive.

“Faith,” “Jesus,” “Christ,” “Savior,” “Lord”

Definite upward trend in the use of the word “faith”.  I guess that’s good.

I plotted both “Jesus” and “Christ” because while they are usually used together, when they are used separately it seems like “Christ” is used more often than “Jesus” … at least in recent years (see about 2004-2010).  During the 1970’s, the opposite appears slightly true – “Jesus” used more frequently than “Christ”.

Both “Jesus” and “Christ” steadily increase from 2006 to 2008, then abruptly plummet in November 2008 and stay at about the same level til 2010.  I have no good explanation for this; it seems intriguing.  President Hinckley passed away in early 2008 and President Monson became the new prophet — since the prophet speaks more in General Conference than anyone else, maybe Monson uses other words like “Savior” or “Lord” more frequently?

Perhaps.  There is an uptick for “Lord” and Savior” starting in May 2010.

“Tithing” and “Prayer”

Hypothesis: more talk about prayer during economic hard times, and less about tithing?

Here’s a GDP per capita chart, from http://www.measuringworth.com/usgdp/.  We clearly see four recessions: early 1980’s, early 1990’s, early 2000’s (hey is there a ten year pattern here?) and 2008-present.

So how to “prayer” and “tithing” General Conference frequencies compare during the recessions?

prayer tithing
Early 1980’s down up
Early 1990’s up down
Early 2000’s up down
2008 – now up down


Hypothesis confirmed?  Well, it’s kind of inconclusive – pretty shallow data set.  But this type of analysis would be very interesting to pursue, methinks.  Not necessarily only about economic issues (although that type of stuff is very easy to get historical #s for).

Files

Here are the Python scripts used to extract the data in the .csv’s.

concordance.py – Upgraded somewhat from my previous post on concordances.  Now it is in a class and can be called with any block of text input, not just a file.

getEnsignData.py – Starts at the Ensign archive webpage and dives down into the year, month, and article pages.  Calls appropriate scraper routines from ldsscraper.py for each.  Pickles output to ensignData.txt and genConfData.txt.  This is so I can split the project up into two pieces – download the data (which takes a very long time) and save it; then assemble it for output later.

ldsscraper.py – Uses regular expressions and Beautiful Soup to parse each webpage.  Since each type of page (Archive, Year, Table of Contents (individual month issue), Article) has it’s own format, each has it’s own scraper.

scraper.py – Base class for scrapers in ldsscraper.py.  Uses mechanize to download a webpage.  Also (optionally) strips out anything but the Ascii character set.

wordsOfProphets.py – Run after getEnsignData.py.  Loads up ensignData.txt and genConfData.txt and creates the two .csv files, GenConfArticleSummary1974to2010.csv and WordsOfProphets1974to2010.csv.

Counting Unique Words with Python

The impetus for this project is an NPR story from earlier this year that I just now found.  An English literature researcher constructed a concordance of Agatha Christie novels to support the hypothesis that she suffered from Alzheimer’s in the late stages of her career.  He found a 20% drop in her 73rd novel’s vocabulary, as compared with previous novels.

A concordance is an index of all the instances of a particular word in a body of text.  A concordancer is a program that generates a concordance.  I don’t think I’ve created a true concordancer, because I don’t keep track of where (what page or section of text) I find each word.  I just keep track of a count of unique words.  Call it “Concordancer Junior.”

The concordance.py script will scan an input .txt file and count up the instances of each unique word.  Doing this is almost trivial with Python’s dictionary structure.

def addword(word):
    if theIndex.has_key(word):
        theIndex[word] = theIndex[word] + 1 #increment count
    else:
        theIndex[word] = 1 #add word to dictionary with count = 1

Once all the words have been added to the dictionary, I want to display the most frequently found words.  However, Python dictionary’s are not sorted whatsoever, so just printing the first x entries of the dictionary won’t do.  Searching around found a Python wiki entry that has an easy solution using the sorted() function:

s1 = sorted(theIndex.items(),key=lambda item:item[0]) #secondary key: sort alphabetically
s2 = sorted(s1,key=lambda item:item[1], reverse=True) #primary key: sort by count

Ooohhh… lambda functions.  <shiver!>

I should note that sorted() returns a list of (key,value) tuples rather than a sorted dictionary.  But that’s fine since I’m not going to need to add anything further to the dictionary once I get to the sorting point.

For my first cut, the most common words were ‘the’, ‘a’, ‘and’, ‘he’, etc.  Not very interesting.  So, I revised the script to print out the top 100 words found in the file, EXCLUDING the 100 most common (English) words.

For test cases I downloaded plain text files from Project Gutenberg, removing their entry/exit boilerplate stuff.

Without further ado, some results please!

The King James Bible:

Alice in Wonderland:

Ulysses:

Kim:

Huckleberry Finn:

Screen Scraping with Python

I’ve done a few projects now involving screen scraping web data with Python.  Here’s my (for now) preferred method.

Define the Problem

Always a good first programming step!  In this case, the problem is “how do I extract <bit of data> from <website address>?”

For this example, we’ll extract the population density of Jackson County, Missouri, from it’s city-data webpage.

Download the Webpage

There are several python libraries that allow you to download a webpage.  I’ve found mechanize to be the easiest to use.

import mechanize
mech = mechanize.Browser() #mechanize will mimic a web browser - web servers are none the wiser.
url = "http://www.city-data.com/county/Jackson_County-MO.html" #an example URL from my last project.  Contains county data.
response = mech.open(url)
page = response.read() #read() returns the url's html code as one big string

If you are behind a firewall or something and need to access the internet through a proxy server, then before calling mechanize’s open function, you need to set up the proxy.

PROXY = {'http':'PUT_YOUR_PROXY_SERVER_AND_PORT_HERE'}
mech.set_proxies(PROXY)

Extract the Desired Data with Regular Expressions

Now we have the webpage in html format.  It’s just a big ol’ string.  Here’s a snippet:

<table border="0" cellpadding="0" cellspacing="0"><tr><td>Population density: 1167 people per square mile&nbsp;</td><td><div align="left"><table border="2" cellpadding="0" cellspacing="0" width="20" bordercolor="#DDDD00" bgcolor="#e8e8e8"><tr><td>&nbsp;</td></tr></table></div></td> <td>&nbsp;(very high).</td></tr></table>

You can view the html source for a webpage with pretty much any web browser by right-clicking and selecting “View source.”  Looks kind of confusing, eh?

We need to search through the string we downloaded with mechanize for our piece of data.  Luckily, Python has a built-in, well developed regular expression library that works great for this kind of problem.

import re
extractor = re.compile(r'Population density: (.+?) people')
data = re.findall(extractor,page)[0]

“compile()” will set up a regular expression for use in other functions.  The “r” in front of the search string indicates that the string is a regular expression.  The parentheses ( ) indicate the start and end of the regular expression.  “.” matches any character (except a newline), “+” matches 1 or more of the preceding regular expression, and “?” makes the preceding regular expression non-greedy.

“data” is a string that can be printed, converted to a number and used in calculations or whatever further along in your script.

Congratulations!  You screen scraped some data with Python!

City-Data Screen Scraper and Maps

I wrote a screen-scraper that extracts data from the City-Data website’s county pages.  Luckily, the format of the URLs and county pages themselves are mostly consistent, so it was easy to extract desired bits of data with regular expressions … well, easy after a bit of trial and error.  The input to the script is a .csv with county names, states, and FIPS codes.  The output is the same as the input file but with additional columns of data extracted from the webpage.  I extracted bits of data that I thought would be interesting and that were easy to get.
There are 3,141 counties or county equivalents in the US.  (At least according the the list I used, which is modified from the US Poverty Rate data from the last post).  The script took about 23 minutes to run – for each county it downloaded the webpage from the City-Data website, parsed it for the desired info, and added the info to the output file.  There were a few hiccups along the way due to the following:
  • Rather than counties, Alaska has Boroughs, Municipalities, and Census Areas.  Crazy Alaska!
  • Skagway-Hoonah-Angoon Census Area is now split up between the Hoonah-Angoon Census Area and the Skagway Municipality.
  • Broomfield County, Colorado was incorporated 2001 and City-Data doesn’t have a county page for it yet.
  • Counties that begins with “De”,  “La”, “Mc” – some discrepancies about spacing and capitalization.

Once the data was all extracted into the output .csv, I used the modified mapmaker script from the last post to make some maps.  Without further ado….

Population density per square mile

<1 <2 <10 <50 <100 <200 <500 <1000 >1000

This is a pretty familiar type of map.  Kind of can use this as a sanity check on City-Data’s figures and my plotting script.  Do a Google Image Search for “US Population Density” and compare.  Example: http://en.wikipedia.org/wiki/File:USA-2000-population-density.gif

Cost of living (100 = US average)

<75 <85 <95 <105 <115 <125 <135 <145 >145

I wondered how this map would compare to the poverty rate map from the last post.  Answer: not really.  Places with a higher cost of living probably have higher salaries.  (Hey, how about we plot that next??)

Lowest cost of living: King County, TX (68.4)

Highest cost of living: Kings County, NY (194.1)

And, no, not every county name is a variation of the word “king.”  🙂

Median household income ($)

<25000 <35000 <45000 <55000 <65000 <75000 <85000 <95000 >95000

This map in conjunction with the preceding cost of living map does show some similarities to the poverty rate map.  Areas with low median incomes, but mid-level or high cost of living, generally have higher poverty rates.  Kind of “no duh,” I know.  Notable examples which stand out: eastern Kentucky, along the lower Mississippi River, the Navajo reservation in the Four Corners area.

Lowest median household income: Kalawao County, HI ($12,591).  This is the former leper colony of Molokai.

Highest median household income: Loudoun County, VA ($111,925)  The DC area – your tax dollars at work!

Federal Government Expenditure per Capita ($)

<5000 <7500 <10000 <12500 <15000 <17500 <20000 <22500 >22500

So, how well are those Washington fat cats at sharing the wealth with the voters back home?

Top 5:  Falls Church, VA ($145,164), Fairfax County, VA ($138,060), Los Alamos, NM ($105,868), District of Columbia ($67,982), Arlington County, VA ($52,254).  DC area once again….

Percent Foreign Born Residents

<1% <3% <5% <7% <9% <11% <13% <15% >15%

The map of the percent of foreign born residents looks pretty much as you would expect – higher percentages along the Mexican border and in cities – NYC, DC, Chicago.  And pretty much all of California.  The “finger” stretching up the Texas panhandle into Kansas is kind of interesting.  Also surprising figures in eastern Washington.

Gender (Im)balance

<-5% <-3% <-2% <-1% <1% <2% <3% <5% >5%

The gender balance of a county is calculated as the (#females-#males)/(#females+#males).  So, a positive percentage indicates more females than males, and a negative one indicates more males.

The South appears to be female-heavy, while the West is loaded with males.  Maybe those two should get together…

Lowest: Crowley County, CO (-34.5%).  Apparently one-third of the county residents are inmates at the state prison.  This is likely skewing the numbers toward males.

Highest: Pulaski County, GA (14.9%)

Mean travel time to work (minutes)

<10 <15 <20 <25 <30 <35 <40 <45 >45

Low: Aleutians East Borough, AK (6.3 minutes).  Maybe they all live on their fishing boats?

High: Elliott County, KY (48.7 minutes).  Ouch.  I’m guessing it’s just as long on the way back home … that’s almost 2 hours per day!

Percent affiliated with religious congregation

<20% <30% <40% <50% <60% <70% <80% <90% >90%

There is a pretty surprisingly wide swing here, and it is definitely regional.  Definite concentrations in the Midwest and Texas, as well as the Mormons in Utah and Idaho.  The percentages tend to diminish towards either coast.

Low: Camas County, ID (1.8%).  This figure is kind of suspect if you ask me.  About 1000 residents, and only 18 attend church?

High: Falls Church, VA (164.5%).  Once again the DC area!  Hey, wait a sec City-Data…how come this is above 100%?  People going to more than one church?  There are about 40 counties total with the figure > 100%.

Caveat Emptor…

I have no idea how the data presented on the City-Data website was collected.  Could be correct or it could be wildly off.  In any case the interpretation heavily depends on the data collection method, as with anything of a statistical nature.  (Something that the general news media, and media consuming populace, fails to consider many, many times IMHO).

Related to the above – I found I could skew the look of the maps in different ways in the selection of the color scale.  If I wanted to “drown out” a few high outlier counties, for instance, I could set my high threshold low enough and presto, they’re gone and no one is the wiser.  Not that I have knowingly done that here … I tried to get the best scale that shows the overall trend.  But the reader should be wary of maps, statistics and the like – always ask yourself what the author is trying to get you to believe.

TastyCakes Map Maker

In the Wikipedia “Poverty in the United States” entry, there is a map of the US shaded with the overall poverty rates of each county.  It was made by user TastyCakes (thanks!) using this script, this blank .svg map of US counties, and data from the census bureau.

I modified the script a bit to include functionality for creating an HTML legend rather than a wikipedia one.  Here’s the census data reformatted to work with the script.  (Got rid of extraneous data and merged the state FIPS and county FIPS columns into one.)

Here’s the same map as on the wikipedia article, but in green.  (TastyCakes script has a green color spectrum option.)

<5% <10% <15% <20% <25%
<30% <35% <40% >40%

 

I converted the .svg to a .jpg using this online converter.

Cool, huh?  Now all I need is some interesting data per county in a .csv file, and I can make nifty maps!