City-Data Screen Scraper and Maps

I wrote a screen-scraper that extracts data from the City-Data website’s county pages.  Luckily, the format of the URLs and county pages themselves are mostly consistent, so it was easy to extract desired bits of data with regular expressions … well, easy after a bit of trial and error.  The input to the script is a .csv with county names, states, and FIPS codes.  The output is the same as the input file but with additional columns of data extracted from the webpage.  I extracted bits of data that I thought would be interesting and that were easy to get.
There are 3,141 counties or county equivalents in the US.  (At least according the the list I used, which is modified from the US Poverty Rate data from the last post).  The script took about 23 minutes to run – for each county it downloaded the webpage from the City-Data website, parsed it for the desired info, and added the info to the output file.  There were a few hiccups along the way due to the following:
  • Rather than counties, Alaska has Boroughs, Municipalities, and Census Areas.  Crazy Alaska!
  • Skagway-Hoonah-Angoon Census Area is now split up between the Hoonah-Angoon Census Area and the Skagway Municipality.
  • Broomfield County, Colorado was incorporated 2001 and City-Data doesn’t have a county page for it yet.
  • Counties that begins with “De”,  “La”, “Mc” – some discrepancies about spacing and capitalization.

Once the data was all extracted into the output .csv, I used the modified mapmaker script from the last post to make some maps.  Without further ado….

Population density per square mile

<1 <2 <10 <50 <100 <200 <500 <1000 >1000

This is a pretty familiar type of map.  Kind of can use this as a sanity check on City-Data’s figures and my plotting script.  Do a Google Image Search for “US Population Density” and compare.  Example: http://en.wikipedia.org/wiki/File:USA-2000-population-density.gif

Cost of living (100 = US average)

<75 <85 <95 <105 <115 <125 <135 <145 >145

I wondered how this map would compare to the poverty rate map from the last post.  Answer: not really.  Places with a higher cost of living probably have higher salaries.  (Hey, how about we plot that next??)

Lowest cost of living: King County, TX (68.4)

Highest cost of living: Kings County, NY (194.1)

And, no, not every county name is a variation of the word “king.”  🙂

Median household income ($)

<25000 <35000 <45000 <55000 <65000 <75000 <85000 <95000 >95000

This map in conjunction with the preceding cost of living map does show some similarities to the poverty rate map.  Areas with low median incomes, but mid-level or high cost of living, generally have higher poverty rates.  Kind of “no duh,” I know.  Notable examples which stand out: eastern Kentucky, along the lower Mississippi River, the Navajo reservation in the Four Corners area.

Lowest median household income: Kalawao County, HI ($12,591).  This is the former leper colony of Molokai.

Highest median household income: Loudoun County, VA ($111,925)  The DC area – your tax dollars at work!

Federal Government Expenditure per Capita ($)

<5000 <7500 <10000 <12500 <15000 <17500 <20000 <22500 >22500

So, how well are those Washington fat cats at sharing the wealth with the voters back home?

Top 5:  Falls Church, VA ($145,164), Fairfax County, VA ($138,060), Los Alamos, NM ($105,868), District of Columbia ($67,982), Arlington County, VA ($52,254).  DC area once again….

Percent Foreign Born Residents

<1% <3% <5% <7% <9% <11% <13% <15% >15%

The map of the percent of foreign born residents looks pretty much as you would expect – higher percentages along the Mexican border and in cities – NYC, DC, Chicago.  And pretty much all of California.  The “finger” stretching up the Texas panhandle into Kansas is kind of interesting.  Also surprising figures in eastern Washington.

Gender (Im)balance

<-5% <-3% <-2% <-1% <1% <2% <3% <5% >5%

The gender balance of a county is calculated as the (#females-#males)/(#females+#males).  So, a positive percentage indicates more females than males, and a negative one indicates more males.

The South appears to be female-heavy, while the West is loaded with males.  Maybe those two should get together…

Lowest: Crowley County, CO (-34.5%).  Apparently one-third of the county residents are inmates at the state prison.  This is likely skewing the numbers toward males.

Highest: Pulaski County, GA (14.9%)

Mean travel time to work (minutes)

<10 <15 <20 <25 <30 <35 <40 <45 >45

Low: Aleutians East Borough, AK (6.3 minutes).  Maybe they all live on their fishing boats?

High: Elliott County, KY (48.7 minutes).  Ouch.  I’m guessing it’s just as long on the way back home … that’s almost 2 hours per day!

Percent affiliated with religious congregation

<20% <30% <40% <50% <60% <70% <80% <90% >90%

There is a pretty surprisingly wide swing here, and it is definitely regional.  Definite concentrations in the Midwest and Texas, as well as the Mormons in Utah and Idaho.  The percentages tend to diminish towards either coast.

Low: Camas County, ID (1.8%).  This figure is kind of suspect if you ask me.  About 1000 residents, and only 18 attend church?

High: Falls Church, VA (164.5%).  Once again the DC area!  Hey, wait a sec City-Data…how come this is above 100%?  People going to more than one church?  There are about 40 counties total with the figure > 100%.

Caveat Emptor…

I have no idea how the data presented on the City-Data website was collected.  Could be correct or it could be wildly off.  In any case the interpretation heavily depends on the data collection method, as with anything of a statistical nature.  (Something that the general news media, and media consuming populace, fails to consider many, many times IMHO).

Related to the above – I found I could skew the look of the maps in different ways in the selection of the color scale.  If I wanted to “drown out” a few high outlier counties, for instance, I could set my high threshold low enough and presto, they’re gone and no one is the wiser.  Not that I have knowingly done that here … I tried to get the best scale that shows the overall trend.  But the reader should be wary of maps, statistics and the like – always ask yourself what the author is trying to get you to believe.

Advertisements

2 responses

  1. […] done a few projects now involving screen scraping web data with Python.  I thought I would write a post on my basic […]

  2. Here’s something interesting: http://www.openheatmap.com/

What do you think?

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: