Tag Archives: screen scraping with python

Words of the Prophets

The General Conference of the Church of Jesus Christ of Latter-Day Saints (Mormons) is held twice a year, in April and October.  General Authorities of the Church, including Prophets and Apostles, speak to the church membership about doctrinal issues and give other counsel.

During the last conference, I wondered: Is there a pattern to what is taught in General Conference?  Maybe Python can help us find out….

Methodology

This project involves screen scraping lds.org’s Ensign archives and then using a concordance (of sorts) to do some analysis for word counts and word usage frequency.  My thinking is, the more a certain word is used, the more the general authorities are giving counsel about a particular topic.  An index of all General Conference talks is also created.

The church magazine “Ensign” prints the full text of General Conference in the May and November issues each year.  Luckily, the Ensign is available online at lds.org.  So here’s what we’ll do:

  1. Screen scrape the General Conference Ensign articles
  2. Count up the words in each article and generate word count summaries for each General Conference
  3. Profit!  Er, I mean, make some charts and stuff.

Note: Each October (at least for the last several years) there is a General Relief Society meeting held the week prior to General Conference.  The proceedings of this meeting are included in the November Ensign.  So they are included in this project as well.  Whether or not the meeting is a part of General Conference or not is a matter of debate I guess, but the speakers are General Authorities, so surely belong in this analysis.

Note #2: Only General Conferences back to 1974 are available online.  The online Ensign archives go back further, but prior to 1974 there was a separate “Conference Report” for the proceedings of General Conference, which is not available online as far as I can tell.  So all my results are 1974 – 2010.

Note #3: I didn’t include stuff in the Ensign that did not list an author.  This cuts out stuff like the “Sustaining of Church Officers” and the “Statistical Report”, etc.

Note #4: The Ensign articles use Unicode, which I had some headaches parsing.  So I ended up throwing out everything but the Ascii character set.  Therefore the resulting titles and words might occasionally be incorrect – mainly missing punctuation.  But it’s generally ok.

Results

It took about an hour and a half to download and parse all the General Conference articles!  There are two output .csv files:

GenConfArticleSummary1974to2010.csv – index of LDS General Conference talks, 1974 – 2010.  Lists speaker, title, Ensign month, year and page number, word count, unique word count, unique word ratio (unique count / word count), and top 100 words for each conference talk.

WordsOfProphets1974to2010.csv – lists all the unique words found across all General Conference talks.  Gives the total count for each word.  For each General Conference, the percentage contribution of each unique word to the Conference’s total is given.  Ie, “0.389497” for the word “church” for “May1974” means that in the April 1974 General Conference, 0.38% of the words spoken (well, scraped from the Ensign webpage) were the word “church.”

Some stats (1974 – 2010 General Conference):

  • 2,713 talks
  • 397 different speakers (176 of those gave just a single talk)
  • 5,181,241 total words
  • 51,274 unique words (0.98% of total)

A note on the “unique word ratio” ( = 100 * unique words / total words) : I’ve noticed this generally tends to decrease the longer the body of text is.  So it probably is only meaningful (although what the meaning is I do not know) to compare for texts of about the same size.

The next chart shows the top 20 General Conference speakers who gave the most talks (“Count of Title”).  Average total word count and the average unique word ratio are also shown.  Gordon B. Hinckley is tops, no surprise.  He was in the First Presidency (generally 2 talks per Conference) or was the Prophet (about 4 talks per Conference – usually gives a welcome and a goodbye talk in addition to 2 meatier ones) for much of the time period under consideration (1974 – 2010).  The same can be said about his successor and next most frequent General Conference speaker, Thomas S. Monson.  (This is only the top 20; see GenConfArticleSummary1974to2010.csv and sort in Excel or OpenOffice to get the full list.)

Speaker Count of Title Average of Word_Count Average of Unique Ratio
Gordon B. Hinckley 207 2065.545894 34.51755811
Thomas S. Monson 162 2189.833333 36.73315815
James E. Faust 98 2286.030612 33.84414864
L. Tom Perry 75 2113.973333 32.77750744
Boyd K. Packer 75 2276.106667 31.69806946
Spencer W. Kimball 66 2264.893939 34.32113774
M. Russell Ballard 62 2113.903226 32.98976187
Ezra Taft Benson 57 2153.578947 31.79958897
Russell M. Nelson 56 2328.535714 34.46947778
David B. Haight 55 2022.8 33.50270983
Dallin H. Oaks 54 2459.018519 30.85206586
Neal A. Maxwell 53 1965.773585 39.94132846
Joseph B. Wirthlin 53 2259.528302 33.46916535
Richard G. Scott 51 1934.647059 34.08122863
Marion G. Romney 51 2157.607843 30.11546943
Robert D. Hales 46 2255.086957 30.38011966
Henry B. Eyring 46 2437.673913 26.86417891
Howard W. Hunter 45 1676.977778 34.74803859
Marvin J. Ashton 38 2248.263158 34.31398562
N. Eldon Tanner 36 2428.833333 31.85897606

Something else interesting regarding the “unique word ratio”.  Neal A. Maxwell’s is particularly high at 39.9%.  This is somewhat expected; Elder Maxwell was somewhat renowned for eloquence and a large vocabulary.  Surprisingly, Henry B. Eyring’s unique word ratio is particularly low at 26.8%.  But I wouldn’t call his talks rudimentary or simplistic by any means; quite the opposite.  The relative word count between Maxwell (1965.7) and Eyring (2437.6) may have something to do with these numbers — as I said before, longer data sources tend to have smaller unique word ratios.  But then again, N. Eldon Tanner’s total word count (2428.8) is close to Eyring’s, but Tanner’s unique word ratio is higher, 31.8%.

Now for some individual word analysis – using data in WordsOfProphets1974to2010.csv.  Probably lots of interesting stuff that could be done with this data, but for now we’ll just look at some Excel charts plotting the word usage frequency for some “interesting” words.

“Constitution” and “Pioneers”

I picked these first because I was fairly certain where there would be big spikes.  As expected, “Constitution” gets used more frequently in 1976 and 1987 — the bicentennial of the Declaration of Independence and the Constitution, respectively.  “Pioneers” gets a big spike in 1997 – the sesquicentennial of Mormon pioneers arriving in the Salt Lake valley in 1847.

I should note that the y-axis is a %.  For example, about 0.11% of the words scraped from talks in the May 1997 Ensign were the word “pioneers”.

“Internet” and “Pornography”

The internet isn’t mentioned at all til 1996, about the time it started becoming popular and mainstream.  The counsel from Church leaders about the evils of pornography seems to have increased, on average, in the years since the internet became more common and the problem of internet pornography more pervasive.

“Faith,” “Jesus,” “Christ,” “Savior,” “Lord”

Definite upward trend in the use of the word “faith”.  I guess that’s good.

I plotted both “Jesus” and “Christ” because while they are usually used together, when they are used separately it seems like “Christ” is used more often than “Jesus” … at least in recent years (see about 2004-2010).  During the 1970’s, the opposite appears slightly true – “Jesus” used more frequently than “Christ”.

Both “Jesus” and “Christ” steadily increase from 2006 to 2008, then abruptly plummet in November 2008 and stay at about the same level til 2010.  I have no good explanation for this; it seems intriguing.  President Hinckley passed away in early 2008 and President Monson became the new prophet — since the prophet speaks more in General Conference than anyone else, maybe Monson uses other words like “Savior” or “Lord” more frequently?

Perhaps.  There is an uptick for “Lord” and Savior” starting in May 2010.

“Tithing” and “Prayer”

Hypothesis: more talk about prayer during economic hard times, and less about tithing?

Here’s a GDP per capita chart, from http://www.measuringworth.com/usgdp/.  We clearly see four recessions: early 1980’s, early 1990’s, early 2000’s (hey is there a ten year pattern here?) and 2008-present.

So how to “prayer” and “tithing” General Conference frequencies compare during the recessions?

prayer tithing
Early 1980’s down up
Early 1990’s up down
Early 2000’s up down
2008 – now up down


Hypothesis confirmed?  Well, it’s kind of inconclusive – pretty shallow data set.  But this type of analysis would be very interesting to pursue, methinks.  Not necessarily only about economic issues (although that type of stuff is very easy to get historical #s for).

Files

Here are the Python scripts used to extract the data in the .csv’s.

concordance.py – Upgraded somewhat from my previous post on concordances.  Now it is in a class and can be called with any block of text input, not just a file.

getEnsignData.py – Starts at the Ensign archive webpage and dives down into the year, month, and article pages.  Calls appropriate scraper routines from ldsscraper.py for each.  Pickles output to ensignData.txt and genConfData.txt.  This is so I can split the project up into two pieces – download the data (which takes a very long time) and save it; then assemble it for output later.

ldsscraper.py – Uses regular expressions and Beautiful Soup to parse each webpage.  Since each type of page (Archive, Year, Table of Contents (individual month issue), Article) has it’s own format, each has it’s own scraper.

scraper.py – Base class for scrapers in ldsscraper.py.  Uses mechanize to download a webpage.  Also (optionally) strips out anything but the Ascii character set.

wordsOfProphets.py – Run after getEnsignData.py.  Loads up ensignData.txt and genConfData.txt and creates the two .csv files, GenConfArticleSummary1974to2010.csv and WordsOfProphets1974to2010.csv.

Advertisements

City-Data Screen Scraper and Maps

I wrote a screen-scraper that extracts data from the City-Data website’s county pages.  Luckily, the format of the URLs and county pages themselves are mostly consistent, so it was easy to extract desired bits of data with regular expressions … well, easy after a bit of trial and error.  The input to the script is a .csv with county names, states, and FIPS codes.  The output is the same as the input file but with additional columns of data extracted from the webpage.  I extracted bits of data that I thought would be interesting and that were easy to get.
There are 3,141 counties or county equivalents in the US.  (At least according the the list I used, which is modified from the US Poverty Rate data from the last post).  The script took about 23 minutes to run – for each county it downloaded the webpage from the City-Data website, parsed it for the desired info, and added the info to the output file.  There were a few hiccups along the way due to the following:
  • Rather than counties, Alaska has Boroughs, Municipalities, and Census Areas.  Crazy Alaska!
  • Skagway-Hoonah-Angoon Census Area is now split up between the Hoonah-Angoon Census Area and the Skagway Municipality.
  • Broomfield County, Colorado was incorporated 2001 and City-Data doesn’t have a county page for it yet.
  • Counties that begins with “De”,  “La”, “Mc” – some discrepancies about spacing and capitalization.

Once the data was all extracted into the output .csv, I used the modified mapmaker script from the last post to make some maps.  Without further ado….

Population density per square mile

<1 <2 <10 <50 <100 <200 <500 <1000 >1000

This is a pretty familiar type of map.  Kind of can use this as a sanity check on City-Data’s figures and my plotting script.  Do a Google Image Search for “US Population Density” and compare.  Example: http://en.wikipedia.org/wiki/File:USA-2000-population-density.gif

Cost of living (100 = US average)

<75 <85 <95 <105 <115 <125 <135 <145 >145

I wondered how this map would compare to the poverty rate map from the last post.  Answer: not really.  Places with a higher cost of living probably have higher salaries.  (Hey, how about we plot that next??)

Lowest cost of living: King County, TX (68.4)

Highest cost of living: Kings County, NY (194.1)

And, no, not every county name is a variation of the word “king.”  🙂

Median household income ($)

<25000 <35000 <45000 <55000 <65000 <75000 <85000 <95000 >95000

This map in conjunction with the preceding cost of living map does show some similarities to the poverty rate map.  Areas with low median incomes, but mid-level or high cost of living, generally have higher poverty rates.  Kind of “no duh,” I know.  Notable examples which stand out: eastern Kentucky, along the lower Mississippi River, the Navajo reservation in the Four Corners area.

Lowest median household income: Kalawao County, HI ($12,591).  This is the former leper colony of Molokai.

Highest median household income: Loudoun County, VA ($111,925)  The DC area – your tax dollars at work!

Federal Government Expenditure per Capita ($)

<5000 <7500 <10000 <12500 <15000 <17500 <20000 <22500 >22500

So, how well are those Washington fat cats at sharing the wealth with the voters back home?

Top 5:  Falls Church, VA ($145,164), Fairfax County, VA ($138,060), Los Alamos, NM ($105,868), District of Columbia ($67,982), Arlington County, VA ($52,254).  DC area once again….

Percent Foreign Born Residents

<1% <3% <5% <7% <9% <11% <13% <15% >15%

The map of the percent of foreign born residents looks pretty much as you would expect – higher percentages along the Mexican border and in cities – NYC, DC, Chicago.  And pretty much all of California.  The “finger” stretching up the Texas panhandle into Kansas is kind of interesting.  Also surprising figures in eastern Washington.

Gender (Im)balance

<-5% <-3% <-2% <-1% <1% <2% <3% <5% >5%

The gender balance of a county is calculated as the (#females-#males)/(#females+#males).  So, a positive percentage indicates more females than males, and a negative one indicates more males.

The South appears to be female-heavy, while the West is loaded with males.  Maybe those two should get together…

Lowest: Crowley County, CO (-34.5%).  Apparently one-third of the county residents are inmates at the state prison.  This is likely skewing the numbers toward males.

Highest: Pulaski County, GA (14.9%)

Mean travel time to work (minutes)

<10 <15 <20 <25 <30 <35 <40 <45 >45

Low: Aleutians East Borough, AK (6.3 minutes).  Maybe they all live on their fishing boats?

High: Elliott County, KY (48.7 minutes).  Ouch.  I’m guessing it’s just as long on the way back home … that’s almost 2 hours per day!

Percent affiliated with religious congregation

<20% <30% <40% <50% <60% <70% <80% <90% >90%

There is a pretty surprisingly wide swing here, and it is definitely regional.  Definite concentrations in the Midwest and Texas, as well as the Mormons in Utah and Idaho.  The percentages tend to diminish towards either coast.

Low: Camas County, ID (1.8%).  This figure is kind of suspect if you ask me.  About 1000 residents, and only 18 attend church?

High: Falls Church, VA (164.5%).  Once again the DC area!  Hey, wait a sec City-Data…how come this is above 100%?  People going to more than one church?  There are about 40 counties total with the figure > 100%.

Caveat Emptor…

I have no idea how the data presented on the City-Data website was collected.  Could be correct or it could be wildly off.  In any case the interpretation heavily depends on the data collection method, as with anything of a statistical nature.  (Something that the general news media, and media consuming populace, fails to consider many, many times IMHO).

Related to the above – I found I could skew the look of the maps in different ways in the selection of the color scale.  If I wanted to “drown out” a few high outlier counties, for instance, I could set my high threshold low enough and presto, they’re gone and no one is the wiser.  Not that I have knowingly done that here … I tried to get the best scale that shows the overall trend.  But the reader should be wary of maps, statistics and the like – always ask yourself what the author is trying to get you to believe.

Minethings Screen Scraper

Minethings is a brower-based persistent MMOG.  Visit the site to find out more.  Very simple interface and the “game” itself is not much to sneeze at – mainly an economic simulation, IMHO.  But that makes it kind of interesting … all items in the game can be sold via auction to other players, therefore the going rate for a given item is solely determined by the supply/demand.

I thought it would be interesting to track how the prices in Minethings changes over time.  I wondered if there were any patterns that could bring success in the game, if known and exploited.  Hence the screen scraper.

Mtscraper.py has two usage options.  First is “census mode”, which you run with “mtscraper.py -c”.  This will loop through all the miners currently registered with Minethings.  Names, # of melds, profession, and home city will be printed to a file called “mtscrape_census_YYYYMMDDHHSS.txt” that will be created in the same directory where mtscraper.py is located.  Second is “prices mode”, run with “mtscraper.py” (no command line option).  This will loop through all items of each mine type and record the sales history as well as current listings and bids (comment in the script says average sales price and high bid, low listing, but it is a bit off.  It actually lists all of them – all the sales, all the listings, and all the bids.)  Oh, and it is set up as is to just get the prices for one city, Harmond.  To change cities, or get prices for ALL cities, look at the “cities” variable in the script.  The script takes quite a while (about a minute) to query all the items for sale in just one city; so doing all of the cities or doing it frequently will probably hammer the servers and cause the admin no end of headaches … sorry!

Usage : need a valid Minethings account, then put your username and password into the USERNAME and PASSWORD fields of the script.  You may need a proxy server (I did) so there is a slot for that as well.  Getting a webpage to load is accomplished through mechanize; you’ll need to install that.  (If the way I connected isn’t working, consult the mechanize library and try to figure it out.)  Python required as well, of course.  Developed and tested with 2.6.

The script is heavily dependent on the format of the Minethings internal webpages.  If the site were redesigned, then likely mtscraper.py would need to be tweaked.  This is a serious negative quality of a good screen scraper; one shared by my earlier Captionator project.  Next time I’ll try to see if I can avoid it; Beautiful Soup may help ….?

Anyway, copy the output of mtscraper.py into Excel and you can make some charts.  This one uses “census” data from July 21 and shows the average number of melds per job type:  (click on image for higher quality)

With the prices mode, it’s interesting to copy the data for multiple days into a spreadsheet to see the prices change over time.  (This can generate a very large data set very quickly – I hit Excel’s 65536 row limit with only 9 days worth of data, and that’s just for Harmond!)  Here’s the average bid/listing/sale price for a Starter Mine in Harmond: