Counting Unique Words with Python

The impetus for this project is an NPR story from earlier this year that I just now found.  An English literature researcher constructed a concordance of Agatha Christie novels to support the hypothesis that she suffered from Alzheimer’s in the late stages of her career.  He found a 20% drop in her 73rd novel’s vocabulary, as compared with previous novels.

A concordance is an index of all the instances of a particular word in a body of text.  A concordancer is a program that generates a concordance.  I don’t think I’ve created a true concordancer, because I don’t keep track of where (what page or section of text) I find each word.  I just keep track of a count of unique words.  Call it “Concordancer Junior.”

The concordance.py script will scan an input .txt file and count up the instances of each unique word.  Doing this is almost trivial with Python’s dictionary structure.

def addword(word):
    if theIndex.has_key(word):
        theIndex[word] = theIndex[word] + 1 #increment count
    else:
        theIndex[word] = 1 #add word to dictionary with count = 1

Once all the words have been added to the dictionary, I want to display the most frequently found words.  However, Python dictionary’s are not sorted whatsoever, so just printing the first x entries of the dictionary won’t do.  Searching around found a Python wiki entry that has an easy solution using the sorted() function:

s1 = sorted(theIndex.items(),key=lambda item:item[0]) #secondary key: sort alphabetically
s2 = sorted(s1,key=lambda item:item[1], reverse=True) #primary key: sort by count

Ooohhh… lambda functions.  <shiver!>

I should note that sorted() returns a list of (key,value) tuples rather than a sorted dictionary.  But that’s fine since I’m not going to need to add anything further to the dictionary once I get to the sorting point.

For my first cut, the most common words were ‘the’, ‘a’, ‘and’, ‘he’, etc.  Not very interesting.  So, I revised the script to print out the top 100 words found in the file, EXCLUDING the 100 most common (English) words.

For test cases I downloaded plain text files from Project Gutenberg, removing their entry/exit boilerplate stuff.

Without further ado, some results please!

The King James Bible:

Alice in Wonderland:

Ulysses:

Kim:

Huckleberry Finn:

Advertisements

6 responses

  1. […] project involves screen scraping lds.org’s Ensign archives and then using a concordance (of sorts) to do some analysis for word counts and word usage frequency.  My thinking is, the more […]

  2. David Zeitlyn | Reply

    Hello
    I wonder if this could be used as the basis of a concordance plugin for Flyingmeat’s Voodpad – it s the one function I miss from that program. Mac only.
    davidz

    1. Maybe…never used that program myself, but the website says: “Whether it’s Python, Perl, Ruby or any other of your favorite Unix scripting languages- VoodooPad can run it and display the results. Just type in a script and hit Command-R and the page will run as a script like you were at a terminal. VoodooPad works great as a script library.”

  3. I was able to do count unique words in the file. I can’t out how to print the lines that the words are found in. Can someone point me in the right direction?

    1. Sorry for the grammer.

      “I was able to count unique words in the file. I can’t figure out how to print the lines that the words are found in. Can someone point me in the right direction?” To add to my original post, I have the txt in a list. I was think I can see if the key words are in that list and if so, print the index of that list.

      1. Still kind of unclear what you mean.

        After running concordance.py on your text file, Concordancer.wordIndex is a Python dictionary with entries [word:count], where “word” is a string and is the index into the dictionary and “count” is an integer count of how many times that word appeared in the text file. The line number in the text file is not recorded – actually the text file is read in as one big text file; you would have to change it to be read in one line at a time. Then maybe you could alter the dictionary to be [word:lineNumbers] where “lineNumbers” is a list of integers representing all the lines where the “word” was found…. might work! (Can a list live in a dictionary?)

        Maybe you want to compare the wordIndex dictionary against another list of strings and print out which items in the second list are present in the wordIndex. This would be an easy add-on to the script as it currently exists. Just loop through your second list and do something like “if wordIndex.has_key(word)” and then print out the word. Dictionaries don’t have “indices” per se; the word itself is the “index”.

        Hope this helps!

What do you think?

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: