The impetus for this project is an NPR story from earlier this year that I just now found. An English literature researcher constructed a concordance of Agatha Christie novels to support the hypothesis that she suffered from Alzheimer’s in the late stages of her career. He found a 20% drop in her 73rd novel’s vocabulary, as compared with previous novels.
A concordance is an index of all the instances of a particular word in a body of text. A concordancer is a program that generates a concordance. I don’t think I’ve created a true concordancer, because I don’t keep track of where (what page or section of text) I find each word. I just keep track of a count of unique words. Call it “Concordancer Junior.”
def addword(word): if theIndex.has_key(word): theIndex[word] = theIndex[word] + 1 #increment count else: theIndex[word] = 1 #add word to dictionary with count = 1
Once all the words have been added to the dictionary, I want to display the most frequently found words. However, Python dictionary’s are not sorted whatsoever, so just printing the first x entries of the dictionary won’t do. Searching around found a Python wiki entry that has an easy solution using the sorted() function:
s1 = sorted(theIndex.items(),key=lambda item:item) #secondary key: sort alphabetically s2 = sorted(s1,key=lambda item:item, reverse=True) #primary key: sort by count
Ooohhh… lambda functions. <shiver!>
I should note that sorted() returns a list of (key,value) tuples rather than a sorted dictionary. But that’s fine since I’m not going to need to add anything further to the dictionary once I get to the sorting point.
For my first cut, the most common words were ‘the’, ‘a’, ‘and’, ‘he’, etc. Not very interesting. So, I revised the script to print out the top 100 words found in the file, EXCLUDING the 100 most common (English) words.
For test cases I downloaded plain text files from Project Gutenberg, removing their entry/exit boilerplate stuff.
Without further ado, some results please!