In this recipe we take a look at simple mechanisms that we might use to measure the complexity of different books. First we take a look at the Harry Potter series to consider whether we can analyse the complexity of the books as the series progresses. Then we take a look at some classics from Project Gutenberg and compare them to modern books to see whether there are any similarities and differences between the two.
We consider 3 different factors as potential ways of measuring book complexity;
- Sentence Length: Number of words per sentence in a book.
- Word Variety: Number of different words in the first 50,000 words of a book.
- Average word length: Average length of all of the words of a book.
Note: We opted against using NLTK (the Natural Language Toolkit) to identify words and sentences in order to keep our scripts simple for beginners. This means that some of our sentence lengths and word lengths might be slightly off.
Click any image to enlarge it.
These findings should be taken with a pinch of salt, given that our scripts for building word and sentence counts aren’t 100% rigorous. With that in mind however, any errors should be relative across all samples, so the overall findings should be reasonable. Given that the Harry Potter books are aimed at teenagers, we expect to see indicators of more complexity from the classics we found on Project Gutenberg. Here is what we found.
- Sentence length in the Harry Potter books was noticeably shorter than the classic books we chose from Project Gutenberg.
- While Wuthering Heights and Moby Dick had more unique words in the first 50,000 than the Harry Potter books, Pride and Prejudice had fewer than in the first Harry Potter book.
- Both sentence length and unique words goes up from the first Harry Potter to the last Harry Potter book. Not significantly, but it does increase with the age of the books.
- Average word length appears to be essentially the same across books of all reading ages.
Download the “all_files_xxx.zip” file to grab all scripts and related data files for this recipe. Note. We have not included any of the source book data in this sample for copyright reasons, but we have included the intermediate data files required for generating these charts.
A short description of the files follows;
common.py: Contains some common functions for all of the scripts – mostly related to setting up default styles for the MatPlotLib.
generate_charts.sh: A shell script for OSX/Linux that will simply generate all of the above charts in one command.
plot_sentence_length.py: Generates the sentence length charts from previously generated “sentence length” files.
plot_word_length.py: Generates the word length charts from previously generated “word summary” files.
plot_word_variety.py: Plots the word variety chart above. Note. You should use the files in data/50k/, as these all contain information about just the first 50,000 words of each book (as a longer book will naturally have more words in – so we have to normalise them).
utils/: In the utils folder are two scripts for generating text/csv files describing sentence lengths and word counts from a book in text file format.