[History by Numbers] Volumes of data
Playing with one of Google's lesser-known search tools
There have been many times when I have wished I could search the 2000-odd books cluttering up my house for a specific reference or phrase. Short of digitising the whole lot at great expense, alas it’s not going to be possible. However, the search giant Google offers something similar on a grander scale with its Google Books project.
This was launched back in 2004, and since then has been met with delight by many a historical researcher, given that the contents of more than 25 million books have been scanned, and horror by many a publisher, enraged by what they see as copyright violation. Not every book is fully viewable anyway, but even with those only available in ‘snippet view’, the text can be searched for key phrases or names of interest.
What is much less well known, however, is the spin-off project called Google Ngram Viewer, which was released in December 2010. This plumbs the vast corpora of texts in Google’s database, specifically those published between 1500 and 2008, for the incidence of a given search term and plots it on a graph over time – one can also compare results by separating multiple search terms with commas. The results are normalised to allow for the fact that many more books were published at the end of the timescale than at the beginning, by being presented as a percentage of the number of books published each year. It doesn’t search the whole of the Google Books collection, but about a fifth of it, and only matches found in at least 40 books are plotted.
What’s an ‘ngram’ (or ‘n-gram’)? That’s simply a collection of n words (although the term is also used by linguists to cover letters or syllables) in a row.
And how might this tool be of help to historical researchers?