Google NGram for early modern history?

Last month, I started a new job at Utrecht University within the ERC funded project ARTECHNE. One of the things I try to figure out in my subproject, The Term ‘Technique’ in the History of the Arts and Sciences, 1500-1950, is when and why a term like ‘technique’ first started occuring in the vernacular to describe artistic skills (instead of only in Latin to describe any process of skilfully making or doing something, as was the case before the eighteenth-century). Of course this can be done by studying primary sources one by one, but searching large amounts of historical  texts semi-automatically could be a great help. However, Digital Humanities methods like data mining do require caution, as Pim Huijnen also describes in this excellent blog.

Book-scanning-U-Mich
Book scanning at the University of Michigan, one of the libraries participating in Google Books

I soon saw this confirmed when I started experimenting with Google Books NGram Viewer (GNV), and figured it might be interesting for other early modern historians to share my experience. In theory GNV is amazing for analysis of historical texts, and for research that focuses on post-1800 texts , this is true to some extent, as described here. As Wikipedia puts it, GNV ‘is an online search engine that charts frequencies of any set of comma-delimited search strings using a yearly count of n-grams found in sources printed between 1500 and 2008.’ An n-gram is an instance of a word or phrase within a corpus; n is a variable representing the number of words.[1] In other words, GNV counts how often a word or a combination of words occurs in the digitized printed sources available in Google Books in any given year between 1500 and 2008 and visualizes that in a nice chart. Google Books contains over 25 million titles and GNV works for those in American English, British English, French, German, Spanish, Russian, Hebrew, and Chinese, so in theory it would be a great way to figure out when the term ‘technique’, or a combination of words like ‘art’ and ‘technology’ first occurred in European languages, and how it spread.

However, there are a couple of reasons that this does not work for early modern printed works (roughly 1500-1800). First of all, the majority of the books on Google Books are not from this period, but from the nineteenth and twentieth centuries. A lot of early modern sources relevant for my research are simply not on Google Books. Second, old books are often printed in unusual and irregular fonts, which are hard to recognize for OCR (Optical Character Recognition software). This leads to a lot of misses (i.e. the combination of the terms ‘art’ and ‘technique’ does occur but does not come up in the GNV because the OCR does not recognize one or both words), but also to a lot of false positives (i.e. GNV ‘recognizes’ a word, but when you go to the source, it turns out it is a an OCR misread, or a source is dated wrong).

To give an example: I tried using GNV to see when the word ‘Technik’ starts to occur in German in Google Books. This is what the resulting ngram looks like:

Schermafdruk 2016-03-11 13.21.10

The upward line after 1800 is fairly reliable; sources printed after this date are generally suitable for OCR. However, before 1800 it is a different story. The three ‘peaks’ between 1650 and 1750 seem rather random – and if you start analysing them, it soon turns out they are. At the bottom of the screen, you can click some of the periods in which GNV shows the term you searched for occurs. When you follow the link to the results for 1500-1727, it turns out that all these results are documents that are dated incorrectly in Google Books:

Schermafdruk 2016-03-11 11.32.11

When you click the first result that actually has a document attached to it (the third result here), it turns out that the word ‘Technik’ here occurs not in a late sixteenth century book, but in a 1925 newspaper article that somehow ended up in the same file as a sixteenth century Italian book:

Schermafdruk 2016-03-11 11.28.47.png

On closer inspection, every supposed occurrence of ‘Technik’ between 1500 and 1727 turns out to be a case of a wrongly dated document or an OCR misread. For the period 1728-1800 (you can manually adjust the period), the results are slightly more reliable, but a quick look at the results shows that the most common occurrence is not around the middle of the eighteenth century, as GNV suggests, but the last decade of the century, mostly in work related to Kant’s Kritik der Urteilskraft.

So although I can use GNV and especially a period-limited search in Google Books to partly back up my initial hunch about the emergence of the term ‘Technik’ in German (namely that it is first used in Kant’s Kritik der Urteilskraft), it is not a reliable way to say how often a certain term occurs in digitized sources on Google Books from before 1800. That does not mean we can’t use digital humanities methods and tools like GNV at all of course, you just have to realize the limitations and figure out an alternative that does work for your particular research. One of the solutions we are working on is building our own database containing historical texts on art and technology together with the people at the Utrecht Digital Humanities Lab. We only just started, and it is a work in progress, so more on that to follow soon!

 

 

[1] In computational linguistics, n-grams are actually used for more complex things like probability predictions too.