Google NGram for early modern history?

Last month, I started a new job at Utrecht University within the ERC funded project ARTECHNE. One of the things I try to figure out in my subproject, The Term ‘Technique’ in the History of the Arts and Sciences, 1500-1950, is when and why a term like ‘technique’ first started occuring in the vernacular to describe artistic skills (instead of only in Latin to describe any process of skilfully making or doing something, as was the case before the eighteenth-century). Of course this can be done by studying primary sources one by one, but searching large amounts of historical  texts semi-automatically could be a great help. However, Digital Humanities methods like data mining do require caution, as Pim Huijnen also describes in this excellent blog.

Book-scanning-U-Mich

Book scanning at the University of Michigan, one of the libraries participating in Google Books

I soon saw this confirmed when I started experimenting with Google Books NGram Viewer (GNV), and figured it might be interesting for other early modern historians to share my experience. In theory GNV is amazing for analysis of historical texts, and for research that focuses on post-1800 texts , this is true to some extent, as described here. As Wikipedia puts it, GNV ‘is an online search engine that charts frequencies of any set of comma-delimited search strings using a yearly count of n-grams found in sources printed between 1500 and 2008.’ An n-gram is an instance of a word or phrase within a corpus; n is a variable representing the number of words.[1] In other words, GNV counts how often a word or a combination of words occurs in the digitized printed sources available in Google Books in any given year between 1500 and 2008 and visualizes that in a nice chart. Google Books contains over 25 million titles and GNV works for those in American English, British English, French, German, Spanish, Russian, Hebrew, and Chinese, so in theory it would be a great way to figure out when the term ‘technique’, or a combination of words like ‘art’ and ‘technology’ first occurred in European languages, and how it spread.

However, there are a couple of reasons that this does not work for early modern printed works (roughly 1500-1800). First of all, the majority of the books on Google Books are not from this period, but from the nineteenth and twentieth centuries. A lot of early modern sources relevant for my research are simply not on Google Books. Second, old books are often printed in unusual and irregular fonts, which are hard to recognize for OCR (Optical Character Recognition software). This leads to a lot of misses (i.e. the combination of the terms ‘art’ and ‘technique’ does occur but does not come up in the GNV because the OCR does not recognize one or both words), but also to a lot of false positives (i.e. GNV ‘recognizes’ a word, but when you go to the source, it turns out it is a an OCR misread, or a source is dated wrong).

To give an example: I tried using GNV to see when the word ‘Technik’ starts to occur in German in Google Books. This is what the resulting ngram looks like:

Schermafdruk 2016-03-11 13.21.10

The upward line after 1800 is fairly reliable; sources printed after this date are generally suitable for OCR. However, before 1800 it is a different story. The three ‘peaks’ between 1650 and 1750 seem rather random – and if you start analysing them, it soon turns out they are. At the bottom of the screen, you can click some of the periods in which GNV shows the term you searched for occurs. When you follow the link to the results for 1500-1727, it turns out that all these results are documents that are dated incorrectly in Google Books:

Schermafdruk 2016-03-11 11.32.11

When you click the first result that actually has a document attached to it (the third result here), it turns out that the word ‘Technik’ here occurs not in a late sixteenth century book, but in a 1925 newspaper article that somehow ended up in the same file as a sixteenth century Italian book:

Schermafdruk 2016-03-11 11.28.47.png

On closer inspection, every supposed occurrence of ‘Technik’ between 1500 and 1727 turns out to be a case of a wrongly dated document or an OCR misread. For the period 1728-1800 (you can manually adjust the period), the results are slightly more reliable, but a quick look at the results shows that the most common occurrence is not around the middle of the eighteenth century, as GNV suggests, but the last decade of the century, mostly in work related to Kant’s Kritik der Urteilskraft.

So although I can use GNV and especially a period-limited search in Google Books to partly back up my initial hunch about the emergence of the term ‘Technik’ in German (namely that it is first used in Kant’s Kritik der Urteilskraft), it is not a reliable way to say how often a certain term occurs in digitized sources on Google Books from before 1800. That does not mean we can’t use digital humanities methods and tools like GNV at all of course, you just have to realize the limitations and figure out an alternative that does work for your particular research. One of the solutions we are working on is building our own database containing historical texts on art and technology together with the people at the Utrecht Digital Humanities Lab. We only just started, and it is a work in progress, so more on that to follow soon!

 

 

[1] In computational linguistics, n-grams are actually used for more complex things like probability predictions too.

Advertisements

About mariekehendriksen

I am a historian of science and art, specialized in the material culture of eighteenth-century medicine and chemistry. I received my PhD from Leiden University in 2012, worked at the University of Groningen as a postdoc, and am now based at Utrecht University. I have been awarded fellowships by the National Maritime Museum in London, the Max Planck Institute for the History of Science in Berlin, the Wood Institute at the College of Physicians, the Chemical Heritage Foundation (both in Philadelphia), and a Wellcome Trust Grant at the Royal College of Surgeons of Edinburgh Library and Archives. The topics of my publications range from historical anatomical collections and medicine chests to anatomical preparation methods and the production of coloured glass. At Utrecht University I work as a postdoctoral researcher within the ERC-funded project Artechne. The project studies how technique was taught and learned in art and science between 1500 and 1950. Although the term ‘technical’ is readily used today, presently a history of the shifting meanings of the term ‘technique’ in arts and science is sorely lacking. My research is aimed at closing this gap in intellectual history, a.o. through the development of an interactive semantic-geographical map of ‘technique’ and related terms.
This entry was posted in Digital Humanities. Bookmark the permalink.

3 Responses to Google NGram for early modern history?

  1. nicky553 says:

    Miarieke, thanks for this, I too wish for a reliable tool for pre-modern searching. You may be interested in my thoughts on the n-gram when it first appeared:
    http://www.psmag.com/business-economics/culturomics-an-idea-whose-time-has-come-34742

    Best, Anita

  2. Thank you Anita, I can’t belief I hadn’t found your excellent article when researching this! I absolutely agree with you – culturomics and digital humanities tools can be a valuable part of our research methods, but they can never replace qualitative research, nor should they be used indiscriminately!

    Best, Marieke

  3. Pingback: Creating and integrating a database – work in progress | ARTECHNE – Technique in the Arts, 1500-1950

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s