London Seminar in Digital Text and Scholarship: 'Protocols for Encoding Shakespeare'

Martin Mueller wrote recently on the Project Bamboo blog that by 2015, the entire corpus of early modern printed books (1476-1700) will be available in accurately transcribed electronic texts. He then poses the question of "what you must 'do to' those texts before they release their query potential" ( My project, Encoding Shakespeare, addresses this question. I aim to create a model of Shakespeare's language use that is extensible to all of early modern English. By applying the right metadata to all 865,185 words of Shakespeare's complete works (e.g. standardized spellings, lemmas, parts of speech), I aim to create a training set for the Natural Language Processing (NLP) algorithms to automate this tagging. The accuracy of automated tagging currently stands at 89% for Shakespeare and 95-97% for modern English; I aim to close the gap. I am designing protocols for error reduction, and a web platform for cloud-sourced encoding and correcting the NLP's errors. (I adopt this model from networked science projects using human computation to solve similarly large, tractable problems like protein folding or galaxy classification.) Late 2012 is the stage for defining which metadata are the most agnostic, which will serve the most diverse text analysis tools--both now and in the future. At the London Seminar in Digital Text and Scholarship, I propose to consult my colleagues on future-proofing my metadata protocols, and to beta-test my cloud-sourcing platform immediately before the recruitment phase. 

Michael Ullyot is Assistant Professor, Department of English, University of Calgary (Canada). His current research is on biographical self-consciousness in early modern England. He has published articles on elegies, anecdotes, modernized Chaucers, and Senecan drama.

Institute of English Studies
Michael Ullyot, University of Calgary (Canada)
Event date: 
Thursday, 8 November 2012 - 12:00am
Download on iTunes