Skip to main content

historical linguistics

The Old Bailey Corpus 2.0, 1720-1913

Image: Coloured aquatint by Thomas Rowlandson and Auguste Charles Pugin of a trial at the Old Bailey published in Ackermann, Rudolph & Pyne, William Henry. 1808-1810.The microcosm of London or London in miniature. Vol. IILondon: Rudolph Ackermann, plate facing p. 212. Source




The Old Bailey Corpus


The Old Bailey Corpus (OBC) is a sociolinguistically, pragmatically and textually annotated corpus based on a selection of the Proceedings of the Old Bailey (henceforth Proceedings), the published version of the trials at London's Central Criminal Court. The 2,163 volumes of the Proceedings contain almost 200,000 trials, totalling ca. 134 million words. These speech-related texts record Late Modern English as spoken in the courtroom. The Old Bailey trial proceedings were taken down in shorthand and as such the published Proceedings are a reasonably close approximation of what was said in court, even though scribes, printers, publishers and the constraints of the printed medium acted as linguistic filters between the spoken word and its representation in the Proceedings.

The compilation of the OBC started in January 2006 at Justus Liebig University Giessen. Version 1.0 of the OBC was released in 2013, containing 14 million spoken words. OBC 2.0 contains a total of 24.4 million speech-related words. It was released in June 2016 and consists of 637 selected Proceedings, from 1720 to 1913. 

OBC 2.0 allows the linguist to analyze speech-related texts in a period that has been neglected both with regard to the compilation of primary linguistic data and the description of the structure, variability, and change of spoken English. With a high number of speakers and over half a million individual utterances, OBC 2.0 constitutes a fairly representative sample of spoken, rather formal Late Modern English in the courtroom setting. Moreover, every speaker turn is annotated for sociobiographical (gender, social class, age), pragmatic (role in the trial) and textual variables (the shorthand scribe, printer and publisher of individual Proceedings). OBC 2.0 is the largest diachronic collection of spoken English with this detail of utterance level sociolinguistic annotation. Although the corpus can of course be used for traditional investigations of language change, it is particularly suited for studies that correlate linguistic change and structural variability in Late Modern English with the social context. Its size, the time span covered (almost 200 years) and the available sociobiographical speaker information make OBC 2.0 ideal for fine-tuned studies involving several independent variables, including historical sociolinguistic approachesand the analysis of low-frequency features.

CLARIN-D Service Centre Saarbrücken
Project leader
Magnus Huber, English Dept., University of Giessen


- German Science Foundation
- German Federal Ministry of Education and Research


- The creators of the Proceedings of the Old Bailey Online: Tim Hitchcock (Department of History, University of Sussex), Robert Shoemaker (Department of History, Humanities Research Institute, University of Sheffield) and Sharon Howard (Department of History, Humanities Research Institute, University of Sheffield)
- Members of CLARIN-D: Christian Mair, Christian Drude, Julia Misersky, Elke Teich, Hannah Kermes, Jörg Knappen, Peter Fankhauser