Skip to main content

Natural Language Processing for Historical Documents – a workshop report

Submitted by Linda Stokman on

Experts on tools for working with historical documents met in Berlin in September for a CLARIN workshop to exchange ideas, experiences about tools and methods. The outputs included a draft resource guide, and a plan of action to integrate more tools into the CLARIN infrastructure.


The main goal of the workshop was produce a guide to software applications for processing historical language varieties, a document which will help users to find, understand, choose and deploy natural language processing software applications for the annotation and analysis of texts in historical language varieties. The guide will be published alongside the existing ‘Resource Families’ guides to datasets (https://www.clarin.eu/resource-families). The workshop took place at the BBAW in Berlin, and was organized by Martin Wynne (Bodleian Libraries, University of Oxford), Bryan Jurish (ZDL, BBAW) and Christian Thomas (CLARIN-D, BBAW).

Photograph of the WorkshopThe workshop brought together 21 participantsfrom 13 different European countries, who are creating or working with NLP tools such as tokenizers, normalizers, morphological analyzers, part-of-speech taggers and lemmatizers which work with historical language varieties, especially European languages in the period 1500-1800. The workshop enabled mutual sharing of expertise, know-how, tools and resources. This historical period (roughly covered by the term ‘Early Modern’ in English) was selected since it represents the time covered by many digitization programmes of early printed works, and a time when many languages were still recognizably similar in form to contemporary varieties, but with significant differences which mean that standard software tools often cannot be applied to them with acceptable levels of accuracy. This workshop discussed the adaption of NLP tools trained on or designed for modern language varieties, as well as custom tools designed specifically for particular historical varieties.

Example of a workflow diagram Example of a workflow diagram for the annotation of historical text

Preliminary investigation revealed two distinct approaches to dealing with historical varieties, both of which were represented and discussed in the workshop:

  1. modernization: creating modernized versions of the words in the texts so that they can work with existing NLP tools for contemporary language varieties; or

  2. domain adaptation: developing new tools (or retraining old ones) so that they can work with historical varieties of languages, i.e. domain adaptation.

The workshop also generated a set of recommendations for candidate software applications for integration into the CLARIN infrastructure. While there is some considerable expertise in certain CLARIN Centres in this domain, there are currently no tools suitable for processing historical language varieties available via the CLARIN Language Resources Switchboard, and only very few available via web service orchestration platforms such as WebLicht. The outputs of the workshop should help the Standing Committee on CLARIN Technical Centre find suitable candidates for integration of relevant tools into the infrastructure. As well as the software applications, the discussion considered annotated texts and lexical data, both of which are key resources required for many workflows, and some concrete proposals were made for depositing such resources in CLARIN repositories.

The workshop is part of CLARIN’s mission to provide and support NLP for research in the humanities and social sciences. Implementing and improving tagging and lemmatization for historical documents is key to improving access to text collections, and as a first step towards distributional semantics and ‘big data’ approaches, and enabling new types of research.

The workshop concluded with a discussion of possible next steps for CLARIN in this domain. An outline plan for a user involvement workshop was formulated, focusing on helping researchers who are manually annotating data to create complete hand-crafted datasets which can serve as ‘gold standard’ data for training and/or evaluation purposes. A proposal will be developed for a ‘hackathon’ or ‘data carpentry’ event on this topic in 2020.

Quote: mwynne: „Natural Language Processing for Historical Documents – a workshop report.“ In: Im Zentrum Sprache, 24. September 2019, https://sprache.hypotheses.org/1790 (Abgerufen am 23. Oktober 2019).