Tour de CLARIN: German Reference Corpus (DeReKo) and German Text Archive (DTA)

Submitted by Jakob LenardiÄŤ on 15 May 2020

Blogpost written by Marc Kupietz and Christian Thomas, edited by Nathalie Walker, Darja Fišer, and Jakob Lenardič

German reference resources are flagships in the CLARIN-D consortium with the German Reference Corpus DeReKo for more modern language data and the German Text Archive DTA for historic language data.

One of CLARIN-D’s most important resources is the German Reference Corpus DeReKo, which has been built and maintained at the Leibniz Institute for the German Language (IDS), which is a B-certified CLARIN centre, since its foundation in the mid-1960s. It continuously samples the contemporary German language use from around 1950 onwards in a stratified fashion and thus serves primarily as an empirical basis for synchronous German linguistics.

The DeReKo archive currently contains almost 47 billion words (with a growth rate of 3 billion words per year) from a variety of genres, ranging from newspaper texts from all areas of German-speaking countries, over fiction and specialized texts, and to Usenet news and Wikipedia talk pages. The entire archive is equipped with several morphosyntactic (see Area D in Figure 1) as well as dependency (see Area E) and constituency annotations (currently only Stanford CoreNLP, not shown in the figure).

Figure 1: ANNIS query on the Malt annotations of a DeReKo virtual corpus using KorAP, showing the virtual corpus builder (A), the query language selector (B) and expanded views for the metadata  (C), token annotations (D) and malt dependency annotation of a search result.

One of the distinctive features of DeReKo is that it invites users to compile their own stratified virtual sub-corpora on the basis of extra-textual metadata (see area C in Figure 1 for a subset) using, for example the KorAP’s virtual corpus builder (Area A), in order to allow for samples that are as representative as possible with regard to specific linguistic research questions on, for example, diachronic differences between variants of German,  and specific language domains like for example Austrian German used in the newspapers of the 90s, or German in computer mediated communication. Currently, more than 50,000 linguists use DeReKo free of charge via the analysis platform COSMAS II and the open source analysis platform KorAP, both of which are also available through the IDS Mannheim centre. Since September 2019, DeReKo can also be used via a library for the programming language R, making it easy to perform and visualize quantitative analyses in a reproducible fashion.

Being part of the IDS CLARIN B-Centre in Mannheim, DeReKo has been integrated into the CLARIN infrastructure from the outset. DeReKo uses and implements many of the standards and best practices developed within CLARIN and can be accessed partially via CLARIN’s Federated Content Search (CLARIN-FCS) and WebLicht.

Users interested in language material from before 1950 will find another resource provided within CLARIN-D extremely useful: the Deutsches Textarchiv (“German Text Archive”, DTA). Hosted by the CLARIN centre at the Berlin-Brandenburg Academy of Sciences and Humanities, the DTA is the largest single corpus of historical New High German covering the period from the 16th to the early 20th century, comprising more than 350 million tokens in 1.34 million digitized pages. Focusing mostly on (digitized) printed material, the DTA also includes a growing number of hand-written documents. Specialty sub-corpora include historical newspapers and other periodicals. The DTA as a whole covers a rich variety of fiction and non-fiction texts, the latter including academic as well as non-academic writing.

Figure 2: Landing page of the DTA (

The DTA is composed of the so-called DTA-Kernkorpus (DTAK, “DTA Core Corpus”) with approximately 1500 first editions from the 16th through the 19th century. Additionally, the DTA-Erweiterungen (DTAE, “DTA Extensions”) module contains specialty corpora and individual texts which have been curated in the context of CLARIN-D and other projects. The full-text sources provided by digitization projects and other discipline-specific initiatives have been (manually or semi-automatically) converted to a -compatible XML format conforming to the DTA-Basisformat (DTABf, “DTA Base Format”) guidelines, including extensive metadata on the original sources and data preparation. OCR texts in the DTA Core Corpus – as well as numerous additional text resources – have been manually corrected. A continuous quality assurance process is made possible by the collaborative web-based platform DTAQ, with around 2000 currently registered users. All DTA corpora are prepared for user consumption by automated computational linguistic analysis methods, including not only PoS-tagging and lemmatization, but also – among others – the orthographic normalization of historical spelling variants, allowing users to formulate queries in modern orthography. Just as DeReKo, the DTA is fully integrated into the CLARIN infrastructure, and can for instance be accessed via VLO, Federated Content Search, Language Resource Switchboard, and WebLicht. Among many other resources available from the CLARIN-D community, the DeReKo and the DTA data sets are flagships with tens of thousands of users.

Figure 3: The Deutsches Textarchiv / German Text Archive: an integrated research platform; Illustration from: Geyken et al. 2018, p. 221 (see: 



