Blogpost written by Marc Kupietz and Christian Thomas, edited by Nathalie Walker, Darja Fišer, and Jakob Lenardič
German reference resources are flagships in the CLARIN-D consortium with the German Reference Corpus DeReKo for more modern language data and the German Text Archive DTA for historic language data.
One of CLARIN-D’s most important resources is the German Reference Corpus DeReKo, which has been built and maintained at the Leibniz Institute for the German Language (IDS), which is a B-certified CLARIN centre, since its foundation in the mid-1960s. It continuously samples the contemporary German language use from around 1950 onwards in a stratified fashion and thus serves primarily as an empirical basis for synchronous German linguistics.
The DeReKo archive currently contains almost 47 billion words (with a growth rate of 3 billion words per year) from a variety of genres, ranging from newspaper texts from all areas of German-speaking countries, over fiction and specialized texts, and to Usenet news and Wikipedia talk pages. The entire archive is equipped with several morphosyntactic (see Area D in Figure 1) as well as dependency (see Area E) and constituency annotations (currently only Stanford CoreNLP, not shown in the figure).
Figure 1: ANNIS query on the Malt annotations of a DeReKo virtual corpus using KorAP, showing the virtual corpus builder (A), the query language selector (B) and expanded views for the metadata (C), token annotations (D) and malt dependency annotation of a search result.
One of the distinctive features of DeReKo is that it invites users to compile their own stratified virtual sub-corpora on the basis of extra-textual metadata (see area C in Figure 1 for a subset) using, for example the KorAP’s virtual corpus builder (Area A), in order to allow for samples that are as representative as possible with regard to specific linguistic research questions on, for example, diachronic differences between variants of German, and specific language domains like for example Austrian German used in the newspapers of the 90s, or German in computer mediated communication. Currently, more than 50,000 linguists use DeReKo free of charge via the analysis platform COSMAS II and the open source analysis platform KorAP, both of which are also available through the IDS Mannheim centre. Since September 2019, DeReKo can also be used via a library for the programming language R, making it easy to perform and visualize quantitative analyses in a reproducible fashion.
Being part of the IDS CLARIN B-Centre in Mannheim, DeReKo has been integrated into the CLARIN infrastructure from the outset. DeReKo uses and implements many of the standards and best practices developed within CLARIN and can be accessed partially via CLARIN’s Federated Content Search (CLARIN-FCS) and WebLicht.
Users interested in language material from before 1950 will find another resource provided within CLARIN-D extremely useful: the Deutsches Textarchiv (“German Text Archive”, DTA). Hosted by the CLARIN centre at the Berlin-Brandenburg Academy of Sciences and Humanities, the DTA is the largest single corpus of historical New High German covering the period from the 16th to the early 20th century, comprising more than 350 million tokens in 1.34 million digitized pages. Focusing mostly on (digitized) printed material, the DTA also includes a growing number of hand-written documents. Specialty sub-corpora include historical newspapers and other periodicals. The DTA as a whole covers a rich variety of fiction and non-fiction texts, the latter including academic as well as non-academic writing.
Figure 2: Landing page of the DTA (http://www.deutschestextarchiv.de/)
The DTA is composed of the so-called DTA-Kernkorpus (DTAK, “DTA Core Corpus”) with approximately 1500 first editions from the 16th through the 19th century. Additionally, the DTA-Erweiterungen (DTAE, “DTA Extensions”) module contains specialty corpora and individual texts which have been curated in the context of CLARIN-D and other projects. The full-text sources provided by digitization projects and other discipline-specific initiatives have been (manually or semi-automatically) converted to a TEI-compatible XML format conforming to the DTA-Basisformat (DTABf, “DTA Base Format”) guidelines, including extensive metadata on the original sources and data preparation. OCR texts in the DTA Core Corpus – as well as numerous additional text resources – have been manually corrected. A continuous quality assurance process is made possible by the collaborative web-based platform DTAQ, with around 2000 currently registered users. All DTA corpora are prepared for user consumption by automated computational linguistic analysis methods, including not only PoS-tagging and lemmatization, but also – among others – the orthographic normalization of historical spelling variants, allowing users to formulate queries in modern orthography. Just as DeReKo, the DTA is fully integrated into the CLARIN infrastructure, and can for instance be accessed via VLO, Federated Content Search, Language Resource Switchboard, and WebLicht. Among many other resources available from the CLARIN-D community, the DeReKo and the DTA data sets are flagships with tens of thousands of users.
Figure 3: The Deutsches Textarchiv / German Text Archive: an integrated research platform; Illustration from: Geyken et al. 2018, p. 221 (see: doi.org/10.1515/9783110538663-011)
- Kupietz, Marc, Cyril Belica, Holger Keibel, and Andreas Witt (2010): “The German Reference Corpus DeReKo: A primordial sample for linguistic research.” In: Proceedings of the seventh conference on International Language Resources and Evaluation (LREC 2010), Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis, Mike Rosner, and Daniel Tapias (eds.). European Language Resources Association (ELRA), 1848-1854.
- Kupietz, Marc, Harald Lüngen, Paweł Kamocki, and Andreas Witt (2018): “The German Reference Corpus DeReKo: New Developments – New Opportunities.” In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odjik, Stelios Piperidis, and Takenobu Tokunaga (eds.), Miyazaki: European Language Resources Association (ELRA), 4353-4360.
- Bański, Piotr, Joachim Bingel, Nils Diewald, Elena Frick, Michael Hanl, Marc Kupietz, Piotr Pęzik, Carsten Schnober, and Andreas Witt (2013): KorAP: the new corpus analysis platform at IDS Mannheim. In: Vetulani, Zygmunt and Hans Uszkoreit (eds.): Human Language Technologies as a Challenge for Computer Science and Linguistics. Proceedings of the 6th Language and Technology Conference. Poznań: . Uniwersytet im. Adama Mickiewicza w Poznaniu, 586-587.
- Boenig, Matthias, and Susanne Haaf (2019): “Aggregating resources in CLARIN: FAIR corpora of historical newspapers in the German Text Archive.” In: Proceedings of CLARIN Annual Conference 2019, Kiril Simov and Maria Eskevich (eds.), Leipzig: CLARIN, 124–128. PDF available at: https://office.clarin.eu/v/CE-2019-1512_CLARIN2019_ConferenceProceedings....
- Fischer, Frank, Susanne Haaf, and Marius Hug (2019): “The best of three worlds: Mutual enhancement of corpora of dramatic texts (GerDraCor, German Text Archive, TextGrid Repository).” In: Proceedings of CLARIN Annual Conference 2019, Kiril Simov and Maria Eskevich (eds.), Leipzig: CLARIN, 97–103. PDF available at: https://office.clarin.eu/v/CE-2019-1512_CLARIN2019_ConferenceProceedings....
- Jurish, Bryan, and Maret Nieländer (2019): “Using DiaCollo for historical research.” In: Proceedings of CLARIN Annual Conference 2019, Kiril Simov and Maria Eskevich (eds.), Leipzig: CLARIN, 40–43. PDF available at: https://office.clarin.eu/v/CE-2019-1512_CLARIN2019_ConferenceProceedings....
- Geyken, Alexander, Matthias Boenig, Susanne Haaf, Bryan Jurish, Christian Thomas, and Frank Wiegand (2018): “Das Deutsche Textarchiv als Forschungsplattform für historische Daten in CLARIN.” In: Henning Lobin, Roman Schneider, and Andreas Witt (eds.): Digitale Infrastrukturen für die germanistische Forschung (= Germanistische Sprachwissenschaft um 2020, vol. 6). Berlin/Boston, 2018, 219–248. DOI: https://doi.org/10.1515/9783110538663.
- Bański, Piotr, Susanne Haaf, and Martin Mueller (2018): “Lightweight Grammatical Annotation in the TEI: New Perspectives.” In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 7.–12. Mai 2018, Miyazaki (Jp), 1795–1802. PDF available at: http://www.lrec-conf.org/proceedings/lrec2018/pdf/422.pdf.
- Haaf, Susanne, and Christian Thomas (2017): “Enabling the Encoding of Manuscripts within the DTABf. Extension and Modularization of the Format.” In: Journal of the Text Encoding Initiative (jTEI), 10: 2015 Conference Issue. DOI: https://doi.org/10.4000/jtei.1650.
- Geyken, Alexander, and Thomas Gloning (2015): “A living text archive of 15th–19th-century German. Corpus strategies, technology, organization.” In: Jost Gippert and Ralf Gehrke (eds.): Historical Corpora. Challenges and Perspectives. Tübingen 2015, 165–180. PDF available at: http://www.deutschestextarchiv.de/files/Geyken-Gloning-2015_A-living-tex....
- Jurish, Bryan (2015): “DiaCollo: On the trail of diachronic collocations.” In: Koenraad De Smedt (ed.): Proceedings of the CLARIN Annual Conference 2015, Wroclaw, Poland, Ocotber 14–17, 28–31. PDF available at: http://www.deutschestextarchiv.de/files/jurish2015diacollo-clarin.pdf.
- Thomas, Christian, and Frank Wiegand (2015): “Making great work even better. Appraisal and digital curation of widely dispersed electronic textual resources (c. 15th–19th centuries) in CLARIN-D.” In: Jost Gippert and Ralf Gehrke (eds.): Historical Corpora. Challenges and Perspectives. Tübingen 2015, 181–196. PDF available at: http://www.deutschestextarchiv.de/files/Thomas-Wiegand-2015_Making-Great....
Click here to read more about Tour de CLARIN