You are here

Reference corpora

Introduction

 According to the linguist Geoffrey Leech (2002), a "corpus is designed to provide comprehensive information about the language […] It has to be a general corpus of wide coverage of the language, and hopefully it will be treated by its user community as some kind of “standard” for the language." Reference corpora thus contrast with specialised corpus families (e.g., parliamentary corpora, CMC-corpora) in that they are comprehensive with respect to genre inclusion, typcally sampling a diverse set of primarily written genres. 

The CLARIN infrastructure offers access to 29 reference corpora for 20 languages. Most of the corpora are available through easy-to-use concordancers such as KonText and the noSketch Engine; the reference corpora are also well annotated, typically displaying rich morphosyntactic annotation.

For comments, changes of the existing content or inclusion of new corpora, send us an email.

This website was last updated on 21 September 2020.

Reference corpora in the CLARIN infrastructure

Corpus Language Description Availability

AbNC: Abkhaz National Corpus

Size: 10 million words
Annotation: MSD-tagged, lemmatized
Licence: CLARIN_PUB-BY-NC-ND

 

Abkhaz

This corpus includes Abkhaz texts published between 1920 and 2016. The corpus is encoded in TEI.

The corpus is available for online browsing through the Corpuscle concordancer (CLARINO distribution).

For a relevant publication, see Meurer (2018).

Concordancer

Bulgarian National Reference Corpus (BNRC)

Size: 70 million tokens
Annotation: tokenized, PoS-tagged
Licence: Individual terms of agreement

Bulgarian

This corpus includes Bulgarian texts taken from news media, literature, and administrative documents between 1997 and 2002.

The tokenised corpus is available through WebCLaRK, while the PoS-tagged version is available only upon request.

For a related publication, see Simov et al. (2004).

Concordancer

Croatian language corpus Riznica 0.1

Size: 101.8 million tokens, 85.3 million words, 4.7 million sentences, 14,781 texts
Annotation: sentence segmented, PoS-tagged, lemmatized
Licence: CC BY-NC-SA 4.0

 

Croatian

This corpus includes Croatian texts taken from fiction (28%) and specialised texts (72%).  

The corpus is available for online browsing via noSketch Engine and KonText and for download from the CLARIN.SI repository.

For a related publication, see Ćavar and Brozović Rončević (2012).

noSketch Engine

KonText

Download

Croatian National Corpus

Size: 101 million tokens

Croatian

This corpus includes Croatian texts taken from newspapers, magazines, popular texts, and fiction.

The corpus is available for online browsing through the noSketch Engine.

For a relevant publication, see Tadić (2002).

Concordancer

SYN2005: balanced corpus of written Czech

Size: 100 million words
Annotation: MSD-tagged, lemmatized
Licence: Czech National Corpus (Shuffled Corpus Data)

Czech

This corpus includes Czech texts published between 2000 and 2004. The corpus is encoded in XML.

The corpus is available for online browsing through the KonText concordancer and can be downloaded from the LINDAT repository.

For a relevant publication, see Hnátková et al. (2014).

Concordancer

Download

SYN2010: balanced corpus of written Czech

Size: 100 million words
Annotation: MSD-tagged, lemmatized
Licence: Czech National Corpus (Shuffled Corpus Data)

Czech

This corpus includes Czech fiction, professional literature, newspapers etc. published between 2005 and 2009. The corpus is encoded in XML.

The corpus is available for online browsing through the KonText concordancer and can be downloaded from the LINDAT repository.

For a relevant publication, see Hnátková et al. (2014).

Concordancer

Download

SYN2015: representative corpus of written Czech

Size: 100 million words
Annotation: MSD-tagged, lemmatized
Licence: Czech National Corpus (Shuffled Corpus Data)

Czech

This corpus includes Czech fiction, professional literature, newspapers etc. published between 2010 and 2014. The corpus is encoded in XML.

The corpus is available for online browsing through the KonText concordancer and can be downloaded from the LINDAT repository.

For a relevant publication, see Hnátková et al. (2014).

Concordancer

Download

DK-CLARIN Reference Corpus of General Danish

Size: 45.1 million words
Annotation: PoS-tagged, sentence and paragraph segmentation, lemmatized
Licence: CLARIN ACA-NC

Danish

This corpus includes Danish texts published between 2008 and 2011.

The corpus is encoded in TEI. Non-linguistic metadata includes information on source and year of publication.

The corpus is available for download from the CLARIN-DK repository.

Download

SoNaR

Size: 500 million words
Annotation: PoS-tagged, lemmatized, named entities; coreference annotation and annotation of spatial and temporal relations for the manually annotated SoNaR-1 subset
Licence: Terms of Agreement

Dutch

This corpus includes representative Dutch texts (fiction, brochures, magazines, legal texts, newspapers, parliamentary proceedings, and computer-mediated communication).

Aside from written materials, the corpus also contains transcriptions of spoken language. The corpus is encoded in FoLiA.

The corpus is available for online browsing through the OpenSONAR concordancer and can be downloaded from the Dutch Language Institute (CLARIAH-NL).

Concordancer

Download

Corpus of Contemporary American English – Kielipankki version

Size: 440 million words; 190,000 texts
Annotation: PoS-tagged, lemmatized
Licence: CLARIN ACA (online version); CLARIN RES (downloadable version)

English (American)

This corpus includes American English texts evenly divided into the spoken, fiction, magazine, newspaper, and academic genres (around 88 million words each) published between 1990 and 2012.

The corpus is available for download from the Finnish Language Bank as well as for online browsing through the concordancer Korp (FIN-CLARIN distribution). 

Concordancer

Download

British National Corpus

Size: 100 million words
Annotation: PoS-tagged, lemmatized
Licence: BNC User Licence (restricted for the downloadable version)

English (British)

This corpus includes English texts (fiction, magazines, newspapers, and academic writing) published between 1980 and 1993.

The corpus is encoded in TEI. Non-linguistic metadata include contextual and bibliographic information. Aside from written materials, the corpus also includes transcriptions of spoken language.

The corpus is available for online browsing through a dedicated concordancer and can be downloaded from the Oxford Text Archive (CLARIN-UK).

Concordancer

Download

Estonian National Corpus 2019

Size: 1.5 billion words
Annotation: MSD-tagged, lemmatized
Licence: CC-BY-SA

Estonian

This corpus includes Estonian texts published between 1990 and 2019. Amongst others, this corpus contains the Estonian Reference Corpus as a subcorpus.

The corpus is available for download from META-SHARE (CELR distribution).

Download

Estonian Reference Corpus

Size: 175 million words
Annotation: MSD-tagged, lemmatized
Licence: free for non-commercial use

Estonian

This corpus includes Estonian texts (fiction, PhD theses, newspapers, magazines, parliamentary transcriptions, computer-mediated communication) published between 1990 and 2007. The corpus is encoded in TEI.

The corpus is available for online browsing through a dedicated concordancer and is available for download from CELR.

Concordancer

Download

GNC: Georgian National Corpus

Size: 217 million words
Annotation: MSD-tagged, lemmatized
Licence: CC-BY-NC, CLARIN_ACA-NC-LOC-PRIV-ND-*

Georgian (Old, Middle, Modern),
Mingrelian, Svans

This corpus includes texts from languages spoken in Georgia from 500 to 2013. The corpus is encoded in TEI XML.

The corpus is available for online browsing through a dedicated webpage.

For a relevant publication, see Meurer (2017).

Concordancer

DeReKo

Size: 31.7 billion words
Annotation: MSD-tagged, lemmatized
Licence: CC-BY-SA

German

This corpus includes German texts in a wide variety of genres published from 1947 onwards. Non-linguistic metadata include rich bibliographic information and partial layout information.

Part of the corpus is available for download from a dedicated webpage (CLARIN-D distribution), while the entire corpus can be queried online through the COSMAS II platform.

For a relevant publication, see Kupietz et al. (2018).

Concordancer

Download

Corpus of Greek Texts

Size: 27.6 million words
Annotation:
Licence: CC-BY-NC; ACA

Greek

This corpus includes representative Greek texts published between 1990 and 2010. Aside from written materials, the corpus also includes transcriptions of spoken language.

The corpus is available for online browsing through a dedicated concordancer.

For a relevant publication, see Goutsos (2010).

Concordancer

Diachronic corpus of Greek of the 20th century

Size: 20 million words
Annotation:
Licence: CC BY-NC

Greek

This corpus includes Greek texts published in the 20th century.

The corpus is available for download from CLARIN:EL.

Download

Hellenic National Corpus

Size: 47 million words
Annotation: sentence segmented
Licence: proprietary

Greek

This corpus includes Greek texts published from 1990 onwards.

The corpus is available for online browsing through a dedicated concordancer.

For a relevant publication, see Gavrilidou (2002).

Concordancer

Hungarian National Corpus

Size: 190 million tokens
Annotation: PoS-tagged

Hungarian

This corpus includes Hungarian texts (newspapers, literature, scientific articles, official and personal documents).

The corpus is available for online browsing through a dedicated concordancer.

For a relevant publication, see Váradi (2002).

Concordancer

The Icelandic Gigaword Corpus

Size: 1.3 billion words
Annotation: MSD-tagged, lemmatized
Licence: CC-BY and a special user licence

Icelandic

This corpus includes Icelandic texts (newspapers, parliamentary proceedings, adjudications, fiction and non-fiction) published until 2017.

The corpus is encoded in TEI. Non-linguistic metadata include bibliographic information. Aside from written materials, the corpus also contains transcriptions of spoken language.

The corpus is available for online browsing and download through CLARIN-IS (in two subsets, each with its own licence).

For a relevant publication, see Steingrímsson et al. (2018).

Concordancer

‌ Download subset 1

 

Download subset 2

Corpus of the Contemporary Lithuanian Language

Size: 208.4 million tokens
Annotation: MSD-tagged, lemmatized
Licence: CLARIN RES

Lithuanian

This corpus includes Lithuanian texts (mostly newspapers but also fiction, non-fiction, and specialised magazines) published between 1990 and 2008.

The corpus is encoded in TEI. Non-linguistic metadata includes bibliographic information. Aside from written materials, the corpus also contains transcriptions of spoken language.

The corpus is available for online browsing through a dedicated concordancer.

Concordancer

The Lexicographic Corpus for Norwegian Bokmål (LBK)

Size: 100 million tokens
Annotation: PoS-tagged, lemmatized
Licence: CLARIN_ACA-NC-LOC-ND

Norwegian (Bokmål)

This corpus includes representative Norwegian (Bokmål) texts (newspapers and periodicals, non-fiction, fiction, TV subtitles, and small print) published between 1985 and 2013.

The corpus is available for online browsing through the concordancer Glossa (CLARINO).

For a relevant publication, see Lain Knudsen and Vatvedt Fjeld (2013).

Concordancer

Norsk Ordboks Nynorskkorpus (NNK)

Size: 107.8 million words
Annotation: MSD-tagged, lemmatized
Licence: CLARIN_RES-NC-DEP

Norwegian (Nynorsk)

This corpus includes representative Norwegian (Nynorsk) texts published between 1866 and 2012. The corpus is encoded in XML.

The corpus is available for online browsing through the Corpuscle concordancer (CLARINO).

Concordancer

National Corpus of Polish

Size: 1.8 billion tokens
Annotation: MSD-tagged, lemmatized

Polish

This is a written and spoken corpus that includes representative Polish texts published between 1945 and 2010.

The corpus is encoded in TEI. Non-linguistic metadata includes information on source, year of publication, text type, title, author. Aside from written materials, the corpus also includes transcriptions of spoken language.

The corpus is available for online browsing through a dedicated concordancer.

For the relevant publication, see Przepiórkowski et al. (2012).

Concordancer

PAROLE Portuguese Corpus

Size: 3 million words
Annotation: MSD-tagged, manually disambiguated
Licence: ELRA

Portuguese

This corpus includes Portuguese texts (newspapers, books, periodicals, and miscellaneous texts) published between 1996 and 1997. The corpus is encoded in the PAROLE format.

The corpus is available for download from the ELRA catalogue.

Download

Written corpus ccGigafida 1.0

Size: 126.9 million tokens, 103.2 million words, 31,722 texts
Annotation: MSD-tagged, lemmatized
Licence: CC-BY-NC-SA 4.0

Slovenian

This corpus includes representative Slovenian texts (newspapers, magazines, computer-mediated communication, fiction and non-fiction) published between 1990 and 2011. The corpus is encoded in TEI. Non-linguistic metadata includes information on source, year of publication, text type, title, author.

This corpus is a downloadable subset of the representative Gigafida corpus (version 1). It can be downloaded from the CLARIN.SI repository.

For a relevant publication, see Erjavec and Logar (2012).

Download

Written corpus ccKres 1.0

Size: 12.2 million tokens, 9.8 million words
Annotation: MSD-tagged, lemmatized
Licence: CC-BY

Slovenian

This corpus includes balanced Slovenian texts (newspapers, magazines, computer-mediated communication, fiction and non-fiction) published between 1990 and 2011. The corpus is encoded in TEI. Non-linguistic metadata includes information on source, year of publication, text type, title, author.

This corpus is a downloadable subset of the balanced Kres corpus. It can be downloaded from the CLARIN.SI repository.

For a relevant publication, see Erjavec and Logar (2012).

Download

Written corpus Gigafida 2.0

Size: 1.3 billion tokens, 1.1 billion words, 38,310 texts
Annotation: MSD-tagged, lemmatized
Licence: Individual terms of agreement

Slovenian

This corpus includes representative Slovenian texts (newspapers, magazines, computer-mediated communication, fiction and non-fiction) published between 1990 and 2018. The corpus is encoded in TEI. Non-linguistic metadata includes information on source, year of publication, text type, title, author.

The corpus is available for online browsing through the noSketch Engine concordancer (CLARIN.SI distribution), as well as through a dedicated search engine.

For a relevant publication, see Krek et al. (2018).

noSketch Engine

Concordancer

Written corpus Kres 1.0

Size: 99 million words
Annotation: MSD-tagged, lemmatized
Licence: Individual terms of agreement

Slovenian

This corpus includes balanced Slovenian texts (newspapers, magazines, computer-mediated communication, fiction and non-fiction) published between 1990 and 2011.

This corpus is a balanced subset of the representative Gigafida corpus (version 1). The corpus is encoded in TEI. Non-linguistic metadata includes information on source, year of publication, text type, title, author.
The corpus is available for online browsing through a dedicated concordancer.

For a relevant publication, see Krek et al. (2018).

Concordancer

Publications

[Ćavar and Brozović Rončević 2012] Damir Ćavar and Dunja Brozović Rončević. 2012. Riznica: the Croatian Language Corpus. Prace filologiczne, 63: 51–65. 

[Erjavec and Logar Berginc 2012] Tomaž Erjavec and Nataša Logar Berginc. 2012.  Referenčni korpusi slovenskega jezika (cc)Gigafida in (cc)KRES. In Zbornik Osme konference Jezikovne tehnologije, 57–62. 

[Gavrilidou 2002] Maria Gavrilidou. 2002. The Hellenic National Corpus on-line. Revue belge de Philologie et d'Histoire, 80 (3): 1003–1015. 

[Goutsos 2010] Dionysis Goutsos. 2010. The Corpus of Greek Texts: a reference corpus for Modern Greek. Corpora, 5 (1): 29–44. 

[Hnátková et al. 2014] Milena Hnátková, Michal Kren, Pavel Procházka, and Hana Skoumalová. 2014. The SYN-series corpora of written Czech. In Proceedings of LREC 2014, 160–164.

[Krek et al. 2016] Simon Krek, Polona Gantar, Špela Arhar Holdt, and Vojko Gorjanc. 2016. In Proceedings of the Conference on Language Technologies and Digital Humanities, 200–202.

[Kupietz et al. 2018] Marc Kupietz, Harald Lüngen, Pawel Kamocki, and Andreas Witt. 2018. The German Reference Corpus DeReKo: New Developments – New Opportunities In Proceedings of LREC 2018, 4353–4360.

[Lain Knudsen and Vatvedt Fjeld 2013] Rune Lain Knudsen and Ruth Vatvedt Fjeld. 2013. LBK2013: A balanced; annotated national corpus for Norwegian Bokmål. In Proceedings of the workshop on lexical semantic resources for NLP at NODALIDA 2013, 12–20.

[Leech 2002] Geoffrey Leech. 2002. The Importance of Reference Corpora.

[Meurer 2017] Paul Meurer. 2017. The Morphosyntactic Analysis of Georgian.

[Meurer 2018] Paul Meurer. 2018. The Abkhaz National Corpus. In Proceedings LREC 2018, 2456–2460.

[Przepiórkowski et al. 2012] Adam Przepiórkowski, Mirosław Bańko, Rafał L. Górski, and Barbara Lewandowska-Tomaszczyk, editors. 2012. Narodowy Korpus Języka Polskiego.

[Simov et al. 2004] Kiril Simov, Petya Osenova, Sia Kolkovska, Elisaveta Balabanova, Dimitar Doikoff. 2004. A Language Resources Infrastructure for Bulgarian. In Proceedings of LREC 2004, 1685–1688.

[Steingrímsson et al. 2018] Steinþór Steingrímsson, Sigrún Helgadóttir, Eiríkur Rögnvaldsson, Starkaður Barkarson, and Jón Guðnason. 2018. Risamálheild: A Very Large Icelandic Text Corpus. In Proceedings of LREC 2018, 4361–4366.

[Tadić 2002] Marko Tadić. 2002. Building the Croatian National Corpus. In Proceedings of LREC 2002, 441–446.

[Váradi 2002]Tamás Váradi. 2002. The Hungarian National Corpus. In Proceedings of LREC 2002, 385–389.