Tour de CLARIN: Resource from DLU/Flanders - Corpus of Contemporary Dutch

Submitted by karolina@clarin.eu on 27 March 2018

Blog post written by Griet Depoorter, Katrien Depuydt and Hans Westgeest

The Corpus of Contemporary Dutch (Corpus Hedendaags Nederlands - CHN) is a collection of more than 800,000 texts taken from various sources such as newspapers, magazines, news broadcasts, legal writings, and books for the period between 1814 and 2013.

Since 1994, the Institute for Dutch Lexicology (which transformed itself into the Dutch Language Institute) made several corpora of contemporary Dutch available online: the 5, 27 and 38 Million Word Corpora and the Dutch Parole Corpus 2004. The material of these old corpora was merged and a considerable amount of more recent material was added from the NRC Handelsblad, which is a Dutch newspaper, and De Standaard, which is a Flemish newspaper. Other sources that were added came from Suriname and the Netherlands Antilles (where Dutch is also an official language), such as newspapers, material published on internet (blog, website) and books written by Surinam authors. This collection of data became the Corpus of Contemporary Dutch, which serves as the first step towards a monitor corpus for contemporary Dutch.

The corpus contains approximately 440 million tokens:

224 million Dutch Dutch
185 million Belgian Dutch (Flemish)
14.4 million Dutch as spoken in the Antilles
18.3 million Surinamese Dutch

The corpus has been lemmatized and POS-tagged and the list of the main POS-tags (which are documented in an INL working paper) is the following:

AA*: adjective or adverb
ADP*: preposition
ADV*: adverb
CONJ*: conjunction
INT: interjection
NOU-C*: common noun
NOU-P*: common noun
NUM: numeral
PD*: pronoun, determiner, article
RES*: residual categories (abbreviation, formula, symbol, truncated word, unknown)
VRB*: verb

The CHN can be searched via a simple search interface and via CQL, and users can search for or filter by five criteria: title, author, year of publication, medium and language variety. The possible values for the last criterion are NN (Dutch from the Netherlands), BN (Dutch from Belgium), SN (Dutch from Suriname) and AN (Dutch from Netherlands Antilles).

The software powering the CHN website was developed at the Dutch Language Institute during the course of the IMPACT and CLARIN projects and the corpus search is powered by BlackLab.

Image 1: searching for all word forms of the lemma “gezellig” (cosy) in Flemish newspapers from 2000 to 2010.

Image 2: some occurrences of the lemma “gezellig”

The great advantage of the Corpus of Contemporary Dutch is that it is a corpus which continues to grow and in the course of this year a significant amount of new data will be added (among others, newspaper data from the period between 2014 and 2017).

The corpus data have already been successfully used in linguistic research. Jaspers et al. (2015) used CHN data when researching the syntactic and semantic characteristics of Dutch scalar modifiers denoting small degrees (like few in English), while Devos (2016) has used the corpus to investigate a special category of Dutch infinitival phrases that act as obligatory modifiers of nominal predicates and add a causative meaning to the clause. In addition, the corpus has also served as the main source of data for a number of students’ theses. As examples of using the corpus in student work, Saskia Lubrun at the University of Leiden researched the collocational properties of the Dutch subjunctive, while Wanda Polak at the University of Amsterdam investigated the phonological contexts of several Dutch suffixes.

Click here to read more about Tour de CLARIN