Resource from The Netherlands: The SoNaR reference corpus of Dutch

SoNaR (for a detailed description, see VLO) is a reference corpus of standard written Dutch. It comprises contemporary texts ranging from printed media such as books and periodicals to computer-mediated communication such as chats and tweets from the Netherlands and the Dutch speaking area in Flanders (SoNaR New Media). It is the result of the STEVIN project, which involved major universities in the Netherlands and the Dutch-speaking part of Belgium, Flanders (a full list of which can be found in the extended documentation of the corpus). The aim was to create a corpus of the contemporary written language, originally primarily intended for use by language and speech technology researchers and developers. It was made accessible and usable for humanities researchers in the CLARIN-NL and CLARIAH projects by providing a web application with a interface dedicated to humanities researchers.

SoNaR consists of two main subcorpora – SoNaR-1 and SoNaR-500. In addition, there is the SoNaR New Media Corpus.

SoNaR-1 contains 1 million tokens and is very richly annotated, especially in relation to the semantic layers, which consist of named-entity labelling, annotation of co-reference relations, semantic role labelling and annotation of spatial and temporal relations. Additionally, all its annotations have been manually verified. As one of its pivotal subparts, SoNaR-1 includes the Dutch Parallel Corpus (for a detailed description, see VLO), a sentence-aligned parallel corpus of English, Dutch and French.  The larger subcorpus, SoNaR-500, contains 500 million tokens of full texts. The texts in SoNaR-500 have been tokenised, tagged for part-of-speech and lemmatised, but without manual verification.

The SoNaR New Media corpus contains approx. 35 million words and consists of tweets, chats and sms. All texts have been automatically tokenized, tagged for part of speech and lemmatized.

In order to provide easy access to the corpus, CLARIN-NL and CLARIAH-NL has developed the OpenSoNaR (for a detailed description, see VLO) search environment. OpenSoNaR, whose frontend is named WhiteLab and whose backend is named BlackLab, is a state-of-the-art concordancer which provides two primary interfaces of user-driven functionality that can be used by both laymen and specialist researchers alike. In the Exploration interface (Figure 1), a researcher can look into the corpus distribution, see the statistical information of the subcorpora and retrieve n-grams. Through the Search interface, four search options are available:

  • simple, which limits the search to words only;
  • extended, which enables the researcher to query the corpus by either word form or lemma, set the part of speech and choose among semantic metadata filters (figure 2);
  • advanced, which allows users to further specify the lemma or word forms that they’re interested in; and
  • expert, which provides an input for CQL commands.

The OpenSoNaR environment also stores previous search results, allowing researchers a great degree of flexibility and room for comparison between the temporary subcorpora that they have created during a single search session (figure 3).

A initial version of a successor of OpenSoNaR, called openSoNaR+, was developed in 2015. A new and upgraded version of OpenSoNaR+ is expected to be released in early 2018.

Figure 1: The Exploration interface showing the distribution of the subcorpora within SoNaR. The highlighted box shows that Tweets make up a 0.03% part of SoNaR.

Figure 2: The "extended" search interface — a search is being performed for the lemma "paard" (English horse), while the drop menu shows metadata filters.

Figure 3: The "Results" tab shows that OpenSoNaR stores previous the results of previous queries.

Blog post written by Darja Fišer and Jakob Lenardič 

