Skip to main content

Tour de CLARIN: The Nordic Dialect Corpus

Submitted by Jakob Lenardič on

Blog post written by Janne Bondi Johannessen, Kristin Hagen, Anders Nøklestad, and Joel Priestley, edited by Darja Fišer, Elisa Gorgaini, and Jakob Lenardič


The Nordic Dialect Corpus (NDC) is a speech corpus available at the CLARINO Text Laboratory Centre.

NDC is a corpus of Danish, Faroese, Icelandic, Norwegian and Swedish (including Övdalian) spoken language. The corpus consists of spontaneous speech data from dialects of the North Germanic languages across all Nordic countries. The recordings and transcriptions in the corpus are part-of-speech tagged and come from various sources. The Danish, Icelandic and Norwegian recordings were collected for the project with the financial support of individual national research councils; the Faroese recordings were done by a cross-Scandinavian research project, while the Swedish recordings were mainly donated to the corpus from SweDia 2000, a previous dialect collection project.

Country

Informants

Places

Tokens

Denmark

81

15

220,360

Faroe

20

5

64,803

Iceland

48

8

94,338

Norway

438

111

1,997,920

Sweden (incl. Övdalian)

150

44

376,868

Table 1: The recordings and transcriptions in NDC

NDC contains over 2.75 million words from conversations and interviews done from 1998 to 2015, see table 1. Older Norwegian recordings can be found in the corpus LIA Norwegian - Corpus of Old Dialect Recordings, also available through the CLARINO Text Laboratory Centre. The informants, whose recorded speech constitutes the NDC corpus, were instructed to avoid interpersonal topics due to information privacy laws. This is why they mostly focused on topics like holidays, school, life in the olden days, and factual, non-private mentions of individuals; while sensitive topics like politics are not addressed.

Since the recordings in NDC are classified as personal data, the corpus is accessible only via Glossa, a search tool developed at the Text Laboratory and reimplemented in the CLARINO project that was featured in a previous Tour de CLARIN post. The corpus has three login possibilities: eduGAIN, CLARIN and Feide (for Norwegian users).



Figure 1: A Simple search in the Norwegian part of the dialect corpus for ikke (“not”) returns 29,100 matches. The Norwegian and Övdalian parts of the corpus have two aligned transcriptions, one phonetic and one orthographic. Both are included in the concordance view shown in the figure. (The other languages have one orthographic transcription for each recording.)

 


Figure 2: The map view can show the distribution of the phonetic variants of the search, here all the variants of the Norwegian word ikke (“not”). The variants are shown above the map. By clicking on a colour and then on a variant, you can see the distribution of the chosen variant on the map. In the Figure above, the dialectal variants ‘ennte’, ‘ente’, ‘nnte’, and ‘nte’ are shown in yellow.

 


Figure 3: The search interface of the corpus is easy to use with user-friendly buttons and menus. The search possibilities are still advanced. The search shown above is an Extended search for words starting with n– in phonetic form where the orthographic word form is ikke. The orthographic form is chosen by clicking on the menu button to the left of the search word and filling in the box at the bottom, see Figure 4.

 


Figure 4: The parts-of-speech menu in NDC with the specify or exclude box at the bottom. All transcriptions in NDC are morphologically tagged and searchable for all languages.

 

NDC has been frequently used in a wide range of research areas, such as phonology, morphology, syntax and lexicography. Janne Johannessen and Kristin Hagen published a monograph called Språk i Norge og nabolanda: Ny forskning om talespråk (“Language in Norway and Neighbouring Countries: New Research on Spoken Language”), which presents linguistic research carried out on the basis of the corpus. The monograph includes topics such as: the syntactic structure of adjectival complements in Norwegian dialects, the use of possessive markers, the order of particles and objects in Norwegian and Swedish dialects, and the syntax of questions in Norwegian dialects.

The open access journal, Nordic Atlas of Language Structure (NALS) Journal, actively encourages linguists to publish empirical research on geographical linguistic variation that is observed on the basis of data from NDC, as well as from other dialect resources. For instance, Vangsnes (2014) used NDC to explore variation in the syntax of noun phrases, focusing on the marking of definiteness and the syntactic features of possessors, while Garbacz (2014) explored the non-standard use of definite articles in indefinite contexts in Swedish, Fenno-Swedish and Norwegian dialects. Finally, the NDC corpus has been used as empirical basis for a number of MA and PhD theses at the universities in the Nordic countries.

References

Garbacz, Piotr. 2014. Definite articles in indefinite contexts. The Nordic Atlas of Language Structures (NALS) Journal 1 (1): 87–93. https://doi.org/10.5617/nals.5369.

Johannessen, Janne Bondi and Kristin Hagen (eds.). 2014. Språk i Norge og nabolanda. Ny forskning om talespråk. Oslo: Novus forlag. https://novus.mamutweb.com/Shop/Product/JohannessenHagen-(red)-Språk-i-Norge-og-nabolanda/102627.

Vangsnes, Øystein A.2014. Noun Phrases. The Nordic Atlas of Language Structures (NALS) Journal 1 (1): 4–9. https://doi.org/10.5617/nals.5361.


Click here to read more about Tour de CLARIN