Tour de CLARIN: Iceland

Submitted by Jakob Lenardič on 20 November 2020

Written by Eiríkur Rögnvaldsson

Iceland first joined CLARIN as an in November 2018, but after a new law was passed by the Icelandic Parliament on European Research Infrastructure Consortia in 2019, it was able to apply for full membership, which was approved in February 2020. The membership agreement was signed on March 10, 2020. The Icelandic consortium is led by the Árni Magnússon Institute for Icelandic Studies, and Professor Emeritus Eiríkur Rögnvaldsson is the National Coordinator.

The following institutions participate in the CLARIN-IS consortium and have signed a memorandum of partnership:

The CLARIN-IS office is based in Reykjavík, where it shares the premises with the Árni Magnússon Institute’s Language Technology Unit, with which it cooperates closely. The staff consists of the National Coordinator who works part-time, and a computer scientist, Samúel Þórisson, who works full time.

In the first year of our CLARIN ERIC membership, the main tasks of CLARIN-IS have been to establish a National Consortium and to build a Metadata Providing Centre (CLARIN C-Centre) which hosts metadata for Icelandic language resources and makes them available through the Virtual Language Observatory. Furthermore, we have now established a repository which already hosts a number of tools and resources.

In connection with Iceland’s participation in the META-NORD project from 2011–2013, the Árni Magnússon Institute established a local website, Málföng (“Language Resources”) where the institute’s language resources and tools were stored and made accessible. The website also contains links to several resources and tools owned by others. Most of the institute’s tools and resources have now been made available through the CLARIN-IS website, and we are in the process of preparing them for archiving in our repository by adapting them to standards, writing metadata, and so on.

A number of our resources are already widely used, both by researchers and the general public. For instance, the Database of Modern Icelandic Inflection (DMII) is a multipurpose linguistic resource which contains inflectional paradigms, with a vocabulary of 300,000 lemmas and 6.5 million inflectional forms. The online version is very popular among the general public. The Saga Corpus contains 49 Old Icelandic narrative texts, approx. 1.7 million words in total. The spelling has been normalized to Modern Icelandic spelling ,and some inflectional endings changed to Modern Icelandic form. The Icelandic Gigaword Corpus (IGC) is a tagged and lemmatized corpus of Modern Icelandic containing approximately 1,550 million running words of text. Each running word is accompanied by a morphosyntactic tag and lemma, and each text is accompanied by bibliographic information. Both the Saga Corpus and the Gigaword Corpus can be queried online, using the Korp corpus tool developed by Språkbanken in Gothenburg.

We expect our repository to expand considerably in the coming years, especially with resources and tools developed within the Icelandic National Language Technology Programme which started in 2019 and will run for five years. The Ministry of Education, Science and Culture, which funds the programme, demands that all its deliverables be submitted to CLARIN-IS and made accessible under maximally open licences. Thus, one of our main tasks in the next few years will be to validate these tools and resources, archive them, and make them openly available.

CLARIN-IS staff – Eiríkur Rögnvaldsson (right) and Samúel Þórisson (left)