Tour de CLARIN: The Database of Modern Icelandic Inflection

Submitted by Jakob Lenardič on 4 December 2020

Written by Kristín Bjarnadóttir and Eiríkur Rögnvaldsson

The Database of Modern Icelandic Inflection (DMII) contains inflectional paradigms and has a vocabulary of 300,000 lemmas with approximately 6.5 million inflectional forms. Uninflected words are also included. Downloadable data for use in language technologies is available from the CLARIN-IS repository under the CC BY-SA 4.0 licence. The download package contains four versions of the data in the CSV format, ranging from a simple list of word forms to data linking lemmas and inflectional forms to grammatical tags and usage information. A detailed description of the data is available on the DMII website.

The DMII was originally created in the context of the initiative to provide Icelandic language technologies at the start of the millennium. The first version of the DMII was a set of XML files with 173,389 paradigms, made available on CDs for use in LT in 2004 (Bjarnadóttir 2012), and individual paradigms have been accessible on the DMII website since the same year. Extensively used by the Icelandic public as a reference on inflection, the website has been very popular from the start – more than 298,000 users viewed over 5.6 million pages in the year starting September 1, 2019.

 
The online version of the DMII, showing the inflection of the word tölva (“computer”).

The DMII Core is a subset of DMII, which contains the core vocabulary of current Icelandic, i.e., common non-domain specific words, and a selection of named Icelandic entities, i.e., personal names, common place names, and a few names of important institutions. The vocabulary contains approximately 58,000 words. Its sources are The Dictionary of Modern Icelandic (Íslensk nútímamálsorðabók), containing approximately 50,000 headwords, with additions from the top 50,000 most frequent words (lemmas) of the Gigaword Corpus (Risamálheild). The DMII Core was created to be used for third party publications, and is accessible through a RESTful that is open to everyone. The API allows users to send simple queries and receive full paradigms in JSON-format as a response.

The DMII has been used in a number of different language technology projects. It has proven its usefulness in increasing the accuracy in PoS tagging (Loftsson et al. 2011, Steingrímsson et al. 2019); in the post-processing of OCR texts (Daðason et al. 2014); in linking lexicographic resources (Bjarnadóttir 2016); in developing a high-accuracy lemmatizer for Icelandic (Ingólfsdóttir et. al. 2019); in developing Context-Free Grammar for Icelandic (Þorsteinsson et al. 2019); and more – see a list on the project website.

Furthermore, the DMII is currently being used to develop The Database of Icelandic Morphology (DIM; Bjarnadóttir et al. 2019), which is a multipurpose linguistic resource. Whereas the DMII is descriptive, the DIM is partly prescriptive, i.e., the “correctness” of both words and inflectional forms is marked in accordance with accepted rules of usage. This greatly improves the scope of applications using the data, from the purely analytical possibilities of the DMII (used for examples in search engines, PoS tagging, named-entity recognition, etc.), to the productive possibilities of the DIM, such as correction and formulation of text. The analysis has recently been extended to include genre, style, domain, age, and various grammatical features. Work on error analysis of sub-standard forms is in progress, as is work on an analysis of word formation, including linkups of all constituents to lemmas in the DMII. More components of the DIM will be added to the CLARIN-IS repository as soon as they are finalized.

References

Bjarnadóttir, K. 2012. The Database of Modern Icelandic Inflection. Proceedings of Language Technology for Normalization of Less-Resourced Languages, workshop at LREC 2012, 13–18.

Bjarnadóttir, K. 2016. The Case for Normalization: Linking Lexicographic Resources for Icelandic. In Nordiske Studier i Leksikografi, 79–88.

Bjarnadóttir, K., Hlynsdóttir, K.I., and Steingrímsson, S. 2019. DIM: The Database of Icelandic Morphology. In Proceedings of the 22nd Nordic Conference on Computational Linguistics, 146–154.

Daðason, J.F., Bjarnadóttir, K., and Rúnarsson, K. 2014. The Journal Fjölnir for Everyone: The Post-Processing of Historical OCR Texts. In Proceedings of Language Resources and Technologies for Processing and Linking Historical Documents and Archives – Deploying Linked Open Data in Cultural Heritage, workshop at LREC 2014, 56–62.

Ingólfsdóttir, S.L., Loftsson, H., Daðason, J.F., and Bjarnadóttir, K. 2019. Nefnir: A high accuracy lemmatizer for Icelandic. Proceedings of the 22nd Nordic Conference on Computational Linguistics, 310–315.

Loftsson, H., Helgadóttir, S., and Rögnvaldsson, E. 2011. Using a Morphological Database to Increase the Accuracy in POS Tagging. In Proceedings of Recent Advances in Natural Language Processing, 49–55.

Steingrímsson, S., Kárason, Ö., and Loftsson, H. 2019. Augmenting a BILSTM tagger with a Morphological Lexicon and a Lexical Category Identification Step. In Proceedings of Recent Advances in Natural Language Processing, 1161–1168.

Þorsteinsson, V., Óladóttir, H., and Loftsson, H. 2019. A Wide-Coverage Context-Free Grammar for Icelandic and an Accompanying Parsing System. In Proceedings of Recent Advances in Natural Language Processing, 1397–1404.