Skip to main content

Tour de CLARIN: The CLARIN-LV B-Centre

Although Latvia has been active participant of CLARIN activities since its initiation, CLARIN-LV repository was only set up in 2020. After registration of the first set of language resources and tools, we applied for the B-centre status. The assessment procedure to become a CLARIN B-centre was completed in January, 2023.

The CLARIN-LV B-centre is hosted by the Institute of Mathematics and Computer Science (Artificial Intelligence Laboratory) at the University of Latvia (IMCS UL). The centre aims to implement the CLARIN mission in Latvia by collecting, documenting, curating and providing easy and sustainable long-term access to the digital language data and tools developed by the Latvian research community for a wide group of users, including humanities and social science scholars, language technology developers and also the public. Being open to submissions for data of any language, CLARIN-LV aims to support the preservation of and access to Latvian, Latgalian and Livonian digital language resources and tools.

CLARIN-LV consortium is built from seven partners: The Institute of Literature, Folklore and Art at the University of Latvia, Rīga Stradiņš University, Liepaja University, Rezekne Academy of Technologies, National Library of Latvia, and IMCS UL. Besides submissions from the consortium partners, language resources and tools in the repository also are registered from organisations outside the consortium.

CLARIN-LV repository is still growing. It contains information about more than 50 language resources and tools, including 37 corpora, 14 lexical resources and three tools. Twenty-six language resources are downloadable, while others are open for browsing. A small number of language resources have an academic license, due to the IPR restrictions. We actively cooperate with the digitalhumanities.lv initiative and several research and development projects, on the one hand, to widen the content of the repository and, on the other hand, to support digital humanities researchers with our knowledge and services.

 

The Latvian Donate speech initiative 'Balsu talka' is collecting data for an open Latvian spoken language corpus.

 

Similar to the other CLARIN centres, CLARIN-LV operates in accordance with Open Data Principles and FAIR Data Principles, which are among the priorities of Latvian science for 2021-2027. The most popular items in the repository are:

  • Largest online Latvian lexicon tezaurs.lvcurrently containing more than 390.000 lexical items. tezaurs.lv is among the oldest items in our repository. It is regularly updated, thus each year a new version of this lexicon is added to the repository (the most recent is tezaurs.lv 2023). Recently, Latvian WordNet has been integrated into tezaurs.lv, making it possible to find English translation equivalents of Latvian concepts.
  • Balanced corpus of Modern Latvian (LVK) is the largest balanced corpus of modern Latvian. Several versions of LVK have been created and made available for browsing: LVK 2013LVK 2018 and recently released LVK 2022 (containing more than 100 million words).
  • Latvian treebank is being released in two formats – UD format (available from the LINDAT/CLARIH-CZ repository) and hybrid format (LVTB) available from CLARIN-LV repository. Six different versions, including recently released LVTB v2.11 (almost 17 thousand manually annotated sentences), are available for download from the repository.

Most of the Latvian open-access corpora are collected in the Latvian National Corpus collection (LNCC). The LNCC includes both written and spoken, corpora. Open-access federated search facility provides an overview of LNCC content, including absolute and relative frequency of a given search term across all the LNCC corpora.

CLARIN-LV is an active partner of the CLARIN Knowledge Center for Systems and Frameworks for Morphologically Rich Languages (SAFMORIL). Besides individual consultations, we also regularly organise conferences, seminars and targeted practical workshops. Through the recently started recovery and sustainability plan project 'Language Technology Initiative', we aim to facilitate development of high-level skills in development and use of language resources and tools, as well as create crucially missing resources and technologies for our users. One such resource is an open Latvian spoken language corpus, which is currently collected through Latvian Donate speech initiative 'Balsu talka'.

 

Read the interview with Kristīna Korneliusa