Skip to main content

Programme CLARIN Annual Conference 2021

Event name: CLARIN Annual Conference 2021
Date: Monday, 27 September 2021 - Wednesday, 29 September 2021 (all times are CEST, UTC+2)
Location: Online (Zoom details are sent to registered participants individually)

Twitter Hashtag: #CLARIN2021

Conference Programme Outline

10.00 - 10.15 Opening &  Steven Krauwer Award Ceremony
10.15 - 11.00 Keynote
11.00 - 11.30 

Five-Minute Paper Presentations:

  • Resources (Part 1)
  • Research Use Cases
11.30 - 12.30 CLARIN Café: Interactive Q&A Session for Newcomers in CLARIN from the SSH Domain
12.30 - 12.45 Break - video playing
12.45 - 13.30 The Technical State of the Infrastructure
13.30 - 14.30 PhD Student Session
14.30 - 15.00 Social Coffee Break 
15.00 - 16.00

Five-Minute Paper Presentations:

  • Research Data Management, Metadata and Curation (Part 1)
  • Legal Issues Related to the Use of LRs in Research (Part 1)
  • Resources (Part 2)
16.00 - 16.15 Day 1 Wrap-Up
09.45 - 10.00 Video playing
10.00 - 10.15 Presentation by Programme Committee Chair
10.15 - 10.45 State of the CLARIN Infrastructure
10.45 - 11.30 Keynote
11.30 - 12.20 Have Your Lunch with the BoD (Open to all Participants)
12.20 - 12.30 Break - video playing
12.30 - 13.30

Five-Minute Paper Presentations:

  • Repositories and National CLARIN Centres 
  • Research Data Management, Metadata and Curation (Part 2)
  • Legal Issues Related to the Use of LRs in Research (Part 2)
13.30 - 14.30

Teaching with CLARIN

Presentations & Award

14.30 - 15.00 Social Coffee Break 
15.00 - 16.00 Bazaar 
16.00 - 16.15 Day 2 Wrap-Up
09.45 - 10.00 Video playing
10.00 - 10.15 Social Morning Coffee
10.15 - 11.00 Panel: The Role of Corpora for the Study of Language Use and Mental Health Conditions
11.00 - 11.30 Presentations by CLARIN Committees 
11.30 - 12.20 Lunch
12.20 - 12.30 Break - video playing
12.30 - 13.45

Five-Minute Paper Presentations:

  • Annotation and Acquisition Tools
  • National CLARIN Centres
13.45 - 14.30  Keynote
14.30 - 15.00 Social Coffee Break
15.00 - 16.00

Five-Minute Paper Presentations:

  • Resources
16.00 - 16.15 Day 3 Wrap-Up and Closing



From Punched Cards to Linguistic Linked Data ...Through Infrastructures


Marco Passarotti

CIRCSE Research Centre, Catholic University of the Sacred Heart, Milano, Italy

Monday, 27 September, 10.15 - 11.00

Language Technologies Beyond Research: From Poetry to Music Industry


Elena González-Blanco

Global Head of Digital for Wealth Management and Insurance at Banco Santander, and Research Director at IE University 

Tuesday, 28 September, 10.45 - 11.30

Language Modeling and Artificial Intelligence


Tomáš Mikolov

Czech Institute for Informatics, Robotics and Cybernetics of the Czech Technical University in Prague, Czech Republic

Wednesday, 29 September, 13.45 - 14.30


Conference Programme Details

Monday, 27 September 2021 

*All times are Central European Summer Time (CEST), UTC+2 

Times (CEST) Programme
10.00 - 10.15

Opening & Steven Krauwer Award Ceremony  

Franciska de Jong & Maciej Piasecki & Monica Monachini

10.15 - 11.00
Keynote | Chair: Monica Monachini 

From Punched Cards to Linguistic Linked Data ...Through Infrastructures (slides)

Marco Passarotti

Director of the CIRCSE Research Centre, Catholic University of the Sacred Heart, Milano, Italy.   

Abstract: The talk discusses how linguistic resources have become increasingly accessible and, lately, interoperable from the very first years of computational linguistics until the present day. Starting from the pioneering work of father Roberto Busa on processing the Latin texts of Thomas Aquinas with IBM computers in the 1950s, the talk will touch upon the following phases in the light of those passed through by the Index Thomisticus corpus: 1) the registration of the Thomas Aquinas' texts on punched cards (1950s and 1960s) and the publication of the 56 volumes of the Index Thomisticus (1970s and 1980s): (meta)data are analogical and isolated, with no interoperability with other linguistic resources; 2) the CD-ROM (1990s) and the web-based version (2000s) of the Index Thomisticus: (meta)data are digital, and (partly) accessible; 3) the storage of the Index Thomisticus Treebank in the CLARIN infrastructure, together with many other resources: (meta)data are accessible, reusable and (partly) findable. Full (and linguistically deep) interoperability between resources in the infrastructure is still to come; 4) the linking of the Index Thomisticus Treebank and other resources for Latin to the LiLa Knowledge Base: (meta)data are findable, accessible, reusable and fully interoperable by using principles, data categories and ontologies developed by the Linguistic Linked Open Data community.
Bio: Marco Passarotti is Associate Professor of Computational Linguistics at Università Cattolica del Sacro Cuore (Milan, Italy), where he is Director of the CIRCSE Research Centre, which he co-founded in 2009. His main research interests deal with building, using and disseminating linguistic resources and natural language processing tools for Latin. A former pupil of one of the pioneers of humanities computing, father Roberto Busa SJ, since 2006 he has headed the Index Thomisticus Treebank project, which continues the legacy of Busa’s work on the opera omnia of Thomas Aquinas. He is the principal investigator of the LiLa project, an ERC-Consolidator Grant (2018-2023), which aims to build a Linked Data Knowledge Base of linguistic resources and natural language processing tools for Latin.
11.00 - 11.30 

Five-Minute Paper Presentations | Chair: Martin Wynne (slides)

  • Resources (Part 1)
  • Research Use Cases

ParlaMint: Comparable Corpora of European Parliamentary Data (slides)

Tomaž Erjavec, Maciej Ogrodniczuk, Petya Osenova, Andrej Pančur, Nikola Ljubešić, Tommaso Agnoloni, Starkaður Barkarson, María Calzada Pérez, Çağrı Çöltekin, Matthew Coole, Roberts Darģis, Luciana D. de Macedo, Jesse de Does, Katrien Depuydt, Sascha Diwersy, Matyáš Kopp, Tomas Krilavičius, Giancarlo Luxardo, Vaidas Morkevičius, Costanza Navarretta, Paul Rayson, Orsolya Ring, Michał Rudolf, Kiril Simov, Steinþór Steingrímsson, Árni Magnússon, István Üveges, Ruben van Heusden, Giulia Venturi.  

This paper outlines the ParlaMint project from the perspective of its goals, tasks, participants, results and applications potential. The project produced language corpora from the sessions of the national parliaments of 17 countries, almost half a billion words in total. The corpora are split into COVID-related subcorpora (from November 2019) and reference corpora (to October 2019). The corpora are uniformly encoded according to the ParlaMint schema with the same Universal Dependencies linguistic annotations. Samples of the corpora and conversion scripts are available from the project’s GitHub repository. The complete corpora are openly available via the CLARIN.SI repository for download, and through the NoSketch Engine and KonText concordancers as well as through the Parlameter4 interface for exploration and analysis.


Corpora for Bilingual Terminology Extraction in Cybersecurity Domain (slides)

Andrius Utka, Sigita Rackevičienė, Liudmila Mockienė, Aivaras Rokas, Marius Laurinaitis and Agnė Bielinskienė. 

The paper aims at presenting English-Lithuanian corpora for bilingual term extraction (BiTE) in the cybersecurity domain within the framework of the project DVITAS. It is argued that a system of parallel, comparable, and training corpora for BiTE is particularly useful for less resourced languages, as it allows to efficiently use strengths and avoid weaknesses of comparable and parallel resources. A special focus is given to the open nature of the data, which is achieved by publishing the data in the CLARIN-LT repository.


How to Perform Linguistic Analysis of Emotions in a Corpus of Vernacular Semiliterate Speech with the Help of CLARIN Tools (slides)

Rosalba Nodari and Luisa Corona.  

Research has shown that words are constitutive of emotions and that language contributes to shape feelings. However, less is known about how people with basic literacies can use language to maintain, create and recreate affective bonds, and how they express themselves through the language of emotions. In this respect, digital humanities tools can help shed some light on linguistic encoding of emotions. This proposal aims to show the potential of the CLARIN infrastructure tools for carrying out such analysis on a particular corpus of letters written in the 60s’ by Michela Margiotta, a semiliterate Italian woman affected by tarantism, to the anthropologist Annabella Rossi. The research will show how corpora of semiliterate letters can pose several problems when conducting research using digital humanities tools. In this respect, different methodologies will be compared in order to verify how CLARIN tools can help in the detection of encoded emotion in written documents.


Dependency Trees in Automatic Inflection of Multi-Word Expressions in Polish (slides)

Ryszard Tuora and Łukasz Kobyliński.  

Natural language generation for morphologically rich languages can benefit from automatic inflection systems. This paper presents such a system, which can tackle inflection, with particular emphasis on Multi Word Expressions (MWEs). This is done using rules induced automatically from a dependency treebank. The system is evaluated on a dictionary of Polish MWEs. Including such a tool into the CLARIN infrastructure will be beneficial for processing morphologically rich languages.


Q&A and Plenary Discussions
11.30 - 12.30

CLARIN Café: Interactive Q&A Session for Newcomers to CLARIN | Moderator: Francesca Frontini

12.30 - 12.45 Break - video playing 
12.45 - 13.30

The Technical State of the Infrastructure (slides)

Dieter Van Uytvanck

13.30 - 14.30 PhD Student Session | Chair: Darja Fišer (slides)
13.30 - 13.35

Reflecting Cognitive Processing of Trauma in Language  a Result of Trauma's Experience Corpus Analysis with a Usage of Word's Meanings Category

Wiktoria Mieleszczenko-Kowszewicz  

The aim of this research is to present the results of trauma’s corpus analysis with quantitative content analysis category, including meaning of words that indicate a cognitive processing of trauma. This mechanism manifests itself in the occurrence of voluntary reflexive ruminations aimed at understanding the event and its consequences. Literature reviews show that people who use words that are reflective of cognitive effort when describing the event benefit from physical health and general well-being. Individuals (N=22) who experienced a traumatic event over the past six months were asked to describe this event. Their narratives were analysed with a task of 'own category' in a Literary Exploration Machine (LEM) in CLARIN. The chosen category was Cognitive Processing of Trauma that comprises words’ meanings indicating the intellectual activity of a person related to understanding, finding the meaning of a given event, as well as indicating the result of the event for current living. The category includes 272 meanings of 207 words. The results of the analysis show that the average amount of words' meaning in the narrative that indicates cognitive processing of trauma is 1,01% (SD=0,44, Mdn=0,98).
13.35 - 13.40

Machine Learning Applied to Voice Signal in Parkinson's Disease

Antonio Pallotti  

In this paper, a classification algorithm for Parkinson’s disease screening is proposed. Code executes the processing of specific voice signals recorded by healthy and ill subjects. In the direction of a future implementation and validation in a home telemonitoring system, the algorithm has been built with the objective of serving as a screening tool for the precocious directing of subjects with high risk of neurological diseases to instrumental exams. In fact, in several neurological disorders, such as Parkinson’s disease, motor impairments of the vocal apparatus arise earlier than postural and ambulatory symptoms. In a home telemonitoring system, in which hardware would consist in a voice recorder (that could be a simple smartphone) and a server for the web platform, data would be acquired and instantly stored on a platform for their processing through machine learning algorithms and review by specialists. For this purpose, a fully automatic process is needed. Therefore, in this work, audio-preprocessing and features computation are performed completely automatically using Matlab. Final models have been trained in Matlab environments from Weka’s libraries. The family of developed models are trained with different type of phonations, from simple vowels to complex sounds, for a wider and more efficient analysis of vocal apparatus motor impairments. Moreover, the dataset was 612 observations large, which is significantly above the mean size of similar works using simple phonations only. For a deeper analysis, different groups of parameters have been tested and cepstral features have been found to be optimal for classification and made up the big part of the final algorithm. Developed models are part of the K-Nearest Neighbor family, thus available for implementation in a web platform. Finally, obtained models have shown high accuracies on the whole dataset, reaching values comparable with the literature but with more stability (standard deviation less than 1%). These results have been confirmed in the last validation session in which models have been exported and validated with 25% of data, reaching a best performance with a true positive rate of 98% and a true negative rate of 87%.
13.40 - 13.45

Building English-Arabic Parallel Medical Corpora

Zainab Almugbel 

The research aims to build an open-access parallel English-Arabic corpus of medical texts, aligned at document, sentence and word level, and to demonstrate its use in case-study applications. In order to achieve this aim, we will go through the following objectives:
1. Collate a large open-access parallel English-Arabic corpus of medical texts from reliable medical online resources, such as World Health Organization medical fact sheets, with a target size of up to 100 million words of both English and Arabic, comparable in size to BNC. A small part of this corpus will be manually annotated for the evaluation
2. Develop and evaluate NLP and deep learning alignment algorithms to align the English and Arabic parallel texts at document, sentence and word level
3. Use the resulting aligned parallel corpus to extract a lexicon of English-Arabic translations of medical terminology
4. Use the aligned parallel corpus in additional case studies of practical applications of the resource
5. Publish the corpus as an open resource for research in AI for medicine into the CLARIN infrastructure that supports re-using and easy access to language data and tools for research, and promote and publicise this resource to encourage re-use and subsequent citations of this PhD thesis.
13.45 - 13.50

Attitudes and Language Acquisition: An Investigation on the Italian-French Community in Aix-Marseille

Fabio Ardolino  

The doctoral research presented here explores the relationship between social attitudes and spontaneous L2 acquisition in a sample of Italian-French migrants. In order to answer the research question, a specific protocol was developed, integrating methods borrowed from both social psychology (implicit and explicit attitudes' psychometrics) and sociolinguistics (speech elicitation and analysis). Regarding this last point, a crucial aspect of the data collection phase consisted in the construction of a wide bilingual corpus, representing the L1 and L2 varieties characterising the informants’ repertoires. In addition, the modular interview elicitation mode selected for the Italian speech elicitation has made the corpus a valuable source of social, psychosocial, and historical information about the condition of contemporary Italian migrants in France. Consistently, the corpus is intended to be integrated within the CLARIN infrastructure in order to have the greatest research re-usability. The comparative analysis of the collected psychosocial and linguistic datasets highlights interesting relationships between social attitudes and L2 development, varying significantly according to the involved stereotypical dimension (warmth or competence) and construct’s nature (explicit or implicit).
13.50 - 13.55

Towards the Precise Detection of Adverbial Roles in Hungarian – Manual Clustering of Adverbial Adjuncts

Noémi Ligeti-Nagy  

The research presented here is part of a project creating a question-answering (QA) system for Hungarian (Novák et al. 2019). The aim of the system is to comprehend texts and to model this comprehension by formulating relevant questions about the sentences. To achieve this, a precisely annotated corpus is inescapable, where the annotation contains all the detailed distinctive features necessary for asking appropriate questions. 

I present an annotation of adverbial adjuncts that would be appropriate when designing such a training corpus. I used a HunCLARIN corpus, the Szeged Dependency Treebank (Vincze et al. 2010) and extracted those elements annotated with Obl edge (thus being an adjunct and not an argument of the verb) that bear one of the nine case suffixes of the directional triad of locative suffixes: the internal locatives, inessive -bAn, illative -bA, and elative -bÓl; external locatives, concentrated on a given point, adessive -nÁl, allative -hOz, ablative -tÓl; external, surface-oriented locatives, superessive -On, sublative -rA and delative -rÓl.

I took these nine groups of words (consisting of 1~097 lemmas) and 'pre-categorised' each group using the word embedding model of Siklósi-Novák (2016), which is based on the word2vec models (Mikolov et al. 2013) and uses hierarchical clustering. The output of the clustering was a list of groups of words, each group consisting of three to eight semantically closely related words. Then I manually cleaned these lists to collect words that could answer the same wh-word, meaning that they occur as the same type of adverbial in the sentence, together in one group. I defined 28 categories (altogether 50, counting the subcategories) into which I manually sorted the 1097 lemmas. In some cases, with some case suffixes, particular words may play a role different from the one defined by the default category, the labelling of which was also a task.
The categorisation presented here provides appropriate features in a training corpus to create a QA system, while the methodology could be extended to other languages as well.
13.55 - 14.00

Tough Constructions and Their Analogs in English, French and Russian: A Parallel Corpus Study Using the CLARIN VLO Platform

Alina Tsikulina 

The languages of the world differ strikingly in the encoding of some concepts. For instance, in the domain of evaluatives, there are some constructions involving tough predicates (e.g., this hill is difficult to climb) that present atypical mappings between structure and meaning, and great crosslinguistic variability. For example, some languages, extensively studied in this domain (Hicks, 2009; Legendre, 1986; Comrie & Matthews, 1990), such as English and French, are characterised by a typical object-to-subject raising, leaving an object-gap in the embedded infinitival clause and a semantic asymmetry in that the syntactic subject is interpreted as the semantic object of the sentence. Other languages, less studied and without this type of raising, such as Russian, seem to offer a variety of equivalent constructions – functional analogs to such tough constructions (TCs) – using mainly topicalisation of the NP with case marking (Comrie & Matthews, 1990), passives or deverbals (Paykin & Van Peteghem, 2020).

The present work reports on the methods and findings of a parallel corpus built using data accessible through the CLARIN VLO-platform, and aiming to test the above asymmetry between two languages that do have TCs (English and French) and a language that does not (Russian). The data were first extracted with the help of the Corpus Query Processor (CQP) of the OPUS corpus (Tiedemann & Thottingal, 2020), and more specifically from the OpenSubtitles2018 multilingual corpus for movies and TV-series (Lison, P., Tiedemann & Kouylekov, 2018), with a special focus on occurrences involving tough adjectives (e.g., difficult, easy). After the extraction of the target occurrences with English as source language and French and Russian as target parallel translations, the data were annotated for TCs and their functional analogs (e.g., extraposed/intraposed constructions, compact adjectival uses, deverbals, passives, other distributed strategies).

The results show that even though English and French are thought to belong to the same language type (gap-strategy languages), French seems to allow a multitude of functional equivalents (extraposed, compact adjectivals, deverbals, passives, distributed strategies) that co-exist with typical TCs. With respect to Russian, although previous work suggests mainly three alternative constructions (topicalisation, passive uses and deverbals), the data show much greater variability: topicalisation as the most frequent analog followed by extraposition, passives, use of adjectivals and only marginally by deverbals. Overall, the findings suggest that French and Russian have much more in common than initially thought, and that the formal devices used to express evaluative relations of this type may be radically different not only across languages, but also within systems.

The present parallel corpus study allowed an in-depth investigation of a grammatical phenomenon that is only little discussed for Russian, and mainly explored from a theoretical (syntactic) point of view in English and in French. The project casts doubts on some common assumptions about the prototypical encoding strategies of the investigated languages in the domain of TCs and has not only implications in the domain of corpus linguistics, semantics and syntax– through the proposed crosslinguistic comparisons (English-French-Russian) – but also shows how datasets available through the CLARIN VLO platform can be used to shed light onto the characteristics of less studied languages, thus contributing to linguistic theories and typology research.

14.00 - 14.05

The RigVeda Goes 'Universal': Annotating Comparative Constructions in the Most Ancient Poetry of India

Erica Biagetti  

Historical linguistics has always relied on collections of written texts, i.e. corpora, which constitute the only source of evidence available for ancient languages. Annotated corpora revolutionised historical linguistics because they allow scholars to automatically retrieve large quantitative evidence on linguistic phenomena whose account has been previously based on qualitative evidence and to capture correlations among them which could hardly be grasped by linguists’ naked eye (Eckhoff et al., 2018: 303; Biber, 2009; Anthony, 2013). Furthermore, morphosyntactically annotated corpora require automatic data selection through explicit query expressions, crucially making historical linguistic research replicable (Haug, 2015). This paper presents the author’s PhD project devoted to the annotation of the RigVeda (RV), as well as its employment for the study of equative and similative constructions attested in this text. The RV is a collection of 1028 hymns dating back to the second half of the second millennium BCE (Witzel, 1995), which constitutes the oldest layer of Vedic literature and whose language is strongly conditioned by the poetic and ritual character of the text, as well as by its metrical structure. The treebank of the RV was created as part of the larger Vedic Treebank (VTB; Hellwig et al., 2020), a corpus of selected passages from Vedic Sanskrit literature syntactically annotated according to the UD standard. A first version of the VTB was published on the occasion of the release of UD version 2.6 (15 May 2020); a new version will be integrated in the next UD release scheduled for November 15, 2021.
The main section of the paper presents a new annotation scheme developed by the author for equative and similative constructions attested in the RV and shows research outcomes obtained thanks to this scheme. In its main structure, the annotation follows the scheme indicated by UD guidelines for comparative constructions. In UD, there are no relations designed specifically to mark comparative constructions: phrasal comparatives are simply assimilated to other obliques (obl), and comparative clauses are treated in the same way as other adverbial clauses (advcl). Similarly, standard markers take the same deprel as other function words, such as adpositions (case) and subordinating conjunctions (mark). However, since comparative particles usually have other functions in a language, and since comparison is expressed by several different strategies in the RV as in many other languages (cf. Haspelmath et al., 2017, among others), the paper suggests exploiting UD openness to language-specific extensions in order to increase granularity and informativeness of the annotation and allow more targeted queries on comparative constructions. The paper will show that treebank-backed analysis has the advantage of providing quantitative data and highlighting tendences in Rigvedic equative and similative constructions which could not be observed before by linguists’ naked eye. However, since the kind of information stored in the treebank does not allow to check whether there are any metrical or formulaic factors constraining the kind of word-order attested in Rigvedic comparative constructions, the paper also discusses some methodological issues that arise when annotating an ancient poetic text such as the RV and envisage the integration of metrical as well as semantic information within the CoNLL-U Plus Format as a possible solution.
14.05 - 14.15 Plenary discussion Q&A 
14.15 - 14.30 Follow-Up Discussion in Breakout Rooms
14.30 - 15.00 Break
Social Coffee Break 'First Meeting/Informal Encounters' | Moderator: Ben Verhoeven
15.00 - 16.00

Five-Minute Paper Presentations | Chair: Juan Steyn (slides)

  • Research Data Management, Metadata and Curation (Part 1)
  • Legal Issues Related to the Use of LRs in Research (Part 1) 
  • Resources (Part 2)
15.00 - 15.15 Research Data Management, Metadata and Curation (Part 1) | Chair: Juan Steyn
Curation Criteria for Multimodal and Multilingual Data: A Mixed Study within the Quest Project

Amy Isard and Elena Arestau.  

We conducted a user survey and expert interviews within the ongoing Quest project to get an impression of the needs of users and researchers who are working with multimodal and multilingual linguistic corpora. This contribution describes the design and results of the mixed study, whose main goal is to improve the reuse potential of these resources, and to identify concrete topics which are important for the curation of such data.



Seamless Integration of Continuous Quality Control and Research Data Management for Indigenous Language Resources (slides)

Anne Ferger and Daniel Jettka.  

This paper reports on further substantial developments of the continuous quality control framework proposed by Hedeland and Ferger (2020) for assuring and enhancing the quality of linguistic research data, especially for indigenous language resources in the project INEL (Arkhipov and Däbritz, 2018). The focus lies on the seamless integration of continuous quality control into data creation workflows, as well as the induction and improvement of automated monitoring, reporting, and documentation mechanisms. Best practices as well as enhanced and new open access tools for projects intending to optimise their research data management are provided.


Flexible Metadata Schemes for Research Data Repositories  The Common Framework in Dataverse and the CMDI Use Case

Jerry de Vries, Vyacheslav Tykhonov, Andrea Scharnhorst, Eko Indarto and Femmy Admiraal.  

This paper presents how DANS, which participates in the CLARIAH+ project, works on a Common Framework which makes it possible to expose CMDI metadata via a DANS discovery service. The Common Framework refers to discussions in CLARIN about integrating standards in Dataverse. This paper informs CLARIAH+ about the explorations of the envisioned use of the Common Framework and reports about the possibilities and challenges of the interoperability of these metadata schemes. The challenges faced are: First, a proposal of a core set of CMDI metadata as recommendation. Second, the extraction of CMDI metadata and transform and load the metadata fields
into the Dataverse core set of metadata. Third, a workflow for prediction and linking concepts from external controlled vocabularies to CMDI metadata values. Fourth, the extension of the Common Framework with support for FAIR controlled vocabularies to create FAIR metadata. Fifth, the extension of the export functionality of Dataverse to export deposited CMDI metadata back to the original CMDI format.
15.15-15.25 Legal Issues Related to the Use of LRs in Research (Part 1) | Chair: Juan Steyn


The Interplay of Legal Regimes of Personal Data, Intellectual Property and Freedom of Expression in Language Research

Aleksei Kelli, Krister Lindén, Pawel Kamocki, Kadri Vider, Penny Labropoulou, Ramūnas Birštonas, Vadim Mantrov, Vanessa Hannesschläger, Riccardo del Gratta, Age Värv, Gaabriel Tavits and Andres Vutt.  

Sometimes legal scholars get relevant but baffling questions from laypersons, such as: 'The reference to a work is personal data, so does the GDPR actually require me to anonymise it?' Or 'As my voice data is personal data, does the GDPR automatically give me access to a speech recogniser using my voice sample?' Or 'Can I say anything about myself without the GDPR requiring the web host to anonymise or remove the post? What can I say about others like politicians?' And 'What can researchers say about patients in a research report?' Based on these questions, the authors address the interaction of intellectual property and data protection law in the context of data minimisation and attribution rights, access rights, trade secret protection, and freedom of expression.


Less is More When FAIR. The Minimum Level of Description in Pathological Oral and Written Data (slides)

Rosalba Nodari, Silvia Calamai and Henk van den Heuvel.  

This paper presents a case study under the DELAD initiative, on the basis of two different types of data originating in a former neuropsychiatric hospital in Italy: a collection of oral interviews recorded in 1977 by Anna Maria Bruzzone inside the hospital, and a long diary written by a schizophrenic patient in the 1970s. Given the vulnerability of the subjects involved, and the distance in time from the data collection, not all the audio and written material may be accessible). The aim of this work is to address some of the challenges in archiving and storing legacy data referring to vulnerable people in European infrastructures, and to present a minimum set of metadata that can be accessed for further research, according to the FAIR principles.
15.25-15.30 Resources (Part 2) | Chair: Juan Steyn


From Data Collection to Data Archiving: A Corpus of Italian Spontaneous Speech

Daniela Mereu.  

The interest in speech sciences for spontaneous speech has increased, and researchers have begun to study the characteristics of spontaneous and casual speech in different languages, on the basis of spontaneous speech corpora that allow the investigation of large amounts of data and to formulate more robust theoretical generalisations. For this kind of research on Italian, corpora of spontaneous speech suitable for phonetic analysis are very limited, because the available resources of spoken Italian are not always accompanied by audio files, or the recordings are not suited for acoustic analysis of speech. The main aim of this proposal is to present a new corpus of Italian spontaneous speech, representing the variety of Italian spoken in Bolzano (South Tyrol, Italy). Special attention will be given to corpus construction procedures, from data collection to database creation. Finally, the way of archiving the corpus in a CLARIN repository will be discussed, in order to reflect on the best practices for making this corpus available to the scientific community and archiving it in a safe and long-term way.


Breakout Sessions on Specific Topics:

  • Room 1: Research Use Cases (Martin Wynne)
  • Room 2: Resources (Parts 1 and 2) (Martin Wynne)
  • Room 3: Legal Issues Related to the Use of LRs in Research (Part 1) (Juan Steyn)
  • Room 4: Research Data Management, Metadata and Curation (Part 1) (Juan Steyn)
16.00 - 16.15 Day 1 Wrap-Up by Eva Soroli with Illustrations by Marta Fioravanti @nonlineare


Tuesday, 28 September 2021

*All times are Central European Summer Time (CEST), UTC+2 

Times (CEST) Programme
09.45 - 10.00 Video playing
10.00 - 10.15

Programme Committee Presentation (slides)

Monica Monachini

10.15 - 10.45

State of the CLARIN Infrastructure (slides)

Franciska de Jong
10.45 - 11.30

Keynote | Chair: Koenraad De Smedt

Language Technologies Beyond Research: From Poetry to the Music Industry (slides)

Elena González-Blanco

Global Head of Digital for Wealth Management and Insurance at Banco Santander, and Research Director at IE University   

Abstract: The age of machine learning and data analytics has changed the habits of entertainment. Recommendation systems have been improving in the last years, with relevant commercial purposes, and most global leading companies - such as Amazon, Google or Netflix - are investing heavily in improving their algorithms through Artificial Intelligence. The case of music has been especially relevant, as the market has drastically changed in the last ten years, moving towards a user-centric streaming model, where user preferences make the difference and dynamic playlists are the key to streaming success. Recommenders are usually built based on similarities between songs that are identified by their sound waves; classification using conventional tags, such as author, genre, or period; and collaborative tagging by users.
In this context, song lyrics (the text of songs) are barely considered for the improvement of recommendation strategies and, in most cases, they are analysed by hand with uneven criteria and filters. But times are changing quickly, and Natural Language Processing and Language Technologies are getting an increasingly important role in the Artificial Intelligence scenario, becoming one of the key levers to drive growth and better customer experience across different industries. The music industry is also being disrupted by this trend. Including NLP tools for understanding the underlying poetry in lyric songs has proven to be the key to automation in content recommendation algorithms.

The challenge is not easy, as it requires training algorithms in different languages and the availability of corpora, language resources and tools are unevenly distributed. The use of the CLARIN infrastructure to enrich, grow and accelerate AI-based industrial projects might be a definite step to expand the infrastructure beyond research and become one of the key levers for growth in Artificial Intelligence across the different languages.
Bio: Elena González-Blanco is an artificial intelligence and digital innovation expert with a special focus on language technologies, Fintech and Insurtech. Recently appointed as Global Head of Digital for Wealth Management and Insurance at Banco Santander, and Research Director at IE University, Elena González-Blanco was the General Manager of Europe at CoverWallet for the last four years and previously Head of Artificial Intelligence Product Development at Indra. She combines her business activity with an outstanding career as international researcher, as Principal Investigator of the H2020 European Research Council Excellence Starting Grant Project POSTDATA and LYRAICS.
Intra-entrepreneur within the Spanish university, she was the Director and founder of LINHD (Digital Innovation Lab and IT solutions provider). Executive Committee/Advisory Board of key European digital research infrastructures and international associations (President of the Spanish Digital Humanities Association, Member of the Executive Committee of the European Alliance for Digital Humanities, Secretary of the International Alliance for Digital Humanities Organization, and member of the Advisory Board of the CLARIN, among others.
PhD in Spanish Philology and awarded 1st National MA Prize in Spanish Studies and Classics, has developed researching and teaching activities at Harvard University, King’s College, UNAM, Bonn and UNED. She is currently Associate Professor of Artificial Intelligence Applied to Business at IE. Fluent speaker of English, French, German and Italian, Elena has been recognised as one of the Top100 Female Leaders in Spain (2016, 2017, 2018), and awarded with the Julián Marías Prize 2017 for researchers under the age of 40, as well as with the 2021 WIDS Prize for Women in Machine Learning and Data Science. She has been #1 and #3 in the Choiseul Ranking '100 Economic Leaders for the Future of Spain' (2018, 2019). She is also the mother of 4 children.
11.30 - 12.20


Have Your Lunch with the BoD (Open to all Participants)

12.20 - 12.30 Break - video playing
12.30 - 13.30

Five-Minute Paper Presentations | Chairs: Neeme Kahusk, Jurgita Vaičenonienė, Krister Lindén (slides)

  • Repositories and National CLARIN Centres
  • Research Data Management, Metadata and Curation (Part 2)
  • Legal Issues Related to the Use of LRs in Research (Part 2)
12.30-12.55 Repositories and National CLARIN Centres | Chair: Neeme Kahusk


ARCHE Suite: A Flexible Approach to Repository Metadata Management (slides)

Mateusz Żółtak, Martina Trognitz and Matej Durco 

This article presents an innovative approach to metadata handling implemented in the ARCHE Suite repository solution. It first discusses the technical requirements for metadata management and contrasts them with the shortcomings of the existing solutions. Then, it demonstrates how the ARCHE Suite addresses those problems. After one year of productive use, we can assert that the approach implemented in the ARCHE Suite is viable and provides important benefits.


A Data Repository for the Management of Dynamic Linguistic Datasets (slides)

Thomas Gaillat, Leonardo Contreras Roa and Juvénal Attoumbre.  

This paper addresses the issue of using Nakala, a dynamic database technology, for the management of language corpora. We present our ongoing attempt at storing and classifying multimedia documents of a corpus of language learner oral and written productions with universal resource identifiers. The architecture supports query APIs compatible with R packages and other tools which will facilitate the generation of linguistically enriched datasets for a more effective corpus-based study of language acquisition.


CLARIN-IT Resources in CLARIN ERIC  a Bird’s-Eye View (slides)

Dario Del Fante, Francesca Frontini, Monica Monachini and Valeria Quochi.  

This paper investigates the visibility of CLARIN-IT language resources within the services of the CLARIN ERIC central infrastructure, notably the Virtual Language Observatory, the Switchboard and the Federated Content Search, from a user perspective in order to identify possible issues. While the experiment focused on one national consortium, the ultimate goal is to develop an assessment methodology that can be used by any national consortia aiming to review the accessibility of their resources and tools within the CLARIN central services.


Opening Language Resource Infrastructures to Non-Research Partners: Practicalities and Challenges (slides)

Verena Lyding, Egon W. Stemle and Alexander König.  

By now, digital infrastructures for language data and tools have become commonplace in the research domain, but their possible benefits are still almost unknown outside of these circles. However, it stands to reason that the data and methods developed there could also be of use to non-research language actors like publishing houses or libraries. In this article, we present a use case within a local language infrastructure project that provides a newspaper portal with modern NLP tools via an API to help them improve their online search. We describe how this use case was implemented with a special focus on the problems that came up during the realisation, specifically those that arose from the interaction between a research and a non-research institution.


Q&A and Plenary Discussions
12.55-13.15 Research Data Management, Metadata and Curation (Part 2) | Chair: Jurgita Vaičenonienė


Bagman  A Tool that Supports Researchers Archiving Their Data (slides)

Claus Zinn.  

Getting researchers to archive their data properly is hard. Many factors are at play. In this paper, we present Bagman, a software that aims at alleviating research data management significantly. Bagman is a web-based software that supports researchers to package their data, assign a minimal set of metadata for their description, define a licence for the data’s future distribution, and to submit the entire package in a safe manner to an archive of their choice.


The TEI-based ISO Standard 'Transcription of Spoken Language' as an Exchange Format within CLARIN and Beyond

Hanna Hedeland and Thomas Schmidt.  

This paper describes the TEI-based ISO standard 2462:2016 'Transcription of spoken language' and other formats used within CLARIN for spoken language resources. It assesses the current state of support for the standard and the interoperability between these formats and with relevant tools and services. The main idea behind the paper is that a digital infrastructure providing language resources and services to researchers should also allow the combined use of resources and/or services from different contexts. This requires syntactic and semantic interoperability. We propose a solution based on the ISO/TEI format and describe the necessary steps for this format to work as an exchange format with basic semantic interoperability for spoken language resources across the CLARIN infrastructure and beyond.


Citation Tracking and Versioning for Linguistic Examples

Tobias Weber.  

This paper outlines the possible implementation of a data citation tracking method within the CLARIN services, based on Weber (2019), which has not been developed yet. The goal is to create collections of subsets of data, displaying the variation in their cited forms in the literature. This creates a citation infrastructure to increase transparency of scientific workflows, enrich data sets administered by CLARIN, and highlight their relevance.


Q&A and Plenary Discussions
13.15-13.30 Legal Issues Related to the Use of LRs in Research (Part 2) | Chair: Krister Lindén


Legal Issues Related to the use of Twitter Data in Language Research

Pawel Kamocki, Vanessa Hannesschläger, Esther Hoorn, Aleksei Kelli, Marc Kupietz, Krister Linden and Andrius Puksas.  

Twitter data is used in a wide variety of research disciplines in social sciences and humanities. Although most Twitter data is publicly available, its re-use and sharing raises many legal questions related to intellectual property and personal data protection. Moreover, the use of Twitter and its content is subject to the Terms of Service, which also regulate re-use and sharing. This extended abstract provides a brief analysis of these issues and introduces the new Academic Research product track, which enables authorised researchers to access Twitter API on a preferential basis.


Ethnomusicological Archives and Copyright Issues: An Italian Case Study (slides)

Prospero Marra, Duccio Piccardi and Silvia Calamai.  

This paper adds a piece to the puzzle of the complex balance between diffusion and legal restraints in the management of oral archives. We focus on the Caterina Bueno Italian ethnomusicological archive, which is being processed by the Archivio Vi.Vo. project and represents a challenging case study in terms of protection of the original informants, the author of the arrangements and the other performers. In particular, the paper expounds problems and partial solutions related to authorship, the fixation of the musical performance, its reproduction, diffusion and the compensation for subsequent uses. Overall, the paper aims to promote awareness on legal protection while defusing the apprehension of potential obstacles and dampening excessive risk aversion in the diffusion of oral materials.


Q&A and Plenary Discussions
13.30 - 14.30 Teaching with CLARIN | Chairs: Iulianna van der Lek and Francesca Frontini (slides)
13.30 - 13.40 Introduction
13.40 - 14.10 Three-Minute Presentations
Tuomo Hiippala
Faculty of Arts, University of Helsinki, Finland
Anika Nicolosi and Beatrice Nava
University of Parma, Italy
Erik Axelson
Faculty of Arts, University of Helsinki, Finland
Diana Maynard
Faculty of Engineering, University of Sheffield, UK
Zuzana Neverilova 
Faculty of Arts, Masaryk University, Czech Republic
Mietta Lennes
Faculty of Humanities, University of Helsinki, Finland
Silvia Calamai and Rosalba Nodari
Faculty of Languages for Intercultural and Business Communication, University of Siena, Italy
Esther Hoorn and Henk van den Heuvel
University of Groningen, the Netherlands
Darja Fiser and Kristina Pahor de Maiti
Faculty of Arts, University of Ljubljana, Slovenia
14.10 - 14.20 Teaching with CLARIN Award 
14.20 - 14.30 Q&A and Plenary Discussions
14.30 - 15.00


Social Coffee Break 'Exchange of Opinions/Impressions So Far' | Moderator: Ben Verhoeven
15.00 - 16.00 Bazaar Session | Chair: Frieda Steurs (slides)
16.00 - 16.15 Day 2  Wrap-Up by Tomaž Erjavec with Illustrations by Marta Fioravanti @nonlineare

Wednesday, 29 September 2021

*All times are Central European Summer Time (CEST), UTC+2 ​​​​​​

Times (CEST) Programme
09.45 - 10.00 Video playing
10.00 - 10.15 Social Morning Coffee
10.15 - 11.00

Panel | Moderator: Henk van den Heuvel (slides)

The Role of Corpora for the Study of Language Use and Mental Health Conditions About the Panel

Panellists: Gloria Gagliardi, Stefan Goetze, Saturnino Luz, Khiet Truong

Automatic detection of mental health conditions from text and speech has become a very appealing research field over the last years. Research into the topic is now accumulating into an impressive body of literature and special sessions at conferences. This CLARIN conference is an excellent platform to discuss infrastructural and strategic issues that are related to the resources needed for this type of research and their shareability. For instance, what are the biases that may intrude the data annotation and tools developed and how can these be avoided? Or how to handle the challenges of collecting and sharing language resources that typically involve vulnerable people? In this panel we will discuss these issues with experts in the field. Each of them will present a short pitch, highlighting their research and the role they see for the CLARIN infrastructure in facilitating it. Each pitch will conclude with one or two brief statements, which will serve as the basis for the discussion with the session participants. Links to relevant publications and a short summary of the panellists' work in this domain can be found here.
11.00 - 11.30 Presentations by CLARIN Committees | Chair: Franciska de Jong (slides)
11.30 - 12.20 Lunch
12.20 - 12.30 Break- video playing
12.30 - 13.45

Five-Minute Paper Presentations | Chairs: António Branco and Kiril Simov (slides)

  • Annotation and Acquisition Tools
  • National CLARIN Centres
12.30-13.10 Annotation and Acquisition Tools | Chair: António Branco


A Method for Building Non-English Corpora for Abstractive Text Summarisation

Julius Monsen and Arne Jönsson.  

We present a method for building corpora for training, and testing, abstractive text summarisers for languages other than English. The method builds on the widely used English CNN/Daily Mail corpus and the assumption that corpora for other languages can be built by filtering language-specific news corpora to have similar properties as the CNN/Daily Mail corpus. In the paper, we show how to achieve this by removing texts from the target corpus that do not adhere to the characteristics of the CNN/DailyMail corpus. Models are trained on these filtered subsets of the corpus and compared to results from training a model on the CNN/DaiyMail corpus. The results show that the method can be used to build corpora for training abstractive text summarisers for languages other than English that have properties on par with those trained using the CNN/Daily Mail corpus.


Enhancing CLARIN-DK Resources While Building the Danish ParlaMint Corpus

Bart Jongejan, Dorte Haltrup Hansen and Costanza Navarretta.  

In this paper we describe the Danish CLARIN resources, corpora, tools and workflow, which we used and enhanced in order to build the Danish ParlaMint corpus, as part of the CLARIN founded ParlaMint project. More specifically, the article accounts for the manual and automatic processes involved in the preparation of the Danish Parliamentary speeches, with focus on the CLARIN-DK tools and Text Tonsorium workflow management. The tools annotated the speeches with metadata and linguistic information in compliance with the common ParlaMint TEI P5 format. As a spin-off of the project, the CLARIN-DK sentence tokeniser and the CST Named Entity Recogniser were improved. These tools, together with the CST-lemmatiser, Danish UD-Pipe soft-ware and several data transformation utilities, produced all the linguistic annotations in the correct format. We conclude the paper with a report of a pilot evaluation of the quality of some of the linguistic annotations in the Danish ParlaMint corpus.


Creating an Error Corpus: Annotation and Applicability (slides)

Þórunn Arnardóttir, Xindan Xu, Dagbjört Guðmundsdóttir, Lilja Björk Stefánsdóttir and Anton Karl Ingason.  

In this paper, we describe the Icelandic Error Corpus, a manually annotated error corpus for Icelandic. The Icelandic Error Corpus consists of texts from three sources: student essays, online news and Wikipedia articles, with a total of 56,794 annotated error instances. The corpus is used to analyse errors made by Icelandic native speakers, which are in turn used to guide the development of an Icelandic open-source spellchecker. The corpus is delivered in an augmented TEI format and published under an open-source license.


Q&A and Plenary Discussions


Reliability of Automatic Linguistic Annotation: Native vs Non-Native Texts (slides)

Elena Volodina, David Alfter, Therese Lindström Tiedemann, Maisa Lauriala and Daniela Piipponen.  

We summarise the results of a manual evaluation of the performance of automatic annotation on three different datasets: (1) texts written by native speakers, (2) essays written by second language (L2) learners of Swedish in the original form and (3) the normalised versions of the same essays. The focus of the evaluation is on lemmatisation, PoS-tagging, dependency annotation, word sense disambiguation and multi-word detection.


Annotation Management Tool: A Requirement for Corpus Construction (slides)

Yousuf Ali Mohammed, Arild Matsson and Elena Volodina.  

We present an annotation management tool, SweLL portal, that has been developed for the purposes of the SweLL infrastructure project for building a learner corpus of Swedish (Volodina et al., 2020). The SweLL portal has been used for supervised access to the database, for data versioning, export and import of data and metadata, statistical overview, administration of annotation tasks, monitoring of annotation tasks and reliability controls. The portal was developed driven by visions of longitudinal sustainable data storage and was partially shaped by situational needs reported by the portal users, including project managers, researchers, and annotators.


ALEXIA: A Lexicon Acquisition Tool

Steinunn Rut Friðriksdóttir, Atli Jasonarson, Steinþór Steingrímsson and Einar Freyr Sigurðsson.  

We present a new corpus tool, ALEXIA, which is designed to facilitate research using the Icelandic Gigaword Corpus, but can be adapted to any text corpus. The tool aids the compilation and expansion of lexical databases and dictionaries by comparing the vocabulary of the database to that of the corpus in order to find gaps in the data. In particular, two well-known Icelandic language resources are incorporated into the design in order to explore the tool’s usage. We describe the design and functionality of the tool, how it can be adapted to various data sources and the process of filtering out noise in order to get a clean list of word candidates. Additionally, we present an extensive list of manually collected stop words that can be used to minimise distortion in research results.
13.10-13.25 National CLARIN Centres | Chair: Kiril Simov


Help Yourself from the Buffet: National Language Technology Infrastructure Initiative on CLARIN-IS (slides)

Anna Björk Nikulásdóttir, Þórunn Arnardóttir, Jón Guðnason, Þorsteinn Daði Gunnarsson, Anton Karl Ingason, Haukur Páll Jónsson, Hrafn Loftsson, Hulda Óladóttir, Einar Freyr Sigurðsson, Atli Þór Sigurgeirsson, Vésteinn Snæbjarnarson and Steinþór Steingrímsson.  

In this paper, we describe how a fairly new CLARIN member is building a broad collection of national language resources for use in language technology (LT). As a CLARIN C-centre, CLARIN-IS is hosting metadata for various text and speech corpora, lexical resources, software packages and models. The providers of the resources are universities, institutions and private companies working on a national (Icelandic) LT infrastructure initiative.


CLARIN Knowledge Centre for Belarusian Text and Speech Processing (K-BLP)

Yuras Hetsevich, Jauheniya Zianouka, David Latyshevich, Mikita Suprunchuk, Valer Varanovich and Katerina Lomat.  

This paper represents CLARIN Knowledge Center for Belarusian text and speech processing (KBLP), which is based at the Speech Synthesis and Recognition Laboratory, the United Institute of Informatics Problems of the National Academy of Sciences of Belarus, Minsk. The CLARIN Knowledge Centre for Belarusian text and speech processing is part of the CLARIN ERIC, which holds the European ESFRI-European Strategy Forum on Research Infrastructures certification as a landmark research infrastructure.


CLARIN Flanders: New Prospects (slides)

Vincent Vandeghinste, Els Lefever, Walter Daelemans, Tim Van de Cruys and Sally Chambers.  

We describe the creation of CLARIN Belgium (CLARIN-BE) and, associated with that, the plans of the CLARIN-VL consortium within the CLARIAH-VL infrastructure for which funding was secured for the period 2021-2025.


Breakout Sessions on Specific Topics:

  • Room 1: Annotation and Acquisition Tools (Antonio Branco)
  • Room 2: Repositories and National CLARIN Centres (Neeme Kahusk, Kiril Simov)
  • Room 3: Research Data Management, Metadata and Curation (Part 2) (Jurgita Vaičenonienė)
  • Room 4: Legal Issues Related to the Use of LRs in Research (Part 2) (Krister Lindén)
  • Room 5: CLARIN committees (Franciska de Jong)
13.45 - 14.30 

Keynote | Chair: Jan Hajič

Language Modeling and Artificial Intelligence (slides)

Tomáš Mikolov

Czech Institute for Informatics, Robotics and Cybernetics of the Czech Technical University in Prague, Czech Republic.   

Abstract: Statistical language modelling has been labelled as an AI-complete problem by many famous researchers of the past. However, despite all the progress made in the last decade, it remains unclear how much progress towards truly intelligent language models we have made. In this talk, I will present my view on what has been accomplished so far, and what scientific challenges are still in front of us. We need to focus more on developing new mathematical models with certain properties, such as the ability to learn continually and without explicit supervision, generalise to novel tasks from limited amounts of data, and the ability to form non-trivial long-term memory. I will describe some of our attempts to develop such models within the framework of complex systems.
Bio: Tomas Mikolov is a researcher at CIIRC, Prague. Currently he leads a research team focusing on the development of novel techniques within the area of complex systems, artificial life and evolution. Previously, he worked at Facebook AI and Google Brain, where he led the development of popular machine learning tools such as word2vec and fastText. He obtained his PhD at the Brno University of Technology for his work on neural language models (the RNNLM project) in 2012. His main research interest is to understand intelligence, and to create artificial intelligence that can help people to solve complex problems.
14.30 - 15.00


Social Coffee Break with Ben Verhoeven
15.00 - 16.00

Five-Minute Paper Presentations | Chair: Costanza Navarretta (slides)

  • Resources (Part 3)
15.00 - 16.00 Resources (Part 3) | Chair: Costanza Navarretta


The CIRCSE Collection of Linguistic Resources in CLARIN-IT (slides)

Rachele Sprugnoli and Marco Passarotti.  

In this paper, we present the collection of the linguistic resources for Latin made available by the CIRCSE Research Center in the CLARIN-IT repository. After an introduction about the history and the main research lines of the Center, the paper provides details on both the lexical and the textual resources that were built across more than a decade at the CIRCSE and that are now accessible in CLARIN-IT.


‘Cretan Institutional Inscriptions’ Meets CLARIN-IT (slides)

Irene Vagionakis, Riccardo Del Gratta, Federico Boschetti, Paola Baroni, Angelo Mario Del Grosso, Tiziana Mancinelli and Monica Monachini.  

This paper describes a project in the domain of Digital Epigraphy, named Cretan Institutional Inscriptions developed at the Ca’ Foscari University of Venice. The project is supported by CLARIN-IT as part of the actions addressed to initiatives, projects and events in the field of humanities and social sciences. The main goal is to make the project visible through CLARIN channels with the hope that it will be a forerunner for other digital epigraphy projects in CLARIN. The article also illustrates the dockerisation process applied to the Cretan Institutional Inscriptions project, currently hosted on the CLARIN-IT servers.


IceTaboo: A Database of Contextually Inappropriate Words for Icelandic (slides)

Agnes Sólmundsdóttir, Lilja Björk Stefánsdóttir and Anton Karl Ingason.  

We present IceTaboo, a database of 2725 words that are inappropriate or offensive to at least some speakers in some contexts. Every word is coded for part of speech, a classification of reasons that trigger a negative reaction among some speakers, as well as information about the meaning expressed by the word. The database is released under an open CC BY 4.0 license on CLARIN and it is already being used in the development of an automatic proofreading tool, developed in collaboration with an industry partner in commercial software development. The proofreading tool is itself under development in an open repository on Github under an MIT license.


The Nature of Icelandic as a Second Language: An Insight From the Learner Error Corpus for Icelandic (slides)

Isidora Glišić and Anton Karl Ingason.  

The Icelandic L2 Error Corpus is an expanding collection of texts written by users of Icelandic as a second language, published on CLARIN. It currently consists of 17508 manually annotated errors in different categories pertaining to grammar, spelling, lexical and other issues. The corpus was used to perform a contrastive interlanguage analysis using a native speaker reference corpus, comparing it to the Icelandic Error Corpus. This paper presents the corpus and the first results of the analysis.


Swedish Word Metrics: A Swe-Clarin Resource for Psycholinguistic Research in the Swedish Language

Erik Witte, Jens Edlund, Arne Jönsson and Henrik Danielsson.  

We present Swedish Word Metrics (SWM), a new CLARIN resource for calculations of lexical and sub-lexical metrics of Swedish words. The calculations at SWM are based on the AFC-list, which is a freely available lexical database with 816404 entries containing spellings, phonetic transcriptions, word-class assignments, and word frequency data. Besides allowing for easy access to the AFC-list data, the SWM site calculates metrics of orthographic and phonological neighbourhood density, phonotactic probability, orthographic transparency, as well as phonetic and orthographic isolation points. The source code for all calculations has been made publicly available and can be extended with more types of word metrics, whereby it forms a framework for continued word-metric developments in the Swedish language.


Insights on a Swedish Covid-19 Corpus

Dimitrios Kokkinakis.  

The COVID-19 pandemic has had a serious impact on people all over the world, from mental and physical health to economic downturn, to education and social relationships, while political decisions in many countries have had a profound impact on the lives of all people regardless of age. Many of these effects can be studied with statistical and qualitative data, such as collected questionnaires and sickness absence rates. But large-scale studies require expertise in multiple domains and from many points of view. SpråkbankenText continuously collects text from various sources. In order to fill the gap in the lack of an available Swedish COVID-19-related dataset, we started to build a Swedish COVID-19 corpus (sv-COVID-19). Various tools for e.g. lexical, semantic or pragmatic/discourse analyses can be then applied in order to answer relevant questions on e.g. how people, on a larger scale than what can be obtained through qualitative studies, experienced their everyday life through the different phases of COVID-19 crisis, or how political decisions and their consequences are described and discussed.


Voices from Ravensbrück. Towards the Creation of an Oral and Multi-Lingual Resource Family (slides)

Silvia Calamai, Jeannine Beeken, Henk Van Den Heuvel, Max Broekhuizen, Arjan van Hessen, Christoph Draxler and Stefania Scagliola. 

This paper describes a pilot project aimed at introducing a new type of corpus into the CLARIN resource family tree, called ‘narratives’. To this end, a multilingual corpus of existing interviews with survivors of concentration camp Ravensbrück will be curated following CLARIN compliant standards. During WWII, this German camp imprisoned 130.000 women from 20 different nationalities. This diversity creates the opportunity to build a unique corpus of gender-specific interviews, covering the same topic, narrated in a similar structure, but voiced in different languages. The corpus will also be enriched with various types of annotation (e.g. transcripts).


Q&A and Plenary Discussions
16.00 - 16.15 Day 3 Wrap-Up and Closing by Andreas Witt with Illustrations by Marta Fioravanti @nonlineare