Programme CLARIN Annual Conference 2021

General Information

Event name: CLARIN Annual Conference 2021
Date: Monday, 27 September 2021 - Wednesday, 29 September 2021 (all times are CEST, UTC+2)
Location: Online (Zoom details are sent to registered participants individually)

Twitter Hashtag: #CLARIN2021

For more information on the conference, see the event page.

Conference Programme Outline

10.00 - 10.15 Opening &  Steven Krauwer Award Ceremony
10.15 - 11.00 Keynote
11.00 - 11.30 

Five-Minute Paper Presentations:

  • Resources. Part 1
  • Research Use Cases
11.30 - 12.30 CLARIN Café: Interactive Q&A Session for Newcomers in CLARIN from the SSH Domain
12.30 - 13.30 The Technical State of the Infrastructure
13.30 - 14.30 PhD Students Session
14.30 - 15.00 Social Coffee Break 
15.00 - 16.00

Five-Minute Paper Presentations:

  • Research Data Management, Metadata and Curation. Part 1
  • Legal Issues Related to the Use of LRs in Research. Part 1
  • Resources. Part 2
16.00 - 16.15 Day 1 Wrap-Up
10.00 - 10.15 Presentation by Programme Committee Chair
10.15 - 10.45 The State of the Infrastructure
10.45 - 11.30 Keynote
11.30 - 12.30 Have Your Lunch with the BoD (Open to all Participants)
12.30 - 13.30

Five-Minute Paper Presentations:

  • Repositories and National CLARIN Centres 
  • Research Data Management, Metadata and Curation. Part 2
  • Legal Issues Related to the Use of LRs in Research. Part 2
13.30 - 14.30

Teaching with CLARIN

Presentations & Award

14.30 - 15.00 Social Coffee Break 
15.00 - 16.00 Bazaar 
16.00 - 16.15 Day 2 Wrap-Up
10.00 - 10.15 Social Morning Coffee
10.15 - 11.00 Panel: The Role of Corpora for the Study of Language Use and Mental Health Conditions
11.00 - 11.30 Presentations by CLARIN Committees 
11.30 - 12.30 Lunch
12.30 - 13.45

Five-Minute Paper Presentations:

  • Annotation and Acquisition Tools
  • National CLARIN Centres
13.45 - 14.30  Keynote
14.30 - 15.00 Social Coffee Break
15.00 - 16.00

Five-Minute Paper Presentations:

  • Resources
16.00 - 16.15 Day 3 Wrap-Up and Closing

Featured Presentations


From Punched Cards to Linguistic Linked Data ...Through Infrastructures


Marco Passarotti

CIRCSE Research Centre, Catholic University of the Sacred Heart, Milano, Italy

Monday, 27 September, 10.15 - 11.00

Language Technologies Beyond Research: From Poetry to Music Industry


Elena González-Blanco

General Manager of Europe at CoverWallet, Director and Founder of LINHD, Spain 

Tuesday, 28 September, 10.45 - 11.30

Language Modeling and Artificial Intelligence


Tomáš Mikolov

Czech Institute for Informatics, Robotics and Cybernetics of the Czech Technical University in Prague, Czech Republic

Wednesday, 29 September, 13.45 - 14.30


Conference Programme Details

Monday, 27 September 2021 

*All times are Central European Summer Time (CEST), UTC+2 

Times (CEST) Programme
10.00 - 10.15

Opening & Steven Krauwer Award Ceremony  

Franciska de Jong &  Maciej Piasecki & Monica Monachini

10.15 - 11.00
Keynote | Chair: Monica Monachini 

From Punched Cards to Linguistic Linked Data ...Through Infrastructures

Marco Passarotti

Director of the CIRCSE Research Centre, Catholic University of the Sacred Heart, Milano, Italy.   

Abstract: The talk discusses how linguistic resources have become increasingly accessible and, lately, interoperable from the very first years of computational linguistics until the present day. Starting from the pioneering work of father Roberto Busa on processing the Latin texts of Thomas Aquinas with IBM computers in the 1950s, the talk will touch upon the following phases in the light of those passed through by the Index Thomisticus corpus: 1) the registration of the Thomas Aquinas' texts on punched cards (1950s and 1960s) and the publication of the 56 volumes of the Index Thomisticus (1970s and 1980s): (meta)data are analogical and isolated, with no interoperability with other linguistic resources; 2) the CD-ROM (1990s) and the web-based version (2000s) of the Index Thomisticus: (meta)data are digital, and (partly) accessible; 3) the storage of the Index Thomisticus Treebank in the CLARIN infrastructure, together with many other resources: (meta)data are accessible, reusable and (partly) findable. Full (and linguistically deep) interoperability between resources in the infrastructure is still to come; 4) the linking of the Index Thomisticus Treebank and other resources for Latin to the LiLa Knowledge Base: (meta)data are findable, accessible, reusable and fully interoperable by using principles, data categories and ontologies developed by the Linguistic Linked Open Data community.
Bio: Marco Passarotti is Associate Professor of Computational Linguistics at Università Cattolica del Sacro Cuore (Milan, Italy), where he is Director of the CIRCSE Research Centre, which he co-founded in 2009. His main research interests deal with building, using and disseminating linguistic resources and natural language processing tools for Latin. A former pupil of one of the pioneers of humanities computing, father Roberto Busa SJ, since 2006 he has headed the Index Thomisticus Treebank project, which continues the legacy of Busa’s work on the opera omnia of Thomas Aquinas. He is the principal investigator of the LiLa project, an ERC-Consolidator Grant (2018-2023), which aims to build a Linked Data Knowledge Base of linguistic resources and natural language processing tools for Latin.
11.00 - 11.30 
Five-Minute Paper Presentations | Chair: Martin Wynne
  • Resources. Part 1
  • Research Use Cases

ParlaMint: Comparable Corpora of European Parliamentary Data

Tomaž Erjavec, Maciej Ogrodniczuk, Petya Osenova, Andrej Pančur, Nikola Ljubešić, Tommaso Agnoloni, Starkaður Barkarson, María Calzada Pérez, Çağrı Çöltekin, Matthew Coole, Roberts Darģis, Luciana D. de Macedo, Jesse de Does, Katrien Depuydt, Sascha Diwersy, Matyáš Kopp, Tomas Krilavičius, Giancarlo Luxardo, Vaidas Morkevičius, Costanza Navarretta, Paul Rayson, Orsolya Ring, Michał Rudolf, Kiril Simov, Steinþór Steingrímsson, Árni Magnússon, István Üveges, Ruben van Heusden, Giulia Venturi.  

This paper outlines the ParlaMint project from the perspective of its goals, tasks, participants, results and applications potential. The project produced language corpora from the sessions of the national parliaments of 17 countries, almost half a billion words in total. The corpora are split into COVID-related subcorpora (from November 2019) and reference corpora (to October 2019). The corpora are uniformly encoded according to the ParlaMint schema with the same Universal Dependencies linguistic annotations. Samples of the corpora and conversion scripts are available from the project’s GitHub repository. The complete corpora are openly available via the CLARIN.SI repository for download, and through the NoSketch Engine and KonText concordancers as well as through the Parlameter4 interface for exploration and analysis.


Corpora for Bilingual Terminology Extraction in Cybersecurity Domain

Andrius Utka, Sigita Rackevičienė, Liudmila Mockienė, Aivaras Rokas, Marius Laurinaitis and Agnė Bielinskienė. 

The paper aims at presenting English-Lithuanian corpora for bilingual term extraction (BiTE) in the cybersecurity domain within the framework of the project DVITAS. It is argued that a system of parallel, comparable, and training corpora for BiTE is particularly useful for less resourced languages, as it allows to efficiently use strengths and avoid weaknesses of comparable and parallel resources. A special focus is given to the open nature of the data, which is achieved by publishing the data in the CLARIN-LT repository.


How to Perform Linguistic Analysis of Emotions in a Corpus of Vernacular Semiliterate Speech with the Help of CLARIN Tools

Rosalba Nodari and Luisa Corona.  

Research has shown that words are constitutive of emotions and that language contributes to shaping feelings. However, less is known about how people with basic literacies can use language to maintain, create and recreate affective bonds, and how they express themselves through the language of emotions. In this respect, digital humanities tools can help shed some light on linguistic encoding of emotions. This proposal aims to show the potential of the CLARIN infrastructure tools for carrying out such analysis on a particular corpus of letters written in the 1960s by Michela Margiotta, a semiliterate Italian woman affected by tarantism, to the anthropologist Annabella Rossi. The research will show how corpora of semiliterate letters can pose several problems when conducting research using digital humanities tools. In this respect, different methodologies will be compared in order to verify how CLARIN tools can help in the detection of encoded emotion in written documents.


Dependency Trees in Automatic Inflection of Multi-Word Expressions in Polish

Ryszard Tuora and Łukasz Kobyliński.  

Natural language generation for morphologically rich languages can benefit from automatic inflection systems. This paper presents such a system, which can tackle inflection, with particular emphasis on Multi Word Expressions (MWEs). This is done using rules induced automatically from a dependency treebank. The system is evaluated on a dictionary of Polish MWEs. Including such a tool into the CLARIN infrastructure will be beneficial for processing morphologically rich languages.


Q & A
11.30 - 12.30

CLARIN Café: Interactive Q&A Session for Newcomers to CLARIN | Moderator: Francesca Frontini

12.30 - 13.30

The Technical State of the Infrastructure

Dieter Van Uytvanck

13.30 - 14.30 PhD Students Session
13.30 - 13.35

Reflecting cognitive processing of trauma in language - a result of trauma's experience corpus analysis with a usage of word's meanings category

Wiktoria Mieleszczenko-Kowszewicz  

13.35 - 13.40

Machine learning applied to voice signal in Parkinson's disease

Antonio Pallotti  

13.40 - 13.45  

Building English-Arabic Parallel Medical Corpora

Zainab Almugbel, Eric Atwell and Mhd Alsalka 

13.45 - 13.50  

Attitudes and Language Acquisition: an investigation on the Italian-French community in Aix-Marseille

Fabio Ardolino  

13.50 - 13.55  

Towards the Precise Detection of Adverbial Roles in Hungarian – ManualClustering of Adverbial Adjuncts

Noémi Ligeti-Nagy  

13.55 - 14.00  

Tough constructions and their analogs in English, French and Russian: a parallel corpus study using the CLARIN VLO platform

Alina Tsikulina and Efstathia Soroli  

14.00 - 14.05  

The RigVeda goes “universal”: annotating comparative constructions in the most ancient poetry of India

Erica Biagetti  

14.05 - 14.15

Plenary discussion Q&A

14.15 - 14.30

Follow-up discussion in break out rooms

14.30 - 15.00

Social Coffee Break 'First Meeting/Informal Encounters' | Moderator: Ben Verhoeven

15.00 - 16.00
Five-Minute Paper Presentations | Chairs: Jan Steyn
15.00—15.10 Session 1: Research Data Management, Metadata and Curation. Part 1
Curation Criteria for Multimodal and Multilingual Data: A Mixed Study within the Quest Project

Amy Isard and Elena Arestau.  

We conducted a user survey and expert interviews within the ongoing Quest project to get an impression of the needs of users and researchers who are working with multimodal and multilingual linguistic corpora. This contribution describes the design and results of the mixed study, whose main goal is to improve the reuse potential of these resources, and to identify concrete topics which are important for the curation of such data.



Seamless Integration of Continuous Quality Control and Research Data Management for Indigenous Language Resources

Anne Ferger and Daniel Jettka.  

This paper reports on further substantial developments of the continuous quality control framework proposed by Hedeland and Ferger (2020) for assuring and enhancing the quality of linguistic research data, especially for indigenous language resources in the project INEL (Arkhipov and Däbritz, 2018). The focus lies on the seamless integration of continuous quality control into data creation workflows, as well as the induction and improvement of automated monitoring, reporting, and documentation mechanisms. Best practices as well as enhanced and new open access tools for projects intending to optimise their research data management are provided.


Flexible Metadata Schemes for Research Data Repositories  The Common Framework in Dataverse and the CMDI Use Case

Jerry de Vries, Vyacheslav Tykhonov, Andrea Scharnhorst and Eko Indarto.  

Research data repositories are increasingly expected to operate together. Standardisation and alignment of metadata schemes used to describe datasets are a precondition for any platform to work (see as an example At the same time, data repositories usually serve specific knowledge domains, and have tailored their indexing practices towards those communities. In short, there is a tension between serving one or few communities in a very detailed manner and being integratable into a crossdomain platform. The Dataverse community responded to this natural tension by offering both a standard, common core set of metadata called Citation Block and the possibility to extend this core set with custom fields defined as a discipline-specific metadata block. This paper discusses in detail the challenges and solutions when it comes to implementing Common Framework principles into a very concrete Dataverse instance and a very concrete community (Conzett et al., 2020). More specifically, this paper reports how the Data Archive and Network Services institute (DANS), which participates in the CLARIAH+2 project, works on a Common Framework, which makes it possible to expose CMDI3 metadata via a DANS discovery service.
15.15-15.25 Session 2: Legal Issues Related to the Use of LRs in Research. Part 1 


The Interplay of Legal Regimes of Personal Data, Intellectual Property and Freedom of Expression in Language Research

Aleksei Kelli, Krister Lindén, Pawel Kamocki, Kadri Vider, Penny Labropoulou, Ramūnas Birštonas, Vadim Mantrov, Vanessa Hannesschläger, Riccardo del Gratta, Age Värv, Gaabriel Tavits and Andres Vutt.  

Sometimes legal scholars get relevant but baffling questions from laypersons, such as: 'The reference to a work is personal data, so does the GDPR actually require me to anonymise it?' Or 'As my voice data is personal data, does the GDPR automatically give me access to a speech recogniser using my voice sample?' Or 'Can I say anything about myself without the GDPR requiring the web host to anonymise or remove the post? What can I say about others like politicians?' And 'What can researchers say about patients in a research report?' Based on these questions, the authors address the interaction of intellectual property and data protection law in the context of data minimisation and attribution rights, access rights, trade secret protection, and freedom of expression.


Less is More When FAIR. The Minimum Level of Description in Pathological Oral and Written Data

Rosalba Nodari, Silvia Calamai and Henk van den Heuvel.  

This paper presents a case study under the DELAD initiative, on the basis of two different types of data originating in a former neuropsychiatric hospital in Italy: a collection of oral interviews recorded in 1977 by Anna Maria Bruzzone inside the hospital, and a long diary written by a schizophrenic patient in the 1970s. Given the vulnerability of the subjects involved, and the distance in time from the data collection, not all the audio and written material may be accessible). The aim of this work is to address some of the challenges in archiving and storing legacy data referring to vulnerable people in European infrastructures, and to present a minimum set of metadata that can be accessed for further research, according to the FAIR principles.
15.25-15.30 Session 3: Resources. Part 2 


From Data Collection to Data Archiving: A Corpus of Italian Spontaneous Speech

Daniela Mereu.  

The interest in speech sciences for spontaneous speech has increased, and researchers have begun to study the characteristics of spontaneous and casual speech in different languages, on the basis of spontaneous speech corpora that allow the investigation of large amounts of data and to formulate more robust theoretical generalisations. For this kind of research on Italian, corpora of spontaneous speech suitable for phonetic analysis are very limited, because the available resources of spoken Italian are not always accompanied by audio files, or the recordings are not suited for acoustic analysis of speech. The main aim of this proposal is to present a new corpus of Italian spontaneous speech, representing the variety of Italian spoken in Bolzano (South Tyrol, Italy). Special attention will be given to corpus construction procedures, from data collection to database creation. Finally, the way of archiving the corpus in a CLARIN repository will be discussed, in order to reflect on the best practices for making this corpus available to the scientific community and archiving it in a safe and long-term way.


Breakout Sessions on Specific Topics:

  • Room 1: Research Use Cases 

  • Room 2: Resources. Parts 1 and 2 

  • Room 3: Legal Issues Related to the Use of LRs in Research. Part 1

  • Room 4: Research Data Management, Metadata and Curation. Part 1

16.00-16.15 Day 1 Wrap-Up by Eva Soroli with Illustrations by Marta Fioravanti @nonlineare


Tuesday, 28 September 2021

*All times are Central European Summer Time (CEST), UTC+2 

Times (CEST) Programme

10.00 - 10.15

Programme Committee Presentation

Monica Monachini

10.15 - 10.45

State of the Infrastructure

Franciska de Jong

10.45 - 11.30

Keynote | Chair: TBA

Language Technologies Beyond Research: From Poetry to the Music Industry

Elena González-Blanco

General Manager of Europe at CoverWallet, Director and Founder of LINHD, Spain.   

Abstract: The age of machine learning and data analytics has changed the habits of entertainment. Recommendation systems have been improving in the last years, with relevant commercial purposes, and most global leading companies - such as Amazon, Google or Netflix - are investing heavily in improving their algorithms through Artificial Intelligence. The case of music has been especially relevant, as the market has drastically changed in the last ten years, moving towards a user-centric streaming model, where user preferences make the difference and dynamic playlists are the key to streaming success. Recommenders are usually built based on similarities between songs that are identified by their sound waves; classification using conventional tags, such as author, genre, or period; and collaborative tagging by users.
In this context, song lyrics (the text of songs) are barely considered for the improvement of recommendation strategies and, in most cases, they are analysed by hand with uneven criteria and filters. But times are changing quickly, and Natural Language Processing and Language Technologies are getting an increasingly important role in the Artificial Intelligence scenario, becoming one of the key levers to drive growth and better customer experience across different industries. The music industry is also being disrupted by this trend. Including NLP tools for understanding the underlying poetry in lyric songs has proven to be the key to automation in content recommendation algorithms.
The challenge is not easy, as it requires training algorithms in different languages and the availability of corpora, language resources and tools are unevenly distributed. The use of the CLARIN infrastructure to enrich, grow and accelerate AI-based industrial projects might be a definite step to expand the infrastructure beyond research and become one of the key levers for growth in Artificial Intelligence across the different languages.
Bio: Elena González-Blanco is an artificial intelligence and digital innovation expert with a special focus on language technologies, Fintech and Insurtech. Recently appointed as Global Head of Digital for Wealth Management and Insurance at Banco Santander, Elena González-Blanco was the General Manager of Europe at CoverWallet for the last four years and previously Head of Artificial Intelligence Product Development at Indra. She combines her business activity with an outstanding career as international researcher, as Principal Investigator of the H2020 European Research Council Excellence Starting Grant Project POSTDATA and LYRAICS.
Intra-entrepreneur within the Spanish university, she was the Director and founder of LINHD (Digital Innovation Lab and IT solutions provider). Executive Committee/Advisory Board of key European digital research infrastructures and international associations (President of the Spanish Digital Humanities Association, Member of the Executive Committee of the European Alliance for Digital Humanities, Secretary of the International Alliance for Digital Humanities Organization, and member of the Advisory Board of the CLARIN, among others.
PhD in Spanish Philology and awarded 1st National MA Prize in Spanish Studies and Classics, has developed researching and teaching activities at Harvard University, King’s College, UNAM, Bonn and UNED. She is currently Associate Professor of Artificial Intelligence Applied to Business at IE. Fluent speaker of English, French, German and Italian, Elena has been recognised as one of the Top100 Female Leaders in Spain (2016, 2017, 2018), and awarded with the Julián Marías Prize 2017 for researchers under the age of 40, as well as with the 2021 WIDS Prize for Women in Machine Learning and Data Science. She has been #1 and #3 in the Choiseul Ranking '100 Economic Leaders for the Future of Spain' (2018, 2019). She is also the mother of 4 children.

11.30 - 12.30


Have Your Lunch with the BoD (Open to all Participants)

12.30 - 13.30

Five-Minute Paper Presentations | Chairs: Neeme Kahusk, Jurgita Vaičenonienė, Krister Lindén


Session 1: Repositories and National CLARIN Centres


ARCHE Suite: A Flexible Approach to Repository Metadata Management

Mateusz Żółtak, Martina Trognitz and Matej Durco  

This article presents an innovative approach to metadata handling implemented in the ARCHE Suite repository solution. It first discusses the technical requirements for metadata management and contrasts them with the shortcomings of the existing solutions. Then, it demonstrates how the ARCHE Suite addresses those problems. After one year of productive use, we can assert that the approach implemented in the ARCHE Suite is viable and provides important benefits.


A Data Repository for the Management of Dynamic Linguistic Datasets

Thomas Gaillat, Leonardo Contreras Roa and Juvénal Attoumbre.  

This paper addresses the issue of using Nakala, a dynamic database technology, for the management of language corpora. We present our ongoing attempt at storing and classifying multimedia documents of a corpus of language learner oral and written productions with universal resource identifiers. The architecture supports query APIs compatible with R packages and other tools which will facilitate the generation of linguistically enriched datasets for a more effective corpus-based study of language acquisition.


CLARIN-IT Resources in CLARIN ERIC  a Bird’s-Eye View

Dario Del Fante, Francesca Frontini, Monica Monachini and Valeria Quochi.  

This paper investigates the visibility of CLARIN-IT language resources within the services of the CLARIN ERIC central infrastructure, notably the Virtual Language Observatory, the Switchboard and the Federated Content Search, from a user perspective in order to identify possible issues. While the experiment focused on one national consortium, the ultimate goal is to develop an assessment methodology that can be used by any national consortia aiming to review the accessibility of their resources and tools within the CLARIN central services.


Opening Language Resource Infrastructures to Non-Research Partners: Practicalities and Challenges

Verena Lyding, Egon W. Stemle and Alexander König.  

By now, digital infrastructures for language data and tools have become commonplace in the research domain, but their possible benefits are still almost unknown outside of these circles. However, it stands to reason that the data and methods developed there could also be of use to non-research language actors like publishing houses or libraries. In this article, we present a use case within a local language infrastructure project that provides a newspaper portal with modern NLP tools via an API to help them improve their online search. We describe how this use case was implemented with a special focus on the problems that came up during the realisation, specifically those that arose from the interaction between a research and a non-research institution.


Q & A


Session 2: Research Data Management, Metadata and Curation. Part 2


Bagman  A Tool that Supports Researchers Archiving Their Data

Claus Zinn.  

Getting researchers to archive their data properly is hard. Many factors are at play. In this paper, we present Bagman, a software that aims at alleviating research data management significantly. Bagman is a web-based software that supports researchers to package their data, assign a minimal set of metadata for their description, define a licence for the data’s future distribution, and to submit the entire package in a safe manner to an archive of their choice.


The TEI-based ISO Standard 'Transcription of Spoken Language' as an Exchange Format within CLARIN and Beyond

Hanna Hedeland and Thomas Schmidt.  

This paper describes the TEI-based ISO standard 2462:2016 'Transcription of spoken language' and other formats used within CLARIN for spoken language resources. It assesses the current state of support for the standard and the interoperability between these formats and with relevant tools and services. The main idea behind the paper is that a digital infrastructure providing language resources and services to researchers should also allow the combined use of resources and/or services from different contexts. This requires syntactic and semantic interoperability. We propose a solution based on the ISO/TEI format and describe the necessary steps for this format to work as an exchange format with basic semantic interoperability for spoken language resources across the CLARIN infrastructure and beyond.


Citation Tracking and Versioning for Linguistic Examples

Tobias Weber.  

This paper outlines the possible implementation of a data citation tracking method within the CLARIN services, based on Weber (2019), which has not been developed yet. The goal is to create collections of subsets of data, displaying the variation in their cited forms in the literature. This creates a citation infrastructure to increase transparency of scientific workflows, enrich data sets administered by CLARIN, and highlight their relevance.


Q & A


Session 3: Legal Issues Related to the Use of LRs in Research. Part 2


Legal Issues Related to the use of Twitter Data in Language Research

Pawel Kamocki, Vanessa Hannesschläger, Esther Hoorn, Aleksei Kelli, Marc Kupietz, Krister Linden and Andrius Puksas.  

Twitter data is used in a wide variety of research disciplines in social sciences and humanities. Although most Twitter data is publicly available, its re-use and sharing raises many legal questions related to intellectual property and personal data protection. Moreover, the use of Twitter and its content is subject to the Terms of Service, which also regulate re-use and sharing. This extended abstract provides a brief analysis of these issues and introduces the new Academic Research product track, which enables authorised researchers to access Twitter API on a preferential basis.


Ethnomusicological Archives and Copyright Issues: An Italian Case Study

Prospero Marra, Duccio Piccardi and Silvia Calamai.  

This paper adds a piece to the puzzle of the complex balance between diffusion and legal restraints in the management of oral archives. We focus on the Caterina Bueno Italian ethnomusicological archive, which is being processed by the Archivio Vi.Vo. project and represents a challenging case study in terms of protection of the original informants, the author of the arrangements and the other performers. In particular, the paper expounds problems and partial solutions related to authorship, the fixation of the musical performance, its reproduction, diffusion and the compensation for subsequent uses. Overall, the paper aims to promote awareness on legal protection while defusing the apprehension of potential obstacles and dampening excessive risk aversion in the diffusion of oral materials.


Q & A

13.30 - 14.30

Teaching with CLARIN | Chairs: Francisca Frontini and Iulianna van der Lek

13.30 - 13.40 Introduction
13.40 - 14.10 Three-Minute Presentations


Applied Language Technologies
Tuomo Hiippala
Faculty of Arts, University of Helsinki, Finland


Archilochus of Paros: Elegiac Fragments - XML Archive
Anika Nicolosi and Beatrice Nava
University of Parma, Italy


Computational Morphology with HFST
Erik Axelson
Faculty of Arts, University of Helsinki, Finland


GATE, an Open-Source Toolkit for Natural Language Processing
Ian Roberts, Kalina Bontcheva, Xingyi Song, Mark A. Greenwood, Mehmet Bakir, Johnn Petrak, Ye Jiang
Faculty of Engineering, University of Sheffield, UK


Introduction to Digital Humanities
Zuzana Neverilova 
Faculty of Arts, Masaryk University, Czech Republic


Introduction to Speech Analysis
Mietta Lennes
Faculty of Humanities, University of Helsinki, Finland


Oral Archives for Sociolinguistic Research
Silvia Calamai and Rosalba Nodari
Faculty of Languages for Intercultural and Business Communication, University of Siena, Italy


Privacy by Design in Research: How You Can Do a Data Protection Impact Assessment for an Innovative Research Scenario Involving Speech Data
Esther Hoorn
University of Groningen, the Netherlands


Voices of the Parliament: A Corpus Approach to Parliamentary Discourse Research
Darja Fiser and Kristina Pahor de Maiti
Faculty of Arts, University of Ljubljana, Slovenia

14.10 - 14.20

Q & A
14.20 - 14.30 Teaching with CLARIN Award

14.30 - 15.00


Social Coffee Break 'Exchange of Opinions/Impressions So Far' | Moderator: Ben Verhoeven

15.00 - 16.00

Bazaar Session

16.00 - 16.15

Day 2  Wrap-Up with Illustrations by Marta Fioravanti @nonlineare

Wednesday, 29 September 2021

*All times are Central European Summer Time (CEST), UTC+2 ​​​​​​

Times (CEST) Programme

10.00 - 10.15

Social Morning Coffee

10.15 - 11.00

Panel | Chair: Henk van den Heuvel

The Role of Corpora for the Study of Language Use and Mental Health Conditions

11.00 - 11.30

Presentations by CLARIN Committees | Chair: Franciska de Jong

11.30 - 12.30


12.30 - 13.45

Five-Minute Paper Presentations | Chairs: António Branco and Kiril Simov


Session 1: Annotation and Acquisition Tools


A Method for Building Non-English Corpora for Abstractive Text Summarisation

Julius Monsen and Arne Jönsson.  

We present a method for building corpora for training, and testing, abstractive text summarisers for languages other than English. The method builds on the widely used English CNN/Daily Mail corpus and the assumption that corpora for other languages can be built by filtering language-specific news corpora to have similar properties as the CNN/Daily Mail corpus. In the paper, we show how to achieve this by removing texts from the target corpus that do not adhere to the characteristics of the CNN/DailyMail corpus. Models are trained on these filtered subsets of the corpus and compared to results from training a model on the CNN/DaiyMail corpus. The results show that the method can be used to build corpora for training abstractive text summarisers for languages other than English that have properties on par with those trained using the CNN/Daily Mail corpus.


Enhancing CLARIN-DK Resources While Building the Danish ParlaMint Corpus

Bart Jongejan, Dorte Haltrup Hansen and Costanza Navarretta.  

In this paper we describe the Danish CLARIN resources, corpora, tools and workflow, which we used and enhanced in order to build the Danish ParlaMint corpus, as part of the CLARIN founded ParlaMint project. More specifically, the article accounts for the manual and automatic processes involved in the preparation of the Danish Parliamentary speeches, with focus on the CLARIN-DK tools and Text Tonsorium workflow management. The tools annotated the speeches with metadata and linguistic information in compliance with the common ParlaMint TEI P5 format. As a spin-off of the project, the CLARIN-DK sentence tokeniser and the CST Named Entity Recogniser were improved. These tools, together with the CST-lemmatiser, Danish UD-Pipe soft-ware and several data transformation utilities, produced all the linguistic annotations in the correct format. We conclude the paper with a report of a pilot evaluation of the quality of some of the linguistic annotations in the Danish ParlaMint corpus.


Creating an Error Corpus: Annotation and Applicability

Þórunn Arnardóttir, Xindan Xu, Dagbjört Guðmundsdóttir, Lilja Björk Stefánsdóttir and Anton Karl Ingason.  

In this paper, we describe the Icelandic Error Corpus, a manually annotated error corpus for Icelandic. The Icelandic Error Corpus consists of texts from three sources: student essays, online news and Wikipedia articles, with a total of 56,794 annotated error instances. The corpus is used to analyse errors made by Icelandic native speakers, which are in turn used to guide the development of an Icelandic open-source spellchecker. The corpus is delivered in an augmented TEI format and published under an open-source license.


Q & A


Reliability of Automatic Linguistic Annotation: Native vs Non-Native Texts

Elena Volodina, David Alfter, Therese Lindström Tiedemann, Maisa Lauriala and Daniela Piipponen.  

We summarise the results of a manual evaluation of the performance of automatic annotation on three different datasets: (1) texts written by native speakers, (2) essays written by second language (L2) learners of Swedish in the original form and (3) the normalised versions of the same essays. The focus of the evaluation is on lemmatisation, PoS-tagging, dependency annotation, word sense disambiguation and multi-word detection.


Annotation Management Tool: A Requirement for Corpus Construction

Yousuf Ali Mohammed, Arild Matsson and Elena Volodina.  

We present an annotation management tool, SweLL portal, that has been developed for the purposes of the SweLL infrastructure project for building a learner corpus of Swedish (Volodina et al., 2020). The SweLL portal has been used for supervised access to the database, for data versioning, export and import of data and metadata, statistical overview, administration of annotation tasks, monitoring of annotation tasks and reliability controls. The portal was developed driven by visions of longitudinal sustainable data storage and was partially shaped by situational needs reported by the portal users, including project managers, researchers, and annotators.


ALEXIA: A Lexicon Acquisition Tool

Steinunn Rut Friðriksdóttir, Atli Jasonarson, Steinþór Steingrímsson and Einar Freyr Sigurðsson.  

We present a new corpus tool, ALEXIA, which is designed to facilitate research using the Icelandic Gigaword Corpus, but can be adapted to any text corpus. The tool aids the compilation and expansion of lexical databases and dictionaries by comparing the vocabulary of the database to that of the corpus in order to find gaps in the data. In particular, two well-known Icelandic language resources are incorporated into the design in order to explore the tool’s usage. We describe the design and functionality of the tool, how it can be adapted to various data sources and the process of filtering out noise in order to get a clean list of word candidates. Additionally, we present an extensive list of manually collected stop words that can be used to minimise distortion in research results.


Session 2: National CLARIN Centres


Help Yourself from the Buffet: National Language Technology Infrastructure Initiative on CLARIN-IS

Anna Björk Nikulásdóttir, Þórunn Arnardóttir, Jón Guðnason, Þorsteinn Daði Gunnarsson, Anton Karl Ingason, Haukur Páll Jónsson, Hrafn Loftsson, Hulda Óladóttir, Einar Freyr Sigurðsson, Atli Þór Sigurgeirsson, Vésteinn Snæbjarnarson and Steinþór Steingrímsson.  

In this paper, we describe how a fairly new CLARIN member is building a broad collection of national language resources for use in language technology (LT). As a CLARIN C-centre, CLARIN-IS is hosting metadata for various text and speech corpora, lexical resources, software packages and models. The providers of the resources are universities, institutions and private companies working on a national (Icelandic) LT infrastructure initiative.


CLARIN Knowledge Centre for Belarusian Text and Speech Processing (K-BLP)

Yuras Hetsevich, Jauheniya Zianouka and David Latyshevich.  

This paper represents CLARIN Knowledge Center for Belarusian text and speech processing (KBLP), which is based at the Speech Synthesis and Recognition Laboratory, the United Institute of Informatics Problems of the National Academy of Sciences of Belarus, Minsk. The CLARIN Knowledge Centre for Belarusian text and speech processing is part of the CLARIN ERIC, which holds the European ESFRI-European Strategy Forum on Research Infrastructures certification as a landmark research infrastructure.


CLARIN Flanders: New Prospects

Vincent Vandeghinste, Els Lefever, Walter Daelemans, Tim Van de Cruys and Sally Chambers.  

We describe the creation of CLARIN Belgium (CLARIN-BE) and, associated with that, the plans of the CLARIN-VL consortium within the CLARIAH-VL infrastructure for which funding was secured for the period 2021-2025.


Breakout Sessions on Specific Topics:

1. Annotation and Acquisition Tools

2. Repositories and National CLARIN Centres

3. Research Data Management, Metadata and Curation. Part 2 

4. Legal Issues Related to the Use of LRs in Research. Part 2 

Breakout room for CLARIN committees

13.45 - 14.30 

Keynote | Chair: Jan Hajič

Language Modeling and Artificial Intelligence

Tomáš Mikolov

Czech Institute for Informatics, Robotics and Cybernetics of the Czech Technical University in Prague, Czech Republic.   

Abstract: Statistical language modelling has been labelled as an AI-complete problem by many famous researchers of the past. However, despite all the progress made in the last decade, it remains unclear how much progress towards truly intelligent language models we have made. In this talk, I will present my view on what has been accomplished so far, and what scientific challenges are still in front of us. We need to focus more on developing new mathematical models with certain properties, such as the ability to learn continually and without explicit supervision, generalise to novel tasks from limited amounts of data, and the ability to form non-trivial long-term memory. I will describe some of our attempts to develop such models within the framework of complex systems.
Bio: Tomas Mikolov is a researcher at CIIRC, Prague. Currently he leads a research team focusing on the development of novel techniques within the area of complex systems, artificial life and evolution. Previously, he worked at Facebook AI and Google Brain, where he led the development of popular machine learning tools such as word2vec and fastText. He obtained his PhD at the Brno University of Technology for his work on neural language models (the RNNLM project) in 2012. His main research interest is to understand intelligence, and to create artificial intelligence that can help people to solve complex problems.

14.30 - 15.00


Social Coffee Break with Ben Verhoeven

15.00 - 16.00

Five-Minute Paper Presentations | Chair: Costanza Navarretta


The CIRCSE Collection of Linguistic Resources in CLARIN-IT

Rachele Sprugnoli and Marco Passarotti.  

In this paper, we present the collection of the linguistic resources for Latin made available by the CIRCSE Research Center in the CLARIN-IT repository. After an introduction about the history and the main research lines of the Center, the paper provides details on both the lexical and the textual resources that were built across more than a decade at the CIRCSE and that are now accessible in CLARIN-IT.


‘Cretan Institutional Inscriptions’ Meets CLARIN-IT

Irene Vagionakis, Riccardo Del Gratta, Federico Boschetti, Paola Baroni, Angelo Mario Del Grosso, Tiziana Mancinelli and Monica Monachini.  

This paper describes a project in the domain of Digital Epigraphy, named Cretan Institutional Inscriptions developed at the Ca’ Foscari University of Venice. The project is supported by CLARIN-IT as part of the actions addressed to initiatives, projects and events in the field of humanities and social sciences. The main goal is to make the project visible through CLARIN channels with the hope that it will be a forerunner for other digital epigraphy projects in CLARIN. The article also illustrates the dockerisation process applied to the Cretan Institutional Inscriptions project, currently hosted on the CLARIN-IT servers.


IceTaboo: A Database of Contextually Inappropriate Words for Icelandic

Agnes Sólmundsdóttir, Lilja Björk Stefánsdóttir and Anton Karl Ingason.  

We present IceTaboo, a database of 2725 words that are inappropriate or offensive to at least some speakers in some contexts. Every word is coded for part of speech, a classification of reasons that trigger a negative reaction among some speakers, as well as information about the meaning expressed by the word. The database is released under an open CC BY 4.0 license on CLARIN and it is already being used in the development of an automatic proofreading tool, developed in collaboration with an industry partner in commercial software development. The proofreading tool is itself under development in an open repository on Github under an MIT license.


The Nature of Icelandic as a Second Language: An Insight From the Learner Error Corpus for Icelandic

Isidora Glisic and Anton Karl Ingason.  

The Icelandic L2 Error Corpus is an expanding collection of texts written by users of Icelandic as a second language, published on CLARIN. It currently consists of 17508 manually annotated errors in different categories pertaining to grammar, spelling, lexical and other issues. The corpus was used to perform a contrastive interlanguage analysis using a native speaker reference corpus, comparing it to the Icelandic Error Corpus. This paper presents the corpus and the first results of the analysis.


Swedish Word Metrics: A Swe-Clarin Resource for Psycholinguistic Research in the Swedish Language

Erik Witte, Jens Edlund, Arne Jönsson and Henrik Danielsson.  

We present Swedish Word Metrics (SWM), a new CLARIN resource for calculations of lexical and sub-lexical metrics of Swedish words. The calculations at SWM are based on the AFC-list, which is a freely available lexical database with 816404 entries containing spellings, phonetic transcriptions, word-class assignments, and word frequency data. Besides allowing for easy access to the AFC-list data, the SWM site calculates metrics of orthographic and phonological neighbourhood density, phonotactic probability, orthographic transparency, as well as phonetic and orthographic isolation points. The source code for all calculations has been made publicly available and can be extended with more types of word metrics, whereby it forms a framework for continued word-metric developments in the Swedish language.


Insights on a Swedish Covid-19 Corpus

Dimitrios Kokkinakis.  

The COVID-19 pandemic has had a serious impact on people all over the world, from mental and physical health to economic downturn, to education and social relationships, while political decisions in many countries have had a profound impact on the lives of all people regardless of age. Many of these effects can be studied with statistical and qualitative data, such as collected questionnaires and sickness absence rates. But large-scale studies require expertise in multiple domains and from many points of view. SpråkbankenText continuously collects text from various sources. In order to fill the gap in the lack of an available Swedish COVID-19-related dataset, we started to build a Swedish COVID-19 corpus (sv-COVID-19). Various tools for e.g. lexical, semantic or pragmatic/discourse analyses can be then applied in order to answer relevant questions on e.g. how people, on a larger scale than what can be obtained through qualitative studies, experienced their everyday life through the different phases of COVID-19 crisis, or how political decisions and their consequences are described and discussed.


Voices from Ravensbrück. Towards the Creation of an Oral and Multi-Lingual Resource Family

Silvia Calamai, Jeannine Beeken, Henk Van Den Heuvel, Max Broekhuizen, Arjan van Hessen, Christoph Draxler and Stefania Scagliola. 

This paper describes a pilot project aimed at introducing a new type of corpus into the CLARIN resource family tree, called ‘narratives’. To this end, a multilingual corpus of existing interviews with survivors of concentration camp Ravensbrück will be curated following CLARIN compliant standards. During WWII, this German camp imprisoned 130.000 women from 20 different nationalities. This diversity creates the opportunity to build a unique corpus of gender-specific interviews, covering the same topic, narrated in a similar structure, but voiced in different languages. The corpus will also be enriched with various types of annotation (e.g. transcripts).


Q & A

16.00 - 16.15

Day 3 Wrap-Up and Closing with Illustrations by Marta Fioravanti @nonlineare