Programme CLARIN Annual Conference 2022

Event name: CLARIN Annual Conference 2022
Date: Monday, 10 October 2022 - Wednesday, 12 October 2022 (all times are CET, UTC+2)
Location: OREA Hotel Pyramida Praha, Czechia, and online
Twitter Hashtag: #CLARIN2022
 
CLARIN 2022 | CLARIN2022 floor plan: Ground floor - First floor | Proceedings |
 

Conference Programme Outline

9:00 – 10:30
  • Centre Assessment Committee (CAC)
  • CLARIN National Coordinators' Forum (NCF) Part 1
  • Mars
  • Jupiter
10:30 - 11:00 Coffee break  
11:00 - 13:00
  • CLARIN National Coordinators' Forum (NCF) Part 2
  • Standing Committee on CLARIN Technical Centres (SCCTC)
  • User Involvement Committee
  • Jupiter
  • Mars
  • Observatory
13:00 - 14:00 Lunch break  
14:00 - 15:30
  • CLARIN Standards Committee (CSC)
  • Knowledge Infrastructure Committee (KIC)
  • CLARIN Legal Issues Committee (CLIC)
  • Mars
  • Observatory
  • Uranus
15:30 - 16:00 Coffee break  
16:00 - 16:15
  • Conference opening session
  • Steven Krauwer Award
Sun I+II
16:15 - 17:00 Keynote Peter Leinen
Sun I+II
17:00 - 17:40 Panel | CLARIN and Libraries: Infrastructures Working Together
Sun I+II
17:40 - 18:40 Abstract presentations
Sun I+II
18:40 - 19:00 SSH Open Cluster: Life after SSHOC
Sun I+II
19:15 - 22:00 Welcome reception  
09:00 - 09:10 Presentation by programme committee chair Sun I+II
09:10 - 09:30 Presentation by local National Coordinator Sun I+II
09:30 - 10:00 Pitches by CLARIN committees Sun I+II
10:00 - 10:30 State of the technical infrastructure Sun I+II
10:30 - 11:00 Coffee break  
11:00 - 13:00 Abstract presentations Sun I+II
13:00 - 13:05 Group picture Sun I+II
13:05 - 14:00 Lunch  
14:00 - 14:20 State of the infrastructure Sun I+II
14:20 - 15:05 Keynote by Barbara Plank Sun I+II
15:05 - 15:25 Abstract presentations Sun I+II
15:25 - 15:55 Coffee break  
15:55 - 16:55 Teaching with CLARIN Sun I+II
16:55 - 17:55 PhD student session
Sun I+II
17:55 - 18:55 Bazaar
Lobby
19:30 - 22:00 Conference dinner  
09:00 - 10:00 Abstract presentations Sun I+II
10:00 - 10:40 Panel | CLARIN and Other SSH Platforms: You’ll Never Innovate Alone Sun I+II
10:40 - 11:10 Coffee break  
11:10 - 11:55 Keynote by Ariane Nabeth-Halber Sun I+II
11:55 - 12:55
Abstract presentations
Sun I+II
12:55 - 13:00 Closing remarks Sun I+II
13:00 - 14:00 Lunch  

 


Keynotes

 

Enabling Digital Research - The German National Library as Part of a (National) Research Infrastructure

 
 

Peter Leinen

Germany National Library

Monday, 11 October, 16.15 - 17.00
 

Is Human Label Variation Really So Bad for AI?

 
 
 
 
Barbara Plank
 

LMU Munich

Tuesday, 12 October, 14.20 - 15.05

100 Years of Speech recognition, the Data Fork and the Conversational Challenge. Stories from Today’s Speech Industry

 

Ariane Nabeth-Halber

ViaDialog

Wednesday, 12 October, 11.10 - 11.55


 

Conference Programme Details

Day One

Time Monday 10 October 2022 Room
9:00 – 10:30
  • Centre Assessment Committee (CAC)
  • CLARIN National Coordinators' Forum (NCF) Part 1
  • Mars
  • Jupiter
10:30 - 11:00 Coffee break  
11:00 - 13:00
  • CLARIN National Coordinators' Forum (NCF) Part 2
  • Standing Committee on CLARIN Technical Centres (SCCTC)
  • User Involvement Committee
  • Jupiter
  • Mars
  • Observatory
13:00 - 14:00 Lunch break  
14:00 - 15:30
  • CLARIN Standards Committee (CSC)
  • Knowledge Infrastructure Committee (KIC)
  • CLARIN Legal Issues Committee (CLIC)
  • Mars
  • Observatory
  • Uranus
15:30 - 16:00 Coffee break  
  Start of the Conference
  • Sun I+II
16:00 - 16:15
  • Conference opening session
  • Steven Krauwer Award
  • Sun I+II
16:15 - 17:00

Keynote| Chair: Andreas Witt

Enabling Digital Research - The German National Library as Part of a (National) Research Infrastructure

Peter Leinen

Abstract: The growing digitisation in memory institutions such as libraries and archives results in completely new tasks in these institutions, such as the collection and long-term archiving of a continuously increasing amount of digital objects and data. At the same time, it facilitates new forms of utilisation, such as text and data mining, as well as new forms of cooperation with the scientific community. This opportunity must be seized and actively shaped. The lecture will present the strategic orientation of the German National Library as well as the experiences made and challenges faced by the library in concrete cooperation with the scientific community. Closely linked to these experiences is the German National Library’s commitment to the development of the National Research Data Infrastructure, which will be discussed in the second part of the lecture.
Bio: After his studies in mathematics and computer science, Peter Leinen worked in the area of applied mathematics and scientific computing, with the goal of developing high-performance, massively parallel methods for solving technical problems - such as problems in fluid mechanics - on high-performance and supercomputers. In 2004, Peter Leinen took over the management of the Computing Centre at the University of Trier and, in 2011, he accepted the appointment to the same position at the University of Mannheim. In 2016, Peter Leinen followed a call from the German National Library to take over the management of the Department of Information Infrastructure. Responsible for the entire IT supply of all processes of the German National Library, one focus of his work is on the methods and procedures for building, archiving and providing the digital collection, as well as for networking with science, especially with the digital humanities. His role also includes representing the German National Library in the development of the German National Research Data Infrastructure, as well as involvement in different consortia and cross-cutting issues such as standards, authority data and long-term archiving of research data. Peter Leinen is one of the elected spokespersons of nestor, the competence network for digital long-term preservation, and is active in other initiatives such as the German Initiative for Network Information (DINI).
  • Sun I+II
17:00 - 17:40

Panel | Chair: Martin Wynne

CLARIN and Libraries: Infrastructures Working Together

Sally Chambers, Andreas Witt, Peter Leinen

Abstract: While new research infrastructures for the arts, humanities and social sciences, such as CLARIN and DARIAH, have emerged in recent decades, libraries have been for many centuries the most important resource for researchers, and remain so today in the digital age. For virtual, digital, and distributed research infrastructures such as CLARIN to be effective, they need to work closely with libraries, which play key roles as creators and curators of digital data, and as intermediaries between researchers and digital data, tools and expertise. The CLARIN and Libraries workshop took place at KB National Library of the Netherlands in May 2022. This was the first workshop with the explicit aim of bringing together the international CLARIN community and research libraries in order to discuss issues relating to the delivery of digital content for researchers, and to plan the practical steps for future collaboration. This panel session will feature participants from the workshop, who will be discussing what these next steps should be.
  • Sun I+II
 

Thematic session: Language resources and CLARIN centres

Chair: Costanza Navarretta

  • Sun I+II
17:40 - 18:00

ACTER 1.5: Annotated Corpora for Term Extraction Research

Ayla Rigouts Terryn, Veronique Hoste and Els Lefever

Abstract: This contribution presents version 1.5 of the Annotated Corpora for Term Extraction Research (ACTER) dataset. It includes domain-specific corpora in three languages (English, French, and Dutch) and four domains (corruption, dressage (equitation), heart failure, and wind energy). Manual annotations are available of terms and Named Entities for each corpus, with almost 20k unique annotations in total. Significant improvements have been made, most notably the inclusion of sequential annotations. Additionally, an online demo – D-Termine – has been launched for monolingual and bilingual automatic term extraction from parallel corpora, based on the dataset.
 
18:00 - 18:20

Linguistic Autobiographies. Towards the Creation of a Multilingual Resource Family

Silvia Calamai, Rosalba Nodari, Claudia Soria and Alessandro Carlucci

Abstract: This paper describes a project aimed at adding a new type of corpus to the CLARIN resource family tree, called ‘linguistic autobiographies’. In a linguistic autobiography the writer explicitly reflects on the relationship between him/herself and the language. This genre is fruitfully used in different educational settings, and research has shown that it helps to uncover the social, affective, and psychological dimensions of language learning. The potential of a multilingual collection is discussed starting from Italian data.
 
18:20 - 18:40

CLARIN-LV: Many Steps till Operation

Inguna Skadiņa, Ilze Auziņa, Roberts Darģis, Eduards Lasmanis and Arnis Voitkāns

Abstract: Inspired by the previous submissions from CLARIN national consortia, in this abstract we summarize the most important steps and achievements during the implementation phase of CLARIN-LV research infrastructure. Although CLARIN-LV was an active supporter of CLARIN goals during the preparatory phase, Latvia joined CLARIN ERIC only in 2016. During the last five years CLARIN-LV became an active C-center, supporting and collaborating with digital humanities, developing and sharing language resources developed by Latvian academic community, as well as active contributor and participant of CLARIN international activities.
 
18:40 - 19:00

SSH Open Cluster: Life After SSHOC | Chair: Francesca Frontini

Francesca Frontini, Matej Ďurčo, Laure Barbot, Carsten Thiel  

Abstract: Although the H2020 project SSHOC came to an end in April 2022, it will have a lasting effect on the RI landscape. In this short session, four speakers will zoom in on the project’s lasting impact on the collaboration between research infrastructures from the SSH domain. As the founding members of the SSH Open Cluster, they will maintain the recently launched SSH Open Marketplace, a discovery platform for resources such as data, tools, publications, and training materials. (An overview of the outcome of the SSHOC project can be found in the legacy booklet.) The SSH domain’s alignment with the development of the European Open Science Cloud (EOSC) at national, institutional and European level will also remain on the agenda.
  • Sun I+II
19:15 - 22:00  Welcome reception  

Day Two

Time Tuesday 11 October 2022 Room
09:00 - 09:10 Presentation by programme committee chair
  • Sun I+II
09:10 - 09:30 Presentation by local National Coordinator | Chair: Maciej Piasecki

Eva Hajicova, Barbara Hladká, Martin Popel

  • Sun I+II
09:30 - 10:00 Pitches by CLARIN committees | Chair: Antal van den Bosch
  • Sun I+II
10:00 - 10:30 State of the technical infrastructure | Chair: Antal van den Bosch
  • Sun I+II
10:30 - 11:00 Coffee break  
 

Thematic session: Tools and workflows. Part 1

Chair: Stelios Piperidis

  • Sun I+II
11:00 - 11:20

BabyLemmatizer: A Lemmatizer and POS-tagger for Akkadian

Aleksi Sahala, Tero Alstola, Jonathan Valk and Krister Lindén

Abstract: We present a hybrid lemmatizer and POS-tagger for Akkadian, the language of the ancient Assyrians and Babylonians, documented from 2350 BCE to 100 CE. In our approach the text is first POS-tagged and lemmatized with TurkuNLP trained with human-verified labels, and then post-corrected with dictionary-based methods to improve the lemmatization quality. The postcorrection also assigns labels with confidence scores to flag the most suspicious lemmatizations for manual validation. We demonstrate that the presented tool achieves a Lemma+POS labelling accuracy of 94%, and a lemmatization accuracy of 95% in a held-out test set.
 
11:20 - 11:40

WebLicht-Batch. A Web-Based Interface for Batch Processing Large Input with the WebLicht Workflow Engine

Claus Zinn and Ben Campbell

Abstract: WebLicht is a workflow engine that gives researchers access to a well-inhabited space of natural language processing tools that can be combined into tool chains to perform complex natural language analyses. In this paper, we present WebLicht-Batch, a web-based interface to WebLicht’s chainer back-end. WebLicht-Batch helps users to automatically feed large input data, or input data of multiple files into WebLicht. It disassembles large input into smaller, more digestible sizes, feeds the resulting parts into WebLicht’s pipelining and execution engine, and then assembles the results of such processing into files that preserve the usual input-output dichotomy.
 
11:40 - 12:00

The CLaDA-BG Dictionary Creation System: Specifics and Perspectives

Zhivko Angelov, Kiril Simov, Petya Osenova and Zara Kancheva

Abstract: The paper reports on the system for creating dictionaries within the CLaDA-BG infrastructure. At the heart of the system lies the BTB-Wordnet around which all other language resources are organised. These are various dictionaries, ontologies, corpora. The specific features and functionalities are outlined. Also, the rationale behind the construction of such a system is given.
 
 

Thematic session: Tools and workflows. Part 2

Chair: Starkaður Barkarson

  • Sun I+II
12:00 - 12:20

A Lightweight NLP Workflow Engine for CLARIN-BE

Adriaan Lemmens and Vincent Vandeghinste

Abstract: This paper presents our work in progress on building a flexible workflow engine. The architecture of the engine is based on the message queue programming model and is implemented in Python. NLP tools are exposed as remotely executable tasks and wrapped in standard abstractions. The main contribution is to provide a uniform API for defining various kinds of tasks and workflows. The use case of the library is building a text analytics platform for digital humanities, providing a usage-friendly interface to predefined workflows.
 
12:20 - 12:40

Natural Language Processing for Literary Studies: Graph Literary Exploration Machine (GoLEM)

Agnieszka Karlińska, Wiktor Walentynowicz, Jan Wieczorek and Maciej Maryl

Abstract: This paper presents a design of a web-based application for the analysis and visualisation of relations between terms, named entities, and topics. The goal of this project is to create, in close cooperation between the literary research community and the IT professionals, a comprehensive workflow tailored to the specificity of literary studies at large, and, the current debates and trends in humanities research. The application brings together the already existing tools offered by CLARIN-PL and the resources and tools developed at Dariah.lab. It consists of three components: the named entity relationship analysis component, the terminology extraction component and the topic modelling component. The whole system is not only based on an automatic operation of the components but is constructed to allow user intervention in individual sub-processes of the entire processing pipeline. A strong emphasis is put on the use of metadata of the analysed texts (e.g., for filtering and grouping documents) and the visualisation of results.
 
12:40 - 13:00

Supporting Ancient Historical Linguistics and Cultural Studies with EpiLexO

Valeria Quochi, Andrea Bellandi, Michele Mallia, Alessandro Tommasi and Cesare Zavattari

Abstract: This contribution presents a system of independent software components meant to support the creation of ecosystems of interrelated language data (i.e. lexica linked to textual testimonies, concepts, metadata, bibliographic references, and other external lexical resources) according to the current state-of-the-art representational models for the semantic web. The system is implemented as a set of autonomous servers exposing Restful APIs that in principle can serve different frontend applications and use cases. In this work they serve the EpiLexO GUI application designed and geared to support scholars of ancient languages of fragmentary attestation in their studies. The development of both the back-ends and the front-end is still work-in progress, but a first version is ready for use.
 
13:00 - 14:00 Lunch  
14:00 - 14:20 State of the infrastructure
  • Sun I+II
14:20 - 15:05

Keynote| Chair: TBA

Is Human Label Variation Really so Bad for AI?

Barbara Plank 

Abstract: Human variation in labeling is typically considered noise. Annotation projects in computer vision and natural language processing typically aim at minimizing human label variation, in order to maximize data quality and in turn optimize and maximize machine learning metrics. However, is  human variation in labeling noise, or can we turn such information into signal for machine learning? In this talk, I will first illustrate the problem and then discuss approaches to tackle this fundamental issue.
Bio: Barbara Plank’s background is in computer science and language technologies. She is an expert in Natural Language Processing (NLP) and the chair (full professor) of AI and Computational Linguistics at LMU Munich, where she leads the MaiNLP research lab at the Center for Information and Language Processing (CIS). In her view, NLP should be able to handle any language and any domain, which is not easily achieved. Her work thus focuses on making NLP models more robust so that they can learn under circumstances of selection or annotation bias, or in cases of limited data availability, for example. Plank plays a pivotal role in several NLP research projects (DFF Sapere Aude, Amazon Research Award) and is involved in teaching and supervision activities for NLP scholars. She recently received an ERC Consolidator grant to work on natural language understanding for non-standard languages and dialects.
  • Sun I+II
 

Thematic session: Legal questions

Chair: Monica Monachini

  • Sun I+II
15:05 - 15:25

EU Data Governance Act: New Opportunities and New Challenges for CLARIN

Paweł Kamocki and Krister Lindén

Abstract: The Data Governance Act was proposed in late 2020 as part of the European Strategy for Data, adopted on 30 May 2022 (as Regulation 2022/868). It will enter into application on 24 September 2023. The Data governance Act is a major development in the legal framework affecting CLARIN and the whole language community. With its new rules on the re-use of data held by the public sector bodies and on the provision of data sharing services, its new limits on international transfers of non-personal data, and especially its encouragement of data altruism, the Data Governance Act creates new opportunities and new challenges for CLARIN ERIC. This abstracts briefly analyses the provisions of the Data Governance Act, and aims at initiating the debate on how they will impact CLARIN and the whole language community.
 
15:25 - 15:55 Coffee break  
15:55 - 16:55
Teaching with CLARIN 
  • Sun I+II
16:55 - 17:55 PhD student session | Chair: Darja Fišer

Language Representation Models for Low and Medium-Resource Languages

Jón Daðason

Abstract: Language Representation Models based on the Transformer architecture have obtained state-of-the-art results on a wide variety of natural language processing (NLP) tasks. The Transformer is far more scalable than previous neural network architectures, allowing larger and more powerful models to be trained than ever before. The size of the largest Transformer-based model has grown from 110 million parameters in 2018 (GPT) to 1.6 trillion parameters in 2021 (Switch Transformer). However, this growth comes at the cost of significantly higher computational requirements. Current state-of-the-art models can require several weeks to train, using thousands of high-end graphical processing units (GPUs) or tensor processing units (TPUs). Furthermore, there has been an exponential growth in the amount of text these models are trained on. Transformer-based language models are typically first pre-trained on large amounts of text on an unsupervised task, such as predicting the next word given a preceding text sequence. Once pre-trained, the model can be fine-tuned on more practical tasks (known as downstream tasks in this context), such as question answering and automatic text summarisation. It has been shown that increasing the size of pre-training corpora can significantly improve downstream performance. As a result, the size of the largest pre-training dataset for English has grown from 800 million words in 2018 (GPT) to 1.4 trillion words in 2022 (Chinchilla). For research on low and medium-resource languages, it becomes increasingly important to investigate how the Transformer architecture can best be utilised in settings where pre-training data is scarce and access to computational resources is limited. We will evaluate the data efficiency of various pre-training methods, experiment with how small monolingual pre-training corpora can best be augmented with multilingual or machine-translated text and perform a thorough evaluation of commonly used text filtering techniques. Furthermore, we will consider how different subword tokenisation algorithms and vocabulary sizes affect downstream performance under low and medium-resource settings.

Beyond Babylonian Confusion: A Case Study-Based Approach for Multilingual NLP on Historical Literature

Tess Dejaeghere

Abstract: Text is inherently infused with historic linguistic properties and descriptions of traditions, cultural practices and social structures – making it an exceptional source for linguists and historians to study and quantify past and present trends. Text analysis is a core component of both Natural Language Processing (NLP) and the (Digital) Humanities, but there exist vast differences in user culture, end goals and the level of technical knowledge between these two research fields. While NLP researchers adhere to rigid workflows to cater to linguistics-centred questions and effectuate language model improvement, DH scholars seek to answer meta-textual questions in a heuristic framework. Partly due to this divergent perspective on text as data, a cross-pollination of methodologies between these two research fields remains rather limited. Specifically Named Entity Recognition (NER) and Sentiment Analysis (SA) are heavily exploited information extraction tools in NLP applications. Despite having shown great potential in humanities research settings and cultural heritage institutions, NER and SA are often overlooked as a research tool by literary scholars and historians. A lack of familiarity with NLP tools, along with the known technical challenges that come with literary-historical text processing, are often enough to deter digital humanists from applying NER or SA in their workflows, possibly leaving a wealth of valuable insights untapped. Answering the urgent calls for a methodology-driven agenda in the Digital Humanities, this research seeks to develop transparent, reproducible and durable step-wise workflows with the integration of NER and SA for application in heuristic literary-historical research settings. For this purpose, a large multilingual corpus of travelogues in English, Dutch, German and French, ranging from the 15th to the 20th century is being collected, as the exceptional characteristics of travelogues as highly idiosyncratic lenses into the past account for a wide range of linguistic and historical variance. A set of relevant literary-historical research questions will be investigated with the support of NER and SA. Rather than focusing on resolving the research questions, our research takes a step back and aims to develop over-arching workflows regarding open-source tool selection and evaluation, tool adaptation and mitigating the challenges inherent to literary-historical and multilingual corpus processing. Finally, the workflows and code used for this purpose will be made open-source to encourage methodological transparency and reproducibility in similar future research. Our research aims to generate much-needed insights regarding the potential and limitations of NER and SA in literary-historical research and intends to foster a tool- and data-critical attitude among digital humanists through the development of clear-cut NLP-workflows, benefiting the CLARIN-infrastructure and DH-community alike.

Use Case: Searching for Strong Verbs in the Historical CourantenCorpus

Machteld de Vos

Abstract: The Dutch language contains several strong verbs, which are divided into seven different verb classes based on the ablaut. Over time, some of these verb classes have undergone language change. This happened, for example, to the verbs from verb class III, as can be illustrated with the verb zingen, ‘to sing’. This strong verb displayed a three-part ABC-paradigm in Middle Dutch: zing(en) ‘sing’ – zang, zongen ‘sang’ – gezongen ‘sung’, with zang being used for preterite singular (‘I, you (sg.), he or she sang’) and zongen for preterite plural (‘we, you (pl.) or they sang’; De Smet, 2021: 17; Van der Sijs, 2021: 453). Nowadays, Standard Dutch has lost the preterite singular variant in these verbs and uses the same form for singular and plural preterite, thus displaying an ABB-paradigm: zing(en) ‘sing’ – zong(en) ‘sang’ – gezongen ‘sung’. In a large corpus study on strong verbs in Dutch, De Smet (2021) has shown that this change solidified in the 17th century – a century that is also known as the formative period in the standardisation of Dutch (Van der Wal, 1995; Van der Sijs, 2021). Even though De Smet acknowledges that standardisation and concomitant codification of verb paradigms in the late 16th and early 17th century may have played a role here, she focuses her analysis on possible language internal factors influencing this change (2021: 108-109). The language external factor has therefore not been studied in detail yet, leaving the following question unanswered: could standardisation have played a part in this language change? In my poster presentation, I will present the results of a study carried out to try and answer this question, following the precept versus practice method (as detailed in e.g. Anderwald, 2011; 2012, for strong verbs in English). Firstly, I performed a comprehensive study of the norms for 41 17th-century class III-verbs by mapping out all stances on these verbs encountered in the 10 normative grammars on Standard Dutch written between 1550 and 1650. Secondly, I investigated the actual use of these verbs in the new CourantenCorpus, a corpus containing Dutch newspapers from 1618 to 1700 (ca. 19 million words in total). The second study especially provided a use case for CLARIAH tools such as Blacklab (e.g. De Does et al., 2017), the GaLAHaD platform for linguistic annotation of historical Dutch (Brouwer et al., 2022), and the PIE-based framework for tagging and lemmatising historical text (Creten et al., 2020). The focus of this poster presentation will therefore be on this second study and its use of the corpus and tools, addressing questions such as: were the search possibilities sufficient to find the relevant data? What was the impact of the inaccuracies of automatic linguistic annotation? To what extent can we still draw valid conclusions from noisy material? What can be done to circumvent these issues?

Adjectivisation of Participles in Lithuanian

Laima Jancaitė

Abstract: Lithuanian is a synthetic language with numerous forms. Particularly verbs have a variety of forms – they can be conjugated, inflected, and uninflected. Inflected verb forms are participles, for example, valgantis vaikas 'an eating child', nupirktas maistas 'food purchased'. Even though a participle is regarded as a verb form, it has the properties of both verbs and adjectives. It has the meaning of a verb, categories of voice and tense, and valency similar to that of verbs (vakar nupirktas maistas 'food that has been bought yesterday'); it can also be inflected like an adjective, have degrees, pronominal forms, and attributive as well as predicative functions. Some Lithuanian participles lose their verbal properties and become adjectivised, for example, įprastas ('usual'; verb įprasti 'get used to'); nevykęs ('lame'; verb nevykti 'not to happen'); užimtas žmogus ('busy person'; verb užimti 'to occupy'); valgomos uogos ('edible, not poisonous berries'; verb valgyti 'to eat'). They are used as adjectives and are often separated from verbal meaning; moreover, they lose the categories of tense and voice. Usually, these words are regarded as adjectives in dictionaries. Sometimes it is difficult to differentiate between adjectivised and simple participles. It is problematic to distinguish when participles should have separate entries in dictionaries; also, part-of-speech tagging poses some problems. The aim of the thesis is to identify criteria to differentiate between adjectives and participles that have a similar function. The research is based on Lithuanian corpora. Initially, a list of criteria to identify adjectivised participles was created. These criteria are semantic, grammatical, derivational, and statistical. Some of them are more subjective, for example, the synonyms and antonyms of participles and adjectives, or the changed lexical meaning of the participle. The other criteria are more objective, for example, the derivational property of the participle to form an adverb (papildomas 'additional', papildomai 'additionally', verb papildyti 'to replenish'), or the tendency for participles to combine with adverbs of measure/degree (labai mokytas 'highly-educated', verb mokyti 'to teach'). The list of criteria was created by analysing theoretical works and conducting a pilot study. During the pilot study, the 160 most frequent verbs and 43 participial adjectives from the Lexical Database of Lithuanian Language Usage (2021, https://kalbu.vdu.lt/mokymosi-priemones/leksikonas/) were analysed. It is planned that this database will soon be available in the CLARIN repository. This database is based on the Pedagogic Corpus of Lithuanian (669 000 words, https://clarin.vdu.lt/xmlui/handle/20.500.11821/50). The selected words were analysed in this corpus and, when needed, in the Corpus of Contemporary Lithuanian Language (https://clarin.vdu.lt/xmlui/handle/20.500.11821/16). Now it is planned to analyse more words and to classify participles by the degree of adjectivisation. Also, it is planned to use statistical information (for example, ratios of participles to other verbal forms and the frequency of participles) for automatic analysis. One of the possible results of the analysis could be word lists in which participles would be classified by the degree of adjectivisation. They would be useful for morphological and syntactic analysis, lexicographic works, etc. It is planned that they would appear in the CLARIN repository.

Semantic Classification of Prepositions in BulTreeBank WordNet

Zara Kancheva

Abstract: The aim of the thesis is to present a model for incorporating prepositions in the structure of BulTreeBank WordNet (BTB-WN). Prepositions are viewed as a necessary part of speech in a lexical resource such as wordnet, because their integration can be beneficial for several NLP tasks (such as semantic annotation, word sense disambiguation, machine translation, parsing, knowledge extraction, word embeddings, text analysis and generation, etc.). A semantic classification of Bulgarian prepositions is done and a model for preposition synsets and relations in BTB-WN will be presented.

The (un)translatability of languages: discursive strategies of forensic psychiatrists

Agnieszka Karlińska

Abstract:In my paper, I will introduce the research project underlying my PhD thesis, the process of creating a corpus and the preliminary results of my analysis. My study is situated at the intersection of legal and medical sociology, sociolinguistics and corpus linguistics. I aim to reconstruct discursive strategies adopted by forensic psychiatrists in response to the challenges and contradictions related to their role in the criminal justice system. When assessing the defendant’s sanity, forensic psychiatrists are required to draw a line to distinguish crime and insanity. This involves bringing scientific reliability in line with the logic of the law and also to reconcile two discourses: legal and medical. The forensic psychiatrists do so through linguistic means: by formulating a narrative about the perpetrator of a crime with the use of categories taken from the social and medical sciences. The process—although appearing to be rooted in neutrality—is not independent of the broader social context including gender stereotypes, particularly stereotypes of femininity. To date, this issue has remained on the margins of interest in the social science. I intend to fill this gap and explore how the meaning of normality is developed at the interface of psychiatry and law. The study will answer the question of how medical categories are translated into legal categories and, more broadly, how an agreement between representatives of (radically) different disciplines can be reached. I will also capture discursive mechanisms of constructing gender representations in documents issued by expert psychiatrists. The study is based on the mixed-methods approach. I combine methods that are extremely rarely employed in analysis conducted at the intersection of law, psychiatry and language: computational text analysis and discourse analysis developed in the sociology of science and discursive psychology. As the core research material, I compiled a corpus consisting of 225 psychiatric reports issued by Polish forensic experts with a volume of about 1.5 million tokens. Employing tools provided by CLARIN-PL I identify the singularities of the language of forensic psychiatry and forensic psychiatric reports as a genre. I compare, on the one hand, forensic psychiatric reports with strictly medical and legal texts, and, on the other hand, reports concerning men with reports on women. I analyse lexical, grammatical and narrative features, and, further on, identify the interpretative repertoires used by forensic experts. I conduct, i.a., stylometric analysis using the Open Stylometric System WebSty and semantic analysis using plWordNet. Written psychiatric reports as texts are still under-researched. My research is the first case of applying computational text analysis methods to explore these types of documents. The methodological framework I have created and the results of my analysis can serve as new showcases for the CLARIN infrastructure aimed at researchers in the fields of law, medicine and social sciences. They can also provide a starting point for comparative research carried out within other legal systems and expanding CLARIN legal corpora family.

Neural Metaphor Detection for Slovene

Matej Klemen

Abstract: A metaphor is an expression that uses a comparison with another concept for rhetorical effect. For example, instead of saying ‘his words were offensive’, we might say ‘his words cut deeper than a knife’. In this case, words did not physically cause a cut; instead, the words caused a pain that is analogous to a cut. Metaphors are ubiquitous in language and add colour to our conversations. Because of this, having the ability to detect and form metaphors has potential applications in machine translation and automatic creative writing, such as news headline generation or rephrasal. In addition, the detection of metaphors allows us to analyse discourse and see how language evolves. In our work, we present initial experiments on the automatic token-level detection of metaphors for the Slovene language. To do so, we leverage several resources from the CLARIN resource repository: - The KOMET and G-KOMET corpora. KOMET is a corpus containing 259 839 tokens across 13 963 sentences and covers journalistic, fiction, and online text. G-KOMET is smaller, contains 52 955 tokens across 5 695 sentences, and covers speech transcripts. Both corpora contain metaphor signals, direct, indirect, and borderline metaphors. In our work, we consider direct and indirect metaphors as positive examples (metaphors), and the rest as negative examples (not metaphors). - The SloBERTa and CroSloEngual BERT language models. SloBERTa is a large transformer model, pre-trained on a large collection of Slovene text. Similarly, CroSloEngual BERT is a large transformer model, but it is pre-trained on text from two languages in addition to Slovene: Croatian and English. We additionally experiment with a multilingual model XLM-RoBERTa, which is not part of the repository and is trained on data from 100 languages. To apply the models to our problem, we take existing pre-trained weights and fine-tune them for metaphor detection on the two corpora. Specifically, we fine-tune the models by minimising the cross-entropy between the predicted metaphoricity tags and the annotated metaphoricity tags. We evaluate our models quantitatively using the token-level F1 score and qualitatively by analysing examples of correct and incorrect predictions. We perform our experiments on two versions of each corpus: the original (unmodified) corpus and the corpus with only noun and verb metaphors. We find that the models reach a reasonably high detection accuracy on the first version (up to 0.60 F1 score), but perform less well on the second version (up to 0.41 F1 score), which contains harder-to-detect but semantically more interesting metaphors. The best detection accuracy is consistently achieved by the monolingual SloBERTa, followed by the trilingual CroSloEngual BERT and the multilingual XLM-RoBERTa. Additionally, we test whether combining the training data of both corpora can improve the metaphor detection accuracy and find that it can lead to small improvements.

Sentiments towards Migrants from/via Southeast Europe to Austria: Sentiment Analysis of Austrian Newspapers

Lucija Krušić

Abstract: Following the refugee crisis of 2015, migration has become a widely explored topic in the fields of Natural Language Processing (NLP) and sentiment analysis (SA) (Nerghes and Lee, 2018). The focus of SA is the automatic detection of sentiments, emotions and opinions found in textual data (Liu, 2012). This contribution proposes sentiment analysis of Austrian newspapers published in the period between the 18th and mid-20th century, with a focus on the development and variation of sentiments towards migration from South-East Europe to Austria. This long-term perspective, following three significant migration waves (Rupnow, 2017), might facilitate a deeper understanding of the current climate as well as potentially uncover past biases towards migrants. Austria has a long tradition of being a destination for migrants and refugees, partly because of its central position in Europe, but also due to its political and economic circumstances (Rupnow, 2017). However, the political narrative in Austria still avoids the label of an ‘immigrant country’ (Rheindorf & Wodak, 2018). So far, sentiment analysis of migration discourse in German has mainly focused on social media data (Backfried & Shalunts 2016; Heidenreich et al. 2020; Nerghes & Lee 2018). This contribution aims to fill that gap by investigating ANNO (Österreichische Nationalbibliothek, 2021), a corpus of Austrian newspapers. The intent is to contribute to the growing field of German sentiment analysis, both through the creation of a sentiment annotated newspaper corpus as well as through fine-tuning a state-of-the-art BERT (Devlin et al. 2019) model for the purposes of sentiment classification. The SA methodology is based on an existing BERT model, trained on German Europeana newspapers (Schweter, 2020). To facilitate fine-tuning of the model, sentiment annotations will be provided to selected portions of the ANNO corpus, following the best practices for corpus annotation (Schmidt, Dangel, & Wolff, 2021). The annotation studies will be conducted using CATMA (Gius et al, 2022), a dedicated annotation tool available through the CLARIN infrastructure. Furthermore, the annotators will be given training to ensure annotation quality. The annotated corpus will be made available in the Virtual Language Observatory, through the GAMS repository of the Institute Centre for Information Modelling in Graz, according to existing copyright restrictions. This will provide a meaningful addition to existing newspaper corpora available in the VLO as well as to resources for German sentiment analysis.

Validation of the Archivio Vi.Vo. Architecture: A Case Study on Rhotic Degemination in Tuscan Vernacular

Roberta Bianca Luzietti

Abstract: The research project aims at providing an experimental sociophonetic framework on the residual phonetic phenomenon of /r/ degemination in Tuscany, through the reuse of an historical oral archive, initially conceived for other purposes. The selected archive was collected by the historian Angela Spinelli in the early 1980s in the rural area of Prato to preserve the memory of the post-WWII period. The archive was digitised in 2011 within the Gra.fo project and consists of more than 120 hours of interviews and related metadata. However, the archive is unfortunately currently unavailable due to restructuring work on the original website. Given this context, the project goals are a) to insert Spinelli’s archive into the Archivio Vi.Vo. architecture, in order to safely store, preserve, organise and explore its content; b) to conduct the sociophonetic analysis on the productions of /r/. The first goal represents the opportunity to test/prove the adaptability of Archivio Vi.Vo. with different types of archives and eventually favour the emergence of new ideas for the implementation of additional features. The second shows the potential of oral archives of the past to be a valuable source for conducting linguistic research. The project represents an interdisciplinary work that benefits from the use of innovative technologies/tools developed within the CLARIN(-IT) research infrastructure and, at the same time, offers additional validation on the accessibility, adaptability, and reusability of the Archivio Vi.Vo. architecture.

Grammar-Aware Neural Methods to Modelling Meaning in Natural Language

Anssi Moisio

Abstract: Contemporary neural language models (LMs), such as BERT and GPT variants, learn implicit rules about how words are used in the provided training corpus, and are able to generalise some of this knowledge to novel sentences. However, these models require large training corpora to learn broad rules, since the rules are based on associations between specific words (or more accurately, the tokens that are derived from words). In contrast, humans (implicitly) know the grammar rules that generate syntactically correct sentences. This enables assigning meaning to statements that adhere to familiar grammar structures even when they contain novel combinations of words or morphemes. For example, a native speaker can understand the rarely used word ‘unmisunderstandable’ because she understands the rules that govern how morphemes can be combined to express a new meaning. It has been demonstrated that the currently used neural models lack this kind of robust capacity for compositional generalisation. Previous propositions to improve the LMs’ capacity for compositional generalisation have approached the problem from a few different angles. Some have focused specifically on grammar, and not meaning, and some have focused on pinpointing the neural models’ compositional capacities using artificial training and testing datasets. Recently, some efforts have been made to clarify and assess the question in the domain of realistic natural language tasks. My research project aims to advance this latter line of research. Specifically, I plan to extend the evaluation metrics to natural language tasks, such as machine translation and paraphrase detection. I will focus on Finnish, since morphologically rich languages offer a slightly different perspective on compositionality than languages like English, because much of the composition happens in morphology rather than in syntax. Ultimately, my aim is to develop new neural architectures that are better able to learn rule-abiding behaviour and thus better able to model the compositional nature of human languages. This work directly benefits from the resources in the Language Bank of Finland (LBF), provided by FIN-CLARIN. Most notably, I use the text corpora in the LBF to train language models, machine translation systems and other natural language processing systems. I have also contributed to the resources of the LBF: together with others at University of Helsinki, Aalto University, Finnish Broadcasting Company and other partners, we released in the LBF the 'Lahjoita puhetta' corpus of over 3000 hours of Finnish colloquial speech recorded by over 20k speakers all around Finland. About 1600 hours of this corpus is transcribed, providing a large corpus of spontaneous speech from all Finnish dialects also in textual form. This corpus has already helped us build better speech recognition systems, which will also be available in the LBF as a part of the Aalto-ASR tool.

ASR fine tuning for minority languages and speaker adaptation

Jim O'Regan

Abstract: The emergence of self-supervised methods in Automatic Speech Recognition (ASR) has lead to an “ImageNet moment” — that is, “base” models pretrained in an unsupervised manner on large collections of untranscribed speech can then be “fine tuned” for other tasks using much less data than would otherwise be required. In the case of ASR in particular, the wav2vec2 model, when pre-trained on a collection of multilingual data, has been shown to generalise to languages that were not seen in the pre-training process without resorting to techniques such as explicitly mapping phonemes. Thanks to the relative ease with which these models can be fine tuned to give high quality ASR models, there has been an explosion in the number of languages for which publicly available models are available. Methods such as phoneme mapping can be adapted to the newer models with greater reliability, as a broader and more diverse set of languages can be drawn upon. We are investigating the use of these methods with the aim of providing ASR infrastructure through CLARIN not just for Swedish, but also for the minority languages of Sweden. Somewhat relatedly, in recent work, we have found that continuing the fine-tuning of a model can be effective as a method for speaker adaptation, with relative reductions of Word Error Rate (WER) between 20 and 40 percent with as little as 20 minutes of transcribed data, to an absolute WER of less than 1 percent with 16 hours of transcribed data. Drawing upon CLARIN resources related to the framing of terrorism in the Swedish parliament, we are creating sets of speaker adapted models for each of the speakers in the parliamentary recordings. Just as the text of the parliamentary speeches, spanning multiple decades, provides insight into changes in language use, more accurate transcriptions, and phonetic transcriptions, can serve as a basis for investigation into change in the spoken language over time. A system for performing speaker adaptation is being created as part of the SWE-CLARIN speech infrastructure, as there is interest in the humanities in having more accurate transcriptions of the speech of specific individuals.

The Balanced Corpus of Modern Latvian (LVK)

Kristine Levane-Petrova

Abstract:This abstract presents The Balanced Corpus of Modern Latvian (LVK) the developments of the corpus from the beginnings until the present day. The latest version of LVK LVK2018 is available at www.korpuss.lv and registered at the CLARIN-LV repository. As LVK2018 contains all previous versions of LVK, only the latest version (LVK2018) is publicly available. The new version of LVK2022 will be released in late 2022.The Balanced Corpus of Modern Latvian has been developed in multiple rounds. The history of the LVK series goes back to 2007 when the first 1 million corpus was created. The LVK design, compilation and the text selection criteria were based on the Latvian Language Corpus Conception (Levāne-Petrova 2012). The experience from the design of other general language corpora was taken into account as well. The reviewed list of corpora includes British National Corpus (Burnard 2007; Aston, Burnard 1998), Czech National Corpus (Čerm;k 2002; Hntkov et al. 2014; Křen et al. 2016), Corpus of Contemporary Lithuanian Language (Kovalevskaitė 2006; Rimkutė et al. 2010), and others.LVK is designed as a general-language, representative, synchronic and publicly available corpus. The corpus contains texts originally written in Latvian and published after 1991. The corpus contains five different sections; journalism, fiction, scientific, legal and parliamentary transcripts. The same corpus design criteria were also used for the subsequent LVK series. LVK2013 was released on 2013 with 4.5 million words (Levāne-Petrova 2012), while LVK2018 was released on the late 2018 with 10 million words (Levāne-Petrova 2019; Darģis, Levāne-Petrova, Poikāns 2019). All corpora in the LVK series, except LVK2018, were manually created. Therefore, the main novelty in LVK2018 was the automation process in all corpus development steps (data collection, data processing and data selection). All corpora are morphologically annotated (Paikens 2007; Paikens et al., 2013; Paikens 2016). Morphosyntactic annotations contain PoS tags, lemma and other Latvian specific morphological and syntactic information. The corpora also contains metadata descriptions. There are three publicly visible metadata fields; unique identifier (id), section and reference.Now we are on the way to release the new version of LVK (LVK2022) with 100 million words. Although LVK2022 will be an extended version of the LVK series, containing all the texts included in the LVK2018, the corpus design criteria (proportions and also text selection criteria) will differ from the previous LVK corpora. For instance, the LVK2022 will also include the translations of the fiction and other text genres in Latvian, not just originally written texts. Some subcorpora (Wikipedia, Phd Thesis, Laws), to be included in the respective proportions in the LVK2022, are already publicly available at www.korpuss.lv. All LVK corpora and aubcorpora have been released in the framework of Latvian National Corpus (Saulīte et al., 2022) and registered at the CLARIN-LV repository. LVK2018 and other corpora are freely available via the corpus query interface NoSketch Engine.

Using Classical Readability Formulas to Measure Text Readability in Sesotho

Johannes Sibeko

Abstract: The doctoral research presented here explores measuring text readability in Sesotho, a Bantu language spoken by more than 10 million speakers spread across South Africa, Lesotho, and Zimbabwe. It is used for a variety of purposes, including education at all levels, government, politics, media, and entertainment industries. Unfortunately, Sesotho reading skills are lacking. In education, teachers are expected to choose and adapt texts to their learners’ levels. However, these processes are intuitive and subjective. As a result, there is no objective way of assuring that texts administered for learning, teaching and assessment are of the correct readability levels. To this end, we propose that an objective measure of text readability in Sesotho will help in the selection and adaptation of texts for different purposes and expected levels. Therefore, this study aims to develop metrics for measuring text readability that can benefit researchers, authors, teachers, and readers. Furthermore, we hope to develop a web-based application to provide access to automated text readability analysis that will allow the user to paste texts and receive a readability analysis report. The application will be developed using our metric(s) for measuring readability in Sesotho. We adopt the classical readability formulas approach to text readability analysis. We aim to adapt nine existing readability metrics into Sesotho using English as a higher-resourced helper language. Four of these readability metrics, namely Flesch Reading Ease, Flesch- Kincaid Grade Level, Simple Measure of Goobledygook, and Gunning Fog index, use syllable information. Four of the metrics, namely Rate index, Lasbarhets index, Automatic Readability index, and Coleman-Liau index, rely on word length. Finally, the Dale-Chall index is based on a 3000-word list of frequently used words that are expected to be known and readable to grade four learners. So far, four resources have been developed as part of the study. Three of these resources have been published on the South African Centre for Digital Language Resources’ (SADiLaR online repository1. We surveyed digital language resources available to Sesotho and discovered the absence of a syllabification system. As indicated above, four of the metrics we are hoping to adapt to Sesotho rely on syllable identification. To this end, we developed two syllabification systems, a rule-based system, and a machine learning pattern-based system. The rule-based syllabification system has a 99.69% accuracy, while the machine learning system has 78.92% accuracy. Additionally, we published a corpus of 1355 single words and their syllable counterparts. We also prepared a corpora of reading comprehension and summary writing texts at first language (indicated as home language) and second language (indicated as first additional language). The corpus contains exam texts from all eleven official languages of South Africa, namely, South English, Afrikaans, isiZulu, isiXhosa, Siswati, isiNdebele, Xitsonga, Tshivenda, Sepedi, Setswana and Sesotho. The corpora consist of 433 texts. We also plan on compiling a corpus of frequently used words for Sesotho. Such a corpus will be useful in adapting Dale-Chall readability metric. All our modules will be published open access on SADiLaR’s repository.
  • Sun I+II
17:55 - 18:55 Bazaar
  • Lobby
19:30 - 22:00 Conference dinner  

Day Three

Time Wednesday 12 October 2022 Room
 

Thematic session: Curation of Language  Resources

Chair: Koenraad De Smedt

  • Sun I+II
9:00 - 9:20

CLARIN Depositing Guidelines: State of Affairs and Proposals for Improvement

Jakob Lenardič and Darja Fišer

Abstract: The paper presents a review of the guidelines for depositing language resources in CLARIN Bcentres. We discuss how the existing guidelines instruct depositors to document basic resource metadata such as size, annotation, and language. On the basis of our review, we propose a new set of guidelines to be adopted by CLARIN repositories for those metadata categories that are pivotal for the SSH community but are underrepresented in the existing guidelines.
 
9:20 - 9:40

The Resource Publishing Pipeline of the Language Bank of Finland

Ute Dieckmann, Mietta Lennes, Jussi Piitulainen, Jyrki Niemi, Erik Axelson, Tommi Jauhiainen and Krister Lindén

Abstract: We present the process of publishing resources in Kielipankki, the Language Bank of Finland. Our pipeline includes all the steps that are needed to publish a resource: from finding and receiving the original data until making the data available via different platforms, e.g., the Korp concordance tool or the download service. Our goal is to standardise the publishing process by creating an ordered check list of tasks with the corresponding documentation and by developing conversion scripts and processing tools that can be shared and applied on different resources.
 
9:40 - 10:00

TEI and Git in ParlaMint: Collaborative Development of Language Resources

Tomaž Erjavec and Matyáš Kopp 

Abstract: This paper discusses the encoding, validation and development of language resources of the completed ParlaMint I and on-going ParlaMint II CLARIN projects, which centre on the collaborative development of TEI-encoded corpora of parliamentary proceedings. We introduce the use of TEI ODD to write the encoding guidelines and formal XML schemas for validation. We motivate and explain using Git to develop encoding schemas and language resources. Apart from revision control, issues and publishing documentation, we also emphasise GitHub actions with their ability to integrate program code execution into the data submission process. The paper is written with a view to introducing SSH scientists to the two environments, as they can be valuable items in the toolbox for compiling language resources, especially in a collaborative setting.
 
10:00 - 10:40 Panel | CLARIN and Other SSH Platforms: You’ll Never Innovate Alone
  • Sun I+II
10:40 - 11:10 Coffee break  
11:10 - 11:55

Keynote| Chair: TBA

100 Years of Speech recognition, the Data Fork and the Conversational Challenge. Stories from today’s Speech Industry

Ariane Nabeth-Halber

Abstract: From "Radio Rex" to "Google Duplex", several revolutions hit the speech technology domain. Each time these revolutions were related to shifts in data, algorithms and infrastructures. Latest developments in self-supervised learning and large pretrained models seem to lead to a data fork, between the use of massive-size raw data and dedicated training infrastructure and the use of small-size targeted annotated data and on-demand training capabilities. This new state of the art also changes the landscape of solved / unsolved speech and language tasks, namely casting a new light on long-time dream of "talking machines" and attracting huge interest from the media and the large public. Beyond the echoes of trending stories, how do these shifts translate in the Industry? We’ll be first taking insiders' looks into a few real-life speech industry stories, taking place in contact centers, but also trading floors or media monitoring spaces. Scrutinizing how business problem solving can be related to speech and language R&D and to data and corpora questions, but also to many other key factors. With this in mind, we’ll review the changes that current advances are starting to induce in the speech industry landscape, and the main challenges that still need to be addressed. Spoiler alert : The "Conversational Challenge" is probably the one that stands as the next frontier.
Bio: Ariane Nabeth-Halber has worked in the speech industry for 25 years. She started her career in research (ATR, Japan; Thalès, France), and then moved to the speech industry, for example Nuance Communication and French company Bertin IT, working with contact centres, broadcasters, trading floors and public ministries, but also academic labs such as LIUM and Avignon University/LIA. Since August 2021, Ariane Nabeth-Halber has led the Speech and Conversational AI team at ViaDialog, to deliver augmented and safe customer relationship experiences. A European Commission expert and LT-Innovate board member, Ariane holds a PhD in computer science and signal processing from Telecom ParisTech. She regularly speaks at conferences on AI and speech technology.
  • Sun I+II
 

Thematic session: Research cases

Chair: Krister Lindén

  • Sun I+II
11:55 - 12:15

Analysing changes in official use of the design concept using SweCLARIN resources

Lars Ahrenberg, Daniel Holmer, Stefan Holmlid and Arne Jönsson 

Abstract: We show how the tools and language resources developed within the SweClarin infrastructure can be used to investigate changes in the use and understanding of the Swedish related words arkitektur, design, form, and formgivning. Specifically, we compare their use in two governmental public reports on design, one from 1999 and the other from 2015. We test the hypothesis that their meaning has developed in a way that blurs distinctions that may be important to stakeholders in the respective fields.
 
12:15 - 12:35

A Snapshot of Climate Change Arguments: Searching for Recurring Themes in Tweets on Climate Change

Maria Skeppstedt and Robin Schäfer 

Abstract: We applied the topic modelling tool Topics2Themes to a collection of German tweets on the subject of climate change, the GerCCT corpus. Topics2Themes is currently being further developed and evaluated within Sprakbanken Sam, which is a part of SWE-CLARIN. The tool ˚ automatically extracted 15 topics from the tweet collection. We used the graphical user interface of Topics2Themes to manually search for recurring themes among the eight tweets most closely associated with the topics extracted. Although the content of the tweets associated with a topic was often diverse, we were still able to identify recurring themes. More specifically, 14 themes that occurred at least three times were identified in the texts analysed.
 
12:35 - 12:55

Linguistic Framing of Political Terror: Distant and Close Readings of the Discourse on Terrorism in the Swedish Parliament 1993–2018

Magnus P. Ängsal, Daniel Brodén, Mats Fridlund, Leif-Jöran Olsson and Patrik Öhberg 

Abstract: This paper provides a study of the discourse on terrorism in Swedish parliamentary debate 1993– 2018. The aim is to explore how terrorism is discursively constructed in parliamentary deliberations, drawing on the resources of Swe-Clarin in the form of the corpus tool Korp and the linguistic concept of ‘frame’. To map meanings attached to terrorism we pursue two research questions: what framing elements are connected to ‘terrorism’ and ‘terrorist’ in parliamentary speeches as 1) simplexes and 2) as part of compounds along the lines of controversies and party affiliations? The latter research question is probed through distant and close readings of the specific compound statsterrorism (‘state terrorism’). Our findings show that terrorism is typically framed as located outside of Sweden and as tied to Islamism, but the question of what countries are associated with state terrorism depends on the political affiliation of the interlocutor. The compound statsterrorism is most prominently used by the left and green parties and then commonly associated with Israel and Turkey. We conclude by suggesting that a widened inquiry into compounds, in general as well as diachronically, is likely a productive way of expanding the scope of our research.
 
12:55 - 13:00 Closing remarks  
13:00 - 14:00 Lunch