- Invited talk
- CLARIN Café: “This is CLARIN. How can we help you?"
- Moderator-led discussions about papers. Paper topics:
- CLARIN Students Session
- CLARIN in the Classroom
- CLARIN Bazaar
Language Technology & Hypothesis Testing (Slides)
Dr. Antske Fokkens (Associate professor, Faculty of Humanities, Vrije Universiteit Amsterdam)
Both the quality and accessibility of language technology has drastically increased over the last decade. Generic language models and deep learning have led to impressive results and both models and code for creating and using them is often made available. As such, we can see an increase of these technologies being used in industry and various research disciplines outside of computational linguistics. Despite sometimes impressive results, however, our technologies are still far from perfect and much is still unknown about how well our models work for specific use cases. In this talk, I will argue for the importance of going back to the foundations and ground research in hypotheses, both for studying language technology itself as well as for applying it in other research domains.
The goal of the Café is to introduce and familiarize the audience with the CLARIN Infrastructure. What is CLARIN all about? How can CLARIN help? What does it have to offer to researchers, lecturers and students? And how does using the CLARIN Infrastructure help them during their research projects? During this Café these questions will all be adressed.
1University of Ljubljana, Slovenia 2Jožef Stefan Institute, Slovenia
This paper presents the current state of the CLARIN Resource and Tool families initiative, the aim of which is to provide user-friendly overviews of the available language resources and tools in the CLARIN infrastructure for researchers from digital humanities, social sciences and human language technologies. The initiative now consists of a total of 11 corpus families, 5 families of lexical resource resources, and 4 tool families, which together amount to 950 manually curated tools and resources as of 17 August 2020. We present the initiative from the perspective of missing metadata as well as problems related to the general accessibility of the tools and resources and their findability in the Virtual Language Observatory ( ).
An Internationally FAIR Mediated Digital Discourse Corpus: Towards Scientific and Pedagogical Reuse
Rachel Panckhurst and Francesca Frontini
Université Paul-Valéry Montpellier 3, France
In this paper, the authors present a French Mediated Digital Discourse corpus, (88milSMS
http://88milsms.huma-num.fr https://hdl.handle.net/11403/comere/cmr-88milsms). Efforts were undertaken over the years to ensure its publication according to the best practices and standards of the community, thus guaranteeing compliance with FAIR principles and CLARIN recommendations with pertinent scientific and pedagogical reuse.
The First Dictionaries in Esperanto. Towards the Creation of a Parallel Corpus
Denis Eckert1 and Francesca Frontini2
1CNRS Paris, France 2Université Paul-Valéry Montpellier 3, France
Between 1887 - date of creation of the Esperanto language by L. Zamenhof inWarsaw - and 1890, 37 documents in at least 16 different languages were prepared by Zamenhof himself or by early adopters of Esperanto, in order to present the new “international language” to the broadest possible public. This unique multilingual corpus is scattered across many national libraries. We have systematically collected digital versions of this set of documents and begun to analyze them in parallel. Many of them (17) contain the same basic dictionary, elaborated by people who mostly had a limited, or absolutely no knowledge of philology. These 17 dictionaries encompass about 920 entries in Esperanto, each time translated in a given target language. We are progressively digitizing the whole corpus of these small dictionaries (so far, 12 versions have been digitized and encoded) and aim at making it accessible to scholars of various disciplines in format in an Open Access repository. The need for international and interdisciplinary co-operation is obvious, due to the great variety of the languages used, in order to decipher and correctly encode linguistic issues that are likely to arise (non-standardized Hebrew, pre-independence Latvian or Lithuanian, old-style Russian spelling, etc.).
Digital Neuropsychological Tests and Biomarkers: Resources for
and AI Exploration in the Neuropsychological Domain
Dimitrios Kokkinakis and Kristina Lundholm Fors
University of Gothenburg, Sweden
Non-invasive, time and cost-effective, easy-to-measure techniques for the early diagnosis or monitoring the progression of brain and mental disorders are at the forefront of recent research in this field. Natural Language Processing and Artificial Intelligence can play an important role in supporting and enhancing data driven approaches to improve the accuracy of prediction and classification. However, large datasets of e.g. recorded speech in the domain of cognitive health are limited. To improve the performance of existing models we need to train them on larger datasets, which could raise the accuracy of clinical diagnosis, and contribute to the detection of early signs at scale. In this paper, we outline our ongoing work to collect such data from a large population in order to support and conduct future research for modelling speech and language features in a cross-disciplinary manner. The final goal is to explore and combine linguistic with multimodal biomarkers from the same population and compare hybrid models that could increase the predictive accuracy of the algorithms that operate on them.
CORLI: The French Knowledge-Centre
Eva Soroli1, Céline Poudat2, Flora Badin3, Antonio Balvet1, Elisabeth Delais-Roussaire4, Carole Etienne5, Lydia-Mai Ho-Dac6, Loïc Liégeois7, Christophe Parisse8
1University of Lille, France 2University Côte d'Azur, France 3LLL-CNRS Orléans, France 4University of Nantes, France 5ICAR-CNRS Lyon, France 6CLLE, University of Toulouse, France 7University of Paris, France 8INSERM, CNRS Paris, Nanterre University, France
As a first step towards increasing reproducibility of language data and promoting scientific syn-ergies and transparency, CORLI (Corpus, Language and Interaction), a consortium involving members from more than 20 research labs and 15 Universities, part of the French large infrastruc-ture Huma-Num, contributes to the European research infrastructure of CLARIN through the es-tablishment of a knowledge sharing centre: the French Clarin CORpus Language and Interactions K-Centre (CORLI K-Centre). The purpose of the CORLI K-Centre is to provide expertise in cor-pus linguistics and French language, and support academic communities through actions towards FAIR and Open data. We describe the development of the CORLI K-Centre, its scope, targeted audiences, as well as its intuitive and interactive online platform which centralizes and offers both proactive and reactive services about: available language resources, databases and depositories, training opportunities, and best research practices (i.e., on legal/ethical issues, data collection, metadata standardization, anonymization, annotation and analysis guidelines, corpus exploration methods, format conversions and interoperability).
The CLASSLA Knowledge Centre for South Slavic Languages
Nikola Ljubešić1, Petya Osenova2, Tomaž Erjavec1 and Kiril Simov2
1Jožef Stefan Institute, Slovenia 2IICT-BAS, Bulgaria
We describe the recently set-up CLARIN Knowledge centre CLASSLA focused on language
resources and technologies for South Slavic languages. The Knowledge centre is currently run by the Slovene national consortium CLARIN.SI and the Bulgarian national consortium CLADABG. The two main aims of the Knowledge centre are coordination in development of language resources and technologies for the typologically related languages and joint training activities and helpdesk support for the underlying user base.
University of Tübingen, Germany
In this work, we introduce sticker2, a production-quality, neural syntax annotator for Dutch and German based on deep transformer networks. sticker2 uses multi-task learning to support simultaneous annotation of several syntactic layers (e.g. part-of-speech, morphology, lemmas, and syntactic dependencies). Moreover, sticker2 can finetune pretrained models such as XLM-RoBERTa (Conneau et al., 2019) for state-of-the-art accuracies.
To make use of the deep syntax models tractable for execution environments such as WebLicht (Hinrichs et al., 2010), we apply model distillation (Hinton et al., 2015) to reduce the model’s size. Distillation results in models that are roughly 5.2-8.5 times smaller and 2.5-4.4 times faster, with only a small loss of accuracy.
Sticker2 is widely available through its integration in WebLicht. Nix derivations and Docker
images are made available for advanced users that want to use the models outside WebLicht.
Exploring and Visualizing WordNet Data with GermaNet Rover
Marie Hinrichs, Richard Lawrence and Erhard Hinrichs
University of Tübingen, Germany
This paper introduces GermaNet Rover, a new web application for exploring and visualizing GermaNet. It provides semantic relatedness calculations between two selected word senses using six algorithms, and allows regular expression or edit distance lookup in combination with other search constraint options. Visualizations include a concept's position in the graph and the shortest path between concepts. Rover provides easy access to these features and is available as a CLARIN resource, using single-sign-on authentication and authorization.
Named Entity Recognition for Distant Reading in ELTeC
Francesca Frontini1, Carmen Brando2, Joanna Byszuk3, Ioana Galleron4, Diana Santos5 and Ranka Stanković6
1Université Paul-Valéry Montpellier 3, France 2CRH, EHESS Paris, France 3Institute of Polish Language, Poland 4LaTTiCe CNRS, Université Sorbonne Nouvelle - Paris 3, France 5Linguateca & University of Oslo 6University of Belgrade
The “Distant Reading for European Literary History” COST Action, which started in 2017, has among its main objectives the creation of an open source, multilingual European Literary Text Collection (ELTeC). In this paper we present the work carried out to manually annotate a selection of the ELTeC collection for Named Entities, as well as to evaluate existing NER tools as to their capacity to reproduce such annotation. In the final paragraph, points of contact between this initiative and CLARIN are discussed.
UiL-OTS, University of Utrecht, The Netherlands
This paper presents results of an application (Sasta) derived from the CLARIN-developed tool GrETEL for the automatic assessment of transcripts of spontaneous Dutch language. The techniques described here, if successful, (1) have important societal impact, (2) are interesting from a scientific point of view, and (3) may benefit the CLARIN infrastructure itself since they enable a derivative program called CHAMP-NL (CHAT-iMProver for Dutch) that can improve the quality of the annotations of Dutch data in CHAT-format.
A Neural Parsing Pipeline for Icelandic Using the Berkeley Neural Parser
þórunn Arnardóttir and Anton Karl Ingason
University of Iceland, Iceland
We present a machine parsing pipeline for Icelandic which uses the Berkeley Neural Parser and includes every step necessary for parsing plain Icelandic text, delivering text annotated according to IcePaHC. The parser is fast and reports an 84.74 F1 score. We describe the training and evaluation of the new parsing model and the structure of the parsing pipeline. All scripts necessary for parsing plain text using the new parsing pipeline are provided in open access via the CLARIN repository and GitHub.
Annotating Risk Factor Mentions in the COVID-19 Open Research Dataset
Maria Skeppstedt, Magnus Ahltorp, Gunnar Eriksson and Rickard Domeij
The Language Council of Sweden, The Institute for Language and Folklore, Sweden
We here describe the creation of manually annotated training data for the Kaggle task “What do we know about COVID-19 risk factors?”. We applied our text mining tool on the “COVID- 19 Open Research Dataset” to i) select data for manual annotation, ii) classify the data into initially established classification categories, and iii) analyse our data set in search for potential refinements of the annotation categories. The process resulted in a corpus consisting of 50,000 tokens, for which each token is annotated as to whether it is part of an expression that functions as a “risk factor trigger”. Two types of risk factor triggers were annotated, those indicating that the text describes a risk factor, and those indicating that something could not be shown to be a risk factor.
University of Bergen, Norway
Newspaper corpora which are continuously kept up to date are useful for monitoring language changes in almost real time. The COVID19 pandemic prompted a case study in the Norwegian Newspaper Corpus. This corpus was mined for productive compounds with the stems “corona” and the alternative “korona”, tracing their frequencies and dates of first occurrence. The analysis not only traced the daily volume of such compounds, but also the sustained creation of many new compounds, and a change in their preferred spelling.
Trawling the Gulf of Bothnia of News: A Big Data Analysis of the Emergence of Terrorism in Swedish and Finnish Newspapers, 1780-1926
Mats Fridlund1, Daniel Brodén1, Leif-Jöran Olsson2 and Lars Borin2
1Centre for Digital Humanities, University of Gothenburg, Sweden 2Språkbanken Text, University of Gothenburg, Sweden
This study combines history domain knowledge and language technology expertise to evaluate and expand on research claims regarding the historical meanings associated with terrorism in Swedish and Finnish contexts. Using a cross-border comparative approach and large newspaper corpora made available by the CLARIN research infrastructure, we explore overlapping national discourses on terrorism, the concept’s historical diversity and its relations to different national contexts.We are particularly interested in testing the hypothesis that substate terrorism’s modern meaning was not yet established in the 19th century but primarily restricted to Russian terrorism. We conclude that our comparative study finds both uniquely national and shared meanings of terrorism and that our study strengthen the hypothesis. In extension, the study also serves as an exploration of the potentials of cross-disciplinary evaluative studies based on extensive corpora and of cross-border comparative approaches to Swedish and Finnish newspaper corpora.
Studying Emerging New Contexts for Museum Digitisations on Pinterest
Bodil Axelsson1, Daniel Holmer2, Lars Ahrenberg2 and Arne Jönsson2
1Department of Culture and Society, Linköping University, Sweden 2Department of Computer and Information Science, Linköping University, Sweden
In a SweClarin cooperation project we apply topic modelling to the texts found with pins in Pinterest boards. The data in focus are digitisations from the Swedish History Museum and the underlying research question is how their historical objects are given new contextual meanings in the boards.We illustrate how topics can support interpretations by suggesting coherent themes for Viking Age Jewelry and localising it in different strands of popular culture.
Evaluation of a Two-OCR Engine Method: First Results on Digitized Swedish Newspapers Spanning over nearly 200 Years
Dana Dannélls1, Lars Björk2, Ove Dirdal3 and Torsten Johansson2
1Språkbanken Text, University of Gothenburg, Sweden 2Kungliga Biblioteket Stockholm, Sweden 3Zissor Oslo, Norway
In this paper we present a two-OCR engine method that was developed at Kungliga biblioteket (KB), the National Library of Sweden, for improving the correctness of the OCR for mass digitization of Swedish newspapers. We report the first quantitative evaluation results on a material spanning over nearly 200 years. In this first evaluation phase we experimented with word lists for different time periods. Although there was no significant overall improvement of the OCR results, the evaluation shows that some combinations of word lists are successful for certain periods and should therefore be explored further.
Stimulating Knowledge Exchange via Trans-National Access - The ELEXIS Travel Grants as a Lexicographical Use Case
Sussi Olsen1, Bolette S. Pedersen1, Tanja Wissik2, Anna Woldrich2 and Simon Krek3
1University of Copenhagen, Denmark 2Austrian Academy of Sciences, Austria 3Jožef Stefan Institute, Slovenia
This paper describes the intermediate outcome of one of the initiatives of the ELEXIS project: Transnational Access. The initiative aims at facilitating interaction between lexicographers/researchers from the EU and associated countries and lexicographical communities throughout Europe by giving out travel grants. Several of the grant holders have visited CLARIN centres, have been acquainted with the CLARIN infrastructure and have used CLARIN tools. The paper reports on the scientific outcome of the visits that have taken place so far: the origin of the grant holders, their level of experience, the kind of research projects the grant holders work with and the outcomes of their visits. Every six months ELEXIS releases a call for grants. So far 23 visits have been granted in total; 13 of these visits have been concluded and the reports of the grant holders are publicly available at the ELEXIS website.
PoetryLab as Infrastructure for the Analysis of Spanish Poetry
Javier de la Rosa1, Álvaro Pérez1, Laura Hernández1, Aitor Díaz2, Salvador Ros2 and Elena González-Blanco3
1LINHD, UNED Madrid, Spain 2Control and Communication Systems, UNED Madrid, Spain 3IE University Madrid, Spain
The development of the network of ontologies of the ERC POSTDATA Project brought to light some deficiencies in terms of completeness in the existing corpora. To tackle the issue in the realm of the Spanish poetic tradition, our approach consisted in designing a set of tools that any scholar could use to automatically enrich the analysis of Spanish poetry. The effort crystallized in the PoetryLab, an extensible open source toolkit for syllabification, scansion, enjambment detection, rhyme detection, and historical named entity recognition for Spanish poetry. We designed the system to be interoperable, compliant with the project ontologies, easy to use by tech-savvy and non-expert researchers, and requiring minimal maintenance and setup. Furthermore, we propose the integration of the PoetryLab as a core functionality in the tool catalog of CLARIN for Spanish poetry.
University of Tübingen, Germany
In this work we report on our experiences with using the Nix (Dolstra, 2006) package manager to build fully reproducible annotation services for the WebLicht (Hinrichs et al., 2010) workflow engine.
University of Copenhagen, Denmark
The Text Tonsorium (TT) is a workflow management system (WMS) for Natural Language
Processing (NLP). The software implements a design goal that sets it apart from other WMSes: it operates without manually composed workflow designs. The TT invites tool providers to register and integrate their tools, without having to think about the workflows that new tools can become part of. Both input and output of new tools are specified by expressing language, file format, type of content, etc. in terms of an ontology. Likewise, users of the TT define their goal in terms of this ontology and let the TT compute the workflow designs that fulfill that goal. When the user has chosen one of the proposed workflow designs, the TT enacts it with the user’s input. This untraditional approach to workflows requires some familiarization. In this paper, we propose possible improvements of the TT that can facilitate its use by the clarin community.
Charles University, Czech Republic
In this paper we describe how the TEITOK corpus platform was integrated with the Kontext corpus platform at LINDAT to provide document visualization for both existing and future resources at LINDAT. The TEITOK integrations also means LINDAT resources will become available in TEI/XML format, and searchable both using Manatee and CWB.
Wittgenstein Archives at the University of Bergen, Norway
The Wittgenstein Archives at the University of Bergen (WAB) offer specialized tools for research access to its Wittgenstein resources which however are in need for an upgrade to better serve user requirements. The paper discusses this need along some selected exemplary features of two such tools: Interactive Dynamic Presentation (IDP) of Wittgenstein’s philosophical Nachlass and Semantic Faceted Search and Browsing (SFB) of Wittgenstein metadata. The tasks of extending and better adapting these two tools to user requirements shall be carried out within the Norwegian CLARINO+ project.
The Language Archive, Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands
In the beginning of 2018, The Language Archive at the Max Planck Institute for Psycholinguistics (TLA) migrated to a new CLARIN-compatible repository solution that was based on the open source Islandora/Fedora framework. This new solution - labeled FLAT - was developed together with the Meertens Institute and contains a customized ingest pipeline as well as a deposit frontend that is easy to use such that researchers can deposit their own collections. In the beginning of this year, some major new functionality was added to the repository for browsing and visualizing archived materials. After two years of using FLAT, in this paper we will take stock and describe what our experiences have been. What worked well? What could be improved? What turned out to be a serious shortcoming?
Building a Home for Italian Audio Archives
Silvia Calamai1, Niccolò Pretto2, Monica Monachini2, Maria Francesca Stamuli3, Silvia Bianchi1, Pierangelo Bonazzoli4
1DSFUCI, Siena University Arezzo, Italy 2Institute of Computational Linguistics "A. Zampolli" CNR, Pisa, Italy 3Ministry of Culture and Tourism, Florence, Italy 4Unione dei Communi Montani del Consentino, Italy
Audio and audiovisual archives are at the crossroads of different fields of knowledge, yet they require common solutions for both their long-term preservation and their description, availability, use and reuse. Archivio Vi.Vo. is an Italian project financed by the Tuscany Region, aiming to (i) explore methods for long-term preservation and secure access to oral sources and (ii) develop an infrastructure under the CLARIN-IT umbrella offering several services for scholars from different domains interested in oral sources. This paper describes the project’s infrastructure and its methodology through a case study on the Caterina Bueno’s audio archive.
Digitizing University Libraries - Evolving from Full Text Providers to CLARIN Contact Points on Campuses
Manfred Nölte and Martin Mehlberg
State and University Library Bremen, Germany
The beginnings, currently emerging demands and activities as well as future options of the relation of the State and University Library Bremen (SuUB) to CLARIN will be described in this paper. Section 1 presents a solution for supplying digital humanists with tools and services suited best for their research. With the SuUB as a digitizing academic library this relation is with respect to full text transfers to CLARIN (Geyken et al., 2018; Nölte and Blenkle, 2019). Further connections could consist in providing advice and training for researchers of the Digital Humanities as potential CLARIN users (see section 2) and a discussion about future structural options on the level of research infrastructures. In section 3 we suggest a collaboration between digitizing libraries to jointly agree upon standards of quality, file formats, interfaces and web services. We discuss the foundation of local CLARIN contact points to pass on scholars and researchers to the respective contact or service of CLARIN. The relevance to the CLARIN activities, resources, tools or services is described at the end of each respective section.
"Tea for Two": The Archive of the Italian Latinity of the Middle Ages meets the CLARIN Infrastructure
Federico Boschetti1, Riccardo Del Gratta1, Monica Monachini1, Marina Buzzoni2, Paolo Monella3 and Roberto Rosselli Del Turco3
1Institute of Computational Linguistics "A. Zampolli" CNR, Pisa, Italy 2ALIM, Università Ca' Foscari Venezia, Italy 3ALIM, Università degli Studi di Palermo, Italy
This paper presents the Archive of the Italian Latinity of the Middle Ages (ALIM) and focuses, particularly, on its structure and metadata for its integration into the ILC4CLARIN repository. Access to this archive of Latin texts produced in Italy during the Middle Ages is of great importance in providing CLARIN-IT and the CLARIN community, at large, with critically reliable texts for the use of philologists, historians of literature, historians of institutions, culture and science of the Middle Ages.
Use Cases of the ISO Standard for Transcription of Spoken Language in the Project INEL
Anne Ferger and Daniel Jettka
Universität Hamburg, Germany
This contribution addresses the benefits and challenges of using the ISO standard for transcription of spoken language ISO 24624:2016 in a long-term research project. By exploring several use cases for the standard format, which include its application in archiving, dissemination, analysis and search of linguistic data, the practicality and versatility of its usage are examined. Also various opportunities for further interdisciplinary research are highlighted.
Evaluating and Assuring Research Data Quality for Audiovisual Annotated Language Data
Timofey Arkhangelskiy1, Hanna Hedeland2 and Aleksandr Riaposov1
1QUEST, Universität Hamburg, Germany 2QUEST, Leibniz-Institut für Deutsche Sprache Mannheim, Germany
This paper presents the QUEST project and describes concepts and tools that are being developed within its framework. The goal of the project is to establish quality criteria and curation criteria for annotated audiovisual language data. Building on existing resources developed by the participating institutions earlier, QUEST develops tools that could be used to facilitate and verify adherence to these criteria. An important focus of the project is making these tools accessible for researchers without substantial technical background and helping them produce high-quality data. The main tools we intend to provide are the depositors’ questionnaire and automatic quality assurance, both developed as web applications. They are accompanied by a Knowledge base, which will contain recommendations and descriptions of best practices established in the course of the project. Conceptually, we split linguistic data into three resource classes (data deposits, collections and corpora). The class of a resource defines the strictness of the quality assurance it should undergo. This division is introduced so that too strict quality criteria do not prevent researchers from depositing their data.
Towards Comprehensive Definitions of Data Quality for Audiovisual Annotated Language Resources
Leibniz-Institut für Deutsche Sprache Mannheim, Germany
Though digital infrastructures such as CLARIN have been successfully established and now provide large collections of digital resources, the lack of widely accepted standards for data quality and documentation still makes re-use of research data a difficult endeavour, especially for more complex resource types. The article gives a detailed overview over relevant characteristics of audiovisual annotated language resources and reviews possible approaches to data quality in terms of their suitability for the current context. Conclusively, various strategies are suggested in order to arrive at comprehensive and adequate definitions of data quality for this particular resource type.
Towards an Interdisciplinary Annotation Framework: Combining NLP and Expertise in Humanities
Laska Laskova, Petya Osenova and Kiril Simov
AIaLT, IICT-BAS, Bulgaria
The paper describes the initial steps in creating an annotation framework that would incorporate the knowledge coming from Bulgarian corpora, lexicons, linguistic analyzers with the expert knowledge coming from specialists in History, Diachronic Linguistics, Iconography. The proposed framework relies on the INCEpTION system, CIDOC CRM ontology and FrameNet. Here the following steps are described: workflow, guideline principles and challenges. The domain focus is History. The ultimate goal is to provide enough manual and verified expert data for constructing a Bulgaria-centered knowledge graph in addition to the information coming from Wikipedia,Wikidata, Geonames. The annotations will be used also for training automatic semantic processors and linkers.
1Leibniz-Institut für Deutsche Sprache Mannheim, Germany 2Eberhard Karls Universität Tübingen, Germany
An implementation of -based signposts and its use is presented in this paper. Arnold et al. 2020 present Signposts as a solution to challenges in long-term preservation of corpora, especially corpora that are continuously extended and subject to modification, e.g., due to legal injunctions, but also may overlap with respect to constituents, and may be subject to migrations to new data formats. We describe the contribution Signposts can make to the CLARIN infrastructure and document the design for the CMDI profile.
Extending the CMDI Universe: Metadata for Bioinformatics Data
Olaf Brandt, Holger Gauza, Steve Kaminski, Mario Trojan and Thorsten Trippel
Eberhard Karls Universität Tübingen, Germany
CMDI is a discipline independent metadata framework, though it is currently mainly used within CLARIN and by initiatives in the humanities and social sciences. In this paper we investigate, if and how CMDI can be used in bioinformatics for metadata modelling and describing the research data.
The CMDI Explorer
Denis Arnold1, Ben Campbell2, Thomas Eckert3, Bernhard Fisseni1, Thorsten Trippel2 and Claus Zinn2
1Leibniz-Institut für Deutsche Sprache Mannheim, Germany 2Eberhard Karls Universität Tübingen, Germany 3Universität Leipzig, Germany
We present the CMDI Explorer, a tool that empowers users to easily explore the contents of complex CMDI records and to process selected parts of them with little effort. The tool allows users, for instance, to analyse virtual collections represented by CMDI records, and to send collection items to other CLARIN services such as the Switchboard for subsequent processing. The CMDI Explorer hence adds functionality that many users felt was lacking from the CLARIN tool space.
Going to the ALPS: A Tool to Support Researchers and Help Legality Awareness Building
Veronika Gründhammer, Vanessa Hannesschläger and Martina Trognitz
Austrian Centre for Digital Humanities and Cultural Heritage (ACDH-CH), Austria
In this paper, we describe the “ACDH-CH Legal issues Project Survey” (ALPS), a tool that helps researchers to understand all legal dimensions of their research project and the lifecycle of their data. The introduction explains the institutional preconditions and specific target groups of the tool, which was developed by the Austrian Centre for Digital Humanities and Cultural Heritage (ACDH-CH) of the Austrian Academy of Sciences and is used to support both its members and project partners to comply with legal requirements and the spirit of Open Science at the same time. The paper then focuses on the various elements of the survey and their goals, also explaining the workflow that is used to process the results of survey participants. The conclusion and outlook section suggests ways to open up this tool for use by a community beyond the members and project partners of the hosting institution.
Leibniz-Institut für Deutsche Sprache Mannheim, Germany
N-grams are of utmost importance for modern linguistics and language theory. The legal status of n-grams, however, raises many practical questions. Traditionally, text snippets are considered copyrightable if they meet the originality criterion, but no clear indicators as to the minimum length of original snippets exist; moreover, the solutions adopted in some EU Member States (the paper cites German and French law as examples) are considerably different. Furthermore, recent developments in EU law (the CJEU's Pelham decision and the new right of newspaper publishers) also provide interesting arguments in this debate. The proposed paper presents the existing approaches to the legal protection of n-grams and tries to formulate some clear guidelines as to the length of n-grams that can be freely used and shared.
CLARIN Contractual Framework for Sharing Language Data: The Perspective of Personal Data Protection
Aleksei Kelli1, Krister Lindén2, Kadri Vider1, Pawel Kamocki3, Arvi Tavast4, Ramūnas Birštonas5, Gaabriel Tavits1, Mari Keskküla1, Penny Labropoulou6
1University of Tartu, Estonia 2University of Helsinki, Finland 3Leibniz-Institut für Deutsche Sprache Mannheim, Germany 4Institute of the Estonian Language, Estonia 5Vilnius University, Lithuania 6ILSP/ARC, Greece
The article analyses the responsibility for ensuring compliance with the General Data Protection Regulation (GDPR) in research settings. As a general rule, organisations are considered the data controller (responsible party for the GDPR compliance). Research constitutes a unique setting influenced by academic freedom. This raises the question of whether academics could be considered the controller as well. However, there are some court cases and policy documents on this issue. It is not settled yet. The analysis serves a preliminary analytical background for redesigning CLARIN contractual framework for sharing data.
University of Macerata, Italy
My project deals with media interpreting and more specifically with film festival interpreting (FFI). My aim is to create a multimodal corpus of dialogue interpretations (Italian<>English) recorded at international film festivals and/or downloaded from the official web channels (social media platforms or web TV) that broadcast them in (live) streaming. The international film festival which has drawn my attention is the Giffoni Film Festival (Italy) because of its unique audience and jury made up of children and young people from 3 to over 18 years old.
The FFI multimodal corpus I will create may contribute to two CLARIN resource families, namely spoken and CMC corpora. Indeed, it deals with interpreter-mediated encounters (spoken corpus) broadcast in (live) streaming (CMC corpus).
I work with dialogue-like data taken from different settings such as meetings, Q&As and press conferences with foreign guests. The audience is made up of young jurors (expect for the press conferences) plus a remote one following the (live) streaming. I aim to align the transcriptions with the accompanying videos and/or audio tracks through the use of software such as ELAN and EXMERaLDA. These data will be analysed in particular in terms of interactional dynamics, conversational formats and linguistic features: these are the specific areas that may require specific annotations.
As I said before I think that my corpus can be also defined as a CMC one because it deals with public (media) communication online broadcast in (live) streaming through social media platforms and official web TV services. My aim is to analyse whether and how the (live) streaming of these events influences the interpreting performances both at a micro and at a macro level, namely in terms of “audience design” (Bell, 1984, 1991, i.e. how interpreters adjust to this specific situational and interactional context) and “ethics of entertainment” (Katan & Straniero Sergio, 2001, i.e. the final aim of each and every TV show is to entertain the audience). Given that FFI can be subsumed under the wider field of MI (Merlini, 2017), this concept can be applied to other types of broadcast communication as well. The combination of oral data and computer-mediated communication could be my contribution to CLARIN: indeed, its CMC corpora deal with written data.
The Covid pandemic had an impact on the 50th edition of the Giffoni Film Festival and on my research as well. For example, I would have attended a residential course at the Institute for Computational Linguistics “A. Zampolli” in Pisa but it has been postponed to a date because of the lockdown. The contribution in this case would be twofold: the CLARIN infrastructure and its experts may contribute to my project with their expertise and technological resources and I may contribute to CLARIN in particular as far as the settings and the presence of the interpreters are concerned.
The Institute of Literary Research of The Polish Academy of Sciences, Poland
In my paper, I would like to present the research project that forms the basis of my PhD thesis. The main area of my interest is the presence of various types of irony in the work of Juliusz Słowacki, one of the most important and influential poets of Polish Romanticism. This poet is known for his masterful use of various types of irony, which unique in Polish literature. The nature and manner of its usage changes with the development of the poet's work. The first element of my doctorate is to show the development of the use of irony and the artistic means that are used to achieve an ironic effect. I present different types of irony, my main interest being literary irony.
When examining irony, I use not only recognized methods of traditional literary studies, but also methods introduced by digital humanities, with particular use of tools provided by the CLARIN consortium. In my paper I present the process of creating a digital corpus of texts by Juliusz Słowacki and its use in research. The first important step was to use the LEM tool to generate statistics of lemmas and tags within the entire corpus, individual texts and groups of texts. Next, I created a list of stylistic irony markers that can be compiled in such a way that they can be detected using CLARIN tools. In some cases, the tools need to be adjusted, but most often it is possible to set the currently available tools so that the expected results can be interpreted in search of the described irony indicators.
The 'Mnemosyne Language' Tool: Towards the Creation of a New Lexical Resource for a Knowledge Model in Cultural Heritage
Digital Heritage Research Lab, Cyprus University of Technology, Cyprus
The combined PhD projects constitute the backbone of the research agenda of the EU
Chair in Digital Cultural Heritage “Mnemosyne” Project (WIDESPREAD-03-2017, 2019-2024). The project is structured as a three-phase research programme that has as a main objective the holistic documentation in Cultural Heritage, which fully supports all the needs of the potential users. This novel approach gives any tangible Cultural Heritage objects voice, using a language understandable to the end users, by tying in a taxonomy of both movable and immovable objects.
We will present the results of the research undertaken in the first phase of the project, which consists of a hierarchical knowledge-based classification system for Cultural Heritage artefacts, monuments and sites. The advantage of this methodology is the novel integration of highly complex multimodal information and data in a unique computer-integrated knowledge management system, which will be the foundation of newly developed semantic rules and an innovative ontology.
The “Mnemosyne language” tool could be integrated in the future as a new resource in the CLARIN Research Infrastructure, available for being openly reused and implemented as one of the first contributions of the CLARIN-CY National Consortium.
University of Las Palmas de Gran Canaria, Spain
There is currently a total lack of bilingual French-Spanish resources in the field of architecture, a sector particularly prosperous after a major period of crisis.
The goal of this project is to solve this lack of material and to create an extract of a French-Spanish bilingual dictionary for the domain of architecture, as well as a methodological proposal.
Having used the resources proposed by ELEXIS for this project, the following questions arise concerning the use of the CLARIN infrastructure which is already successfully established. We will try to show the need for such resources to be easily shared and housed in a free long-term infrastructure.
Liepāja University, Ventspils University of Applied Sciences, Latvia
In Latvian linguistics, the language of science has been researched in various studies, however the general landscape of research into the subject can still be regarded as fragmentary. The presented study on the summaries of doctoral dissertations in management science and the characteristic linguistic features of the texts and their translations is a study in contrastive linguistics using contrastive analysis and corpus-linguistic methods. First the unilateral method, analysing the grammatical and lexical phenomena of the mentioned texts, their micro- and macrostructure in English and Latvian, then the bilateral method, performing semasiological analysis, specifying the results of the interlingual comparison are applied.
Parallel corpora of management science thesis summary texts written in English and translated into Latvian from 2013 to 2020 are used, subsequently contrasting the texts with their translations. Two corpora, each containing 25 texts of management science doctoral dissertation summaries have been created, one containing the original texts in English and the other containing their Latvian translations. The study is representative as it contains 98% the management science thesis summary texts, written in English in Latvian universities over the period. With the help of AntConc microstructure and macrostructure analysis of the studied texts has been performed. A semasiological analysis of terms (false friends) found in the researched texts, which could cause difficulties in the translation process, has been carried out.
The use of the elements denoting the authors’ stance, such as the use of the first and the third person, passive voice, personal and impersonal constructions, as well as linguistic formulations are being studied in detail.
The following macro-structural elements appear in the texts: annotation, introduction; brief theoretical justification; research methods; main results; discussion and interpretations; limitations; novelty; key conclusions, proposals; sources; words of gratitude. Also, macrostructure elements specific to the researched texts have been established.
The study could provide input into CLARIN projects by supplying information that the doctoral students need in order to write and translate their theses and summaries. After completion of the study the idea of including the paper into CLARIN repository will be considered. (Slides)
Crosslinguistic Influence and Cognitive Processing in Bi- and Multilinguals: A Parallel Corpus and Experimental Study
University of Lille, France
How the languages we speak influence our thought? Despite a growing interest about the impact of bilingualism and a general “bilingual advantage” assumed by many researchers, little is still known about the universal vs. language-specific aspects of such influence or about the degree of implication of crosslinguistic interaction in bi/multilingual contexts.
The focus of the present study is on the specificities of the involved languages as observed in the syntactic and the semantic domains across and within systems, and on the degree of their impact at the computational and algorithmic levels. More specifically, the languages of the world differ in the encoding of concepts and events. For instance, there is a class of adjectival predicates (called Tough) like easy, tough, difficult which, when added to a construction, can cause asymmetries at both the syntactic and semantic levels. English (EN) and French (FR), for example, follow a gap-strategy (raising object to subject), whereas Russian (RU) allows for a variety of patterns (e.g., non-stative constructions with case marking, use of deverbals).
FR ≈ EN vs. RU
In the domain of events, and more specifically in the domain of motion events, the same languages follow a very different schema. While French tends to lexicalize the Path of a motion in the verb leaving the expression of Manner optional (Verb-framed strategy), English and Russian allow for dense constructions where Manner is lexicalized in the verb and Path encoded in satellites (Satellite-framed strategy).
FR vs. EN ≈ RU
Such dissociations constrain how speakers organize information in discourse and revive questions about how they deal, linguistically and cognitively, with such asymmetries especially when they learn a second/third language (FR-EN/RU-EN/RU-FR/RU-EN-FR).
The comparative approach of this project involves corpus data accessed through the CLARIN VLO-platform combined with a variety of experimental offline (categorization/acceptability judgements) and online measures (eye-tracking, RTs). This triangulation is used to investigate the relative impact of language-independent and language-specific factors on verbal and non-verbal processing of monolingual, bilingual and trilingual speakers, and thus provide deeper insights into the relation between language and thought.
This project (a) connects two areas that have never been studied together in an experimental way (tough- and motion-event constructions); (b) brings together different research methods for the investigation of the Language-Though interface (offline and online experimental data gathered in the lab combined with corpus data gathered in naturalistic settings), (c) contributing more generally to the debate about universal vs. language-specific dimensions of cognition. Apart from its (d) implications in the domain of Corpus and Cognitive Linguistics – through the proposed crosslinguistic comparisons (English-French-Russian) – this research has also (e) obvious applications in the field of second/third language acquisition and teaching. Finally, the project (f) capitalizes on existing language databases available through CLARIN certified organizations and Knowledge-Centres (e.g., Talkbank, Ortolang, CORLI) and (e) offers to enrich the infrastructure with new open-access and interoperable bi/trilingual corpora, aligned not only with audio/video formats but also with eye movement data coupled with offline measures – digital resources not yet available to scholars through CLARIN.
Faculty of Electronics, Wrocław University of Science and Technology, Poland
The main aim of this work was to find a method of automatically evaluating the quality of dimensionality reduction algorithms in terms of visual legibility for humans.
When a researcher is struggling with high-dimensional data (e.g. vector representation of documents), he often would like to reduce it so he can view it in a two-dimensional plane. To do this, he applies one of the algorithms such as PCA, t-SNE or Isomap. Most of them require at least a few parameters to be defined in advance. The researcher (often according to his intuition and previous experience) sets them manually in several ways, makes a visual comparison of the obtained results and selects the most accurate in his opinion.
Automating the evaluation process would allow scientists to search the space of hyperparameters for those that generate the objectively clearest results. In the context of CLARIN infrastructure, readability of outputs for all services that present data in a two-dimensional plane could be automatically improved, especially that it is often the case that the user does not have the skills or the ability to do it by himself. The tools that could benefit from such work are for example WebSty (https://ws.clarin-pl.eu/websty.shtml) and TopicModeling (https://ws.clarin-pl.eu/topic.shtml).
Our solution is based on comparing ground truth (i.e. labels from manual tagging) with labels obtained by clustering in the lower space. If the reduction is correct, then clustering in the lower space (especially using Euclidean distance) should give us results comparable with ground truth, because most of the methods strongly rely on the simple fact that the points representing the same class should be placed in low dimensional space close to each other and far away from points representing other classes. It is worth to notice that we do not need ground truth to use the algorithm. The second set of labels can be obtained, for example, in a process of clustering in a higher dimension.
In our work, we examined two clustering algorithms (K-means and Agglomerative Hierarchical Clustering with Euclidean distance and several linkage methods) on four different corpora (each corpus had predefined labels). Every document was represented by the mean word2vec vector generated using KGR10 FastText model for Polish. To map every document on a 2D plane we used the t-SNE algorithm with the following set of searched parameters: metric, number of iterations, learning rate, perplexity. For evaluation purpose, we used the Adjusted Mutual Information (AMI) score. The set of parameters values with the highest score was considered the most accurate, which was confirmed by the visual reception of the plots.
The main problem of our research is the subjective evaluation of the results. In order to overcome this problem, we are planning to run a series of surveys that should show that the automatically obtained score actually grows with people's judgment. Besides, the method may need some adjustments to take into account human understanding of the words "far" and "near". For example, keeping a group too far from other groups can make it very difficult to actually see the results. It would also be worth trying to use other clustering algorithms. (Slides)
CLARIN in the Classroom is a new initiative open to university lecturers who have used CLARIN resources, tools or services in their courses. They are invited to present their experience and suggest future steps that can help facilitate and accelerate the further integration of CLARIN into university curricula.
Building and maintaining online courses in digital research methods
Mietta Lennes, University of Helsinki / FIN-CLARIN
The digital text and speech materials and tools offered via CLARIN can be widely used within the Social Sciences and Humanities. In order to promote a general awareness of these possibilities, FIN-CLARIN annually offers three open online courses: Corpus Linguistics and Statistical Methods (twice a year), Introduction to Speech Analysis, and Data Clinic. The Data Clinic is implemented in co-operation with the HELDIG network of Digital Humanities experts at the University of Helsinki.
In the presentation, the background of each of the three courses will be described briefly. Some of the practical issues in maintaining course contents in two languages on Moodle will also be discussed. Corpus Linguistics has been offered in Finnish and English for several years, the Data Clinic materials are primarily in English, and, starting from autumn 2020, Speech Analysis is also provided both in Finnish and in English.
Research methods, tools and software are under constant change and development. It is a nearly impossible task to keep track of all the technical skills that could or should be recommended for or required from students in, e.g., Digital Humanities in any given year. For similar reasons, multidisciplinary advanced-level courses like the Data Clinic do not really seem to get "finalized", but the contents must be actively reviewed, amended and updated. Fortunately, it is in many cases possible to use external learning materials and resources.
In multidisciplinary courses, learners come from different fields and backgrounds, and some of them may not yet have all of the required skills. This can place an extra burden on the teacher. In online teaching, the issue might be partially solved by sharing small, self-contained and well described packages of learning content (i.e., Learning Objects) that can be used flexibly in many different courses and replaced or removed when the content becomes obsolete. The possibilities of sharing Learning Objects within CLARIN will be discussed in the presentation.
Corpus literacy in German linguistics: the usage of corpus tools and platforms in academic classrooms
Laura Herzberg, University of Mannheim
The teaching of "corpus literacy", understood as "the ability to use the technology of corpus linguistics to investigate language and enhance the language development of students" (Heather & Helt 2012: 417), is a central component of linguistic degree programs. In approaches to "data driven learning", corpora are also gaining in importance as resources in language teaching (Leńko-Szymańska & Boulton 2015); however, corpora and corpus linguistic methods are yet to be used extensively in teacher training scenarios. The German Linguistics department at the University of Mannheim is filling this gap by offering corpus linguistic classes as an integral part of the bachelor's and master’s curricula in a variety of study programs, such as teacher training (B.A./M.A. of Education), as well as interdisciplinary study programs that combine German linguistics classes with cultural and political sciences or media communication and literature studies.
The corpus platforms offered by the Leibniz Institute for German Language (IDS) as well as the Digital Dictionary of the German Language (DWDS) of the BBAW provide the data bases in the courses. The students are introduced to these platforms by watching tutorials and solving complementary tasks. Independent of the content-related focus of the class, the students have to develop a linguistic research question that is the centre of their own research project. They are instructed to use a corpus linguistic approach to examine their research questions. These questions are usually based on personal interests and are influenced by their own experiences, e.g. the usage of anglicisms in online texts, such as blogs or discussion forums.
Uses cases for corpus platforms carried out in our classes are manifold: In a teacher training class, the students learn how to use corpora with regard to their own future teaching career, for example, in order to investigate different German vocabularies by comparing newspaper corpora (e.g. DWDS corpus “die ZEIT”) with corpora of computer-mediated communication (e.g. DWDS blog corpus). In a different scenario, students use the Database for Spoken German (DGD) to analyse the usage of interaction signs, such as interjections, and compare their frequency to written language data, such as postings in Wikipedia talk pages that are available via the Corpus Search, Management and Analysis System (COSMAS II). The latter are equally often used for cross-lingual studies. The students also explore collocations, create word development curves as well as word profiles with the help of corpus tools.
Pedagogical Applications of ORVELIT Corpus
Jurgita Vaičenonienė and Jolanta Kovalevskaitė, Vytautas Magnus University
We will share our experience of creating a corpus tailored to the needs of translators, editors and Lithuanian language specialists, which has recently found home in CLARIN-LT repository. By showing how we integrate CLARIN related content in our lectures, we will argue that teaching the students about the tools and resources for the analysis of Lithuanian stored in the national CLARIN centers, provides knowledge on services offered by CLARIN in general. We will first introduce the goals of creating the comparable corpus of original and translated Lithuanian ORVELIT (Originalios ir vertimų lietuvių kalbos tekstynas). Next, we will present its integration into the curriculum of the MA course of “Contrastive Stylistics” at Vytautas Magnus University. Part of the course is devoted to showing the students how to access and use electronic language resources and analysis tools. Among other topics, the students get acquainted with the building procedure and characteristics of the ORVELIT corpus. As from now, they will be able to download the morphologically annotated and raw versions of the corpus to investigate the features of original and translated Lithuanian on their own. Finally, we will share our ideas on developing the uptake of the corpus. As the students at the programme of Applied English Linguistics are not usually taught to work with annotated corpora, our next steps are to expand the ORVELIT entry in CLARIN-LT repository by providing the pre-generated lists of parts of speeches and preparing demos on searching the corpus with tools for annotated corpora (e.g., ANNIS).
About the use of CLARIN tools in the courses taught to students of empirical linguistics and language documentation
Katarzyna Klessa, Adam Mickiewicz University in Poznan
This contribution summarizes the classroom experience gained from lectures for the students of two study programmes at Adam Mickiewicz University in Poznań, Poland, i.e.: "Empirical Linguistics and Language Documentation" (MA programme taught in English, http://elldo.amu.edu.pl/) and "Computer Linguistics"(BA programme taught in Polish, http://computerlinguistics.amu.edu.pl/)
A number of tools made available by the members of the CLARIN consortium (especially CLARIN-PL but not exclusively) are applied within three kinds of university courses: documentary linguistics, corpus linguistics and experimental phonetics.
Selected examples of classroom curricula are discussed in the contexts of the above courses. The curricula assume the focus on (1) raising awareness about the tools among students, (2) practical use of some of the tools in the course of classroom activities, (3) identifying possible future applications of the CLARIN infrastructure.
The choice of the specific resources and task definition depends on the course topic. Among others, we use the CLARIN-PL speech tools package, Spokes Conversational Speech, dSpace Repository, Polish Wordnet or The Language Bank of Finland (Kielipankki).
Students are instructed to get familiar with the CLARIN websites, to comment on the contents, test the tools, and to solve simple tasks using the CLARIN resources. Student activities involve searching repositories, data and metadata training, team-working, writing reports as well as spoken data analysis and processing.
LABLASS and the BULGARIAN LABLING CORPUS for Teaching Linguistics
Velka Popova Radostina Iglikova and Krasimir Kordov, Konstantin Preslavsky University of Shumen
The Applied Linguistics Laboratory (LABLING) at the Konstantin Preslavski University of Shumen is technological partner with the ClaDa-BG National consortium. The LABLING team focuses their research on creating computer corpora of children's speech and collections of associative data. As a result of two-year long work the pilot versions of the web-based LABLASS and BULGARIAN LABLING CORPUS are a fact, which I, as a project participant, was immediately tempted to include in the curriculum of the linguistic disciplines I teach and in my newly published textbook Psycholinguistics as Experimental Linguistics .
By publishing the BULGARIAN LABLING CORPUS to the CHILDES platform (https://childes.talkbank.org/access/Slavic/Bulgarian/LabLing.html) is achieved the broadening of the platform's abilities for cross-linguistic research to include another Slavic language. Simultaneously, the Bulgarian linguistic tradition is enriched with another universal useful standard for researching linguistic ontogenesis which makes it possible to quickly, precisely and reliably compare great numbers of languages and on this basis – to build solid typologies and modern theories. Having this in mind, I have decided to broaden the studied topics of the disciplines Psycholinguistics, Linguistics, Child Linguistics, which in turn has improved the standard and quality of independent student research – course assignments and theses. In addition, as a professor I often use the BULGARIAN LABLING CORPUS data in my teaching, since their multimodal format makes them applicable for various demonstrations.
The abilities of the web-based LABLASS system developed within the ClaDa-
BG project are not limited to including available lexicographic resources but are instead much broader, which results in creating new dictionaries, visualisation and comparison of data from different sources. Therefore the system has its place in the practical modules of the disciplines I teach, especially when they have to do with discussing topics such as mental lexicon and language ontogenesis. During classes students can put their own working hypotheses to the test by comparing and analyzing published data, creating their own dictionaries. Another positive result to be mentioned is the successful defense in 2019 of an MA thesis entitled “Specificities of the Vocabulary of the Bulgarian Native Speaker Nowadays”.
In conslusion, I would summarize that for me and my students, before CLARIN entered the academic classroom with its resources, instruments and services, the classroom became a sort of workshop for CLARIN where in the process of creating corpora of children`s speech and associative collections the students were acquiring research competences and skills, as well as the self-esteem that their future products would return to them and their colleagues in the university auditorium. The students were separated into two work groups and each student received their own individual project with specific tasks. The active participation of the students in recording and transcribing children`s speech in the CHILDES universal format, as well as the collecting, systematizing and entering the associative data in the LABLASS web-system has definitely played an important role in their personal and professional growth.
 Popova, Velka. 2020. Psycholinguistics as Experimental Linguistics. Shumen (in Bulgarian)
Wiktoria Mieleszczenko-Kowszewicz, SWPS University of Social Sciences and Humanities
One of challanges which language researchers have to face is polysemy which influence of research’s acuraccy. The second one is lack of tool which capture the specific word’s category. The aim of the workshop is to teach participants how to create their own categories of words in specific meaning to achieve their scientific goals. Another aim is to present the potential way of using new tool in the classroom. Polish Wordnet enable users to recognize emotional and fundamental values of words. Those information is used to do the frequency analysis of words belonging to the specific category in text corpora using „Sentiment analysis” option in Literary Exploration Machine (LEM). A new option in LEM is „Own category” which aim is to do the frequency analysis of particular words in specific meaning in the text. During workshop I present the process of choice of words including: choice of words’ meaning and guides for competent judges. This process will be presented as a part of engaging students into potencial research project.
The making of the siParl tutorial
Kristina Pahor de Maiti, Faculty of Arts, University of Ljubljana
Darja Fišer, Jožef Stefan Institute, Ljubljana, Slovenia
In this talk we will share our experience with the development of the tutorial “Voices of the Parliament” (https://www.modernlanguagesopen.org/articles/10.3828/mlo.v0i0.295/) which shows how corpora can be used to investigate language use and communication practices in a specialised socio-cultural context of political discourse. The tutorial demonstrates the potential of parliamentary corpora research via concordancers without the need for programming skills. The intended audience of the tutorial are students and researchers of modern languages, but also users from other fields of digital humanities and social sciences who are interested in the study of socio-cultural phenomena through language. We will present the obstacles we ran into when selecting the resources and tools to be showcased, demonstrate the solutions we adopted, and point out the important next steps for CLARIN to take in order to make their infrastructure more suitable for use in pedagogic settings.
Digital Philology and Computational Linguistics
Federico Boschetti and Monica Monachini, CNR-ILC Pisa and CLARIN-IT
The CNR-ILC is present with educational activities in many Italian universities and high schools. This is a good opportunity for promoting the best practices of CLARIN in academic courses, internships and summer schools. In this contribution we will illustrate our experience in Pisa, Venice, Macerata and Siracuse.
The University of Pisa hosts a prestigious program for the study of digital humanities and computational linguistics (“Corso di Laurea Magistrale in Informatica Umanistica”). An introductory class of the courses in Digital Philology is dedicated to illustrate what CLARIN offers to classicists, such as Greek and Latin annotated corpora and treebanks. Some students also take an internship at the CNR-ILC, where they use CLARIN resources. Furthermore, this year the summer school promoted by the LabCD “Digital Tools for Humanists” (https://bit.ly/3cfR84a) had to be funded by CLARIN, but unfortunately it has been suspended, due to the COVID19 emergency (it will be restored next year).
In Venice, the CNR-ILC is present both at the Venice International University (VIU) and at the Venice Centre for Digital and Public Humanities (VeDPH). The VIU is a consortium of twenty institutions and its Globalization Program is a multicultural, international and interdisciplinary program in human and social sciences. CLARIN was explicitly introduced in the syllabus (https://bit.ly/2RLB4xv) for a course entitled “Digital Humanities: Web Resources, Tools and Infrastructures” in 2017. The national coordinator of CLARIN-IT provided a seminar and students used CLARIN resources to identify multilingual resources for their multicultural studies. Also this year CLARIN will be part of the syllabus (https://bit.ly/2RROWpT). Last year at the VeDPH a talk on CLARIN-IT (https://bit.ly/3kCwLkI) opened the cycle of seminars in Digital and Public Humanities addressed to PhD students and CLARIN is in the syllabus of “Literary and Linguistic Computing” course.
The PhD course “Humanities and Digital methods” at the University of Macerata is positioned between the humanities and new technologies. It highlights the potential of the interaction between these two fields of knowledge thus fostering innovation. Privileged research topics are: digital archives and libraries, databases for historical research, computational linguistics, new tools for textual analysis and elaboration, ethical implications of techno-sciences. The students have been introduced by the national coordinator of CLARIN-IT to the advantages from the use of a research infrastructure for their studies with the possibility to access, share and use data and tools. It is urgent in the humanities to raise awareness about the benefits of the two pillars the infrastructures rely on: Open Science and FAIR data. The course has shown that even for PhD students it is not immediately evident that data should comply with fairness and they are not aware about how to produce fair data during their research.
CNR-ILC collaborates also with high schools in Siracusa and Pisa: students are requested to annotate literary texts in order to identify relevant linguistic and stylistic phenomena. In this case, CLARIN is presented to the classes in order to enhance their motivation: also for high school students it is very important to understand that digital resources must be FAIR, and the Infrastructures of research are the natural solution for this.
Simonetta Montemagni and Giulia Venturi, CNR-ILC Pisa and CLARIN-IT
Teaching Computational Linguistics to Master students within a Digital Humanities degree program is aimed at meeting two broad complementary goals, covering both the practical utility of NLP in real world applications and its promise for improving the understanding of human language and/or for exploring humanistic texts. The typical situation that has to be tackled in both cases is the adaptation of existing tools and resources to the specific language variety which needs to be automatically processed (e.g. historical varieties of language, social media or domain specific language, or different textual genres). During the course taught together with Giulia Venturi, this topic is practically investigated in a project articulated as follows: (i) the project goal is proposed to the students (typically in couples) halfway through the semester; (ii) a presentation is given by each group during the last week of course illustrating the results achieved up to that point; (iii) a final report is provided before the exam, together with the annotated corpus developed during the project.
Over the last few years, in our CL course to Master students at the University of Pisa, we proposed a “domain adaptation” task (focusing on different varieties of language, e.g. historical texts or texts belonging to different genres) developed along the lines detailed above and carried out using tools and resources distributed via CLARIN, namely the UD treebanks and the UDPipe linguistic annotation chain. In all cases the main topic of the project has been corpus annotation, with a specific view to handling the peculiarities of the language variety dealt with and simultaneouly harmonizing annotation choices with the general UD annotation strategy.
For the academic year 2019-2020, the goal of the project has been the construction of a social media test corpus to evaluate the accuracy of UDPipe trained on different treebanks representative of different varieties of language use. The stages of work can be summarised as follows:
- automatic annotation of a ~2,500 token corpus of tweets provided by the teachers to each group of students;
- manual revision of the automatically annotated text: at this stage, the revision is carried out individually;
- inter-annotator agreement analysis between the members of the group and construction of a unified version of the subcorpus assigned to them. During this phase, a qualitative and quantitative analysis of the type of errors found is carried out, with particular attention to the peculiarities of the language variety dealt with, individual annotation choices of group members and proposed annotation guidelines for incorrectly / inadequately treated constructions;
- use of the revised unified corpus to test the correctness of the annotation produced by UDPipe models trained on different language varieties (news vs social media language).
Computational Linguistics, CLARIN in the Classroom: Case of Latvia
Inguna Skadiņa, Ilze Auziņa and Baiba Saulīte, Institute of Mathematics and Computer Science, University of Latvia
In Latvia teaching of Computational Linguistics started in 2003 at Liepāja University. Two different courses - one for the master students in computer science and one for the master students in linguistics – were created. While course for the linguistics students mostly concentrated on corpus linguistic methods and was devoted to the methods for the creation of corpus and use of corpus in linguistic studies, the course for the computer science students dealt with algorithms and methods for implementation of language processing tools. Content of these courses gradually changed with the development of new language resources and tools, as well as with new methods on how to study language and how to model language. Few years later Computational Linguistics was included in doctoral study course on Modern approaches to Linguistics at Liepāja University and has resulted in several PhD thesis where corpus-driven methods were applied.
In 2013 a Natural language processing course was introduced for bachelor students in computer science at the University of Latvia. The course mainly concentrates on language modelling, in particular Latvian language modelling, and the state of the art tools and methods.
While Latvia joined CLARIN in 2016, language resources and tools that currently available from the CLARIN-LV repository, as well as ones that we plan to include in a nearest future, have been explored already since first courses in 2003. When new and important tool or resource has been developed, it is usually presented in corresponding course. It has to be mentioned that CLARIN-LV C-Center has been registered only in March, 2020, while language resources and tools have been explored much more earlier.
In 2017 new Computational Linguistics course for Master Students in English philology was started. This is the first course, in which one lecture and one seminar is devoted to the CLARIN language resources and tools (LRTs). In lecture we introduce students with CLARIN research infrastructure, Virtual Language Observatory and resource families. We also highlight some LRTs that are more relevant to the research interests of students for particular year of studies. In the seminar students present their findings – tool or resource they found in CLARIN VLO and that seemed interesting for her/him. We ask students explore other resources than simple corpus.
In Autumn, 2020 Computational Linguistics course started for Master Students in Baltic Philology. This is a largest course (6 ECTS). In this course we will follow the experience from our previous courses: one lecture and one seminar will be devoted to the CLARIN language resources and tools, while in other lectures and tutorials students will study Latvian language resources (Corpus of Modern Latvian, Latvian Treebank, Latvian FrameNet corpus, etc.) that are presented in CLARIN-LV repository. In addition we plan to involve students in CLARIN Café that will introduce CLARIN in a nutshell.
While we are familiar with language resources and tools for Latvian, tools and resources that might be of interest for students in English philology are not so well known for us. We would be very interested in collective work of CLARIN partners to create an aligned list of language resources and tools for all languages (e.g., morphologically annotated corpora, treebanks, POS taggers, etc.).
- CLARIN-LV repository
- Computational Linguistic Course for Latvian Philology
- Computational Linguistic Course for English Philology
Integrating Computation into the Humanities: Using Clarin Data in the Digital Humanities Hackathon in Helsinki
Mikko Tolonen, University of Helsinki
In this presentation, we will present the workflows employed in the Helsinki Digital Humanities Hackathon and demonstrate how CLARIN data and tools can easily be integrated into core humanities work and education. We will also introduce the University of Helsinki MA track in Digital Humanities, which serves two general objectives: 1) renewing the scholarly culture in particular areas of the humanities through the introduction of innovative methods; and 2) meeting the challenge of digitization in the training of humanities professionals to support their ability for critical reflection in the digital world and enable them to participate in multidisciplinary collaboration with professionals from different backgrounds.
The main learning objective of DH education at UH is to help students find and further develop a common language among the humanities, the social sciences, and data science. Our primary educational philosophy is to promote cross-disciplinary collaboration in the humanities. In the long term, this will enable the renewal of scholarly culture, bridging the gap between traditional humanities and those who experiment with new methods on humanities data. What is unique about DH teaching at the University of Helsinki is that the introductory and elective courses have a specific practical aim: they prepare students to put their skills and theoretical knowledge into practice at the end of each academic year in a multidisciplinary research project entitled Helsinki Digital Humanities Hackathon.
The Helsinki Digital Humanities Hackathon is a chance to experience an interdisciplinary research project from start to finish within the span of 10 days. For researchers and students from computer science and data science, the hackathon provides an opportunity to test their abstract knowledge against complex real-life problems. For people from the humanities and social sciences, it shows what they can achieve with such collaboration. For both, the hackathon gives the experience of intensely working with people from different backgrounds as part of an interdisciplinary team. During the hackathon, each group plans and conducts a digital humanities research project: working together, they formulate research questions with respect to particular data sets, develop and apply methods and tools to answer them, and present their work at the end of the hackathon. The DHH project brings together not only students but also academics and industry professionals. Thus the students are practising skills that will be useful to them regardless of whether they are planning an academic career or aiming to find a job outside academia.
UPSKILLS, an Erasmus+ project that will foster research-based teaching
Lonneke van der Plas, University of Malta
In this presentation, I would like to present the project UPSKILLS (UPgrading the SKIlls of Linguistics and Language Students) to you. We are an Erasmus+-funded strategic partnership featuring 6 partners (University of Malta as coordinator, University of Belgrade, University of Bologna, University of Graz, University of Rijeka, CLARIN ERIC) and many associated partners (University of Zurich, University of Geneva, the Open University of Cyprus and several industrial partners). We have just started and the project will run for 3 years.
Our central goal is to tackle skills gaps and mismatches in students of language-related disciplines through supporting the development of materials that better meet the learning outcomes needed in the current job market. Our analyses showed that there is need for developing a more transdisciplinary approach that supports the acquisition of transferable forward-looking skills, such as critical thinking and problem solving, knowledge of research design and data analysis, project management, and digital skills. We will make use of innovative pedagogies such as online educational games, developing and testing modular and blended learning, and will train lecturers accordingly. Furthermore, we will focus on real-world applications (work-based learning), by promoting inquiry-based learning, and by integrating existing research and research infrastructures into teaching. Clarin is one of the partners in this project together with the other partners we will design and implement methods to use CLARIN’s resources and tools in the classroom. Apart from presenting our project, I am very interested to hear from the participants in this session what their experiences are, so we can learn from these.
All abstracts of the CLARIN Bazaar can be found here.
Back to the main conference page.