Thematic session | Parallel session 1 | Parallel session 2 | Parallel session 3 | Parallel session 4 | Poster session | PhD-students poster session | CLARIN Bazaar
CLAMS: Computational Linguistic Applications for Multimedia Services
In this talk, I will discuss a new collaborative research project between Brandeis and WGBH to build a smart archive platform, in order to assist archivists confronted with thousands of hours of digital data in multiple types of media. This platform, called Computational Linguistics Applications for Multmedia Services (CLAMS), is a workflow composition platform, using computational linguistic and computer vision tools to extract information that can be converted to descriptive metadata, for smarter archiving, search, retrieval, and analysis. Building on top of the success of the recent LAPPS-CLARIN integration and the interoperability between the LAPPS Interchange Format (LIF) and the CLARIN-D/WebLicht Text Corpus Format ( ), we have begun development on a Multimedia Interchange Format (MMIF) that enables interoperability between content-analysis tools working over different media types (audio, video, text). The CLAMS platform will allow an archivist to create a new or access an existing workflow, by drag-and-drop from a toolshed of registered NL, ASR, and CV tools. The interchange formats, MMIF and LIF, act to bridge the individual components in the workflow, while ensuring type consistency between I/O operations. To illustrate the functionality of the platform, I will demonstrate the integration and deployment of both speech recognition and computational linguistic tools over a subset of the American Archive data, provided by WGBH. Specifically, by identifying text within video, categorizing different audio elements, distinguishing language types, types of locations/scenes, and content breaks, we demonstrate that CL tools can significantly enhance the descriptive content of A/V collections to improve both discoverability and access beyond their existing and often sparse item-level descriptive metadata.
Speech and gestures: computational linguistic studies
Face-to-face communication is multimodal since at least two modalities are involved, the auditive (speech) and the visual (gestures). Speech and gestures are related semantically and temporally on many levels. Co-speech gestures, which comprise e.g. head movements, facial expressions, body posture, arm and hand gestures are co-expressive but not redundant. Discovering the relation between speech and gestures is important for understanding communication, but has also practical applications such as the construction of ICT. In the talk, I will present studies investigating multimodal communication from a computational linguistic point of view. In particular, I will focus on the collection and annotation of multimodal corpora, which in this context are video- and audio-recorded monologues and dialogues, and research conducted on these data at the Centre for Language Technology, in order to investigate the relationship between speech and gestures at the prosodic, syntactic, semantic and pragmatic level.
CLARIN Social Media Session
The seminar is designed to explain the changing face of communication in 21st century and the importance of using social media in academia and research infrastructures. Participants will learn more about which social media is best fit for you and why, how to create engaging posts on popular platforms like Facebook and Twitter, when is the best time to publish, what kind of content works, the importance of multimedia, how to schedule posts, deal with negatives and efficiently manage your accounts. The seminar will include hands-on activities with lots of examples of good and bad social media practice from the CLARIN and related accounts.
EXMARaLDA meets WebAnno.
Steffen Remus, Hanna Hedeland, Anne Ferger, Kristin Bührig and Chris Biemann.
In this paper, we present an extension of the popular web-based annotation tool WebAnno, allowing for linguistic annotation of transcribed spoken data with time-aligned media files. Several new features have been implemented for our concomitant current use case: a novel teaching method based on pair-wise manual annotation of transcribed video data and systematic comparison of agreement between students. To enable annotation of spoken language data, apart from technical and data model related issues, the extension of WebAnno also offers a partitur view for the inspection of parallel utterances in order to analyze various aspects related to methodological questions in the analysis of spoken interaction.
Human-human, human-machine communication: on the HuComTech multimodal corpus.
Laszlo Hunyadi, Tamás Váradi, István Szekrényes, György Kovács, Hermina Kiss and Karolina Takács.
The present paper describes HuComTech, a multimodal corpus featuring over 50 hours of video taped interviews with 112 informants. The interviews were carried out in a lab equipped with multiple cameras and microphones able to record posture, hand gestures, facial expressions, gaze etc. as well as the acoustic and linguistic features of what was said. As a result of large-scale manual and semi-automatic annotation, the HuComTech corpus offers a rich dataset on 47 annotation levels. The paper presents the objectives, the workflow, the annotation work, focusing on two aspects in particular i.e. time alignment made with the Leipzig tool WEBMaus and the automatic detection of intonation contours developed by the HuComTech team. Early exploitation of the corpus included analysis of hidden patterns with the use of sophisticated multivariate analysis of temporal relations within the data points. The HuComTech corpus is one of the flagship language resources available through the HunCLARIN repository.
Oral History and Linguistic Analysis. A Study in Digital and Contemporary European History.
Florentina Armaselu, Elena Danescu and François Klein.
The article presents a workflow for combining oral history and language technology, and for evaluating this combination in the context of European contemporary history research and teaching. Two experiments are devised to analyse how interdisciplinary connections between history and linguistics are built and evaluated within a digital framework. The longer term objective of this type of enquiry is to draw an "inventory" of strengths and weaknesses of language technology applied to the study of history.
The Acorformed Corpus: Investigating Multimodality in Human-Human and Human-Virtual Patient Interactions.
Magalie Ochs, Philippe Blache, Grégoire Montcheuil, Jean-Marie Pergandi, Roxane Bertrand, Jorane Saubesty, Daniel Francon and Daniel Mestre.
The paper aims at presenting the Acorformed corpus composed of human-human and human-machine interactions in French in the specific context of training doctors to break bad news to patients. In the context of human-human interaction, an audiovisual corpus of interactions between doctors and actors playing the role of patients during real training sessions in French medical institutions have been collected and annotated. This corpus has been exploited to develop a platform to train doctors to break bad news with a virtual patient. The platform has been exploited to collect a corpus of human-virtual patient interactions annotated semi-automatically and collected in different virtual reality environments with different degree of immersion (PC, virtual reality headset and virtual reality room).
Media Suite: Unlocking Archives for Mixed Media Scholarly Research.
Roeland Ordelman, Liliana Melgar, Carlos Martinez-Ortiz and Julia Noordegraaf.
This paper discusses the rationale behind the development of a research environment –the Media Suite– in a sustainable, dynamic, multi-institutional infrastructure that supports mixed media scholarly research with large multimedia data collections, serving media scholars and digital humanists in general.
Parallel session 1: CLARIN in Relation to Other Infrastructures and Projects
Using Linked Data Techniques for Creating an IsiXhosa Lexical Resource - a Collaborative Approach.
Thomas Eckart, Bettina Klimek, Sonja Bosch and Dirk Goldhahn.
The CLARIN infrastructure already provides a variety of lexical resources for many languages. However, the published inventory is unevenly distributed favouring languages with large groups of native speakers and languages spoken in highly developed countries. Improving the situation for so called “under-resourced languages” is possible by close collaboration with the language-specific communities and expertise that - naturally - reside in the countries where those languages are spoken. This submission presents an example for such a collaboration where a representative sample of an existing lexical resource for the isiXhosa language, which is spoken in South Africa, was processed, enriched, and published. The resource under discussion is intended to be a prototype for more resources to come.
A Platform for Language Teaching and Research (PLT&R).
Maria Stambolieva, Valentina Ivanova and Mariyana Raykova.
The Platform for Language Teaching and Research was designed and developed at New Bulgarian University in answer to important educational needs, some of them specific to Bulgaria. The aim of the paper is to present the tool developed to match those needs, its functionalities, architecture and applications – actual and envisaged. The Platform can provide 1/ course development support for native and foreign language (and literature) teachers and lecturers, 2/ data and tools for corpus-driven and corpus-based lexicography, corpus and contrastive linguistics, 3/ an environment for research, experimentation and comparison of new methods of language data preprocessing. The educational content organised and generated by the Platform is to be integrated in the CLARIN part of the CLaDA-BG infrastructure, of which New Bulgarian University is a partner.
Curating and Analyzing Oral History Collections.
This paper presents the digital interview collections available at Freie Universität Berlin, focusing on the online archive Forced Labor 1939–1945, and discusses the digital perspectives of curating and analyzing oral history collections. It specifically looks at perspectives of interdisciplinary cooperation with CLARIN projects and at the challenges of cross-collection search and de-contextualization. (Slides)
Parallel session 2: CLARIN Knowledge Infrastructure, Legal Issues and Dissemination
New exceptions for Text and Data Mining and their possible impact on the CLARIN infrastructure.
Pawel Kamocki, Erik Ketzan, Julia Wildgans and Andreas Witt.
The proposed paper discusses new exceptions for Text and Data Mining that have recently been adopted in some EU Member States, and probably will soon be adopted also at the EU level. These exceptions are of great significance for language scientists, as they exempt those who compile corpora from the obligation to obtain authorisation from rightholders. However, corpora compiled on the basis of such exceptions cannot be freely shared, which in a long run may have serious consequences for Open Science and the functioning of research infrastructure such as CLARIN .
Processing personal data without the consent of the data subject for the development and use of language resources.
Aleksei Kelli, Krister Lindén, Kadri Vider, Pawel Kamocki, Ramūnas Birštonas, Silvia Calamai, Chiara Kolletzek, Penny Labropoulou and Maria Gavrilidou.
The development and use of language resources often involve the processing of personal data. The General Data Protection Regulation (GDPR) establishes an EU-wide framework for the processing of personal data for research purposes while at the same time it allows for some flexibility on the part of the Member States. The paper discusses the legal framework for language research following the entry into force of the GDPR. To this goal, we first present some fundamental concepts of data protection relevant for language research and then focus on the models that certain EU member states use to regulate data processing for research purposes.
Toward a CLARIN Data Protection Code of Conduct.
Pawel Kamocki, Erik Ketzan, Julia Wildgans and Andreas Witt.
This abstract discusses the possibility to adopt a CLARIN Data Protection Code of Conduct pursuant art. 40 of the General Data Protection Regulation. Such a code of conduct would have important benefits for the entire language research community. The final section of this abstract proposes a roadmap to the CLARIN Data Protection Code of Conduct, listing various stages of its drafting and approval procedures.
Parallel session 3
From Language Learning Platform to Infrastructure for Research on Language Learning.
David Alfter, Lars Borin, Ildikó Pilán, Therese Lindström Tiedemann and Elena Volodina.
Lärka is an Intelligent Computer-Assisted Language Learning (ICALL) platform developed at Språkbanken, as a flexible and a valuable source of additional learning material (e.g. via corpus-based exercises) and a support tool for both teachers and L2 learners of Swedish and students of (Swedish) linguistics. Nowadays, Lärka is being adapted into a central building block in an emerging second language research infrastructure within a larger context of the text-based research infrastructure developed by the national Swedish Language bank, Språkbanken, and SWE-CLARIN.
Bulgarian Language Technology for Digital Humanities: a focus on the Culture of Giving for Education.
Kiril Simov and Petya Osenova.
The paper presents the main language technology components that are necessary for supporting the investigations within the digital humanities with a focus on the culture of giving for education. This domain is socially significant and covers various historical periods. It also takes into consideration the social position of the givers, their gender and the type of the giving act (last posthumous will or financial support in one’s lifetime). The survey describes the adaptation of the tools to the task as well as the various ways for improving the targeted extraction from the specially designed corpus of texts related to giving. The main challenge was the language variety caused by the big time span of the texts (80-100 years). We provided two initial instruments for targeted information extraction: statistics with ranked word occurrences and content analysis. Even in this preliminary stage the provided technology proved out to be very useful for our colleagues in sociology, cultural and educational studies.
Multilayer Corpus and Toolchain for Full-Stack NLU in Latvian.
Normunds Grūzītis and Artūrs Znotiņš.
We present a work in progress to create a multilayer text corpus for Latvian. The broad application area we address is natural language understanding (NLU), and the aim of the corpus creation is to develop a data-driven toolchain for NLU in Latvian. Both the multilayered corpus and the downstream applications are anchored in cross-lingual state-of-the-art representations: Universal Dependencies (UD), FrameNet, PropBank and Abstract Meaning Representation (AMR). The corpus and toolchain also include named entity and coreference annotation required by AMR. The data set and the toolchain is to be added to the CLARIN infrastructure for both HSS research and development of cross-lingual NLP applications.
(Re-)Constructing “public debates” with CLARIAH MediaSuite tools in print and audiovisual media.
Berrie van der Molen, Jasmijn van Gorp and Toine Pieters.
This paper focuses on the proceedings of CLARIAH research pilot Debate Research Across Media (DReAM) by reflecting on the used conceptualization of public debates. In the pilot, heterogenous datasets (of digitized print and audiovisual media) are searched with the levelled research approach (combining distant and close reading techniques) to do historical public debate analysis with tools of the CLARIAH MediaSuite. The qualitative research interest in public debates is fundamentally historical, but in order to bridge the gap between distant and close reading of the combined digital datasets a number of insights from media studies is reflected upon. The natures of the different media and digitization processes, the type of analysis and focus on the source material itself, and the necessity to combine historical expertise with a sensibility towards discursive relations are all taken into consideration before we argue that using this approach in the MediaSuite can help the researcher to gain an improved understanding of historical public debates.
Improving Access to Time-Based Media through Crowdsourcing and CL Tools: WGBH Educational Foundation and the American Archive of Public Broadcasting.
Karen Cariani and Casey Davis-Kaufman.
In this paper, we describe the challenges facing many libraries and archives trying to provide better access to their media collections through online discoverability. We present the initial results of a project that combines technological and social approaches for metadata creation by leveraging scalable computation and engaging the public, the end users, to improve access through crowdsourcing games and tools for time-based media. The larger need is for more accurate output and ease of use of computational tools for audiovisual archives to create descriptive metadata and annotations. As leaders in preservation, access, and analysis of culturally significant audiovisual material, WGBH is continually confronted with the need to enhance the descriptive data to improve discoverability for large-scale digital indexing and analysis of media collections.
Parallel session 4
Discovering software resources in CLARIN.
We present a profile for the description of software that enables discovery of the software and formal documentation of aspects of the software, and a proposal for faceted search in metadata for software. We have tested the profile by making metadata for over 70 pieces of software. The profile forms an excellent basis for formally describing properties of the software, and for a faceted search dedicated to software which enables better discoverability of software in the CLARIN infrastructure.
Towards a protocol for the curation and dissemination of vulnerable people archives.
Silvia Calamai, Chiara Kolletzek and Aleksei Kelli.
This paper aims at introducing a reflection on the possibility of defining a protocol for the curation and dissemination of speech archives, which appear to have – de jure – the highest restrictions on their curation and dissemination. The case study is offered by the discovering of Anna Maria Bruzzone archive, containing the voices of people with mental disabilities recorded in 1977 in a psychiatric hospital.
Versioning with Persistent Identifiers.
Martin Matthiesen and Ute Dieckmann.
We present the update process of a dataset using persistent identifiers (PIDs). The dataset is available in two different variants, for download and via an online web interface. During the update process we had to fundamentally rethink as to how we wanted to use PIDs and version numbering. We will also reflect on how to effectively use assignment in case of minor changes in the large dataset. We discuss the roles of different types of PIDs, the role of metadata and special landing pages.
Interoperability of Second Language Resources and Tools.
Elena Volodina, Maarten Janssen, Therese Lindström Tiedemann, Nives Mikelic Preradovic, Silje Karin Ragnhildstveit, Kari Tenfjord and Koenraad de Smedt.
Language learning based on learner corpora is an increasingly active area of research in CLARIN centres and beyond. In order to promote comparative research, the interoperability of data and tools in this area must be improved, and metadata and error annotation should be harmonized. A closer European collaboration in the field of learner corpus creation is desirable.
Tweak Your CMDI Forms to the Max.
Rob Zeeman and Menzo Windhouwer.
Metadata records created and provided via the Component Metadata Infrastructure (CMDI) can be of high quality due to the possibility to create a metadata profile tailored for a specific resource type. However, this flexibility comes with a cost: it's harder to create a metadata editor that can cope well with this diversity. In the Dutch CLARIAH project the aim is to create a user-friendly CMDI editor, which is able to deal with arbitrary profiles and can be embedded in the environments of the various partners. Already a few CMDI editors have been created, e.g., Arbil [Withers 2012], CMDI-Maker [CLASS 2018] and COMEDI [Lyse et al 2015]. Of these Arbil is not supported anymore and CMDI-Maker only supports a limited number of profiles. COMEDI can handle arbitrary CMDI profiles, but it comes with its own dedicated environment and stays very close to the profile, which makes certain technical limitations of CMDI still leak into the end user’s experience. An example is the lack of multilingual labels for elements in the CMDI profile specifications. In this abstract CLARIAH’s CMDI Forms (CCF; [KNAW HuC DI 2018a]) is introduced. It supports CMDI 1.2 and can handle any CMDI profile, but also allows various tweaks (usually small adjustments) to enhance usability. CMDI Forms can also be embedded, by a set of plugins, into a specific environment. The next sections will describe these features in more depth.
CLARIN Data Management Activities in the PARTHENOS Context.
Marnix van Berchum and Thorsten Trippel.
Data Management is one of the core activities of all CLARIN centres providing data and services for the academia. In PARTHENOS, European initiatives and projects in the area of the humanities and social sciences assembled to compare policies and procedures. One of the areas of interest is data management. The data management landscape shows a lot of proliferation, for which an abstraction level is introduced to help centres, such as CLARIN centres, in the process of providing the best possible services to users with data management needs.
Integrating language resources in two OCR engines to improve processing of historical Swedish text.
Dana Dannélls and Leif-Jöran Olsson.
We are aiming to address the difficulties that many History and Social Sciences researchers struggle with to bring in non-digitized text into language analysis workflows. In this paper we present the language resources and material we used for training two Optical Character Recognition engines for processing historical Swedish text written in Fraktur (blackletter). The trained models, resources and dictionaries are freely available and accessible through our web service, hosted at Spräkbanken, to enable users and developers easy access for extraction of historical Swedish text that are only available in images for further processing.
Looking for hidden speech archives in Italian institutions.
Vincenzo Galatà and Silvia Calamai.
The aims and the main results of an on-line survey concerning speech archives collected in the fields of Social Sciences and Humanities among Italian scholars are presented and discussed. A huge amount of speech archives is especially preserved among researchers: the most part of the resources is not accessible and legal issues are generally not deeply addressed. The great majority of the respondents would agree in storing their archives in national repositories, if any.
Setting up the PORTULAN / CLARIN centre.
Luís Gomes, Frederico Apolónia, Ruben Branco, João Silva and António Branco.
This paper aims at sharing the lessons learned at setting up a CLARIN repository based on the software, which we have just used to develop the PORTULAN / CLARIN centre. This paper documents the changes and extensions to META-SHARE that were needed to fulfil the CLARIN requirements for becoming a B-type centre. The main purpose of this paper is to serve as a one-stop guide for teams pondering or having decided to adopt META-SHARE software for setting up their own CLARIN repositories in the future.
LaMachine: A meta-distribution for NLP software.
Maarten van Gompel and Iris Hendrickx.
We introduce LaMachine, a unified Natural Language Processing (NLP) open-source software distribution to facilitate the installation and deployment of a large amount of software projects that have been developed in the scope of the CLARIN-NL project and its current successor CLARIAH. Special attention is paid to encouragement of good software development practices and reuse of established infrastructure in the scientific and open-source software development community. We illustrate the usage of LaMachine in an exploratory text mining project at the Dutch Health Inspectorate where LaMachine was applied to create a research environment for automatic text analysis for health care quality monitoring.
XML-TEI-URS: using a TEI format for annotated linguistic ressources.
Loïc Grobol, Frédéric Landragin and Serge Heiden.
This paper discusses XML- -URS, a recently introduced TEI-compliant XML format for the annotation of referential phenomena in arbitrary corpora. We describe our experiments on using this format in different contexts, assess its perceived strengths and weaknesses, compare it with other similar efforts and suggest improvements to ease its use as a standard for the distribution of interoperable annotated linguistic resources.
Visible Vowels: a Tool for the Visualization of Vowel Variation.
Wilbert Heeringa and Hans Van de Velde.
Visible Vowels is a web app for the analysis and visualization of acoustic vowel measurements: f0, formants and duration. The app is a useful instrument for research in linguistics. The app combines user friendliness with maximum functionality and flexibility, using a live plot view.
ELEXIS - European lexicographic infrastructure.
Milos Jakubicek, Iztok Kosem, Simon Krek, Sussi Olsen and Bolette Sandford Pedersen.
This paper describes the establishing ELEXIS lexicographic infrastructure, a research infrastructure financed by the European Union through the H2020 funding scheme. We present the project as a whole in terms of its target audience and stakeholders as well as its key parts. We outline the components of the infrastructure, both those already implemented and those to be implemented in the course of the project (2018-2022). Close collaboration with CLARIN is supported by the Integration and Sustainability Committee.
Sustaining the Southern Dutch Dialects: the Dictionary of the Southern Dutch Dialects (DSDD) as a case study for CLARIN and DARIAH.
Jacques Van Keymeulen, Sally Chambers, Veronique De Tier, Jesse de Does, Katrien Depuydt, Tanneke Schoonheim, Roxane Vandenberghe and Lien Hellebaut.
In this paper, we report on an ongoing project, the Dictionary of the Southern Dutch Dialects (DSDD), funded by the Research Foundation Flanders (FWO). The DSDD is based on three dictionaries of the Flemish, Brabantic and Limburgian dialects. The project aims to aggregate and standardise the three comprehensive dialect lexicographic databases into one integrated dataset. The project, which started in January 2017, is organised in three phases: i) design and preparation, ii) implementation and iii) exploitation. The Ghent University DSDD team (Department of Dutch Linguistics/Ghent Centre for Digital Humanities) works closely together with the Dutch Language Institute (INT) who are responsible for the technical development and sustainability of the DSDD linguistic data infrastructure. During the project period (2017-2020), 3-4 research use cases will be developed to test the applicability of the newly aggregated DSDD for digital scholarship. At a later stage, the DSDD database can be linked with other dialect data in Belgium and the Netherlands. Within Flanders, work is underway to strengthen collaboration between DARIAH and CLARIN. As the DSDD is already working with both infrastructures, the DSDD is in a unique situation to benefit from CLARIAH.
SweCLARIN – Infrastructure for Processing Transcribed Speech.
Dimitrios Kokkinakis, Kristina Lundholm Fors and Charalambos Themistokleous.
In this paper we describe the spoken language resources (including transcriptions) under development within the project “Linguistic and extra-linguistic parameters for early detection of cognitive impairment”. The focus of the present paper is on the resources that are being produced and the way in which these could be used to pursue innovative in dementia prediction, an area in which more scientific investigations are required in order to research develop additional predictive value and improve early diagnosis and therapy. The language resources need to be thoroughly annotated and analyzed using state-of-the-art language technology tools and for that purpose we apply Sparv, a corpus annotation pipeline infrastructure which is part of the Swe-CLARIN toolbox. Sparv is offering state-of-the-art language technology as an e-research tool for analyzing and processing various types of Swedish corpora. We also highlight some of the difficulties in working with speech data and suggest ways to mediate these.
TalkBankDB: A Comprehensive Data Analysis Interface to TalkBank.
John Kowalski and Brian MacWhinney.
TalkBank, a CLARIN B Centre, is the host for a collection of multilingual multimodal corpora designed to foster fundamental research in the study of human communication. It contains tens of thousands of audio and video recordings across many languages linked to richly annotated transcriptions, all in the CHAT transcription format. The purpose of the TalkBankDB project is to provide an intuitive on-line interface for researchers to explore TalkBank's media and transcripts, specify data to be extracted, and pass these data on to statistical programs for further analysis.
L2 learner corpus survey – Towards improved verifiability, reproducibility and inspiration in learner corpus research.
Therese Lindström Tiedemann, Jakob Lenardič and Darja Fišer.
We present a survey of the second language learner corpora available within CLARIN. The survey provides a test of the ease of finding these corpora through the and the extent of the metadata and documentation which users have included. Based on this we suggest some ways of improving the usefulness of the VLO and making more linguists aware of what CLARIN provides. Furthermore, we suggest that in addition to collecting data and metadata, a bibliographical database of research using and documenting work on second language learner corpora should be collaboratively maintained.
DGT-UD: a Parallel 23-language Parsebank.
Nikola Ljubešić and Tomaž Erjavec.
We present DGT-UD, a 2 billion word 23-language parallel parsebank, comprising the JRC DGT parallel corpus of European law parsed with UD-Pipe. The paper introduces the JRC DGT corpus, details its annotation with UD-Pipe and discusses its format under the two CLARIN.SI web-based concordancers and its repository. An analysis is presented that showcases the utility of the corpus for comparative multilingual research. The corpus is meant as a shareable CLARIN resource, useful for translators, service providers, and developers of language technology tools.
DI-ÖSS - Building a digital infrastructure in South Tyrol.
Verena Lyding, Alexander König and Elisa Gorgaini.
This paper presents the DI-ÖSS project, a local digital infrastructure initiative for South Tyrol, which aims at connecting institutions and organizations that are working with language data. It shall serve to facilitate and increase data exchange, joint efforts in processing and exploiting data and the overall increase of synergies, and thus links to big European infrastructure initiatives. However, while sharing the overall objectives to foster standardization and increase efficiency and sustainability, on the implementation level a local initiative faces a different set of challenges. It aims to involve institutions which are less familiar with the logic of infrastructure and have less experience and fewer resources to deal with technical matters in a systematic way. The paper will describe how DI-ÖSS addresses the needs for digital language infrastructure on a local level; lay out the course of action; and depict the targeted mid- and long-term outputs of the project.
Linked Open Data and the Enrichment of Digital Editions: the Contribution of CLARIN to the Digital Classics.
Monica Monachini, Francesca Frontini, Anika Nicolosi and Fahad Khan.
Semantic Web technologies allow scholars in the humanities to make links and connections between the multitude of digitised cultural artifacts which are now available on the World Wide Web, thus facilitating the making of new scientific discoveries and the opening up of new avenues of research. Semantic Web and Linked Data technologies, by their very nature, are complex and require the adoption of a sustainable, long term approach that takes research infrastructures like CLARIN into consideration. In this paper, we present the case-study of a project (DEA) on an augmented digital edition of fragmentary Ancient Greek texts using Linked Data; this will highlight a number of the core issues that working in the Digital Classic brings up. We will discuss these issues as well as touching on the role CLARIN can play in the overall linked data lifecycle and in particular on humanities datasets.
How to use DameSRL: A framework for deep multilingual semantic role labeling.
Quynh Ngoc Thi Do, Artuur Leeuwenberg, Geert Heyman and Marie-Francine Moens.
This paper presents DAMESRL, a flexible and open source framework for deep multilingual semantic role labeling. It provides flexibility in its model construction in terms of word representation, sequence representation, output modeling, and inference styles and comes with clear output visualization. The framework is available under the Apache 2.0 license.
Speech Recognition and Scholarly Research: Usability and Sustainability.
For years we have been working on speech recognition (ASR) as a tool for scholarly research. The current state-of-the-art can be useful for many scholarly use cases focusing on audiovisual content, but practically applying ASR is often not so straightforward. In the CLARIAH Media Suite, a secured online portal for scholarly research for audiovisual media, we solved the most important hurdles for the practical deployment of ASR by focusing on usability and sustainability aspects.
Towards TICCLAT, the next level in Text-Induced Corpus Correction.
Martin Reynaert, Maarten van Gompel, Ko van der Sloot and Antal van den Bosch.
We give an update of the state-of-affairs of the tools we have gradually been developing for the Dutch CLARIN infrastructure over the past 10 years. We first focus on our OCR post-correction system TICCL, next describe its wider environment, the corpus building work flow PICCL, and then sketch the various guises in which these are made available to the broad research community.
SenSALDO: a Swedish Sentiment Lexicon for the SWE-CLARIN Toolbox.
Jacobo Rouces, Lars Borin, Nina Tahmasebi and Stian Rødven Eide.
The field of sentiment analysis or opinion mining consists in automatically classifying text according to the positive or negative sentiment expressed in it, and has become very popular in the last decade. However, most data and software resources are built for English and a few other languages. In this paper we describe the creation of SenSALDO, a comprehensive sentiment lexicon for Swedish, which is now freely available as a research tool in the SWE-CLARIN toolbox under an open-source CC-BY license.
Error Coding of Second-Language Learner Texts Based on Mostly Automatic Alignment of Parallel Corpora.
Dan Rosén, Mats Wirén and Elena Volodina.
Error coding of second-language learner text, that is, detecting, correcting and annotating errors, is a cumbersome task which in turn requires interpretation of the text to decide what the errors are. This paper describes a system with which the annotator corrects the learner text by editing it prior to the actual error annotation. During the editing, the system automatically generates a parallel corpus of the learner and corrected texts. Based on this, the work of the annotator consists of three independent tasks that are otherwise often conflated: correcting the learner text, repairing inconsistent alignments, and performing the actual error annotation.
Using Apache Spark on Hadoop Clusters as Backend for WebLicht Processing Pipelines.
Soheila Sahami, Thomas Eckart and Gerhard Heyer.
Modern annotation tools and pipelines that support automatic text annotation and processing have become indispensable for many linguistic and NLP-driven applications. To simplify their active use and to relieve users from complex configuration tasks -based platforms - like the CLARIN-D WebLicht - have emerged. However, in many cases the current state of participating endpoints does not allow processing of “big data”-sized text material or the execution of many user tasks in parallel. A potential solution is the use of distributed computing frameworks as a backend for SOAs. Those systems and their corresponding software architecture already support many of the features relevant for processing big data for large user groups. This submission gives an example of a specific implementation based on Apache Spark and outlines potential consequences for improved processing pipelines in federated research infrastructures.
UWebASR – Web-based ASR engine for Czech and Slovak.
Jan Švec, Martin Bulín, Aleš Pražák and Pavel Ircing.
The paper introduces a beta-version of a user-friendly Web-based ASR engine for Czech and Slovak that enables users without a background in speech technology to have their audio recordings automatically transcribed. The transcripts are stored in a structured XML format that allows efficient manual post-processing.
Pictograph Translation Technologies for People with Limited Literacy.
Vincent Vandeghinste, Leen Sevens and Ineke Schuurman.
We present a set of Pictograph Translation Technologies, which automatically translates natural language text into pictographs, as well as pictograph sequences into natural language text. These translation technologies are combined with sentence simplification and an advanced spelling correction mechanism. The goal of these technologies is to enable people with a low level of literacy in a certain language to have access to information available in that language, and to allow these people to participate in online social life by writing natural language messages through pictographic input. The technologies and demonstration system will be added to the CLARIN infrastructure at the Dutch Language Institute in the course of this year, and have been presented on Tour De CLARIN.
PhD-students poster session
Automatic genre identification with machine learning methods.
Genre identification is an important task in natural language processing which can be useful for many practical and research purposes, as a prime example for the creation of genre-specific (sub)corpora. However this task is extremely hard because genre is not a homogeneous and unequivocal property of texts and in many cases it is barely separable from the topic. In this research we compare the performance of two different automatic genre identification methods and a very simple lexical-based baseline method. We classified six text types: literary, academic, legal, press, spoken and personal. In one part of our research we did experiments with traditional machine learning methods using linguistic, n-gram and error features. In the other part we tested the same task with a word embedding based neural network. In this part we experimentalised with different training data (words only, POS-tags only, words and POS-tags etc.). Our results revealed that neural network is a suitable method for this task while traditional machine learning showed significantly lower performance. We gained high (around 70%) accuracy with the word embedding based method. The results of the different text categories also showed differences which is related to the stylistic properties of the studied genres. Our experiments provided other interesting findings as well. The word embedding measurements revealed that using the POS-tags only can be more effective than expected. This suggests that genres have specific structural characteristics which allow to identify them without lexical or topic-related features.
Improving OCR of historical newspapers and journals published in Finland by adding Swedish training data.
Optical character recognition (OCR) of Finnish historical newspapers and journals published in Finland between 1771 and 1920 still yields insufficiently good results. The online collection digitalized and published by the National Library of Finland digi.kansalliskirjasto.fi contains over 11 million pages in mostly Finnish and Swedish, of which approximately 5.11million are freely available (Kettunen and Koistinen, 2018). Good quality OCR is essential to make this collection useful for harvesting and research.
Although optical character recognition of printed text has reached high accuracy rates for modern fonts, historical documents still pose a challenge for character recognition. Some of the reasons why it happens are: fonts differ in different materials, there is lack of orthographic standard (same words are spelled differently) and sometimes material quality is poor.
In our previous work (Drobac et al., 2017), we trained OCR models with open source program Ocropus (Breuel, 2008) on two Finnish fraktur datasets (DIGI and Natlib) and after postprocessing we got 93.27% and 95.21% character accuracy rate respectively. Although this result was a huge improvement to the previous 90.16% achieved with commercial software AbbyFine Reader 11, the trained models preformed poorly on Swedish text with accuracy of only 81.39%.
This result is not surprising because the models were trained to specifically recognize Finnish fraktur. However, having poor results for the Swedish text, which is widely represented in the collection is not satisfactory.
In this work, we add a small amount of Swedish training data to our Finnish data sets and train the models using Ocropus software. All three data sets consist of manually transcribed lines of text. The DIGI data set contains approximately 12,000 random lines of Finnish, mostly fraktur text (a small amount of text is written in Antiqua). The Natlib set has about 54,000 lines of Finnish fraktur text and the Swedish set consists of around 4,000 lines of Swedish text in both fonts. Digi and Swedish lines were picked randomly from the entire corpus, while Natlib lines were extracted from 255 random pages of text. The results show that with already a small amount of additional Swedish data we get a big improvement on the Swedish test set (88.25%, +6.86%), but interestingly also on the Finnish test set. With a character accuracy rate of 95.31% on the Digi test set, the result improved significantly (+2.04%) even without post-processing, whereas 94.62% on the Natlib test set (-0.21%) is close to the previous Natlib results before post processing. While improvement on the Swedish test set was expected, it was not obvious that the Finnish test results would increase. It seems that the Digi set is small enough that adding more versatile font information to the training data (Swedish texts are more often written in Antiqua) benefits the Finnish model as well. Additionally, Finnish texts contain Swedish names and places, which might be better recognized after the addition of Swedish information. However, the Natlib
training set is too big and too specific to be largely influenced by our small addition of extra data.
Utilising Large Quantities of Found Audio Data.
Today national archives contain what might seem to be an unmanageable quantity of data. In Sweden the largest archives, ISOF (Institute for Language and Folklore) and the KB (National Library of Sweden), hold 13000 hours and 7 million hours of audio or audio-visual recordings respectively. Data of this sort is often referred to as found data, i.e. data that was not recorded with the specific purpose of being used in research. These data are in many cases more valuable compared to data collected in a controlled setting as they provide for higher ecological validity. We propose an explorative approach with the aim of utilising large amounts of audio data. Our approach is based on a concept we have named Temporally Disassembled Audio (TDA). At this stage, we have experimented with many different types of audio, mostly focusing on speech, and the results are promising.
Dramatic Languages: Foreign Languages in the Writings of Peter Handke.
Since the early 1980s, foreign language elements have gained importance and become more and more prominent in the works of the Austrian writer Peter Handke (*1942), especially in his texts for the stage. At the same time, Handke started his work as a translator from four other languages to German, all of which have influenced and subsequently been incorporated in his “own” writing - along with around ten further languages that Handke has since included more or less frequently in his (stage) texts. Additionally, he has written several works for the stage in French. This poster will introduce my PhD project, which aims to analyze the poetic use and function of foreign language elements in Handke’s dramatic oeuvre. To this end, a relational database collecting and interlinking several types of entities (context quotes, foreign language parts of quotes, individual words, foreign and German lemmas) was built and all relevant quotes from the stage texts have been collected. In addition to building the database, collecting the data and analyzing it in my PhD thesis, I am providing the collected material openly online in the web app “Handke: in Zungen” (https://handkeinzungen.acdh.oeaw.ac.at/), which has an and thus enables reuse of the data.
Instrument of parliamentary discourse analysis – Saeima debate corpus.
The Latvian Saeima debate corpus (Department of Communication Studies of the RSU and the Artificial Intelligence Laboratory of the Institute of Mathematics and Computer Science of the LU) includes 13 million words that members of parliament (MP), ministers and other members of the Saeima session have said on the podium since 1993. The instrument for the analysis of this corpus was created on the bases of transcripts collected from parliamentary sessions. It offers a convenient set of tools that can be used for parliamentary discourse research (addressing the research of individual sessions, parliamentary term, parties, personalities, age, or gender speeches).
Texts are automatically tokenized, lemmatized and morphologically analyzed and tagged using CMM based tagger. Syntactic dependencies are inferred by neural transition-based dependency parser trained on Latvian Universal Dependencies corpus version 2.1. The research work in the cabinet is aimed at obtaining concordances and acquiring commonly used word combinations. The program also allows you to see the name of the neighborhood, in the context of the use of a particular word, which is essential for defining word connections and typical word combinations.The possibilities of the Saeima debate corpus are examined in these four research works (with application of critical discourse analysis):
1. Physicians of Latvian parliament use the same discursive structure of language, which is used for communication with a patient.
2. Politicians resist the initiative of environmental activists to abandon the use of wildlife in a circus. They argue their view in media, referring to the national tradition, but in the analysis parliamentary speeches, the word "circus" is used with contempt.
3. In the parliamentary discourse the concept of state is vague. However, it is often compared to the parents, mother, or owner, attributing paternal function.
4. In parliamentary speeches during a single term, it was established that the word "sick" is rarely used for its common meaning, instead - for offensive intonation.
Automatization of Detection of Information-Dense Texts.
The notion of information is only formal here, i.e. it is defined as semantic, pragmatic, and only measurable in relative terms. A definition of information density is elaborated involving informativity (a relative measure of semantic and pragmatic information) per clause (following ). In computational linguistics, one of the most common characteristics employed to detect information-dense texts is lexical density. This presentation provides a part of the methodology proposed for automatic analysis of information density based on lexical and syntactic levels of language. The methodology was developed by analysing two corpora compiled from the texts of two academic genres - research papers (3 479 442 running words) and their abstracts (85 616 running words). Both corpora will be available in CLARIN-LT Repository . The first stage of the research was based on the analysis of lexical structures of research papers and their abstracts, namely type/token ratio and lexical density, in each corpus separately. Such analysis proposed that lexical density depends strongly on the form of texts. In other words, lexical level of text is related with syntactic one, i.e. how words behave within a text, how they are connected with each other in a sentence, etc. Thus the first stage of the research suggested a further direction for the analysis: while information density is seen as too complex to measure globally, a study of both lexical and syntactic features allows a comparison of information density between different texts or different text genres.
 Mills, C. R. Information density in French and Dagara folktales: a corpus-based analysis of linguistic marking and cognitive processing. Doctoral thesis. Queen’s University Belfast (2014).
 CLARIN-LT Repository, https://clarin.vdu.lt/xmlui/.
Revealing relationships between Old Norse texts using computer-assisted methods.
Katarzyna Anna Kapitan.
The poster discusses the possibilities given by application of phylogenetic analysis in textual scholarship. It has been long recognized in the literature, that there are clear similarities between theoretical assumptions of cladistics and stemmatics (Howe et al. 2004) and the application of computer-assisted methods, originating from phylogenetics, to answer the questions of textual criticism, is a powerful tool in revealing manuscripts’ filiation. Recently, Hall and Parsons (2013) presented a method of data sampling and tree-generating using PHYLIP-package, which gained some followers in the Old Norse philology (Zeevaert et al. 2013, Kapitan 2017). This method of data collection with aim of revealing the relationships between the manuscripts is, however, time consuming. It requires a remarkable amount of manual collation and annotation of texts. This poster presents an attempt to automatize this process. I suggest using XML-based transcriptions as a basis of analysis and using Python scripts to convert the data from XML-encoded text to numeric input file for PHYLIP-package (Felsenstein). Data collected in this way allows us to approach in a systematic way the important question of textual criticism (and New Stemmatics) regarding the kind of textual variation that should be used for analysis. There are competing approaches to this question. Salemans (1996) suggested a strictly systematized classification of parsimony informative variants, while Robinson (1996) suggested that all types of variants should be analyzed, including linguistic variants. By marking up variants in XML following the TEI P5 guidelines we can distinguish various types of variation and conduct experiments aimed at assessing the value or usefulness of different types of variation for textual analysis. Because we generate trees of relationships to all types of variants separately we can compare the results and reproduce the experiments.
Speech Analysis in the Clarin-PL Project.
The poster describes the evolution of the speech analysis services created in the Clarin-PL project. The key elements added within the last period include an automatic speech recognition system, adaptation to various specific domains and integration with the EMU SDMS (Speech Database Management System). Our main goal centers around users and their demands so much of our work involves improving the user experience of the tools. The poster also includes plans for the future, which will include a major overhaul of the website to improve the usability for for the researchers in the humanities and social sciences.
Corpus-driven conversational agents: tools and resources for multimodal dialogue systems development.
Maria Di Maro.
In this paper, it is going to be presented how tools made available through CLARIN can be applied for research purposes in the development of corpus-driven conversational agents. The starting point will be the description of a standard architecture for multimodal dialogue systems. For some of its parts, specific available tools will be briefly described, due to their suitability to their development.
Archival Dynamics : the Langues de France project and the building of the Judeo-Spanish Oral Archive (JSOA).
Until 2005, publicly accessible recordings in Judeo-Spanish were scarce, if not inexistent. This is no longer the case, thanks to the “Judeo-Spanish Oral Archive” (JSOA) project, part of Langues de France, a CNRS programme initiated and supported by the French Ministry of Culture’s DGLFLF (Délégation générale à la langue française et aux langues de France) section. The programme run between 2005 and 2009 and aimed at building oral corpora for endangered languages spoken in France. To do so, the CNRS (ADONIS and Huma-Num structures) supported digitisation of old recordings, as well as new fieldwork to account for the present state of these languages. More importantly, it provided researchers with a corpora-building methodology based on interoperable metadata, and helped them build freely and publicly available databases, in compliance with the OAI principles. The JSOA project participated in this process, going well beyond the requirements of the programme, as it is currently enlarging its scope from a linguistics-turned documentation project to multidisciplinary Oral History project, interested in Sephardic communities not only in France but worldwide. By October 2018, more than 150 additional hours of speech, thoroughly documented according to CNRS’s standards of data management, should be published at the Collection de Corpus Oraux Numériques (CoCoON) plateform, whose Judeo-Spanish collection can be reached at link. The first aim of this presentation is to depict how a single field, medium-scale initiative, has triggered dynamics for an ambitious language and culture documentation project, and what results these dynamics have yielded to this day. The second and main aim is to show how appropriate metadata building, partly based on controlled vocabularies, can help utilisation of the JSOA archive in fields as diverse anthropology, linguistics, musicology or history.
Modeling Lexical Knowledge for Natural Language Processing.
The poster/presentation will report results from a PhD dissertation project focused on representing lexical knowledge and leveraging such models for the purposes of natural language processing. It will outline several methods for modeling the lexicon with respect to the WordNet dictionary: in terms of knowledge graphs and in terms of distributed representations of words and senses. The models are obtained through various methods for enriching the WordNet knowledge graph and training embeddings on the constructed semantic networks. The generated knowledge graphs can be used to carry out knowledge-based word sense disambigation – experimental work shows that accuracy scores are improved significantly with the addition of relations based on syntactic and formal semantic representations. The word and sense embedding models in turn are shown to improve results on the word similarity and relatedness tasks and can serve as input sources to supervised NLP systems. Two deep learning systems for performing word sense disambiguation and context embedding are implemented and evaluated; one of them depends crucially on the distributed representations generated on the basis of the knowledge graph. All created knowledge resources and tools are discussed in terms of their possible integration within the CLARIN infrastructure to be developed by the Bulgarian national consortium.
Completing the BLARK for Portuguese with finely-tuned Distributional Semantic Models.
Distributional semantics models (DSMs) became a key asset in any basic language resources kit for any idiom. On a par with POS taggers or parsers a.o., they are instrumental in the improvement of the performance of a wide range of applications and natural language processing tasks. A DSM is a computational representation of a given vocabulary. For each word of the vocabulary, a corresponding vector encodes the syntactic and semantic information obtained from the frequency of contexts in which each word occurs with neighboring words in a given text. DSMs allow the computation of semantic similarity between expressions, a useful property for a wide range of applications such as dialog systems, question answering, recommendation systems and sentiment analysis, etc. We describe the development, tunning and the public release of free DSMs for Portuguese. The train of the DSMs resorted to a large corpus (2 billion tokens) that we collected and preprocessed from a wide range of domains. The evaluation used systematic fine-grained tasks and data sets (from analogy, similarity, and categorization set types) created in our group, these are comparable in size and domain to mainstream datasets in other languages. The training of the DSMs resorted to the Skip-Gram model, a shallow neural network. Given a training text, word by word the neural network learns the semantic spaces by predicting the neighborhood words, using as dynamic parameters the sliding window of neighborhood words, the semantic space vector dimension, the number of training epochs and negative samples. We performed four main experiments aiming at the best overall accuracy. In the first experiment, we trained a DSM with the newly acquired data with baseline settings. In the second experiment, we trained with old data with an improved Skip-Gram implementation. In the third experiment, we trained with the former best sub-dataset joined with the newly acquired data. In the fourth experiment, we trained with all available data. With these experiments, we obtained improved state-of-the-art DSMs for Portuguese.
Studying the ins and outs of external possession with the CLARIN infrastructure.
In my PhD research the CLARIN infrastructure was indispensable, because it enabled me to study so-called external possession in Dutch dialects through the entire data life cycle.
External possessors appear outside of the possessum phrase syntactically, whereas semantically, they are interpreted as the possessor of one of their co-arguments. These patterns are for example very common in French and German. Some of them frequently occur in Dutch idioms, but are otherwise rare in standard Dutch. However, as data from previous dialect studies have shown (e.g. SAND), the structures do occur in dialects of Dutch. The previously existing data in the CLARIN infrastructure provided very important clues for my research with respect to interesting syntactic structures and relevant dialect areas. In addition, I used the environment of DynaSAND extensively to organize and visualize the data that I collected through an online written questionnaire. The system of ‘Kloeke codes’ enabled me to organize the linguistic phenomena I tested according to geographic areas. Subsequently, DynaSAND enabled me to put external possession on the Dutch/Flemish map, displaying the geographic distribution of these phenomena. The results show that external possessor constructions mostly occur in the eastern part of the Netherlands, along the German border. This fact is an important clue for my syntactic analysis. The CLARIN infrastructure contributed to this finding, because it made a systematic study into external possession possible in a large geographic area, while also giving the opportunity to visualize the results easily. Finally, the CLARIN-B status of the Meertens Institute proved to be useful for making my data available to others. I archived my data in their repository. The dataset is described with rich metadata, which contributes to the findability of this dataset and enables other researchers to interpret and reuse it.
Word Embeddings for Cross-Language Learning in Low-Resource Languages.
Dense-vector representations of words, commonly known as word embeddings, allow natural language processing applications to encode semantic information of words by mapping similar words to similar vectors. In recent years, word embeddings have been proven to improve performance in a variety of tasks. Another advantage of such vector representations is that they can potentially be language independent. This can be achieved by assigning words from different languages but with the same meaning to the same vector, thus mapping words from different languages into the same vector space. In recent years, a variety of methods to achieve this have been proposed, mostly working by first constructing a separate embedding for each language and the aligning the embeddings into the same vector space. However, relatively little work has been done on discovering how such multilingual embeddings can be used to facilitate cross-lingual learning, which can be especially helpful for languages with a small amount of training data. In our work, we explore various practical applications of using multilingual embeddings in languages with a low amount of available training data, with a particular focus on South Slavic languages.
Automatic Collocation Identification Using Word Embeddings.
Collocations such as ’black coffee’ and ’French window’ are multiword expressions whose constituents show a high degree of statistical association and whose meaning is not entirely semantically transparent, i.e. one of the constituents carries a special meaning found only in this combination. Collocations are in the grey area between free phrases like ’black car’, where both of the components are fully transparent, and idiomatic expressions such as ’black sheep’, where all the constituents are opaque. Over the last decades, there has been a lot of research on the automatic identification of collocations. The majority of studies have concentrated on establishing statistical lexical association measures (AM) best suited for this purpose. The performance of different AMs depends on such factors as the language under investigation, the definition of a collocation, the size and the quality of the corpus, and the frequency threshold.
The present work aims at developing an efficient machine learning algorithm for automatic collocation identification using pre-trained word embeddings and combining them with AM scores. The experiments focus on German adjective-noun collocations using the dataset described in [Evert, 2008]. The database contains 1252 expressions annotated by professional lexicographers, where an expression was evaluated as a ’true collocation’ if it is useful for a bilingual dictionary: 520 true collocations and 732 non-collocations. The task at hand is the binary classification of the collocation candidates from the Evert dataset. In the previous experiments on the dataset in the MWE 2008 shared task [Pecina, 2008] the highest mean average precision (MAP) of 0.62 was achieved by the AM Piatersky-Shapiro coefficient followed by a machine learning algorithm where multiple AMs were used as features with the MAP of 0.61.
In the proposed model, the classical approach of using AM scores is combined with implementing dense word vectors (word embeddings) as features in the logistic regression classifier. The word embeddings used in the experiment had been trained on the raw text from the DECOW14ax corpus [Sch ̈afer and Bildhauer, 2012], the vocabulary contains one million words with corresponding 50-, 100-, 200-, and 300-dimensional vectors [Dima, 2015]. Each value of the vector is treated as a separate feature in the classifier as well as the four AM scores: minimum sensitivity, dice, chi-squared, and log- likelihood. The algorithm outperforms the previous models achieving the MAP of 0.74 in the classifier with 304 features (300 dimensions+4 AMs) in the 10-fold cross-validation setting and the MAP of 0.72 on the unseen test data.
The experiments show that using word embeddings and combining them with AMs yield high precision in the collocation identification task. Future work will expand on defining the lexical-semantic relations between the constituents of German adjective-noun collocations and integrate this information into GermaNet, a lexical-semantic network for German that is offered as part of CLARIN infrastructure.
Using the Dutch Parallel Corpus to Calculate English Dutch Word Translation Entropy.
Large corpora are frequently used as linguistic data that is representative of a language, or at least representative of a specifically defined subset of that language. When a corpus comprises texts that are available in two or more languages, it becomes more than just a collection of translations. In addition to the language specific properties of each translation, the relationship between these translations and especially between the different languages surfaces as well. These so called parallel corpora allow researchers to compare languages empirically and allow them to make inferences about subjects of interest. The current study, situated within the PreDicT project (Predicting Difficulty in Translation), consults such a parallel corpus, namely the Dutch Parallel Corpus or DPC (Macken, De Clercq, & Paulussen, 2011), to calculate word translation entropy for the English Dutch language pair. DPC is managed and distributed by the CLARIN B Centre Dutch Language Institute, and available through CLARIN’s Virtual Language Observatory.
PreDicT aims to build a system that, given an input text in language x, and a target language y, can predict how difficult it would be to translate said text to language y. On top of that, the system would highlight segments of the source text that are difficult to translate. Upon completion, the system will be made available through a public repository and accessible through a web interface. The tool’s metadata will be recorded in CLARIN’s catalogs.
As a first step in creating this system, we completed a pilot study that correlated translation process data (duration, editing, and gaze features) with product data that, according to literature, can indicate translation difficulty (number of errors, word translation entropy, syntactic equivalence) (Vanroy, De Clercq, & Macken, 2018). The dataset that we used was taken from Daems (2016) and consists of a variety of translation process and product data collected by CASMACAT (Alabau et al., 2013) and Inputlog (Leijten & Van Waes, 2013) and merged by post processing scripts. In total, there is information of 690 translated segments by 23 translators (13 professionals, 10 students). We found that all three product features indeed correlate with some process features, in particular with the number of times a translator has revised a segment’s translation, and with the period of pause relative to the segment’s total translation time.
In that pilot study, word translation entropy was calculated automatically as part of the process that converted the aforementioned dataset to the CRITT TPR DB format (Carl & Schaeffer, 2018). To work out the entropy of a source word, the program looks up how it has been translated by all translators over all sentences. With a small translated corpus of 690 segments, entropy calculated in such a manner can be quite skewed and not representative. In the current study, we use DPC to get more justified entropy values. We use Moses (Koehn et al., 2007) to word align the parallel corpus and retrieve the word translation entropy. Then, we calculate the correlations between our new found entropy and the process data to compare these with the initial correlations from our pilot study. By doing so, we get an idea of how reliable it is to calculate entropy on small corpora that are based on different translations of the same segments.
Alabau, V., Bonk, R., Buck, C., Carl, M., Casacuberta, F., García-Martínez, M., … Tsoukala, C. (2013). CASMACAT: An Open Source Workbench for Advanced Computer Aided Translation. The Prague Bulletin of Mathematical Linguistics, 100(1). https://doi.org/10.2478/pralin-2013-0016
Carl, M., & Schaeffer, M. (2018). The Development of the TPR-DB as Grounded Theory Method. Translation, Cognition & Behavior, 1(1), 168–193. https://doi.org/10.1075/tcb.00008.car
Daems, J. (2016). A Translation Robot for each Translator (PhD Thesis). Ghent University, Ghent, Belgium.
Koehn, P., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E., … Moran, C. (2007). Moses: Open Source Toolkit for Statistical Machine Translation. In Proceedings of the ACL 2007 Demo and Poster Sessions (pp. 177–180). Prague, Czech Republic. https://doi.org/10.3115/1557769.1557821
Leijten, M., & Van Waes, L. (2013). Keystroke Logging in Writing Research: Using Inputlog to Analyze and Visualize Writing Processes. Written Communication, 30(3), 358–392. https://doi.org/10.1177/0741088313491692
Macken, L., De Clercq, O., & Paulussen, H. (2011). Dutch Parallel Corpus: A Balanced Copyright-Cleared Parallel Corpus. Meta: Journal Des Traducteurs, 56(2), 374–390. https://doi.org/10.7202/1006182ar
Vanroy, B., De Clercq, O., & Macken, L. (2018). Predicting Difficulty in Translation: A Pilot Study. In Proceedings of the 3rd Conference on Technological Innovation for Specialized Linguistic Domains. Ghent, Belgium.
Regional variation in spoken Russian.
In my PhD project I examine non-linguists’ perspective on regional variation in spoken Russian, based on 59 interviews with young Russians in three Russian cities, Moscow, Perm and Novosibirsk. The data from the interviews was recorded and transcribed. Data from the interviews, primarily the transcription files, will be accessible for research purposes through the Norwegian Centre for Research Data (NSD), and will be made searchable in Corpuscle. Access to raw data from the interviews will make it easier for others to assess the conclusions I make in my thesis. By making the data accessible I also make it possible for data to be reused in other studies, e.g. about regional variation or other aspects of variation in Russian.
All abstracts of the CLARIN Bazaar can be found here.
Back to the main conference page.