CLARIN2017 Book of Abstracts

Invited talks | Paper session 1 | Paper session 2 | Paper session 3 | Paper session 4 | Poster session | CLARIN Bazaar

Invited talks

Literary translations and tools for stylometric research
Karina van Dalen-Oskam

In many countries, readers of fiction consume both fiction written in their own language and fiction in other languages or translated from other languages. In my talk, I will describe how readers from four different countries (United Kingdom, France, Germany, and the Netherlands) look upon the quality of original and translated fiction, and compare some of the novels they rated using a small selection of stylometric tools. This will lead to observations about new ways of analysing the quality of translations and about how language technology may be used by translators and publishers for different purposes.

From multilingual to cross-lingual processing for Social Sciences and Humanities
Piek Vossen

In this presentation I will talk about the NewsReader project that resulted in a reading machine for several languages that extracts event-centric-knowledge-graphs from texts that can be of interest to social sciences and humanities. How did we make sure that the processing of text is done in an interoperable way across these languages and given the different technological challenges for each? How do we represent the results of this processing in a uniform way to allow for differences and trace provenance relations?

I will focus on various aspects:
1. social sciences and humanities want semantic and pragmatic processing of text. How to achieve this in different languages and how to achieve interoperability across languages?
2. the role of language resources (lexicons and annotated corpora) for semantic and pragmatic processing of text and how to solve this for lesser-resourced languages
3. how to achieve interoperability in processing
4. how to achieve interoperability in the output of processing
5. how to deal with difference and provenance of these differences

Thematic session: Multilingual Processing for Social Sciences and Humanities

Many a Little Makes a Mickle – Infrastructure Component Reuse for a Massively Multilingual Linguistic Study
Lars Borin, Shafqat Mumtaz Virk and Anju Saxena

We present ongoing work aiming at turning the linguistic material available in Griersons classical Linguis-tic Survey of India (LSI) into a digital language resource, a database suitable for a broad array of linguistic investigations of the languages of South Asia and studies relating language typology and contact linguis-tics. The project has two main aims: (1) to conduct a linguistic investigation of the claim that South Asia constitutes a linguistic area; (2) to develop state-of-the-art language technology for automatically extract-ing the relevant information from the text of the LSI. In this presentation we focus on how a number of existing research infrastructure components were ‘recycled’ in order to allow the linguists involved in the project to quickly orient themselves in the vast LSI material, and to be able to provide input to the language technologists designing the information extraction from the descriptive grammars. Read full paper

Parliamentary Corpora in the CLARIN infrastructure
Darja Fišer and Jakob Lenardič

This paper gives an overview of the parliamentary records and corpora from CLARIN countries with a focus on an analysis of their availability through the CLARIN infrastructure. Based on the results of the survey we draw a list of recommendations to optimize the depositing and cataloguing of the corpora in the CLARIN repositories in order to make them readily accessible for researchers from different disciplines. Read full paper

Paper session 1

Open Stylometric System WebSty: Towards Multilingual and Muiltipurpose Workbench
Maciej Piasecki, Tomasz Walkowiak and Maciej Eder

WebStyis  an  open,  web-based  stylometric  system  designed  for  SS&H  users.  It  was  designed according to the CLARIN philosophy: no need for installation, minimised requirements on the users’  technical  skills  and  knowledge,  focus  on  SS&H  tasks.  In  the  paper  we  present  its  latest extension  with  several  visualisation  methods,  techniques  for  the  extraction  of  characteristic features, and support for two more languages, namely English and German. Read full paper

Machine Exploration of Secondary Literature with LiteraryExploration Machine
Maciej Maryl, Maciej Piasecki and Tomasz Walkowiak

This paper presents a design of a web-based application for textual scholars. The goal of this project is to create a complex and stable research environment allowing scholars to upload the texts they are analysing and either explore with a suite of dedicated tools or transform them into another format (text, table, list). This latter functionality is especially important for research into Polish texts, because it allows for further processing with the tools built for English. This project utilises the already existing CLARIN-PL applications and supplements them with new functionalities. Read full paper

Paper session 2

Involving users and collaborating between disciplines in making cultural heritage accessible for research
Johanna Berg, Rickard Domeij, Jens Edlund, Gunnar Eriksson, David House, Zofia Malisz, Susanne Nylund Skog and Jenny Öqvist

This paper presents project design and initial experiences of involving users and collaborating between disciplines within the newly started project Tilltal (Berg et al.). The long-term goal of the Tilltal project is to make speech data at the Swedish memory archives more accessible to SSH researchers. We achieve this not only by describing methods by which speech technology can be used to reach SSH research goals, but also by providing fruitful examples of involving users, studying usage and collaborating between disciplines using the approach of participatory design (e.g. Kensing & Blomberg 1998). We use activity theory to survey the research activi-ties surrounding the archival materials (e.g. Nardi 1996). We model characteristic situations of use following ideas in Hansen et al. 2014, propose language technology solutions and assess their usefulness in practice by means of use cases (Jacobson et al. 1992, 2011). Read full paper

Authorship and ownership in the digital oral archives domain: The digital archive in the CLARIN-IT repository
Silvia Calamai and Francesca Biliotti

The paper addresses the problem of authorship and ownership with relation to a digital oral ar-chive created through the digitisation of several analogue archives. The case study is provided by the digital archive (Grammo-foni. Le soffitte della voce, Scuola Normale Superiore & Uni-versity of Siena, Regione Toscana PAR FAS 2007-13), a collection of around 30 Tuscan oral archives that is in the process of being documented in the CLARIN-IT repository. Read full paper

A Bridge from EUDAT’s B2DROP cloud service to CLARIN’s Language Resource Switchboard
Claus Zinn

We describe the usage of EUDAT’s B2DROP cloud service to increase the usability, visibility and attractiveness of the CLARIN Language Resource Switchboard. Read full paper

Paper session 3

Morphological Productivity of Adjective Formation in German – A Diachronic Corpus Study Using the CLARIN-D Infrastructure
Erhard Hinrichs

This paper demonstrates how large digital collections of diachronic and synchronic corpus data can shed new light on traditional research questions in morphology and diachronic linguistics. More specifically, the question of productivity of derivational affixes, a long-standing issue in the study of morphology, can be addressed in a much more fine-grained manner, especially from a diachronic perspective. As its empirical basis, the study makes use of two linguistically annotated corpus collections whose individual texts include high-quality metadata information about the date of origin, text type, and text size. These corpus collections are available as part of the CLARIN-D research infrastructure and are accompanied by powerful query and concordancing tools that support searching for morpho-syntactic patterns and visualization of query results. Read full paper 

Digital Muqtabas CTS Integration in CLARIN
Till Grallert, Jochen Tiepmar, Thomas Eckart, Dirk Goldhahn and Christoph Kuras

This paper describes the CLARIN integration of the Canonical Text Services for a text corpus containing Muhammad Kurd Alī’s al-Muqtabas. This high-quality text resource was introduced into CLARIN’s infrastructure by using a fine grained persistently citable approach. Additionally to the practical benefits of a newly integrated text resource, this paper illustrates the usefulness of CTS as a generic interface by showing that established workflows can be re used with new data sets. Read full paper

CORLI: a linguistic consortium for corpus, language and interaction
Christophe Parisse, Céline Poudat, Ciara Wigham, Michel Jacobson and Loïc Liégeois

CORLI is a consortium of Huma-Num, an organization that helps to organize and provide services for digital humanities in France. CORLI is the consortium for linguistics and includes all aspects of linguistic research and development.As France just joined Clarin as an observer, the objective of our paper is to introduce the consortium CORLI to Clarin; CORLI will act as an interface between Clarin and the scientific community of linguists.The goal of CORLI is to help linguists create, use, and disseminate linguistic corpora and digital tools. CORLI has always maintained a policy of providing funding and technological help to finalize and publish corpora issued from a wide range of institutional or personal research projects. CORLI is also involved in recommending and the circulation of guidelines related to research and technical practices, especially about linguistic corpora. Finally, CORLI organizes workgroups whose goal is to create and moderate networks that target tools and practices in linguistics. These workgroups are organised thematically around topics including metadata, formats,tools and practices for corpus exploration, archiving systems, multimodal practices and annotations. Their goal is to help showcase innovative work and trends undertaken in research labs and to finalize and disseminate current methods and practices in digital humanities research. Read full paper

Paper session 4

Working towards a Metadata Federation of CLARIN and DARIAH-DE
Thomas Eckart and Tobias Gradl

Over the past years great effort went into the establishment of research infrastructures for the Humanities and Social Sciences. As a result of the diversity of their targeted research fields and communities, miscellaneous infrastructure projects have developed unique solutions for overlapping target groups. This especially holds for describing available resources by structured metadata and providing them to a wider audience in a user-friendly fashion. This paper focuses on recent work to overcome the gap between the metadata infrastructures of CLARIN and the German branch of the DARIAH project, focusing on design decisions made by both projects and preliminary work focusing on the synergetic evolution of both. Read full paper

Component Metadata Infrastructure Best Practices for CLARIN
Thomas Eckart, Twan Goosen, Susanne Haaf, Hanna Hedeland, Oddrun Ohren, Dieter Van Uytvanck and Menzo Windhouwer

Last year, 2016, saw the release of both version 1.2 of the Component Metadata (CMD) Infrastructure (CMDI) [CLARIN ERIC 2017] and a first complete technical specification [CMDI Task Force 2016].This new version provides new possibilities, which are gradually opened up by the ecosystem of tools and registries in CMDI. One of the key properties of CMDI is its flexibility, which makes it possible to create metadata records closely tailored to the requirements of resources and tools/services. However, design and implementation choices made at various levels in the CMD lifecycle might influence how well or easily a CMD record is processed and its associated resources made available in the CLARIN infrastructure. Knowledge on this has traditionally been scattered around in various documents, web pages and even completely hidden from sight in experts’ minds. To make this knowledge explicit, the CMDI and Metadata Curation Task Forces have teamed up to create a Best Practice guide. This guide, together with the technical CMDI 1.2 specification, will be a valuable knowledge base and will help any (technical) CMDI user to bring her CMD records to their full potential use within CLARIN. In May 2017, the writing of this guide is still an ongoing effort and this paper gives a first overview and insight in its contents. Notice that this paper is not a replacement for the guide itself, which should be finished in 2017, but aims to draw attention to it and foster discussion and feedback. The next sections give a description of the scope, an outline and a glimpse into the content of the guide. Read full paper

Implementation of an Open Science Policy in the context of management of CLARIN language resources: a need for changes?
Aleksei Kelli, Krister Lindén, Kadri Vider, Penny Labropoulou and Erik Ketzan

The article explores whether CLARIN license categories are compatible with open science policy or should be reviewed. Read full paper

Examining Web User Flows and Behavioursin CLARIN Ecosystem
Go Sugimoto

This article attempts to draw a map of user flows and behaviours in the multi-layered CLARIN’s web  structure by  cross-examining  the  dynamic  movements  of  different  types  of  users  within (and outside of) CLARIN domain. In particular, we analysed the user traffic of several websites including the main website, various CLARIN web applications, and partner websites, as well as the use of single sign-on. Consequently, we are able to have a better understanding of user interactions in the context of a large web ecosystem rather than those of an each individual website. The evolution of web traffic over a year reveals a comprehensive overview of the characteristics of end-users and provides a clue for the next strategic decisions over CLARIN’s user oriented services and business sustainability. This preliminary research also proves the potential of business intelligence of web analytics to measure the impact of aggregation services and complex research infrastructures in cultural heritage and digital humanities alike. Read full paper

Poster session

Digital Classics: A Survey of the Needs of Ancient Greek Scholars in Italy
Monica Monachini, Anika Nicolosi and Alberto Stefanini

This paper presents and discusses the findings of a survey carried out in order to assess the use of digital resources and technology with respect to work in Ancient Greek scholarship, as well as to identify the factors which are likely to constrain its use and to elicit needs and requirements of Ancient Greek scholars. The survey is aligned with the principles behind the recent user engagement strategy developed by CLARIN-ERIC and constitutes one of the national efforts undertaken by CLARIN-IT to contribute to the wider impact of CLARIN on Digital Classicists. Read full paper

Improved treebank querying: a facelift for GrETEL
Liesbeth Augustinus, Bram Vanroy and Vincent Vandeghinste

We describe the improvements to the interface of GrETEL, an online tool for querying treebanks. We demonstrate how we employed the results of two usability tests and individual user feedback in order to create a more user-friendly interface which meets the users’ needs. Read full paper

Multilingual Text Annotation of Slovenian, Croatian and Serbian with WebLicht
Nikola Ljubešić, Tomaž Erjavec, Darja Fišer, Erhard Hinrichs, Marie Hinrichs, Cyprian Laskowski, Filip Petkovski and Wei Qui

Linguistic annotation of text corpora is a prerequisite for corpus linguistics or any advanced explorations of language. In this paper we first introduce three of the CLARIN.SI suite of open source trainable tools, namely diacritic restoration, word-normalisation and part-of-speech tagging with lemmatisation, trained for three Slavic languages: Slovene, Croatian and Serbian. We then present the trial integration of the tagger/lemmatiser with WebLicht, which has so far offered annotation workflows mainly for German and other Western European languages. Read full paper

CLARIN-IT: State of Affairs, Challenges and Opportunities
Lionel Nicolas, Alexander König, Monica Monachini, Riccardo Del Gratta, Silvia Calamai, Andrea Abel, Alessandro Enea, Francesca Biliotti and Valeria Quochi

This paper gives an overview on the Italian national CLARIN consortium and the status of CLARIN-IT in general. It thus discusses the current state of affairs of the consortium and provides information on the members, especially with regards to what they offer to CLARIN in terms of resources, services and expertise, and what CLARIN offers them to further their own research. Read full paper

Expanding the functionalities of the Language Resources Switchboard by integrating a set of tools for the processing of Polish language
Rafał Jaworski and Maciej Ogrodniczuk

This paper presents the Multiservice platform and its integration with the CLARIN Language Resources Switchboard. Multiservice combines a set of offline natural language processing tools for the Polish language. It features, among others, disambiguating tagging, dependency parsing and coreference resolution. A demonstration version of the platform, available online, is also accessible for the CLARIN Language Resources Switchboard (CLRS) users. At CLRS, the user provides a text file, selects one of the predefined processing chains and is automatically redirected to the Multiservice, which is immediately ready to process the request. Read full paper

Man against Machine: Qualitative Comparison of Original, Translated and Post-edited Wikipedia Articles
Mark Fišel, Martin Luts, Arvi Tavast, Sirli Zupping and Kadri Vare

In 2017 Estonia started a nationwide project Miljon+1, to get the Estonian Wikipedia among the Wikipedia versions with over 1 million articles. The project focuses on increasing the size of new articles in Estonian as well as increasing the size of translated articles. This paper describes one possible way how machine translation (MT) could speed up and support the process of reaching to a million Wikipedia articles in Estonian. Read full paper

Something will be connected - Semantic mapping from CMDI to PARTHENOS Entities
Matej Durco, Matteo Lorenzini and Go Sugimoto

The Parthenos project aims at pooling resources from existing infrastructures of the broad cultural heritage and humanities cluster. Central to this effort is the common semantic framework - Parthenos Entities - that shall serve as a target model for mapping of information about resources from participating infrastructures. As a representative of linguistic domain, CLARIN will deliver metadata about language resources. Within the Parthenos project separate provisions are foreseen for the mapping task. However, given the complexity of the CLARIN’s underlying metadata model (CMDI), traditional one-to-one schema mapping is not applicable and alternative conceptual and technical approach is required. This paper presents the ongoing work on mapping CMDI to the Parthenos model and points out a number of issues identified during the process, partly notorious from the ongoing metadata quality discussion within CLARIN. Read full paper

ChronoPress – Chronological Corpus of Polish Press Texts (1945–1962)
Adam Pawłowski

This contribution aims to introduce the main characteristics and functionalities of the ChronoPress corpus. It consists of three parts. First will be presented some definitions and goals of chronological text analysis. Then selected web applications provided with tools of sequential analysis will be shortly reviewed. These theoretical and “state-of-the-art” introduction will be followed by the demonstration of the functionalities of the ChronoPress web service, such as: time series, quantitative analysis, semantic word profiles, lexical maps. At this stage, different case studies will be examined. The final part will include a discussion of the current state of the web service and its possible/potential future development. Read full paper

Developing a CLARIN compatible AAI solution for academic and restricted resources
Tommi A Pirinen, Daniel Jettka and Hanna Hedeland

In this article we introduce an authentication and authorisation infrastructure for a CLARIN-compatible digital repository. The corpora hosted by the repository normally cannot be made available under free licenses (e.g. Creative Commons) due to privacy concerns, only occasionally for general academic use (i.e. Shibboleth-based SSO relying e.g. on the CLARIN SPF), but most often are restricted to academic, non-commercial use and are only available upon personal request. These characteristics were translated into system requirements that in turn called for a customised solution with several modifications to the off-the-shelf Drupal Shibboleth module used in our system. Read full paper

Correct-Annotator: An Annotation Tool for Learner Corpora
Felix Hultin

This paper presents CORRECT-ANNOTATOR, a brower-based, single-page annotation tool to annotate language errors for learner corpora in Swedish. Having grown out of the research project SweLL, CORRECT-ANNOTATOR attempts to significantly ease the workflow of an annotator, by allowing the annotator to edit and correct the text, from which the system induces potential language error annotations. This differs from previous annotation tools in that annotation is made on text transformations, rather than on a static text.With the expansion of learner corpora-related research in recent years, CORRECT-ANNOTATOR might prove useful tool for building learner corpora, which would be a unique addition to the CLARIN infrastructure. Read full paper

An Arranged Marriage: Integrating DKPro Core in the Language Analysis Portal
Milen Kouylekov, Emanuele Lapponi, Stephan Oepen and Richard Eckart de Castilho

This paper describes an ongoing effort to create an inter-operation framework that makes accessible the DKPro Core repository of Natural Language Processing (NLP) components with the CLARINO Language Analysis Portal (LAP). Tight integration of the two projects will substantially enlarge the selection of NLP tools and their coverage across languages available to LAP users, and it will provide DKPro Core with an easy-to-use, in-browser user interface and broad availability as part of the CLARINO infrastructure. This work addresses interesting issues of metadata interpretation (in the description of input and output interfaces to DKPro Core component) as well as of interchange representations for the representation of diverse types of linguistic annotations. Read full paper


