Abstracts Overview CLARIN Annual Conference 2016

Keynotes

Texts, language and geography: Understanding literature using geographical text analysis
Ian Gregory

This paper presents a case study undertaken as part of the European Research Council-funded ‘Spatial Humanities: Texts, GIS, Places’ project and the Leverhulme Trust funded 'GeoSpatial Innovations in the Digital Humanities project.' Combining corpus-based approaches, automated geo-parsing techniques, and geographic information systems technology, this study investigates the geographies underlying the aesthetic terminology historically used in writing about the landscape of the English Lake District, today a world-famous national park in North West England. The focus of this investigation is a 1.5 million word corpus of writing about the Lake District, comprising 80 texts published between the years 1622 and 1900. In investigating this corpus, we exemplify how a hybrid geographical and corpus-based methodology can be used to study historical relationships between landscape writing and the wider environment in which the texts are  set. The techniques used could, however, be used in any study where language and place are important.

Why technologies are not neutral, and why it matters for linguists
Sally Wyatt

Have you ever wondered why the Paris metro tunnels are so narrow, or why some public benches have armrests? Our physical world is designed by engineers and policy makers and many others, and this has consequences for how we live and find our ways around. Software and tools are also designed by engineers and programmers and researchers. What kinds of consequences could this have for the work of researchers in linguistics and humanities more generally? These questions will be the basis of this keynote by Sally Wyatt, Professor of Digital Cultures in Development, Maastricht University, and member of the eHumanities-NL executive committee.

Thematic session: Language resources and historical sources

 

What’s in a Name? The case of Albanisch-Albanesisch and Broader Implications
Alexander Erdmann, Erhard Hinrichs and Brian Joseph

This paper offers a use case of the CLARIN research infrastructure from the fields of historical linguistics and the history of linguistics. Using large electronically available corpora of historical English and German, it investigates differences in terminology used in the two languages when referring to the people and the language of Albania. The search tools that are available for the DTA and the DWDS corpora as part of the CLARIN-D infrastructure make it possible to determine semantic change for the terminology under consideration. The paper concludes with a discussion of broader implication of the present use case for the use of historical corpora and the functionality of query tools needed for digital humanities research. Read full paper.

Canonical Text Services in CLARIN - Reaching out to the Digital Classics and beyond
Jochen Tiepmar, Thomas Eckart, Dirk Goldhahn and Christoph Kuras

Providing both user-friendly and machine-readable interfaces to digital resources is oneof the key tasks of highly integrated research infrastructures like CLARIN. The presentedimplementation of the Canonical Text Service Protocol covers many of the associatedproblems, like dealing with varying levels of text granularity, persistent identificationand address resolution, and simple interfaces for an integration in various automaticworkflows. The paper also demonstrates additional benefits of our CTS implementationin form of built-in text mining techniques. Read full paper.

Paper session 1

Annotating CLARIN.SI TEI corpora with WebAnno
TomaĆŸ Erjavec, Ć pela Arhar Holdt, Jaka Čibej, Kaja Dobrovoljc, Darja FiĆĄer,Cyprian Laskowski and Katja Zupan

The abstract presents the CLARIN.SI supported WebAnno platform for manual annotation of corpora. We concentrate on the conversion of the corpus encoding to the WebAnno format and the merge of WebAnno export into the original TEI. We also overview some annotation campaigns over Slovene corpora. Read full paper.

Discovering Resources in the VLO: Evaluation and Suggestions from the Perspective of Translation Studies
Vesna Lusicky and Tanja Wissik

CLARIN provides access to language resources for scholars in the humanities and social sciences. In theory, scholars and students of Translation Studies may be assumed to be active data providers of language resources, as well as prolific users of the CLARIN services.However, data show that the uptake of CLARIN services by this user group is rather low. This paper investigates the needs of the students of Translation Studies and evaluates the CLARIN from their perspective. It is based on a pilot study applying open and closed situated user assignments and an evaluation of the VLO service. The results provide insights into the needs of this user group and give suggestions to data and service providers that could increase the adoption of CLARIN services by the user group. Read full paper.

Number game – Experience of a European research infrastructure (CLARIN) for the analysis of web traffic
Go Sugimoto

CLARIN, as a European research infrastructure, has successfully established a base technical and social structure and nicely positioned itself in a broad Digital Humanities agenda in Europe. However, it seems that its technical development is not adequately built upon a proper cycle of PDCA (Plan-Do-Check-Act). In particular, it is surprising that there has been limited amount of user evaluation activities for the improvement of the technical services. This article tries to statistically and scientifically illustrate the landscape of the user evaluation in detail. It will scrutinize the trend of the web traffic and user behavior of CLARIN over the last two years in order to review the value of current services in an attempt to adjust the track of its development and to improve our services. In addition, an extra discussion is made for the documentation of process, methodology, and experience of web analytics in general, which would bring universal lessons of marketing research rather than website specific analyses. Furthermore, this paper raises awareness of Open Evaluation which includes the sharing of user statistics. It may become a growing concern for the credibility of CLARIN as open re-search infrastructure. Read full paper.

Paper session 2

ORTOLANG: a French infrastructure for Open Resources and TOols for LANGuage
Jean-Marie Pierrel and Christophe Parisse

ORTOLANG (Open Resources and Tools for Language: www.ortolang.fr) is a French infrastructure implemented in the framework of the “Programme d’Investissement d’Avenir”(PIA) funded by the “Investissements d’Avenir” French Government program. Based on the existing resource centers CNRTL (www.cnrtl.fr) and SLDR (http://sldr.org/) , this infrastructure aims to ensure the management, mutualization, dissemination and long-term preservation of language resources such as corpora, lexicons, terminologies and language processing tools, with particular focus on the languages of France. It will be used as a technical language platform of written and oral language forms, to support the coordination actions of the TGIR HumaNum(http://www.huma-num.fr/). Read full paper.

Integrating corpora of computer-mediated communication into the language resources landscape:Initiatives and best practices from French, German, Italian and Slovenian projects
Michael Beißwenger, Thierry Chanier, Isabella Chiari, TomaĆŸ Erjavec, Darja FiĆĄer, Axel Herold, Nikola LubeĆĄić, Harald LĂŒngen,CĂ©line Poudat, Egon Stemle, Angelika Storrer and Ciara Wigham

The paper presents best practices and results from projects in four countries dedicated to thecreation of corpora of computer-mediated communication and social media interactions (CMC).Even though there are still many open issues related to building and annotating corpora of that type, there already exists a range of accessible solutions which have been tested in projects and which may serve as a starting point for a more precise discussion of how future standards for CMC corpora may (and should) be shaped like. Read full paper.

The CLARIN Language Resource Switchboard
Claus Zinn

The CLARIN infrastructure gives users access to an increasingly rich set of language-related resources,using the Virtual Language Observatory, the Federated Content Search, and the Virtual Collection Registry. While there is ample support for searching resources using metadata-based& search, or full-text search, or aggregating resources into virtual collections, there is little support for users to help them processing resources in one way or another. While there is a considerable number of processing software in the CLARIN world, there is no single point of access where users can find tools to fit their needs and the resource they have. In this paper, we present& the CLARIN Language Resource Switchboard (LRS), which aims at helping users to connect resources with the tools that can process them. The LRS lists all applicable tools for a given resource, lists the tasks the tools can achieve, and invokes the selected tool in such a way so that processing can start immediately without any or little prior tool parameterization. Read full paper.

Paper session 3

TillTal – making cultural heritage accessible for speech research
Johanna Berg, Rickard Domeij, Jens Edlund, Gunnar Eriksson, David House, Zofia Malisz, Susanne Nylund Skog and Jenny Öqvist

This paper announces the new Swedish research project TillTal, a cross-disciplinary project aiming to improve collaborations between SSH research and speech technology and to make Swedish speech archives more accessible to researchers. The project proposal was a direct result of Swe-Clarin efforts to boost speech in SSH, and will start in the beginning of 2017. Here, we provide the background and motivation for the project as well as the project’s outline and goals. Read full paper.

Polish Read Speech Corpus for Speech Tools and Services
Danijel KorĆŸinek, Krzysztof Marasek and Ɓukasz Brocki

This paper describes the speech processing activities conducted at the Polish consortium of the CLARIN project. The purpose of this segment of the project was to develop specific tools that would allow for automatic and semi-automatic processing of large quantities of acoustic speech data. The tools include the following: grapheme-to-phoneme conversion, speech-to-text alignment, voice activity detection, speaker diarization, keyword spotting and automatic speech transliteration. Furthermore, in order to develop these tools, a large high-quality studio speech corpus was recorded and released under an open license, to encourage development in the area of Polish speech research. All the tools and resources were released on the the Polish CLARIN website. This paper discusses the current limitations and future plans of the project. Read full paper.

TalkBank within CLARIN
Brian MacWhinney

TalkBank promotes the use of corpora, web-based access, multimedia linkage, and human language technology (HLT) for the study of spoken language interactions in a wide variety of discourse types across many languages, involving children, second language learners, bilinguals, people with language disorders, and classroom learners. Integration of these materials within CLARIN provides open access to access a large amount of data to support researchers as well as a good test bed for the development of new computational methods. Read full paper.

Conversion and Annotation Web Services for Spoken Language Data in CLARIN
Thomas Schmidt, Hanna Hedeland and Daniel Jettka

We present an approach to making existing CLARIN web services usable for spoken language transcriptions. Our approach is based on a new TEI-based ISO standard for such transcriptions.We show how existing tool formats can be transformed to this standard, how an encoder/decoder pair for the format enables users to feed this type of data through a WebLicht tool chain, and why and how web services operating directly on the standard format would be useful. Read full paper.  

Paper session 4

AAI: systematically addressing the attribute release problem
Jozef Miơutka, Ondƙej Koơarko and Amir Kamran

CLARIN Service Provider Federation is not the only inter-federation but it is the first one to systematically address the attributes release issue; when Identity Providers do not release mandatory attributes to a service. For instance, if a data set is licensed under a restrictive license the user must be uniquely identifiable over time. However, if the Identity Provider does not release such information, the service cannot let the user download the data. Read full paper.

Curation module in action - preliminary findings on VLO metadata quality
Davor Ostojic, Go Sugimoto and Matej Ďurčo

Numerous problems and suggestions have been reported on the issues of metadata aggregation for VLO (Virtual Language Observatory), one of the core services of CLARIN, over the last years. In response to them, we have developed a metadata curation module which is capable of assembling and reporting a wide range of statistics about CMD (Component Metadata) records, collections, and profiles in the aim of monitoring the issues of metadata quality in VLO. In this paper, we present its on-going development and preliminary findings. With an easy-to-use interactive interface and scoring system, the module has successfully demonstrated to visualise the current state of the VLO. Our first set of analysis outlines unprecedented views on the quality of CMD metadata. We have also identified future works including the user interface, usability, input methods, and the calibration of scoring algorithm. We strongly believe that the curation module has a potential to openly and collectively check and improve the metadata, fostering the comprehensive analysis and assessment of metadata quality to support and VLO in the long run. Read full paper.

Poster session 1

Integrating corpora of computer-mediated communication into the language resources landscape:Initiatives and best practices from French, German, Italian and Slovenian projects
Michael Beißwenger, Thierry Chanier, Isabella Chiari, TomaĆŸ Erjavec, Darja FiĆĄer, Axel Herold, Nikola LubeĆĄić, Harald LĂŒngen,CĂ©line Poudat, Egon Stemle, Angelika Storrer and Ciara Wigham

The paper presents best practices and results from projects in four countries dedicated to thecreation of corpora of computer-mediated communication and social media interactions (CMC).Even though there are still many open issues related to building and annotating corpora of that type, there already exists a range of accessible solutions which have been tested in projects and which may serve as a starting point for a more precise discussion of how future standards for CMC corpora may (and should) be shaped like. Read full paper.

ORTOLANG Diffusion - A Component Based Digital Object Repository
Jerome Blanchard, Etienne Petitjean and Frederic Pierre

ORTOLANG (Open Resources and TOols for LANGuage) platform offers a new Digital Object Repository service. By mixing a Service Oriented Architecture for high level services and a Software Component Architecture for its Repository Service, ORTOLANG platform tries to build a robust and reliable Digital Object Repository that provides reach functionalities and a modern interface delivering great performances and best optimization strategies. By its hardware and software architecture choices, ORTOLANG platform ensure very flexible evolution possibilities to guaranty a long time support for hosted resources. Read full paper.

Not quite your usual kind of resource. Gra.fo and the documentation of Oral Archives in CLARIN
Francesca Frontini and Silvia Calamai

We present some reflections on the documentation of Oral History archives within CLARIN with a focus on their accessibility through the CLARIN Virtual Language Observatory. The case study is constituted by the Grammo.foni Le soffitte della voce project, a collection of digitized and catalogued oral Tuscan archives. Read full paper.

ORTOLANG: a French infrastructure for Open Resources and TOols for LANGuage
Jean-Marie Pierrel and Christophe Parisse

ORTOLANG (Open Resources and Tools for Language: www.ortolang.fr) is a French infrastructure implemented in the framework of the “Programme d’Investissement d’Avenir”(PIA) funded by the “Investissements d’Avenir” French Government program. Based on the existing resource centers CNRTL (www.cnrtl.fr) and SLDR (http://sldr.org/) , this infrastructure aims to ensure the management, mutualization, dissemination and long-term preservation of language resources such as corpora, lexicons, terminologies and language processing tools, with particular focus on the languages of France. It will be used as a technical language platform of written and oral language forms, to support the coordination actions of the TGIR HumaNum(http://www.huma-num.fr/). Read full paper.

CLARIN Resources for Classical Latin and Historical German
Brian Macwhinney, Uwe Springmann, Zarah Weiss, Kowalski John, Anke LĂŒdeling and Detmar Meurers

The LangBank Project is a collaboration between Carnegie Mellon University, the University of TĂŒbingen, and Humboldt University in Berlin to create web-based corpus resources for the study of Classical Latin and Historical German by both language learners and scholars. These resources are all being made available through the TalkBank CLARIN-B Centre. Read full paper.

AHA: Anagram Hashing Application
Martin Reynaert

We briefly present AHA, the Anagram Hashing Application, a new web application and service that allows researchers to effortlessly analyse the lexical variation present in their Gold Standard data and to publish the results. Read full paper.

Researcher Hands-On Training in the Digital Humanities: An Austrian Case Study
Tanja Wissik and Claudia Resch

In this paper we discuss hands-on training in the Digital Humanities based on an Austrian case study. Herein we introduce a seasonal initiative, namely the “ACDH Tool Galleries”, organized by the Austrian Centre for Digital Humanities (ACDH) of the Austrian Academy of Sciences that allows developers and professionals to provide education and practical training in tools designed for Digital Humanities users. Furthermore we present current survey data collected among theparticipants of this training courses. Read full paper.

Poster session 2

A SOLR/Lucene based Multi Tier Annotation Search solution
Matthijs Brouwer, Marc Kemps-Snijders and Hennie Brugmann

In recent years, multiple solutions have become available providing search on huge amounts of plain text and metadata. Scalable searchability on annotated text however still appears to be problematic. We add annotational layers and structure to the existing Lucene approach of creating and searching indexes, and furthermore present an implementation as Solr plugin providing both searchability and scalability. SOLR/Lucene is fast, scales well, and has a large basis of users as well as developers. The latter stands in sharp contrast to several existing corpus search and management systems, for which one or few developers have the task of maintenance and further development of the system. With SOLR/Lucene one almost gets this for free. Read full paper.

Web Service for Easy Text-to-TEI Normalization and Metadata Creation
Bart Jongejan, Lene Offersgaard and Dorte Haltrup Hansen

In CLARIN-DK we experience that it is too difficult for users to create data and metadata in the specific formats required in the repository. As most of the deposited resources in CLARIN-DK are text resources, we have decided to make a web service that assists researchers in the preparation phase of text resources. The aim is on the one hand to transform text to the TEI P5 format and on the other hand to create sufficient metadata for the resource – from the point of view of the repository as well as that of the researcher. To reach out to many scholars with a solution to these problems, the information available online has also been extended with tutorials. The goal is to make providing data and metadata of acceptable quality much easier. Read full paper.

TalkBank within CLARIN
Brian MacWhinney

TalkBank promotes the use of corpora, web-based access, multimedia linkage, and human language technology (HLT) for the study of spoken language interactions in a wide variety of discourse types across many languages, involving children, second language learners, bilinguals, people with language disorders, and classroom learners. Integration of these materials within CLARIN provides open access to access a large amount of data to support researchers as well as a good test bed for the development of new computational methods. Read full paper.

Setting up the national infrastructure clarin:el
Stelios Piperidis and Maria Gavrilidou

This paper presents the Greek national infrastructure for language resources, clarin:el, member of CLARIN since 2015. It describes the design principles for the creation of the network and lists the current members; it describes the infrastructure and its architecture and briefly elaborates on the resources and services offered. Read full paper.

Number game – Experience of a European research infrastructure (CLARIN) for the analysis of web traffic
Go Sugimoto

CLARIN, as a European research infrastructure, has successfully established a base technical and social structure and nicely positioned itself in a broad Digital Humanities agenda in Europe. However, it seems that its technical development is not adequately built upon a proper cycle of PDCA (Plan-Do-Check-Act). In particular, it is surprising that there has been limited amount of user evaluation activities for the improvement of the technical services. This article tries to statistically and scientifically illustrate the landscape of the user evaluation in detail. It will scrutinize the trend of the web traffic and user behavior of CLARIN over the last two years in order to review the value of current services in an attempt to adjust the track of its development and to improve our services. In addition, an extra discussion is made for the documentation of process, methodology, and experience of web analytics in general, which would bring universal lessons of marketing research rather than website specific analyses. Furthermore, this paper raises awareness of Open Evaluation which includes the sharing of user statistics. It may become a growing concern for the credibility of CLARIN as open re-search infrastructure. Read full paper.

OpenSKOS next edition: triplestore support for controlled vocabularies
Olha Shkaravska, Marc Kemps-Snijders and Menzo Windhouwer

OpenSKOS software implements a web-service-based approach to management and use of vocabularies based on SKOS design principles. This paper motivates and describes the most recent developments in the software. These developments include migrating from a relational MySQL database to a triplestore and corresponding changes in the software. Also, relation management has been extended so that user-defined relations can be added and used along with SKOS-specified relations. This transfer ensures smooth upgrade of the OpenSKOS-based projects and reusing those parts of the software that have preserved their relevance. Read full paper.

FLAT: A CLARIN-compatible repository solution based on Fedora Commons
Paul Trilsbeek and Menzo Windhouwer

This paper describes the development of a CLARIN-compatible repository solution that fulfils both the long-term preservation requirements as well as the current day discoverability and usability needs of an online data repository of language resources. The widely used Fedora Commons open source repository framework, combined with the Islandora discovery layer, forms the basis of the solution. On top of this existing solution, additional modules and tools are developed to make it suitable for the types of data and metadata that are used by the participating partners. Read full paper.

Just for the record, CMDI should be about semantic interoperability
Thorsten Trippel and Claus Zinn

The Component MetaData Infrastructure (CMDI) provides a lego-brick framework for the creation, use and re-use of self-defined metadata formats. The design of CMDI can be a force for good, but history shows that it has often been misunderstood or badly executed. Consequently, it has led the community towards the dark ages of metadata clutter rather than the bright side of semantic interoperability. In this abstract, we report on the condition of CMDI but also outline an agenda to make the CMDI world a better place to use, share and profit from metadata. Read full paper.