You are here

The CLARIN Bazaar 2018

Below you will find a list of the stalls that can be visited at the CLARIN Bazaar. Please go and talk to stallholders, see their wares, and share your ideas!

Bazaar presentations | Workshop presentations

Bazaar presentations

Title of stall   Description

The OpenAIRE Research Community Dashboard for CLARIN and other research communities

Miriam Baglioni

 

The effective implementation of Open Science calls for a scientific communication ecosystem capable of enabling the “Open Science publishing principles” of transparency and reproducibility. Such ecosystem should provide tools, policies, and trust needed by scientists for sharing/interlinking (for “discovery” and “transparent evaluation”) and re-using (for “reproducibility”) all research products produced during the scientific process, e.g. literature, research data, methods, software, workflows, protocols. OpenAIRE fosters Open Science by advocating its publishing principles across Europe and research communities and by offering technical services in support of Open Access. Its aim is to provide Research Infrastructures (RIs) with the services required to bridge the research life-cycle they support - where scientists produce research products - with the scholarly communication infrastructure - where scientists publish research products - in such a way science is reusable, reproducible, and transparently assessable. It is fostering the establishment of reliable, trusted, and long lasting RIs by (i) compensating the lack of OS publishing solutions and (ii) providing the support required by RIs to upgrade existing solutions to meet Open Science publishing needs (e.g. technical guidelines, best practices, OA mandates). To this aim OpenAIRE has extended its service portfolio by introducing the Research Community Dashboard (RCD). Thanks to its functionality, scientists of RIs can (i) find tools for publishing all their research products, such as literature, datasets, software, research packages, etc. (provide metadata, get DOIs, and ensure preservation of files), (ii) interlink such products manually or by exploiting advanced mining techniques, and (iii) integrate their services to automatically publish metadata and/or payload of objects into OpenAIRE. As a consequence, scientists populate and access an information space of interlinked objects dedicated to their RI, through which they can share any kind of products in their community, maximise re-use and reproducibility of science, and outreach the scholarly communication at large.

     

Leveraging Concepts in Open Access Publications

Andrea Bertino

 
     

Provenance Information in CMDI​

Daan Broeder, Menzo Windhouwer

 

When managing data sets in research data workflows almost all research disciplines are faced with the challenge on how to deal with versioning or, broader, tracking provenance. At this stall we propose an extension to the CMD Infrastructure to specify (provenance) relationships among language resources. Although we are particularly interested in use-cases for describing relations between corpora (update, enrichment etc.), we also like to discuss provenance tracking and provenance use cases in general. Contributions to our work are very welcome.

     

Gaze as an annotation method in speech research

Mattias Bystedt, Zofia Malisz, David House, Jens Edlund

 

It is widely known that gaze is an important phenomenon in face-to-face conversation. This, however, is not the only way gaze is relevant to speech and language research. What we do with our gaze is useful from pther eperspectives as well. Here, we discuss how gaze can be used as a tool to achieve better annotations for speech technology, specifically to improve prosody in speech synthesis.

     

Sounds and images: Corpora for speech production analysis and visualization

Chiara Celata, Irene Ricci, Chiara Bertini

 

Speech production technologies are of increasing importance and diffusion in the domain of multimodal humanities, often providing high dimensional imaging data on speech articulators that can also be used for the realization of three-dimensional (3D) interfaces of the vocal tract. These technologies and related tools are less expensive and can be more flexibly used than in the past, e.g. by being applied to a larger sample of speakers and speaking situations, including fieldwork on endangered dialects and the study of speech pathologies. However, physiological imaging coupled with audio recordings requires the managing of extremely large quantities of experimental data. Safe storage and sharing of such datasets may go beyond the possibilities of single research centres and would thus greatly benefit from the existence of shared policies of data processing as well as of infrastructures sustaining speech production data archives. Among the major problems is also the lack of open access software and tools for speech production analysis, which implies a very low level of interoperability among systems - and research centres. The current lack of shared procedures and protocols for data storage and processing (including metadata organization) is also a major issue to be addressed in order to include speech production data and resources under the umbrella of shared linguistic infrastructures promoted by CLARIN. Ethical and privacy-related issues are also a crucial aspect in speech production research, which would require specific consideration, as they overlap only partially with the protocols currently developed in the domain of oral linguistic and historical archives. We will discuss these issues by providing concrete examples of datasets and procedures used in SIAMO, an ongoing project on speech motor disorders and visual feedback.

     

Digital Environment for Textual Scholarship 

Angelo Mario Del Grosso

 

This contribution aims at showing a set of tools developed within a number of national and international Digital Humanities projects.

In particular, we will show the models, the methodology and the technologies that have been adopted to implement a digital scholarly environment for studying historical texts.

Starting from the digital encoding of the resources by using the TEI guidelines, we will illustrate a workflow encompassing a sequence of incremental steps each of which will be led both on a methodological and technological perspective.

The workflow we are developing considers five steps: 1) visualization and structuring of metadata, 2) transcription, 3) structural encoding, 4) annotation, 5) lexical and conceptual structuring.

The ongoing platform that is in charge of managing the text and the textual annotations is called Omega and it is hosted on a github repository at the following url: https://github.com/literarycomputinglab/OmegaProject.

In order to present the output of the digital scholarly process the Edition Visualization Technology tool (EVT) has been appropriately customized. Specifically, new features concerning image visualization, diplomatic edition display and textual search have been developed.

Finally, it is possible to have a demo of the tool at the following two urls:

1) The draft digital edition of the ​ Life of San Teobaldo: http://licodemo.ilc.cnr.it/evt-rotulo-viewer-search.

2) A page of the manuscript concerning the al-qāmūs al-muḥīṭ lexicon at http://licodemo.ilc.cnr.it/qamus.

     

Human annotation of temporally disassembled audio using MMAEs: findings and considerations​

Per Fallgren, Zofia Malisz, David House, Jens Edlund

 

In many cases, the sheer amount of audio data we find in archives is daunting. In The TillTRal project, we look to get an overview of 13 000 hours of data, which in many cases has virtually no metadata. For this reason, we're exploring techniques to quickly get an overview over large audio datasets. Here, we will discuss the effects of temporal dissassembly (e.g. chopping up audio in small segments and playing them out of order) and massively multichannel acoustic environments (a method of replaying large numbers of small sound samples in a manner that creates a continuous acoustic environment, or a soundscape. 

     

Rotterdam Exchange Format Initiative (REFI)​

Jeanine Evers

 

The Rotterdam Exchange Format Initiative (REFI) was formed in 2016  to develop a standard for exchanging processed qualitative data between Qualitative Data Analysis Software packages. It is an open standard and any program can implement it, thus increasing the number of software programs that can ‘talk’ to one another. 

The group created and tested the Codebook Exchange Standard in March 2018. This standard exchanges ‘codes’, their characteristics and notes between the software packages involved. 

The group aims to have completed and tested the Project Exchange Standard in March 2019. This standard will exchange the whole content of a project, i.e. processed data files, codes and links, their characteristics and the notes made, both about the objects in the project and the project itself.

Come and visit!

     

Technologies for computer-assisted translation, lexica and ontologies

Andrea Bellandi, Davide Albanesi, Emiliano Giovannetti

 

The in-depth study of a text, especially in a scholarly perspective, requires some specific levels of the text to be appropriately annotated and some textual content elements to be structured in external resources linked to the text. In particular, the terms that make up a text and which constitute its semantic "backbone" have to be formalized and structured in electronic terminologies. Moreover, the concepts denoted by each term have to be structured too and linked to the relative terms, in order to allow a user to browse a text also on a conceptual basis.

The translation of a text is no exception, especially if the text to be translated is linguistically and structurally complex. A multilingual termino-ontological resource that encodes the keyphrases (including terms and named entities) present in the source and target texts provides the translator with valuable information elements belonging to the deepest semantic layers of the texts.

At the Literary Computing group of ILC-CNR, on the basis of models and formalisms of the Semantic Web, two collaborative web applications have been developed, LexO and Traduco, the former for the construction and management of termino-ontological resources and the latter for the computer-assisted translation of the Babylonian Talmud in Italian.

     

Towards accurate, interoperable and reusable recommended metadata components [Poster]

Hanna Hedeland, Twan Goosen

 

We have identified the need for a set of recommended components for basic aspects of the description of language resources and tools within the Component MetaData Infrastructure (CMD). The availability of high quality generic components would greatly improve discoverability and comparability of resources hosted across various consortia, and thus the general usability of the infrastructure. Our inventory of existing components shows that there is substantial overlap regarding content, but still a rather inconsistent use of similar concepts in this area. We aim to develop a first draft of a CLARIN general information component based on this inventory and further requirements pertaining to adequacy, interoperability, and reusability. As a closely related topic, we also aim to address the issue of mapping between various schemas in use within and outside of the LRT community. Contextual semantics introduce significant complexity to such mapping tasks. As an alternative to previously considered relation registry based approaches, we suggest exploring the option of an augmentation framework for CMD to allowing for in-line mapping information under control and responsibility of the metadata modeller. At our bazaar stall, we are looking forward to getting your input and hope that some of you are interesting in joining us in the further exploration and implementation of these enhancements.

     

Towards CLARIN recommended formats: a bottom-up approach​

Hanna Hedeland, Piotr Bánski

 

In the lifetime of the CLARIN initiative, several more or less official “lists of recommended standards” have been proposed, several documents have been produced on that topic, a "format registry" has been set up, and a few surveys have been circulated. Many of the proposed lists are publicly accessible under the CLARIN label and they are sometimes very different in content: in the range of norms covered, granularity of versioning, internal classification and predicted use. This has resulted in a general feeling of uncertainty as to what and how exactly is recommended for CLARIN centres, and what their users can expect. The CLARIN Standards Committee has undertaken the task of preparing a uniform list of standards accepted and/or recommended by and for CLARIN centres, and our goal at the Bazaar is to make sure that the bottom-up part of that list reflects the actual practices employed at the individual centres. Part of that has already been completed as a result of a questionnaire. We invite the representatives of the centres to come over, verify the current state of the list and possibly extend it.

     

Perception in Online Restaurant Reviews​

Hyun Jung Kang

 

The number of consumer reviews posted on the internet has exploded in recent years and, as a result, it has led to the emergence of a new form of word of mouth: Electronic-word-of-mouth (eWOM). In a world characterized by three defining properties (3Vs - volume, variety and velocity), it has become very common in decision-making processes to rely on the eWOM of anonymous, which can be easily accessed on websites with a large amount of use rgenerated databases. To date, a majority of studies in natural language processing (NLP) have focused on extracting positive and negative opinions expressed in text and also the targets of these opinions. However, we take a slightly different approach: our hypothesis is that the function of online reviews is not just to evaluate a restaurant or to summarize aspects of a restaurant. Instead we consider a review as a perception of one’s experiences, that is, the way in which the experience at the restaurant is regarded and understood by reviewers. We test this hypothesis by employing linguistic and NLP methods from corpora of online restaurant reviews.

     

GROBID-Dictionaries: Infrastructure for Automatically Structuring Digitised Dictionaries and Entry-based Documents​

Mohamed Khemakhem

 

GROBID-Dictionaries is the first machine learning infrastructure for structuring digitised dictionaries into TEI-compliant resources. The system’s architecture relies on a cascading approach for information extraction from textual information in PDF documents. The implemented pluggable models in GROBID-Dictionaries have shown enough flexibility to be be applicable on a wide range of entry-based documents containing lexical or encyclopaedic information, from dictionaries to address directories. The usability has been also enhanced also to ease the setup and the training of the system. The tool is still under development and the feedback of new users with new samples is highly welcomed.

     

The CLARIN ERIC deployment infrastructure and its applicability to reproducible research​ [Poster]

Alexander König, Egon Stemle, André Moreira, Willem Elbers

 

Over time a number of requirements and technological preconditions for the CLARIN ERIC central infrastructure have been defined. An introduction is provided on how containerization using Docker can help to meet these requirements. A fleshed out build and deployment workflow, that CLARIN ERIC is employing to ensure that all the goals for the central infrastructure are met in an efficient and sustainable way, is presented. In a second step, it is also shown how these same workflows can help researchers, especially in the fields of computational and corpus linguistics, to provide for more easily reproducible research by creating a virtual environment that can provide specific versions of data, programs and algorithms used for certain research questions and make sure that the exact same versions can still be used at a later stage to reproduce the results.

     

A laboratory for the study of languages' rhythmic and intonation standard and variation

Valentina De Iacovo

 

The laboratory deals with speech data of different languages and dialects and is engaged in their description in terms of phonetic/sentence features (with different goals, from the didactic data extraction to the evaluation of legal-professional aspects. For some years now, the LFSAG is working on the creation of speech archives of dialectal and regional data (www.lfsag.unito.it).

     

Mylly – The Mill: A platform for processing and analyzing your language data [Poster]

Mietta Lennes and Jussi Piitulainen

 

Mylly is a data analysis platform where language researchers can process their data in a graphical user interface. Users can populate their personal, persistent sessions in the Mylly workspace by importing local files (or files available on the web), by querying the Korp API directly from Mylly, and by processing their files in Mylly. The files resulting from each transformation can be examined directly in the user interface, or processed further, or exported locally. Mylly automatically tracks the workflow. Notes can be added, and workflows can be repeated on new files. Current tools include morphosyntactic analysis of plain text, automatic speech recognition for Finnish, finite-state transducer technology, conversions, some statistics, and a general relational toolkit. Tools to manipulate VRT documents (annotated tokenized text) are forthcoming.

Mylly is based on the open source Chipster platform, developed for bioinformatics at CSC – IT Center for Science. The previous Java client is being replaced with standard HTML5 technology that runs directly in a regular browser. The new version supports federated login such as HAKA or eduGAIN. The new backend uses OpenShift container technology that can distribute resources on virtual servers transparently and scalably. The current tools are being configured for the new Mylly implementation, which we expect to roll out in a few months. 

Welcome to take a peek at the new Mylly and to discuss your wishes with us!

Find out more: https://www.kielipankki.fi/support/mylly

     

CLaSSES: a tool for exploring non-literary Latin [Poster

Giovanna Marotta, Francesco Rovai, Irene De Felice, Lucia Tamponi

 

In spite of the growing interest in digital humanities, little attention has been paid to the development of resources specifically designed for linguistic studies on non-literary Latin. While collecting a large amount of data, the available databases are skewed and cannot provide linguists with rich qualitative and quantitative linguistic information focused on specific phenomena.


CLaSSES (http://classes-latin-linguistics.fileli.unipi.it/en) is a digital tool which allows users not only to access non-literary Latin texts but also to perform quantitative linguistic analysis on them. The corpus (more than 26000 tokens) includes documents written on different material (inscriptions, writing tablets, letters) in different periods and provinces of the Roman Empire: more than 1200 Latin inscriptions from Rome and Italy (4th century BC–1st century AD); 200 ink-written tablets from Vindolanda (Roman Britain, 1st–3rd century AD); 219 letters from the North-African and Near-East areas (1st century BC–6th century AD).

All texts have been tagged with linguistic information (including lemmatization and annotation of spelling variants) and other metadata (including dating, place of provenance and textual typology of the documents).


At the bazaar, we will demonstrate how users can exploit CLaSSES to search through ancient texts, to evaluate the statistical incidence of linguistic phenomena and to interpret them in the light of the socio-historical context of the Roman world.

     

CLARIAH and dataLegend: Linking Social History Data on the Web​

Albert Meroño Peñuela

 

DataLegend is the platform developed in WP4 of CLARIAH, the Dutch Common Lab Research Infrastructure for the Arts and the Humanities. In dataLegend we combine the needs of social history research (a way of studying history by discovering patterns hidden in large amounts of population registers, like censuses, birth certificates or work affiliations) with the principles of Linked Data, knowledge graphs, and open science. To make this a reality, dataLegend provides tools enabling users to create Linked Data quickly and easily from their CSV files, possibly the most popular format to encode tables of diverse data. At the Bazaar, we will present four tools that conform the ecosystem and cycle of dataLegend: Druid, COW, cattle, and grlc. Druid is an efficient knowledge graph storage and querying database for fast Linked Data publishing. COW and cattle are command line tools and web services enabling users to create Linked Data easily from CSV files. grlc is an automatic API builder that leverages social and collaborative SPARQL query writing in order to make access to Linked Data portable and reproducible. We will show these tools individually over real life datasets, and how they can be combined in an integrated pipeline for more open and reproducible research workflows.

     

LETTERE: LETters Transcription Environment for REsearch

Giovanni Moretti, Stefano Menini, Rachele Sprugnoli

 

In this stall we will present LETTERE (LETters Transcription Environment for REsearch) a new standalone, platform-independent application specifically designed for the transcription of correspondence in accordance with requirements defined by history scholars. The development of LETTERE is part of the National Edition of De Gasperi’s correspondence, a project launched in 2017 with the support of the Italian Ministry of Cultural Heritage and Tourism that, for the first time, funded a National Edition in digital format.

     

The latest and the greatest on GrETEL 4 

Jan Odijk

 

GrETEL is a treebank search application with as a crucial feature the option to do querying by example (QBE). It was originally developed by KU Leuven [Augustinus et al 2012]. In Utrecht we added more functionality, some of which we already reported in [Odijk et al. 2018]. In particular (1) the possibility to upload one’s own data with its metadata in a variety of formats (CHAT, TEI, FoLIA, plain text) and turn them into a parsebank, which can then be queried; (2) the option to filter search results on the basis of metadata; (3) a graphical interface to drag nodes from a query into a pivot table in combination with metadata; (4) improved support for checking Xpath queries which have been hand-crafted or are modified versions of the Xpath query generated by QBE. We will demonstrate this functionality using a Dutch CHILDES corpus (Van Kampen Corpus) as example.

[Augustinus et al. 2012] Liesbeth Augustinus, Vincent Vandeghinste, and Frank Van Eynde. 2012. Example-based treebank querying. In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Ugur Dogan, Bente Maegaard, Joseph Mariani,Asunción Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12). European Language Resources Association (ELRA), Istanbul, Turkey.

[Odijk et al. 2018 ] Odijk, Jan, Klis, Martijn van der & Spoel, Sheean (01-01-2018). Extensions to the GrETEL Treebank Query Application. In Eduard Bejcek (Eds.), Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories (TLT16)(pp. 46-55) (10 p.). Prague, Czech Republic: Charles University, \http://aclweb.org/anthology/W/W17/W17-7608.pdf.

     

An Aligned Resource for Text Simplification

Evelina Rennes, Arne Jönsson

 

We will present our work towards data-driven text simplification for Swedish. We have collected and aligned a corpus of easy-to-read and standard language sentences from texts collected from websites of Swedish authorities and municipalities. The resource comprises all publicly available easy-to-read web texts from Swedish authorities and municipalities, as well as the rest of the texts written in Standard Swedish. To align the corpora, we evaluated three alignment methods, based on the similarity between word embeddings, that previously have shown promising results for English, resulting in a publicly available resourse. The corpus will be used for automatic text simplification but is also an important resource in other projects on, for instance, manual text simplification and assessment of text complexity. The corpus complements the corpora collected at SweClarin/Språkbanken, and as the corpus comprises simplifications carried out by humans, it is also a unique resource for studies of how humans simplify formal language. Our vision is to use this resource to create a system that is able to simplify a text automatically, but this is not an obvious task. How do we move on from here? What results can we expect? We will demonstrate the corpus and discuss our future ideas.

     

CLARINO

Victoria Rosén, Koenraad De Smedt

 

CLARINO is the Norwegian part of CLARIN. The following four centers have been established in Norway:

• CLARINO Bergen Center (University of Bergen) [Type B]
• The Text Laboratory (University of Oslo) [Type C]
• Språkbanken (National Library of Norway, Oslo) [Type C]
• TROLLing (UiT The Arctic University of Norway, Tromsø) [Type C]

Data and tools at these centers cover about 70 languages and may also be relevant for researchers outside of Norway. Some demos will be given.

     

DIGITAL HUMANITIES FOR EUROPE. "Commerce" a multicultural review between the two wars

Antonietta Sanna, Caterina Fiorani, Federico Boschetti

 

Computer for demo, posters. 
Demo of digital edition of ”Commerce - Cahiers trimestriels publiés par les soins de Paul Valéry, Léon-Paul Fargue, Valéry Larbaud",  a most important European european review of literature. Published in Paris from 1924 to 1932, Commerce contributed to the diffusion of English, German, Russian, Italian Modernist literature publishing translations from authors like Joyce, Woolf, Faulkner, Rilke, Ungaretti.
Size: 6000 pages
Languages: English, French, German, Italian.

Purpose:
Discuss to facilitate research in europeans cultural studies. 
Discuss to improve visibility of private archives. 

Acknowledgement: Fondazione Camillo Caetani, Università di Pisa, Università di Perugia.

     

Querying and navigation historical mono and bilingual corpora for DH: new approaches​

Eva Sassolini, Sebastiana Cucurullo, Alessandra Cinini

 

We present online text analysis system for the valorization of digital corpora of great historical, scientific, linguistic and cultural relevance. Our aim is on the one hand the long-time preservation of all texts and on the other, to improve content usability and increase user interaction. We also make text resources more attractive with new graphic views and links that expand the possibilities for analysis. Beside the decoding the original text format, encoding in a standardized format, indexing contents and extracting of linguistic resources from text, we integrate visual analytics and distant reading techniques in our text analysis systems, for improving the system query functionalities. The text valorization process starts with the recovery of text materials. Once we achieve an appropriate representation of the text, we implement new approaches for its promotion and dissemination.

We will show some web applications realized within national and regional projects, in particular: Galileo’s epistolary archive; The Digesta Iustiniani parallel corpus online query system (Latin – Italian); Atlante Lessicale Toscano online query system (ALT-Web) with Gabmap functionalities.

     

Use-Case Documentation of Research Infrastructure Services and User Community Needs​

Melanie Grumt Suárez and Erhard Hinrichs

 

At this stall, we introduce current projects conducted by scholars from humanities and social science disciplines, who participate in CLARIN-D and who use CLARIN(-D) resources and tools in their research. We are focusing on user community needs, and therefore we have begun to start a use case documentation of research infrastructure services. We kindly invite all participants to join us with their expertise and knowledge of using CLARIN tools and language data to solve research questions in order to make such use case documentations in a virtual library available for the research community and public.

Description poster #1 [Use Case Library]: “A Virtual Library of Use Case Documentations: Media Formats – Documentation Scheme – Dissemination”

Members of the Working Group 1 “German Philology”have started to document diverse scientific use cases in different media formats (step by step tutorial, screencast, and expert interview). The use cases are structured in four parts (1) name research questions, (2) present digital language data and analysis tools for scientific work with digital language data, CLARIN-D URLs, (3) explain modus operandi and usage, and (4) outline results, as appropriate indicate already existing scholarly presentation of results. 

This coordinated use of digital language data and digital tools in the CLARIN-D context aims to solve research questions in German studies. Furthermore, the goal is to provide a virtual library of use case documentationswhich illustrate the research and academic teaching with CLARIN-D resources within the German studies. The YouTube channel CLARINGermanyhosts the virtual librarywith the screencasts and expert interviews, the CLARIN-D website and blog will show further material like step by step tutorials. 

Links and further reading:

CLARIN-D Use Case Library: https://www.youtube.com/playlist?list=PLYxx1t2OIuvjJWG3QQAQXtXcgcjIdPonD

CLARIN-D Blog: https://www.clarin-d.net/de/blog-clarin-d

Click here to look at the poster.

Description poster #2 [Curation Projects]: “Language Data: Curation Projects and Integration Scenarios”

The aim of the curation projects was to identify key language data and resources and then to sustainably integrate those data into the CLARIN-D infrastructure. In the area of German studies three curation projects made it possible to add three types of data into the CLARIN-D centres: historical textsof the 15th-19th centuries, spoken language data, and computer mediated communication (CMC). In addition, we developed integration scenarios that can be used for further integration use cases in a sustainable way.

Further information about the curation projects WG 1: 

CP1: “Curation and integration of historical text resources of the 15th-19th century into the CLARIN infrastructure (WG 1)”: https://www.clarin-d.net/en/curation-project-1-1-german-philologie

CP2: “Curation and integration of spoken academic language resources of the GeWiss project into the CLARIN infrastructure (WG 1)”: https://www.clarin-d.net/en/curation-project-1-2-german-philologie

CP3: “ChatCorpus2CLARIN: Integration of the Dortmund Chat Corpus into CLARIN-D”: https://www.clarin-d.net/en/curation-project-1-3-german-philology

Click here to look at the poster.

Please have a look at the CLARIN-D dossier which presents all the Use Case Documentation on the YouTube channel CLARINGermany and the Curation Projects of WG 1.

     

Archiving and analysing spoken language data at The Language Archive

Paul Trilsbeek, Caroline Rowland

 

We will demonstrate the current state of affairs at The Language Archive (TLA) with respect to accessing, depositing and analysing collections of spoken/signed language resources, as well as present some ideas for future developments. 

Early 2018, TLA migrated its holdings from an in-house built repository system to a new system which is largely based on the open source Islandora and Fedora Commons software. This system (labelled FLAT) is developed together with the Meertens Institute. The setup at TLA includes a built-in deposit interface, which allows researchers themselves to upload new data and create the corresponding CMDI metadata using web forms. In terms of discovering materials, the focus is now more on “facets” extracted from the CMDI metadata, rather than the hierarchical structure of the collections.

Annotating and analysing language recordings with ELAN or similar tools is still a hugely time-consuming task, as it involves a lot of manual work. Automatic detection of some relevant features or events in the audio and video signals typically only works on materials that were recorded under ideal circumstances. A first step towards making these automated tools capable of dealing with more realistic recordings is to create large amounts of training data that are systematically annotated for the kinds of features we would like to extract. The DARCLE consortium has proposed an annotation schema for this purpose.

     

DH Course Registry​

Tanja Wissik & Hendrik Schmeer

 

Are you teaching a DH Course or related field? Do you want to make your course more visible outside your university network? Do you want to attract more (foreign) students? Do you want to shape the landscape of DH teaching in Europe and beyond and put your own activities on the map? 
Then come to our stand at the CLARIN Bazaar and get to know the DH Course Registry (url: https://registries.clarin-dariah.eu/courses/), a joint effort of CLARIN ERIC and DARIAH-EU, designed to showcase DH classes and encourage enrolment across Europe and beyond. 
Increasing the visibility of DH training activities – both on the local and the European level – is of great concern to the DH community, not only in order to attract more students, but also as a way of consolidating DH as an academic discipline. As traditional academic structures are rather resistant to the inherent interdisciplinary of DH initiatives, we have to look to other dissemination channels and go beyond barriers to reach our public outside the individual university. The DH Course Registry was built for just this purpose. 
We will be happy to tell you more about the DH Course Registry at the CLARIN Bazaar.

     

Join ELEXIS: The Benefits for Institutions to join the European Lexicographic Infrastructure

Tanja Wissik & Anna Woldrich

 

ELEXIS (European Lexicographic Infrastructure) is an H2020 project with the goal to establish an infrastrucutre to open up dictionaries, linguistic data and language tools for European communities. While EU institutions can no longer join ELEXIS as partners, they are very welcome to join the project with an observer status.

Observer status can be granted to research institutions or other legal entities that are involved in lexicography or are active in fields connected with the goals defined in ELEXIS infrastructure project – natural language processing, artificial intelligence etc. Observers are expected to offer a contribution to ELEXIS project and infrastructure, either in the form of (a sample of) lexicographic/lexical data created at the observer institution, or by providing a type of expertise required by ELEXIS. Expertise would be related either to the field of lexicography or to natural language processing, artificial intelligence etc.
As an observer your institution will benefit in the following ways:
-    Inclusion in ELEXIS activities (without direct funding)
-    Inclusion in ELEXIS communication channels
-    Access to (open) data and tools (for research purposes)
-    Active promotion of participation in trans-national access calls
-    Opportunity to participate in the ELEXIS post-project developments
If you want to know more about ELEXIS and how your institution can become an observer come visit us at the CLARIN Bazaar or visit www.elex.is

     

Development of Public Cloud NLP Services (SEMANTIKA2)

Andrius Utka and Tomas Krilavičius

 

We would like to present a project SEMANTIKA2 funded by European Structural Funds which has been launched in 2018 and is being implemented by Vytautas Magnus University and partners from the private sector. During the project, the architecture of the Information System for Syntactic and Semantic Analysis of the Lithuanian language (LKSSAIS) will be changed into cloud ready, most NLP components will be reworked and implemented, and new cloud public NLP services will be provided. All source codes of components and all resources of the new project will be accessible via the CLARIN-LT repository.

     

CLARIN support for multimodality

Maria Eskevich, Franciska de Jong

CLARIN ERIC

 

More than a third of the abstracts accepted for the 2018 edition of the CLARIN Annual Conference addresses one or more topics from the multimodality spectrum:  multimedia services, mixed-media analysis, support for speech disorders, multimodal dialogue annotation, interview processing, etc.

This apparent growth of interest in resources that are not limited to text only raises the question if the CLARIN infrastructure is fully ready for serving the needs for providers and users of the materials and tools that play a role in the harvest of multimodality work.

Please come to the CLARIN ERIC stall to share your requirements, wishes and dreams and help us to understand what could be done to serve the relevant communities better.

     

CLARIN metadata best practices, curation and support​ [Poster]

CMDI and Metadata Curation task forces

 

The CLARIN community is gradually switching to CMDI 1.2 as more and more of its features are supported by the infrastructure. At this bazaar stall the CMDI task force is available to answer any questions you have about the use of component metadata in CLARIN. Also they are glad to dicuss the best practices guide, which is under continuous development. We appreciate any feedback or suggestions in this regard. The best practice guide is developed by the CMDI task force in close cooperation with the Metadata Curation task force. Tooling and workflow for metadata curation in the CLARIN infrastructure is under active developement, and we are happy to discuss issues and recieve any feedback on this topic as well.

     

Europeana Research. The reuse of digital cultural heritage in the e-infrastructures domain.

Alba Irollo

 

In 2018 Europeana has reached its 10th anniversary, as the digital platform for cultural heritage funded by the European Commission. It is managed by the Europeana Foundation, which addresses its activities to researchers, research institutions and e-infrastructures through Europeana Research. The latter also leads a Research Community within the Europeana Network Association, which is meant for professionals at cultural heritage institutions dealing with researchers’ needs.

The digitised content in the Europeana platform has a huge potential that is waiting to be exploited for research. All the scholars who use cultural heritage as a source could be potentially interested  in reusing this content, which currently consists of 58 million items. Five Europeana APIs are already available at this purpose. In the context of the CLARIN Annual Conference 2018, Europeana Research will be presenting the two Europeana collections that are most relevant for research in Linguistics: the Manuscript Collection and the Newspapers Collection. Europeana Research will also give a presentation at the satellite event DH Foresight: Gleaning the future of digital methods and infrastructures for the humanities.

     

POETRY LAB POSTDATA ERC PROJECT [Poster]

POSTDATA ERC- UNED Salvador Ros, Elena González-Blanco

 

Postdata is a ERC project devoted to poetry standardization and linkend open data. It is based in three pillars: semantic modelling, poetry tools and virtual research infrastructure. http://postdata.linhd.es.

Poetry tools is intended to be a open source market where find differents tools to process poetry from the poetry model detection to more complex semantic processes. But not only is devoted to poetry processes but any process that improves and allows a deeper understanding of it.

In this bazaar It is presented two tools: ANJA and HISMETAG. One is related with the concept of enjambment and the second is related with the Named entities recognition in medieval texts, since was detected a lack of knowledge in this epoch in order to process texts.

     

Workshop presentations

ParlaCLARIN@LREC [Poster]

Darja Fišer

 

Workshop page

     

The DELAD initiative: Pathological speech data as a challenge for CLARIN 

Henk van den Heuvel

 

Corpora of disordered speech (CSD) are hard to obtain. They are costly to collect and difficult to share due to privacy issues. 
Moreover, they are often small in size and very dedicated in terms of language impairments addressed. These factors make re-use a challenge on the one hand, and a necessity on the other. We imagine that CSD can be hosted at local CLARIN centres and made findable through a central portal via their (harvested) metadata. CLARIN can offer the standards, best practices and services which are needed for this.

Workshop page

     

Interoperability of Second Language Resources and Tools

Elena Volodina

 

Workshop page

     

Oral History & Technology: Oral History Portal​

Arjan van Hessen

 

Workshop page

     

For general information on the Conference, see the event page.