Recap on the CLARIN Café: Drinking Coffee in the Afternoon with Czech CLARIN

Submitted by Linda Stokman on 28 April 2021

The CLARIN Café, titled “Drinking Coffee in the Afternoon with Czech CLARIN”, took place via Zoom on 15 April 2021 and was organized by the LINDAT/CLARIAH-CZ team together with CLARIN National Coordinator prof. Eva Hajičová from the Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University, Prague, Czech Republic. It was attended by 50 participants. 

About LINDAT/CLARIAH-CZ and The Czech National Corpus

LINDAT/CLARIAH-CZ and The Czech National Corpus are two nodes of CLARIN in the Czech Republic. There are several small overlaps in the disciplines they cover, but in general these infrastructures are complementary in one or more aspects of the data and/or services offered. LINDAT/CLARIAH-CZ’s focus is on digital humanities and arts in general, with a particular focus on language technology and access to large and/or richly annotated language resources. The Czech National Corpus concentrates on language as well, taking care of representative corpora of Czech and providing tools for corpus studies and lexicography also on restricted texts. Both infrastructures show their unique expertise in combining data, tools and services. 

Introducing CLARIN ERIC

The Café was opened by Franciska de Jong, CLARIN executive director, with a brief introduction on the technical and knowledge sharing infrastructure of CLARIN.

Introduction by Eva Hajičová

The local organizers and hosts of the CLARIN Café welcomed the participants and expressed their regret that for obvious reasons the meeting could not  take place in a more pleasant setting than that offered by the Zoom meeting. In order to evoke a cosy atmosphere of a real café the hosts prepared a number of virtual desserts typical for Prague cafés and served them with each of the presentations.

LINDAT repository by Pavel Straňák

The talk first summarised the CLARIN-DSpace open source project developing an easy to use and powerful repository system that fulfills CLARIN B-centre requirements, and concentrates on user friendly approach and broad integrations. After briefly presenting the software solution and the CLARIN community maintaining it, the talk concentrated on the main features of the system as is run at LINDAT and the main benefits from the centre and user perspective: simple and efficient search, integration with Google Scholar and Dataset Search, data citation and how it provides real citations to authors, versioning of datasets, and other aspects that make the repository well conformant with the FAIR guidelines.

Language services by Jan Hajič

The talk focused on language services, tools and applications offered by the LINDAT/CLARIAH-CZ node of CLARIN ERIC. The tools in the form of downloadable software packages (and their associated, machine-learned models) allow to process texts and speech in various ways - machine translation, speech recognition and multilingual text analysis at various levels, such as basic morphological and syntactic analysis, named entity recognition, text correction etc. Morphological, valency and semantic lexicons in several languages are also provided, linked to the usage in corpora, annotated both manually and automatically. Most of the tools and lexicons are also available as services, allowing for 24/7 programmatic access from user software and scripts. Each service has its application, i.e., web-browser based user interface, which allows for easy testing, and smaller-volume user file processing. In the talk, the services and applications were illustrated by two examples, namely the Universal Dependencies-based text processing trained from the UD treebank collection in 100+ languages and the state-of-the-art and popular machine translation service for several language pairs. Some information about the implementation of the services at the LINDAT/CLARIAH-CZ publicly accessible service cluster of CPUs and GPUs was also provided.

Corpora in TEITOK by Maarten Janssen

The talk introduced TEITOK, an online corpus platform that can be used not only for searching corpora, but also for maintaining corpora, and visualizing corpus documents. Corpus documents are kept in tokenized and annotated /XML, without cleaning out any mark-up. This means documents can be visualized in full. TEITOK is a multi-purpose corpus system, and visualization options depend on the type of corpus documents, whether they are text corpora, facsimile transcriptions, spoken corpora. The talk provided a brief overview of the core philosophy and features of TEITOK, and then proceeded to illustrate the options of the system by various corpora. At LINDAT, TEITOK has been integrated as one of the main tools for developing new corpora, and linked with two other corpus tools used at the institute: KonText and PML-TQ

Czech National Corpus by Michal Křen

The talk introduced Czech National Corpus (CNC), a long-term project that aims at continuous mapping of Czech by compiling language corpora and developing user applications for working with them. The talk also mentioned the CNC-run K-centre and its services. An overview of the CNC corpora included representative general-purpose corpora of Czech (printed, internet, spoken), InterCorp multilingual parallel corpus, as well as hosted corpora of other languages. The second part of the talk introduced three user applications developed at the CNC that CLARIN users could benefit from: Word at a Glance, KonText and Calc. Word at a Glance is a word profile aggregator that presents a comprehensive overview of the behaviour of a word based entirely on corpus data. KonText is a web-based general-purpose concordancer that supports monolingual, parallel and spoken corpora and is deployed also by CLARIN centres outside the Czech Republic. Calc is an interactive statistical calculator that supports the most common corpus research tasks and presents the results in a clear visual format. Introduction of all the applications was supplemented by commented screenshots.

Malach Center for Visual History by Jiří Kocián

This presentation introduced the nature of the activities and projects carried out at the MALACH Center. The Center’s mission is twofold: to take care of the data, primarily oral history video interviews with the Holocaust survivors, to which it provides access and to present them to the end users in the most versatile form possible. For that purpose, Malach User Interface was launched in January 2021 as a simple tool facilitating easy orientation and search through numerous databases, both external and internally stored within LINDAT/CLARIAH-CZ Infrastructure. For the data internally stored and curated by the MALACH Center several sub-projects are currently in operation, improving usability of the interviews so far accessible only in their raw format. The Malach staff, in cooperation with the University of West Bohemia in Pilsen generate automated transcripts, using the AMALACH software and integrate these interviews into AMALACH phonetic fulltext search interface. Part of the transcripts is manually edited by the students-trainees within the framework of the Center's international internship program. The students also participate in further development of the keyword structures, created with the use of LINDAT/CLARIAH-CZ NameTag2 service, GIS representation of the resulting geotags and SSNA analysis of the whole datasets. This collective effort ensures that older collections of memory sources equipped with now obsolete catalogues or those missing them completely will not fade into obscurity and will be used by researchers in the future.

There were two comments concerning the activities of the MALACH Centre. The first one asked whether any of the tools provided by the LINDAT/CLARIAH-CZ Centre contributed to the new access tools at the MALACH Centre. In reply, Jiří Kocián mentioned the cooperation with the LINDAT/CLARIAH-CZ group at the University of West Bohemia in Pilsen, who provided the technical solutions, using in part the USC-provided transcripts.  Jan Hajič added that the UWB speech recognition tools have been used for the experiments, but due to the fact that later the transcripts have been corrected by humans the more accurate human transcripts are  now used. The other comment appreciated especially the activities of the MALACH Centre  focused on pedagogical outcome of these activities aimed at teachers (not only Czech but also from abroad) who can use the combined material and experience from their stay at the Centre for teaching purposes at all school levels. The discussion period was also used by Pavel Straňák to demonstrate "live" some possible uses of the applications and services.

