This edition of the CLARIN Café is organized by Anas Fahad Khan (ILC-CNR and CLARIN-IT), Penny Labropoulou (ILSP/ARC and CLARIN:EL) and Marco Passarotti - CIRCSE Unicatt and CLARIN-IT. The café will be hosted by Dieter Van Uytvanck.
- Date: 29 April 2021
- Time: 14:00 - 16:00 (CEST)
- Venue:CLARIN virtual Zoom meeting
- Twitter hashtag: #CLARINcafe
A full overview of the Café sessions scheduled can be found on the CLARIN Café page.
Despite the increase in the quantity and coverage of language resources available today, many of them still remain locked in data silos, which prevents users from taking advantage of both their individual and combined potential in a truly interoperable way. CLARIN offers a distributed infrastructure for hosting and sharing language resources, which is the first step towards combating resource isolation. The CLARIN VLO operates as a single query access point to multiple meta-collections of resources and tools, but lacks a way of creating connections between them other than at the metadata level. To make use of the full potential of resources, however, they need to be made interoperable at the level of data too, so that they can be queried, accessed and analysed by any tool, as well as linked to other resources with which they share common or complementary information. Within CLARIN, the Federated Content Search initiative represents a step in this direction, with a set of specifications that allow services to access corpora from heterogeneous search engines in a uniform way.
Over the last few years, there has been a growing interest within the Semantic Web and Computational Linguistics communities in applying the principles of the so-called Linked Data (LD) paradigm to the (meta)data of language resources, thus creating Linguistic Linked Data (LLD). According to the LD paradigm, data published on the Web should be interlinked through semantically typed connections that can be accessed and processed with standardized protocols and query languages across different data providers, so as to make the structure of web data better serve the needs of users. The LD paradigm is a series of best practices and principles that state that unique resource identifiers (URIs) should be used to name things in a way that allows users to look them up, find useful information represented with standard formalisms and discover more things that are linked to those resources. Taking up LD principles to interlink linguistic (meta)data from heterogeneous sources that contain complementary information allows for the growth of a graph of interoperable (yet distributed) language resources that can be consumed by applications to perform tasks or assist scholars in their research. Indeed, LD is increasingly being adopted by the Language Resources, Digital Humanities and Computational Lexicography communities, and a large number of efforts are now being devoted towards the conversion of language resources to RDF, i.e. the standard model for data interchange on the Web that also forms the basis of LLD.
The CLARIN Café will be specifically devoted to the role that LLD can play in the CLARIN Infrastructure, discussing how the LLD approach can be used to overcome the problem of the lack of interoperability between the (many) resources stored in CLARIN. On the other hand, the role the CLARIN infrastructure could play in a broader LLD-based ecosystem of semantically interoperable language data and applications will also be analysed. In particular, the Café will focus on the following issues:
- What are the advantages of applying the linked data principles to language resources, NLP tools and language resource metadata?
- What are the currently available models, ontologies and their extensions to represent (meta)data as LLD?
- What are the most challenging issues that affect LLD based applications and how can CLARIN help ?
- What research projects are already using LLD to make resources accessible and what are some potential points of collaboration with CLARIN ?
- Cimiano, P., C. Chiarcos, J.P. McCrae, and J. Gracia (2020). Linguistic Linked Data. Springer International Publishing. https://www.springer.com/gp/book/9783030302245
CLARIN-specific aspects of Linguistic Linked (Open) Data and/or RDF technology
- Chiarcos, C., C. Fäth, and F. Abromeit (2020). Annotation Interoperability for the Post-ISOCat Era. In Proceedings of The 12th Language Resources and Evaluation Conference (pp. 5668-5677).
- Hinrichs, E., N. Ide, J. Pustejovsky, J. Hajic, M. Hinrichs, M.F. Elahi et al. (2018). Bridging the LAPPS Grid and CLARIN. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation.
- Trippel, T. and C. Zinn (2020), Describing Research Data with CMDI — Challenges to Establish Contact with Linked Open Data, In: A. Pareja-Lora, M. Blume, B. C. Lust, C. Chiarcos (eds.), Development of Linguistic Linked Open Data Resources for Collaborative Data-Intensive Research in the Language Sciences, MIT Press
- Windhouwer, M., Indarto, E., Broeder, D. (2017). CMD2RDF: building a bridge from CLARIN to linked open data. Data Archiving and Networked Services. In: Odijk J. and van Hessen A (eds.), CLARIN in the Low Countries. London: Ubiquity Press.
14:00 - 14:10 Opening and CLARIN 101 (slides)
14:10 - 14:25 LLD: An Introduction - Christian Chiarcos (slides)
14:25 - 14:40 LLD: Benefits and some ongoing initiatives - Jorge Gracia (slides)
14:40 - 14:55 A view into metadata for LLD - Penny Labropoulou (slides)
14:55 - 15:10 Prefixes Matter. CLARIN and LLD in the light of the LiLa Knowledge Base - Marco Passarotti (slides)
15:10 - 15:15 Break
15:15 - 15:25 LLD: Issues and open challenges - Christian Chiarcos, Jorge Gracia (slides)
15:25 - 15:30 Response - Dieter van Uytvanck
15:30 - 16:00 Discussion
How to join
Registration is closed.