A Recap on the Workshop on Data Management for FAIR CMC Corpora

Submitted by e.gorgaini@uu.nl on 24 November 2021

Blog post by Jennifer-Carmen Frey, Alexander König, Egon W. Stemle

About

On 27 October, a virtual workshop on ‘Data Management for FAIR CMC Corpora was held in association with the 8th Conference on CMC and Social Media Corpora for the Humanities (cmccorpora21) that took place in hybrid form, virtually and in Nijmegen, the Netherlands on October 28-29. The workshop was organised by the recently established CLARIN K-Centre for Computer-Mediated Communication and Social Media Corpora (CKCMC) with support from CLARIN-ERIC.

The workshop aimed at early career researchers and researchers about to start a new corpus linguistic project involving corpora of computer-mediated communication (CMC). It touched upon various important aspects of corpus collection, compilation and publication, and data preservation targeted directly to the field of CMC corpora.

Thematic Sessions

The workshop was split into four thematic sessions spread out during the workshop day and supplemented by two question and answer sessions to allow further discussion and exchange on the presented topics.

The day started with a short introduction to data management and the FAIR guiding principles for data stewardship by Egon Stemle, from the Institute of Applied Linguistics at Eurac Research, Italy. He elaborated on the need for and added values of clear and transparent data management when studying linguistics with corpora in general and addressed issues arising from work with CMC corpora in particular.

After this, Pawel Kamocki, a legal expert for intellectual property and data protection for CLARIN ERIC at the Leibniz Institute for the German Language in Mannheim, gave a talk on intellectual property and legal issues with CMC corpora. He addressed questions about what can and cannot be done with content collected from the web and what individual researchers can do to make their data more reusable while guaranteeing ethical treatment of personal data.

The afternoon session was concerned with more practical issues of data management once the data has been collected. In order to foster interoperability among CMC corpora, Harald Lüngen, specialist for corpus creation and curation at the Leibniz Institute for the German Language in Mannheim and Michael Beißwenger, professor for German Linguistics and Language Teaching at the University of Duisburg-Essen and convener of the TEI special interest group “computer-mediated communication”, presented different data formats and standards for the representation of language resources, including CMC- for the representation of CMC corpora in TEI, currently on its way to being officially integrated into the Text Encoding Initiative (TEI) and already used by several CMC corpora in the CLARIN Resource Family for CMC corpora.

Finally, the last presentation of the day was given by Alexander König, a representative of CLARIN ERIC and member of the task force on core metadata. He introduced language research infrastructures, and metadata formats and standards available to support corpus creators in making their CMC resources more findable, accessible, interoperable and reusable.

We thank all the participants for their active contributions during the workshop and the Q&A sessions.

All the slides and recordings will be available on the workshop webpage.

blog