CLARIN UI Event: Using TEI for representing CMC/social media data

Submitted by Linda Stokman on 25 October 2017

Blog post by Daniel Pfurtscheller (Universität Innsbruck)

On October 4, a workshop titled “How to use for the annotation of CMC and social media resources: a practical introduction” was held in association with the 5th Conference on CMC and Social Media Corpora for the Humanities (cmccorpora17) at Eurac Research, Italy. The goal of the event was to give a practical introduction into the annotation of language data from genres of computer-mediated communication (CMC) and social media using the formats of the Text Encoding Initiative (TEI). The tutorial was run by Harald Lüngen (IDS Mannheim, Germany), Michael Beißwenger (University of Duisburg-Essen, Germany), and Laura Herzberg (University Mannheim, Germany). This event was funded as a CLARIN User Involvement Event.

Michael Beißwenger started the first part of the workshop with a presentation on why and how one should encode CMC/social media corpora in TEI. Michael talked about issues of interoperability of CMC corpora and ways to close the ‘gap’ between text corpora and spoken language corpora, and outlined the state of the art. He then introduced the participants to the most recent TEI encoding schema draft for representing written interaction in CMC. This TEI schema draft has been developed within the TEI-SIG “computer-mediated communication” in the context of the CLARIN-D curation project ChatCorpus2CLARIN.

In the second part of the workshop Harald Lüngen presented the basics of TEI. The detailed lesson dealt with concepts and structural organization of TEI XML documents, the history of TEI, metadata as well as best practices for TEI customization. Harald showed how use the official guidelines and how to handle TEI customizations via Documents using Roma, a web application for customizations of TEI schemata.

The third part of the workshop was a hands-on session. Under the supervision of Laura Herzberg the participants could to encode CMC data as TEI XML using the oXygen XML editor. Besides learning basic functions and techniques of XML annotation and validation the participants could practice working with the official TEI guidelines and documentation. There was also enough room for questions and open discussion of representing specific features of CMC/social media data with the features of the customized CMC/Clarin-D schema.

In the name of all participants I would like thank the speakers and whole team at Eurac for the friendly atmosphere, Michael Beißwenger, Egon W. Stemle (Eurac Research, Italy) and Ciara R. Wigham (Université Clermont Auvergne, France) for organizing the workshop and CLARIN for funding it as a User Involvement Event.