Skip to main content

Workshop: CLARIN, Standards and the Text Encoding Initiative


CLARIN is a pan-European initiative which aims to build a research infrastructure for language resources integrating numerous tools and resources in a distributed architecture, and which will respond to the needs of researchers across the humanities and social sciences. CLARIN is being built on open standards, but also with a recognition that standards and guidelines are only one part of a complex jigsaw which needs to be assembled to create reliable, durable and high quality services. The Text Encoding Initiative is a long-standing community which develops guidelines for the encoding of scholarly texts in XML, and works with associated technologies. This workshop brings together those involved in these two sets of activities to share experiences and knowledge, and to find ways to work together productively in the next generation of infrastructure services.

Attendence at the workshop is a no-cost option when you regsiter at the Conference via the website:


09:30 Registration
10:00 Keynote address: TEI for written historical corpora: why and how? - Alexander Geyken (Berlin-Brandenburg Academy of Sciences) abstract presentation (pdf)
11:00 break
11:30 Presentation 1: The new corpus query engine KorAP: connections with CLARIN and the TEI - Andreas Witt & Piotr Bański (Institut für Deutsche Sprache) presentation (pdf)
12:00 Presentation 2: Poio API1: a CLARIN-D curation project for language documentation and language typology - Peter Bouda (Centro Interdisciplinar de Documentação Linguística e Social, Minde) abstract (pdf) presentation (pdf)
12:30 Presentation 3: TEI, ALTO and : why we need all of them - Günter Mühlberger (University of Innsbruck) abstract (pdf) presentation (pdf)
13:00 lunch (not provided by conference organisers - see conference website for local restaurants)
14:30 Presentation 4: TEI and the Component Metadata Framework - Matej Durco and Karlheinz Mörth (Austrian Academy of Sciences) abstract (pdf) presentation (pdf)
15:00 Presentation 5: WebLicht's Text Corpus Format: susTEInability of CLARIN-D web services? - Jens Stegmann (University of Stuttgart)
15:30 Panel discussion: Responses: problems and opportunities - Arianna Ciula, Karlheinz Mörth and Laurent Romary
16:00 break
16:30 Panel discussion part 2: Next steps
17:00 End

Background and further information

The organizing committee of this workshop invited proposals for presentations on topics which link together CLARIN and the TEI, including:

  • the role of the TEI in developing standards for CLARIN services,
  • technical issues in the integration of TEI-conformant resources or TEI-aware tools in CLARIN services,
  • barriers and problems with the deployment and linking of CLARIN and TEI technologies,
  • training, awareness and advocacy activities.

Presenters are asked not to simply present an overview of their work, but to focus on precisely how, why (or why not) TEI formats, guidelines and technologies are being deployed, and to go into some technical detail to do this if necessary.

It is hoped that this will be only the start of promoting dialogue and collaboration between CLARIN and the TEI at many levels. One result would be an improved dialogue about the use of the TEI in higher-level initiatives to develop standards for the CLARIN architecture, but another would be enhanced engagement directly with the TEI community of developers and researchers in the many centres and institutions related to CLARIN.


TEI for written historical corpora: why and how?

Dr Alexander Geyken, Berlin-Brandenburg Academy of Sciences

In the first part of the talk I will report on our experiences at the Deutsche Textarchiv (German Text Archive, DTA) with the integration of texts from 15 external corpus projects (some of them were using the TEI from scratch, some not), including the pro and cons of the use of TEI. The second part will explain the motivation behind the DTA-Base format, a strict subset of TEI-P5 that is intended to allow rich structural expressiveness while being as precise as possible in order to allow the interoperability of the different corpora. 

Organizing Committee

Martin Wynne (Chair)
Oxford e-Research Centre
University of Oxford
martin.wynne [at] (martin[dot]wynne[at]it[dot]ox[dot]ac[dot]uk)

Karlheinz Moerth
Institute for Corpus Linguistics and Text Technology
Austrian Academy of Sciences
Karlheinz.Moerth [at] (Karlheinz[dot]Moerth[at]oeaw[dot]ac[dot]at)

Ineke Schuurman
KU Leuven / U.Utrecht
Belgium / the Netherlands
ineke [at] (ineke[at]ccl[dot]kuleuven[dot]be)

Andreas Witt
Institut für Deutsche Sprache
witt [at] (witt[at]ids-mannheim[dot]de)

Xavier Gomez Guinovart
Seminario de Linguistica Informatica
Universidade de Vigo
xgg [at] (xgg[at]uvigo[dot]es)


Sapienza Universitá di Roma
Piazzale Aldo Moro 5