CLARIN IT presents: MERLIN - A Written Learner Corpus for Czech, German, and Italian

Submitted by Jakob Lenardič on 15 March 2019

Blog post written by Alexander König, edited by Monica Monachini, Darja Fišer, and Jakob Lenardič

The MERLIN corpus is a written learner corpus for Czech, German, and Italian. The corpus is composed of over 2,200 texts, about 1000 in German, 800 in Czech and over 400 in Italian, and can be downloaded in various formats from the ERCC repository of the Italian CLARIN consortium. The corpus can also be browsed online via a multi-functional web interface that enables users to explore authentic written learner productions in relation to their CEFR classification and annotated learner features.

The corpus has been designed to illustrate the Common European Framework of Reference for Languages (CEFR) with richly annotated authentic learner data. Since its publication in 2001, the CEFR has become the leading instrument of reference for the teaching and certification of languages and for the development of curricula. At the same time, there is a growing concern that the CEFR reference levels are not sufficiently illustrated, leaving practitioners such as teachers, test and curriculum developers, and textbook authors without comprehensive empirical characterizations of the relevant distinctions between the proficiency levels. This is particularly true for languages other than English, where supplementary empirical tools are urgently needed.

The MERLIN corpus was designed to address this demand for the three languages Czech, German and Italian, by annotating authentic written learner productions and relating them to CEFR in a methodologically sophisticated way. To create the corpus, the partners relied on existing corpus annotation and search tools as much as possible. As no single tool was able to fulfill all the annotation requirements, a combination of tools was required to support the wide range of manual and automatic annotation that had been designed to illustrate the CEFR scales.

The manual annotation, which includes error annotation and linguistic characteristics of the learner language, was performed using the Falko add-on for Microsoft Excel, which provides an existing framework for annotating learners’ errors, and the MMAX2 multi-level annotation tool, which is a flexible GUI-based tool for creating new annotations as well as visualizing them. Parallel to the manual annotation, the developers of CLARIN-IT created a custom UIMA toolchain in order to enrich the corpus with additional layers of linguistic annotation, such as part-of-speech tagging and syntactic parsing. All in all, the texts were annotated with about 70 different features, covering orthography, grammar and lexicon of the learner language as well as specific sociolinguistic or pragmatic characteristics. This regards features such as the appropriate use of formality/politeness, e.g. the T/V distinction in German, or of idiomatic expressions like greetings or closing formulae.

Figure 1: the online search interface of the MERLIN corpus

MERLIN is now mainly used by linguists specialized in learner language, but also teachers and language test developers who use richly annotated authentic examples to improve their methodology. The MERLIN online platform is especially crucial for language teachers as it provides ready-made usage scenarios in Czech, German, and Italian which show how the corpus can be used for data-driven teaching in a classroom environment. In this respect, the online platform also gives access to several pre-prepared language learning tasks that students can solve by using the corpus. There is also a YouTube demonstration that is aimed at language teachers and shows how the corpus can be used as part of the syllabus.

Figure 2: a schema of the annotation levels in the corpus, which include the mark-up of both word/sentence- (e.g., orthography) and discourse-level (e.g., errors in achieving coherence) errors

The corpus was collected from 2012 to 2014 within the project MERLIN “Multilingual Platform for the European Reference Levels: Interlanguage Exploration in Context”. The project was funded by the EU Lifelong Learning Programme with a consortium of seven partners: Technische Universität Dresden (DE) as the Lead Partner, the European Academy Bolzano (IT), Charles University (CZ), telc GmbH (DE), Berufsförderungsinstitut Oberösterreich (AT), Eberhard-Karls-Universität Tübingen (DE), and finally the European Centre for Modern Languages of the Council of Europe (AT) as Associated Partners.

The corpus has also been successfully used in several master’s theses:

Tina Schönfelder 2014. REQUESTS im Italienischen und Deutschen als Fremdsprache (“REQUESTS in Italian and German as Foreign Languages”).
Tassja Weber. 2013. Verbvalenz und Rektion im Bereich Deutsch als Fremdsprache. Eine korpusgestützte Analyse zweier Verbgruppen (“Valency and Case in German for Special Purposes as a Foreign Language”)
Julia Hancke. 2013. Automatic Prediction of CEFR Proficiency Levels Based on Linguistic Features of Learner Language

Publications:

Andrea Abel & Katrin Wisniewski. 2015. MERLIN - die mehrsprachige Plattform für die europäischen Referenzniveaus at the 6th (Österreichische Gesellschaft für Sprachendidaktik) ÖGSD Conference in Salzburg.
Katrin Wisniewski. Empirisch gestützte Arbeit mit dem GeRS: Zur Einschätzung schriftlicher Leistungen in Deutsch, Tschechisch und Italienisch als Fremdsprachen mit dem Lernerkorpus MERLIN. 26. Kongress der deutschen Gesellschaft für Fremdsprachenforschung in Ludwigsburg.

Click here to read more about Tour de CLARIN