You are here

L2 corpora

Introduction

L2 learner corpora play a crucial role in second language research and pedagogy, allowing for a systematic study of how a learner of a second language acquires the new language on a lexical as well as syntactic level, and how it is influenced by his or her native language. A special characteristic of this type of corpora are the markup of errors and prosodic features of the learners.

The CLARIN infrastructure provides access to 35 L2 learner corpora. 11 corpora are multilingual, while an additional 24 corpora provide written, spoken and even videotaped forms of monolingual L2 data in 9 respective languages: Arabic, Czech, English, Finnish, French, German, Hungarian, Norwegian and Swedish. Many of these corpora are available through public licences.

We first provide overviews of the corpora that are already part of the CLARIN infrastructure and then list those that have not yet been integrated.

For comments, changes of the existing content or inclusion of new corpora, send us an email.

This website was last updated on 5 September 2018.

Monolingual L2 learner corpora in the CLARIN infrastructure

Written corpora

Corpus Language Description Availability

CzeSL – Czech as a Second Language

Size: 0.9 million words
Annotation: tokenised, PoS-tagged, lemmatised, error labels
Licence: CC-BY 

Czech

This corpus contains essays written in 2013 by learners from 54 L1 backgrounds.

The corpus is available for download from LINDAT.

For a related publication, see Rosen (2016).

Download

British Academic Written English Corpus

Size: 2761 texts
Licence: CC-BY

English

This is primarily a L1 corpus although it also contains L2 texts.

The corpus is available for download from the University of Oxford Text Archive.

Download

ETS Corpus of Non-Native Written English

Size: 12,100 essays (1100 / language)
Licence: restricted

English

The corpus contains texts written by learners from 11 L1 backgrounds as part of an international text of academic English proficiency. Prompts as well as proficiency level are part of the metadata.

The corpus is available for download from the LDC catalogue.

Download

The Hanken Corpus of Academic Writing

Size: 500,000 words 
Licence: CC-BY

English

This corpus contains academic texts  written by Finnish and Swedish native speakers.

The corpus is still under development.

 

ICLE International Corpus of Learner English

Size: 3 million words

English

This corpus contains texts written by learners of English from 14 L1 backgrounds.

The corpus can be purchased on CD-ROM and a new version (ICLE v.3) is in development.

 

The Uppsala Student English corpus

Size: 1.2 million tokens
Annotation: tokenised
Licence: CC-BY

English

This corpus contains essays written during the first three semesters of English studies at Uppsala University; most of the essays were written during the first semester. The corpus contains text files, each with a student ID and text ID including the course level, and information about the different prompts are available. 

The corpus is available for download from the University of Oxford Text Archive.

Download

The Advanced Finnish Learners’ Corpus


Size: 288,000 tokens
Annotation: tokenised
Licence: CLARIN RES

Finnish

This corpus contains academic texts written by MA students and collected in 2009.

The corpus consists of two subcorpora - The Exam Essays Subcorpus and the Course Papers Subcorpus, both of which are also available through Korp.

Concordancer

Download subcorpus

Download subcorpus

International Corpus of Learner Finnish (ICLFI) Corpus

Size: 1 million words
Annotation: morphological
Licence: CLARIN RES

Finnish

This corpus contains fictional (e.g., letters, narratives) and non-fictional (e.g., essays) texts.

The corpus provides information on a large number of variables concerning the linguistic background of the learner, the learning task, the learning context, etc. It is available through the concordancer Korp.

For a related publication, see Jantunen (2011).

Concordancer

Testipiste Corpus

Size: 840,000 tokens
Annotation: tokenised
Licence: CLARIN RES

Finnish

This corpus contains essays written by adult migrants from various L1 backgrounds.

The corpus will be made available through the concordancer Korp.

Commented Learner Corpus Academic Writing

Size: 853 texts
Licence: academic

German

This corpus contains texts written by students at the University of Hamburg from various L1 backgrounds.

The corpus is available for download through the repository of the University of Hamburg.

Download

ASK – Norsk andrespråkskorpus

Size: 618,000 tokens
Annotation: tokenised, PoS-tagged, errors
Licence: CLARIN RES

Norwegian

This corpus contains essays and tests written by students from 10 L1 backgrounds. It also contains L1 control essays.

The corpus is available through a dedicated concordancer provided by CLARINO.

Concordancer

FinSveStud 79-80

Size: 175,000 tokens
Annotation: tokenised
Licence: CLARIN RES

Swedish

This corpus contains texts written by students with Finnish as their L1 background.

The corpus is available through the concordancer Korp.

Concordancer

SW1203-essays

Size: 52025 tokens
Annotation: tokenised, PoS-tagged, MSD-tagged, lemgrams, compounds word forms
Licence: CC-BY

Swedish

This corpus contains essays.

The corpus is available for download through Språkbanken and through the concordancer Korp.

Together with the Tisus corpus, SW1203-essays is a subcorpus of the pilot SweLL corpus

Concordancer

Download

Tisus corpus

Size: 60,000 tokens
Annotation: tokenised, PoS-tagged, MSD-tagged, lemgrams, compounds word forms
Licence: CC-BY

Swedish

Together with the SW1203-essays, the Tisus corpus is a subcorpus of the pilot SweLL corpus.

The corpus is available for download from Språkbanken and through the concordancer Korp.

Concordancer

Download

Spoken corpora

Corpus Language Description Availability

The Anglish Corpus 

Annotation: interpausal units

English

This corpus contains various speech tasks performed by French native speakers and the associated transcriptions.

The corpus is available for download from Ortolang.

Download

GLBCC (Giessen - Long Beach Chaplin Corpus)

Size: 2472 words/transcript
Licence: CC-BY

English

This corpus contains film retellings performed by English and German native speakers.

The corpus is available for download from the University of Oxford Text archive. 

Download

ISLE Speech Corpus

Size: approx. 18 hours
Annotation: phone-level annotation, stress errors
Licence: ELRA END USER

English

This corpus contains various speech tasks (reading simple sentences, using minimal pairs, giving answers to multiple choice questions) performed by German and Italian native speakers.

The corpus is available for download from the ELRA catalogue.

Download

A Learners' Corpus of Reading Texts

Licence: CLARIN RES

English

This corpus contains unprepared readings by first-year students at an English department who speak French as a native language.

The corpus is available for download from Ortolang.

Download

French Learner Language Oral Corpora (FLLOC)

Size:1375 transcripts
Annotation: MSD-tagged
Licence: CC-BY

French

This corpus contains various narrative and interactive speech tasks performed by English and Dutch native speakers. 

The corpus is available for download from the University of Oxford Text Archive.

Download

Hamburg Modern Times Corpus

Size: 24,000 words
Annotation: prosody
Licence: CLARIN RES

German

This corpus contains film retellings and the accompanying transcriptions.

LANGMAN

Licence: CC-BY

Hungarian

The VLO lists 11 subcorpora of LANGMAN, which is a spoken corpus involving Chinese native speakers who learn Hungarian as a second language.

The subcorpora are available for download from the VLO.

Download

Video and multimodal corpora

Corpus Language Description Availability

Arabic Learner Corpus

Size: 0.3 million tokens
Annotation: tokenised
Licence: CLARIN RES

Arabic

This corpus contains essays written by students from 67 L1 backgrounds. It also contains recordings of speech tasks and associated transcriptions.

The corpus is available for download from the LDC catalogue.

Download

English as a Foreign Language Corpus

Size: 24 hours

English

The corpus contains videotaped lessons involving students at Finnish secondary schools. 

 

The Long Second Corpus

Licence: CLARIN RES

Finnish

This corpus contains written texts, audio recordings and videotaped lessons involving immigrants from the following L1 backgrounds: Estonian, Macedonian, Kurdish, Portuguese, Russian, and English.

The corpus is still in preparation. It is set to be made available on the LAT platform.

 

Multilingual L2 learner corpora in the CLARIN infrastructure

Written corpora

Corpus Language Description Availability

LETEC (Learning and Teaching Corpus)

Licence: CC-BY

English and French

The VLO lists 7 subcorpora of LETEC, which is a L2 corpus involving undergraduate and Master's level students who are native speakers of either French, Mandarin, Korean, Polish or Arabic.

CEFLING Project Corpus

Finnish and English

This corpus contains texts written by primary and secondary school students (years 7-9).

DIALUKI: Diagnosing reading and writing in a second or foreign language

Size: 8,600 texts
Licence: CLARIN RES

Finnish and English

This corpus contains texts both in Finnish (written by Russian native speakers) and English (written by Finnish native speakers). 

The corpus will be made available through Korp.
 

MERLIN Written Learner Corpus for Czech, German, Italian 1.1

Size: 2287 texts
Annotation: a wide range of language characteristics that provide researchers with concrete examples of learner performance and progress across multiple proficiency levels.
Licence: CC BY-SA 4.0

Czech, German, and Italian

This corpus contains learner texts produced in standardized language certifications covering CEFR levels A1-C1.

The corpus is available for download from the Eurac Research CLARIN Centre Repository.

Download

Topling - Paths in Second Language Acquisition

Size:165,000 tokens
Annotation: tokenised
Licence: CLARIN End User Licence Agreement

Finnish, English and Swedish

This corpus contains written texts in English, Swedish and Finnish produced by students in the Finnish educational system and is an extension of  the CEFLING corpus, which it also includes. 

The corpus is available through the concordancer Korp. 

Concordancer

Spoken corpora

Corpus Language Description Availability

AixOx

Size: 40 minutes/task
Licence: restricted

English and French

This corpus contains readings of written texts performed by French and English native speakers.

 The corpus is available for download from Ortolang.

Download

LeaP: The Learning the Prosody of a Foreign Language

Size: 31 hours
Annotation:  PoS-tagged, lemmatised, prosody

English and German

This corpus contains recordings of English and German spoken by non-native speakers from 31 different native language backgrounds.

The corpus is available for download from the Language Archive.
Download

Repiso/Contrefactualité

Size: CLARIN RES

French, Italian, Spanish

This corpus contains recordings of counterfactual sentences.

The corpus is available for download from ORTOLANG

Download

Openprodat

Licence: Publique Générale GNU

Dutch, English, French, German, Italian, Arabic, Spanish, Hungarian, Japanese, Thai, Norwegian, Chinese

This corpus contains paragraph readings by participants in both their L1 and in as many L2 as they felt they could manage.

The corpus is available for download from Ortolang.

For a related publication, see Hirst et al 2013

Download

Video and multimodal corpora

Corpus Language Description Availability

TAITO: Written and Oral Data of the TAITO-project

English, French, German, Italian, Swedish

This corpus contains texts written by undergraduate students at the beginning of their studies and videotaped discussions.

 

YKI National Certificates corpus

Licence: CLARIN RES

Italian,  Swedish, Spanish, English, Finnish, German, French, Russian

This corpus contains written and speech tasks.

 

Other L2 learner corpora

There exist an additional number of 128 L2 learner corpora that are not part of the CLARIN infrastructure. However, since they are already listed on the website of the Catholic University of Louvain and many are seemingly unavailable for download or online querying, we do not list them here. 

Additional materials

CLARIN workshop on Interoperability of Second Language Resources and Tools, 6-8 December 2017, Gothenburg, Sweden. [html]

Publications on the L2 learner corpora

[Hirst et al. 2013] Daniel Hirst, Brigitte Bigi, Hyongsil Cho, Hongwei Ding, Sophie Herment, Ting Wang. 2013. Building OMProDat: an open multilingual prosodic database.

[Jantunen 2011] Jarmo Harri Jantunen. 2011. Kansainvälinen oppijansuomen korpus (ICLFI): typologia, taustamuuttujat ja annotointi.

[Orr and Quené 2017] Rosemary Orr and Hugo Quené. 2017. D-LUCEA: Curation of the UCU Accent Project Data.

[Rosen 2016] Alexandr Rosen. 2016. Building and using corpora of non-native Czech. [pdf]