L2 learner corpora

Introduction

L2 learner corpora play a crucial role in second language research and pedagogy, allowing for a systematic study of how a learner of a second language acquires the new language on a lexical as well as syntactic level, and how it is influenced by his or her native language. A special characteristic of this type of corpora are the markup of errors and prosodic features of the learners.

The CLARIN infrastructure provides access to 74 L2 learner corpora. 11 corpora are multilingual, while the rest  provide written, spoken and even videotaped forms of monolingual L2 data in 13 respective languages: Arabic, Czech, English, Finnish, French, German, Hungarian, Icelandic,  Italian, Mandarin, Norwegian, Spanish, and Swedish. Many of these corpora are available through public licences.

We first provide overviews of the corpora that are already part of the CLARIN infrastructure and then list those that have not yet been integrated.

For comments, changes of the existing content or inclusion of new corpora, send us an email.

This website was last updated on 9 September 2021.

Monolingual L2 learner corpora in the CLARIN infrastructure

Written corpora

Corpus Language Description Availability

CzeSL ‚Äď Czech as a Second Language

Size: 0.9 million words

Annotation: tokenised, PoS-tagged, lemmatised, error labels

Licence: CC-BY

Czech

This corpus contains essays written in 2013 by learners from 54 L1 backgrounds.

The corpus is available for download from LINDAT.

For the relevant publication, see Rosen (2016).

Download

British Academic Written English Corpus

Size: 2761 texts

Licence: CC-BY

English

This is primarily a L1 corpus although it also contains L2 texts.

The corpus is available for download from the University of Oxford Text Archive.

Download

CORYL (Corpus of Young Learner Language)

Size: 191,568 tokens

Annotation: tokenised, anonymised, error labels, linked to CEFR levels

Licence: CC-BY

English

This corpus contains English texts written yb Norwegian primary school pupils (7th, 10th, and 11th grade).

The corpus is available through the concordancer Corpuscle provided by CLARINO.

Concordancer

ETS Corpus of Non-Native Written English

Size: 12,100 essays (1100 / language)

Licence: restricted

English

This corpus contains texts written by learners from 11 L1 backgrounds as part of an international text of academic English proficiency. Prompts as well as proficiency level are part of the metadata.

The corpus is available for download from the LDC catalogue.

Download

The Hanken Corpus of Academic Writing

Size: 500,000 words

Licence: CC-BY

English

This corpus contains academic texts  written by Finnish and Swedish native speakers.

The corpus is still under development.

 

ICLE International Corpus of Learner English

Size: 3 million words

English

This corpus contains texts written by learners of English from 14 L1 backgrounds.

The corpus can be

 

The Uppsala Student English corpus

Size: 1.2 million tokens

Annotation: tokenised

Licence: CC-BY

English

This corpus contains essays written during the first three semesters of English studies at Uppsala University; most of the essays were written during the first semester. The corpus contains text files, each with a student ID and text ID including the course level, and information about the different prompts are available.

The corpus is available for download from the University of Oxford Text Archive.

Download

The Advanced Finnish Learners’ Corpus

Size: 288,000 tokens

Annotation: tokenised, MSD-tagged, lemmatised

Licence: CLARIN RES

Finnish

This corpus contains academic texts written by MA students and collected in 2009.

The corpus consists of two subcorpora - The Exam Essays Subcorpus and the Course Papers Subcorpus, both of which are also available through Korp.

Concordancer

Download

Download

International Corpus of Learner Finnish (ICLFI) Corpus

Size: 1 million words

Annotation: MSD-tagged

Licence: CLARIN RES

Finnish

This corpus contains fictional (e.g., letters, narratives) and non-fictional (e.g., essays) texts.

The corpus provides information on a large number of variables concerning the linguistic background of the learner, the learning task, the learning context, etc. It is available through the concordancer Korp.

For the relevant publication, see Jantunen (2011).

Concordancer

Testipiste Corpus

Size: 840,000 tokens

Annotation: tokenised

Licence: CLARIN RES

Finnish

This corpus contains essays written by adult migrants from various L1 backgrounds.

The corpus will be made available through the concordancer Korp.

 

Commented Learner Corpus Academic Writing

Size: 853 texts

Licence: CC BY-NC-SA 3.0

German

This corpus contains texts written by students at the University of Hamburg from various L1 backgrounds.

The corpus is available for download through the repository of the University of Hamburg.

Download

ASK ‚Äď Norsk andrespr√•kskorpus

Size: 618,000 tokens

Annotation: tokenised, PoS-tagged, errors

Licence: CLARIN RES

Norwegian

This corpus contains essays and tests written by students from 10 L1 backgrounds. It also contains L1 control essays.

The corpus is available through a dedicated concordancer provided by CLARINO.

Concordancer

FinSveStud 79-80

Size: 175,000 tokens

Annotation: tokenised, lemmatised

Licence: CLARIN RES

Swedish

This corpus contains texts written by students with Finnish as their L1 background.

The corpus is available through the concordancer Korp.

Concordancer

SpIn

Size: 46,911 tokens; 4,302 sentences

Annotation: tokenised, PoS-tagged, MSD-tagged, lemgrams, compounds word forms

Licence: CC-BY

Swedish

This corpus contains essays from a Language Introduction course for newly arrived students (256 essays; 166 students, some of whom are recurrent) ‚Äď i.e., course preparation for Swedish upper-intermediate school (gymnasium-level). It is a subcorpus of the SweLL-pilot corpus.

Aside from the automatic linguistic annotation, the corpus is manually annotated for CEFR labels (A1-B2). See the metadata description for further details on the automatic and manual annotation.

The corpus is available through the concordancer Korp and for download in Språkbanken Text / the SweLL infrastructure through an individual application form.

For the relevant publication, see Volodina et al. (2016).

Concordancer (Korp)

Online (application)

SW1203-essays

Size: 52,528 tokens; 3,145 sentences

Annotation: tokenised, PoS-tagged, MSD-tagged, lemgrams, compounds word forms

Licence: CC-BY

Swedish

This corpus contains essays from a preparatory university course with three essays written by (almost) all students: (1) entrance essay, (2) mid-term essay; (3) fnal exam essay; (4) final exam retake for some students. The corpus is longitudinal in a way. It is a subcorpus of the SweLL-pilot corpus.

Aside from the automatic linguistic annotation, the corpus is manually annotated for CEFR labels (B1-C2). See the metadata description for further details on the automatic and manual annotation.

The corpus is available for download from the Språkbanken Resource List, through the concordancer Korp, and for download through in Språkbanken Text / the SweLL infrastructure through an individual application form.

For the relevant publication, see Volodina et al. (2016).

Concordancer (Korp)

Online (application)

Download

SweLL-gold

Size: 147,842 tokens (original version), 151,851 (normalized version); 7,807 sentences (original), 8,137 sentences (normalized)

Annotation: tokenised, PoS-tagged, MSD-tagged, lemgrams, compounds word forms

Licence: CC-BY

Swedish

This corpus contains essays from various education establishments in Sweden for non-Swedish speaking adult learners.

Aside from the automatic linguistic annotation, the corpus is manually annotated at the following levels: pseudonymization, normalization, and correction annotation. See the metadata description for further details on the automatic and manual annotation. While the SweLL-pilot corpus was collected in 2006‚Äď2016, SweLL-gold was collected in 2017‚Äď2020.

The corpus is available through the concordancer Korp and for download in Språkbanken Text / the SweLL infrastructure through an individual application form.

For the relevant publication, see Volodina et al. (2019).

Concordancer (original)

Concordancer (normalized)

Online (application)

Tisus corpus

Size: 60,632 tokens; 3,422 sentences

Annotation: tokenised, PoS-tagged, MSD-tagged, lemgrams, compounds word forms

Licence: CC-BY

Swedish

This corpus contains essays from a test situation written by adult learners (105 essays, 105 sutdents; one essay per student). The essays are argumentative on the topic of stress, written at an advanced level. This is a subcorpus of the SweLL-pilot corpus.

Aside from the automatic linguistic annotation, the corpus is manually annotated for CEFR labels (B2-C1). See the metadata description for further details on the automatic and manual annotation.

The corpus is available for download from Språkbanken, through the concordancer Korp, and in Språkbanken Text / the SweLL infrastructure through an individual application form.

For the relevant publication, see Volodina et al. (2016).

Concordancer (Korp)

Online (application)

Download

Spoken corpora

Corpus Language Description Availability

The Dresden Corpus

Annotation: audio/transcription linking

Licence: public (acknowledgment required)

Czech

The corpus contains speech recordings of ~32 German children learning Czech (type of study: interview).

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Kubanek-German (2000)

Concordancer

Download

The Anglish Corpus 

Annotation: interpausal units

Licence: CLARIN RES

English

This corpus contains various speech tasks performed by French native speakers and the associated transcriptions.

The corpus is available for download from Ortolang.

Download

The Barcelona English Language Corpus

Annotation: audio/transcription linking

Licence: public (acknowledgment required)

English

This corpus contains speech recordings of Spanish children and teenagers learning English in Barcelona. across 4 tasks (written composition, oral narrative, oral interview and role play).

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Mu√Īoz (2006)

Concordancer

Download

The Barraja-Rohan Corpus

Annotation: audio/transcription linking

Licence: public (acknowledgment required)

English

This corpus contains speech recordings of adult international students who spoke English as a second language and who had newly arrived at an Australian university. These undergraduate international students from various Asian backgrounds interacted over a period of seven months with Australian graduate students who were native speakers of English.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Barraja-Rohan (2013)

Concordancer

Download

The Connolly Corpus

Annotation: audio/transcription linking

Licence: public (acknowledgment required)

English

The corpus contains speech recordings of 60 Japanese high school students learning English.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

Concordancer

Download

The CUHK corpus

Annotation: audio/transcription linking

Licence: public (acknowledgment required)

English

This corpus contains speech recordings of 6 children learning English in Hong Kong.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see MacWhinney (2016)

Concordancer

Download

The Dresden Corpus

Annotation: audio/transcription linking

Licence: public (acknowledgment required)

English

The corpus contains speech recordings of ~32 German children learning English (type of study: interview).

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Kubanek-German (2000)

Concordancer

Download

GLBCC (Giessen - Long Beach Chaplin Corpus)

Size: 2472 words/transcript

Licence: CC-BY

English

This corpus contains film retellings performed by English and German native speakers.

The corpus is available for download from the University of Oxford Text archive.

Download

ISLE Speech Corpus

Size: approx. 18 hours

Annotation: phone-level annotation, stress errors

Licence: ELRA END USER

English

This corpus contains various speech tasks (reading simple sentences, using minimal pairs, giving answers to multiple choice questions) performed by German and Italian native speakers.

The corpus is available for download from the ELRA catalogue.

Download

A Learners' Corpus of Reading Texts

Licence: CLARIN RES

English

This corpus contains unprepared readings by first-year students at an English department who speak French as a native language.

The corpus is available for download from Ortolang.

Download

The Markee Corpus

Annotation: audio/transcription linking

Licence: public (acknowledgment required)

English

This corpus contains speech recordings of 3 students learning English as a second language.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Markee (2000)

Concordancer

Download

The PAROLE Corpus

Annotation: audio/transcription linking

Licence: public (acknowledgment required)

English

This corpus contains speech recordings of 95 students learning English in France (type of study: tasks/storytelling).

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Hilton (2009)

Concordancer

Download

The QATAR Corpus

Annotation: audio/transcription linking

Licence: public (acknowledgment required)

English

This corpus contains recorded interviews involving 19 Qatari learners of English.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Zhao and MacWhinney (2010)

Concordancer

Download

The Vercellotti Corpus

Annotation: audio/transcription linking

Licence: public (acknowledgment required)

English

This corpus contains recordings of adult learners entering an Intensive English Program (IEP) in the United States during the year 2010. Tasks include 2 minute monologues.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Vercellotti (2017)

Concordancer

Download

The Dresden Corpus

Annotation: audio/transcription linking

Licence: public (acknowledgment required)

French

The corpus contains speech recordings of ~32 German children learning French (type of study: interview).

The corpus is a part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Kubanek-German (2000)

Concordancer

Download

French Learner Language Oral Corpora (FLLOC)

Size: 1375 transcripts

Annotation: MSD-tagged

Licence: CC-BY

French

This corpus contains various narrative and interactive speech tasks performed by English and Dutch native speakers.

The corpus is available for download from the University of Oxford Text Archive. The transcripts and audio files can also be downloaded and browsed through through TalkBank.

Download

The LANGSNAP Corpus

Annotation: audio/transcription linking

Licence: public (acknowledgment required)

French

This corpus contains speech recordings of 28 British undergraduates learning French before, during and after a year abroad. Tasks include oral interviews and and story retellings, aside from argumentative writing tasks.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Tracy-Ventura and Huensch (2018)

Concordancer

Download

The LANGSNAP3 Corpus

Annotation: audio/transcription linking

Licence: public (acknowledgment required)

French

This corpus is a 3-year follow up to the LANGSNAP corpus, involving 18 participants.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Tracy-Ventura and Huensch (2018)

Concordancer

Download

The Newcastle Corpus

Annotation: audio/transcription linking

Licence: public (acknowledgment required)

French

This corpus contains intermediate level spoken French from 17-18 year old second language learners, in years 12 to 13 of UK secondary education.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

Concordancer

Download

The PAROLE Corpus

Annotation: audio/transcription linking

Licence: public (acknowledgment required)

French

This corpus contains speech recordings of 40 students learning French in France as a second language (type of study: tasks/storytelling).

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Hilton (2009)

Concordancer

Download

The Trinity College (TCD) Corpus

Annotation: audio/transcription linking

Licence: public (acknowledgment required)

French

This corpus contains recordings of 5 children (2 Irish, 1 Polish, 2 Cambodian) learning French in a school in France.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

Concordancer

Download

The Reading Corpus

Annotation: audio/transcription linking

Licence: public (acknowledgment required)

French

This corpus contains oral proficiency interviews with 34 16-year-olds learning French in South Wales.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Chambers and Richards (1995)

Concordancer

Download

The UWI Corpus

Size: 15,068 tokens

Annotation: audio/transcription linking

Licence: public (acknowledgment required)

French

This corpus consists of 25 recorded interviews with learners of French (9 adult learners) in Jamaica.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Péters (2017)

Concordancer

Download

The Dimroth SLA Corpus

Annotation: audio/transcription linking

Licence: public (acknowledgment required)

German

The corpus contains speech recordings of 47 students learning German (type of study: interview).

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Dimroth (2008)

Concordancer

Download

Hamburg Modern Times Corpus

Size: 24,000 words

Annotation: prosody

Licence: CLARIN RES

German

This corpus contains film retellings and the accompanying transcriptions.

The corpus is available for download from the HZSK CLARIN-D repository.

Download

The RyanDan Corpus

Annotation: audio/transcription linking

Licence: public (acknowledgment required)

German

The corpus contains recordings of 4 Carnegie Mellon University students learning German.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Walter (2020)

Concordancer

Download

The VYSA Corpus

Annotation: audio/transcription linking

Licence: public (acknowledgment required)

German

This corpus contains recordings of 3 highschool students learning German abroad while living with German-speaking host families and attending German secondary schools in standard German-speaking urban and peri-urban regions of Germany.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Young-Scholten and Langer (2015)

Concordancer

Download

The Theodórsdóttir corpus

Annotation: audio/transcription linking

Licence: public (acknowledgment required)

Icelandic

The corpus contains recordings obtained in a longitudinal case study of L2 Icelandic.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Theodórsdóttir (2018)

Concordancer

Download

The PAROLE Corpus

Annotation: audio/transcription linking

Licence: public (acknowledgment required)

Italian

This corpus contains speech recordings of 95 students learning Italian in France (type of study: tasks/storytelling).

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Hilton (2009)

Concordancer

Download

The COPA Corpus

Annotation: audio/transcription linking

Licence: public (acknowledgment required)

Mandarin

This corpus contains speech recordings of ~120 college students learning Mandarin in Hong Kong (type of study: responses to questions).

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Zhang (2009)

Concordancer

Download

The HKPU Corpus

Annotation: audio/transcription linking

Licence: public (acknowledgment required)

Mandarin

This corpus contains speech recordings of 20 college students learning Mandarin in Hong Kong. The tasks involve oral interviews.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Chang et al. (2013)

Concordancer

Download

LANGMAN

Size: 11 subcorpora

Annotation: error coding

Licence: CC-BY

Hungarian

This corpus is a spoken corpus involving Chinese native speakers who learn Hungarian as a second language.

The subcorpora are available for download from and browsing through the TalkBank.

Concordancer

Download

The BCN-L2 Corpus

Annotation: error coding

Licence: public (acknowledgment required)

Spanish

This corpus contains speech recordings of Berber students learning Spanish. The participants were 88 native speakers of Moroccan Arabic (Darija) and 26 speakers of Berber (Amazigh) living in Catalonia.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Bet et al. (2016)

Concordancer

Download

The Díaz Rodríguez Corpus

Annotation: audio/transcription linking

Licence: public (acknowledgment required)

Spanish

This corpus contains speech recordings of Indoeuropean and Asian Learners, both semi-spontaneous and experimental, obtained in Barcelona, Spain (type of study: naturalistic, longitudinal).

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Díaz (2002)

Concordancer

Download

The LANGSNAP Corpus

Annotation: audio/transcription linking

Licence: public (acknowledgment required)

Spanish

This corpus contains speech recordings of 27 British undergraduates learning Spanish before, during and after a year abroad. Tasks include oral interviews and and story retellings, aside from argumentative writing tasks.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Tracy-Ventura and Huensch (2018)

Concordancer

Download

The LANGSNAP3 Corpus

Annotation: audio/transcription linking

Licence: public (acknowledgment required)

Spanish

This corpus is a 3-year follow-up to the LANGSNAP Corpus, involving 33 participants.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Tracy-Ventura and Huensch (2018)

Concordancer

Download

The Liceras Corpus

Annotation: audio/transcription linking

Licence: public (acknowledgment required)

Spanish

This corpus contains speech recordings of 11 students learning Spanish as a second language.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Liceras et al. (1999)

Concordancer

Download

The Nebrija-CORELE-UA Corpus

Size: 1 hour 27 minutes, 10,292 words

Annotation: audio/transcription linking

Licence: public (acknowledgment required)

Spanish

This corpus contains 10 recorded interviews involving students of Spanish as a Foreign Language have at the University of Alicante, in Alicante.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Medina Soler (2017)

Concordancer

Download

The Nebrija-INMIGRA Corpus

Annotation: audio/transcription linking

Licence: public (acknowledgment required)

Spanish

This corpus consists of oral interviews carried out in the context of the LETRA test of Spanish for immigrant workers. It is made up of semi-guided interviews carried out in Spanish which last approximately 10 minutes each. The participants are immigrants from 11 different countries who live in the Autonomous Community of Madrid (Spain).

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Liceras (2017)

Concordancer

Download

The Nebrija-OAP Corpus

Size: 9 hours 19 minutes, 49,718 words

Annotation: audio/transcription linking

Licence: public (acknowledgment required)

Spanish

This corpus contains 67 videotaped presentations involving 95 North American students of Spanish as a Foreign Language at Nebrija University in Madrid.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Vergara Padilla (2017)

Concordancer

Download

The Nebrija-WOCAE Corpus

Annotation: audio/transcription linking

Licence: public (acknowledgment required)

Spanish

This corpus contains recordings of emails written and read by 28 Chinese students.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Vergara Padilla (2017)

Concordancer

Download

The Nicol√°s Corpus

Annotation: audio/transcription linking

Licence: public (acknowledgment required)

Spanish

This corpus contains recordings of 2 two children from Morocco learning Spanish in Spain learning Spanish.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see de Benito (2016)

Concordancer

Download

The SPLLOC1 Corpus

Annotation: audio/transcription linking

Licence: public (acknowledgment required)

Spanish

This corpus contains recordings of L2 Spanish in a classroom context. There were 20 learners, all of whom were English native speakers, at each of 3 levels: beginners (Year 9 students aged 13-14), intermediate students (A2 students aged 17-18), and fourth year undergraduates.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Mitchell et al. (2008)

Concordancer

Download

The SPLLOC2 Corpus

Annotation: audio/transcription linking

Licence: public (acknowledgment required)

Spanish

This corpus is an extension of the SPLLOC1 Corpus.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Mitchell et al. (2008)

Concordancer

Download

Video and multimodal corpora

Corpus Language Description Availability

Arabic Learner Corpus

Size: 0.3 million tokens

Annotation: tokenised

Licence: CLARIN RES

Arabic

This corpus contains essays written by students from 67 L1 backgrounds. It also contains recordings of speech tasks and associated transcriptions.

The corpus is available for download from the LDC catalogue.

Download

English as a Foreign Language Corpus

Size: 24 hours

Licence: Under Negotiation

English

The corpus contains videotaped lessons involving students at Finnish secondary schools. 

 

The Long Second Corpus

Licence: Under Negotiation

Finnish

This corpus contains written texts, audio recordings and videotaped lessons involving immigrants from the following L1 backgrounds: Estonian, Macedonian, Kurdish, Portuguese, Russian, and English.

The corpus is still in preparation. It is set to be made available on the LAT platform.

The van Compernolle Corpus

Annotation: audio/transcription linking

Licence: public (acknowledgment required

French

This corpus contains a recorded examination of classroom interactional practices and actions in a beginning-level ESL reading class. Analytic foci include aspects of speech delivery and timing as well as nonverbal behaviors (e.g., eye gaze, gesture).

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

Concordancer

Download

Multilingual L2 learner corpora in the CLARIN infrastructure

Written corpora

Corpus Language Description Availability

CEFLING Project Corpus

 

Finnish and English

This corpus contains texts written by primary and secondary school students (years 7-9).

 

DIALUKI: Diagnosing reading and writing in a second or foreign language

Size: 8,600 texts

Licence: CLARIN RES

Finnish and English

This corpus contains texts both in Finnish (written by Russian native speakers) and English (written by Finnish native speakers).

The corpus will be made available through Korp.

 

MERLIN Written Learner Corpus for Czech, German, Italian 1.1

Size: 2287 texts

Annotation: a wide range of language characteristics that provide researchers with concrete examples of learner performance and progress across multiple proficiency levels.

Licence: CC BY-SA 4.0

Czech, German, and Italian

This corpus contains learner texts produced in standardized language certifications covering CEFR levels A1-C1.

The corpus is available for download from the Eurac Research CLARIN Centre Repository.

Download

Topling - Paths in Second Language Acquisition

Size: 165,000 tokens

Annotation: tokenised

Licence: CLARIN End User Licence Agreement

Finnish, English and Swedish

This corpus contains written texts in English, Swedish and Finnish produced by students in the Finnish educational system and is an extension of  the CEFLING corpus, which it also includes.

The corpus is available through the concordancer Korp.

Concordancer

Spoken corpora

Corpus Language Description Availability

AixOx

Size: 40 minutes/task

Licence: restricted

English and French

This corpus contains readings of written texts performed by French and English native speakers.

 

LeaP: The Learning the Prosody of a Foreign Language

Size: 31 hours

Annotation: PoS-tagged, lemmatised, prosody

English and German

This corpus contains recordings of English and German spoken by non-native speakers from 31 different native language backgrounds.

The corpus is available for download from the Language Archive.

Download

Repiso/Contrefactualité

Licence: CLARIN RES

French, Italian, Spanish

This corpus contains recordings of counterfactual sentences.

 

Openprodat

Licence: Publique Générale GNU

Dutch, English, French, German, Italian, Arabic, Spanish, Hungarian, Japanese, Thai, Norwegian, Chinese

This corpus contains paragraph readings by participants in both their L1 and in as many L2 as they felt they could manage.

The corpus is available for download from Ortolang.

For the relevant publication, see Hirst et al 2013

Download

GeWiss

Size: 1.4 million tokens

Annotation: code switching

German (L2 and L1), English, Polish, Italian (L1)

This corpus contains L1 and L2 transcripts and audio recordings of spoken German academic discourse, as well as L1 data of spoken English, Polish, and Italian academic discourse.

For the relevant publication, see Fandrych et al. (2014)

Concordancer

Video and multimodal corpora

Corpus Language Description Availability

TAITO: Written and Oral Data of the TAITO-project

Licence: Under Negotiation

English, French, German, Italian, Swedish

This corpus contains texts written by undergraduate students at the beginning of their studies and videotaped discussions.

 

YKI National Certificates corpus

Licence: CLARIN RES

Italian,  Swedish, Spanish, English, Finnish, German, French, Russian

This corpus contains written and speech tasks.

 

Other L2 learner corpora

There exist an additional number of 128 L2 learner corpora that are not part of the CLARIN infrastructure that are listed on the website of the Catholic University of Louvain.

See also LADDER. Learners' digital communication: a corpus for pragmatic competences in Italian L1/L2. This downloadable corpus consists of emails and instant messages, where the informants are (i) German learners of Italian between A2-C1 level according to the CEFR and most of them are students living in Tyrol (Austria) and (ii) native speakers of Italian most of whom are students from Rome (Italy). See also Brocca (2021) for a related publication.

Additional materials

CLARIN workshop on Interoperability of Second Language Resources and Tools, 6-8 December 2017, Gothenburg, Sweden. [html]

Publications on the L2 learner corpora

[Barraja-Rohan 2013] Anne-Marie Barraja-Rohan. 2013. Second Language Interactional Competence and its Development: A Study of International Students in Australia.

[Bel et al. 2016] Aurora Bel, Estela Garc√≠a-Alcaraz, and Elisa Rosado. 2016.¬†¬†Reference comprehension and production in bilingual Spanish. In¬†Language Acquisition Beyond Parameters: Studies in honour of Juana M. Liceras, edited by Anah√≠ Alba de la Fuente, Elena Valenzuela, and Cristina Mart√≠nez Sanz, 37‚Äď70.

[de Benito 2016] Estrella Nicol√°s de Benito. 2016.¬†La adquisici√≥n del sintagma determinante en espa√Īol por ni√Īos de lengua materna √°rabe marroqu√≠. Doctoral dissertation.

[Chang et al. 2013] A. Chang, Z.H. Feng, and W.C. Yang. 2013. A new multimedia shared L2 spoken Mandarin Chinese corpus: construction and linguistic analyses. In Proceedings of the 21st Annual Meeting of the Internatioal Association of Chinese Linguistics.

[Chambers and Richards 1995] Francine Chambers and Brian Richards. 1995. The "free conversation" and the assessment of oral proficiency.¬†Language Learning Journal, 11: 6‚Äď10.¬†

[Dimroth 2008] Christine Dimroth. 2008.¬†Age Effects on the Process of L2 Acquisition? Evidence From the Acquisition of Negation and Finiteness in L2 German.¬†Language Learning, 58 (1): 117‚Äď150.

[D√≠az 2002] Lourdes D√≠az. 2002.¬†Interferencias discursivas de hablantes biling√ľes castellano/catal√°n: uso oral y escrito. In¬†Seminari sobre les lleng√ľes i educaci√≥ de l‚ÄôEstat, edited by J. Perera.

[Hirst et al. 2013] Daniel Hirst, Brigitte Bigi, Hyongsil Cho, Hongwei Ding, Sophie Herment, Ting Wang. 2013. Building OMProDat: an open multilingual prosodic database.

[Hilton 2009] Heather Hilton. 2009. Annotation and Analyses of Temporal Aspects of Spoken Fluency.¬†CALICO Journal, 26 (3): 644‚Äď661.

[Kubanek-German 2000] Angelika Kubanek-German. 2010. Early Language Programmes in Germany. In An Early Start: Young Learners and Modern Languages in Europe and Beyond.

[Jantunen 2011] Jarmo Harri Jantunen. 2011. Kansainvälinen oppijansuomen korpus (ICLFI): typologia, taustamuuttujat ja annotointi.

[Liceras 2017] Juana M. Liceras.¬†Herramientas para abordar el an√°lisis de la gram√°tica no nativa de los inmigrantes (Juana M. Liceras). In¬†La formaci√≥n de los docentes de espa√Īol para inmigrantes en distintos contextos educativos, edited By Dimitrinka Georgieva N√≠kleva.

[Liceras et al. 1999] J.M. Liceras, E. Valenzuela, and L. D√≠az. 1999.¬†L1/L2 Spanish grammars and the pragmatic deficit hypothesis.¬†Second Language Research, 15 (2): 161‚Äď190.

[MacWhinney 2016] Brian MacWhinney. 2016. A Shared Platform for Studying Second Language Acquisition. Language Learning, 67 (1).

[Markee 2000] Numa P. Markee. 2000. Conversation Analysis. Mahwah, New Jersey: Erlbaum.

[Medina Soler 2017] Isabela Medina Soler. 2017. La atenuación en el discurso oral de estudiantes de e/le universitarios con nivel b1 en contexto de inmersión para los actos de habla disentivo.

[Mitchell et al. 2008] Rosamond Mitchell, Laura Dom√≠nguez, Mar√≠a Arceh, Florence Myles, and Emma Marsden. 2008. SPLLOC: A new corpus for Spanish second language acquisition research.¬†Eurosla Yearbook, 8 (1): 287‚Äď304.

[Mu√Īoz 2006] Carmen Mu√Īoz (editor).¬†2006.¬†Age and the Rate of Foreign Language Learning. Great Britain: Comwell Press Ltd

[Orr and Quené 2017] Rosemary Orr and Hugo Quené. 2017. D-LUCEA: Curation of the UCU Accent Project Data.

[P√©ters 2017] Hugues P√©ters. 2017.¬†Comportements d'autocorrection et d'h√©sitation manifest√©s par les apprenants de FLE au cours de conversations orales spontan√©es.¬†Publi√© dans Bulletin VALS-ASLA N¬į Sp√©cial, 2: 133‚Äď145.

[Rosen 2016] Alexandr Rosen. 2016. Building and using corpora of non-native Czech. 

[Theod√≥rsd√≥ttir‚Ĩ 2018] Gu√įr√ļn¬†Theod√≥rsd√≥ttir‚Ĩ. 2018.¬†L2 Teaching in the Wild: A Closer Look at Correction and Explanation Practices in Everyday L2 Interaction. The Modern Language Journal, 102 (1).

[Tracy-Ventura and Huensch 2018] Nicole Tracy-Ventura and Amanda Huensch. 2018.¬†The potential of publicly shared longitudinal learner corpora in SLA research. In¬†Critical Reflections on Data in Second Language Acquisition, edited by Aarnes Gudmestad and Amanda Edmonds, 149‚Äď170.

[Vercellotti 2015] Mary Lou Vercellotti. 2015.¬†The Development of Complexity, Accuracy, and Fluency in Second Language Performance: A Longitudinal Study.¬†Applied Linguistics, 38 (1): 90‚Äď111.

[Volodina et al. 2016] Elena Volodina, Ildikó Pilán, Ingegerd Enström, Lorena Llozhi, Peter Lundkvist, Gunlög Sundberg, and Monica Sandell. SweLL on the rise: Swedish Learner Language corpus for European Reference Level studies. Proceedings of LREC 2016, Slovenia. 

[Volodina et al. 2019]  Elena Volodina, Lena Granstedt, Arild Matsson, Beáta Megyesi, Ildikó Pilán, Julia Prentice, Dan Rosén, Lisa Rudebeck, Carl-Johan Schenström, Gunlög Sundberg, and Mats Wirén. 2019. The SweLL Language Learner Corpus: From Design to Annotation. Northern European Journal of Language Technology, Special Issue. (Non-final version)

[Vergara Padilla 2017]¬†¬†Mar√≠a √Āngeles Vergara Padilla. 2017.¬†La influencia de las tipolog√≠as textuales en la fluidez. Las presentaciones acad√©micas orales de aprendientes estadounidenses de ele.

[Walter 2020] Daniel Walter. 2020. Student Uses of the First Language for L2 Classroom Interactions.

[Young-Scholten and Langer 2015] Martha Young-Scholten and Monika Langer. 2015.¬†The role of orthographic input in second language German: Evidence from naturalistic adult learners‚Äô production.¬†Applied Psycholinguistics, 36 (1): 93‚Äď114.

[Zhang 2009] Yanhui Zhang. 2009. A Tutor for Learning Chinese Sounds through Pinyin (Unpublished Doctoral Dissertation). Carnegie Mellon University.

[Zhao and MacWhinney 2009] Yun Zhao and Brian MacWhinney. 2009. Competing Cues:¬†A Corpus-based Study of the English Tense-Aspect in Second Language Acquisition. In¬†Proceedings of the 34th annual Boston University Conference on Language Development, edited by Katie Franich, Kate M. Iserman, and Lauren L. Keil, 503‚Äď514.