L2 learner corpora

Introduction

L2 learner corpora play a crucial role in second language research and pedagogy, allowing for a systematic study of how a learner of a second language acquires the new language on a lexical as well as syntactic level, and how it is influenced by his or her native language. A special characteristic of this type of corpora are the markup of errors and prosodic features of the learners.

The CLARIN infrastructure provides access to 74 L2 learner corpora. 11 corpora are multilingual, while the rest  provide written, spoken and even videotaped forms of monolingual L2 data in 13 respective languages: Arabic, Czech, English, Finnish, French, German, Hungarian, Icelandic,  Italian, Mandarin, Norwegian, Spanish, and Swedish. Many of these corpora are available through public licences.

We first provide overviews of the corpora that are already part of the CLARIN infrastructure and then list those that have not yet been integrated.

For comments, changes of the existing content or inclusion of new corpora, send us an email.

This website was last updated on 29 July 2020.

Monolingual L2 learner corpora in the CLARIN infrastructure

Written corpora

Corpus Language Description Availability

CzeSL – Czech as a Second Language

Size: 0.9 million words
Annotation: tokenised, PoS-tagged, lemmatised, error labels
Licence: CC-BY 

Czech

This corpus contains essays written in 2013 by learners from 54 L1 backgrounds.

The corpus is available for download from LINDAT.

For a related publication, see Rosen (2016).

Download

British Academic Written English Corpus

Size: 2761 texts
Licence: CC-BY

English

This is primarily a L1 corpus although it also contains L2 texts.

 

The corpus is available for download from the University of Oxford Text Archive.

Download

CORYL (Corpus of Young Learner Language)

Size: 191,568 tokens
Annotation: tokenised, anonymised, error labels, linked to CEFR levels
Licence: CC-BY

English

This corpus contains English texts written yb Norwegian primary school pupils (7th, 10th, and 11th grade).

The corpus is available through the concordancer Corpuscle provided by CLARINO.

Concordancer

ETS Corpus of Non-Native Written English

Size: 12,100 essays (1100 / language)
Licence: restricted

English

The corpus contains texts written by learners from 11 L1 backgrounds as part of an international text of academic English proficiency. Prompts as well as proficiency level are part of the metadata.

 

The corpus is available for download from the LDC catalogue.

Download

The Hanken Corpus of Academic Writing

Size: 500,000 words 
Licence: CC-BY

English

This corpus contains academic texts  written by Finnish and Swedish native speakers.

The corpus is still under development.

 

 

ICLE International Corpus of Learner English

Size: 3 million words

English

This corpus contains texts written by learners of English from 14 L1 backgrounds.

 

The corpus can be purchased on CD-ROM and a new version (ICLE v.3) is in development.

 

The Uppsala Student English corpus

Size: 1.2 million tokens
Annotation: tokenised
Licence: CC-BY

English

This corpus contains essays written during the first three semesters of English studies at Uppsala University; most of the essays were written during the first semester. The corpus contains text files, each with a student ID and text ID including the course level, and information about the different prompts are available. 

The corpus is available for download from the University of Oxford Text Archive.

Download

The Advanced Finnish Learners’ Corpus

Size: 288,000 tokens
Annotation: tokenised, MSD-tagged, lemmatised
Licence: CLARIN RES

Finnish

This corpus contains academic texts written by MA students and collected in 2009.

The corpus consists of two subcorpora - The Exam Essays Subcorpus and the Course Papers Subcorpus, both of which are also available through Korp.

Concordancer

Download subcorpus

Download subcorpus

International Corpus of Learner Finnish (ICLFI) Corpus

Size: 1 million words
Annotation: MSD-tagged
Licence: CLARIN RES

Finnish

This corpus contains fictional (e.g., letters, narratives) and non-fictional (e.g., essays) texts.

The corpus provides information on a large number of variables concerning the linguistic background of the learner, the learning task, the learning context, etc. It is available through the concordancer Korp.

For a related publication, see Jantunen (2011).

Concordancer

Testipiste Corpus

Size: 840,000 tokens
Annotation: tokenised
Licence: CLARIN RES

Finnish

This corpus contains essays written by adult migrants from various L1 backgrounds.

The corpus will be made available through the concordancer Korp.

 

Commented Learner Corpus Academic Writing

Size: 853 texts
Licence: CC BY-NC-SA 3.0

German

This corpus contains texts written by students at the University of Hamburg from various L1 backgrounds.

The corpus is available for download through the repository of the University of Hamburg.

 

Download

ASK – Norsk andrespråkskorpus

Size: 618,000 tokens
Annotation: tokenised, PoS-tagged, errors
Licence: CLARIN RES

Norwegian

This corpus contains essays and tests written by students from 10 L1 backgrounds. It also contains L1 control essays.

The corpus is available through a dedicated concordancer provided by CLARINO.

Concordancer

FinSveStud 79-80

Size: 175,000 tokens
Annotation: tokenised, lemmatised
Licence: CLARIN RES

Swedish

This corpus contains texts written by students with Finnish as their L1 background.

The corpus is available through the concordancer Korp.

Concordancer

SW1203-essays

Size: 52025 tokens
Annotation: tokenised, PoS-tagged, MSD-tagged, lemgrams, compounds word forms
Licence: CC-BY

Swedish

This corpus contains essays.

The corpus is available for download through Språkbanken and through the concordancer Korp.

Together with the Tisus corpus, SW1203-essays is a subcorpus of the pilot SweLL corpus

Concordancer

Download

Tisus corpus

Size: 60,000 tokens
Annotation: tokenised, PoS-tagged, MSD-tagged, lemgrams, compounds word forms
Licence: CC-BY

Swedish

Together with the SW1203-essays, the Tisus corpus is a subcorpus of the pilot SweLL corpus.

 

The corpus is available for download from Språkbanken and through the concordancer Korp.

Concordancer

Download

Spoken corpora

Corpus Language Description Availability

The Dresden Corpus

Annotation: audio/transcription linking
Licence: public (acknowledgment required)

Czech

The corpus contains speech recordings of ~32 German children learning Czech (type of study: interview).

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For a relevant publication, see Kubanek-German (2000).

Concordancer

‌Download

The Anglish Corpus 

Annotation: interpausal units
Licence: CLARIN RES

English

This corpus contains various speech tasks performed by French native speakers and the associated transcriptions.

 

The Barcelona English Language Corpus

Annotation: audio/transcription linking
Licence: public (acknowledgment required)

English

This corpus contains speech recordings of Spanish children and teenagers learning English in Barcelona. across 4 tasks (written composition, oral narrative, oral interview and role play).

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For a relevant publication, see Muñoz (2006).

Concordancer

‌Download

The Barraja-Rohan Corpus

Annotation: audio/transcription linking
Licence: public (acknowledgment required)

English

This corpus contains speech recordings of adult international students who spoke English as a second language and who had newly arrived at an Australian university. These undergraduate international students from various Asian backgrounds interacted over a period of seven months with Australian graduate students who were native speakers of English. 

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For a relevant publication, see Barraja-Rohan (2013).

Concordancer

‌Download

 

The Connolly Corpus

Annotation: audio/transcription linking
Licence: public (acknowledgment required)

English

The corpus contains speech recordings of 60 Japanese high school students learning English.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

Concordancer

‌Download

The CUHK corpus

Annotation: audio/transcription linking
Licence: public (acknowledgment required)

English

This corpus contains speech recordings of 6 children learning English in Hong Kong.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For a relevant publication, see MacWhinney (2016).

Concordancer

‌Download

The Dresden Corpus

Annotation: audio/transcription linking
Licence: public (acknowledgment required)

English

The corpus contains speech recordings of ~32 German children learning English (type of study: interview).

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For a relevant publication, see Kubanek-German (2000).

Concordancer

‌Download

GLBCC (Giessen - Long Beach Chaplin Corpus)

Size: 2472 words/transcript
Licence: CC-BY

English

This corpus contains film retellings performed by English and German native speakers.

The corpus is available for download from the University of Oxford Text archive. 

Download

ISLE Speech Corpus

Size: approx. 18 hours
Annotation: phone-level annotation, stress errors
Licence: ELRA END USER

English

This corpus contains various speech tasks (reading simple sentences, using minimal pairs, giving answers to multiple choice questions) performed by German and Italian native speakers.

The corpus is available for download from the ELRA catalogue.

 

Download

A Learners' Corpus of Reading Texts

Licence: CLARIN RES

English

This corpus contains unprepared readings by first-year students at an English department who speak French as a native language.

The corpus is available for download from Ortolang.

Download

The Markee Corpus

Annotation: audio/transcription linking
Licence: public (acknowledgment required)

English

This corpus contains speech recordings of 3 students learning English as a second language.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For a relevant publication, see Markee (2000).

Concordancer

‌Download

The PAROLE Corpus

Annotation: audio/transcription linking
Licence: public (acknowledgment required)

 

English

This corpus contains speech recordings of 95 students learning English in France (type of study: tasks/storytelling).

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For a relevant publication, see Hilton (2009).

Concordancer

‌Download

The QATAR Corpus

Annotation: audio/transcription linking
Licence: public (acknowledgment required)

English

This corpus contains recorded interviews involving 19 Qatari learners of English.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For a relevant publication, see Zhao and MacWhinney (2010).

Concordancer

‌Download

The Vercellotti Corpus

Annotation: audio/transcription linking
Licence: public (acknowledgment required)

English

This corpus contains recordings of adult learners entering an Intensive English Program (IEP) in the United States during the year 2010. Tasks include 2 minute monologues.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For a relevant publication, see Vercellotti (2017).

Concordancer

‌Download

The Dresden Corpus

Annotation: audio/transcription linking
Licence: public (acknowledgment required)

French

The corpus contains speech recordings of ~32 German children learning French (type of study: interview).

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For a relevant publication, see Kubanek-German (2000).

Concordancer

‌Download

French Learner Language Oral Corpora (FLLOC)

Size:1375 transcripts
Annotation: MSD-tagged
Licence: CC-BY

French

This corpus contains various narrative and interactive speech tasks performed by English and Dutch native speakers. 

The corpus is available for download from the University of Oxford Text Archive. The transcripts and audio files can also be downloaded and browsed through through TalkBank.

Download

The LANGSNAP Corpus

Annotation: audio/transcription linking
Licence: public (acknowledgment required)

French

This corpus contains speech recordings of 28 British undergraduates learning French before, during and after a year abroad. Tasks include oral interviews and and story retellings, aside from argumentative writing tasks.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For a relevant publication, see Tracy-Ventura and Huensch (2018).

Concordancer

‌Download

The LANGSNAP3 Corpus

Annotation: audio/transcription linking
Licence: public (acknowledgment required)

French

This corpus is a 3-year follow up to the LANGSNAP corpus, involving 18 participants.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For a relevant publication, see Tracy-Ventura and Huensch (2018).

Concordancer

‌Download

The Newcastle Corpus

Annotation: audio/transcription linking
Licence: public (acknowledgment required)

French

This corpus contains intermediate level spoken French from 17-18 year old second language learners, in years 12 to 13 of UK secondary education.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

Concordancer

‌Download

The PAROLE Corpus

Annotation: audio/transcription linking
Licence: public (acknowledgment required)

French

This corpus contains speech recordings of 40 students learning French in France as a second language (type of study: tasks/storytelling).

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For a relevant publication, see Hilton (2009).

Concordancer

‌Download

The Trinity College (TCD) Corpus

Annotation: audio/transcription linking
Licence: public (acknowledgment required)

French

This corpus contains recordings of 5 children (2 Irish, 1 Polish, 2 Cambodian)  learning French in a school in France.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

Concordancer

‌Download

The Reading Corpus

Annotation: audio/transcription linking
Licence: public (acknowledgment required)

French

This corpus contains oral proficiency interviews with 34 16-year-olds learning French in South Wales.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For a relevant publication, see Chambers and Richards (1995).

Concordancer

‌Download

The UWI Corpus

Size: 15,068 tokens
Annotation: audio/transcription linking
Licence: public (acknowledgment required)

French

This corpus consists of 25 recorded interviews with learners of French (9 adult learners) in Jamaica.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For a relevant publication, see Péters (2017).

Concordancer

‌Download

The Dimroth SLA Corpus

Annotation: audio/transcription linking
Licence: public (acknowledgment required)

German

The corpus contains speech recordings of 47 students learning German (type of study:  interview).

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For a relevant publication, see Dimroth (2008).

Concordancer

‌Download

Hamburg Modern Times Corpus

Size: 24,000 words
Annotation: prosody
Licence: CLARIN RES

German

This corpus contains film retellings and the accompanying transcriptions.

The corpus is available for download  from the HZSK CLARIN-D repository. 

Download

The RyanDan Corpus

Annotation: audio/transcription linking
Licence: public (acknowledgment required)

German

The corpus contains recordings of 4 Carnegie Mellon University students learning German.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For a relevant publication, see Walter (2020).

Concordancer

‌Download

The VYSA Corpus

Annotation: audio/transcription linking
Licence: public (acknowledgment required)

German

This corpus contains recordings of 3 highschool students learning German abroad while living with German-speaking host families and attending German secondary schools in standard German-speaking urban and peri-urban regions of Germany.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For a relevant publication, see Young-Scholten and Langer (2015).

Concordancer

‌Download

The Theodórsdóttir corpus

Annotation: audio/transcription linking
Licence: public (acknowledgment required)

Icelandic

The coprus contains recordings obtained in a longitudinal case study of L2 Icelandic.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For a relevant publication, see Theodórsdóttir (2018).

Concordancer

‌Download

The PAROLE Corpus

Annotation: audio/transcription linking
Licence: public (acknowledgment required)

Italian

This corpus contains speech recordings of 95 students learning Italian in France  (type of study: tasks/storytelling).

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For a relevant publication, see Hilton (2009).

Concordancer

‌Download

The COPA Corpus

Annotation: error coding
Licence: public (acknowledgment required)

Mandarin

This corpus contains speech recordings of ~120 college students learning Mandarin in Hong Kong (type of study: responses to questions).

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For a relevant publication, see Zhang (2009).

Concordancer

‌Download

The HKPU Corpus

Annotation: audio/transcription linking
Licence: public (acknowledgment required)

Mandarin

This corpus contains speech recordings of 20 college students learning Mandarin in Hong Kong. The tasks involve oral interviews.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For a relevant publication, see Chang et al. (2013).

Concordancer

‌Download

LANGMAN

Size: 11 subcorpora
Annotation: error coding
Licence: CC-BY

Hungarian

This corpus is a spoken corpus involving Chinese native speakers who learn Hungarian as a second language.

The subcorpora are available for download from and browsing through the TalkBank.

Concordancer

Download

The BCN-L2 Corpus

Annotation: error coding
Licence: public (acknowledgment required)

Spanish

This corpus contains speech recordings of Berber students learning Spanish.  The participants were 88 native speakers of Moroccan Arabic (Darija) and 26 speakers of Berber (Amazigh) living in Catalonia.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For a relevant publication, see Bet et al.  (2016).

Concordancer

‌Download

The Díaz Rodríguez Corpus

Annotation: audio/transcription linking
Licence: public (acknowledgment required)

Spanish

This corpus contains speech recordings of Indoeuropean and Asian Learners, both semi-spontaneous and experimental, obtained in Barcelona, Spain (type of study: naturalistic, longitudinal).

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For a relevant publication, see Díaz (2002).

Concordancer

‌Download

The LANGSNAP Corpus

Annotation: audio/transcription linking
Licence: public (acknowledgment required)

Spanish

This corpus contains speech recordings of 27 British undergraduates learning Spanish before, during and after a year abroad. Tasks include oral interviews and and story retellings, aside from argumentative writing tasks.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For a relevant publication, see Tracy-Ventura and Huensch (2018).

Concordancer

‌Download

The LANGSNAP3 Corpus

Annotation: audio/transcription linking
Licence: public (acknowledgment required)

Spanish

This corpus is a 3-year follow-up to the LANGSNAP Corpus, involving 33 participants.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For a relevant publication, see Tracy-Ventura and Huensch (2018).

Concordancer

‌Download

The Liceras Corpus

Annotation: audio/transcription linking
Licence: public (acknowledgment required)

Spanish

This corpus contains speech recordings of 11 students learning Spanish as a second language.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For a relevant publication, see Liceras et al. (1999).

Concordancer

‌Download

The Nebrija-CORELE-UA Corpus

Size: 1 hour 27 minutes; 10,292 words
Annotation: audio/transcription linking
Licence: public (acknowledgment required

Spanish

This corpus contains  10 recorded interviews involving  students of Spanish as a Foreign Language have at the University of Alicante, in Alicante.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For a relevant publication, see Medina Soler (2017).

Concordancer

‌Download

The Nebrija-INMIGRA Corpus

Annotation: audio/transcription linking
Licence: public (acknowledgment required

Spanish

This corpus consists of oral interviews carried out in the context of the LETRA test of Spanish for immigrant workers. It is made up of semi-guided interviews carried out in Spanish which last approximately 10 minutes each. The participants are immigrants from 11 different countries who live in the Autonomous Community of Madrid (Spain). 

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For a relevant publication, see Liceras (2017).

Concordancer

‌Download

The Nebrija-OAP Corpus

Size: 9 hours 19 minutes; 49,718 words
Annotation: audio/transcription linking
Licence: public (acknowledgment required

Spanish

This corpus contains 67 videotaped presentations involving 95 North American students of Spanish as a Foreign Language at Nebrija University in Madrid.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For a relevant publication, see Vergara Padilla (2017).

Concordancer

‌Download

The Nebrija-WOCAE Corpus

Annotation: audio/transcription linking
Licence: public (acknowledgment required

Spanish

This corpus contains recordings of emails written and read by 28 Chinese students.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For a relevant publication, see Vergara Padilla (2017).

Concordancer

‌Download

The Nicolás Corpus

Annotation: audio/transcription linking
Licence: public (acknowledgment required

Spanish

This corpus contains recordings of 2 two children from Morocco learning Spanish in Spain learning Spanish.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For a relevant publication, see de Benito (2016).

Concordancer

‌Download

The SPLLOC1 Corpus

Annotation: audio/transcription linking
Licence: public (acknowledgment required

Spanish

This corpus contains recordings of L2 Spanish in a classroom context. There were 20 learners, all of whom were English native speakers, at each of 3 levels: beginners (Year 9 students aged 13-14), intermediate students (A2 students aged 17-18), and fourth year undergraduates. 

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For a relevant publication, see Mitchell et al. (2008).

Concordancer

‌Download

The SPLLOC2 Corpus

Annotation: audio/transcription linking
Licence: public (acknowledgment required

Spanish

This corpus is an extension of the SPLLOC1 Corpus.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For a relevant publication, see Mitchell et al. (2008).

Concordancer

‌Download

 

Video and multimodal corpora

Corpus Language Description Availability

Arabic Learner Corpus

Size: 0.3 million tokens
Annotation: tokenised
Licence: CLARIN RES

 

Arabic

This corpus contains essays written by students from 67 L1 backgrounds. It also contains recordings of speech tasks and associated transcriptions.

The corpus is available for download from the LDC catalogue.

Download

 

English as a Foreign Language Corpus

Size: 24 hours
Licence: Under Negotiation

 

English

The corpus contains videotaped lessons involving students at Finnish secondary schools. 

 

 

 

The Long Second Corpus

Licence: Under Negotiation

Finnish

This corpus contains written texts, audio recordings and videotaped lessons involving immigrants from the following L1 backgrounds: Estonian, Macedonian, Kurdish, Portuguese, Russian, and English.

The corpus is still in preparation. It is set to be made available on the LAT platform.

 

The van Compernolle Corpus

Annotation: audio/transcription linking
Licence: public (acknowledgment required

French

This corpus contains a recorded examination of classroom interactional practices and actions in a beginning-level ESL reading class. Analytic foci include aspects of speech delivery and timing as well as nonverbal behaviors (e.g., eye gaze, gesture).

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

Concordancer

Download

 

 

Multilingual L2 learner corpora in the CLARIN infrastructure

Written corpora

Corpus Language Description Availability

CEFLING Project Corpus

 

Finnish and English

This corpus contains texts written by primary and secondary school students (years 7-9).

 

 

DIALUKI: Diagnosing reading and writing in a second or foreign language

Size: 8,600 texts
Licence: CLARIN RES

Finnish and English

This corpus contains texts both in Finnish (written by Russian native speakers) and English (written by Finnish native speakers). 

 

The corpus will be made available through Korp.
 

MERLIN Written Learner Corpus for Czech, German, Italian 1.1

Size: 2287 texts
Annotation: a wide range of language characteristics that provide researchers with concrete examples of learner performance and progress across multiple proficiency levels.
Licence: CC BY-SA 4.0

Czech, German, and Italian

This corpus contains learner texts produced in standardized language certifications covering CEFR levels A1-C1.

The corpus is available for download from the Eurac Research CLARIN Centre Repository.

Download

Topling - Paths in Second Language Acquisition

Size:165,000 tokens
Annotation: tokenised
Licence: CLARIN End User Licence Agreement

Finnish, English and Swedish

This corpus contains written texts in English, Swedish and Finnish produced by students in the Finnish educational system and is an extension of  the CEFLING corpus, which it also includes. 

The corpus is available through the concordancer Korp. 

Concordancer

 

Spoken corpora

Corpus Language Description Availability

AixOx

Size: 40 minutes/task
Licence: restricted

English and French

This corpus contains readings of written texts performed by French and English native speakers.

 

LeaP: The Learning the Prosody of a Foreign Language

Size: 31 hours
Annotation:  PoS-tagged, lemmatised, prosody

 

English and German

This corpus contains recordings of English and German spoken by non-native speakers from 31 different native language backgrounds.

The corpus is available for download from the Language Archive.
Download

Repiso/Contrefactualité

Licence: CLARIN RES

French, Italian, Spanish

This corpus contains recordings of counterfactual sentences.

 

Openprodat

Licence: Publique Générale GNU

Dutch, English, French, German, Italian, Arabic, Spanish, Hungarian, Japanese, Thai, Norwegian, Chinese

This corpus contains paragraph readings by participants in both their L1 and in as many L2 as they felt they could manage.

The corpus is available for download from Ortolang.

For a related publication, see Hirst et al 2013

Download

GeWiss

Size: 1.4 million tokens
Annotation: code switching

German (L2 and L1), English, Polish, Italian (L1)

This corpus contains transcripts and audio recordings of spoken  academic discourse, primarily talks including discussions and oral exams.

For the relevant publication, see Fandrych et al. (2014)

Concordancer

 

Video and multimodal corpora

Corpus Language Description Availability

TAITO: Written and Oral Data of the TAITO-project

Licence: Under Negotiation

English, French, German, Italian, Swedish

This corpus contains texts written by undergraduate students at the beginning of their studies and videotaped discussions.

 

 

YKI National Certificates corpus

Licence: CLARIN RES

Italian,  Swedish, Spanish, English, Finnish, German, French, Russian

This corpus contains written and speech tasks.

 

Other L2 learner corpora

There exist an additional number of 128 L2 learner corpora that are not part of the CLARIN infrastructure. However, since they are already listed on the website of the Catholic University of Louvain and many are seemingly unavailable for download or online querying, we do not list them here. 

Additional materials

CLARIN workshop on Interoperability of Second Language Resources and Tools, 6-8 December 2017, Gothenburg, Sweden. [html]

Publications on the L2 learner corpora

[Barraja-Rohan 2013] Anne-Marie Barraja-Rohan. 2013. Second Language Interactional Competence and its Development: A Study of International Students in Australia.

[Bel et al. 2016] Aurora Bel, Estela García-Alcaraz, and Elisa Rosado. 2016.  Reference comprehension and production in bilingual Spanish. In Language Acquisition Beyond Parameters: Studies in honour of Juana M. Liceras, edited by Anahí Alba de la Fuente, Elena Valenzuela, and Cristina Martínez Sanz, 37–70.

[de Benito 2016] Estrella Nicolás de Benito. 2016. La adquisición del sintagma determinante en español por niños de lengua materna árabe marroquí. Doctoral dissertation.

[Chang et al. 2013] A. Chang, Z.H. Feng, and W.C. Yang. 2013. A new multimedia shared L2 spoken Mandarin Chinese corpus: construction and linguistic analyses. In Proceedings of the 21st Annual Meeting of the Internatioal Association of Chinese Linguistics.

[Chambers and Richards 1995] Francine Chambers and Brian Richards. 1995. The "free conversation" and the assessment of oral proficiency. Language Learning Journal, 11: 6–10. 

[Dimroth 2008] Christine Dimroth. 2008. Age Effects on the Process of L2 Acquisition? Evidence From the Acquisition of Negation and Finiteness in L2 German. Language Learning, 58 (1): 117–150.

[Díaz 2002] Lourdes Díaz. 2002. Interferencias discursivas de hablantes bilingües castellano/catalán: uso oral y escrito. In Seminari sobre les llengües i educació de l’Estat, edited by J. Perera.

[Hirst et al. 2013] Daniel Hirst, Brigitte Bigi, Hyongsil Cho, Hongwei Ding, Sophie Herment, Ting Wang. 2013. Building OMProDat: an open multilingual prosodic database.

[Hilton 2009] Heather Hilton. 2009. Annotation and Analyses of Temporal Aspects of Spoken Fluency. CALICO Journal, 26 (3): 644–661.

[Kubanek-German 2000] Angelika Kubanek-German. 2010. Early Language Programmes in Germany. In An Early Start: Young Learners and Modern Languages in Europe and Beyond.

[Jantunen 2011] Jarmo Harri Jantunen. 2011. Kansainvälinen oppijansuomen korpus (ICLFI): typologia, taustamuuttujat ja annotointi.

[Liceras 2017] Juana M. Liceras. Herramientas para abordar el análisis de la gramática no nativa de los inmigrantes (Juana M. Liceras). In La formación de los docentes de español para inmigrantes en distintos contextos educativos, edited By Dimitrinka Georgieva Níkleva.

[Liceras et al. 1999] J.M. Liceras, E. Valenzuela, and L. Díaz. 1999. L1/L2 Spanish grammars and the pragmatic deficit hypothesis. Second Language Research, 15 (2): 161–190.

[MacWhinney 2016] Brian MacWhinney. 2016. A Shared Platform for Studying Second Language Acquisition. Language Learning, 67 (1).

[Markee 2000] Numa P. Markee. 2000. Conversation Analysis. Mahwah, New Jersey: Erlbaum.

[Medina Soler 2017] Isabela Medina Soler. 2017. La atenuación en el discurso oral de estudiantes de e/le universitarios con nivel b1 en contexto de inmersión para los actos de habla disentivo.

[Mitchell et al. 2008] Rosamond Mitchell, Laura Domínguez, María Arceh, Florence Myles, and Emma Marsden. 2008. SPLLOC: A new corpus for Spanish second language acquisition research. Eurosla Yearbook, 8 (1): 287–304.

[Muñoz 2006] Carmen Muñoz (editor). 2006. Age and the Rate of Foreign Language Learning. Great Britain: Comwell Press Ltd

[Orr and Quené 2017] Rosemary Orr and Hugo Quené. 2017. D-LUCEA: Curation of the UCU Accent Project Data.

[Péters 2017] Hugues Péters. 2017. Comportements d'autocorrection et d'hésitation manifestés par les apprenants de FLE au cours de conversations orales spontanées. Publié dans Bulletin VALS-ASLA N° Spécial, 2: 133–145.

[Rosen 2016] Alexandr Rosen. 2016. Building and using corpora of non-native Czech. 

[Theodórsdóttir‬ 2018] Guðrún Theodórsdóttir‬. 2018. L2 Teaching in the Wild: A Closer Look at Correction and Explanation Practices in Everyday L2 Interaction. The Modern Language Journal, 102 (1).

[Tracy-Ventura and Huensch 2018] Nicole Tracy-Ventura and Amanda Huensch. 2018. The potential of publicly shared longitudinal learner corpora in SLA research. In Critical Reflections on Data in Second Language Acquisition, edited by Aarnes Gudmestad and Amanda Edmonds, 149–170.

[Vercellotti 2015] Mary Lou Vercellotti. 2015. The Development of Complexity, Accuracy, and Fluency in Second Language Performance: A Longitudinal Study. Applied Linguistics, 38 (1): 90–111.

[Vergara Padilla 2017]  María Ángeles Vergara Padilla. 2017. La influencia de las tipologías textuales en la fluidez. Las presentaciones académicas orales de aprendientes estadounidenses de ele.

[Walter 2020] Daniel Walter. 2020. Student Uses of the First Language for L2 Classroom Interactions.

[Young-Scholten and Langer 2015] Martha Young-Scholten and Monika Langer. 2015. The role of orthographic input in second language German: Evidence from naturalistic adult learners’ production. Applied Psycholinguistics, 36 (1): 93–114.

[Zhang 2009] Yanhui Zhang. 2009. A Tutor for Learning Chinese Sounds through Pinyin (Unpublished Doctoral Dissertation). Carnegie Mellon University.

[Zhao and MacWhinney 2009] Yun Zhao and Brian MacWhinney. 2009. Competing Cues: A Corpus-based Study of the English Tense-Aspect in Second Language Acquisition. In Proceedings of the 34th annual Boston University Conference on Language Development, edited by Katie Franich, Kate M. Iserman, and Lauren L. Keil, 503–514.