You are here

Spoken corpora

Introduction

Corpora of spoken language contain transcriptions of spontaneous or planned speech, such as broadcast news or elicited narratives and dialogues. They are often aligned with the accompanying recordings. They are an invaluable resource for various kinds of linguistic research, such as phonology, conversational analysis, and dialectology. Such corpora are carefully sampled and rich in sociodemographic metadata. 

There are 62 spoken corpora in the CLARIN infrastructure, 52 of which contain both the transcriptions of spoken or spontaneous speech and the associated recordings, and 10 only the transcriptions. Most of the corpora are monolingual, accounting for the following 14 languages: Arabic, Czech, Estonian, Finnish, French, German, Hungarian, Italian, Nepali, Norwegian, Polish, Skoti Saami, Slovenian, Spanish, and Swedish.  In the vast majority of cases, the corpora can be directly downloaded from the national repositories or queried through easy-to-use online search environments. They are also richly tagged, many with mark-up specific to speech corpora, such as phonemic and prosodic annotation.

We first provide overviews of the corpora that are already part of the CLARIN infrastructure and then list those that have not yet been integrated.

For comments, changes of the existing content or inclusion of new corpora, send us an email.

This website was last updated on 5 October 2018.

Spoken corpora in the CLARIN infrastructure

Corpora with transcriptions and audio recordings

Corpus Language Description Availability

Arabic Speech Corpus

Licence: CC NC-SA 3.0

Arabic

The corpus is available for download from the Oxford Text Archive.

Download

DIALEKT v1: dialectal corpus with multi-tier transcription

Size: 100,000 words
Annotation: orthographically and phonetically (dialect features) transcribed, MSD-tagged, lemmatised
Licence: Academic Licence Agreement for Czech National Corpus Data

Czech

This corpus contains traditional dialectological material, mostly unprepared monologue-type speech.

The corpus is available download (upon request) and through the concordancer KonText.

For a related publication, see Komrsková et al. (2018).

Concordancer

Download

ORAL2013: balanced corpus of informal spoken Czech (transcriptions & audio)

Size: 2.8 million words
Annotation: recordings and transcripts anonymised
Licence: Academic Licence Agreement for Czech National Corpus Data

Czech

This corpus contains informal conversations.

The corpus is available for download from LINDAT and through the concordancer KonText.

For a related publication, see Benešová et al. (2015).

Concordancer

Download

ORTOFON v1: balanced corpus of informal spoken Czech with multi-tier transcription (transcriptions & audio)

Size: 1 million words
Annotation: orthographically and phonetically transcribed; MSD-tagged, lemmatised
Licence: Academic Licence Agreement for Czech National Corpus Data

Czech

This corpus contains informal conversations.

The corpus is available for download from LINDAT and through the concordancer KonText.

For a related publication, see Komrsková et al. (2018).

Concordancer

Download

OVM – Otázky Václava Moravce

Size: 35 hours
Annotation: word-by-word transcriptions, including the transcription of some non-speech events
Licence: CC BY-NC 3.0

Czech

This corpus contains transcribed recordings from the Czech political discussion broadcast “Otázky Václava Moravce“.

The corpus is available for download from LINDAT and through the concordancer KonText.

Concordancer

Download

Prague DaTabase of Spoken Czech 1.0

Size: 770,000 tokens, 7324 minutes
Licence: CC BY-NC SA 4.0

Czech

This corpus contains spontaneous dialogue.

The corpus is available for download from LINDAT.

For a related publication, see Hajič et al. (2008).

Download

Spoken corpus of Karel Makoň

Size: 1000 hours
Licence: CC BY-SA 3.0

Czech

The corpus contains talks on Christian mysticism given by Karel Makoň.

The corpus is available for download from LINDAT.

Download

Czech Malach Cross-lingual Speech Retrieval Test Collection

Size: 592 hours
Annotation: manual annotations of selected topics and interviews' metadata
Licence: CC BY-NC-ND 4.0

Czech, English, French, German, Spanish

This corpus contains interviews with survivors of the Holocaust.

The corpus is available for download from LINDAT.

Download

Air Traffic Control Communication

Size: 20 hours
Annotation: speaker information
Licence: CC BY-NC-ND 3.0

English

This corpus contains recordings of communication between air traffic controllers and pilots

The corpus is available for download from LINDAT and through the concordancer KonText.

Concordancer

Download

Boston University Radio Speech Corpus

Size: 7 hours
Annotation: PoS-tagged, phonetic alignment, prosodic markers
Licence: CLARIN RES

English

This corpus contains recordings and texts from radio news.

The corpus is available for download from the UPenn repository.

Download

Buckeye Corpus of Conversational Speech

Annotation: phonetic labels
Licence: CLARIN RES

English

This corpus contains an interview.

The corpus is available for download from ORTOLANG.

For a related publication, see Pitt et al. (2005).

Download

ELFA Corpus

Size: 13 hours
Licence: restricted

English

This corpus contains recorded lectures and seminars.

The corpus is available for download from FIN-CLARIN.

Download

Corpus of Lecture Speech

Size: 41 hours
Annotation: orthographically transcribed

Estonian

This corpus contains recordings of academic lectures and oral conference presentations.

The corpus is available for download from a dedicated webpage.

Download

Corpus of Radio Interviews

Size: 36 hours
Annotation: speech annotation to orthographically transcribed

Estonian

This corpus contains telephone interviews from different radio programmes.

 

Corpus of Radio News

Size: 19 hours
Annotation: speech annotation to orthographically transcribed

Estonian

This corpus contains public broadcast news.

The corpus is available for download from a dedicated webpage.

Download

Estonian Dialect Corpus

Size: 1.3 million words
Licence: CLARIN ACA
Annotation: phonetically transcribed, MSD-tagged, partly syntactically parsed

Estonian

This corpus contains interviews.

The corpus is available for download from META-SHARE.

Download

Estonian Emotional Speech Corpus 

Size: 1234 sentences
Licence: CC-BY

Estonian

This corpus contains read sentences that express anger, joy and sadness, or are neutral.

The corpus is available for download from META-SHARE.

For a related publication, see Altrov and Pajupuu (2012).

Download

Estonian North Wind and the Sun Corpus v.1.0.3

Annotation: words in standard orthography and phonemes in SAMPA

Estonian

This corpus contains recordings of the tale “Põhjatuul ja päike” (North Wind and the Sun) read by the same speakers who participated in the Phonetic Corpus of Estonian Spontaneous Speech.

The corpus is available for download from META-SHARE.

Download

Phonetic Corpus of Estonian Spontaneous Speech v.1.0.4

Size: 635,000 words, 90 hours
Annotation: orthographically and phonetically transcribed, syllables, prosodic feet, intonation phrases, changes in voice quality
Licence: CLARIN_RES

Estonian

This corpus contains spontaneous speech by speakers with different dialectological and social backgrounds.

The corpus is available for download from META-SHARE.

Download

Faroese Danish Corpus Hamburg 0.2.dan (FADAC-0.2.dan Hamburg)

Licence: HZSK-RES (restricted, non-commercial only)

Faroese, Danish

This corpus contains informal interviews.

 

Aalto University DSP Course Conversation Corpus 2013-2016, Downloadable Version

Size:  5200 utterances
Licence: academic

Finnish

This corpus contains spontaneous conversations.

The corpus is available for download from FIN-CLARIN.

Download

Finnish Broadcast Corpus

Size: 18 hours
Licence: restricted

Finnish

This corpus contains radio and TV broadcasts.

The corpus is available for download from FIN-CLARIN and for online querying through the LAT-platform.

Concordancer

Download

Follow-up Study of Dialects of Finnish

Licence: restricted

Finnish

This corpus contains interviews.

This corpus is available for online querying through the LAT-platform.

LAT platform

Route to A wing

Size: 218 tokens
Annotation: PoS-tagged
Licence: CC-0

Finnish

This corpus contains spontaneous conversations.

This corpus is available for online querying through the concordancer Korp.

Concordancer

Samples of Spoken Finnish

Licence: CC-BY

Finnish

This corpus contains interviews.

This corpus is available for online querying through the LAT platform and through the concordancer Korp.

Concordancer

LAT platform

The Finnish Dialect Syntax Archive

Size: 1.2 million words
Annotation: MSD-tagged
Licence: CC-BY-NC-ND

Finnish

The corpus contains interviews.

The corpus is available for online querying through the LAT platform and through the concordancer Korp.

Concordancer

LAT platform

The Longitudinal Corpus of Finnish Spoken in Helsinki (1970s, 1990s and 2010s)

Size: 210 hours
Licence: restricted

Finnish

This corpus contains spontaneous speech and interviews.

The corpus is available for online querying through the LAT platform.

LAT platform

The Corpus of Border Karelia

Size: 120 hours
Licence: CC-BY

Finnish, Karelian

This corpus contains interviews.

The corpus is available for download from FIN-CLARIN.

Download

Plenary Sessions of the Parliament of Finland

Size: 22.5 million words
Licence: CC-BY-NC-ND

Finnish, Swedish

This corpus contains the proceedings of the Finnish Parliament.

The corpus is available through a dedicated webpage and through the concordancer Korp.

Concordancer

Download

CLAPI

Licence: CC BY-NC-SA 4.0 

French

This is a collection containing around 40 corpora which contain social interactions in different contexts: professional, private, institutional, commercial, medical, and educational situations.

Most of the corpora can be downloaded and queried through a dedicated concordancer.

Concordancer

Corpus de Français Parlé Parisien des années 2000

Licence: CC-BY

French

This corpus contains interviews.

The corpus is available for download from a dedicated webpage.

Download

Phonologie du Français Contemporain

Licence: CC-BY

French

The corpus is available for download from a dedicated webpage.

Download

Corpora of the Database for Spoken German

Size: 27 corpora
Annotation: PoS-tagged, lemmatised, time-aligned, orthographically transcribed

German

This corpus collection contains a wide variety of spoken discourse, ranging from dialectal and extraterritorial varieties of German to broadcast TV debates and interviews.

The corpora are available for online querying via the Database of Spoken German.

Concordancer

Hamburg Modern Times Corpus

Size: 3 hours
Annotation: manual annotation of phonetic phenomena, accent/stress marking
Licence: HZSK-ACA (academic, non-commercial only)

German

This corpus contains task-oriented communcation (e.g., a film retelling) in the context of studying adult L2 acquisition.

 

Corpora of the Bavarian Archive for Speech Signals

Size: 47 corpora
Licence: CLARIN PUB/ACA/RES

German, English

This corpus collections contains a wide variety of spoken discourse, such as elicited speech tasks, spontaneous conversations in different settings (e.g., in a taxi, over the telephone), involving a variety of different speakers (e.g., from adolescents to adults, as well as speakers that are hard of hearing).

The corpora are available for download from the BAS CLARIN B Centre. 

Download

EXMARaLDA Demo Corpus 1.0

Size: 2 hours
Annotation: suprasegmental information, accentuation/stress marking
Licence: HZSK-PUB (public, non-commercial only)

German, English, French, Spanish, Turkish, Polish, Vietnamese, Swedish, Norwegian, Italian, Russian, Afrikaans, Portuguese

This corpus is a demo of the EXMARaLDA system.

The corpus is available for download from a CLARIN-D repository.

Download

Hamburg Adult Bilingual LAnguage (HABLA)

Size: 79 hours
Licence: HZSK-RES (restricted, non-commercial only)

German, French, Italian

This corpus contains interviews.

For a related publication, see Kupisch et al. (2012).

 

Budapest Sociolinguistic Interview - version 2

Size: 270,000 words
Annotation: MSD-tagged, spoken language phenomena (hesitation, consonant drops)
Licence: CLARIN RES

Hungarian

This corpus contains sociolinguistic interviews conducted with 50 individuals.

The corpus is available for download and through a dedicated concordancer.

For a related publication, see Kontra and Váradi (1997).

Concordancer

Download

CLIPS : corpora e lessici di italiano parlato e scritto

Size: 100 hours

Italian

This corpus contains speech from 15 different cities in Italy.

 

Mbochi speech corpus

Size: 5000 sentences, 4.5 hours
Licence: ELRA

Mbochi, French

The corpus is available for download from the ELRA catalogue.

Download

Nepali Spoken Corpus

Size: 31 hours 26 minutes
Annotation: phonetically transcribed
Licence: ELRA

Nepali

The corpus is available for download from the ELRA Catalogue.

Download

Nganasan Spoken Language Corpus (NSLC)

Size: 32 hours
Licence: HZSK-RES (restricted, non-commercial only)

Nganasan, Russian

 

LIA

Size: 1.5 million tokens
Annotation: orthographically and phonetically transcribed, MSD-tagged, lemmatised

Norwegian

This corpus contains interviews and conversation in Norwegian dialects.

The corpus is available through a Tekstlab concordancer (account needed).

Concordancer

NoTa-Oslo

Size: 1 million tokens
Annotation: orthographically transcribed, MSD-tagged, lemmatised

Norwegian

This corpus contains interviews and conversations in Oslo sociolects.

The corpus is available through a Tekstlab concordancer (account needed).

Concordancer

TAUS

Size: 270 000 tokens
Annotation: MSD-tagged, lemmatised, orthographically and partially phonetically transcribed

Norwegian

This corpus contains informal interviews in Oslo sociolects.

The corpus is available through a Tekstlab concordancer (account needed).

Concordancer

The BigBrother Corpus

Size: 440,300 tokens
Annotation: orthographically transcribed, msd-tagged, lemmatised

Norwegian

This corpus contains recordings and transcripts from the Norwegian Big Brother in 2001.

The corpus is available through a Tekstlab concordancer.

Concordancer

Corpus of American Nordic Speech (CANS)

Size: 251,000 tokens
Annotation: orthographically and phonetically transcribed, MSD-tagged, lemmatised

Norwegian, Swedish

This corpus contains interviews, conversations. Norwegian and Swedish dialects in America.

The corpus is available through a Tekstlab concordancer.

For a related publication, see Johannessen (2015).

Concordancer

Hamburg Corpus of Polish in Germany (HamCoPoliG)

Size: 38 hours
Licence: HZSK-RES (restricted, non-commercial only)

Polish

This corpus contains spontaneous speech and reading tasks.

For a related publication, see Czachór (2012).

 

Consecutive and Simultaneous Interpreting (CoSi)

Size: 6 hours
Licence: HZSK-RES (restricted, non-commercial only)

Portuguese, English

This corpus contains lectures in Portuguese with simultaneous interpretation in English.

 

Skolt Saami Documentation Corpus (2016)


Size: 19 hours
Annotation: MSD-tagged

Skolt Saami

This corpus contains interviews.

This corpus is available for online querying through the LAT platform.

LAT platform

Hamburg Corpus of Argentinean Spanish (HaCASpa)

Size: 19 hours
Annotation: orthographically transcribed
Licence: HZSK-RES (restricted, non-commercial only)

Spanish (Argentinian)

This corpus contains spontaneous speech and reading tasks.

For a related publication, see Gabriel et al. (2010).

 

Catalan in a bilingual context (PhonCAT)

Size: 144 hours
Annotation: orthographically and phonetically transcribed
Licence: HZSK-RES (restricted, non-commercial only)

Spanish (Catalan)

This corpus contains read, elicited and spontaneous speech.

For a related publication, see Benet et al. (2012).

 

Corpora with transcriptions only

Corpus Language Description Availability

ORAL2008: Balanced corpus of informal spoken Czech

Size: 1 million tokens
Licence: CC BY-NC-SA 3.0

Czech

This corpus contains informal conversations.

The corpus is available for download from LINDAT and through the concordancer KonText.

For a related publication, see Benešová et al. (2015).

Concordancer

Download

ORTOFON v1: balanced corpus of informal spoken Czech with multi-tier transcription (transcriptions)

Size: 1 million tokens
Annotation: orthographically and phonetically transcribed, MSD-tagged, lemmatised
Licence: CC BY-NC-SA 4.0

Czech

This corpus contains informal conversations.

The corpus is available for download from LINDAT and through the concordancer KonText.

For a related publication, see Komrsková et al. (2018).

Concordancer

Download

Prague Dependency Treebank of Spoken Language (PDTSL) 0.5

Size: 120,000 words
Annotation: syntactic dependencies
Licence: ACADEMIC (PDTSL)

Czech

The corpus is available for download from LINDAT.

Download

ParCorFull: A Parallel Corpus Annotated with Full Coreference

Size: 160,000 tokens
Annotation: coreference (nominal and clausal)
Licence: CC BY-NC-ND 4.0

English, German

This corpus contains planned speech and newswire.

The corpus is available for download from LINDAT.

Download

The Spoken Wikipedia Corpora

Annotation: text segmentation, normalization, time-alignment
Licence: CC-BY SA 4.0

English, German, Dutch

The corpus contains transcripts of read Wikipedia articles

The corpus is available for download from a CLARIN-D repository.

For a related publication, see Köhn et al. (2016).

Download

Corpus of Spoken Estonian

Size: 1 million words
Annotation: unspecified tagging

Estonian

The corpus contains transcripts of recordings from various domains.

 

ALCEBLA

Size: 72 hours
Licence: HZSK-RES (restricted, non-commercial only)

German, Spanish

This corpus contains Speech tasks performed by bilingual children.

For a related publication, see Ulloa Saceda et al. (2012).

 

Corpus of Doctor-Patient Conversations from Ahus

Size: 958,830 tokens
Annotation: orthographically transcribed, MSD-tagged, lemmatised

Norwegian

This corpus contains doctor-patient conversations.

The corpus is available through a Tekstlab concordancer (account needed).

Concordancer

Spoken corpus Gos 1.0

Size: 1 million words, 120 hours
Licence: CC-BY 4.0

Slovenian

This corpus contains transcripts from radio and TV shows, school lessons, private conversations, business meetings

The corpus is available for download from CLARIN.SI as well as through a dedicated webconcordancer.

For a related publication, see Verdonik and Zwitter-Vitez (2011).

Concordancer

Download

Spoken corpus Gos VideoLectures 3.0 (transcription)

Size: 126,000 words
Annotation: PoS-tagged, lemmatised, orthographically and phonetically transcribed
Licence: CC BY 4.0

Slovenian

This corpus contains public academic speech.

The corpus is available for download from CLARIN.SI and through the concordancer KonText.

For the version with audio recordings, click here.

For a related publication, see Verdonik (2018).

Concordancer

Download

Gothenburg Dialogue Corpus

Size: 1,470,000 tokens
Annotation: MSD-tagged, lemmatised

Swedish

The corpus is available through the concordancer Korp (account needed).

Concordancer

Other spoken corpora

Corpus Language Description Availability

Griffith Corpus of Spoken Australian English

Size: 32,134 words

English

The corpus is available for download and through the concordancer of the Australian National Corpus.

Concordancer

Download

Spoken BNC2014

Size: 10 million words

English

The corpus contains face-to-face conversations between people who speak British English as their first language.

The corpus is available through the CQP concordancer.

Concordancer

The Aston Corpus of West Midlands English (ACWME)

Annotation: orthographically transcribed

English

The corpus contains recordings of performances - comedy, drama, poetry, song and story-telling - and related interviews with performers, members of the audience and local and national celebrities.

The corpus is available for download from a dedicated webpage.

Download

Vienna-Oxford International Corpus of English

English

The corpus contains naturally occurring, non-scripted face-to-face interactions in English as a lingua franca (ELF).

The corpus is available through a dedicated concordancer.

Concordancer

AN.ANA.S._MT

English, Italian, Spanish

The corpus contains TV-broadcasts and elicited dialogues.

 

Babel - A Multi Language Database

Annotation: orthographically transcribed

Hungarian

This corpus contains various elicited speech tasks.

 

BEA (Hungarian Spontaneous Speech Database)

Size: 465 recordings
Annotation: partial transcription
Licence: restricted

Hungarian

This corpus contains spontaneous speech.

 

Hungarian Broadcast News Database

Size: 25,000 words, 3.5 hours
Annotation: audio-level annotations
Licence: META_SHARE NC-NoReD

Hungarian

The corpus is available for download (upon request) from META-SHARE.

Download

Hungarian Gigaword Corpus / "spoken language" subcorpus

Size: 76 million words
Annotation: PoS-tagged, MSD-tagged

Hungarian

The corpus contains radio broadcasts (reading aloud and spontaneous conversation)

The corpus is available through the Hungarian Gigaword Corpus concordancer.

Concordancer

Hungarian Kindergarten Language Corpus

Size: 192,000 words
Annotation: PoS-tagged, MSD-tagged
Licence: restricted

Hungarian

This corpus contains elicited speech tasks (picture descriptions) and guided conversation with children.

The corpus is available for download through META-SHARE.

Download

Hungarian Reference Speech Database

Size: 6 hours
Annotation: partial phonemic-level annotation
Licence: META-SHARE No-Redistribution Commercial FF

Hungarian

This corpus contains reading tasks.

The corpus is available for download (upon request) from META-SHARE.

Download

Medical Speech Database

Annotation: phonetic transcription
Licence: META-SHARE C-NoReD-FF

Hungarian

The corpus is available for download (upon request) from META-SHARE.

Download

Corpus LIP

Size: 490,000 words

Italian

The corpus is available through a dedicated concordancer.

Concordancer

Corpus AVIP-API

Annotation: orthographically transcribed

Italian

The corpus contains quasi-spontaneous dialogues (a map task).

The corpus is available for download from a dedicated webpage.

Download

Corpus Lips

Size: 700,000 words, 100 hours
Annotation: PoS-tagged, lemmatised

Italian

This is a L2-learner corpus.

The corpus is available for download from a dedicated webpage.

Download

Selezione dal "Corpus di parlato telegiornalistico. Anni Sessanta vs. 2005

Annotation: orthographically transcribed

Italian

This corpus contains news broadcast.

The corpus is available for download from a dedicated webpage.

Download

SpIt-MDb (Spoken Italian - Multilevel Database)

Annotation: orthographically transcribed

Italian

This corpus contains spontaneous speech.

The corpus is available for download from a dedicated webpage.

Download

Uralic Languages under the Influence (UraLUID) database

Size: 108,000 tokens, 4 hours
Annotation: MSD-tagged, time-alignment, phonetic and orthographic transcription

Udmurt, Tundra Nenets, Synya Khanty, Surgut Khanty

This corpus contains narratives (e.g., folk storites).

The corpus is available for download from a dedicated website.

Download

Publications on the spoken corpora

[Altrov and Pajupuu 2012] Rene Altrov, Hille Pajupuu. 2012. Estonian Emotional Speech Corpus: theoretical base and implementation

[Benešová et al. 2015] Lucie Benešová, Michal Křen, Martina Waclawičová. 2015. Korpus spontánní mluvené češtiny ORAL2013.

[Benet et al. 2012] Ariadna Benet, Susana Cortés, Conxita Lleó. 2012. Phonoprosodic Corpus of Spoken Catalan (PhonCAT).

[Czachór 2012] Agnieszka Czachór. 2012. Corpus of Polish Spoken in Germany. Collecting and Analyzing Written & Spoken Data for Investigating Contact-Induced Change.

[Gabriel et al. 2010] Christoph Gabriel, Ingo  Feldhausen, Andrea Pešková, Laura Colantoni, Su-Ar Lee, Valeria Arana, Leopoldo Labastía. 2010. Argentinian Spanish Intonation. 

[Hajič et al. 2008]  Jan Hajič, Silvie Cinková, Marie Mikulová, Petr Pajas, Jan Ptáček, Josef Toman, Zdeňka Urešová. 2008. PDTSL: An Annotated Resource For Speech Reconstruction.

[Johannessen 2015] Janne Bondi Johannessen. 2015. The Corpus of American Norwegian Speech (CANS). 

[Komrsková et al. 2018] Zuzana Komrsková, Marie Kopřivová, David Lukeš, Petra Poukarová, Hana Goláňová. 2018. New Spoken Corpora of Czech: ORTOFON and DIALEKT.

[Kontra and Váradi 1997] Miklós Kontra and Tamás Váradi. 1997. The Budapest Sociolinguistic Interview.

[Köhn et al. 2016]  Arne Köhn, Florian Stegen, Timo Baumann. 2016. Mining the Spoken Wikipedia for Speech Data and Beyond.

[Kupisch et al. 2012]  Tanja Kupisch, Dagmar Barton, Giulia Bianchi, Ilse Stangen. 2012. “he HABLA-Corpus (German-French and German-Italian). 

[Pitt et al. 2005] Mark Pitt, Keith Johnson, Elizabeth Hume; Scott Kiesling, William Raymond.2005. The Buckeye Corpus of Conversational Speech: Labeling Conventions and a Test of Transcriber Reliability

[Ulloa Saceda et al. 2012]  Marta Ulloa Saceda; Lleó, Conxita & García Sánchez, Izarbe (2012): Corpora of spoken Spanish by simultaneous and successive German-Spanish bilingual and Spanish monolingual children.

[Verdonik 2018] Darinka Verdonik. Corpus and database GOS Videolectures.

[Verdonik and Zwitter-Vitez. 2012] Darinka Verdonik and Ana Zwitter-Vitez. 2012. Slovenski govorni korpus Gos.