Skip to main content

Multimodal Corpora

Multimodal corpora are data collections used to study how two or more modalities interface with one another in human communication. In this sense, multimodal corpora are often collections of video and speech recordings accompanied with transcriptions and gesture annotations, although multimodal corpora of textual data supplemented with images exist as well. Such corpora can be used for “the exploration of a range of lexical, prosodic and gestural features of conversation, and for investigations of the ways in which these features interact in real, everyday speech (Abuczki and Baiat Ghazaleh 2013: 88).

The CLARIN infrastructure offers 17 multimodal corpora, 13 of which are monolingual (English, Estonian, Finnish, French, German, Greek, Hungarian, Italian, Slovenian, and Zulu). These corpora are richly annotated for various verbal and non-verbal elements of communication, such as body gesture, gaze direction, and head, eye, and lip movement.

For comments, changes of the existing content or inclusion of new corpora, send us an resource-families [at] (email).


The Multimodal Corpora in the CLARIN Infrastructure

Video-Audio Corpora

Corpus Language Description Availability

IFA Dialog Video corpus

Size: 5 hours
Annotation: functional annotation of dialogue utterances, annotated gaze direction
Licence: GNU general public license


This corpus contains annotated video recordings of friendly Face-to-Face dialogues. It is modelled on the Face-to-Face dialogues in the Spoken Dutch Corpus (CGN). The procedures and design of the corpus were adapted to make this corpus useful for other researchers of Dutch speech. For this corpus 20 dialogue conversations of 15 minutes were recorded and annotated, in total 5 hours of speech. To stay close to the Face-to-Face dialogues in the CGN, pairs of well-acquainted participants were selected, either good friends, relatives, or long-time colleagues. The participants were allowed to talk about any topic they wanted.

The corpus is available for download from a dedicated webpage (hosted by CLARIAH-NL).

For a relevant publication, see van Son et al. (2008).

MPI ESF Corpus Dutch, English, French, German, Swedish This corpus was built under the ESF Foreign Language Speakers project. It contains a lot of annotated audio recordings containing multimodal interaction.  

Eye-tracking in Multimodal Interaction Corpus

Licence: restricted

English The corpus is available for download from the Language Archive (CLARIAH-NL). Download

TV News Corpus

Size: 30 hours
Licence: CC-BY-SA


This corpus contains video and audio recordings and their transcriptions.

The corpus is available for download from (CELR distribution).


Corpus d'interactions dialogales

Size: 8 hours
Annotation: prosody, interpausal units, gestures, syntax


A demo version of this corpus is available for download (videos and transcriptions) from the ORTOLANG repository.

For a relevant publication, see Bertrand et al. (2008).




BAS SmartWeb Video

Size: 36 hours
Annotation: orthography, phonology, speaker turn, noise, prosody, gaze direction


The corpus contains a collection of user queries to a naturally spoken Web interface with the main focus on the soccer world series in 2006. The recordings include 156 field recordings using a hand-held UMTS device (one person, SmartWeb Handheld Corpus SHC), 99 field recordings with video capture of the primary speaker and a secondary speaker (SmartWeb Video Corpus SVC) as well as 36 mobile recordings performed on a BMW motorbike (one speaker, SmartWeb Motorbike Corpus SMC).

The corpus is available for download from the BAS CLARIN-D repository.

For a relevant publication, see Mögele et al. (2006).


Natural Media Motion-Capture Corpus

Size: 3 hours
Annotation: gesture types, meta-information about encoding (e.g., difficult to encode)


This corpus contains data from 18 participants, whose task was to describe nine objects each to an experimenter, without using everyday vocabulary about forms, sizes or objects. The participants were recorded on audio and several video cameras, and their hand movements were recorded using an optical VICON motion capture system.

The corpus is available for download from the BAS CLARIN-D repository.


BAS SmartKom Public Video and Gesture corpus

Size: 15 hours

Annotation: orthography, phonology, speaker turn, noise, prosody, emotion, hand gesture, facial expression


German This corpus contains multi modal recordings of 86 actors who use the SmartKom system. SmartKom Public is comparable to a traditional public phone booth but equipped with additional intelligent communication devices. Naive users were asked to test a 'prototype' for a market study not knowing that the system was in fact controlled by two human operators. They were asked to solve two tasks in a period of 4,5 min while they were left alone with the system. The instruction was kept to a minimum; in fact the user only knew that the system is able to understand speech, gestures and even mimic expressions and should more or less communicate like a human. Download

Bielefeld Speech and Gesture Alignment Corpus

Size: 9881 isolated words, 1764 gestures
Annotation: alignment of speech and gestures

German, English

This corpus contains 25 dialogues of interlocutors (50), who engage in a spatial communication task combining direction-giving and sight description. The stimulus is a model of a town presented in a Virtual Reality (VR) environment. Upon finishing a “bus ride” through the VR town along five landmarks, a router explained the route as well as the wayside landmarks to an unknown and naive follower.

The corpus is available for download from the BAS CLARIN-D repository.

For a related publication, see Lücking et al. (2013).


Multimodal and multiparty corpus of text comprehension interactions

Annotation: orthographic transcription, gaze/head/eye/lip movements
Licence: CC BY-NC-SA


This corpus contains reading comprehension exercises in a high school setting involving 2 high school students and their teacher. The goal of the sessions is to represent how the interaction between a teacher and more than one students is performed: what is the structure of the conversation; how turn-taking is coordinated; what are the multimodal feedback and attention signals the speakers employ.

The corpus is available for download from CLARIN:EL.

For a relevant publication, see Koutsombogera et al. (2016).


Hungarian Multimodal Corpus

Size: 50 hours
Annotation: non-verbal and verbal elements of communication
Licence: open and restricted



This corpus contains video and audio recordings of conversations divided into two major parts: a simulated job interview and a guided dialogue about personal topics. The participants are university students (54 females, 67 males) mostly involving the same interviewer in both scenarios.

The corpus is available for online browsing through the MTA RIL Language Archive Serve (HUN-CLARIN distribution) and for download from the Language Archive (CLARIAH-NL).

For a relevant publication, see Pápay et al. (2011).




Size: 100,870 tokens

Annotation: utterance phenomena, gesture annotations (facial, hand, body posture)

Licence: CC BY-NC-SA 4.0

This corpus includes the transcripts of 56 TV face-to-face interviews (14 hours total) taken from several broadcasts of the Italian political talk show Mezz'ora, from 24 September 2017 to 14 January 2018, aired on the Rai 3 channel.
The audio signal has been transcribed using a semi-supervised speech-to-text methodology (Google + manual correction). Annotation has been done using XML as markup language and following the standard for Speech Transcripts in terms of utterances. 
The corpus is available for download from the ILC4CLARIN repository.
For a related publication, see Trotta et al. (2019) and Trotta et al. (2020).

Multimodal corpus EVA 1.0

Size: 57 minutes
Annotation: MSD-tagged, non-verbal and verbal elements of communication
Licence: CC BY-NC-SA 4.0



This corpus contains one episode of an audio/video session plus corresponding orthographic transcriptions with a duration of 57 minutes. The multi-party spontaneous discourse in the recording is from an entertaining evening TV-talk show A si ti tut not padu, broadcasted by the POP-TV Slovene commercial TV station in 2008, and represents a part of the Slovene spoken corpus GOS.

In addition to the original transcription and morphosyntactic annotation from the GOS corpus, the following layers of information are added:

  • statement sentiment
  • phrase breaks within statements
  • prominence of statements
  • sentences within the statement
  • sentence sentiment
  • sentence type
  • speaker visibility on the scene
  • gesture units
  • gesture phrases
  • emotions
  • semiotic intent
  • dialogue role

The corpus is available for download from the CLARIN.SI repository.

For a relevant publication, see Mlakar et al. (2019).


Video-linked Thai/Swedish child data corpus

Annotation: video-transcription alignment, word segmentation, phonetic transcription

Swedish, Thai

This corpus consists of 60 transcripts from interactions in everyday contexts between 6 children and their caregivers (10 transcripts per child), recorded longitudinally, for the period when the children are 18 to 27 months of age. All six children are growing up in middle class environments, in Sweden and Thailand (Bangkok area) respectively. The videos of the corpus are linked to the transcripts, on an utterance-by-utterance basis using the software CLAN (MacWhinney 2020).

The corpus is available for online browsing (CLARIN K-Centre Lund University Humanities Lab).

For a relevant publication, see Zlatev et al. (2006).

Unisa isiZulu Video Corpus Zulu The corpus is unavailable.  

Text-Image Corpora

Corpus Language Description Availability

A Multimodal Corpus of Tourist Brochures Produced by the City of Helsinki, Finland (1967-2008)

Size: 58 double pages
Annotation: content, layout, graphic, typographic appearance, rhetorical structure


This corpus contains tourist brochures produced by the city of Helsinki, Finland, is fully annotated using XML schema provided for the Genre and Multimodality (GeM) model (Bateman 2008).

The corpus is available for download from FIN-CLARIN.


Hindi Visual Genome 1.0

Size: 32,925 items, 32,535 images, 32925 sentences, 322,000 words
Licence: CC BY-NC-SA 4.0


Hindi, English

This corpus contains short English segments (captions) from Visual Genome along with associated images. The English texts are automatically translated to Hindi with manual post-editing, taking the associated images into account.

The corpus is available for download from the LINDAT repository.

For a relevant publication, see Parida et al. (2019).



[Abuczki and Baiat Ghazaleh 2013] Ágnes Abuczki and Esfandiari Baiat Ghazaleh. 2013. An overview of multimodal corpora, annotation tools, and schemes. Argumentum, 9: 86–98.

[Allwood 2008] Jens Allwood. 2008. Multimodal corpora. In Corpus linguistics: an international handbook (Vol 1), edited by A. Lüdeling and M. Kytö, 207–225.

[Bateman 2008] John A. Bateman 2008. Multimodality and Genre. London: Palgrave Macmillan.

[Bertrand et al. 2008] Roxane Bertrand, Philippe Blache, Robert Espesser, Gaëlle Ferré, Christine Meunier, Béatrice Priego-Valverde, and Stéphane Rauzy. 2008. Le CID - Corpus of Interactional Data - Annotation et Exploitation Multimodale de Parole Conversationnelle. Traitement Autoatique des Langues, 49 (3): 105–134.

[Koutsombogera et al. 2016] Maria Koutsombogera, Miltos Deligiannis, Maria Giagkou, and Harris Papageorgiou. 2016. Towards Modelling Multimodal and Multiparty Interaction in Educational Settings. In Toward Robotic Socially Believable Behaving Systems, edited by A. Esposito and L. Jain, vol. 106.

[Lücking et al. 2012] Andy Lücking, Kirsten Bergman, Florian Hahn, Stefan Kopp, and Hannes Rieser. 2012. Data-based analysis of speech and gesture: the Bielefeld Speech and Gesture Alignment corpus (SaGA) and its applications. J Multimodal User Interfaces, 7: 5–18.

[MacWhinney 2020] Brian MacWhinney. 2020. Tools for Analyzing Talk Part 2: The CLAN Program. .

[Mlakar et al. 2019] Izidor Mlakar, Darinka Verdonik, Simona Majhenič, and Matej Rojc. 2019. Towards Pragmatic Understanding of Conversational Intent: A Multimodal Annotation Approach to Multiparty Informal Interaction – The EVA Corpus. In SLSP 2019: Lecture Notes in Computer Science, edited by C. Martín-Vide, M. Purver, and S. Pollak, 19–30.

[Mögele et al. 2006] Hannes Mögele, Moritz Kaiser, and Florian Schiel. 2006. SmartWeb UMTS Speech Data Collection. The SmartWeb Handheld Corpus. In Proceedings of LREC2006, 2106–2111.

[Pápay et al. 2011] Kinga Pápay, Szilvia Szeghalmy, and István Szekrényes. 2011. HuComTech Multimodal Corpus Annotation. 

[Parida et al. 2019] Shantipriya Parida, Ondřej Bojar, and Satya Ranjan Dash. 2019. Hindi Visual Genome: A Dataset for Multimodal English-to-Hindi Machine Translation. 

[Trotta et al. 2019] Daniela Trotta, Sara Tonelli, Alessio Palmero Aprosio, and Annibale Elia. 2019. Annotation and Analysis of the PoliModal Corpus of Political Interviews. In Proceedings of the Sixth Italian Conference on Computational Linguistics (CLiC-it 2019.

[Trotta et al. 2020] Daniela Trotta, Alessio Palmero Aprosio, Sara Tonelli, and Annibale Elia. 2020. Adding Gesture, Posture and Facial Displays to the PoliModal Corpus of Political Interviews. In Proceedings of LREC2020, 4320–4326. 

[van Son et al. 2008] R. J. J. H. van Son, Wieneke Wesseling, Eric Sanders, and Henk van den Heuvel. 2008. Promoting free Dialog Video Corpora: The IFADV Corpus Example. In International LREC Workshop on Multimodal Corpora: MMCorp 2008: Multimodal Corpora, edited by M. Kipp et al., 18–37.

[Zlatev et al. 2006] Jordan Zlatev, Mats Andrén, and Soraya Osathanonda. 2006. A video-linked Thai/Swedish child data corpus: A tool for the study of comparative semiotic development.