Skip to main content

Sign Language Resources

This CLARIN Resource family page lists corpora and lexica for Sign Languages (SL). A substantial number of these can be found in the CLARIN but deserve a special podium in the spotlight. 

Sign language (SL) corpus resources contain transcriptions/annotations of spontaneous or elicited dialogues and narratives. All resources are in a video format because of the gestural/spatial-visual modality, a vital characteristic of signed languages (sign languages, used by Deaf-blind signers, can be received in tactile modality). SL corpora are crucial resources for various types of linguistic research, such as lexicography, phonology, syntax, and pragmatics, as well as for language typology. 

This page also provides access to SL lexical resources. Some of them are connected to SL corpora. There also are independent lexical resources that were primarily created for language learning and teaching.

This page was constructed by harvesting metadata for SL corpora in three different ways:

  • By making an inventory of the material (datasets and resources) offered by CLARIN K-Centres with expertise in SL.
  • By making an inventory of other datasets in the VLO which may qualify as members of the new CRF by contacting the right holders.
  • By making an inventory of any other material (e.g., new datasets, annotation tools, manuals) not yet accessible through the CLARIN Infrastructure by sending out questionnaires to SL communities.

For comments, changes of the existing content or inclusion of new corpora, send us an resource-families [at] clarin.eu (email).

 

Sign language resources in the CLARIN infrastructure

Corpora

Corpus Language Description Availability

Adamorobe Sign Language Corpus

Size: 90 MPG1 and 90 MPEG2 clips 
Annotation: transcripts 
Licence: Restricted

Adamorobe Sign Language The Adamorobe Sign Language Corpus contains almost 36 hours of videorecordings of Adamorobe Sign Language, filmed in Adamorobe in Ghana between 2000 and 2004 by Victoria Nyst. The deposit contains recordings of approximately 20 signers. The 39 original tapes were digitized, cut, compressed and converted into MPG1 and MPEG2 digital clips using the standard settings of the MPI in Nijmegen. The total number of clips is 90 MPG1 and 90 MPEG2. There are 27 complete synchronized Elan-transcriptions in English and in Twi, which is the Akwapim variety of Akan, the spoken language in Adamorobe. The recordings include spontaneous narratives, personal stories and stories about the history of Adamorobe, elicited data, retellings of cartoons and picture stories. Download

Balinese Homesign Corpus

 

Balinese

The collection includes sign language data from deaf homesigners in Bali, Indonesia. The data was collected between 2021 and 2023.

The collection is available for download from the Language Archive.

Download

Sign-Hub WP2.4: Life Stories

Size: 200 hours 
Annotation: not annotated yet 
Licence: CLARIN PUB

Catalan Sign Language (LSC), German Sign Language (DGS), Italian Sign Language (LIS), Dutch Sign Language (NGT), Spanish Sign Language (LSE), and Turkish Sign Language (TİD).

This is a collection of datasets connected to the Sign-Hub project. The corpus contains interviews conducted with elderly Deaf signers from five countries on their life experiences as well as a documentary movie based on these interviews. These interviews were conducted in five of the participating countries of the SIGN-HUB project and in six different sign languages: Catalan Sign Language (LSC), German Sign Language (DGS), Italian Sign Language (LIS), Sign Language of the Netherlands (NGT), Spanish Sign Language (LSE), and Turkish Sign Language (TİD). In each country, interviews have been conducted in different geographical areas. The exact number of interviews differs per sign language, but for every sign language, at least 20 interviews have been conducted, with interviewees being between 66 and 97 years of age. Interviews followed a pre-defined questionnaire; however, the addition of country-specific questions was encouraged.

This collection is available for download from the Ortolang repository.

For the relevant publication, see Pfau et al. (2021)

Download

Dogon Sign Language Corpus

Size: 341 clips in MPG1 and MPG2 format 
Annotation: EAF transcripts 
Licence: Restricted

Dogon Sign Language

The Dogon Sign Language Corpus contains 32 hours of video data, recorded in the Dogon area in Mali between 2010 and 2012. These recordings are cut into 341 shorter clips varying lengths, in MPG1 and MPG2 format. The recordings feature the signing of 41 men and 27 women. The average age of all signers was 30 years. Recordings were made in 13 locations. Following approaches developed in earlier sign language corpora, the following he types of data were collected for the the Dogon Sign Language Corpus:

  • personal narratives
  • interviews about personal history
  • signed guided tours by deaf signers around the house, fields and nature
  • elicited lexical data
  • reports by the team members of the data collection

Metadata are stored in the sign language format, using the ARBIL editor software. The entire corpus, i.e. the video clips, annotations and metadata, is stored in the DoBeS archive at the Max Planck Institute for Psycholinguistics in Nijmegen.

 

Download

Hotel Review Corpus – Dutch Sign Language

Size: 21,825 words; 3.5 hours 
Licence: CC BY-NC 3.0

Dogon Sign Language (NGT)

This is a multimodal parallel corpus of hotel reviews that were originally written in Dutch, and subsequently translated into the Dutch Sign Language by 6 professionals, all of whom are deaf translators.

The corpus is available for download from the Institute of Dutch Language.

Download

Corpus NGT

Size: 2375 sessions 
Annotation: ID-Glosses, sentence-level translations, and mouth actions for a subset of the sessions 
Licence: CC-BY-NC-SA 3.0 NL

Dutch Sign Language (NGT)

This corpus contains sessions with linked media files and ELAN annotation files (EAF); about 15% of the sessions are glossed and translated.

For the relevant publication, see Crasborn et al. (2008)

Download

The NGT Interactive Corpus

Size: 23 hours 
Annotation: Unannotated 
Licence: Restricted access, unspecified

Dutch Sign Language (NGT) This corpus contains 15 spontaneous dialogues and multi-participant conversations by deaf signers, 10 of which were recorded in authentic settings like a deaf club and a bar, 5 were recorded in the lab. In addition, two informal three-party conversations were filmed where each participant was wearing a mobile eye trackers. Browse

VISIBASE CORPUS

Size: 32 recordings 
Annotation: yes 
Licence: Restricted

Dutch Sign Language (NGT) The Visibase corpus is a collection of digitised and described NGT material that was present in the late 1990s at the sign language research groups at the University of Amsterdam and at Leiden University. The project lasted from 1996–2001 and was based at Radboud University, University of Amsterdam and Utrecht University. Download

ECHO

Size: 76 MPEG1 recordings 
Annotation: EAF transcripts 
Licence: CC BY-NC-SA 3.0

Dutch Sign Language (NGT), British Sign Language (BSL), Swedish Sign Language (SSL), German Sign Language (DGS)

The corpus contains recorded sign narrations of five fable stories, a small lexicon, and interviews with the signers for each of the three languages. In addition, there is sign language poetry from BSL, NGT and SSL. Finally, the corpus includes two annotated segments of the Gehörlos So! corpus of German Sign Language (DGS) by Jens Heßmann.

For the relevant publication, see Crasborn et al. (2007)

Download

Corpus-PhD-Fusellier-Souza-2004

Size: 10 discourses with 3 Deaf emerging signers in Brasil 
Annotation: partially annotated corpus 
Licence: CC BY-NC-ND 4.0

Emerging Sign Languages (in Brazil)

This is a corpus containing 10 discourses with 3 Deaf emerging signers in Brasil.

The corpus is available for download from the Huma-num repository (COCOON).

Download

AddictionLink in Finnish Sign Language

Licence: Under negotiation

Finnish Sign Language (FinSL) This corpus contains written and recorded (audio and video) materials pertaining to alcohol, drugs and addictions, on independent change programs and a self-assessment test on the use of alcohol.  

Consumer Information in Finnish Sign Language

Licence: Under negotiation

Finnish Sign Language (FinSL)

This corpus contains written and recorded (video) materials pertaining to advice aimed at consumers with regards to e.g. product defects, service related complaints, canceling orders and online shopping.

The corpus is available for download from a dedicated webpage.

Browse

Finnish Sign Language Learning Material

Licence: Under negotiation

Finnish Sign Language (FinSL)

This corpus contains written and recorded (audio and video) materials pertaining to Finnish sign language greetings, names of family members, numbers and telling the time, as well as basic verbs and related words.

The corpus is available for download from a dedicated webpage.

Browse

News in Finnish Sign Language

Licence: Under negotiation

Finnish Sign Language (FinSL)

This corpus contains recordings of Finnish news.

The corpus is available for download from a dedicated webpage.

Browse

The Kipo Corpus

Size: 163 minutes 
Licence: CC-BY-NC-SA

Finnish Sign Language (FinSL)

This is a video corpus of the language policy program for the National Sign Languages in Finland translated by two people who speak the sign language as their mother tongue.

The corpus is available for download from the Finnish Language Bank.

Download

Translations of the Bible and of the Church Manual into Finnish Sign Language

Licence: Under negotiation

Finnish Sign Language (FinSL)

This is a video corpus of Bible translations (including The Gospels of John and Luke and the Old Testament, Genesis 1:1-4:16, 6:1-9:17), mass and other religious ceremonies, as well as other religious documents.

The corpus is available for online browsing through a dedicated webpage.

Browse

Belgian Covid Sign Language Corpus (BeCoS Corpus)

Size: 177 hours of speech 
Annotation: speaker diarisation, ASR and post-ASR, punctuation prediction, signer diarisation, sign language identification, sign language keypoint recognition 
Licence: CC BY

Flemish Sign Lagnuage (VGT), The French Belgian Sign Language (LSFB), French, Dutch

This corpus consists of the entire archive of official press conferences from the Belgian Federal Government concerning the COVID-19 pandemic. The speakers speak mostly Dutch or French and occasionally German, and nearly all speech is accompanied by a deaf signer who interprets live what is said.

The corpus is available for download from the Dutch Language Institute.

Download

Corpus Vlaamse Gebarentaal

Size: 140 hrs / 5 TB 
Annotation: ID-glosses 
Licence: CC BY.NC 3.0

Flemish Sign Language (VGT)

This is a collection of videos in Flemish Sign Language. 120 deaf people contributed to the Corpus VGT as informants. Age, region and gender were taken into account when selecting the informants. The informats were given a series of themes to talk about in pairs: telling a story, making agreements, discussing a theme, telling about their school days, etc. The conversations were recorded on video and edited them for each assignment.

The corpus is available for download from the Dutch Language Institute and for browsing through a dedicated website.

Browse

Download

Hotel Review Corpus – Flemish Sign Language

Size: 21,825 words; 4 hours 
Licence: CC BY-NC 4.0

Flemish Sign Language (VGT), Dutch

This is a multimodal parallel corpus of hotel reviews that were originally written in Dutch, and subsequently translated into the Flemish Sign Language by 6 professionals, all of whom are deaf translators.

The corpus is available for download from the Institute of Dutch Language.

Download

Creagest-Acquisition corpus

Size: 50 hours of recording in total. 
Annotation: partially annotated corpus. Elan annotation not accessible online 
Licence: CC BY-NC-ND 4.0

French Sign Language (LSF)

This is a corpus of children's LSF collected from 65 deaf children and 17 deaf adults (control group), conducted by four deaf interviewers from four different regions of France (4 stimuli, 2 cameras). 50 hours of recording in total.

A sample of the corpus (10 extracts of Tom & Jerry cartoon narrative, filmed with two cameras (20 files)) is available for download from the Huma-Num repository.

For the relevant publication, see Balvet et al. (2010)

Download

Creagest-Dialogues corpus

Size: 7 extracts filmed with 3 cameras (21 files). 
Annotation: partially annotated corpus. Elan annotation not accessible online 
Licence: CC BY-NC-ND 4.0

French Sign Language (LSF)

This is a corpus of dialogues between Deaf adults (106 hours of video data): 51 interviews, conducted by four Deaf interviewers from four different regions of France (semi-directive interviews, 3 cameras).

A sample of the corpus (7 extracts filmed with 3 cameras (21 files)) is available for download from the Huma-num repository.

For the relevant publication, see Garcia et al. (2013)

Download

LS-Colin corpus

Size: 2 hours 
Annotation: ELAN annotations 
Licence: CC BY-NC-ND 4.0

French Sign Language (LSF)

This is a reference corpus for LSF, recorded in January 2002 in Paris, involving 13 Deaf adults (monologues). The corpus is divided in 5 video files of various length. It contains a description in French (some metadata) and a translation in French of narratives and other discourses following the time code. The topics and genres included are: "Le Récit du Cheval" (narrative), "Le Récit des Oiseaux" (narrative), "L'Euro" (argumentative discourse), "La Recette de Cuisine" (cooking recipe), "Le 11 septembre 2001" (argumentative and narrative discourse) et "Le Thème Linguistique" (metalinguistic discourse).

The corpus is available for download from the Huma-num repository.

For the relevant publication, see Cuxac et al. (2002)

Download

MEDIAPI-SKEL

Size: 368 subtitled videos 
Licence: CC BY-NC 4.0

French Sign Language (LSF)

This is a 2D-skeleton video corpus of LSF with French subtitles. The corpus consists of 368 subtitled videos produced by Média’Pi4, a media company producing bilingual content with LSF and written French. The corpus was produced at the Laboratoire d’informatique pour la mécanique et les sciences de l’ingénieur (LIMSI).

From the original videos, 25 body keypoints, 2x21 hand keypoints and 70 face keypoints were extracted using OpenPose. 135 keypoints for every person in every frame of the 368 videos were provided, as well as the associated subtitles in French.

The corpus is available for download from Ortolang.

Download

MOCAP1

Annotation: partially annotated corpus 
Licence: CC BY NC-ND 4.0

French Sign Language (LSF)

This is a corpus of French Sign Language (LSF) captured with a motion capture system and an HD camera. It was designed with the objective of carrying out multidisciplinary studies in Movement Sciences, Linguistics and Computer Science. The corpus consists of 5 tasks of different natures: description, explanation, narration and translation, performed by 4 speakers (8 for the description task).

The corpus is available for download from the Ortolang repository.

For the relevant publication, see Benchiheub et al. (2016)

Download

Corpus CATTEAU 2020

Size: 11 poems in LSF and 57 translations in French (several versions for each poem) 
Annotation: multimodal annotation, prosodic annotation 
Licence: CC BY-NC-ND 3.0

French Sign Language (LSF), French

This corpus contains eleven poetic works in LSF (French sign language) and their fifty-seven translations into oral French.

The corpus is available for download from the Ortolang repository.

For the relevant publication, see Catteau (2020)

Download

Signes en famille

Size: approx. 10 samples 
Annotation: partially annotated corpus 
Licence: CC BY-NC-ND 3.0

French Sign Language (LSF), French

This is a corpus of spontaneous exhanges between either hearing and deaf children on the one hand and either hearing or deaf parents on the other.

A sample of the corpus is available for download from the Ortolang repository.

For the relevant publication, see Leroy et al. (2009)

Download

DEGELS1

Size: 6 sessions x 4 video files 
Annotation: partially annotated corpus 
Licence: CC BY-NC-ND 4.0

French Sign Language (LSF), Spoken French

The theme of the dialogues is the description of routes and places in Marseille and Aix-en-Provence in France. The corpus is composed of 3 dialogues in LSF and 3 dialogues in French. Each dyad is composed of a moderator and a speaker. There is a single moderator for French and two moderators for LSF. The recording equipment consisted of 3 cameras and 2 headset microphones for the French spoken part. The corpus is composed of 6 sessions: 1, 2 and 3 for French and 4, 5, 6 for LSF. Each dyad is composed of a speaker located on the right of the overview noted A, and a moderator located on the left of the overview noted B. Thus, for session 1, the speakers are conversing in French, A1 is the speaker located on the right of the overview and B1 is the moderator located on the left of the overview. For each session there are 4 video files (mp4/AVC): 1 for the speaker, 1 for the moderator, 1 which gives a profile view of the two speakers, the overview, and 1 which is a montage of these 3 videos. All the files are synchronised. For the LSF part, there is no sound track in the videos. For the French part, there are 2 sound files (wave) in addition to the video files, 1 per speaker. The first 3 videos do not contain a sound track. Only the editing video contains sound, that of the speaker on the right in the right channel and that of the moderator on the left in the left channel.

For the relevant publication, see Braffort and Boutora (2012)

Download

DICTA-SIGN corpus

Size: Text feature: 10 file, Video feature: 25 hour, Data format: MPEG-4 
Annotation: partially annotated corpus 
Licence: CC BY NC-ND 4.0

French, Modern Greek, German, English, Greek Sign Language, British Sign Language (BSL), German Sign Language (DGS), French Sign Language (LSF)

Multimedia corpus (video) for four sign languages (english, french, german and greek) of at least 14 informants per language and a session duration of approx. 2 hours using the same elicitation materials (scripts and tasks) across languages.

For the relevant publication, see Efthimiou et al. (2010)

Browse

DGS CORPUS

Size: +50 hours 
Licence: Restricted, see here

German Sign Language (DGS)

The DGS Corpus is a collection of German Sign Language (DGS) data from 330 signers from Germany. The 15-year long-term project is based at the Institute of German Sign Language and Communication of the Deaf at the Universität Hamburg and started in 2009. It is led by Thomas Hanke and Annika Herrmann. The DGS Corpus is used to build the DGS-German dictionary DW-DGS

For the relevant publication, see the list of publications

Download

Italian Sign Language Corpus

Licence: Restricted, see here

Italian Sign Language (LIS)

The Italian Sign Language Corpus is a collection of Italian Sign Language (LIS) data from 180 signers from Italy. The core part of the project involved three universities: University of Milan-Bicocca, University Ca’Foscari and Sapienza University.

The corpus is available for download from MPI's Language Archive (CLARIAH-NL).

Download

Kata Kolok Child Signing Corpus

Size: Data from four focal deaf children accumulates to 95h 24min (Lutzenberger 2022:282). 
Annotation: Translations in Indonesian and English, ID-glosses linked to the Kata Kolok SignBank 
Licence: Restricted

Kata Kolok (Benkala Sign Language)

This corpus covers spontaneous child-caregiver interactions focused on five deaf and eight hearing children acquiring Kata Kolok natively. Ages range between 4 months and 8;4 years of age.

The corpus is not freely accessible due to the vulnerable target group. Contact person: %20h.lutzenberger [at] bham.ac.uk (Hannah Lutzenberger)

For the relevant publication, see Lutzenberger (2022)

Browse

Kata Kolok Corpus

Size: 63.5; data collection ongoing 
Annotation: 63.5 hours of video data, roughly 3:52 hours are translated in English and Indonesian, 3:44 hours are glossed and about 1:45 hours are translated and glossed. 
Licence: CC BY-NC-SA 4.0

Kata Kolok (Benkala Sign Language)

This corpus includes a wide range of elicited and spontaneous language materials accumulating to 100 hours of video data from generation III-V of adult deaf and hearing signers. Ongoing data collection (anno 2022) is focused on generation III as they are currently among the eldest KK signers.

For the relevant publication, see de Vos (2016)

Browse

Corpus-PhD-Martinod-2019

Size: 27 minutes 
Annotation: annotated corpus 
Licence: CC BY-NC-ND 4.0

Marajó Sign Language (Brazil)

This is a corpus of sign language practiced in Soure, on the island of Marajó (Brazil, Pará). These data were collected between July and August 2015 and in March 2017.

This corpus is available for download from the Ortolang repository. The videos made available for download represent part of the total corpus of 8 hours and 27 minutes. They consist of elicited stories (9 minutes and 27 seconds) and spontaneous speech (17 minutes and 13 seconds).

Download

Corpus Maurician Sign Language by Univ Paris 8 & INJS

Size: 19 discourses (narratives and other genres) 
Annotation: partially annotated corpus 
Licence: CC BY-NC-ND 4.0

Maurician sign Language (LSM)

This is a corpus of 19 discourses (narratives and other genres).

The corpus is available for download from the Huma-num repository (COCOON).

Download

Polytropon Parallel Corpus

Size: 3,600 sentences 
Annotation: lexical, morphosyntax, semantics, glosses 
Licence: CC BY-NC-SA 4.0

Modern Greek, Greek Sign Language

This is a parallel corpus for the language pair Greek Sign Language (GSL) – Greek. The corpus incorporates sentences performed by a single signer in three repetitions each, captured in front view by means of one HD and one kinect camera. Annotation of the corpus has used the iLex annotation environment and provides information for the grammar levels of lexicon, morphology, syntax and semantics, incorporating annotation tiers for gloss, classifier type, shape and semantics, clause type, sentence type and equivalent translation in Greek on sentence level. The Corpus consists of 3500 ELAN (.eaf) files.

The corpus is available for download from CLARIN:EL, though access requires registration.

For the relevant publication, see Efthimiou et al. (2018)

Download

"Exhibition Corpus" - Text, Sound, Sign

Size: 23 texts 
Licence: CC ZERO

Norwegian Bokmål, Norwegian Nynorsk, Norwegian, Norwegian Sign Language (NSL) This corpus contains texts produced during a 2013 exhibition about languages - "Leve Språket". The exhibition aimed at showing the linguistic diversity in Norway, and it covered topics such as language conflict, the understanding of neighbouring languages and linguistic humor. The target audience was teenagers in school, and the texts are formulated accordingly. The texts were translated into Norwegian Sign Language and either Norwegian Bokmål or Nynorsk. The texts were also recorded to serve as an audio guide in the exhibition room. Download

Norwegian Sign Language Corpus

Size: 8 video clips, 18 minutes 
Annotation: EAF transcripts, ELAN annotations 
Licence: CC BY-NC-SA 4.0

Norwegian Sign Language (NSL)

This corpus consists of data collected in 2007 for the purposes of a doctoral research project about boundary markers in Norwegian Sign Language. Four signers were filmed: two men and two women, both young and old. They are all deaf with deaf parents, siblings, or other family members. They live in central Eastern Norway, and all have gone to the deaf school in the area. The signers were asked to retell a children’s picture book entitled “Frog, Where Are You?” by Mercer Mayer and also to respond to the question “What happened on 9/11 and what did you do?” Video recordings of the signers were made in a studio, and sessions were led by a deaf adult man who is an L1 signers of Norwegian Sign Language. No other people were present during the recordings.

The corpus is available for download from the CLARINO repository.

Download

Hotel Review Corpus – Spanish Sign Language

Size: 20,609 words; 3 hours of videos 
Licence: CC BY-NC 3.0

Spanish Sign Language (LSE), Spanish

This is a multimodal parallel corpus of hotel reviews that were originally written in Dutch, subsequently translated into Spanish and finally into Spanish Sign Language by 6 professionals, all of whom are deaf translators.

The corpus is available for download from the Institute of Dutch Language.

Download

Corpus LS Tunisienne (Fadwa Mhimdi)

Size: 10 narrative discourses 
Annotation: ELAN annotations 
Licence: CC BY-NC-ND 3.0

Tunisian Sign Language (TSL)

This is the first scientific corpus of narrative discourses in Tunisian Sign Language (LST) by Deaf adults. The data were filmed in the Tunis region.

The corpus is available for download from the Ortolang repository.

Download

Turkish sign language database

Licence: Restricted, see here

Turkish sign language (TİD)

This corpus collects Turkish sign language (TID) data. For this project, native, early, and late TID signers were recorded performing different tasks (narratives of short picture stories/cartoon clips) and engaging in free conversation. These recordings and their annotations are stored in this corpus.

The corpus is available for download from the MIP (CLARIAH-NL distribution).

Download

VIDI Sign Space Corpus

Annotation: EAF transcripts 
Licence: Restricted

Turkish Sign Language (TİD) and German Sign Language (DGS) This is a corpus of DGS and TİD data collected by the Max Planck Institute for Psycholinguistics under the lead of Asli Özyürek from March 2007 to September 2012. Download

Lexical resources

Corpus Language Description Availability

Adamorobe Sign Language Lexicon

Size: 250 signs 
Annotation: partial (phonology and iconicity) 
Licence: Restricted

Adamorobe Sign Language

This lexicon contains 250 signs in isolation. For a subset of the signs, encodings about phonological and iconic features are available.

The lexicon is available for download from the MPI Language Archive.

Download

BSL Lexicon CN

Licence: Public

British Sign Language (BSL)

This lexicon was derived from the British Sign Language Corpus and is part of the ECHO case study on sign languages.

The lexicon is available for download from the MPI Language Archive.

Download

CSL lexicon

Annotation: unannotated 
Licence: Restricted

Chinese Sign Language (CSL)

This lexicon demostrates how a Deaf adult signs a story to Deaf children.

The lexicon is available for download from the MPI Language Archive.

Download

Czech Sign Language Corpus for Recognition – Amateur Signer

Licence: ELRA

Czech Sign Language This is an amateur sign-language database comprising 25 signs from Czech sign language. 15 signers (4 women and 11 men) carried out 5 repetitions of each sign and were recorded from 3 different views. The first is a frontal view of the upper part of the body. The second one is similar, but with the camera placed about one meter higher than the first one so as to produce a frontal top-view, and thus allowing to detect 3D information. The last view is a frontal-detail view of the speaker's face, thus allowing lip-reading.  

Czech Sign Language Corpus for Recognition – Professional Signer

Size: 378 signs 
Licence: ELRA

Czech Sign Language

This lexicon comprises signs performed by 4 everyday sign-language users (4 women, 2 of them deaf). 5 repetitions of each sign were recorded from 3 different views. The first is a frontal view of the upper part of the body. The second one is similar, but with the camera placed about one meter higher than the first one so as to produce a frontal top-view and thus allowing to detect 3D information. The last view is a frontal-detail view of the speaker's face, thus allowing lip-reading.

For the relevant publication, see ELRA (European Language Resources Association)

 

ECHO NGT lexicon, Male signer

Size: 300 signs 
Annotation: ELAN transcriptions 
Licence: Public

Dutch Sign Language (NGT)

This lexicon forms part of the ECHO case study on sign languages.

The lexicon is available for download from the MPI Language Archive.

Download

ECHO NGT lexicon, Male signer 2

Size: 300 signs 
Annotation: unannotated 
Licence: Public

Dutch Sign Language (NGT)

This lexicon forms part of the ECHO case study on sign languages.

The lexicon is available for download from the MPI Language Archive.

Download

ECHO NGT lexicon, female signer 2

Annotation: unannotated 
Licence: Public

Dutch Sign Language (NGT)

This lexicon forms part of the ECHO case study on sign languages.The signer retells the fable The Shepherd Boy and the Wolf. The source of the retelling is a Dutch version of the fables by author Paul Biegel, consisting of approximately 300 words.

The lexicon is available for download from the MPI Language Archive.

Download

Woordenboek Vlaamse Gebarentaal

Size: 7.5 hours 
Licence: CC BY-NC 3.0

Flemish Sign Language (VGT)

This resource contains contains the video material of the Dictionary of Flemish Sign Language. The 10,025 videos contain a gesture per video.

The dictionary is available for download from the Dutch Language Institute.

Download

DICTA-SIGN lexicon

Size: 1000 entries per language (video and text) 
Annotation: annotated, see description 
Licence: CC BY-NC-ND 4.0

French, Modern Greek (1453-), German, English, Modern Greek Sign Language, British Sign Language, German Sign Language, French Sign Language

This is a multilingual lexicon in which concepts are linked to graphically represented signs and accompanying videos showcasing the signing process.

The videos are annotated with HamNoSys ("Hamburg Sign Language Notation System").

The lexicon is available for online browsing via a dedicated interface.

For the relevant publication, see Efthimiou, S-E. Fotinea, et al. (2010)

Browse

NOEMA+

Size: 8,616 lemmas 
Annotation: citation forms, GSL synonyms, usage examples in GSL and Greek, concept clarification in the case of homonymity in Greek 
Licence: Freely accessible

Greek Sign Language (GSL)

This is an online dictionary of lemmas taken from three previously developed resources, namely (i) the NOEMA DB, from which it incorporates 3,000 revised entries, (ii) the GSL segment of the Dicta Sign Corpus, from which it incorporates 2,000 entries, and the POLYTROPON Parallel Corpus corpus, from which it incorporates 3,616 new entries.

The lexicon is available for online browsing through an interface provided by the CLARIN:EL consortium.

For the relevant publication, see E. Efthimiou, S-E. Fotinea, et al. (2019)

Browse

NOEMA

Size: 3,000 video entries

Greek Sign Language (GSL)

This dictionary contains video recorded signs paired with Modern Greek translations. The dictionary incorporates explanatory remarks that help non-native GSL users understand the meaning of the sign, while at the same time allowing for native GSL signers to enrich their Modern Greek vocabulary. The dictionary allows users to search by lemma, which means either by (i) hand shape, (ii) lemma classification according to syntactic category, or (iii) by the alphabetic ordering of the sign translations in Modern Greek.

The dictionary is not available online.

 

ECHO SSL Lexicon, signer LM

Size: 300 signs 
Annotation: unannotated 
Licence: Public

Swedish Sign Language (SSL/STS)

This lexicon forms part of the ECHO case study on sign languages.

The lexicon is available for download from the MPI Language Archive.

Download

Other sign language resources

Corpora

Corpus Language Description Availability

British Sign Language Corpus

Size: BSL video data from 249 deaf signers of BSL 
Annotation: yes

British Sign Language (BSL)

The British Sign Language Corpus is a collection of British Sign Language (BSL) video clips of 249 deaf signers from the UK. The BSL Corpus project is based at the Deafness Cognition and Language Research Centre, University College London, lasted from 2008–2011 and was led by Adam Schembri. A related dataset is the BSL Signbank.

For the relevant publication, see Schembri, A., et al. (2012)

Download

Corpus of the Danish Sign Language Dictionary

Size: 4.5 hours 
Annotation: ID-glosses and (ideally) sense indicators 
Licence: Only for internal use (= the dictionary staff) and guest researchers

Danish Sign Language (DTS)

This corpus consists of video material from 31 signers of DTS from Denmark. The Corpus is used to build a DTS-Danish Dictionary. The Danish Sign Language Dictionary project building the corpus is based at the Bachelor’s Degree Programme in Danish Sign Language and Speech-to-text Interpreter at the University College Copenhagen and led by Mads Jonathan Pedersen and Thomas Troelsgård. The project started 2014 and is still ongoing.

For the relevant publication, see Kristoffersen and Troelsgaard (poster)

 

IPROSLA

Size: Around 500 hours 
Annotation: Unannotated 
Licence: Restricted end user license for academic use only.

Dutch Sign Language (NGT) This corpus contains three sets of data. The first is a set of longitudinal data of deaf children from deaf and hearing parents that has been collected at the UvA since the late 1980s. The second is a new collection of longitudinal data collected at the RU from hearing and deaf children of deaf parents (2008–2020). Thirdly, data collected in an educational context by Nini Hoiting at the Kentalis Guyot school. Browse

Corpus of Finnish Sign Language

Size: 14 hours 22 minutes 
Annotation: ID-glosses and translations 
Licence: CC BY-NC-SA 4.0

Finnish Sign Language (FinSL), Finnish

The corpus consists of video-recorded conversations and elicited narratives from 21 Finnish Sign Language signers who belong to different age groups and live in different parts of Finland. The signers perform seven fixed tasks which are

  • introductions
  • discussing work/hobbies,
  • narrating about short cartoon strips,
  • narrating about a video,
  • narrating a story from the picture book
  • discussing a topic related to the deaf world, and
  • free discussion (e.g. on travel, sports)

. All of the video data (14.5 hours by six camera angles) has been annotated for signs and translations. According to the tasks performed by the signers, the corpus has been divided into two subcorpora: one that contains the elicited narratives, and another that contains the conversations.

 

The corpus is available for download from the Meta-Share (FIN-CLARIN Distribution).

For the relevant publication, see Salonen et al. (2020)

Download

Content4All

Licence: CC BY-NC-SA 4.0

Flemish Sign Language (VGT), Swiss-German Sign Language (DSGS) This is a collection of six datasets recorded and created by the Content4All research project. The datasets are hosted by University of Surrey and are password protected. To request download credentials, please contact %20r.bowden [at] surrey.ac.uk (Richard Bowden). Download

Corpus LSFB (University of Namur)

Size: 10 hours 
Annotation: ID-glosses 
Licence: CC BY-NC-ND 4.0, see also the conditions

French Belgian Sign Language (LSFB)

This is the first large-scale digital corpus that illustrates the current use of French Belgian Sign Language (LSFB) and all its variations.

It was first conceived for linguistic research. However, this digital library is an unprecedented tool for teachers, students and interpreters, as well as a safeguard of the linguistic and cultural heritage of the Deaf Community.

 

Hungarian Sign Language Corpus

Size: 30 hours (Grammatical Corpus)

Hungarian Sign Language

The Hungarian Sign Language Corpus is a collection of Hungarian Sign Language (HSL) video data of 147 signers from Hungarian. Overall, 1,750 hours were recorded. The HSL corpus project ran from 2016 to 2017, was based at the Research Institute for Linguistics at the Hungarian Academy of Sciences and led by Csilla Bartha.

For the relevant publication, see Bartha et al. (2016)

 

Signs of Ireland

 

Irish Sign Language (ISL)

The Signs of Ireland Corpus is a collection of Irish Sign Language (ISL) video data from 40 signers of Ireland. The project was based at the Trinity College Dublin, took place in 2004 and was led by Lorraine Leeson.

For the relevant publication, see Leeson (2011)

 

PJM Corpus

 

Polish Sign Language (PJM) This is a corpus of video data from 150 Deaf native signers of Polish Sign Language (PJM).  

SIGNOR Corpus of SZJ

Annotation: tokenised, lemmatised, gestural annotation, mouth shape, ID-gloss 
Licence: Not freely accessible. contact person: prof. dr. Špela Vintar, University of Ljubljana, spela.vintar at ff.uni-lj.si

Slovene Sign Language (SZJ) This corpus is available for querying in its transcribed version providing an avatar demonstration of each sign. The corpus contains interviews with 80 informants. The entire corpus is currently not publishable due to data protection issues; however, permissions for publication are being collected in order to release the recordings too.  

CORLSE

Size: 4 hours 52 minutes / 48 recordings 
Annotation: partly annotated

Spanish Sign Language (LSE)

This corpus is intended for the analysis of LSE argument structure, focusing on how signers organize the names (and the forms similar to the names) and the verbs (and other forms that have a predicative function) to communicate who does what, or feels what, or talks about what, etc. It was not intended to create a representative and structured corpus, but rather a set of examples that would allow basing the grammatical description on contextualized uses. Only a part is accessible through the iSignos website. The corpus is annotated as follows: there are right-hand and left-hand id-glosses and glosses for classifiers, translation into Spanish and role-shift, PoS, argument structure, locus and animacy (2 hours and 21 minutes). Other part just with glosses, translation into Spanish and role-shift; some recordings (16) also have analysis of the non-manual component

For the relevant publication, see Pérez et al. (2019)

Concordancer

iSignos

Annotation: annotated for right-hand and left-hand id-glosses and glosses for classifiers and Spanish translations

Spanish Sign Language (LSE)

This corpus consists of a set of video recordings of signers who express themselves in LSE, presented together with the glosses of both hands and the Spanish translation. In the first stage, a set of videos with their corresponding glosses and translations are available, which will be expanded in successive phases. You can consult the list of recordings and select by genre or theme criteria, and also by the sex or age range of the signers.The resource can be useful for all those people who need this type of linguistic data for their work, for example, for class exercises, interpretation practices, language evaluations, research on LSE, etc.

The corpus is available through a dedicated sarch engine that allows you to explore the corpus and observe the context in which the searched glosses appear.

Browse

Swedish Sign Language Corpus

Size: 24 hours 
Annotation: ID-glosses, PoS tags 
Licence: CC BY-NC-SA 2.5

Swedish Sign Language (STS), Swedish

This is a web-based version of the Swedish Sign Language Corpus, consisting of approximately 93,000 annotated sign tokens. Previously, the corpus was only available through the special-purpose video annotation tool ELAN. The aim of this corpus is to provide a picture of what sign language sentences look like, but also contribute new characters and variants to the Swedish Sign Language Dictionary. It can also be used to develop teaching materials.

For the relevant publication, see Öqvist et al. (2020)

Concordancer

Tactile Swedish Sign Language Corpus

Size: 4.5 hours 
Annotation: partially annotated corpus 
Licence: CC BY-NC-SA 2.5

Swedish Sign Language (STS), Swedish This corpus contains dialogues and elicited narratives with 9 deafblind informants. The entire corpus is currently not publishable due to data protection issues; however, some parts are available through the STS-korpus. The project was funded by Mo Gård Research Fund. Concordancer

Giving Recognition a Hand Corpus

Size: 84 videos 
Licence: Restricted, see here

Turkish Sign Language (TİD), Dutch Sign language (NGT)

This is a multilingual corpus of Turkish Sign Language (TİD) and Dutch Sign Language (NGT) as well as Turkish and Dutch data. It contains 84 video files of signers and speakers from Istanbul and Nijmegen. The project was based at the Max Planck Institute for Psycholinguistics, Centre for Language Studies.

The corpus is available for download from a dedicated webpage.

Download

Lexical resources

Corpus Language Description Availability

Adamorobe Sign Language Lexicon

Size: 250 signs 
Annotation: partial (phonology and iconicity) 
Licence: Restricted

Adamorobe Sign Language

This lexicon contains 250 signs in isolation. For a subset of the signs, encodings about phonological and iconic features are available.

The lexicon is available for download from the MPI Language Archive.

Download

BSL Lexicon CN

Licence: Public

British Sign Language (BSL)

This lexicon was derived from the British Sign Language Corpus and is part of the ECHO case study on sign languages.

The lexicon is available for download from the MPI Language Archive.

Download

CSL lexicon

Annotation: unannotated 
Licence: Restricted

Chinese Sign Language (CSL)

This lexicon demostrates how a Deaf adult signs a story to Deaf children.

The lexicon is available for download from the MPI Language Archive.

Download

Czech Sign Language Corpus for Recognition – Amateur Signer

Licence: ELRA

Czech Sign Language This is an amateur sign-language database comprising 25 signs from Czech sign language. 15 signers (4 women and 11 men) carried out 5 repetitions of each sign and were recorded from 3 different views. The first is a frontal view of the upper part of the body. The second one is similar, but with the camera placed about one meter higher than the first one so as to produce a frontal top-view, and thus allowing to detect 3D information. The last view is a frontal-detail view of the speaker's face, thus allowing lip-reading.  

Czech Sign Language Corpus for Recognition – Professional Signer

Size: 378 signs 
Licence: ELRA

Czech Sign Language

This lexicon comprises signs performed by 4 everyday sign-language users (4 women, 2 of them deaf). 5 repetitions of each sign were recorded from 3 different views. The first is a frontal view of the upper part of the body. The second one is similar, but with the camera placed about one meter higher than the first one so as to produce a frontal top-view and thus allowing to detect 3D information. The last view is a frontal-detail view of the speaker's face, thus allowing lip-reading.

For the relevant publication, see ELRA (European Language Resources Association)

 

ECHO NGT lexicon, Male signer

Size: 300 signs 
Annotation: ELAN transcriptions 
Licence: Public

Dutch Sign Language (NGT)

This lexicon forms part of the ECHO case study on sign languages.

The lexicon is available for download from the MPI Language Archive.

Download

ECHO NGT lexicon, Male signer 2

Size: 300 signs 
Annotation: unannotated 
Licence: Public

Dutch Sign Language (NGT)

This lexicon forms part of the ECHO case study on sign languages.

The lexicon is available for download from the MPI Language Archive.

Download

ECHO NGT lexicon, female signer 2

Annotation: unannotated 
Licence: Public

Dutch Sign Language (NGT)

This lexicon forms part of the ECHO case study on sign languages.The signer retells the fable The Shepherd Boy and the Wolf. The source of the retelling is a Dutch version of the fables by author Paul Biegel, consisting of approximately 300 words.

The lexicon is available for download from the MPI Language Archive.

Download

DICTA-SIGN lexicon

Size: 1000 entries per language (video and text) 
Annotation: annotated, see description 
Licence: CC BY-NC-ND 4.0

French, Modern Greek (1453-), German, English, Modern Greek Sign Language, British Sign Language, German Sign Language, French Sign Language

This is a multilingual lexicon in which concepts are linked to graphically represented signs and accompanying videos showcasing the signing process.

The videos are annotated with HamNoSys ("Hamburg Sign Language Notation System").

The lexicon is available for online browsing via a dedicated interface.

For the relevant publication, see Efthimiou, S-E. Fotinea, et al. (2010)

Browse

NOEMA+

Size: 8,616 lemmas 
Annotation: citation forms, GSL synonyms, usage examples in GSL and Greek, concept clarification in the case of homonymity in Greek 
Licence: Freely accessible

Greek Sign Language (GSL)

This is an online dictionary of lemmas taken from three previously developed resources, namely (i) the NOEMA DB, from which it incorporates 3,000 revised entries, (ii) the GSL segment of the Dicta Sign Corpus, from which it incorporates 2,000 entries, and the POLYTROPON Parallel Corpus corpus, from which it incorporates 3,616 new entries.

The lexicon is available for online browsing through an interface provided by the CLARIN:EL consortium.

For the relevant publication, see E. Efthimiou, S-E. Fotinea, et al. (2019)

Browse

NOEMA

Size: 3,000 video entries

Greek Sign Language (GSL)

This dictionary contains video recorded signs paired with Modern Greek translations. The dictionary incorporates explanatory remarks that help non-native GSL users understand the meaning of the sign, while at the same time allowing for native GSL signers to enrich their Modern Greek vocabulary. The dictionary allows users to search by lemma, which means either by (i) hand shape, (ii) lemma classification according to syntactic category, or (iii) by the alphabetic ordering of the sign translations in Modern Greek.

The dictionary is not available online.

 

ECHO SSL Lexicon, signer LM

Size: 300 signs 
Annotation: unannotated 
Licence: Public

Swedish Sign Language (SSL/STS)

This lexicon forms part of the ECHO case study on sign languages.

The lexicon is available for download from the MPI Language Archive.

Download

Colophon

A working group created the page for this resource family with representatives of various CLARIN Knowledge Centers with expertise in SL resources:

K-Centre ACE:  https://ace.ruhosting.nl 

K-Centre :EL  slt.ilsp.gr / https://www.clarin.gr/en/kcentre 

K-Centre CLARIN-SMS  https://sweclarin.se/eng/centers/stockholm

Contact person: henk.vandenheuvel [at] ru.nl (Henk van den Heuvel), K-Centre ACE