A Recap on CLARIN Café: Bilingual and Multilingual Corpora

Submitted by e.gorgaini@uu.nl on 24 May 2022


The CLARIN Café on Bilingual and Multilingual Corpora took place on 29 April 2022. More than forty participants from various countries and organisations participated in the online event. The event was organised by Eva Soroli, Thomas Gaillat and Franck Cinato, and was divided into two parts:

  • Part I: The CLARIN infrastructure and the new French CLARIN Knowledge Centre CORLI, dedicated to providing expertise in corpus linguistics and the languages spoken in France, and to support academic communities through actions towards FAIR and Open data
  • Part II: Examples of parallel, comparable and dialectal corpora (new or already published), together with demonstrations on how to collect/build, annotate, explore, analyse and archive such corpora in an interoperable way.

Part I: Introduction to CLARIN and Its Knowledge Centres

After a short introduction to CLARIN and its services by Eva Soroli (CLARIN ambassador and associate professor at the University of Lille, France), the CORLI team coordinators presented the new French CLARIN Knowledge centre CORLI, and the new directions of the consortium. CORLI speakers were: Christophe PARISSE, INSERM researcher in cognitive and computer sciences working at the University of Nanterre (France) in the domain of corpus linguistics, language development, language change and language pathology; and Céline POUDAT, associate professor of linguistics and discourse analysis at the University of Cote d’Azur (France). They presented the CORLI consortium - a consortium involving members from more than 20 research labs and 15 universities, part of the French infrastructure Huma-Num and certified CLARIN K-centre since 2020 – and discussed its recent national developments and the progress made in building a sustainable national consortium similar to what European Research Infrastructure Consortia are doing on a European scale. 

The speakers described the development of the CORLI K-Centre, its scope and organization in working groups (e.g., the working group Multilingualism of the consortium), and its intuitive and interactive online platform that centralizes and offers both proactive and reactive services about available language resources, databases and depositories, training opportunities and best research practices. The speakers also discussed the new directions of the consortium, the actions towards the development of a collaborative annotation platform, of solutions for sustainable citation of corpora, and the creation of an open reference corpus for the French language. The large number of participants shows that this centers’ topics and services are of great interest and relevant to other CLARIN national initiatives, as well as to researchers and professionals from other research infrastructures and communities (data scientists, engineers, educators, etc.). 

Watch the recording of the opening of the CLARIN café and that of CORLI: the French CLARIN Knowledge-Centre on the CLARIN YouTube channel.

Part II: Parallel, Comparable and Dialectal Bilingual/Multilingual Corpora

Bilingual, multilingual and multidialectal corpora are very common in language studies and relevant to researchers working, among other domains, in historical linguistics, endangered languages, language acquisition, language variation, dialectology and typology studies. Their collection, annotation, analysis and sustainability are a major challenge for everyone involved in comparative work, research data exploration, data management, etc.

In his presentation, Maximilien GUERIN, postdoctoral researcher at the CNRS research unit HTL at the University of Paris and typologist specialising in morphology and syntax, presented a multidialectal corpus of the Crescent dialects highlighting collection, exploitation and analysis issues.

Situated in northern Limousin and Auvergne region (France), the linguistic Crescent is an area where local gallo-romance varieties simultaneously display typical Oïlic and Occitan features. One of the main aims of this research project is to collect, exploit and analyse this multidialectal corpus before its varieties, now highly endangered, fall into oblivion.

M. Guérin described how the corpus is made available and accessible to both local communities and researchers. This corpus mainly contains raw linguistic data, either written or spoken: lexical items, morphological paradigms, original texts (belonging to various genres), translations (in particular of The Little Prince) and audiobooks. All these resources are associated with metadata providing information about the informants and the context in which the data were collected. The corpus has already been used for different kinds of work: grammatical descriptions, linguistic maps (for linguistic comparisons and variational approaches), morphological analysis (hierarchical clustering), phonetic comparison (mel-frequency cepstral coefficients), etc. This work impacts the field of language sciences because it provides a considerable amount of data for under-described linguistic varieties, a large typological parallel corpus, and new elements for investigation in romance linguistics.

M. Guérin shared his work in progress and perspectives for future developments through the documentation of new varieties, the translation of more versions of The Little Prince, the recording of more audiobooks, and the preservation of the corpus in a durable deposit platform, namely Cocoon (Digital Oral Corpus COllections) - a service-providing CLARIN C-centre.

In the next presentation, Annemarie VERKERK and Luigi TALAMO, typologists from Saarland University (Germany) working in typology and corpus linguistics and specialised in phylogenetic methods, presented their parallel Corpus of Indo-European Prose Plus, the CIEP+ (/kiːp plʌs/) - a project currently in development. The CIEP+ corpus aims to include 43 languages, a balanced sample of 33 Indo-European languages, as well as 10 non-Indo-European languages compiling translations of 18 literary works. As the corpus name suggests, all the texts are of a prosaic nature, belonging to the fiction and epistolary genre.

A. Verkerk and L. Talamo described how the corpus has been collected, its layers of annotation and structure. The ultimate scientific aim in their project is to use CIEP+ to investigate information status and word order variability from an information-theoretical perspective. Thus, the main focus of the talk was on the need for annotation of information status, dependency grammar, surprisal and more, using a combination of automated tools and human annotation, including sentence/word alignment, Universal Dependency parsing, and crowd-sourced annotation for information status. 

The corpus-building process and its challenges (especially in terms of copyright, issues related to its long-term storage and sharing) were elaborated upon, and future perspectives on a shareable version of the corpus eventually through the CLARIN VLO or CRF platform were discussed.

Watch the recording of the Building CIEP+, the Parallel Corpus of Indo-European Prose Plus talk on the CLARIN YouTube channel.

The last speaker and co-organiser of this event was Thomas GAILLAT, associate professor of corpus Linguistics at the University of Rennes (France). He is working at the intersection between natural language processing, corpus linguistics and machine learning. His current research is mostly focused on language acquisition questions and the development of tools that automatically extract and visualise linguistic profiles in texts written by learners of English. His talk covered the issue of storing a comparable learner corpus on a data repository. Comparable corpora are made up of many files, which need to be accessed in an orderly manner in order to extract coherent datasets. Subsequent analyses can then be conducted to make comparisons between speakers of different L1s or L2s.

T. Gaillat illustrated the issue with the corpus InterLangue (CIL) which is a learner corpus of L2 French and English. He showed that the corpus storage architecture now supports online extractions and comparisons in both languages. Based on the Huma-Num Nakala infrastructure and with the use of R scripts, it is possible to extract corpus items, annotate texts automatically and create datasets supporting comparisons.

Comparability is ensured in several stages. Firstly, the French and English subsets of the corpus were collected on the same basis, i.e. identical tasks, similar proficiency, same file types and same metadata types. Secondly, the corpus data was formatted following the same transcription protocol and the same data formats (WAV for audio recordings, XML and txt for transcriptions and CSV for metadata). Finally, the queries can be conducted with scripts that apply consistent extraction based on identical metadata information. The scripts also include automated linguistic annotation with UDPipe, providing French and English texts with Universal Dependency and part-of-speech annotation. The scripts can be modified, as they are distributed under the Creative Commons licence.

This CLARIN Café offered the perfect space to encourage discussions regarding corpora and multilinguality. This event presented some national and international initiatives in the domain and highlighted the emergence of three new corpus projects, including their purposes and features. Thanks to the presence of researchers from all around the world (27 participants from Europe, two from South America, two from Canada, two from the United States, four from Africa and three from Asia), presentations and discussions provided some new insights in the specifics of multilingual and multidialectal corpora and underlined the need for common practices in the domain of multilingual corpus building and management. 

Additional information on this CLARIN Café and the slides of the event are available on the event page.