Skip to main content

Rap Lyrics Corpora: Comparing Tools, Merging Data and Presenting in Paris

Submitted by Karina Berger on

By Alena Němcová Polická

Since February 2023, preparations have been underway to reshape the RapCor database of the DIGITALIA MUNI ARTS infrastructure - a local node of LINDAT/CLARIAH-CZ - to enable long-term storage in order to open access to research data and to create a new repository system for francophone rap albums and lyrics. The specificity of the project lies in the fact that it is not only a database of linguistic data on French (focused mainly on the collection, morpho-semantic and lexicographic description of substandard expressions and on the lemmatisation of graphically unstable sociolectisms and dynamically evolving neologisms), but also a repository of dematerialised musical media for these textual materials.


Central place of the term 'rap' in the museum’s glassroof.

The main characteristic of the real and potential userbase of the database is transdisciplinarity (i.e. the database could provide data for political scientists, sociologists, linguists, didacticians). This brings with it the difficulty of reconciling the expectations of those scholars who call for quantity of meta-information, and text-oriented scholars who are concerned with qualitative cleaning of a database that has grown and specialised in view of the dynamic development of corpus linguistics over the past 15 years of the corpus' existence (RapCor (

I applied for the CLARIN Mobility Grant in order to visit the University of Paris Cité with two main objectives. First, my aim was to present project achievements to linguists and sociologists in Paris, show and discuss possible use cases of this corpus. Thus, on 8 April, I met linguist Clara Romero from University Paris Cité, on 9 April, I visited CNRS researcher and long-time collaborator, sociologist Karim Hammou (Academy of Sciences), at the CRESPPA laboratory (site Pouchet), and on 11 April, I discussed the protocol of beta-testing the programmed interface with another collaborator, musicologist Juliette Hubert from the Université Polytechnique Hauts-de-France.

During my stay in Paris, I was also able to visit public media libraries in order to collect missing data to further expand the database: on 10 April, I visited the Paris Musical Media library and the La Place Hip Hop Library and, on 13 April, thanks to the colleague from ENS/EHESS, Anne-Caroline Fiévet, also the media library in Créteil in the southern suburb of Paris. In addition, I was also given access to the personal extensive media library of the CNRS researcher and key stakeholder Karim Hammou, who has long been linked to the project as an external consultant.


Selfie with hip-hop scholars-colleagues Hammou, Hubert, de Courson and Němcová Polická (left to right) after the ENS meeting on 12 April 2024.

However, the second and main aim of this Mobility Grant to Paris was a working meeting on 12 April on the automation of data collection and detection of new data at the École Normale Supérieure (ENS, Paris) with PhD student Benoît de Courson, who has programmed domain-specialised crawler Gallicagram ( and shared with us (together with Pavel Rychlý from Masaryk University, who was connected online) tools capable of mining data for faster development of the textual part of the data collection. The main task of the meeting was to merge systems for collecting and updating textual data and metadata. The result of this collaborative work is a corpus named RapCor boosted 1. It was incorporated into our LINDAT pipelines and links to it will be available soon via RapCor ( It is also planned to line up the RapCor corpus in the CLARIN Virtual Language Observatory (VLO).

One week of networking with Parisian colleagues was rounded off with a pleasant trip to the newly open French language museum in Villers-Cotterêts, before heading to the airport, thanks to a surprise visit offered by my colleague A.-C. Fiévet.

The CLARIN Mobility Grant allowed me to collect interesting information, knowledge and data, to strengthen ongoing cooperation, and to share results of our work as part of LINDAT/CLARIAH-CZ.