Skip to main content

CLARIN Café - Creating pedagogical corpora with annotation of sensitive content and offensive language - the CrowLL project

, -


General Information

This CLARIN Café is organised by Tanara Zingano Kuhn, Špela Arhar Holdt, Carole Tiberius, Kristina Koppel, Rina Zviel Rishin and Iztok Kosem.

The CLARIN host is Henk van den Heuvel.

  • Date: 04/04/2024
  • Time: 14:00 - 16:00 (CEST)
  • Venue: CLARIN virtual Zoom meeting
  • Twitter hashtag: #CLARINcafe
A full overview of planned CLARIN Café sessions can be found on the CLARIN Café page.


Evidence of authentic language use is fundamental for language teaching and learning. One way to develop authentic language learning materials is through the use of examples from corpora. However, these corpora might include sensitive content or offensive language, in addition to exhibit structural problems. Although such use is unquestionably authentic, it is recommended that these corpora must be carefully monitored before applied to education to flag inappropriateness, thus leaving the choice of use of certain examples to the needs and context of use of teachers and didactic material developers. In other words, the main idea is that corpora for pedagogical purposes should not be cleaned, but rather labelled with categories that indicate potential problems.

The main goal of the CrowLL project was to create manually annotated pedagogical corpora that can be used by lexicographers, language teachers, and researchers. The languages were Brazilian Portuguese, Dutch, Estonian, and Slovene. Corpus sentences are annotated as “problematic” or “non-problematic” from the point of usage for pedagogical purposes. Sentences labelled as problematic also have annotations defining the category of the problem (offensive, vulgar, sensitive content, grammar/spelling problems, incomprehensible/lack of con-text). For each language, the corpus consists of 10,000 sentences, annotated by language experts. These corpora, together with annotation guidelines in each language and in English, are available on PORTULAN CLARIN.

Additionally, we have developed a gamified solution for further corpus growth by using these annotated corpora as “seed corpora” to start the crowdsource-supported development of larger corpora. The CrowLL (Crowdsourcing for Language Learning) game is a multi-level, multi-language digital game in which the crowd helps to achieve the task of manually annotating corpora. With this game, players identify problematic examples automatically extracted from existing corpora, categorise them, and point out the constituent part of the sentence that is problematic. The code for gamified annotation is published on Github as open access under Apache 2.0 licence.

In the future, researchers wanting to create such annotated corpora for their language can choose either the expert approach (the annotation guidelines), or/and opt for crowdsourcing (the game).

In this CLARIN Café, we will share the steps that were followed to create these manually annotated corpora and will discuss some of the challenges that were faced. We will also demo the game to foster further expansion of this type of data collection to other languages. Finally, we will reflect on future steps of this project.

How to join

You can register for free using this link in order to receive the meeting room details.


14:00 - 14:05 Opening and CLARIN 101 - Henk van den Heuvel (CLARIN )

14.05 - 14.35 Introducing the CrowLL project

Tanara Zingano Kuhn

14.35 - 15.00 Challenges in data preparation and manual corpus annotation

Špela Arhar Holdt and Carole Tiberius

15.00 - 15.10 Q & A

15.10 - 15.40 Game and data management demo

Kristina Koppel, Rina Zviel Rishin and Iztok Kosem

15.40 - 16.00 Questions and discussion