A Recap on the CLARIN Café on Text and Data Mining Exceptions a Year After - Has the Pony Become a Horse?

Submitted by e.gorgaini@uu.nl on 29 November 2022


The CLARIN Café on Text and Data Mining (TDM) Exceptions a Year After took place via Zoom on 8 November 2022 and was organised by the CLARIN Legal and Ethical Issues Committee (CLIC). It was attended by around 25 participants, including language researchers, lawyers and legal experts from both CLARIN institutions and the private sector. The aim of the event was to discuss the impact of the copyright exceptions for TDM introduced by the recent Directive on Copyright in the Digital Single Market (hereinafter DSM Directive) on language research and technology so far. The event was a follow-up of a Café organised by the CLIC on the same subject in 2021, with the intention to discuss the issue from a longer-term perspective. The event featured presentations from distinguished guest speakers: Thomas Margoni, research professor of intellectual property law at KU Leuven, and Toby Bond, data lawyer and partner at Bird & Bird London, as well as Antal van den Bosch (member of the CLARIN Board of Directors) and Jan Hajic (Charles University Prague). The event was moderated by Paweł Kamocki, chair of the CLIC.

Who’s Who and What They Said

After an introduction by Antal van der Bosch, Thomas Margoni presented a detailed overview of the TDM exceptions in the DSM Directive, and shared his thoughts about their shortcomings, especially regarding sharing of the corpora created on the basis of these exceptions, in particular across borders. 

Then, Toby Bond presented the situation in the UK (where the DSM Directive does not apply), where the government has recently announced a new copyright exception for TDM purposes, which is likely to be even more favourable for data miners, potentially turning the UK into a “safe harbour” for data-intensive research projects.

In the third talk, Jan Hajic presented a new and ambitious HPLT project, in which very large amounts of data collected via web crawling in the US are stored, shared and processed in several institutions across Europe. The project raises many questions related to TDM exceptions, and in particular those regarding international sharing of data.

Has the Pony become a Horse?

The Café ended with a discussion, fuelled by questions from the audience. The three presentations resonated very well with the audience, as they addressed many questions researchers were asking themselves. Unfortunately, even our eminent guests, some of the best experts on those questions, were sometimes unable to provide a clear answer, as best (and legally safest) practices in TDM have yet to emerge. However, all experts agreed that language researchers should 'keep calm and carry on' with their research, hoping that the interpretation of the legal framework will conform to the reality of their work. 

The slides of the Café are available on the events page.