CLARIN Café on Text and Data Mining Exceptions a Year After - Has the Pony Become a Horse?

, -

General Information

This edition of the CLARIN Café is organised by Paweł Kamocki, chair of the CLARIN Legal and Ethical Issues Committee (CLIC).  

Date: 08 November 2022

Time: 14:00 - 16:00 (CET)

Venue: CLARIN virtual Zoom meeting

A full overview of the Café sessions can be found on the CLARIN Café page.


Copyright exceptions for Text and Data Mining (TDM) are something that the language research community has been asking for for many years. In 2019, the EU Directive on Copyright in the Digital Single Market introduced two such exceptions: one designed specifically for scientific research purposes, and one for the general public. Both provisions raised many doubts and questions (discussed during another CLARIN Café in late 2021). They were to be implemented by the Member States by mid-2021, and although some countries implemented them later than others, it can be said that today European researchers have already had a chance to discover if the new exceptions correspond to their needs and, indeed, make their work easier. 

During this CLARIN Café, legal experts and researchers from CLARIN countries will share their experience with the TDM exceptions, and together with all the participants they will try to answer the question whether the pony received last year has now become a full-grown horse -- and whether it is actually a good thing.

How to Join

Please register for free using this link in order to receive the meeting room details.


14:00 - 14:15 Introduction and CLARIN 101 - Antal van den Bosch (CLARIN Board of Directors)

14.15 - 14.35 A Deeper Look into the EU Text and Data Mining Exceptions: Harmonisation, Data Ownership, and the Future of Technology -Thomas Margoni, KU Leuven

14.35 - 14.55 Tabula rasa: TDM exceptions in post-Brexit UK - Toby Bond, Bird & Bird

14.55 - 15.15 Imagine all the researchers crawling the Internet in peace. The HPLT project and the future of European language research - Jan Hajič, Charles University Prague

15.15 - 16.00 Discussion


Paweł Kamocki is a researcher at the Leibniz-Institut für Deutsche Sprache and Chair of the CLARIN Legal and Ethical Issues Committee. He holds a Doctor of Law (Dr. iur.) degree (Münster, Paris), as well as a Master’s degree in linguistics (Warsaw), and graduated from the Paris Barrister Training School. His scientific interests are centered around legal issues affecting data-intensive science, especially in the field of linguistics and Digital Humanities; he published a number of peer-reviewed articles and book chapters on these questions. He also co-chaired the Working group on Data Access and Re-Use Policies and helped develop such Legal Tech tools as the Public License Selector and the DARIAH ELDAH Consent Form Wizard.


A computational linguist by training, Antal van den Bosch has worked in text mining, digital humanities, and on applications of computational language modelling in (psycho, socio and neuro) linguistics. He has led efforts to create sustainable and open source software packages for machine learning and (Dutch) , much of this in the context of CLARIN-NL and CLARIAH-NL. He is Professor of Language, Communication and Computation at the Faculty of Humanities of Utrecht University. He is guest professor at the Computational Linguistics and Psycholinguistics Research Center at the University of Antwerp, Belgium, a member of the Netherlands Royal Academy of Arts and Sciences, and fellow of the European Association for Artificial Intelligence.


Thomas Margoni is Research Professor of intellectual property law at the Faculty of Law, KU Leuven and a member of the Board of Directors of the Centre for IT & IP Law (CiTiP). His research concentrates on the relationship between law and new technologies with particular attention to the role of the Internet and more recently of AI as new forms to create, transform and disseminate knowledge and information. Current examples of research projects include reCreating Europe the EU H2020 funded project developing an integrated policy approach to copyright in the EU digital single market, where Thomas leads the task on AI and data ownership; OpenAIRE the H2020 project developing an Open Science e-infrastructure for Europe, where Thomas is joint coordinator of the legal and policy task force; OpenMinTeD, the now completed EU H2020 project for the development of an e-infrastructure for Text and Data Mining (TDM) in Europe where Thomas coordinated the legal working group. Other areas of interest where Thomas has developed institutional as well as funded research include the processes of EU copyright and design law harmonisation; data ownership and AI; copyright, design rights and additive manufacturing; the digitisation of cultural heritage and the digital public domain; open access and open science; online intermediaries, fundamental rights and the platform economy; and the role of property rights in sports.

Toby Bond is a partner in Bird & Bird’s Intellectual Property Group, based in London.  Much of his work focuses on helping clients navigate issues relating to the protection and commercialisation of data as they take advantage of the power of big data analytics and artificial intelligence.  He has a particular interest in the wider intellectual property issues arising from the development and deployment of AI systems and has been recognised by The Legal 500 as providing cutting-edge advice on copyright and the protection of AI generated works.  In 2021 he was named one of Global Data Review’s worldwide ‘40 under 40’ upcoming data lawyers.


Jan Hajič is a full professor of Computational Linguistics at the Institute of Formal and Applied Linguistics at the School of Computer Science, Charles University in Prague, where he has also received his Ph.D. in 1995. He served as the head and deputy head of the Institute between 2001 and 2020. His interests cover morphology and part-of-speech tagging of inflective languages, machine translation, deep language understanding, and the application of statistical methods in natural language processing in general. He also has an extensive experience in building language resources for multiple languages with rich linguistic annotation, and is currently the director of a large, multi-institutional research infrastructure on language resources in the Czech Republic, LINDAT/CLARIAH-CZ, which aims at making datasets and corpora openly available for linguistic and Digital Humanities research. His work experience includes both industrial research (IBM Research Yorktown Heights, NY, USA, in 1991-1993) and academia (Charles University in Prague, Czech Republic and Johns Hopkins University, Baltimore, MD, USA, 1999-2000, adjunct position at University of Colorado, USA, 2017-2022). He has published more than 200 conference and journal papers, a book on computational morphology, and several other book chapters, encyclopaedia and handbook entries. He regularly teaches basic and advanced courses on Statistical NLP and has multiple experience giving tutorials and lectures at various international training schools. He has been the PI or Co-PI of numerous international as well as large national grants and projects (including EU Framework Programme projects, such as H2020, and the NSF ITR program in the U.S.). He is the chair of the Executive Board of META-NET, European research network in Language Technology.

Slides, Blog, Recordings