Tailored to the CLARIN Community
The release of ChatGPT, Bing Chat and, more recently, GPT-4 has turned the spotlight on language models and generative chatbots. The interest in AI-generated language data and human-machine interaction is significant, underpinned by extensive coverage in the media.
However, many of the discussions tend to focus on the chatbots’ ability to produce content, and their potential uses in the future. Few address the specific issues relevant to those working with digital language resources. Legal disputes such as Getty Images vs. Stable Diffusion have highlighted some of the important questions related to copyright and regulation. With repositories in many different countries, CLARIN users need to know exactly how to use data in a way that is compliant with the law.
CLARIN’s legal experts - the CLARIN Legal and Ethical Issues Committee, or CLIC - made this topic the focus of the latest CLARIN Café. Entitled 'Do Chatbots Dream of Copyright?' Copyright in AI-generated Language Data, the Café brought together a group of experts from different fields in order to explore the legal implications of working with or using AI-generated texts, and specifically whether these texts can—and should—be protected by copyright. The virtual event was organised by Paweł Kamocki, researcher at the Leibniz-Institut für Deutsche Sprache and Chair of CLIC, and hosted by CLARIN’s Antal van den Bosch.
The virtual Café featured presentations by three speakers - organiser Paweł Kamocki, Thomas Margoni, Research Professor of Intellectual Property Law at the Faculty of Law, KU Leuven, and a member of the Board of Directors of the Centre for IT & IP Law (CiTiP), as well as Toby Bond, a partner in Bird & Bird’s Intellectual Property Group, based in London - and was followed by a lively discussion.
Topics discussed included different concepts of authorship, the difference between AI-generated and AI-assisted text, the regulation of non-personal data as a potential proxy for regulating AI-generated data, the distinction between input and output, and different uses and levels of copyright. The discussion highlighted the complexity of the issue at hand, and also focused on the implications of these issues for the CLARIN community. As AI outputs are not currently protected by copyright, what does this mean for the use of AI-generated texts to train language models or software, for instance? While it is too early for answers, the Café succeeded in untangling some of the key issues, as well as identifying areas that will need to be further explored.
Paweł Kamocki made clear that CLARIN is taking these issues seriously: ‘As an infrastructure that hosts data, it is important for people to know exactly how to use this data. How do we protect our communities from inadvertently breaking the law?’ CLIC will continue to monitor the development of AI and offer support where possible, with further Cafés, as well as relevant publications, planned in the future.
‘Current legal disputes (e.g. Getty Images vs. Stable Diffusion) and regulatory decisions will shape the future development of AI. Specifically, the relationship between copyrighted content and training data is crucial. What are the implications of turning an astronomical amount of copyrighted content into a core component of a closed, proprietary model like GPT-4? There needs to be clarity about the role of copyright law in this context. But this question goes beyond the question of copyright. The transformation of the entire internet [...] into training data for a closed model means that there is an urgent need for democratic oversight over large language models.
Given the rapid pace of AI development, there needs to be a continuous exchange across disciplines. Technical expertise is indispensable for governance frameworks. The CLARIN Café provides a dynamic platform for such an exchange.’
Fabian Ferrari is a postdoctoral researcher in the humanities at Utrecht University, with a background in interdisciplinary social science. He is particularly interested in the governance of AI-generated media.
'Most of the available information on this topic tends to focus on the ability of chatbots to reproduce almost identical content and their potential incredible future developments, especially regarding how helpful they can be. However, only a fraction of these sources is dedicated to discussing intellectual property issues that arise with chatbots processing text and images and how to make their use transparent.
The event was informative and engaging, bringing together speakers with different working expertise, who provided a comprehensive view of the 'problem' at hand, presenting perspectives that are not typically seen in user-centered approaches.'
Roberta Luzietti is a PhD student in Linguistics at the University of Pisa working on the reuse of an historical oral archive to conduct a sociophonetic investigation on a residual phonetic phenomenon in Tuscan vernacular (https://dilles.fileli.unipi.it/en/students-and-alumni/). She also collaborates with CLARIN-IT for the validation of the Archivio Vi.Vo. platform (https://www.clarin-it.it/it/content/archivio-vivo).