You are here

Hacking the GDPR to Conduct Research with Language Resources in Digital Humanities and Social Sciences


In the midst of a winter landscape, at Vilnius, Lithuania, the workshop Hacking the GDPR to Conduct Research with Language Resources in Digital Humanities and Social Sciences took place on 7 December 2018.

The workshop, organised by the CLARIN Legal and Ethical Issues Committee (CLIC), hosted by CLARIN-LT and supported by the CLARIN ERIC, aimed to explore the changes brought to research in the Digital Humanities and Social Sciences disciplines by the introduction of the EU General Data Protection Regulation (GDPR) and, most specifically, the way its application affects the construction, distribution and exploitation of Language Resources (LRs).

Researchers are asking questions such as:

  • what constitutes "personal data" and "sensitive data"? Names, contact details, gender, country of birth, nationality, health status, ...?
  • when collecting material through interviews, task-based assignments, questionnaire surveys, etc. in order to conduct research on the use of language by specific groups of people (e.g. people with specific language impairments or learning disabilities, or coming from different regions, for gender studies, etc.), it's important to accompany the material with the personal data of the participants; so, should we ask for consent in order to acquire and process these data from the participants? is it ok to share these data or some of these data together with the collected material? under what circumstances?
  • what measures can be taken to ensure that personal data are not distributed together with a language resource? must we always use anonymisation? what other alternatives can we use?
  • should we have consent of individuals when collecting videos and photographs of them? even when used solely for research purposes?
  • if the personal data is found already on web sites (e.g. authors' data from blogposts), is it ok to harvest them together with the language material and conduct research using them?
  • regarding legacy data collected before the advent of the GDPR with the consent of the participants, is it ok to share them with other researchers or the general public, or should we ask again for new consent to cover GDPR requirements?

These and many more questions have been discussed during an intense brainstorming and fruitful workshop, which brought together around 25 participants from various disciplines, i.e. legal experts,  NLP researchers and corpus scientists involved in processes that touch upon legal management issues of Language Resources (LRs).

The workshop was organised along two sessions:

  • an introductory session with four short presentations aiming to establish a common ground of knowledge among the participants; more specifically, two presentations focused on creating and processing LRs with and without the consent of the data subjects, and another one gave an overview of the technical measures that can be taken to protect LRs with personal data; the fourth presentation described the general framework for research purposes and laid out a proposal for the creation of a Code of Conduct.
  • a "hackathon" session where the participants discussed specific Use Cases that present problems vis-a-vis the GDPR; the use cases, highlighting different angles of GDPR in relation to LRs, were submitted by the participants as a means of triggering the discussion on the way GDPR influences the various stages of the LR lifecycle and the particular legal and technological measures that can and/or should be deployed to ensure legal access to and processing of personal data under the GDPR regime.

Participants drew on their expertise and experience in order to suggest potential ways of handling each use case, shared their views, asked  questions, argued in favour of or against the various measures and techniques, became more aware of the various peculiarities of the GDPR and tried to understand how these measures can fit or be adapted to fit particular conditions.

Overall, all participants found both the structure and the contents of the workshop stimulating and expressed their interest in continuing the discussion through similar events.

More information, including links to the presentation slides and main discussion points, can be found at the workshop final report.