This section presents some basic concepts in European data protection law under the GDPR. It is divided into two subsections:
- the What aims to explain what "processing of personal data" is;
- the Who discusses the various actors that can be involved in data processing operations: data subjects, data controllers and data processors.
Article 4(1) of the GDPR defines personal data in a very broad manner as "any information relating to an identified or identifiable natural person". This definition can be analysed into four elements:
- any information regardless of its nature (facts, opinions, even untrue or unproven information) and of its form (textual data, sound, image, digital or analogue);
- relating to: an information relates to a person if it tells something about a person, i.e. his identity, characteristics or behaviour, or if it can be used to evaluate the situation of an individual. Information can relate to an individual directly (e.g. Peter is six feet tall) or indirectly, for instance, via an object (e.g. This extravagant limousine belongs to Peter tells something about Peter’s economic situation, i.e. that he is well-off);
- identified or identifiable: a person is identified if he or she is singled out directly (via a name, unless it’s very common (e.g. Smith)) or indirectly (e.g. via a phone number). A person is identifιable if he or she can be identified by any means reasonably likely to be used (see recital 26 of the GDPR; cf. our remarks below about anonymisation). In assessing whether means are reasonably likely to be used, one should take into account the costs, the relevant interests of the data subject (i.e. the person that the information relates to), the potential benefits for the data controller (i.e. the person who is processing data) and the risk of dysfunctions. For example, while it is rather unlikely that someone would employ a costly high-end technology in order to learn that Mr. X is a plumber, or that he drives a Honda, when it comes to more sensitive information (Mr. X’s genetic predisposition to lung cancer or his social security number) the probability is higher. In short, the more sensitive the information, the higher standards for identifiability should apply.
- natural person, i.e. a living individual. The GDPR only protects information about natural persons; however, information about dead persons or legal persons may indirectly relate to identified or identifiable natural persons (e.g. The man who died of a rare genetic disease at age 42 was Peter’s father). Moreover, some Member States may have special frameworks that apply to information about dead individuals, or allow people to define what is going to happen with their data post mortem. Most notably, this is the case of France (see art. 85 of the French Data Protection Act).
Therefore, the GDPR covers all sorts of information that relate (even indirectly) to a person, including not only the person’s name, phone number and address, but also various facts about the person’s past, opinions about the person, his or her social security number, IP address, voice, biometric information (way of walking or speaking), DNA sequences etc. Importantly, it is irrelevant whether the information relates to the public (e.g. professional career) or the private (e.g. family situation) sphere of the individual’s life. This has to be kept in mind while processing all sorts of language resources, especially those containing interviews, images or voice recordings.
Some categories of personal data are regarded as particularly sensitive; their processing is in principle prohibited, unless one of the exceptions defined in art. 9 of the GDPR applies (most notably, if the person has given explicit consent for the processing for a specified purpose). This stricter framework concerns data revealing racial or ethnic origin, political opinions, religious or philosophical beliefs, or trade union membership, as well as genetic data, data concerning health or data concerning a natural person's sex life or sexual orientation. The framework also applies to biometric data, but only if they are processed for the purpose of uniquely identifying a person (so, for example, an ID photo is not sensitive data unless it is processed by a piece of face recognition software). Concerning health data, it is worth mentioning that they are considered sensitive regardless of whether they reveal an illness or any other anomaly. The information that Peter is in perfectly good health is also to be regarded as sensitive.
As mentioned above, personal data is information that can be linked to a natural person by any means reasonably likely to be used. Data anonymisation is a process aimed at breaking this link, so that it becomes "reasonably unlikely" that the person that the data related to would be identified. Some anonymisation techniques include randomisation (noise addition, permutation, differential privacy) and generalisation (aggregation, k-anonymity, l-diversity, t-closeness).
The WP29’s opinion on anonymisation sets a very high standard for anonymisation, especially by pointing out the possibility of identification of data subject via cross-reference with other available datasets (e.g. social media). If you want to anonymise your dataset, contacting the Data Protection Officer at your institution may be a good first step.
In any case, it should be kept in mind that anonymisation should be permanent and irreversible. Moreover, anonymisation is already a form of processing and, therefore, it should follow the data processing principles, including lawfulness.
Properly anonymised data are no longer personal data, and therefore can be freely processed. It is a good practice, however, to periodically review the results of anonymisation, as what is not identifying today, may become identifying in the future due to technological progress and the ever increasing volume of publicly available data.
For more information about anonymisation techniques, see this Opinion.
Pseudonymisation is somewhat similar to anonymisation, but it is reversible (cf. art. 4(5) of the GDPR). A dataset is anonymised if the information that relates to natural persons (e.g. their names, dates of birth and addresses) is separated from the original dataset; however, it is still available and so the dataset can be "de-pseudonymised". Pseudonymised data are still to be regarded as personal data, and their processing must respect the principles set forth in the GDPR. However, pseudonymisation can be regarded as an additional safeguard, particularly relevant when the processing is carried out for research purposes.
Processing is defined in art. 4(2) of the GDPR as any operation performed on data, whether by automated or non-automated means, including collection, recording, storage, anonymisation, transfer, disclosure, annotation, alignment or even erasure of the data. In other words, every operation performed on personal data must respect the principles of the GDPR (unless it is for "in the course of purely personal or household activity", such as keeping a personal agenda or an address book).
The first step in addressing GDPR compliance of personal data processing is to identify the various stakeholders:
- Data controller is the person or entity who "determines the purposes and means of processing". When you are processing personal data for a research project, the data controller will probably be your institution. There can also be several controllers for one processing, referred to as joint controllers, and responsible jointly and severally (in solidum), i.e. on the "one for all, all for one" basis. The status of the data controller is determined based on the facts of every case, and cannot be, for instance, assigned by the parties to a contract; in other words, if a consortium agreement says that only one institution or a particular individual is to be regarded as data controller, whereas, in fact, the decisions concerning the purposes and means of processing are made jointly by all the consortium members, then all the members are to be regarded as "joint controllers"; conversely, if the decisions concerning a particular processing are made by one member only, then the member is to be regarded as the controller, regardless of what the contract may stipulate.
- Data processor is the person or entity who processes data in the name and on behalf of the controller and under his/her instructions. If a controller "outsources" certain processing operations (e.g. long-term storage) to an external provider, the provider is to be regarded as controller. If your institution stores third-party data, it is probably acting as a controller, too. Keep in mind that one institution can be both the controller and the processor with regards to two different processing operations performed on the same dataset (e.g. on one language resource in two different projects).
- Finally, the data subject is the natural person that the personal data relate to (cf. above the definition of personal data).