It is now more than two years since the General Data Protection Regulations came into our lives for better or for worse. European citizens can only welcome the existence of such a shield concerning their personal data. For businesses, on the other hand, ensuring compliance with the GDPR in the age of Big Data can be a source of cold sweats. In this article, we present some key elements related to the application of the GDPR to big data.
GDPR, reminder of the basic principles
The General Regulations on Data Protection apply to personal data as mentioned in Article 4 of the text. Personal data is defined as any information relating to a natural person who can be identified, directly or indirectly.
The French CNIL (Commission Nationale de l'Informatique et des Libertés) identifies five fundamental principles when a company processes personal data:
- The purpose principle: data may only be kept for a defined and legitimate purpose;
- The principle of proportionality and relevance: only data that is useful for the purposes for which it was collected may be kept;
- The principle of a limited retention period;
- The principle of security and confidentiality; and
- The principle of respect for the rights of individuals, in particular the right of access, rectification, deletion, etc.
In the era of Big Data, where data can be bought and sold like apples on the market and companies tend to store reserves for the winter, it is quickly understood that personal data can raise many questions. The line between compliance with standards and illegality is extremely fine and can spark debate.
GDPR and data lakes, how can the two be reconciled?
Data lakes are becoming increasingly popular. A data lake is simply a place where a company stores raw data, in bulk, for future use.
In 2020, companies have a wealth of data, but not necessarily the technology to exploit it. It is therefore easy to understand their desire to retain this valuable data so that they can use it later.
Raw data of all kinds kept for an indefinite period of time? One can quickly imagine the potential problems if the data lake is intended to store personal data.
In order to comply with the law, companies must therefore go outside the scope of the GDPR. The simplest solution is to ensure that the data stored in the data lake does not qualify as "personal data". It is therefore up to companies to make the data anonymous.
How to anonymize data
The European Commission is very strict about the definition of anonymous data. On its website, it states:
"Personal data that has been made anonymous, encrypted or pseudonymized, but which can be used to identify a person again, is still personal data and is covered by the GDPR. Personal data that has been anonymized in such a way that the individual is not or no longer identifiable is no longer personal data. In order for data to be truly anonymized, the process of anonymization must be irreversible. »
These explanations leave little room for interpretation and greatly complicate the work of the Chief Data Officer and data managers. Various techniques exist to anonymize data. The company Google, for example, uses the following techniques:
- Generalization: this method consists of making part of the data common to a group of individuals so that they become part of a group and are therefore no longer identifiable. The attributes of the dataset are therefore generalized. For example, instead of keeping your full address, a company will only keep your postal code;
- Differential privacy: This very fashionable process is equivalent to blurring information with elements that can be called statistical noise or random probability elements. Some irrelevant data is therefore altered, or false data is incorporated into the game to prevent a link between an individual and the data. This statistical noise would have no influence in large-scale data processing. The message is blurred but is still understandable.
There are, of course, many ways of anonymizing data that we do not discuss in this article.
Processing personal data: principles to follow
Anonymization is not always the answer. In some cases, the company has to keep some personal data. If this is the case, it is possible to respect the GDPR by keeping in mind certain reflexes:
- Attempt to limit data collection as much as possible to relevant data. There is little point in accumulating data if it is not used or does not benefit the company. While it can be argued that it may be used in the future, it is more likely to end up in the trash. Imagine the number of attics and cellars filled with useless things that will be useless in 99% of the cases and you will have understood the principle. Yes, this old mismatched sock could one day be converted into a puppet but is this a real asset? The setting of adequate shelf life is also part of this. There is no point in keeping obsolete data;
- Ensuring maximum transparency so that individuals can understand how their data is being used. Such transparency is an ongoing process and will also allow the company to continue to question its data policy. In the same vein, we can only recommend that a clear policy be put in place regarding the management of potential complaints;
- Secure data, considering its sensitivity and the complexity of the systems in place. Cybersecurity will be a pivotal element in the years to come. Beyond the loss of data, there is also a real risk of alteration, which would distort all the indications obtained from the data.
Other avenues to ensure compatibility with the GDPR
To ensure compliance with the General Data Protection Regulations, a company may choose to rethink the way it uses data. For example, consider real-time data processing using stream computing.
It's clear that in 2020, many companies are struggling to comply with regulations even though the intention is there in the majority of cases. The environment is becoming extremely complex, and the need for solutions using artificial intelligence is becoming apparent. In fact, Gartner estimates that spending on privacy compliance will increase dramatically in the coming years to reach $8 billion.
Ryax offers a secure and reliable data engineering platform that allows you to address these challenges with confidence. Using such a solution simplifies data security and harmonizes policies across the enterprise. For more information, contact us or consult our product sheet.
The Ryax Team.