For data about EU citizens, it shouldn’t be possible to ‘single out an individual, link records relating to an individual or infer information concerning an individual’ without their consent - according to the GDPR. However, there are many situations where you need to use the data, for example, to facilitate software development, where you need to use real-world data but don’t have the necessary permission. That’s why data anonymization is crucial.
Data anonymization turns your sensitive data into usable data sets by stripping identifiable information and making it anonymous.
There are different techniques when it comes to anonymizing data, such as masking, noise addition and randomization. However, without the right best practices, these processes can become confusing and, in turn, lead to mistakes that put your data at risk.
We want to shed some light on eight common mistakes (and myths) we hear regularly at CloverDX:
Anonymization might seem like an easy task. If you delete the names of individuals in your dataset, it’s done, right? Unfortunately, that’s not the case.
Variables that are not ‘identifiers’ can still supply context which may lead to identification. For example, when Netflix released data displaying movie ratings, they removed usernames and randomized ID numbers. However, MIT were able to match these anonymized data sets to Amazon users via similar ratings on their site. The data was deanonymized using no PII.
Synthetic data is generated artificial data that resembles your original dataset but contains completely fake information. Synthetic data generates valid values, making it better for certain types of testing and analysis such as software testing, but still has its limitations. It mixes up the original data so much so that it is now difficult to draw useful information.
Unlike synthetic data, anonymized data holds on to some of those important attributes which allow the data to be analyzed for business intelligence purposes. For example, in an anonymized data set you can still ask questions like “What is the most common first name?”, which wouldn’t be possible in a synthetic data set.
Therefore, improving synthetic data and strengthening the security of anonymized data is often a merged process. The way you use these together will be on a case by case basis. You can use anonymization for data that you need to retain key characteristics from and synthetize high risk or useless data. Both will go hand-in-hand in providing you with the most secure, usable data.
Pseudonymization and anonymization are not the same, according to the GDPR.
Why? Because pseudonymization is reversible if the original data is accessible.
For example, you anonymize a transactions database by removing all personal details and put a “customer number” there instead. And somewhere else, there is another database saved that matches the customer number to your details. If you give out just the transactions database, no one can tell who anyone is. But the data is reversible if the third party gains access to your customer database. The data then becomes identifiable.
On the other hand, in some instances of anonymization, it can be difficult to reverse the meaning of incoherent characters to identify key information.
There’s always a tradeoff between the danger of reconstructing your original data set and losing its value. If you anonymize your data correctly, it will lose its link to the original dataset. Unlike pseudonymized data where that link is still present and can enable identification. Anonymized data is secure, as close to the real thing as possible and will still provide value to your business.
Depending on what level of anonymization you require, the data will still provide relevant relationships and properties that you need to make well-informed decisions, all while keeping your data subjects safe. For example, you can still get meaningful website traffic analysis using anonymized data.
If the source data is kept after anonymization takes place, means it’s actually pseudonymized data and is still considered to be personal, therefore, ‘identifiable’.
Holding original, incorrectly processed data puts your business at risk. This data can only legally be processed in accordance with the relevant data protection legislation, including GDPR. This is concurrent to the next mistake you should be aware of:
So, you’ve successfully deidentified a data set and prepped it for a third party for analysis.
However, it’s likely some of this information overlaps with data stored elsewhere in your business and is therefore linked. This means that the data becomes identifiable.
To be safe, you can anonymize all occurrences of the data, reducing the risk of information being linked to individuals. However, you don’t necessarily need to anonymize absolutely everything, just make sure all the data sets available to the specific party don’t pose a threat when they are combined.
The context of your data’s purpose determines the type of anonymization that needs to occur.
There are different individual and sets of anonymization techniques you can use depending on the size and sensitivity of your data. You may also pair these with other privacy best practices.
Differential privacy is another trending data protection technique that companies such as Apple, Google and Uber use.
Put simply, differential privacy is one of many types of privacy protection. It is a mathematical definition of privacy in the context of statistical and machine learning analysis. It’s a useful, albeit complicated, method that allows you to measure the privacy of a database rather than actually privatize your data. Another problem with differential data is that it cannot provide results for smaller samples like anonymization can.
Mistakes and misconceptions are common when it comes to anonymizing data. We’ve covered eight of the biggest in this blog post.
Put simply, gaining control of your data and complying with data protection laws, all while being able to draw value from your data, is of the utmost importance for your business.
Anonymization ensures you benefit from your data safely by removing sensitivity and keeping the data as close to the original source as possible.
So, are you falling into any of these anonymization traps? Perhaps it’s time to re-evaluate your techniques.