Data anonymization is a process of masking data so that individual people or records can’t be identified. It enables life-like data to be used for software testing, analytics, visualization or sharing with third parties, but with any sensitive data safely obscured.
Unlike some other methods of creating non-identifiable data, data anonymization resembles your original dataset as closely as possible, keeping some important characteristics and relationships that can make a big difference to your analysis or testing.
Read more: Data Anonymization: 7 Essential Use Cases
In an ideal world, you’d test on your production data to give you the most accurate test results. But there are many reasons you can’t do that:
There are various ways of creating fake data that avoid some of these issues and help reduce the danger that can occur within your data. We look at some of these below, but all have their pros and cons. Essentially, for the most accurate testing you want to use data that’s fake, but still captures real-world situations.
Read more: Why Your Business Needs A Data Anonymization Strategy
Data can be made un-identifiable in various ways, some easier than others, and some more useful than others.
Randomizing data is relatively easy, but provides limited value. As you can see in the example below, your data may retain some formatting rules, but otherwise is pretty much gibberish. This means that if you want to test your software, you don’t have a very realistic dataset (so your tests aren’t going to catch many potential bugs). It also means that you can’t perform any meaningful analysis on your data.
BEFORE (actual data) | AFTER (randomized data) | |
Name | Frank Smith | Xxuzyg Mbdhu |
Social Security Number | 543-69-1573 | 888-88-8888 |
City | Denver | Xyzzz |
Date of Birth | 24 Jul 1975 | 1 Jan 2000 |
Synthetic data - generating artificial data that resembles your original dataset but contains completely fake information - is a step above randomized data. It’s more elaborate, generates valid values (so better for testing and analysis) but still has some limitations.
BEFORE (actual data) | AFTER (synthetic data) | |
Name | Frank Smith | John Doe |
Social Security Number | 543-69-1573 | 123-45-6789 |
City | Denver | Chicago |
Date of Birth | 24 Jul 1975 | 8 Feb 2014 |
Unlike synthesized data, data anonymization does preserve some attributes of your original dataset.
Anonymization can for example change 'Frank from Denver' into 'John from Denver'. No longer a real person, but your data still keeps accurate information on the number of people in Denver (although you do of course lose information on the number of Franks. It’s important to decide which information is important for you to keep in your particular case).
Rather than creating completely fake data, data anonymization masks your existing dataset. There are several different methods of anonymizing your data, including changing certain values to remove identifying information, shuffling data around or altering values slightly.
The example below shows how some characteristics of the original production data remain, but it's no longer possible to identify individual records or tie sensitive information to a particular person.
BEFORE (actual data) | AFTER (anonymized data) | ||||
Name | Frank Smith | 町 達雄 | 町 達雄 | Frank Smith | |
SSN | 543-69-1573 | 235-41-8875 | 543-67-0008 | 235-81-9568 | |
City | Denver | New York | Delaware | Minneapolis | |
Birth | 24 Jul 1975 | 14 Sep 1957 | 28 Jul 1975 | 17 Sep 1957 |
Anonymized data:
Our webinar Data Anonymization for Better Software Testing explores how data anonymization can help you get better test data and improve the quality of your software releases. We'll go into detail about data anonymization; the different methods of achieving it; and the pros and cons of each approach, as well as taking a look at how CloverDX can help manage the data anonymization process at enterprise scale.