Data Anonymization For Better Software Testing

Data
Anonymization
How to improve your release quality
with better test data
Pavel Švec

Pavel Švec
Senior Consultant
CloverDX consultant
6 yrs experience on data
engineering projects
15 yrs experience in SW
development

When can anonymization be used?
Why does realistic data matter?
Data “manufacturing” techniques
Synthesis and anonymization comparison
Anonymization strategies
Solution for enterprise-level data privacy
Agenda

Maybe the most common case I’ve ever encountered. Developers desperately
need datasets to make their applications as robust as possible, covering all
border cases of production systems. Anonymization is capable of masking
data with meaningful values and keeping relationships coherent, ergo it is a
great data provisioning method for any development or test department.
Example
Credit card fraud detection often requires collaboration of multiple siloed systems to detect
anomalies, which by nature contain sensitive information which should not find a way outside
of a production environment.
Software development

Engineers and scientists often have limited amounts of data to train and test
their AI models. Data anonymization, thanks to its properties, may be a viable
candidate to synthesize additional datasets backed by real data. Benefit
added, similarities can be 100% controlled in a spectrum from keeping original
to completely synthesized data.
Example 1
Small to medium sized company, service provider who provides ML software to predict traffic
congestions or high-hazard segments of road network.
Example 2
Target shuffling could be one of data anonymization uses:
https://www.elderresearch.com/company/resource-center/videos/target-shuffling-
presentation-berkleyhaas
Machine learning

Data quality for software testing

Test with fabricated data only
= testing on production!
To paraphrase Sheldon Cooper:
“It’s funny because it’s true.”
Fabricated data:
Work with assumptions which are not always reliable
Tend to test algorithms not functionality (especially
during unit and integration tests)
Are based on experience, best practices and known
border conditions
Take time to produce
Single purpose only

Why does it matter?
Before go-live After go-live
Generated (synthetic) test data
Real or life-like (anonymized) test data

Production data
Name Frank Smith 王秀英
SSN 543-69-1573 235-41-8875
City Denver New York
Date of Birth 24 Jul 1975 14 Sep 1957
Name Abc Def John Doe
SSN 888-88-8888 123-45-6789
City Xyz Chicago
Date of Birth 1 Jan 2000 8 Feb 2014
Randomized
/
Synthetic
Anonymized
Name 王秀英 Frank Smith
SSN 543-67-0008 235-81-9568
City Delaware Minneapolis
Date of Birth 28 Jul 1975 17 Sep 1957

Production and synthesized data have different characteristics
Synthesized data often prone to dictionary or programming limitations
e.g. regional customs or border condition unawareness (international characters, mixed-up inputs)
Best testing dataset? Production data. But hold on a second…
No product owner will grant unnecessary permissions on system he has responsibility for
Some software requires a full license whilst working with production data, even in a development setting
Privacy and regulatory requirements
Solution?
Give your product owner a tool to copy data out from production which:
• Allows full control over when and how services are impacted
• Provides reliable but obscured data
Why is there a discrepancy in usefulness
between synthetized and production data?

Process of data fabrication resulting in randomized data, valid in given context
and domain.
In other words, synthesis instead of random character sequence Xxuzyg Mbdhu for
domain of people’s names, gives John Sebastian Doe
For given context City of London, street domain may yield Baker Street
Limited capacity in simulation of production situations
Only as good as underlying datasets and models
What is data synthesis?

Process of masking input data, so they keep some of their original attributes
but not to extent they could be used to infer relation to real people or
entities.
Even simple data shuffling can make John from New York a Frank from New York
Will not change population of New York
i.e. keeps some statistical characteristics
(e.g. might loose information how many Johns live in NY)
Transient translation tables may keep data consistent across multiple systems but
allow to yield different results for each execution
What is anonymization?

Examples of anonymization classes
Will retain distribution and values
If there is data containing errors,
these are kept too
Shuffling Mask Jitter
Changes values but keeps some identification,
discarding sensitive information
Usually uses pseudo-randomization technique
e.g. 223-64-8630 → 223-86-0042 will remain
being in even group of Virginia SSNs
or IBAN CH9300762011623852957 →
CH3729874746184983012 is still valid Swiss
one
Returns randomized value with
configurable jitter
e.g. date of birth 5th Aug 1972 with
jitter set to 3 days can result in 7th
Aug 1972 or 2nd Aug 1972

Anonymized but still, fake data. Correct?
Has similar parameters as Synthesized data
Looks like production data
Is valid in given context
In addition to these, may retain real world properties:
Invalid values, encoding discrepancies and other impurities
Relationships
Statistical distribution
Yes, very much so…
It may not seem but it is a GOOD thing

Wealth per Capita
(Source: Wikipedia)
Wealth per Capita
(Generated)

Card Number – Example of an Anonymization Rule
Naively generated 1234 5678 9012 3456
Randomly generated digits
Properly
anonymized
4024 0071 4314 0399
Keeping Issuer code
VISA Credit card
Issued by Bank of America
Randomized
Account Number
Valid Luhn checksum
Preserves card types, issuers, preserves validity

Now I’m confused…
Synthesized, Anonymized? Which one should I go for?

Synthetized
• Completely randomized (generated)
• Doesn’t reflect reality
• There are cheap tools to source synthetic data
• Useful for smaller-scale applications or specific
features where inputs are more atomic
without relations and dependencies
• Some data synthesizers can go as far as to
generate also related and/or dependent data
but are still limited by lack of a realistic model
Anonymized
• Mimics real world behavior but is trickier to
generate
• We need to mask original data in a way so that
original data cannot be reconstructed or inferred
• Preserves real world relationships and challenges
(e.g. inconsistencies, missing values, duplicates, etc.)
• Can be used in end-to-end system testing and AI
applications
• Does not skew perception of reality.
Both are free from PII or
other sensitive information

Data source discovery (CloverDX Harvester)
Interrogation of
data sources
Data model and
categorisation
Suggestion for
anonymization strategy

Enterprise scale anonymization architecture
Sensitive data discovery
(Harvester)
Configure
anonymization policies
(per domain)
Anonymization
Engine
Production data
Anonymized

CloverDX Anonymization Engine
How we do it on systems with thousands of tables
Time for a little demo:

Data Anonymization For Better Software Testing

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Data Anonymization For Better Software Testing

Ähnlich wie Data Anonymization For Better Software Testing (20)

Mehr von CloverDX

Mehr von CloverDX (14)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Data Anonymization For Better Software Testing