SlideShare ist ein Scribd-Unternehmen logo
1 von 25
Downloaden Sie, um offline zu lesen
Data
Anonymization
How to improve your release quality
with better test data
Pavel Švec
Pavel Švec
Senior Consultant
CloverDX consultant
6 yrs experience on data
engineering projects
15 yrs experience in SW
development
When can anonymization be used?
Why does realistic data matter?
Data “manufacturing” techniques
Synthesis and anonymization comparison
Anonymization strategies
Solution for enterprise-level data privacy
Agenda
Anonymization use cases
Maybe the most common case I’ve ever encountered. Developers desperately
need datasets to make their applications as robust as possible, covering all
border cases of production systems. Anonymization is capable of masking
data with meaningful values and keeping relationships coherent, ergo it is a
great data provisioning method for any development or test department.
Example
Credit card fraud detection often requires collaboration of multiple siloed systems to detect
anomalies, which by nature contain sensitive information which should not find a way outside
of a production environment.
Software development
Engineers and scientists often have limited amounts of data to train and test
their AI models. Data anonymization, thanks to its properties, may be a viable
candidate to synthesize additional datasets backed by real data. Benefit
added, similarities can be 100% controlled in a spectrum from keeping original
to completely synthesized data.
Example 1
Small to medium sized company, service provider who provides ML software to predict traffic
congestions or high-hazard segments of road network.
Example 2
Target shuffling could be one of data anonymization uses:
https://www.elderresearch.com/company/resource-center/videos/target-shuffling-
presentation-berkleyhaas
Machine learning
Honorable mentions
Data quality for software testing
Test with fabricated data only
= testing on production!
To paraphrase Sheldon Cooper:
“It’s funny because it’s true.”
Fabricated data:
Work with assumptions which are not always reliable
Tend to test algorithms not functionality (especially
during unit and integration tests)
Are based on experience, best practices and known
border conditions
Take time to produce
Single purpose only
Why does it matter?
Before go-live After go-live
Generated (synthetic) test data
Real or life-like (anonymized) test data
Production data
Name Frank Smith 王秀英
SSN 543-69-1573 235-41-8875
City Denver New York
Date of Birth 24 Jul 1975 14 Sep 1957
Name Abc Def John Doe
SSN 888-88-8888 123-45-6789
City Xyz Chicago
Date of Birth 1 Jan 2000 8 Feb 2014
Randomized
/
Synthetic
Anonymized
Name 王秀英 Frank Smith
SSN 543-67-0008 235-81-9568
City Delaware Minneapolis
Date of Birth 28 Jul 1975 17 Sep 1957
Production and synthesized data have different characteristics
Synthesized data often prone to dictionary or programming limitations
e.g. regional customs or border condition unawareness (international characters, mixed-up inputs)
Best testing dataset? Production data. But hold on a second…
No product owner will grant unnecessary permissions on system he has responsibility for
Some software requires a full license whilst working with production data, even in a development setting
Privacy and regulatory requirements
Solution?
Give your product owner a tool to copy data out from production which:
• Allows full control over when and how services are impacted
• Provides reliable but obscured data
Why is there a discrepancy in usefulness
between synthetized and production data?
Process of data fabrication resulting in randomized data, valid in given context
and domain.
In other words, synthesis instead of random character sequence Xxuzyg Mbdhu for
domain of people’s names, gives John Sebastian Doe
For given context City of London, street domain may yield Baker Street
Limited capacity in simulation of production situations
Only as good as underlying datasets and models
What is data synthesis?
Process of masking input data, so they keep some of their original attributes
but not to extent they could be used to infer relation to real people or
entities.
Even simple data shuffling can make John from New York a Frank from New York
Will not change population of New York
i.e. keeps some statistical characteristics
(e.g. might loose information how many Johns live in NY)
Transient translation tables may keep data consistent across multiple systems but
allow to yield different results for each execution
What is anonymization?
Examples of anonymization classes
Will retain distribution and values
If there is data containing errors,
these are kept too
Shuffling Mask Jitter
Changes values but keeps some identification,
discarding sensitive information
Usually uses pseudo-randomization technique
e.g. 223-64-8630 → 223-86-0042 will remain
being in even group of Virginia SSNs
or IBAN CH9300762011623852957 →
CH3729874746184983012 is still valid Swiss
one
Returns randomized value with
configurable jitter
e.g. date of birth 5th Aug 1972 with
jitter set to 3 days can result in 7th
Aug 1972 or 2nd Aug 1972
Anonymized but still, fake data. Correct?
Has similar parameters as Synthesized data
Looks like production data
Is valid in given context
In addition to these, may retain real world properties:
Invalid values, encoding discrepancies and other impurities
Relationships
Statistical distribution
Yes, very much so…
It may not seem but it is a GOOD thing
Wealth per Capita
(Source: Wikipedia)
Wealth per Capita
(Generated)
Card Number – Example of an Anonymization Rule
Naively generated 1234 5678 9012 3456
Randomly generated digits
Properly
anonymized
4024 0071 4314 0399
Keeping Issuer code
VISA Credit card
Issued by Bank of America
Randomized
Account Number
Valid Luhn checksum
Preserves card types, issuers, preserves validity
Now I’m confused…
Synthesized, Anonymized? Which one should I go for?
Synthetized
• Completely randomized (generated)
• Doesn’t reflect reality
• There are cheap tools to source synthetic data
• Useful for smaller-scale applications or specific
features where inputs are more atomic
without relations and dependencies
• Some data synthesizers can go as far as to
generate also related and/or dependent data
but are still limited by lack of a realistic model
Anonymized
• Mimics real world behavior but is trickier to
generate
• We need to mask original data in a way so that
original data cannot be reconstructed or inferred
• Preserves real world relationships and challenges
(e.g. inconsistencies, missing values, duplicates, etc.)
• Can be used in end-to-end system testing and AI
applications
• Does not skew perception of reality.
Both are free from PII or
other sensitive information
Leaving theory
Data source discovery (CloverDX Harvester)
Interrogation of
data sources
Data model and
categorisation
Suggestion for
anonymization strategy
Enterprise scale anonymization architecture
Sensitive data discovery
(Harvester)
Configure
anonymization policies
(per domain)
Anonymization
Engine
Production data
Anonymized
CloverDX Anonymization Engine
How we do it on systems with thousands of tables
Time for a little demo:
Q&A
hello@cloverdx.com

Weitere ähnliche Inhalte

Was ist angesagt?

Stream Analytics for Data in Motion
Stream Analytics for Data in MotionStream Analytics for Data in Motion
Stream Analytics for Data in MotionExtraHop Networks
 
Architecting for Real-Time Big Data Analytics
Architecting for Real-Time Big Data AnalyticsArchitecting for Real-Time Big Data Analytics
Architecting for Real-Time Big Data AnalyticsRob Winters
 
Testing the Data Warehouse—Big Data, Big Problems
Testing the Data Warehouse—Big Data, Big ProblemsTesting the Data Warehouse—Big Data, Big Problems
Testing the Data Warehouse—Big Data, Big ProblemsTechWell
 
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...Dataconomy Media
 
ironSource Atom BigData Berlin
ironSource Atom BigData BerlinironSource Atom BigData Berlin
ironSource Atom BigData BerlinShimon Tolts
 
Deliver Trusted Data by Leveraging ETL Testing
Deliver Trusted Data by Leveraging ETL TestingDeliver Trusted Data by Leveraging ETL Testing
Deliver Trusted Data by Leveraging ETL TestingCognizant
 
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...SoftServe
 
Big Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil GamesBig Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil GamesRob Winters
 
Oracle Stream Explorer
Oracle Stream ExplorerOracle Stream Explorer
Oracle Stream ExplorerTrivadis
 
Data Virtualization Deployments: How to Manage Very Large Deployments
Data Virtualization Deployments: How to Manage Very Large DeploymentsData Virtualization Deployments: How to Manage Very Large Deployments
Data Virtualization Deployments: How to Manage Very Large DeploymentsDenodo
 
Data Science Out of The Box : Case Studies in the Telecommunication by Anand ...
Data Science Out of The Box : Case Studies in the Telecommunication by Anand ...Data Science Out of The Box : Case Studies in the Telecommunication by Anand ...
Data Science Out of The Box : Case Studies in the Telecommunication by Anand ...Data Con LA
 
How to Use Big Data to Transform IT Operations
How to Use Big Data to Transform IT OperationsHow to Use Big Data to Transform IT Operations
How to Use Big Data to Transform IT OperationsExtraHop Networks
 
"From Big Data To Big Valuewith HPE Predictive Analytics & Machine Learning",...
"From Big Data To Big Valuewith HPE Predictive Analytics & Machine Learning",..."From Big Data To Big Valuewith HPE Predictive Analytics & Machine Learning",...
"From Big Data To Big Valuewith HPE Predictive Analytics & Machine Learning",...Dataconomy Media
 
Solution Architecture US healthcare
Solution Architecture US healthcare Solution Architecture US healthcare
Solution Architecture US healthcare sumiteshkr
 
Polyglot Processing - An Introduction 1.0
Polyglot Processing - An Introduction 1.0 Polyglot Processing - An Introduction 1.0
Polyglot Processing - An Introduction 1.0 Dr. Mohan K. Bavirisetty
 
HP Discover: Real Time Insights from Big Data
HP Discover: Real Time Insights from Big DataHP Discover: Real Time Insights from Big Data
HP Discover: Real Time Insights from Big DataRob Winters
 
Data summit connect fall 2020 - rise of data ops
Data summit connect fall 2020 - rise of data opsData summit connect fall 2020 - rise of data ops
Data summit connect fall 2020 - rise of data opsRyan Gross
 
Internet of Things (IoT)
Internet of Things (IoT)Internet of Things (IoT)
Internet of Things (IoT)Trivadis
 

Was ist angesagt? (20)

Stream Analytics for Data in Motion
Stream Analytics for Data in MotionStream Analytics for Data in Motion
Stream Analytics for Data in Motion
 
Architecting for Real-Time Big Data Analytics
Architecting for Real-Time Big Data AnalyticsArchitecting for Real-Time Big Data Analytics
Architecting for Real-Time Big Data Analytics
 
Observability at Spotify
Observability at SpotifyObservability at Spotify
Observability at Spotify
 
Data Quality Everywhere
Data Quality EverywhereData Quality Everywhere
Data Quality Everywhere
 
Testing the Data Warehouse—Big Data, Big Problems
Testing the Data Warehouse—Big Data, Big ProblemsTesting the Data Warehouse—Big Data, Big Problems
Testing the Data Warehouse—Big Data, Big Problems
 
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
 
ironSource Atom BigData Berlin
ironSource Atom BigData BerlinironSource Atom BigData Berlin
ironSource Atom BigData Berlin
 
Deliver Trusted Data by Leveraging ETL Testing
Deliver Trusted Data by Leveraging ETL TestingDeliver Trusted Data by Leveraging ETL Testing
Deliver Trusted Data by Leveraging ETL Testing
 
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
 
Big Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil GamesBig Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil Games
 
Oracle Stream Explorer
Oracle Stream ExplorerOracle Stream Explorer
Oracle Stream Explorer
 
Data Virtualization Deployments: How to Manage Very Large Deployments
Data Virtualization Deployments: How to Manage Very Large DeploymentsData Virtualization Deployments: How to Manage Very Large Deployments
Data Virtualization Deployments: How to Manage Very Large Deployments
 
Data Science Out of The Box : Case Studies in the Telecommunication by Anand ...
Data Science Out of The Box : Case Studies in the Telecommunication by Anand ...Data Science Out of The Box : Case Studies in the Telecommunication by Anand ...
Data Science Out of The Box : Case Studies in the Telecommunication by Anand ...
 
How to Use Big Data to Transform IT Operations
How to Use Big Data to Transform IT OperationsHow to Use Big Data to Transform IT Operations
How to Use Big Data to Transform IT Operations
 
"From Big Data To Big Valuewith HPE Predictive Analytics & Machine Learning",...
"From Big Data To Big Valuewith HPE Predictive Analytics & Machine Learning",..."From Big Data To Big Valuewith HPE Predictive Analytics & Machine Learning",...
"From Big Data To Big Valuewith HPE Predictive Analytics & Machine Learning",...
 
Solution Architecture US healthcare
Solution Architecture US healthcare Solution Architecture US healthcare
Solution Architecture US healthcare
 
Polyglot Processing - An Introduction 1.0
Polyglot Processing - An Introduction 1.0 Polyglot Processing - An Introduction 1.0
Polyglot Processing - An Introduction 1.0
 
HP Discover: Real Time Insights from Big Data
HP Discover: Real Time Insights from Big DataHP Discover: Real Time Insights from Big Data
HP Discover: Real Time Insights from Big Data
 
Data summit connect fall 2020 - rise of data ops
Data summit connect fall 2020 - rise of data opsData summit connect fall 2020 - rise of data ops
Data summit connect fall 2020 - rise of data ops
 
Internet of Things (IoT)
Internet of Things (IoT)Internet of Things (IoT)
Internet of Things (IoT)
 

Ähnlich wie Data Anonymization For Better Software Testing

Runaway complexity in Big Data... and a plan to stop it
Runaway complexity in Big Data... and a plan to stop itRunaway complexity in Big Data... and a plan to stop it
Runaway complexity in Big Data... and a plan to stop itnathanmarz
 
Mass declassification sept 23 2010v2.1
Mass declassification sept 23 2010v2.1Mass declassification sept 23 2010v2.1
Mass declassification sept 23 2010v2.1Jeff Jonas
 
Data Mining with SQL Server 2008
Data Mining with SQL Server 2008Data Mining with SQL Server 2008
Data Mining with SQL Server 2008Peter Gfader
 
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science TJ Stalcup
 
Technical track chris calvert-1 30 pm-issa conference-calvert
Technical track chris calvert-1 30 pm-issa conference-calvertTechnical track chris calvert-1 30 pm-issa conference-calvert
Technical track chris calvert-1 30 pm-issa conference-calvertISSA LA
 
Intro to Data Science
Intro to Data ScienceIntro to Data Science
Intro to Data ScienceTJ Stalcup
 
Thinkful - Intro to Data Science - Washington DC
Thinkful - Intro to Data Science - Washington DCThinkful - Intro to Data Science - Washington DC
Thinkful - Intro to Data Science - Washington DCTJ Stalcup
 
Data quality and data profiling
Data quality and data profilingData quality and data profiling
Data quality and data profilingShailja Khurana
 
Becoming Datacentric
Becoming DatacentricBecoming Datacentric
Becoming DatacentricTimothy Cook
 
IT Operation Analytic for security- MiSSconf(sp1)
IT Operation Analytic for security- MiSSconf(sp1)IT Operation Analytic for security- MiSSconf(sp1)
IT Operation Analytic for security- MiSSconf(sp1)stelligence
 
2017 06-14-getting started with data science
2017 06-14-getting started with data science2017 06-14-getting started with data science
2017 06-14-getting started with data scienceThinkful
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Data Science London
 
Solving the Disconnected Data Problem in Healthcare Using MongoDB
Solving the Disconnected Data Problem in Healthcare Using MongoDBSolving the Disconnected Data Problem in Healthcare Using MongoDB
Solving the Disconnected Data Problem in Healthcare Using MongoDBMongoDB
 
Data-Mining-ppt (1).pptx
Data-Mining-ppt (1).pptxData-Mining-ppt (1).pptx
Data-Mining-ppt (1).pptxParvathyparu25
 
Data-Mining-ppt.pptx
Data-Mining-ppt.pptxData-Mining-ppt.pptx
Data-Mining-ppt.pptxayush309565
 
Getting to Know Your Data with R
Getting to Know Your Data with RGetting to Know Your Data with R
Getting to Know Your Data with RStephen Withington
 
Karen Lopez 10 Physical Data Modeling Blunders
Karen Lopez 10 Physical Data Modeling BlundersKaren Lopez 10 Physical Data Modeling Blunders
Karen Lopez 10 Physical Data Modeling BlundersKaren Lopez
 
Protect your Database with Data Masking & Enforced Version Control
Protect your Database with Data Masking & Enforced Version Control	Protect your Database with Data Masking & Enforced Version Control
Protect your Database with Data Masking & Enforced Version Control DBmaestro - Database DevOps
 

Ähnlich wie Data Anonymization For Better Software Testing (20)

Runaway complexity in Big Data... and a plan to stop it
Runaway complexity in Big Data... and a plan to stop itRunaway complexity in Big Data... and a plan to stop it
Runaway complexity in Big Data... and a plan to stop it
 
Mass declassification sept 23 2010v2.1
Mass declassification sept 23 2010v2.1Mass declassification sept 23 2010v2.1
Mass declassification sept 23 2010v2.1
 
Data Mining with SQL Server 2008
Data Mining with SQL Server 2008Data Mining with SQL Server 2008
Data Mining with SQL Server 2008
 
How We Use Functional Programming to Find the Bad Guys
How We Use Functional Programming to Find the Bad GuysHow We Use Functional Programming to Find the Bad Guys
How We Use Functional Programming to Find the Bad Guys
 
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science
 
Technical track chris calvert-1 30 pm-issa conference-calvert
Technical track chris calvert-1 30 pm-issa conference-calvertTechnical track chris calvert-1 30 pm-issa conference-calvert
Technical track chris calvert-1 30 pm-issa conference-calvert
 
Intro to Data Science
Intro to Data ScienceIntro to Data Science
Intro to Data Science
 
Thinkful - Intro to Data Science - Washington DC
Thinkful - Intro to Data Science - Washington DCThinkful - Intro to Data Science - Washington DC
Thinkful - Intro to Data Science - Washington DC
 
Data quality and data profiling
Data quality and data profilingData quality and data profiling
Data quality and data profiling
 
Becoming Datacentric
Becoming DatacentricBecoming Datacentric
Becoming Datacentric
 
IT Operation Analytic for security- MiSSconf(sp1)
IT Operation Analytic for security- MiSSconf(sp1)IT Operation Analytic for security- MiSSconf(sp1)
IT Operation Analytic for security- MiSSconf(sp1)
 
2017 06-14-getting started with data science
2017 06-14-getting started with data science2017 06-14-getting started with data science
2017 06-14-getting started with data science
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
 
Solving the Disconnected Data Problem in Healthcare Using MongoDB
Solving the Disconnected Data Problem in Healthcare Using MongoDBSolving the Disconnected Data Problem in Healthcare Using MongoDB
Solving the Disconnected Data Problem in Healthcare Using MongoDB
 
Data-Mining-ppt (1).pptx
Data-Mining-ppt (1).pptxData-Mining-ppt (1).pptx
Data-Mining-ppt (1).pptx
 
Data-Mining-ppt.pptx
Data-Mining-ppt.pptxData-Mining-ppt.pptx
Data-Mining-ppt.pptx
 
Getting to Know Your Data with R
Getting to Know Your Data with RGetting to Know Your Data with R
Getting to Know Your Data with R
 
Karen Lopez 10 Physical Data Modeling Blunders
Karen Lopez 10 Physical Data Modeling BlundersKaren Lopez 10 Physical Data Modeling Blunders
Karen Lopez 10 Physical Data Modeling Blunders
 
Machine Learning for dummies!
Machine Learning for dummies!Machine Learning for dummies!
Machine Learning for dummies!
 
Protect your Database with Data Masking & Enforced Version Control
Protect your Database with Data Masking & Enforced Version Control	Protect your Database with Data Masking & Enforced Version Control
Protect your Database with Data Masking & Enforced Version Control
 

Mehr von CloverDX

Data architecture principles to accelerate your data strategy
Data architecture principles to accelerate your data strategyData architecture principles to accelerate your data strategy
Data architecture principles to accelerate your data strategyCloverDX
 
Characteristics of modern data architecture that drive innovation
Characteristics of modern data architecture that drive innovationCharacteristics of modern data architecture that drive innovation
Characteristics of modern data architecture that drive innovationCloverDX
 
How to build an automated customer data onboarding pipeline
How to build an automated customer data onboarding pipelineHow to build an automated customer data onboarding pipeline
How to build an automated customer data onboarding pipelineCloverDX
 
Automating Data Pipelines: Moving away from Scripts and Excel
Automating Data Pipelines: Moving away from Scripts and ExcelAutomating Data Pipelines: Moving away from Scripts and Excel
Automating Data Pipelines: Moving away from Scripts and ExcelCloverDX
 
CloverDX 6.2 Release
CloverDX 6.2 ReleaseCloverDX 6.2 Release
CloverDX 6.2 ReleaseCloverDX
 
How to Effectively Migrate Data From Legacy Apps
How to Effectively Migrate Data From Legacy AppsHow to Effectively Migrate Data From Legacy Apps
How to Effectively Migrate Data From Legacy AppsCloverDX
 
Deploying ETL to Cloud
Deploying ETL to CloudDeploying ETL to Cloud
Deploying ETL to CloudCloverDX
 
Moving Legacy Apps to Cloud: How to Avoid Risk
Moving Legacy Apps to Cloud: How to Avoid RiskMoving Legacy Apps to Cloud: How to Avoid Risk
Moving Legacy Apps to Cloud: How to Avoid RiskCloverDX
 
Starting Your Modern DataOps Journey
Starting Your Modern DataOps JourneyStarting Your Modern DataOps Journey
Starting Your Modern DataOps JourneyCloverDX
 
CloverDX for IBM Infosphere MDM (for 11.4 and later)
CloverDX for IBM Infosphere MDM (for 11.4 and later)CloverDX for IBM Infosphere MDM (for 11.4 and later)
CloverDX for IBM Infosphere MDM (for 11.4 and later)CloverDX
 
Modern management of data pipelines made easier
Modern management of data pipelines made easierModern management of data pipelines made easier
Modern management of data pipelines made easierCloverDX
 
Removing Danger From Data
Removing Danger From DataRemoving Danger From Data
Removing Danger From DataCloverDX
 
How to publish data and transformations over APIs with CloverDX Data Services
How to publish data and transformations over APIs with CloverDX Data ServicesHow to publish data and transformations over APIs with CloverDX Data Services
How to publish data and transformations over APIs with CloverDX Data ServicesCloverDX
 
Moving "Something Simple" To The Cloud - What It Really Takes
Moving "Something Simple" To The Cloud - What It Really TakesMoving "Something Simple" To The Cloud - What It Really Takes
Moving "Something Simple" To The Cloud - What It Really TakesCloverDX
 

Mehr von CloverDX (14)

Data architecture principles to accelerate your data strategy
Data architecture principles to accelerate your data strategyData architecture principles to accelerate your data strategy
Data architecture principles to accelerate your data strategy
 
Characteristics of modern data architecture that drive innovation
Characteristics of modern data architecture that drive innovationCharacteristics of modern data architecture that drive innovation
Characteristics of modern data architecture that drive innovation
 
How to build an automated customer data onboarding pipeline
How to build an automated customer data onboarding pipelineHow to build an automated customer data onboarding pipeline
How to build an automated customer data onboarding pipeline
 
Automating Data Pipelines: Moving away from Scripts and Excel
Automating Data Pipelines: Moving away from Scripts and ExcelAutomating Data Pipelines: Moving away from Scripts and Excel
Automating Data Pipelines: Moving away from Scripts and Excel
 
CloverDX 6.2 Release
CloverDX 6.2 ReleaseCloverDX 6.2 Release
CloverDX 6.2 Release
 
How to Effectively Migrate Data From Legacy Apps
How to Effectively Migrate Data From Legacy AppsHow to Effectively Migrate Data From Legacy Apps
How to Effectively Migrate Data From Legacy Apps
 
Deploying ETL to Cloud
Deploying ETL to CloudDeploying ETL to Cloud
Deploying ETL to Cloud
 
Moving Legacy Apps to Cloud: How to Avoid Risk
Moving Legacy Apps to Cloud: How to Avoid RiskMoving Legacy Apps to Cloud: How to Avoid Risk
Moving Legacy Apps to Cloud: How to Avoid Risk
 
Starting Your Modern DataOps Journey
Starting Your Modern DataOps JourneyStarting Your Modern DataOps Journey
Starting Your Modern DataOps Journey
 
CloverDX for IBM Infosphere MDM (for 11.4 and later)
CloverDX for IBM Infosphere MDM (for 11.4 and later)CloverDX for IBM Infosphere MDM (for 11.4 and later)
CloverDX for IBM Infosphere MDM (for 11.4 and later)
 
Modern management of data pipelines made easier
Modern management of data pipelines made easierModern management of data pipelines made easier
Modern management of data pipelines made easier
 
Removing Danger From Data
Removing Danger From DataRemoving Danger From Data
Removing Danger From Data
 
How to publish data and transformations over APIs with CloverDX Data Services
How to publish data and transformations over APIs with CloverDX Data ServicesHow to publish data and transformations over APIs with CloverDX Data Services
How to publish data and transformations over APIs with CloverDX Data Services
 
Moving "Something Simple" To The Cloud - What It Really Takes
Moving "Something Simple" To The Cloud - What It Really TakesMoving "Something Simple" To The Cloud - What It Really Takes
Moving "Something Simple" To The Cloud - What It Really Takes
 

Kürzlich hochgeladen

Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...nirzagarg
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themeitharjee
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...HyderabadDolls
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...HyderabadDolls
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubaikojalkojal131
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...gajnagarg
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...Health
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxchadhar227
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...kumargunjan9515
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...HyderabadDolls
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...HyderabadDolls
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.pptibrahimabdi22
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...kumargunjan9515
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...HyderabadDolls
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制vexqp
 

Kürzlich hochgeladen (20)

Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 

Data Anonymization For Better Software Testing

  • 1. Data Anonymization How to improve your release quality with better test data Pavel Švec
  • 2. Pavel Švec Senior Consultant CloverDX consultant 6 yrs experience on data engineering projects 15 yrs experience in SW development
  • 3. When can anonymization be used? Why does realistic data matter? Data “manufacturing” techniques Synthesis and anonymization comparison Anonymization strategies Solution for enterprise-level data privacy Agenda
  • 5. Maybe the most common case I’ve ever encountered. Developers desperately need datasets to make their applications as robust as possible, covering all border cases of production systems. Anonymization is capable of masking data with meaningful values and keeping relationships coherent, ergo it is a great data provisioning method for any development or test department. Example Credit card fraud detection often requires collaboration of multiple siloed systems to detect anomalies, which by nature contain sensitive information which should not find a way outside of a production environment. Software development
  • 6. Engineers and scientists often have limited amounts of data to train and test their AI models. Data anonymization, thanks to its properties, may be a viable candidate to synthesize additional datasets backed by real data. Benefit added, similarities can be 100% controlled in a spectrum from keeping original to completely synthesized data. Example 1 Small to medium sized company, service provider who provides ML software to predict traffic congestions or high-hazard segments of road network. Example 2 Target shuffling could be one of data anonymization uses: https://www.elderresearch.com/company/resource-center/videos/target-shuffling- presentation-berkleyhaas Machine learning
  • 8. Data quality for software testing
  • 9. Test with fabricated data only = testing on production! To paraphrase Sheldon Cooper: “It’s funny because it’s true.” Fabricated data: Work with assumptions which are not always reliable Tend to test algorithms not functionality (especially during unit and integration tests) Are based on experience, best practices and known border conditions Take time to produce Single purpose only
  • 10. Why does it matter? Before go-live After go-live Generated (synthetic) test data Real or life-like (anonymized) test data
  • 11. Production data Name Frank Smith 王秀英 SSN 543-69-1573 235-41-8875 City Denver New York Date of Birth 24 Jul 1975 14 Sep 1957 Name Abc Def John Doe SSN 888-88-8888 123-45-6789 City Xyz Chicago Date of Birth 1 Jan 2000 8 Feb 2014 Randomized / Synthetic Anonymized Name 王秀英 Frank Smith SSN 543-67-0008 235-81-9568 City Delaware Minneapolis Date of Birth 28 Jul 1975 17 Sep 1957
  • 12. Production and synthesized data have different characteristics Synthesized data often prone to dictionary or programming limitations e.g. regional customs or border condition unawareness (international characters, mixed-up inputs) Best testing dataset? Production data. But hold on a second… No product owner will grant unnecessary permissions on system he has responsibility for Some software requires a full license whilst working with production data, even in a development setting Privacy and regulatory requirements Solution? Give your product owner a tool to copy data out from production which: • Allows full control over when and how services are impacted • Provides reliable but obscured data Why is there a discrepancy in usefulness between synthetized and production data?
  • 13. Process of data fabrication resulting in randomized data, valid in given context and domain. In other words, synthesis instead of random character sequence Xxuzyg Mbdhu for domain of people’s names, gives John Sebastian Doe For given context City of London, street domain may yield Baker Street Limited capacity in simulation of production situations Only as good as underlying datasets and models What is data synthesis?
  • 14. Process of masking input data, so they keep some of their original attributes but not to extent they could be used to infer relation to real people or entities. Even simple data shuffling can make John from New York a Frank from New York Will not change population of New York i.e. keeps some statistical characteristics (e.g. might loose information how many Johns live in NY) Transient translation tables may keep data consistent across multiple systems but allow to yield different results for each execution What is anonymization?
  • 15. Examples of anonymization classes Will retain distribution and values If there is data containing errors, these are kept too Shuffling Mask Jitter Changes values but keeps some identification, discarding sensitive information Usually uses pseudo-randomization technique e.g. 223-64-8630 → 223-86-0042 will remain being in even group of Virginia SSNs or IBAN CH9300762011623852957 → CH3729874746184983012 is still valid Swiss one Returns randomized value with configurable jitter e.g. date of birth 5th Aug 1972 with jitter set to 3 days can result in 7th Aug 1972 or 2nd Aug 1972
  • 16. Anonymized but still, fake data. Correct? Has similar parameters as Synthesized data Looks like production data Is valid in given context In addition to these, may retain real world properties: Invalid values, encoding discrepancies and other impurities Relationships Statistical distribution Yes, very much so… It may not seem but it is a GOOD thing
  • 17. Wealth per Capita (Source: Wikipedia) Wealth per Capita (Generated)
  • 18. Card Number – Example of an Anonymization Rule Naively generated 1234 5678 9012 3456 Randomly generated digits Properly anonymized 4024 0071 4314 0399 Keeping Issuer code VISA Credit card Issued by Bank of America Randomized Account Number Valid Luhn checksum Preserves card types, issuers, preserves validity
  • 19. Now I’m confused… Synthesized, Anonymized? Which one should I go for?
  • 20. Synthetized • Completely randomized (generated) • Doesn’t reflect reality • There are cheap tools to source synthetic data • Useful for smaller-scale applications or specific features where inputs are more atomic without relations and dependencies • Some data synthesizers can go as far as to generate also related and/or dependent data but are still limited by lack of a realistic model Anonymized • Mimics real world behavior but is trickier to generate • We need to mask original data in a way so that original data cannot be reconstructed or inferred • Preserves real world relationships and challenges (e.g. inconsistencies, missing values, duplicates, etc.) • Can be used in end-to-end system testing and AI applications • Does not skew perception of reality. Both are free from PII or other sensitive information
  • 22. Data source discovery (CloverDX Harvester) Interrogation of data sources Data model and categorisation Suggestion for anonymization strategy
  • 23. Enterprise scale anonymization architecture Sensitive data discovery (Harvester) Configure anonymization policies (per domain) Anonymization Engine Production data Anonymized
  • 24. CloverDX Anonymization Engine How we do it on systems with thousands of tables Time for a little demo: