In this webinar we discuss privacy, it's relevance to data science, and how privacy-preserving synthetic data can help organizations build a bridge between compliance and efficient use of data.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
How businesses can benefit from privacy preserving synthetic data
1. Statice Webinar
How can businesses benefit from
privacy-preserving synthetic data?
Berlin 2020
2. Statice Webinar | 2020
Outline
1. What is privacy?
2. Data sharing
a. Why share data?
b. Data sharing done wrong
c. Synthetic data as a solution
3. What can you do with synthetic data?
4. Customer cases
5. Q+A
4. Statice Webinar | 2020
● English dictionary definition:
“Privacy is a state in which one is
not observed or disturbed by other
people”
● Lack of privacy => behavioral
change
● Privacy is fundamental to a free
society
Anonymous voting guarantees
freedom of choice
Privacy landscape
6. Statice Webinar | 2020
Privacy in the present
● Digital tracking
everywhere
● Social circle, browsing
habits, shopping details,
location tracking, emails,
calls ...
7. Statice Webinar | 2020
Data protection regulations
● Protection of individual privacy
● Over 80 countries and regions
worldwide
● Strictest regulation
○ GDPR - European Union (2018)
● High fines for violations
https://termly.io/resources/infographics/privacy-laws-around-the-world/
8. Use of sensitive data in your company made practically
impossible because of data protection regulations:
Your data teams are slowed down as data is
generally accessible only after a long
governance process
Your production data cannot be stored or
processed using cloud resources as customer
consent is mostly not feasible for exploratory
data analysis.
Your production data cannot be shared
with partners for product development or
research.
Statice Webinar | 2020
9. Statice Webinar | 2020
Privacy promise: Opt-out scenario
● My data must have no
effect on any analysis
carried on on the dataset
● Problem: if nobody’s data
has no effect on any
analysis then there will be
no utility.
10. Statice Webinar | 2020
Privacy promise:
what can we expect?
● A tradeoff
○ With or without my data,
any outcome of any
analysis should be the
same
○ The impact on me sharing
information in the dataset
will be limited to the general
learnings not the specifics
of my information
12. Statice Webinar | 2020
Why share data?
● As individuals, we share data all the
time
○ With our doctors
○ With our accountants
○ In exchange for a trusted
service
● Privacy is not necessarily complete
non-disclosure
13. Statice Webinar | 2020
Why share data?
● Society benefits from individuals
sharing their data
○ Medical advances
○ Sociological research,
understanding society dynamics
● Examples:
○ Tracking commute patterns to
improve public transport
networks
○ Detect epidemia and act fast by
looking at search engine disease
queries/medicine orders
18. Statice Webinar | 2020
Illustration: Cambridge Analytica
● Infamous leak involved Personally Identifiable Information of over 50
million people
https://www.theguardian.com/technology/2018/mar/17/facebook-cambridge-analytica-kogan-data-algorithm
19. Statice Webinar | 2020
Information not unique to you: "quasi-identifiers"
20. Statice Webinar | 2020
Illustration: Massachusetts Governor leak
Sweeney, Latanya. Weaving Technology and Policy Together to Maintain
Confidentiality. Journal of Law, Medicine and Ethics, Vol. 25 1997, p. 98-110
21. Statice Webinar | 2020
Fingerprint-like information
● On its own, a fingerprint
seems cryptic
● Around 100 minutiae in a
fingerprint
● Experts declare a fingerprint
match if 12 minutiae match
● Precise identification is
possible if fingerprints are
indexed and queryable
22. Statice Webinar | 2020
Illustration: Netflix movie preferences
Join movie
ratings
Ratings of only 4-5 movies
allowed successful
identification of a large
number of users was
possible.
Narayanan A, Shmatikov V. Robust de-anonymization of large spa
datasets. InSecurity and Privacy, 2008. SP 2008. IEEE Symposium on
2008 May 18 (pp. 111-125). IEEE.
23. Statice Webinar | 2020
French Military Base in MaliHeatmap 30 million runners
worldwide
Not that many in the
Sahara
Illustration: Strava Running Tracks
24. Statice Webinar | 2020
And many more . . .
● Search queries
● Browser configuration
25. So how do we
enable the use of
sensitive customer
data while staying
privacy-compliant?
Statice Webinar | 2020
26. Recital 26 of the GDPR:
“This regulation does not therefore concern the processing of such
anonymous information, including for statistical or research
purposes.”
The best way to securely access and leverage sensitive customer
data is to use anonymous data.
Statice Webinar | 2020
27. The problem is that traditional
anonymization methods are unable
to preserve the granularity and
quality of the original data required
for further processing and analysis.
Either they obfuscate data to a large
extent or they do not properly protect
the data.
Data utility Data privacy
vs.
Statice Webinar | 2020
29. Statice is a data anonymization
engine that enables the secure
anonymization of data while
preserving its statistical utility and
data structure.
This allows you to perform meaningful
data analysis without ever exposing
the original data.
Statice Webinar | 2020
30. Guaranteed data privacy
Statice generates
privacy-preserving synthetic
data which is based on
mathematical privacy
guarantees.
Data anonymization made easy.
Automatic anonymization
and granular data quality
Statice anonymizes your
data preserving statistical
utility and data structure by
generating synthetic data.
Flexible integration
Statice can be conveniently
used on-premise both via a
CLI or as a Python library.
Support for all
structured data
Statice supports the
anonymization of tabular,
relational, time-series,
geolocation and other types
of structured data.
Statice Webinar | 2020
31. Original
data Statice
engine
Anonymous
synthetic
data
1 2 3
Data analysis
● Automatic understanding
of provided data types
● Automatic data
classification
Training
● Generative algorithms
learn the statistical
structure and information
of the original data
Data generation
● Generation of anonymous
synthetic data
● Provision of automatic
utility and risk evaluations
How Statice works
Statice Webinar | 2020
32. Automatic
evaluation metrics
that are part of the
Statice software
prove how the
statistical
properties of the
original data are
preserved in the
newly-generated
anonymous
synthetic data.
Statice Webinar | 2020
34. Use data protection to
your advantage and
get the most value out
of your data
Build your data sandbox
Train your machine learning algorithms
Protect your customer data for BI analysis
Enable your scalable use of cloud infrastructures
Use Statice to effectively protect sensitive data in order to
share it easily with partners or across your organization for
quick access and collaborative use.
Leverage synthetic data by Statice to train your machine
learning models with the same accuracy as when using
real-world data.
By anonymizing customer data directly, you add a strong
safeguard for protecting your customers and enable quick
and flexible data analysis.
Process synthetic data in cloud instances without ever
putting sensitive data at risk and yet benefit from a scalable
infrastructure and the cost-efficient use of cloud resources
for your company.
Statice Webinar | 2020
36. Customer case 1:
The Statice engine
enabling a German
insurance provider to
tailor products to its
customers
Challenges
● Impeded timely access to data and availability of granular
information because of legal constraints
● Complicated product development due to sensitive customer
data and privacy regulations
● Biased customer behavior modeling due to lack of access to
complete customer data sets
● Weeks/months period between customer data acquisition and
data processing
Solutions
● Enabled timely access to data with Statice by generating
synthetic data based on real customer data
● Creation of anonymous data warehouse with much lower
compliance hurdles to allow data science teams to work faster on
more representative data
Long-term benefits
● Unlock sensitive customer data as a prime resource for product
innovation
● Massively reduced time-to-data for both internal and external
stakeholders (weeks/months to days)
● Lowered compliance overhead and enable innovation
prototyping
Statice Webinar | 2020
37. ● High risk of engaging in collaborative partnerships due
to sensitive customer data exchange processes
● Potential exposure to customer data leakage and its
legal implications
● Reduced ability to devise innovative strategies with third
parties due to data privacy and security concerns
Solutions
● Statice implemented to produce privacy-preserving
synthetic data
● Safe data, with much lower compliance hurdles for
partnerships, created for external sharing
Long-term benefits
● Compliant and collaborative product development &
data monetisation
● Facilitated innovative partnerships through
unconstrained customer data exchange
Customer case 2:
The Statice engine
allowing a German
healthcare enterprise
to safely engage in
collaborative
partnerships
Statice Webinar | 2020
Challenges
38. Customer case 3:
The Statice engine
enabling a German
telecommunications
company the scalable
use of cloud
infrastructure
● Hugely valuable data in the business’ data exhaust
which cannot be properly exploited due to privacy
concerns
● Inability to scale a data processing and analysis pipeline
on cloud infrastructure due to sensitive data exposure
● High costs and major delays in innovation projects due
to the incapacity to perform and scale data processing
on the cloud infrastructure
Solutions
● Use customer data in the form of privacy-compliant
synthetic data which contains highly similar statistical
information
● Use of synthetic data generated with Statice offers the
freedom to freely, cost-efficiently, fast and safely scale
solutions on cloud infrastructure without concerns
around customer data privacy
Long-term benefits
● Accelerated, cost-efficient use of cloud resources and
data for software testing
Statice Webinar | 2020
Challenges
39. Statice ensures full
data privacy
compliance allowing
your data team to
work more efficiently
Using Statice you can:
Minimize your time-to-data from months to days.
Unlock your sensitive customer data as a prime
resource for product innovation.
Ensure your regulatory compliance for the whole
data value chain.
Statice Webinar | 2020
41. Unlock your data
with Statice.
ben@statice.ai
statice.ai
Ben Nolan
Head of Business Development
42. Statice Webinar | 2020
Are you interested in learning
more about working with us?
43. 3. Project kick-off2. Technical planning1. Feasibility study
~8 weeks
WE FOLLOW THREE STEPS ON THE WAY TO A COOPERATION
Goal
Involved
parties
Results
Understanding scope of data
and use case for the customer
Successful planning of the
infrastructure to be used
Successful coordination of
joint project plan
● Evaluation of shared
data schema
● Implementation plan
● Infrastructure plan
● Joint project plan
● Date for project start
& the customer & the customer & the customer
Statice Webinar | 2020