"Future Analytics - Fabrication of Synthetic Data", Dr Susan Wegner, VP Smart Data Analytics and Communication at Deutsche Telekom
YouTube Link: https://www.youtube.com/watch?v=a4FxA1v2rS4
Watch more from Data Natives 2015 here: http://bit.ly/1OVkK2J
Visit the conference website to learn more: www.datanatives.io
Follow Data Natives:
https://www.facebook.com/DataNatives
https://twitter.com/DataNativesConf
Stay Connected to Data Natives by Email: Subscribe to our newsletter to get the news first about Data Natives 2016: http://bit.ly/1WMJAqS
About the Author:
Dr. Susan Wegner is responsible for the innovation field “Smart Data Analytics & Communications” of Telekom Innovation Laboratories. The focus areas within this field are strategically derived from the key drivers of future telecommunication business: web-based Communication, Big Data and User Driven Innovation.
"Future Analytics - Fabrication of Synthetic Data", Dr. Susan Wegner,VP Smart Data Analytics and Communication at Deutsche Telekom
1. Dr. Susan Wegner, Telekom Innovation Laboratories
25. Februar 2015, BITKOM Big Data Summit, Hanau
Future analytics –
Fabrication of Synthetic Data
DATA NATIVES 2015, Dr. Susan Wegner, Telekom Innovation Laboratories
3. www.laboratories.telekom.com @T_Labs
Depersonalization approaches
3
Depersonalization
Standard Anonymization Approaches
Adaption of real data using data manipulation
techniques to increase k-anonymity*
Pertubation Regression
Classification
Tree
GeneralizationSuppressionReplacement
Markov
Chain
Further
Methods
*each person contained in the releasecannot be distinguished from at least k-1 individuals whose information also appear in the release
Source: http://whimsley.typepad.com/whimsley/2011/09/data-anonymization-and-re-identification-some-basics-of-data-privacy.html
Synthesization
Creation of new data with same properties using
machines learning methods (Ongoing Research)
4. www.laboratories.telekom.com @T_Labs
Standard Anonymization Approaches
A tradeoff between anonymity and usefulness
4
Tradeoff
Perfect
Anonymity
Perfect
Usefulness
It is not possible to create a perfectly anonymised dataset that is perfectly useful to researchers at the same time.
Pro: 100% Data
Privacy
Con: Data Loss and
distortions can
compromise
conclusions
Pro: A maximum of
data based insights
is possible
Con: Disclosure of
individuals is
possible
Decreasing Intensity of
Anonymization
5. www.laboratories.telekom.com @T_Labs
Standard Anonymization Approaches
Combining data sources endangers anonymity
5
Anonymized Netflix Data (2007)
10 million movie ratings
500,000 customers
Personal details were
removed and replaced
by random numbers
Public IMDB* Data (2007)
Users who entered
movie ranking using
real name
Sources Combination
Rankings
Anonymized
Netflix Data
Public IMDB
Data
Users on IMDB using their
real name had similar ranking
patterns in the Netflix data
It was possible to find all
other preferences of those
user in the Netflix Data
Danger
You can never fully estimate
the anonymity of your data
using standard approaches
Conclusion
*IMDB = Internet Movie Database
Source: Narayanan & Shmatikov (2008), Robust De-anonymization of Large Datasets
6. www.laboratories.telekom.com @T_Labs
Depersonalization Approaches
6
Depersonalization
Standard Anonymization Approaches
Adaption of real data using data manipulation
techniques to increase k-anonymity*
Pertubation Regression
Classification
Tree
GeneralizationSuppressionReplacement
Markov
Chain
Further
Methods
Synthesization
Creation of new data with same properties using
machines learning methods (Ongoing Research)
*each person contained in the releasecannot be distinguished from at least k-1 individuals whose information also appear in the release
Source: http://whimsley.typepad.com/whimsley/2011/09/data-anonymization-and-re-identification-some-basics-of-data-privacy.html
8. www.laboratories.telekom.com @T_Labs
ADVANTAGES & DISADVANTAGES OF TECHNIQUES
8
Rendering anonymous means the modification of personal data so that the information concerning personal or material
circumstances can no longer be attributed to an identified or identifiable individual.
Anonymization
Actual data is used to develop patterns in which the characteristics of this actual data are largely retained. These patterns
are then used to generate new data, which no longer has any reference to an individual in the actual data.
Synthetic data makes it possible, for the first time, to use data that was previously unavailable.
Synthetization
9. www.laboratories.telekom.com @T_Labs
Standard Anonymization Approaches vs. Synthetic Data
Standard Approaches vs. Synthetic Data
Synthetic data is not always superior
9
Creation is fast and easy
Suitable for real time provision
Individual is retained
Unrestricted data transfer
Unrestricted data storage
100% protection of individuals
No data loss or distortion
Standard
Anonymization Synthesization
No approach is completely superior over the
other
Synthetic data beats standard
anonymization, if it leads to data loss or if it
doesn’t allow unrestricted storage and
transfer of data due to data privacy issues or
volume restrictions
Conclusion
Main Advantages of Synthetic Data
Synthetic
Data
Standard
Approaches
10. www.laboratories.telekom.com @T_Labs
First results
Comparison of Distributions
0 50000 100000 150000 200000
AIF/MOC
AIF/MTC
AIF/Update Location
IuCS MOC
IuCS MTC
Deviations are statistically not significant
Distribution of the variable ‘Place’ in the source
And synthetized data set
0
20
40
60
80
100
120
140
160
0*0
10.0706*50.0369
10.1572*50.1639
10.2506*47.5914
10.3842*48.0369
10.5281*52.2592
10.6617*51.1064
10.7925*51.5714
10.9053*48.3619
11.0056*49.1725
11.0886*49.4542
11.2239*50.99
11.3775*48.0708
11.4783*48.1275
11.5619*51.8664
11.6128*48.1525
11.7228*48.1069
11.8672*48.9258
12.0014*51.4236
12.1175*51.4356
12.2244*53.8022
12.3811*52.1761
12.5367*49.2728
12.7756*51.92
13.0006*52.4061
13.2242*51.3525
13.3464*52.6336
13.4328*52.5122
13.5742*48.9381
13.7758*52.8561
14.3211*51.2706
6.16194*50.745
6.45917*51.2103
6.6425*50.6422
6.76583*49.3239
6.84806*50.5311
6.93417*49.2456
6.99778*52.0725
7.06139*52.1711
7.12778*49.3119
7.1925*51.46
7.27194*51.9003
7.38*51.3511
7.49028*51.4111
7.59222*50.3547
7.66917*51.9272
7.78528*48.3336
7.88861*52.17
8.01194*50.9022
8.1075*49.7428
8.21611*50.02
8.29556*52.1072
8.38*51.4497
8.46167*53.2397
8.53*51.9522
8.58833*48.4703
8.64194*50.0697
8.68333*50.1431
8.7475*50.0542
8.825*50.4217
8.90944*52.0175
9.0075*48.7008
9.11306*48.805
9.18139*48.9586
9.24778*48.6986
9.36917*52.1467
9.47667*48.6161
9.59611*50.5581
9.69556*54.1139
9.78333*52.4511
9.88167*53.5592
9.95472*49.7753
0
20
40
60
80
100
120
140
160
0*0
10.0706*50.0369
10.1572*50.1639
10.2506*47.5914
10.3842*48.0369
10.5281*52.2592
10.6617*51.1064
10.7925*51.5714
10.9053*48.3619
11.0056*49.1725
11.0886*49.4542
11.2239*50.99
11.3775*48.0708
11.4783*48.1275
11.5619*51.8664
11.6128*48.1525
11.7228*48.1069
11.8672*48.9258
12.0014*51.4236
12.1175*51.4356
12.2244*53.8022
12.3811*52.1761
12.5367*49.2728
12.7756*51.92
13.0006*52.4061
13.2242*51.3525
13.3464*52.6336
13.4328*52.5122
13.5742*48.9381
13.7758*52.8561
14.3211*51.2706
6.16194*50.745
6.45917*51.2103
6.6425*50.6422
6.76583*49.3239
6.84806*50.5311
6.93417*49.2456
6.99778*52.0725
7.06139*52.1711
7.12778*49.3119
7.1925*51.46
7.27194*51.9003
7.38*51.3511
7.49028*51.4111
7.59222*50.3547
7.66917*51.9272
7.78528*48.3336
7.88861*52.17
8.01194*50.9022
8.1075*49.7428
8.21611*50.02
8.29556*52.1072
8.38*51.4497
8.46167*53.2397
8.53*51.9522
8.58833*48.4703
8.64194*50.0697
8.68333*50.1431
8.7475*50.0542
8.825*50.4217
8.90944*52.0175
9.0075*48.7008
9.11306*48.805
9.18139*48.9586
9.24778*48.6986
9.36917*52.1467
9.47667*48.6161
9.59611*50.5581
9.69556*54.1139
9.78333*52.4511
9.88167*53.5592
9.95472*49.7753
Distribution of the variable ‘activity’ in the source
And synthetized data set
Source Synthetized
Amount of cases for each activity
Activity 1
Activity 2
Activity 3
Activity 4
Activity 5
10
11. www.laboratories.telekom.com @T_Labs
Why Synthetic data?
The Solution/USP:
Synthetic data have nearly the same quality as the original.
They cannot be traced back to their origin.
100% compliant with Data Privacy
Patents pending (disruptive technology).
It can be stored in any way and transferred to other.
This makes new services, including individualized services, possible.
11
14. www.laboratories.telekom.com @T_Labs
Data modelling
14
Real Data
New Data
1. Collection of several events 2. Clustering 3. Formation of regional patterns
4. Probability model 5. Fabrication of synthetic data
All possible events
Regional distribution
Local distribution