"Future Analytics - Fabrication of Synthetic Data", Dr. Susan Wegner,VP Smart Data Analytics and Communication at Deutsche Telekom

Dr. Susan Wegner, Telekom Innovation Laboratories
25. Februar 2015, BITKOM Big Data Summit, Hanau
Future analytics –
Fabrication of Synthetic Data
DATA NATIVES 2015, Dr. Susan Wegner, Telekom Innovation Laboratories

www.laboratories.telekom.com @T_Labs
ACCESS TO DATA IS STILL AN ISSUE
2
DUE TO DIFFERENT TECHNOLOGY AND DATA SOURCES

Depersonalization approaches
3
Depersonalization
Standard Anonymization Approaches
Adaption of real data using data manipulation
techniques to increase k-anonymity*
Pertubation Regression
Classification
Tree
GeneralizationSuppressionReplacement
Markov
Chain
Further
Methods
*each person contained in the releasecannot be distinguished from at least k-1 individuals whose information also appear in the release
Source: http://whimsley.typepad.com/whimsley/2011/09/data-anonymization-and-re-identification-some-basics-of-data-privacy.html
Synthesization
Creation of new data with same properties using
machines learning methods (Ongoing Research)

A tradeoff between anonymity and usefulness
4
Tradeoff
Perfect
Anonymity
Perfect
Usefulness
It is not possible to create a perfectly anonymised dataset that is perfectly useful to researchers at the same time.
Pro: 100% Data
Privacy
Con: Data Loss and
distortions can
compromise
conclusions
Pro: A maximum of
data based insights
is possible
Con: Disclosure of
individuals is
possible
Decreasing Intensity of
Anonymization 

Combining data sources endangers anonymity
5
 Anonymized Netflix Data (2007)
 10 million movie ratings
 500,000 customers
 Personal details were
removed and replaced
by random numbers
 Public IMDB* Data (2007)
 Users who entered
movie ranking using
real name
Sources Combination
Rankings
Anonymized
Netflix Data
Public IMDB
Data
 Users on IMDB using their
real name had similar ranking
patterns in the Netflix data
 It was possible to find all
other preferences of those
user in the Netflix Data
Danger
 You can never fully estimate
the anonymity of your data
using standard approaches
Conclusion
*IMDB = Internet Movie Database
Source: Narayanan & Shmatikov (2008), Robust De-anonymization of Large Datasets

Depersonalization Approaches
6
Depersonalization
Adaption of real data using data manipulation
techniques to increase k-anonymity*
Pertubation Regression
Classification
Tree
GeneralizationSuppressionReplacement
Markov
Chain
Further
Methods
Synthesization
Creation of new data with same properties using
machines learning methods (Ongoing Research)
*each person contained in the releasecannot be distinguished from at least k-1 individuals whose information also appear in the release
Source: http://whimsley.typepad.com/whimsley/2011/09/data-anonymization-and-re-identification-some-basics-of-data-privacy.html

synthetic data to overcome privacy issues
7

ADVANTAGES & DISADVANTAGES OF TECHNIQUES
8
Rendering anonymous means the modification of personal data so that the information concerning personal or material
circumstances can no longer be attributed to an identified or identifiable individual.
Anonymization
Actual data is used to develop patterns in which the characteristics of this actual data are largely retained. These patterns
are then used to generate new data, which no longer has any reference to an individual in the actual data.
Synthetic data makes it possible, for the first time, to use data that was previously unavailable.
Synthetization

Standard Anonymization Approaches vs. Synthetic Data
Standard Approaches vs. Synthetic Data
Synthetic data is not always superior
9
Creation is fast and easy
Suitable for real time provision
Individual is retained
Unrestricted data transfer
Unrestricted data storage
100% protection of individuals
No data loss or distortion
Standard
Anonymization Synthesization
 No approach is completely superior over the
other
 Synthetic data beats standard
anonymization, if it leads to data loss or if it
doesn’t allow unrestricted storage and
transfer of data due to data privacy issues or
volume restrictions
Conclusion
Main Advantages of Synthetic Data
Synthetic
Data
Standard
Approaches

First results
Comparison of Distributions
0 50000 100000 150000 200000
AIF/MOC
AIF/MTC
AIF/Update Location
IuCS MOC
IuCS MTC
Deviations are statistically not significant
Distribution of the variable ‘Place’ in the source
And synthetized data set
0
20
40
60
80
100
120
140
160
0*0
10.0706*50.0369
10.1572*50.1639
10.2506*47.5914
10.3842*48.0369
10.5281*52.2592
10.6617*51.1064
10.7925*51.5714
10.9053*48.3619
11.0056*49.1725
11.0886*49.4542
11.2239*50.99
11.3775*48.0708
11.4783*48.1275
11.5619*51.8664
11.6128*48.1525
11.7228*48.1069
11.8672*48.9258
12.0014*51.4236
12.1175*51.4356
12.2244*53.8022
12.3811*52.1761
12.5367*49.2728
12.7756*51.92
13.0006*52.4061
13.2242*51.3525
13.3464*52.6336
13.4328*52.5122
13.5742*48.9381
13.7758*52.8561
14.3211*51.2706
6.16194*50.745
6.45917*51.2103
6.6425*50.6422
6.76583*49.3239
6.84806*50.5311
6.93417*49.2456
6.99778*52.0725
7.06139*52.1711
7.12778*49.3119
7.1925*51.46
7.27194*51.9003
7.38*51.3511
7.49028*51.4111
7.59222*50.3547
7.66917*51.9272
7.78528*48.3336
7.88861*52.17
8.01194*50.9022
8.1075*49.7428
8.21611*50.02
8.29556*52.1072
8.38*51.4497
8.46167*53.2397
8.53*51.9522
8.58833*48.4703
8.64194*50.0697
8.68333*50.1431
8.7475*50.0542
8.825*50.4217
8.90944*52.0175
9.0075*48.7008
9.11306*48.805
9.18139*48.9586
9.24778*48.6986
9.36917*52.1467
9.47667*48.6161
9.59611*50.5581
9.69556*54.1139
9.78333*52.4511
9.88167*53.5592
9.95472*49.7753
0
20
40
60
80
100
120
140
160
0*0
10.0706*50.0369
10.1572*50.1639
10.2506*47.5914
10.3842*48.0369
10.5281*52.2592
10.6617*51.1064
10.7925*51.5714
10.9053*48.3619
11.0056*49.1725
11.0886*49.4542
11.2239*50.99
11.3775*48.0708
11.4783*48.1275
11.5619*51.8664
11.6128*48.1525
11.7228*48.1069
11.8672*48.9258
12.0014*51.4236
12.1175*51.4356
12.2244*53.8022
12.3811*52.1761
12.5367*49.2728
12.7756*51.92
13.0006*52.4061
13.2242*51.3525
13.3464*52.6336
13.4328*52.5122
13.5742*48.9381
13.7758*52.8561
14.3211*51.2706
6.16194*50.745
6.45917*51.2103
6.6425*50.6422
6.76583*49.3239
6.84806*50.5311
6.93417*49.2456
6.99778*52.0725
7.06139*52.1711
7.12778*49.3119
7.1925*51.46
7.27194*51.9003
7.38*51.3511
7.49028*51.4111
7.59222*50.3547
7.66917*51.9272
7.78528*48.3336
7.88861*52.17
8.01194*50.9022
8.1075*49.7428
8.21611*50.02
8.29556*52.1072
8.38*51.4497
8.46167*53.2397
8.53*51.9522
8.58833*48.4703
8.64194*50.0697
8.68333*50.1431
8.7475*50.0542
8.825*50.4217
8.90944*52.0175
9.0075*48.7008
9.11306*48.805
9.18139*48.9586
9.24778*48.6986
9.36917*52.1467
9.47667*48.6161
9.59611*50.5581
9.69556*54.1139
9.78333*52.4511
9.88167*53.5592
9.95472*49.7753
Distribution of the variable ‘activity’ in the source
And synthetized data set
 Source  Synthetized
Amount of cases for each activity
Activity 1
Activity 2
Activity 3
Activity 4
Activity 5
10

Why Synthetic data?
The Solution/USP:
 Synthetic data have nearly the same quality as the original.
 They cannot be traced back to their origin.
 100% compliant with Data Privacy
 Patents pending (disruptive technology).
 It can be stored in any way and transferred to other.
 This makes new services, including individualized services, possible.
11

Thank you
12
WE SHAPE THE FUTURE

Data modelling
14
Real Data
New Data
1. Collection of several events 2. Clustering 3. Formation of regional patterns
4. Probability model 5. Fabrication of synthetic data
All possible events
Regional distribution
Local distribution

Future Analytics – exemplary application
15

Vision – include everything in one picture
100% Compliant with Data Privacy
16

"Future Analytics - Fabrication of Synthetic Data", Dr. Susan Wegner,VP Smart Data Analytics and Communication at Deutsche Telekom

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie "Future Analytics - Fabrication of Synthetic Data", Dr. Susan Wegner,VP Smart Data Analytics and Communication at Deutsche Telekom

Ähnlich wie "Future Analytics - Fabrication of Synthetic Data", Dr. Susan Wegner,VP Smart Data Analytics and Communication at Deutsche Telekom (20)

Mehr von Dataconomy Media

Mehr von Dataconomy Media (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

"Future Analytics - Fabrication of Synthetic Data", Dr. Susan Wegner,VP Smart Data Analytics and Communication at Deutsche Telekom