An experimental comparison of globally-optimal data de-identification algorithms

Technische Universität München
Fabian Prasser, Florian Kohlmayer, Klaus A. Kuhn
Chair for Biomedical Informatics
Institute for Medical Statistics and Epidemiology
Klinikum rechts der Isar der TU München
An Experimental Comparison of
Globally-Optimal Data
De-Identification Algorithms

Optimal de-identification algorithms
• Generalization hierarchies
• Pruning: predictive tagging
• Optimization: roll-up
• Privacy models, e.g.: k-anonymity, l-diversity, t-closeness, δ-presence
F. Prasser, F. Kohlmayer et al.: A Benchmark of Globally-Optimal Anonymization
Methods for Biomedical Data, CBMS 2014
12/19/16 2
• Generalization lattice
K=2K=2
Age Gender Zipcode
34 male 81667
45 female 81667
66 male 81925
70 female 81925
70 male 81925
Age Gender Zipcode
20-60 * 81667
20-60 * 81667
≥ 61 * 81925
≥ 61 * 81925
≥ 61 * 81925

Algorithms – Incognito
• LeFevre et al.
– SIGMOD 2005
• Dynamic programming
– Breadth-first search on lattices for powerset of quasi-identifiers
12/19/16 3

Algorithms – OLA & Flash
• Emam et al.
– JAMIA 2009
• Divide & conquer
– Optimal Lattice Anonymization
– Binary search on sublattices
12/19/16 4
• Kohlmayer & Prasser et al.
– PASSAT 2012
• Greedy search
– Binary depth-first search
– Total order & priority queue

Algorithms – BFS, DFS & Questions
• Generic search methods
– Breadth-first search (BFS)
– Depth-first search (DFS)
→ Extended to use predictive tagging
• Research questions
– How do the algorithms compare in terms of performance?
– Are there further differences between them?
– Are the algorithms' properties influenced by the privacy models
used?
– How do problem-specific methods compare to generic search
algorithms?
12/19/16 5

Benchmark – Method
• Use all reasonable combinations of common privacy models with
typical parameters
– (k)-anonymity, (l)-diversity, (t)-closeness, (δ)-presence
• Properties of the search space are influenced by combining privacy
models:
– (k), (l), (t), (δ)
– (k, l), (k, t), (k, δ), (l, δ), (t, δ)
– (k, l, δ), (k, t, δ)
• Report three basic performance measures
– Pruning power: number of anonymity checks
– Optimizability: number of roll-ups
– Execution times in a highly efficient runtime environment (ARX)
• Five well-known benchmark datasets
12/19/16 6

Results – Averaged over datasets
12/19/16 7
#Roll-ups#ChecksExec.time[s]
Lower is
better
Higher is
better
Lower is
better
●
Allows analyzing variations in results for different sets of privacy models

●
Repeating patterns
→ Consistent results for different configurations
→ Differences between algorithms not influenced by privacy models used
12/19/16 8
Lower is
better
Higher is
better
Lower is
better

●
Breadth-first search is a worst-case strategy
→ No pruning-power, no optimizability
→ Incognito suffers from similar performance problems
12/19/16 9
Lower is
better
Higher is
better
Lower is
better

●
Depth-first search is pretty efficient
→ Can outperform domain-specific methods (OLA)
→ Because of its optimizability (best method in terms of #roll-ups)
12/19/16 10
Lower is
better
Higher is
better
Lower is
better

●
Number of checks: OLA < Flash < DFS < Incognito < BFS
●
Number of roll-ups: DFS > Flash > Incognito > OLA > BFS
●
Execution times: Flash < OLA < DFS < Incognito < BFS
12/19/16 11
Lower is
better
Higher is
better
Lower is
better

Results – Averaged over privacy models
12/19/16 12
– OLA provides performance comparable to Flash for smaller datasets
– DFS provides performance comparable to Flash for larger datasets
#Checks#Roll-upsExec.time[s]
Lower is
better
Higher is
better
Lower is
better
●
Shows variations in
results for different
datasets
●
Algorithms exhibit
similar properties
●
Flash provides the
best overall
performance
●
Differences are
mostly independent
of datasets
●
But

Lessons learned
• In general, domain-specific algorithms outperform generic methods
→ Up to several orders of magnitude (BFS)
→ OLA and Flash only check between 0.2% and 1.1% of all
transformations in the solution space
→ Not necessarily true for large datasets (DFS)
• Flash effectively balances optimizability with pruning power
→ Should be used if optimized runtime environments are available
• OLA provides best pruning power
→ Should be used in general-purpose environments
• DFS outperforms OLA for large datasets
→ In these cases, optimizability is more important than pruning power
→ Optimized runtime environments required
12/19/16 13

Thank you for your attention!
• ARX is free software
– Download – Use – Contribute
– Repository: https://github.com/arx-deidentifier/arx
• Further information
– Website: http://arx.deidentifier.org
– Contact
●
Fabian Prasser (prasser@in.tum.de)
●
Florian Kohlmayer (florian.kohlmayer@tum.de)
12/19/16 14

An experimental comparison of globally-optimal data de-identification algorithms

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to An experimental comparison of globally-optimal data de-identification algorithms

Similar to An experimental comparison of globally-optimal data de-identification algorithms (20)

Recently uploaded

Recently uploaded (20)

An experimental comparison of globally-optimal data de-identification algorithms