Collaboration and data sharing have become core elements of biomedical research. At the same time, there is a growing understanding of privacy threats related to data sharing, especially when sensitive data from distributed sources become available for linkage. Statistical disclosure control comprises well-known data anonymization techniques that allow the protection of data by introducing fuzziness. To protect datasets from different types of threats, different privacy criteria are commonly implemented. Data anonymization is an important measure, but it is computationally complex, and it can significantly reduce the expressiveness of data. To attenuate these problems, a number of algorithms has been proposed, which aim at increasing data quality or improving efficiency. Previous evaluations of such algorithms lack a systematic approach, as they focus on specific algorithms, specific privacy criteria, and specific runtime environments. Therefore, it is difficult for decision makers to decide which algorithm is best suited for their requirements. As a first step towards a comprehensive and systematic evaluation of anonymity algorithms, we report on our ongoing efforts for providing an open source benchmark. In this contribution, we focus on optimal algorithms utilizing global recoding with full-domain generalization. We present a systematic evaluation of domain-specific algorithms and generic search methods for a broad set of privacy criteria, including k-anonymity, l-diversity, t-closeness and d-presence, and their use in multiple real-world datasets. Our results show that there is no single solution fitting all needs, and that generic search methods can outperform highly specialized algorithms.
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
An experimental comparison of globally-optimal data de-identification algorithms
1. Technische Universität München
Fabian Prasser, Florian Kohlmayer, Klaus A. Kuhn
Chair for Biomedical Informatics
Institute for Medical Statistics and Epidemiology
Klinikum rechts der Isar der TU München
An Experimental Comparison of
Globally-Optimal Data
De-Identification Algorithms
2. Technische Universität München
Optimal de-identification algorithms
• Generalization hierarchies
• Pruning: predictive tagging
• Optimization: roll-up
• Privacy models, e.g.: k-anonymity, l-diversity, t-closeness, δ-presence
F. Prasser, F. Kohlmayer et al.: A Benchmark of Globally-Optimal Anonymization
Methods for Biomedical Data, CBMS 2014
12/19/16 2
• Generalization lattice
K=2K=2
Age Gender Zipcode
34 male 81667
45 female 81667
66 male 81925
70 female 81925
70 male 81925
Age Gender Zipcode
20-60 * 81667
20-60 * 81667
≥ 61 * 81925
≥ 61 * 81925
≥ 61 * 81925
3. Technische Universität München
Algorithms – Incognito
• LeFevre et al.
– SIGMOD 2005
• Dynamic programming
– Breadth-first search on lattices for powerset of quasi-identifiers
F. Prasser, F. Kohlmayer et al.: A Benchmark of Globally-Optimal Anonymization
Methods for Biomedical Data, CBMS 2014
12/19/16 3
4. Technische Universität München
Algorithms – OLA & Flash
• Emam et al.
– JAMIA 2009
• Divide & conquer
– Optimal Lattice Anonymization
– Binary search on sublattices
F. Prasser, F. Kohlmayer et al.: A Benchmark of Globally-Optimal Anonymization
Methods for Biomedical Data, CBMS 2014
12/19/16 4
• Kohlmayer & Prasser et al.
– PASSAT 2012
• Greedy search
– Binary depth-first search
– Total order & priority queue
5. Technische Universität München
Algorithms – BFS, DFS & Questions
• Generic search methods
– Breadth-first search (BFS)
– Depth-first search (DFS)
→ Extended to use predictive tagging
• Research questions
– How do the algorithms compare in terms of performance?
– Are there further differences between them?
– Are the algorithms' properties influenced by the privacy models
used?
– How do problem-specific methods compare to generic search
algorithms?
F. Prasser, F. Kohlmayer et al.: A Benchmark of Globally-Optimal Anonymization
Methods for Biomedical Data, CBMS 2014
12/19/16 5
6. Technische Universität München
Benchmark – Method
• Use all reasonable combinations of common privacy models with
typical parameters
– (k)-anonymity, (l)-diversity, (t)-closeness, (δ)-presence
• Properties of the search space are influenced by combining privacy
models:
– (k), (l), (t), (δ)
– (k, l), (k, t), (k, δ), (l, δ), (t, δ)
– (k, l, δ), (k, t, δ)
• Report three basic performance measures
– Pruning power: number of anonymity checks
– Optimizability: number of roll-ups
– Execution times in a highly efficient runtime environment (ARX)
• Five well-known benchmark datasets
F. Prasser, F. Kohlmayer et al.: A Benchmark of Globally-Optimal Anonymization
Methods for Biomedical Data, CBMS 2014
12/19/16 6
7. Technische Universität München
Results – Averaged over datasets
F. Prasser, F. Kohlmayer et al.: A Benchmark of Globally-Optimal Anonymization
Methods for Biomedical Data, CBMS 2014
12/19/16 7
#Roll-ups#ChecksExec.time[s]
Lower is
better
Higher is
better
Lower is
better
●
Allows analyzing variations in results for different sets of privacy models
8. Technische Universität München
Results – Averaged over datasets
●
Repeating patterns
→ Consistent results for different configurations
→ Differences between algorithms not influenced by privacy models used
F. Prasser, F. Kohlmayer et al.: A Benchmark of Globally-Optimal Anonymization
Methods for Biomedical Data, CBMS 2014
12/19/16 8
#Roll-ups#ChecksExec.time[s]
Lower is
better
Higher is
better
Lower is
better
9. Technische Universität München
Results – Averaged over datasets
●
Breadth-first search is a worst-case strategy
→ No pruning-power, no optimizability
→ Incognito suffers from similar performance problems
F. Prasser, F. Kohlmayer et al.: A Benchmark of Globally-Optimal Anonymization
Methods for Biomedical Data, CBMS 2014
12/19/16 9
#Roll-ups#ChecksExec.time[s]
Lower is
better
Higher is
better
Lower is
better
10. Technische Universität München
Results – Averaged over datasets
●
Depth-first search is pretty efficient
→ Can outperform domain-specific methods (OLA)
→ Because of its optimizability (best method in terms of #roll-ups)
F. Prasser, F. Kohlmayer et al.: A Benchmark of Globally-Optimal Anonymization
Methods for Biomedical Data, CBMS 2014
12/19/16 10
#Roll-ups#ChecksExec.time[s]
Lower is
better
Higher is
better
Lower is
better
11. Technische Universität München
Results – Averaged over datasets
●
Number of checks: OLA < Flash < DFS < Incognito < BFS
●
Number of roll-ups: DFS > Flash > Incognito > OLA > BFS
●
Execution times: Flash < OLA < DFS < Incognito < BFS
F. Prasser, F. Kohlmayer et al.: A Benchmark of Globally-Optimal Anonymization
Methods for Biomedical Data, CBMS 2014
12/19/16 11
#Roll-ups#ChecksExec.time[s]
Lower is
better
Higher is
better
Lower is
better
12. Technische Universität München
Results – Averaged over privacy models
F. Prasser, F. Kohlmayer et al.: A Benchmark of Globally-Optimal Anonymization
Methods for Biomedical Data, CBMS 2014
12/19/16 12
– OLA provides performance comparable to Flash for smaller datasets
– DFS provides performance comparable to Flash for larger datasets
#Checks#Roll-upsExec.time[s]
Lower is
better
Higher is
better
Lower is
better
●
Shows variations in
results for different
datasets
●
Algorithms exhibit
similar properties
●
Flash provides the
best overall
performance
●
Differences are
mostly independent
of datasets
●
But
13. Technische Universität München
Lessons learned
• In general, domain-specific algorithms outperform generic methods
→ Up to several orders of magnitude (BFS)
→ OLA and Flash only check between 0.2% and 1.1% of all
transformations in the solution space
→ Not necessarily true for large datasets (DFS)
• Flash effectively balances optimizability with pruning power
→ Should be used if optimized runtime environments are available
• OLA provides best pruning power
→ Should be used in general-purpose environments
• DFS outperforms OLA for large datasets
→ In these cases, optimizability is more important than pruning power
→ Optimized runtime environments required
F. Prasser, F. Kohlmayer et al.: A Benchmark of Globally-Optimal Anonymization
Methods for Biomedical Data, CBMS 2014
12/19/16 13
14. Technische Universität München
Thank you for your attention!
• ARX is free software
– Download – Use – Contribute
– Repository: https://github.com/arx-deidentifier/arx
• Further information
– Website: http://arx.deidentifier.org
– Contact
●
Fabian Prasser (prasser@in.tum.de)
●
Florian Kohlmayer (florian.kohlmayer@tum.de)
F. Prasser, F. Kohlmayer et al.: A Benchmark of Globally-Optimal Anonymization
Methods for Biomedical Data, CBMS 2014
12/19/16 14