Engineering data privacy - The ARX data anonymization tool

•

2 gefällt mir•3,337 views

Website with further information: http://arx.deidentifier.org Description of this talk: While a plethora of methods have been proposed for dealing with many aspects of de-identifying clinical data, only few (prototypical) implementations are available. Actually, the complexity of implementing privacy technologies is an often overlooked challenge. In this talk we will present the open source data de-identification tool ARX, which has been carefully engineered to support multiple privacy technologies for relational datasets. Our tool bridges the gap between different scientific disciplines by integrating methods developed and used by the statistics community with data anonymization techniques developed by computer scientists. ARX has been designed from the ground up to ensure scalability and it is able to process very large datasets on commodity hardware. The software implements a large set of privacy models: (1) syntactic privacy models, such as k-anonymity, l-diversity, t-closeness and δ-presence, (2) statistical models for re-identification risks, and (3) differential privacy. In the talk, we will focus on measures to reduce the uniqueness of records. ARX also supports more than ten different methods for evaluating data utility, including loss, precision, non-uniform entropy and KL divergence. In ARX, de-identification of data can be performed automatically, semi-automatically and manually using a complex method that integrates global recoding, local recoding, categorization, generalization, suppression, microaggregation and top/bottom-coding. All methods are accessible via a comprehensive cross-platform graphical user interface.

Software

Technische Universität München
Fabian Prasser, Florian Kohlmayer, Klaus A. Kuhn
Chair for Biomedical Informatics
Institute for Medical Statistics and Epidemiologie
University of Technology Munich (TUM)
Engineering data privacy -
The ARX data anonymization tool

Technische Universität München
What is ARX?
●
= +
●
A tool for analyzing and reducing the uniqueness of records
in a (relational) dataset
●
Variety of methods
●
Highly scalable
●
Up to 50 dimensions (i.e. attributes)
●
Millions of records
●
(Semi-)automatically and/or manually
●
Comprehensive graphical user interface
ARX | Dagtuhl Genomic Privacy Workshop 201522.10.15 2
Images: https://commons.wikimedia.org/ users: Ysangkok, Scarce2
statistics computer science
Methods from

Technische Universität München
Example
ARX | Dagtuhl Genomic Privacy Workshop 201522.10.15 3
Generalization
Suppression
Microaggregation
Reduce uniqueness

Technische Universität München
Overview of methods implemented by ARX
Sample-based methods
• Fraction of sample uniques
• Average sample uniqueness
• k-anonymity
Population-based methods
• Model by Zayatz [1]
• Model by Hoshino [2]
• Model by Chen et al. [3] / Rinott [4]
• Model by Dankar et al. [5]
ARX | Dagtuhl Genomic Privacy Workshop 201522.10.15 4
[1] Zayatz, L.V.: Estimation of the percent of unique population
elements on a microdata file using the sample. Statistical
Research Division Report Number: Census/SRD/RR-91/08 (1991)
[2] Hoshino, N.: Applying pitmans sampling formula to microdata
disclosure risk assessment. J Off Stat 17(4), 499520 (2001)
[3] Chen, G., Keller-McNulty, S.: Estimation of identification disclosure
risk in microdata. J Off Stat 14, 7995 (1998)
[4] Rinott, Y.: On models for statistical disclosure risk estimation. In:
Proc ECE/Eurostat Work Session Stat Data Confid, p. 275285 (2003)
[5] Dankar, F., Emam, K.E., Neisa, A., Roffey, T.:
Estimating the re-identification risk of clinical
data sets. BMC Med Inform Decis Mak 12(1), 66 (2012)
Global and local recoding
• Can be weighted
Methods
• Categorization
• Generalization
• Cell suppression
• Record suppression
• Micro-aggregation
• Top/bottom coding
Weighted and parameterized
• Ability to control the application
of different coding models
Methods
• AECS, Discernibility, Precision
• (Normalized) Mean squared error
• (Normalized) Non-uniform entropy
• KL divergence
• Loss
Measures for utility Coding models Measures for uniqueness
Transform
Visualize
Analyze
Adapt

Technische Universität München
Screenshots
ARX | Dagtuhl Genomic Privacy Workshop 201522.10.15 5

Technische Universität München
Screenshots (cont'd)
ARX | Dagtuhl Genomic Privacy Workshop 201522.10.15 6

Technische Universität München
Further features offered by ARX
●
Syntactic privacy models
●
ℓ-diversity, t-closeness, δ-disclosure privacy, δ-presence
●
Risk-based anonymization
●
Differential privacy
●
Truthful (e,δ)-differentially private data release
●
Using random sampling
●
Detection of HIPAA identifiers
●
Based on heuristics
●
Import from multiple sources
●
RDBMS, Excel, CSV
●
Software library
●
Open source, cross-platform
ARX | Dagtuhl Genomic Privacy Workshop 201522.10.15 7

Technische Universität München
http://arx.deidentifier.org

Weitere ähnliche Inhalte

Was ist angesagt?

Data mining in telecommunications industryIssa Memari

Data Visualization: Impact, Intrigue, Value Add for APLIC 2014Amanda Makulec

Lec13: Clustering Based Medical Image Segmentation MethodsUlaş Bağcı

Data Governance and Data Science to Improve Data QualityDATAVERSITY

Ethics in Data Management.pptxRavindra Babu

02 Data MiningInstitute of Technology Telkom

Business intelligenceMustafa Ali Hassan, MBA

Alignment: Office of the Chief Data Officer & BCBS 239Craig Milroy

Data science for business leaders executive programmjitu309

Predictive analytics SAS Singapore Institute Pte Ltd

Management information systemPrachiprabha Sarangi

nnU-Net: a self-configuring method for deep learning-based biomedical image s...ivaderivader

Big data in fintech ecosystemBBVA API Market

BI and Data Analytics Incorta

Big Data, Big Deal: For Future Big Data ScientistsWay-Yen Lin

Comparative analysisAkshaya Nanodkar

Data MonetizationKiran Donepudi

Business Intelligence BankingPriya Ramakrishnan

Understanding big data and data analytics big dataSeta Wicaksana

sas-data-governance-framework-107325hurt3303

Was ist angesagt? (20)

Data mining in telecommunications industry

Data Visualization: Impact, Intrigue, Value Add for APLIC 2014

Lec13: Clustering Based Medical Image Segmentation Methods

Data Governance and Data Science to Improve Data Quality

Ethics in Data Management.pptx

02 Data Mining

Business intelligence

Alignment: Office of the Chief Data Officer & BCBS 239

Data science for business leaders executive program

Predictive analytics

Management information system

nnU-Net: a self-configuring method for deep learning-based biomedical image s...

Big data in fintech ecosystem

BI and Data Analytics

Big Data, Big Deal: For Future Big Data Scientists

Comparative analysis

Data Monetization

Business Intelligence Banking

Understanding big data and data analytics big data

sas-data-governance-framework-107325

Ähnlich wie Engineering data privacy - The ARX data anonymization tool

Challenges and opportunities for machine learning in biomedical researchFranciscoJAzuajeG

June 2020: Top Read Articles in Advanced Computational Intelligenceaciijournal

An introduction to machine learning in biomedical research: Key concepts, pr...FranciscoJAzuajeG

A review on early hospital mortality prediction using vital signalsReza Sadeghi

A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...theijes

Hanaa phd presentation 14-4-2017Aboul Ella Hassanien

Development of Computational Tool for Lung Cancer Prediction Using Data MiningEditor IJCATR

Pattern recognition using context dependent memory model (cdmm) in multimodal...ijfcstjournal

A Tool for Optimizing De-Identified Health Data for Use in Statistical Classi...arx-deidentifier

algorithmic-decisions, fairness, machine learning, provenance, transparencyPaolo Missier

FAULT DIAGNOSIS USING CLUSTERING. WHAT STATISTICAL TEST TO USE FOR HYPOTHESIS...JaresJournal

Prof. Mark Coles (Oxford University) - Data-driven systems medicinemntbs1

TBerger_FinalReportThaddeus Berger

SEMI SUPERVISED BASED SPATIAL EM FRAMEWORK FOR MICROARRAY ANALYSISIRJET Journal

Feature selection and microarray dataGianluca Bontempi

An Overview on Gene Expression AnalysisIOSR Journals

Charleston Conference 2016Anita de Waard

Intelligent generator of big data medicalNexgen Technology

Fractal Parameters of Tumour Microscopic Images as Prognostic Indicators of C...cscpconf

FRACTAL PARAMETERS OF TUMOUR MICROSCOPIC IMAGES AS PROGNOSTIC INDICATORS OF C...csandit

Ähnlich wie Engineering data privacy - The ARX data anonymization tool (20)

Challenges and opportunities for machine learning in biomedical research

June 2020: Top Read Articles in Advanced Computational Intelligence

An introduction to machine learning in biomedical research: Key concepts, pr...

A review on early hospital mortality prediction using vital signals

A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...

Hanaa phd presentation 14-4-2017

Development of Computational Tool for Lung Cancer Prediction Using Data Mining

Pattern recognition using context dependent memory model (cdmm) in multimodal...

A Tool for Optimizing De-Identified Health Data for Use in Statistical Classi...

algorithmic-decisions, fairness, machine learning, provenance, transparency

FAULT DIAGNOSIS USING CLUSTERING. WHAT STATISTICAL TEST TO USE FOR HYPOTHESIS...

Prof. Mark Coles (Oxford University) - Data-driven systems medicine

TBerger_FinalReport

SEMI SUPERVISED BASED SPATIAL EM FRAMEWORK FOR MICROARRAY ANALYSIS

Feature selection and microarray data

An Overview on Gene Expression Analysis

Charleston Conference 2016

Intelligent generator of big data medical

Fractal Parameters of Tumour Microscopic Images as Prognostic Indicators of C...

FRACTAL PARAMETERS OF TUMOUR MICROSCOPIC IMAGES AS PROGNOSTIC INDICATORS OF C...

Mehr von arx-deidentifier

An Open Source Tool for Game Theoretic Health Data De-Identificationarx-deidentifier

Anonymisierung und Risikomanagement mit ARXarx-deidentifier

An experimental comparison of globally-optimal data de-identification algorithmsarx-deidentifier

ARX - A Generic Method for Assessing the Quality of De-Identified Health Dataarx-deidentifier

ARX - a comprehensive tool for anonymizing / de-identifying biomedical dataarx-deidentifier

Mehr von arx-deidentifier (6)

An Open Source Tool for Game Theoretic Health Data De-Identification

Anonymisierung und Risikomanagement mit ARX

An experimental comparison of globally-optimal data de-identification algorithms

ARX - A Generic Method for Assessing the Quality of De-Identified Health Data

ARX - a comprehensive tool for anonymizing / de-identifying biomedical data

Kürzlich hochgeladen

How to Track Employee Performance A Comprehensive Guide.pdfLivetecs LLC

Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110

Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini

Cyber security and its impact on E commercemanigoyal112

What is Fashion PLM and Why Do You Need ItWave PLM

React Server Component in Next.js by Hanief UtamaHanief Utama

Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig

GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko

办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea

MYjobs Presentation Django-based projectAnoyGreter

BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp

英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0

What is Advanced Excel and what are some best practices for designing and cre...Technogeeks

What are the key points to focus on before starting to learn ETL Development....kzayra69

Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed

PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122

Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel

EY_Graph Database Powered SustainabilityNeo4j

Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services

Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray

Kürzlich hochgeladen (20)

How to Track Employee Performance A Comprehensive Guide.pdf

Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...

Xen Safety Embedded OSS Summit April 2024 v4.pdf

Cyber security and its impact on E commerce

What is Fashion PLM and Why Do You Need It

React Server Component in Next.js by Hanief Utama

Automate your Kamailio Test Calls - Kamailio World 2024

GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf

办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样

MYjobs Presentation Django-based project

BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE

英国UN学位证,北安普顿大学毕业证书1:1制作

What is Advanced Excel and what are some best practices for designing and cre...

What are the key points to focus on before starting to learn ETL Development....

Unveiling Design Patterns: A Visual Guide with UML Diagrams

PREDICTING RIVER WATER QUALITY ppt presentation

Unveiling the Future: Sylius 2.0 New Features

EY_Graph Database Powered Sustainability

Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...

Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...

Engineering data privacy - The ARX data anonymization tool

1. Technische Universität München Fabian Prasser, Florian Kohlmayer, Klaus A. Kuhn Chair for Biomedical Informatics Institute for Medical Statistics and Epidemiologie University of Technology Munich (TUM) Engineering data privacy - The ARX data anonymization tool

2. Technische Universität München What is ARX? ● = + ● A tool for analyzing and reducing the uniqueness of records in a (relational) dataset ● Variety of methods ● Highly scalable ● Up to 50 dimensions (i.e. attributes) ● Millions of records ● (Semi-)automatically and/or manually ● Comprehensive graphical user interface ARX | Dagtuhl Genomic Privacy Workshop 201522.10.15 2 Images: https://commons.wikimedia.org/ users: Ysangkok, Scarce2 statistics computer science Methods from

3. Technische Universität München Example ARX | Dagtuhl Genomic Privacy Workshop 201522.10.15 3 Generalization Suppression Microaggregation Reduce uniqueness

4. Technische Universität München Overview of methods implemented by ARX Sample-based methods • Fraction of sample uniques • Average sample uniqueness • k-anonymity Population-based methods • Model by Zayatz [1] • Model by Hoshino [2] • Model by Chen et al. [3] / Rinott [4] • Model by Dankar et al. [5] ARX | Dagtuhl Genomic Privacy Workshop 201522.10.15 4 [1] Zayatz, L.V.: Estimation of the percent of unique population elements on a microdata file using the sample. Statistical Research Division Report Number: Census/SRD/RR-91/08 (1991) [2] Hoshino, N.: Applying pitmans sampling formula to microdata disclosure risk assessment. J Off Stat 17(4), 499520 (2001) [3] Chen, G., Keller-McNulty, S.: Estimation of identification disclosure risk in microdata. J Off Stat 14, 7995 (1998) [4] Rinott, Y.: On models for statistical disclosure risk estimation. In: Proc ECE/Eurostat Work Session Stat Data Confid, p. 275285 (2003) [5] Dankar, F., Emam, K.E., Neisa, A., Roffey, T.: Estimating the re-identification risk of clinical data sets. BMC Med Inform Decis Mak 12(1), 66 (2012) Global and local recoding • Can be weighted Methods • Categorization • Generalization • Cell suppression • Record suppression • Micro-aggregation • Top/bottom coding Weighted and parameterized • Ability to control the application of different coding models Methods • AECS, Discernibility, Precision • (Normalized) Mean squared error • (Normalized) Non-uniform entropy • KL divergence • Loss Measures for utility Coding models Measures for uniqueness Transform Visualize Analyze Adapt

5. Technische Universität München Screenshots ARX | Dagtuhl Genomic Privacy Workshop 201522.10.15 5

6. Technische Universität München Screenshots (cont'd) ARX | Dagtuhl Genomic Privacy Workshop 201522.10.15 6

7. Technische Universität München Further features offered by ARX ● Syntactic privacy models ● ℓ-diversity, t-closeness, δ-disclosure privacy, δ-presence ● Risk-based anonymization ● Differential privacy ● Truthful (e,δ)-differentially private data release ● Using random sampling ● Detection of HIPAA identifiers ● Based on heuristics ● Import from multiple sources ● RDBMS, Excel, CSV ● Software library ● Open source, cross-platform ARX | Dagtuhl Genomic Privacy Workshop 201522.10.15 7

8. Technische Universität München http://arx.deidentifier.org

Engineering data privacy - The ARX data anonymization tool

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Engineering data privacy - The ARX data anonymization tool

Ähnlich wie Engineering data privacy - The ARX data anonymization tool (20)

Mehr von arx-deidentifier

Mehr von arx-deidentifier (6)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Engineering data privacy - The ARX data anonymization tool