SlideShare ist ein Scribd-Unternehmen logo
1 von 16
Downloaden Sie, um offline zu lesen
Technische Universität München
Fabian Prasser, Florian Kohlmayer, Klaus A. Kuhn
Chair of Biomedical Informatics
Institute of Medical Statistics and Epidemiology
Rechts der Isar Hospital
Technische Universität München
ARX
A Comprehensive Tool for
Anonymizing Biomedical Data
Technische Universität MünchenARX - A Comprehensive Tool for Anonymizing Biomedical Data
Core-developers of ARX
• Computer scientists, background in IT security and database systems
• Research assistants at the Chair for Biomedical Informatics at TUM
ARX
Florian Kohlmayer Fabian Prasser
Technische Universität MünchenARX - A Comprehensive Tool for Anonymizing Biomedical Data
Motivation: Data sharing in biomedical research
• Data sharing is a core element of biomedical research
• Wellcome Trust: Sharing research data to improve public health [1]
• OECD: Principles and guidelines for access to research data from public
funding [2]
• Disclosure of data may lead to harm for individuals
• Data may be person-related and highly sensitive
• Large body of laws & regulations mandates privacy protection
• US: HIPPA Privacy Rule
• EU: European Data Protection Regulation
• DE: German Federal Data Protection Act
• Safeguards
• Access control, policies, agreements, …
• De-identification / anonymization
ARX
Technische Universität MünchenARX - A Comprehensive Tool for Anonymizing Biomedical Data
Overview: De-identification / anonymization
• Controlling interactive data analysis
• Subject: query results, …
• Methods: differential privacy, query-set-size control, …
• Implementations: Fuzz, PINQ, Airavat, HIDE
• Masking identifiers in unstructured data
• Subject: clinical notes, …
• Methods: machine learning, regular expressions, …
• Implementations: MIST, MITdeid, NLM Scrubber
• Transforming structured data (focus of ARX)
• Subject: tabular data, …
• Methods: generalization, suppression, …
• Implementations: ARX, sdcMicro, PARAT
ARX
Technische Universität MünchenARX - A Comprehensive Tool for Anonymizing Biomedical Data
Background: Transforming structured data
• Basic idea: transform datasets in such a way that they adhere to a set
of formal privacy guarantees
• Typical transformations
• Generalization: Germany  Europe (often applied to individual values)
• Suppression: Germany  * (often applied to whole entries)
• Example generalization hierarchies
<50
1
2
…
≥50
50
51
…
North
America
US
…
Asia …
… …
Europe
Germany
…
High
Very high [4.9, 10.0]
High [4.1, 4.9[
Borderline
high
[3.4, 4.1[
Normal Normal [2.6, 3.4[
Low
Low [1.8, 2.6[
Very low [0.0, 1.8[
Age Country LDL cholesterol
Level 0Level 2
ARX
Technische Universität MünchenARX - A Comprehensive Tool for Anonymizing Biomedical Data
Background: Transforming structured data (cont.)
<50
1
2
…
≥50
50
51
…
Tree
Level 0 Level 1 Level 2
1 <50 *
2 <50 *
… <50 *
50 ≥50 *
51 ≥50 *
… ≥50 *
Tabular representation
ARX
(0,0,0)
(1,0,0) (0,1,0) (0,0,1)
(2,0,0) (1,1,0) (1,0,1) (0,1,1) (0,0,2)
Level 0 for attribute “age”
Level 1 for attribute “zip code”
Level 0 for attribute “gender”
• Representation of gen. hierarchies
• Full-domain generalization: all values of an attribute are generalized
to the same level of the associated generalization hierarchy
• Search space: combinations of all generalization levels (lattice)
Age Gender Zip code
Schema:
ARX
Technische Universität MünchenARX - A Comprehensive Tool for Anonymizing Biomedical Data
Background: Privacy models
• Well-known models: k-anonymity, ℓ-diversity, t-closeness, δ-presence
• Example: k-anonymity
• Proposed by Samarati and Sweeney in 1998 [3]
• Attacker model: linkage via a set of quasi-identifiers (identity disclosure)
• Mitigated by: building groups of indistinguishable data entries
• Adherence can be achieved with generalization and suppression
Age Gender Zip code
34 male 81667
45 female 81675
34 female 81931
45 male 81925
70 female 81931
70 male 81931
66 male 80931
Age Gender Zip code
< 50 * 816**
< 50 * 816**
< 50 * 819**
< 50 * 819**
≥ 50 * 819**
≥ 50 * 819**
* * *
2-anonymity
Suppression
Generalization (1,1,2)
ARX
Technische Universität MünchenARX - A Comprehensive Tool for Anonymizing Biomedical Data
Background: k-Anonymity
Original dataset
2-anonymous dataset
Age Gender Zip code
34 male 81667
45 female 81675
34 female 81931
45 male 81925
70 female 81931
70 male 81931
66 male 80931
Age Gender Zip code
< 50 * 816**
< 50 * 816**
< 50 * 819**
< 50 * 819**
≥ 50 * 819**
≥ 50 * 819**
* * *
Age Gender Zip code Name
45 female 81675 Alice
45 male 81925 Bob
70 male 81931 Charlie
⁞ ⁞ ⁞ ⁞
Age Gender Zip code Name
45 female 81675 Alice
45 male 81925 Bob
70 male 81931 Charlie
⁞ ⁞ ⁞ ⁞
?
Background knowledge, e.g. voter list
Background knowledge, e.g. voter list
ARX
Technische Universität MünchenARX - A Comprehensive Tool for Anonymizing Biomedical Data
Challenge: Tool support
• Situation: anonymization of structured data is frequently recommended
(laws, regulations, guidelines) but in practice it is only used rarely
• Main reasons
• Lack of understanding of opportunities and limitations
• Lack of ready-to-use tools
• Non-trivial: implementing useful tools is challenging
• Usefulness has many dimensions
• Ability to balance data utility with privacy requirements
• Support a broad spectrum of privacy methods, transformation techniques
and methods for measuring and analyzing data utility
• Performance and scalability
• Intuitive visualization and parameterization of all process steps
• Provide methods to end-users as well as programmers
• In an integrated and harmonized manner
• Openness
ARX
Technische Universität MünchenARX - A Comprehensive Tool for Anonymizing Biomedical Data
Challenge: Related software
• sdcMicro
• Cross-platform open source software implemented in “R”
• Collection of a set of methods, not an integrated application
• Different types of recoding models and risk models
• Minimalistic graphical user interface
• µArgus
• Closed source software for MS Windows
• Methods comparable to sdcMicro but more comprehensive user interface
• Development has ceased
• PARAT
• Commercial tool for MS Windows
• Powerful graphical interface
• Methods implemented overlap with methods implemented in ARX
• Centered around a risk-based approach
• More comprehensive list: http://arx.deidentifier.org/related-software/
ARX
Technische Universität MünchenARX - A Comprehensive Tool for Anonymizing Biomedical Data
ARX: Highlights
• Flexible transformation methods: generalization and suppression in
a parameterizable and utility-driven manner
• Multiple privacy models: k-anonymity, ℓ-diversity (three variants),
t-closeness (two variants) and δ-presence, as well as arbitrary
combinations
• Multiple methods for measuring data utility: automatically as well
as manually
• Optimality: classification of the complete solution space
• Functional generalization rules: support for continuous and discrete
variables
• Highly scalable: several million data entries on commodity hardware
• Comprehensive cross-platform Graphical User Interface: wizards,
visualization of the solution space, analysis of data utility
• Application Programming Interface: full-blown Java library
ARX
Technische Universität MünchenARX - A Comprehensive Tool for Anonymizing Biomedical Data
ARX: Anonymization workflow
• Iteratively refine the anonymization process
• Supported by the scalability of our framework
• Three (potentially repeating) steps
- Create and edit rules
- Define privacy guarantees
- Parameterize coding model
- Configure utility measure
- Filter and analyze the
solution space
- Organize transformations
- Compare and analyze
input and output
ARX
Technische Universität MünchenARX - A Comprehensive Tool for Anonymizing Biomedical Data
ARX: Facts and credits
• Three years of work by two main developers: Florian Kohlmayer
and Fabian Prasser
• With help from multiple students: see credits on our website
• Interdisciplinary cooperation: Chair for IT Security, Chair for
Database Systems, Chair for Biomedical Informatics
• Code metrics
• ARX Core/API: 178 files, 200 classes
37,332 LOC (16,281 lines of comments)
• ARX GUI: 174 files, 207 classes
44,772 LOC (15,062 lines of comments)
• ARX Tests: 997 JUnit tests
• Commits: 1,722 commits (since 03/2013)
ARX
Technische Universität MünchenARX - A Comprehensive Tool for Anonymizing Biomedical Data
ARX: Publications
• Implementation framework: Proc Int Symp CBMS, 2012 [4]
• Anonymization algorithm: Proc Int Conf PASSAT, 2012 [5]
• Anonymization of distributed data: J Biomed Inform, 2013 [6]
• Benchmark for anonymity methods: Proc Int Symp CBMS, 2014 [7]
•
• “White Paper”: AMIA Annu Symp Proc, 2014 [8]
ARX
Technische Universität MünchenARX - A Comprehensive Tool for Anonymizing Biomedical Data
Thank you for your attention! Questions?
• Disclaimer
• Anonymization must be performed by experts
• Additional safeguards are required (e.g., contractual measures)
• ARX is open source software
• Contributions are welcome, e.g., feature requests, code reviews, criticism,
enhancements, questions
• Future developments: various projects, especially risk models
• Resources
• Project website: http://arx.deidentifier.org
• Code repository: https://github.com/arx-deidentifier/arx
• Get in touch
• Fabian Prasser (prasser@in.tum.de)
• Florian Kohlmayer (florian.kohlmayer@tum.de)
ARX
Technische Universität MünchenARX - A Comprehensive Tool for Anonymizing Biomedical Data
1. Wellcome Trust. Sharing research data to improve public health. 2013. [cited 2015 Jan 14].
Available at: http://www.wellcome.ac.uk/About-us/Policy/Spotlight-issues/Data-sharing/Public-
health-and-epidemiology/WTDV030690.htm.
2. OECD. Principles and guidelines for access to research data from public funding. 2006. [cited
2015 Jan 14]. Available at: www.oecd.org/sti/sci-tech/38500813.pdf.
3. Pierangela Samarati, Latanya Sweeney. Protecting privacy when disclosing information:
k-anonymity and its enforcement through generalization and suppression. Proc Symp Res
Secur Priv 1998.
4. Florian Kohlmayer*, Fabian Prasser*, Claudia Eckert, Alfons Kemper and Klaus A. Kuhn. Highly
efficient optimal k-anonymity for biomedical datasets. Proc Int Symp CBMS 2012.
5. Florian Kohlmayer*, Fabian Prasser*, Claudia Eckert, Alfons Kemper, Klaus. A. Kuhn. Flash:
efficient, stable and optimal k-anonymity. Proc Int Conf PASSAT 2012.
6. Florian Kohlmayer*, Fabian Prasser*, Claudia Eckert, Klaus A. Kuhn. A flexible approach to
distributed data anonymization. J Biomed Inform 2013.
7. Fabian Prasser*, Florian Kohlmayer*, Klaus A. Kuhn. A benchmark of globally-optimal
anonymization methods for biomedical data. Proc Int Symp CMBS 2014.
8. Fabian Prasser*, Florian Kohlmayer*, Ronald Lautenschlaeger, Klaus A. Kuhn. ARX – A
comprehensive tool for anonymizing biomedical data. AMIA Annu Symp Proc 2014.
*Equal contributors
References
ARX

Weitere ähnliche Inhalte

Was ist angesagt?

Standards and tools for model management in biomedical research
Standards and tools for model management in biomedical researchStandards and tools for model management in biomedical research
Standards and tools for model management in biomedical researchUniversity Medicine Greifswald
 
Model management tools for improved reproducibility in systems biology
Model management tools for improved reproducibility in systems biologyModel management tools for improved reproducibility in systems biology
Model management tools for improved reproducibility in systems biologyUniversity Medicine Greifswald
 
Major issues in data mining
Major issues in data miningMajor issues in data mining
Major issues in data miningYashwant Rautela
 
20200130_Mannocci_OpenAIRE_ResearchGraph
20200130_Mannocci_OpenAIRE_ResearchGraph20200130_Mannocci_OpenAIRE_ResearchGraph
20200130_Mannocci_OpenAIRE_ResearchGraphOpenAIRE
 
Nicolay Lyfenko - Conceptual scheme for text classification system
Nicolay Lyfenko - Conceptual scheme for text classification systemNicolay Lyfenko - Conceptual scheme for text classification system
Nicolay Lyfenko - Conceptual scheme for text classification systemAIST
 
MIE2014: A Framework for Evaluating and Utilizing Medical Terminology Mappings
MIE2014: A Framework for Evaluating and Utilizing Medical Terminology Mappings MIE2014: A Framework for Evaluating and Utilizing Medical Terminology Mappings
MIE2014: A Framework for Evaluating and Utilizing Medical Terminology Mappings Kerstin Forsberg
 
Relevance Clues: Developing an Experimental Design to Examine the Criteria Be...
Relevance Clues: Developing an Experimental Design to Examine the Criteria Be...Relevance Clues: Developing an Experimental Design to Examine the Criteria Be...
Relevance Clues: Developing an Experimental Design to Examine the Criteria Be...Christiane Behnert
 
Data mining course learning outcomes,Data Mining CMAP
Data mining course learning outcomes,Data Mining CMAPData mining course learning outcomes,Data Mining CMAP
Data mining course learning outcomes,Data Mining CMAPjaya lakshmi
 
Real life application of statistics in engineering
Real life application of statistics in engineeringReal life application of statistics in engineering
Real life application of statistics in engineeringJannatulFerdous160
 
The case for cloud computing in Life Sciences
The case for cloud computing in Life SciencesThe case for cloud computing in Life Sciences
The case for cloud computing in Life SciencesOla Spjuth
 

Was ist angesagt? (18)

Medical 3D data retrieval
Medical 3D data retrievalMedical 3D data retrieval
Medical 3D data retrieval
 
Short introduction to SED-ML
Short introduction to SED-MLShort introduction to SED-ML
Short introduction to SED-ML
 
Data and Model Management for Systems Biology
Data and Model Management  for Systems BiologyData and Model Management  for Systems Biology
Data and Model Management for Systems Biology
 
Standards and tools for model management in biomedical research
Standards and tools for model management in biomedical researchStandards and tools for model management in biomedical research
Standards and tools for model management in biomedical research
 
Data and model management in Systems Biology
Data and model management in Systems BiologyData and model management in Systems Biology
Data and model management in Systems Biology
 
Model management tools for improved reproducibility in systems biology
Model management tools for improved reproducibility in systems biologyModel management tools for improved reproducibility in systems biology
Model management tools for improved reproducibility in systems biology
 
Major issues in data mining
Major issues in data miningMajor issues in data mining
Major issues in data mining
 
20200130_Mannocci_OpenAIRE_ResearchGraph
20200130_Mannocci_OpenAIRE_ResearchGraph20200130_Mannocci_OpenAIRE_ResearchGraph
20200130_Mannocci_OpenAIRE_ResearchGraph
 
Eira presentation
Eira presentationEira presentation
Eira presentation
 
Nicolay Lyfenko - Conceptual scheme for text classification system
Nicolay Lyfenko - Conceptual scheme for text classification systemNicolay Lyfenko - Conceptual scheme for text classification system
Nicolay Lyfenko - Conceptual scheme for text classification system
 
Heterogeneous data annotation
Heterogeneous data annotationHeterogeneous data annotation
Heterogeneous data annotation
 
MIE2014: A Framework for Evaluating and Utilizing Medical Terminology Mappings
MIE2014: A Framework for Evaluating and Utilizing Medical Terminology Mappings MIE2014: A Framework for Evaluating and Utilizing Medical Terminology Mappings
MIE2014: A Framework for Evaluating and Utilizing Medical Terminology Mappings
 
Relevance Clues: Developing an Experimental Design to Examine the Criteria Be...
Relevance Clues: Developing an Experimental Design to Examine the Criteria Be...Relevance Clues: Developing an Experimental Design to Examine the Criteria Be...
Relevance Clues: Developing an Experimental Design to Examine the Criteria Be...
 
Data mining course learning outcomes,Data Mining CMAP
Data mining course learning outcomes,Data Mining CMAPData mining course learning outcomes,Data Mining CMAP
Data mining course learning outcomes,Data Mining CMAP
 
Project 0th Review
Project 0th ReviewProject 0th Review
Project 0th Review
 
Real life application of statistics in engineering
Real life application of statistics in engineeringReal life application of statistics in engineering
Real life application of statistics in engineering
 
Chaper 13 trend
Chaper 13 trendChaper 13 trend
Chaper 13 trend
 
The case for cloud computing in Life Sciences
The case for cloud computing in Life SciencesThe case for cloud computing in Life Sciences
The case for cloud computing in Life Sciences
 

Ähnlich wie ARX - a comprehensive tool for anonymizing / de-identifying biomedical data

Elixir at de.nbi meeting
Elixir at de.nbi meetingElixir at de.nbi meeting
Elixir at de.nbi meetingNiklas Blomberg
 
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...BigData_Europe
 
Secure data management, analysis, infrastructure and policy in an internation...
Secure data management, analysis, infrastructure and policy in an internation...Secure data management, analysis, infrastructure and policy in an internation...
Secure data management, analysis, infrastructure and policy in an internation...Carolyn Ten Holter
 
FAIR Sharing of Scientific Experiments - SWAT4LS Conference 2017
FAIR Sharing of Scientific Experiments - SWAT4LS Conference 2017FAIR Sharing of Scientific Experiments - SWAT4LS Conference 2017
FAIR Sharing of Scientific Experiments - SWAT4LS Conference 2017Oya Deniz Beyan
 
Data Virtualization at UMC Utrecht: Don't Collect, Connect! by Erik Fransen (...
Data Virtualization at UMC Utrecht: Don't Collect, Connect! by Erik Fransen (...Data Virtualization at UMC Utrecht: Don't Collect, Connect! by Erik Fransen (...
Data Virtualization at UMC Utrecht: Don't Collect, Connect! by Erik Fransen (...Patrick Van Renterghem
 
From allotrope to reference master data management
From allotrope to reference master data management From allotrope to reference master data management
From allotrope to reference master data management OSTHUS
 
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...OSTHUS
 
Standarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata filesStandarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata filesYasset Perez-Riverol
 
On chemical structures, substances, nanomaterials and measurements
On chemical structures, substances, nanomaterials and measurementsOn chemical structures, substances, nanomaterials and measurements
On chemical structures, substances, nanomaterials and measurementsNina Jeliazkova
 
eTRIKS Data Harmonization Service Platform
eTRIKS Data Harmonization Service PlatformeTRIKS Data Harmonization Service Platform
eTRIKS Data Harmonization Service Platformibemam
 
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
Being FAIR:  FAIR data and model management SSBSS 2017 Summer SchoolBeing FAIR:  FAIR data and model management SSBSS 2017 Summer School
Being FAIR: FAIR data and model management SSBSS 2017 Summer SchoolCarole Goble
 
Automating Data Science over a Human Genomics Knowledge Base
Automating Data Science over a Human Genomics Knowledge BaseAutomating Data Science over a Human Genomics Knowledge Base
Automating Data Science over a Human Genomics Knowledge BaseVaticle
 
Pistoia alliance harmonizing fair data catalog approaches webinar
Pistoia alliance harmonizing fair data catalog approaches webinarPistoia alliance harmonizing fair data catalog approaches webinar
Pistoia alliance harmonizing fair data catalog approaches webinarPistoia Alliance
 
(Open) Research Data Management in H2020 (ISERD – Tel Aviv, Oct 31, 2016)
(Open) Research Data Management in H2020 (ISERD – Tel Aviv, Oct 31, 2016)(Open) Research Data Management in H2020 (ISERD – Tel Aviv, Oct 31, 2016)
(Open) Research Data Management in H2020 (ISERD – Tel Aviv, Oct 31, 2016)OpenAIRE
 
Basics of Research Data Management
Basics of Research Data ManagementBasics of Research Data Management
Basics of Research Data ManagementOpenAIRE
 
Standardisation in BMS European infrastructures
Standardisation in BMS European infrastructuresStandardisation in BMS European infrastructures
Standardisation in BMS European infrastructuresRafael C. Jimenez
 

Ähnlich wie ARX - a comprehensive tool for anonymizing / de-identifying biomedical data (20)

Elixir at de.nbi meeting
Elixir at de.nbi meetingElixir at de.nbi meeting
Elixir at de.nbi meeting
 
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...
 
Secure data management, analysis, infrastructure and policy in an internation...
Secure data management, analysis, infrastructure and policy in an internation...Secure data management, analysis, infrastructure and policy in an internation...
Secure data management, analysis, infrastructure and policy in an internation...
 
FAIR Sharing of Scientific Experiments - SWAT4LS Conference 2017
FAIR Sharing of Scientific Experiments - SWAT4LS Conference 2017FAIR Sharing of Scientific Experiments - SWAT4LS Conference 2017
FAIR Sharing of Scientific Experiments - SWAT4LS Conference 2017
 
Data Virtualization at UMC Utrecht: Don't Collect, Connect! by Erik Fransen (...
Data Virtualization at UMC Utrecht: Don't Collect, Connect! by Erik Fransen (...Data Virtualization at UMC Utrecht: Don't Collect, Connect! by Erik Fransen (...
Data Virtualization at UMC Utrecht: Don't Collect, Connect! by Erik Fransen (...
 
From allotrope to reference master data management
From allotrope to reference master data management From allotrope to reference master data management
From allotrope to reference master data management
 
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
 
Standarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata filesStandarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata files
 
On chemical structures, substances, nanomaterials and measurements
On chemical structures, substances, nanomaterials and measurementsOn chemical structures, substances, nanomaterials and measurements
On chemical structures, substances, nanomaterials and measurements
 
HL7 Standards (March 21, 2018)
HL7 Standards (March 21, 2018)HL7 Standards (March 21, 2018)
HL7 Standards (March 21, 2018)
 
eTRIKS Data Harmonization Service Platform
eTRIKS Data Harmonization Service PlatformeTRIKS Data Harmonization Service Platform
eTRIKS Data Harmonization Service Platform
 
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
 
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
Being FAIR:  FAIR data and model management SSBSS 2017 Summer SchoolBeing FAIR:  FAIR data and model management SSBSS 2017 Summer School
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
 
Automating Data Science over a Human Genomics Knowledge Base
Automating Data Science over a Human Genomics Knowledge BaseAutomating Data Science over a Human Genomics Knowledge Base
Automating Data Science over a Human Genomics Knowledge Base
 
Pistoia alliance harmonizing fair data catalog approaches webinar
Pistoia alliance harmonizing fair data catalog approaches webinarPistoia alliance harmonizing fair data catalog approaches webinar
Pistoia alliance harmonizing fair data catalog approaches webinar
 
(Open) Research Data Management in H2020 (ISERD – Tel Aviv, Oct 31, 2016)
(Open) Research Data Management in H2020 (ISERD – Tel Aviv, Oct 31, 2016)(Open) Research Data Management in H2020 (ISERD – Tel Aviv, Oct 31, 2016)
(Open) Research Data Management in H2020 (ISERD – Tel Aviv, Oct 31, 2016)
 
Accessing Environmental Chemistry Data via Data Dashboards
Accessing Environmental Chemistry Data via Data Dashboards Accessing Environmental Chemistry Data via Data Dashboards
Accessing Environmental Chemistry Data via Data Dashboards
 
Scholze imcw 2014-11-25
Scholze imcw 2014-11-25Scholze imcw 2014-11-25
Scholze imcw 2014-11-25
 
Basics of Research Data Management
Basics of Research Data ManagementBasics of Research Data Management
Basics of Research Data Management
 
Standardisation in BMS European infrastructures
Standardisation in BMS European infrastructuresStandardisation in BMS European infrastructures
Standardisation in BMS European infrastructures
 

Kürzlich hochgeladen

Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 

Kürzlich hochgeladen (20)

Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 

ARX - a comprehensive tool for anonymizing / de-identifying biomedical data

  • 1. Technische Universität München Fabian Prasser, Florian Kohlmayer, Klaus A. Kuhn Chair of Biomedical Informatics Institute of Medical Statistics and Epidemiology Rechts der Isar Hospital Technische Universität München ARX A Comprehensive Tool for Anonymizing Biomedical Data
  • 2. Technische Universität MünchenARX - A Comprehensive Tool for Anonymizing Biomedical Data Core-developers of ARX • Computer scientists, background in IT security and database systems • Research assistants at the Chair for Biomedical Informatics at TUM ARX Florian Kohlmayer Fabian Prasser
  • 3. Technische Universität MünchenARX - A Comprehensive Tool for Anonymizing Biomedical Data Motivation: Data sharing in biomedical research • Data sharing is a core element of biomedical research • Wellcome Trust: Sharing research data to improve public health [1] • OECD: Principles and guidelines for access to research data from public funding [2] • Disclosure of data may lead to harm for individuals • Data may be person-related and highly sensitive • Large body of laws & regulations mandates privacy protection • US: HIPPA Privacy Rule • EU: European Data Protection Regulation • DE: German Federal Data Protection Act • Safeguards • Access control, policies, agreements, … • De-identification / anonymization ARX
  • 4. Technische Universität MünchenARX - A Comprehensive Tool for Anonymizing Biomedical Data Overview: De-identification / anonymization • Controlling interactive data analysis • Subject: query results, … • Methods: differential privacy, query-set-size control, … • Implementations: Fuzz, PINQ, Airavat, HIDE • Masking identifiers in unstructured data • Subject: clinical notes, … • Methods: machine learning, regular expressions, … • Implementations: MIST, MITdeid, NLM Scrubber • Transforming structured data (focus of ARX) • Subject: tabular data, … • Methods: generalization, suppression, … • Implementations: ARX, sdcMicro, PARAT ARX
  • 5. Technische Universität MünchenARX - A Comprehensive Tool for Anonymizing Biomedical Data Background: Transforming structured data • Basic idea: transform datasets in such a way that they adhere to a set of formal privacy guarantees • Typical transformations • Generalization: Germany  Europe (often applied to individual values) • Suppression: Germany  * (often applied to whole entries) • Example generalization hierarchies <50 1 2 … ≥50 50 51 … North America US … Asia … … … Europe Germany … High Very high [4.9, 10.0] High [4.1, 4.9[ Borderline high [3.4, 4.1[ Normal Normal [2.6, 3.4[ Low Low [1.8, 2.6[ Very low [0.0, 1.8[ Age Country LDL cholesterol Level 0Level 2 ARX
  • 6. Technische Universität MünchenARX - A Comprehensive Tool for Anonymizing Biomedical Data Background: Transforming structured data (cont.) <50 1 2 … ≥50 50 51 … Tree Level 0 Level 1 Level 2 1 <50 * 2 <50 * … <50 * 50 ≥50 * 51 ≥50 * … ≥50 * Tabular representation ARX (0,0,0) (1,0,0) (0,1,0) (0,0,1) (2,0,0) (1,1,0) (1,0,1) (0,1,1) (0,0,2) Level 0 for attribute “age” Level 1 for attribute “zip code” Level 0 for attribute “gender” • Representation of gen. hierarchies • Full-domain generalization: all values of an attribute are generalized to the same level of the associated generalization hierarchy • Search space: combinations of all generalization levels (lattice) Age Gender Zip code Schema: ARX
  • 7. Technische Universität MünchenARX - A Comprehensive Tool for Anonymizing Biomedical Data Background: Privacy models • Well-known models: k-anonymity, ℓ-diversity, t-closeness, δ-presence • Example: k-anonymity • Proposed by Samarati and Sweeney in 1998 [3] • Attacker model: linkage via a set of quasi-identifiers (identity disclosure) • Mitigated by: building groups of indistinguishable data entries • Adherence can be achieved with generalization and suppression Age Gender Zip code 34 male 81667 45 female 81675 34 female 81931 45 male 81925 70 female 81931 70 male 81931 66 male 80931 Age Gender Zip code < 50 * 816** < 50 * 816** < 50 * 819** < 50 * 819** ≥ 50 * 819** ≥ 50 * 819** * * * 2-anonymity Suppression Generalization (1,1,2) ARX
  • 8. Technische Universität MünchenARX - A Comprehensive Tool for Anonymizing Biomedical Data Background: k-Anonymity Original dataset 2-anonymous dataset Age Gender Zip code 34 male 81667 45 female 81675 34 female 81931 45 male 81925 70 female 81931 70 male 81931 66 male 80931 Age Gender Zip code < 50 * 816** < 50 * 816** < 50 * 819** < 50 * 819** ≥ 50 * 819** ≥ 50 * 819** * * * Age Gender Zip code Name 45 female 81675 Alice 45 male 81925 Bob 70 male 81931 Charlie ⁞ ⁞ ⁞ ⁞ Age Gender Zip code Name 45 female 81675 Alice 45 male 81925 Bob 70 male 81931 Charlie ⁞ ⁞ ⁞ ⁞ ? Background knowledge, e.g. voter list Background knowledge, e.g. voter list ARX
  • 9. Technische Universität MünchenARX - A Comprehensive Tool for Anonymizing Biomedical Data Challenge: Tool support • Situation: anonymization of structured data is frequently recommended (laws, regulations, guidelines) but in practice it is only used rarely • Main reasons • Lack of understanding of opportunities and limitations • Lack of ready-to-use tools • Non-trivial: implementing useful tools is challenging • Usefulness has many dimensions • Ability to balance data utility with privacy requirements • Support a broad spectrum of privacy methods, transformation techniques and methods for measuring and analyzing data utility • Performance and scalability • Intuitive visualization and parameterization of all process steps • Provide methods to end-users as well as programmers • In an integrated and harmonized manner • Openness ARX
  • 10. Technische Universität MünchenARX - A Comprehensive Tool for Anonymizing Biomedical Data Challenge: Related software • sdcMicro • Cross-platform open source software implemented in “R” • Collection of a set of methods, not an integrated application • Different types of recoding models and risk models • Minimalistic graphical user interface • µArgus • Closed source software for MS Windows • Methods comparable to sdcMicro but more comprehensive user interface • Development has ceased • PARAT • Commercial tool for MS Windows • Powerful graphical interface • Methods implemented overlap with methods implemented in ARX • Centered around a risk-based approach • More comprehensive list: http://arx.deidentifier.org/related-software/ ARX
  • 11. Technische Universität MünchenARX - A Comprehensive Tool for Anonymizing Biomedical Data ARX: Highlights • Flexible transformation methods: generalization and suppression in a parameterizable and utility-driven manner • Multiple privacy models: k-anonymity, ℓ-diversity (three variants), t-closeness (two variants) and δ-presence, as well as arbitrary combinations • Multiple methods for measuring data utility: automatically as well as manually • Optimality: classification of the complete solution space • Functional generalization rules: support for continuous and discrete variables • Highly scalable: several million data entries on commodity hardware • Comprehensive cross-platform Graphical User Interface: wizards, visualization of the solution space, analysis of data utility • Application Programming Interface: full-blown Java library ARX
  • 12. Technische Universität MünchenARX - A Comprehensive Tool for Anonymizing Biomedical Data ARX: Anonymization workflow • Iteratively refine the anonymization process • Supported by the scalability of our framework • Three (potentially repeating) steps - Create and edit rules - Define privacy guarantees - Parameterize coding model - Configure utility measure - Filter and analyze the solution space - Organize transformations - Compare and analyze input and output ARX
  • 13. Technische Universität MünchenARX - A Comprehensive Tool for Anonymizing Biomedical Data ARX: Facts and credits • Three years of work by two main developers: Florian Kohlmayer and Fabian Prasser • With help from multiple students: see credits on our website • Interdisciplinary cooperation: Chair for IT Security, Chair for Database Systems, Chair for Biomedical Informatics • Code metrics • ARX Core/API: 178 files, 200 classes 37,332 LOC (16,281 lines of comments) • ARX GUI: 174 files, 207 classes 44,772 LOC (15,062 lines of comments) • ARX Tests: 997 JUnit tests • Commits: 1,722 commits (since 03/2013) ARX
  • 14. Technische Universität MünchenARX - A Comprehensive Tool for Anonymizing Biomedical Data ARX: Publications • Implementation framework: Proc Int Symp CBMS, 2012 [4] • Anonymization algorithm: Proc Int Conf PASSAT, 2012 [5] • Anonymization of distributed data: J Biomed Inform, 2013 [6] • Benchmark for anonymity methods: Proc Int Symp CBMS, 2014 [7] • • “White Paper”: AMIA Annu Symp Proc, 2014 [8] ARX
  • 15. Technische Universität MünchenARX - A Comprehensive Tool for Anonymizing Biomedical Data Thank you for your attention! Questions? • Disclaimer • Anonymization must be performed by experts • Additional safeguards are required (e.g., contractual measures) • ARX is open source software • Contributions are welcome, e.g., feature requests, code reviews, criticism, enhancements, questions • Future developments: various projects, especially risk models • Resources • Project website: http://arx.deidentifier.org • Code repository: https://github.com/arx-deidentifier/arx • Get in touch • Fabian Prasser (prasser@in.tum.de) • Florian Kohlmayer (florian.kohlmayer@tum.de) ARX
  • 16. Technische Universität MünchenARX - A Comprehensive Tool for Anonymizing Biomedical Data 1. Wellcome Trust. Sharing research data to improve public health. 2013. [cited 2015 Jan 14]. Available at: http://www.wellcome.ac.uk/About-us/Policy/Spotlight-issues/Data-sharing/Public- health-and-epidemiology/WTDV030690.htm. 2. OECD. Principles and guidelines for access to research data from public funding. 2006. [cited 2015 Jan 14]. Available at: www.oecd.org/sti/sci-tech/38500813.pdf. 3. Pierangela Samarati, Latanya Sweeney. Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. Proc Symp Res Secur Priv 1998. 4. Florian Kohlmayer*, Fabian Prasser*, Claudia Eckert, Alfons Kemper and Klaus A. Kuhn. Highly efficient optimal k-anonymity for biomedical datasets. Proc Int Symp CBMS 2012. 5. Florian Kohlmayer*, Fabian Prasser*, Claudia Eckert, Alfons Kemper, Klaus. A. Kuhn. Flash: efficient, stable and optimal k-anonymity. Proc Int Conf PASSAT 2012. 6. Florian Kohlmayer*, Fabian Prasser*, Claudia Eckert, Klaus A. Kuhn. A flexible approach to distributed data anonymization. J Biomed Inform 2013. 7. Fabian Prasser*, Florian Kohlmayer*, Klaus A. Kuhn. A benchmark of globally-optimal anonymization methods for biomedical data. Proc Int Symp CMBS 2014. 8. Fabian Prasser*, Florian Kohlmayer*, Ronald Lautenschlaeger, Klaus A. Kuhn. ARX – A comprehensive tool for anonymizing biomedical data. AMIA Annu Symp Proc 2014. *Equal contributors References ARX