SlideShare a Scribd company logo
1 of 147
Download to read offline
Anonymizing Health Data
Webcast
Case Studies and Methods to Get You Started
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
Part 1 of Webcast: Intro and Methodology
Part 2 of Webcast: A Look at Our Case Studies
Part 3 of Webcast: Questions and Answers
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
Part 1 of Webcast: Intro and Methodology
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
To Anonymize or not to Anonymize
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
Consent needs to be informed.
To Anonymize or not to Anonymize
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
Consent needs to be informed.
Not all health care providers are willing to
share their patient’s PHI.
To Anonymize or not to Anonymize
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
Consent needs to be informed.
Not all health care providers are willing to
share their patient’s PHI.
Anonymization allows for the sharing of health information.
To Anonymize or not to Anonymize
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
Consent needs to be informed.
Not all health care providers are willing to
share their patient’s PHI.
Anonymization allows for the sharing of health information.
To Anonymize or not to Anonymize
Compelling financial case. Breach cost ~$200 per patient.
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
Consent needs to be informed.
Not all health care providers are willing to
share their patient’s PHI.
Anonymization allows for the sharing of health information.
To Anonymize or not to Anonymize
Compelling financial case. Breach cost ~$200 per patient.
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
Consent needs to be informed.
Not all health care providers are willing to
share their patient’s PHI.
Anonymization allows for the sharing of health information.
To Anonymize or not to Anonymize
Privacy protective behaviors by patients.
Compelling financial case. Breach cost ~$200 per patient.
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
Masking Standards
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
Masking Standards
First name, last name, SSN.
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
Masking Standards
Distortion of data—no analytics.
First name, last name, SSN.
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
Masking Standards
Creating pseudonyms.
First name, last name, SSN.
Distortion of data—no analytics.
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
Masking Standards
Removing a whole field.
Creating pseudonyms.
First name, last name, SSN.
Distortion of data—no analytics.
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
Masking Standards
Removing a whole field.
Creating pseudonyms.
Replacing actual values with random ones.
First name, last name, SSN.
Distortion of data—no analytics.
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
De-identification Standards
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
De-identification Standards
Age, sex, race, address, income.
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
Minimal distortion of data—for analytics.
Age, sex, race, address, income.
De-identification Standards
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
Minimal distortion of data—for analytics.
Age, sex, race, address, income.
De-identification Standards
Safe Harbor in HIPAA Privacy Rule.
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
What’s “Actual Knowledge”?
Privacy Rule
Safe Harbor
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
What’s “Actual Knowledge”?
Info, alone or in combo, that could identify
an individual.
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
What’s “Actual Knowledge”?
Info, alone or in combo, that could identify
an individual.
Has to be specific to the data set—not
theoretical.
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
What’s “Actual Knowledge”?
Info, alone or in combo, that could identify
an individual.
Has to be specific to the data set—not
theoretical.
Occupation Mayor of Gotham.
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
Heuristics, or rules of thumb.
Minimal distortion of data—for analytics.
Age, sex, race, address, income.
Safe Harbor in HIPAA Privacy Rule.
De-identification Standards
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
Heuristics, or rules of thumb.
Statistical method in HIPAA Privacy Rule.
Minimal distortion of data—for analytics.
Age, sex, race, address, income.
Safe Harbor in HIPAA Privacy Rule.
De-identification Standards
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
De-identification Myths
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
De-identification Myths
Myth: It’s possible to re-identify most, if not
all, data.
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
De-identification Myths
Myth: It’s possible to re-identify most, if not
all, data.
Using robust methods, evidence suggests risk
can be very small.
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
De-identification Myths
Myth: It’s possible to re-identify most, if not
all, data.
Myth: Genomic sequences are not
identifiable, or are easy to re-identify.
Using robust methods, evidence suggests risk
can be very small.
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
De-identification Myths
Myth: It’s possible to re-identify most, if not
all, data.
Myth: Genomic sequences are not
identifiable, or are easy to re-identify.
In some cases can re-identify, difficult to de-
identify using our methods.
Using robust methods, evidence suggests risk
can be very small.
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
A Risk-based De-identification Methodology
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
A Risk-based De-identification Methodology
The risk of re-identification can be quantified.
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
A Risk-based De-identification Methodology
The risk of re-identification can be quantified.
The Goldilocks principle:
balancing privacy with data utility.
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
A Risk-based De-identification Methodology
The risk of re-identification can be quantified.
The Goldilocks principle:
balancing privacy with data utility.
The re-identification risk needs to be very small.
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
A Risk-based De-identification Methodology
The risk of re-identification can be quantified.
The Goldilocks principle:
balancing privacy with data utility.
De-identification involves a mix of technical, contractual,
and other measures.
The re-identification risk needs to be very small.
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
Steps in the De-identification Methodology
Step 1: Select Direct and Indirect Identifiers
Step 2: Setting the Threshold
Step 3: Examining Plausible Attacks
Step 4: De-identifying the Data
Step 5: Documenting the Process
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
Step 1: Select Direct and Indirect Identifiers
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
Direct identifiers: name, telephone number, health
insurance card number, medical record number.
Step 1: Select Direct and Indirect Identifiers
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
Direct identifiers: name, telephone number, health
insurance card number, medical record number.
Indirect identifiers, or quasi-identifiers: sex, date of birth,
ethnicity, locations, event dates, medical codes.
Step 1: Select Direct and Indirect Identifiers
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
Step 2: Setting the Threshold
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
Maximum acceptable risk for sharing data.
Step 2: Setting the Threshold
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
Maximum acceptable risk for sharing data.
Needs to be quantitative and defensible.
Step 2: Setting the Threshold
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
Maximum acceptable risk for sharing data.
Needs to be quantitative and defensible.
Is the data in going to be in the public domain?
Step 2: Setting the Threshold
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
Maximum acceptable risk for sharing data.
Needs to be quantitative and defensible.
Is the data in going to be in the public domain?
Extent of invasion-of-privacy when data was shared?
Step 2: Setting the Threshold
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
Step 3: Examining Plausible Attacks
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
Recipient deliberately attempts to re-identify the data.
Step 3: Examining Plausible Attacks
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
Recipient deliberately attempts to re-identify the data.
Recipient inadvertently re-identifies the data.
“Holly Smokes, I know her!”
Step 3: Examining Plausible Attacks
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
Recipient deliberately attempts to re-identify the data.
Recipient inadvertently re-identifies the data.
Data breach at recipient’s site, “data gone wild”.
Step 3: Examining Plausible Attacks
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
Recipient deliberately attempts to re-identify the data.
Data breach at recipient’s site, “data gone wild”.
Adversary launches a demonstration attack on the data.
Step 3: Examining Plausible Attacks
Khaled El Emam & Luk Arbuckle
Recipient inadvertently re-identifies the data.
Anonymizing Health Data
Step 4: De-identifying the Data
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
Step 4: De-identifying the Data
Generalization: reducing the precision of a field.
Dates converted to month/year, or year.
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
Step 4: De-identifying the Data
Generalization: reducing the precision of a field.
Suppression: replacing a cell with NULL.
Unique 55-year old female in birth registry.
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
Step 4: De-identifying the Data
Generalization: reducing the precision of a field.
Suppression: replacing a cell with NULL.
Sub-sampling: releasing a simple random sample.
50% of data set instead of all data.
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
Step 5: Documenting the Process
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
Step 5: Documenting the Process
Process documentation—a methodology text.
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
Step 5: Documenting the Process
Results documentation—data set, risk thresholds,
assumptions, evidence of low risk.
Khaled El Emam & Luk Arbuckle
Process documentation—a methodology text.
Anonymizing Health Data
Measuring Risk Under Plausible Attacks
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
T1:Deliberate Attempt
Measuring Risk Under Plausible Attacks
Pr(re-id, attempt) = Pr(attempt) × Pr(re-id | attempt)
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
T1:Deliberate Attempt
Measuring Risk Under Plausible Attacks
Khaled El Emam & Luk Arbuckle
T2: Inadvertent Attempt (“Holly Smokes, I know her!”)
Pr(re-id, acquaintance) = Pr(acquaintance) × Pr(re-id | acquaintance)
Anonymizing Health Data
T1:Deliberate Attempt
Measuring Risk Under Plausible Attacks
Khaled El Emam & Luk Arbuckle
T2: Inadvertent Attempt (“Holly Smokes, I know her!”)
T3: Data Breach (“data gone wild”)
Pr(re-id, breach) = Pr(breach) × Pr(re-id | breach)
Anonymizing Health Data
T1:Deliberate Attempt
Measuring Risk Under Plausible Attacks
Khaled El Emam & Luk Arbuckle
T2: Inadvertent Attempt (“Holly Smokes, I know her!”)
T3: Data Breach (“data gone wild”)
T4: Public Data (demonstration attack)
Pr(re-id), based on data set only
Anonymizing Health Data
Choosing Thresholds
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
Choosing Thresholds
Khaled El Emam & Luk Arbuckle
Many precedents going back multiple decades.
Anonymizing Health Data
Choosing Thresholds
Khaled El Emam & Luk Arbuckle
Many precedents going back multiple decades.
Recommended by regulators.
Anonymizing Health Data
Choosing Thresholds
Khaled El Emam & Luk Arbuckle
Many precedents going back multiple decades.
Recommended by regulators.
All based on max risk though.
Anonymizing Health Data
Choosing Thresholds
Khaled El Emam & Luk Arbuckle
Many precedents going back multiple decades.
Recommended by regulators.
All based on max risk though.
Anonymizing Health Data
Part 2 of Webcast: A Look at Our Case Studies
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
Cross Sectional Data: Research Registries
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
Cross Sectional Data: Research Registries
Khaled El Emam & Luk Arbuckle
Better Outcomes Registry & Network (BORN)
of Ontario
Anonymizing Health Data
Cross Sectional Data: Research Registries
Khaled El Emam & Luk Arbuckle
Better Outcomes Registry & Network (BORN)
of Ontario
140,000 births per year.
Anonymizing Health Data
Cross Sectional Data: Research Registries
Khaled El Emam & Luk Arbuckle
Better Outcomes Registry & Network (BORN)
of Ontario
140,000 births per year.
Cross-sectional—mothers not traced over time.
Anonymizing Health Data
Cross Sectional Data: Research Registries
Khaled El Emam & Luk Arbuckle
Better Outcomes Registry & Network (BORN)
of Ontario
140,000 births per year.
Cross-sectional—mothers not traced over time.
Process of getting de-identified data from a
research registry.
Anonymizing Health Data
Cross Sectional Data: Research Registries
Khaled El Emam & Luk Arbuckle
Better Outcomes Registry & Network (BORN)
of Ontario
140,000 births per year.
Cross-sectional—mothers not traced over time.
Process of getting de-identified data from a
research registry.
Anonymizing Health Data
Researcher Ronnie wants data!
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
Researcher Ronnie wants data!
Khaled El Emam & Luk Arbuckle
919,710 records
from 2005-2011
Anonymizing Health Data
Researcher Ronnie wants data!
Khaled El Emam & Luk Arbuckle
919,710 records
from 2005-2011
Anonymizing Health Data
Choosing Thresholds
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
Choosing Thresholds
Khaled El Emam & Luk Arbuckle
Average risk of 0.1 for Researcher Ronnie
(and the data he specifically requested).
Anonymizing Health Data
Choosing Thresholds
Khaled El Emam & Luk Arbuckle
0.05 if there were highly sensitive variables
(congenital anomalies, mental health problems).
Average risk of 0.1 for Researcher Ronnie
Anonymizing Health Data
Measuring Risk Under Plausible Attacks
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
T1:Deliberate Attempt
Measuring Risk Under Plausible Attacks
Khaled El Emam & Luk Arbuckle
Low motives and capacity
Anonymizing Health Data
T1:Deliberate Attempt
Measuring Risk Under Plausible Attacks
Khaled El Emam & Luk Arbuckle
Low motives and capacity; low mitigating controls.
Anonymizing Health Data
T1:Deliberate Attempt
Measuring Risk Under Plausible Attacks
Khaled El Emam & Luk Arbuckle
Pr(attempt) = 0.4
Anonymizing Health Data
T1:Deliberate Attempt
Measuring Risk Under Plausible Attacks
Khaled El Emam & Luk Arbuckle
T2: Inadvertent Attempt (“Holly Smokes, I know her!”)
119,785 births out of a 4,478,500 women ( = 0.027)
Anonymizing Health Data
T1:Deliberate Attempt
Measuring Risk Under Plausible Attacks
Khaled El Emam & Luk Arbuckle
T2: Inadvertent Attempt (“Holly Smokes, I know her!”)
Pr(aquaintance) = 1- (1-0.027)150/2 = 0.87
Anonymizing Health Data
T1:Deliberate Attempt
Measuring Risk Under Plausible Attacks
Khaled El Emam & Luk Arbuckle
T2: Inadvertent Attempt (“Holly Smokes, I know her!”)
T3: Data Breach (“data gone wild”)
Based on historical data.
Anonymizing Health Data
T1:Deliberate Attempt
Measuring Risk Under Plausible Attacks
Khaled El Emam & Luk Arbuckle
T2: Inadvertent Attempt (“Holly Smokes, I know her!”)
T3: Data Breach (“data gone wild”)
Pr(breach)=0.27
Anonymizing Health Data
T1:Deliberate Attempt
Measuring Risk Under Plausible Attacks
Khaled El Emam & Luk Arbuckle
T2: Inadvertent Attempt (“Holly Smokes, I know her!”)
T3: Data Breach (“data gone wild”)
T4: Public Data (demonstration attack)
Anonymizing Health Data
T1:Deliberate Attempt
Measuring Risk Under Plausible Attacks
Khaled El Emam & Luk Arbuckle
T2: Inadvertent Attempt (“Holly Smokes, I know her!”)
T3: Data Breach (“data gone wild”)
Overall risk
Pr(re-id, T) = Pr(T) x Pr(re-id | T) ≤ 0.1
Anonymizing Health Data
Measuring Risk Under Plausible Attacks
Khaled El Emam & Luk Arbuckle
T2: Inadvertent Attempt (“Holly Smokes, I know her!”)
Pr(aquaintance) = 1- (1-0.027)150/2 = 0.87
Overall risk
Pr(re-id, acquaintance) = 0.87 × Pr(re-id | acquaintance) ≤ 0.1
Anonymizing Health Data
De-identifying the Data Set
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
Meeting Thresholds: k-anonymity
Khaled El Emam & Luk Arbuckle
k
Anonymizing Health Data
Meeting Thresholds: k-anonymity
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
De-identifying the Data Set
Khaled El Emam & Luk Arbuckle
MDOB in 1-yy; BDOB in wk/yy; MPC of 1 char.
Anonymizing Health Data
De-identifying the Data Set
Khaled El Emam & Luk Arbuckle
MDOB in 1-yy; BDOB in wk/yy; MPC of 1 char.
MDOB in 10-yy; BDOB in qtr/yy; MPC of 3 chars.
Anonymizing Health Data
De-identifying the Data Set
Khaled El Emam & Luk Arbuckle
MDOB in 1-yy; BDOB in wk/yy; MPC of 1 char.
MDOB in 10-yy; BDOB in qtr/yy; MPC of 3 chars.
MDOB in 10-yy; BDOB in mm/yy; MPC of 3 chars.
Anonymizing Health Data
Year on Year: Re-using Risk Analyses
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
Year on Year: Re-using Risk Analyses
Khaled El Emam & Luk Arbuckle
In 2006 Researcher Ronnie asks for 2005.
Anonymizing Health Data
Year on Year: Re-using Risk Analyses
Khaled El Emam & Luk Arbuckle
In 2006 Researcher Ronnie asks for 2005—deleted.
In 2007 Researcher Ronnie asks for 2006.
Anonymizing Health Data
Year on Year: Re-using Risk Analyses
Khaled El Emam & Luk Arbuckle
In 2006 Researcher Ronnie asks for 2005.
In 2007 Researcher Ronnie asks for 2006—deleted.
In 2008 Researcher Ronnie asks for 2007.
Anonymizing Health Data
Year on Year: Re-using Risk Analyses
Khaled El Emam & Luk Arbuckle
In 2006 Researcher Ronnie asks for 2005.
In 2007 Researcher Ronnie asks for 2006.
In 2008 Researcher Ronnie asks for 2007—deleted.
In 2009 Researcher Ronnie asks for 2008.
Anonymizing Health Data
Year on Year: Re-using Risk Analyses
Khaled El Emam & Luk Arbuckle
In 2006 Researcher Ronnie asks for 2005.
In 2007 Researcher Ronnie asks for 2006.
In 2008 Researcher Ronnie asks for 2007.
In 2009 Researcher Ronnie asks for 2008—deleted.
In 2010 Researcher Ronnie asks for 2009.
Anonymizing Health Data
Year on Year: Re-using Risk Analyses
Khaled El Emam & Luk Arbuckle
In 2006 Researcher Ronnie asks for 2005.
In 2007 Researcher Ronnie asks for 2006.
In 2008 Researcher Ronnie asks for 2007.
In 2009 Researcher Ronnie asks for 2008—deleted.
In 2010 Researcher Ronnie asks for 2009.
Can we use the same de-identification scheme every year?
Anonymizing Health Data
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
Year on Year: Re-using Risk Analyses
Khaled El Emam & Luk Arbuckle
BORN data pertains to very stable populations.
Anonymizing Health Data
Year on Year: Re-using Risk Analyses
Khaled El Emam & Luk Arbuckle
BORN data pertains to very stable populations.
No dramatic changes in the number or characteristics of
births from 2005-2010.
Anonymizing Health Data
Year on Year: Re-using Risk Analyses
Khaled El Emam & Luk Arbuckle
BORN data pertains to very stable populations.
No dramatic changes in the number or characteristics of
births from 2005-2010.
Revisit de-identification scheme every 18 to 24 months.
Anonymizing Health Data
Year on Year: Re-using Risk Analyses
Khaled El Emam & Luk Arbuckle
BORN data pertains to very stable populations.
No dramatic changes in the number or characteristics of
births from 2005-2010.
Revisit de-identification scheme every 18 to 24 months.
Revisit if any new quasi-identifiers are added or changed.
Anonymizing Health Data
Longitudinal Discharge Abstract Data:
State Inpatient Databases
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
Longitudinal Discharge Abstract Data:
State Inpatient Databases
Khaled El Emam & Luk Arbuckle
Linking a patient’s records over time.
Anonymizing Health Data
Longitudinal Discharge Abstract Data:
State Inpatient Databases
Khaled El Emam & Luk Arbuckle
Linking a patient’s records over time.
Need to be de-identified differently.
Anonymizing Health Data
Meeting Thresholds: k-anonymity?
Khaled El Emam & Luk Arbuckle
k?
Anonymizing Health Data
Meeting Thresholds: k-anonymity?
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
Meeting Thresholds: k-anonymity?
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
De-identifying Under Complete Knowledge
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
De-identifying Under Complete Knowledge
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
De-identifying Under Complete Knowledge
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
De-identifying Under Complete Knowledge
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
State Inpatient Database (SID) of California
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
State Inpatient Database (SID) of California
Khaled El Emam & Luk Arbuckle
Researcher Ronnie wants public data!
Anonymizing Health Data
State Inpatient Database (SID) of California
Khaled El Emam & Luk Arbuckle
Researcher Ronnie wants public data!
Anonymizing Health Data
State Inpatient Database (SID) of California
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
Measuring Risk Under Plausible Attacks
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
T1:Deliberate Attempt
Measuring Risk Under Plausible Attacks
Khaled El Emam & Luk Arbuckle
T2: Inadvertent Attempt (“Holly Smokes, I know her!”)
T3: Data Breach (“data gone wild”)
T4: Public Data (demonstration attack)
Pr(re-id) ≤ 0.09 (maximum risk)
Anonymizing Health Data
De-identifying the Data Set
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
De-identifying the Data Set
Khaled El Emam & Luk Arbuckle
BirthYear in 5-yy (cut at 1910-);
AdmissionYear unchanged;
DaysSinceLastService in 28-dd (cut at 7-, 182+);
LengthOfStay same as DaysSinceLastService.
Anonymizing Health Data
De-identifying the Data Set
Khaled El Emam & Luk Arbuckle
BirthYear in 5-yy (cut at 1910-);
AdmissionYear unchanged;
DaysSinceLastService in 28-dd (cut at 7-, 182+);
LengthOfStay same as DaysSinceLastService.
Anonymizing Health Data
Connected Variables
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
Connected Variables
Khaled El Emam & Luk Arbuckle
QI to QI
Anonymizing Health Data
Connected Variables
Khaled El Emam & Luk Arbuckle
QI to QI
Similar QI?
Same generalization and suppression.
Anonymizing Health Data
Connected Variables
Khaled El Emam & Luk Arbuckle
QI to QI
Similar QI?
Same generalization and suppression.
QI to non-QI
Anonymizing Health Data
Connected Variables
Khaled El Emam & Luk Arbuckle
QI to QI
Similar QI?
Same generalization and suppression.
QI to non-QI
Non-QI is revealing?
Same suppression so both are removed.
Anonymizing Health Data
Other Issues Regarding Longitudinal Data
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
Other Issues Regarding Longitudinal Data
Khaled El Emam & Luk Arbuckle
Date shifting—maintaining order of records.
Anonymizing Health Data
Other Issues Regarding Longitudinal Data
Khaled El Emam & Luk Arbuckle
Date shifting—maintaining order of records.
Long tails—truncation of records.
Anonymizing Health Data
Other Issues Regarding Longitudinal Data
Khaled El Emam & Luk Arbuckle
Date shifting—maintaining order of records.
Long tails—truncation of records.
Adversary power—assumption of knowledge.
Anonymizing Health Data
Other Concerns to Think About
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
Other Concerns to Think About
Khaled El Emam & Luk Arbuckle
Free-form text—anonymization.
Anonymizing Health Data
Other Concerns to Think About
Khaled El Emam & Luk Arbuckle
Free-form text—anonymization.
Geospatial information—aggregation and
geoproxy risk.
Anonymizing Health Data
Other Concerns to Think About
Khaled El Emam & Luk Arbuckle
Free-form text—anonymization.
Geospatial information—aggregation and
geoproxy risk.
Medical codes—generalization, suppression,
shuffling (yes, as in cards).
Anonymizing Health Data
Other Concerns to Think About
Khaled El Emam & Luk Arbuckle
Free-form text—anonymization.
Geospatial information—aggregation and
geoproxy risk.
Medical codes—generalization, suppression,
shuffling (yes, as in cards).
Secure linking—linking data through
encryption before anonymization.
Anonymizing Health Data
Part 3 of Webcast: Questions and Answers
Khaled El Emam & Luk Arbuckle
Anonymizing Health Data
Khaled El Emam & Luk Arbuckle
More Comments or Questions: Contact us!
Anonymizing Health Data
Khaled El Emam & Luk Arbuckle
Khaled El Emam: kelemam@privacyanalytics.ca
Luk Arbuckle: larbuckle@privacyanalytics.ca
More Comments or Questions: Contact us!

More Related Content

Similar to O'Reilly Webcast: Anonymizing Health Data

Hipaa Is Heating Up!!
Hipaa Is Heating Up!!Hipaa Is Heating Up!!
Hipaa Is Heating Up!!Candy Matheny
 
DEF CON 23 - CHRIS ROCK - i will kill you how to get away with mu
DEF CON 23 - CHRIS ROCK - i will kill you how to get away with muDEF CON 23 - CHRIS ROCK - i will kill you how to get away with mu
DEF CON 23 - CHRIS ROCK - i will kill you how to get away with muFelipe Prado
 
Queen Miller confidentiality training
Queen Miller confidentiality trainingQueen Miller confidentiality training
Queen Miller confidentiality trainingQueenMiller
 
Ann Cavoukian Presentation
Ann Cavoukian PresentationAnn Cavoukian Presentation
Ann Cavoukian PresentationCityAge
 
Can I share this? Curating sensitive data
Can I share this? Curating sensitive dataCan I share this? Curating sensitive data
Can I share this? Curating sensitive dataGraham Smith
 
Debix OnCall Healthcare
Debix OnCall HealthcareDebix OnCall Healthcare
Debix OnCall Healthcareitsmecramer
 
Confidentiality and privacy training
Confidentiality and privacy trainingConfidentiality and privacy training
Confidentiality and privacy trainingmiashaw1
 
Cybersecurity Seminar March 2015
Cybersecurity Seminar March 2015Cybersecurity Seminar March 2015
Cybersecurity Seminar March 2015Lawley Insurance
 
Medical Legal Aspects and Concerns of the Mid-Level Pratcioner
Medical Legal Aspects and Concerns of the Mid-Level PratcionerMedical Legal Aspects and Concerns of the Mid-Level Pratcioner
Medical Legal Aspects and Concerns of the Mid-Level PratcionerBernard Racey
 
Qrepublik Medical ID Seed Round Funding_compressed.pdf
Qrepublik Medical ID Seed Round Funding_compressed.pdfQrepublik Medical ID Seed Round Funding_compressed.pdf
Qrepublik Medical ID Seed Round Funding_compressed.pdfQREPUBLIC, INC.
 

Similar to O'Reilly Webcast: Anonymizing Health Data (16)

Hipaa Is Heating Up!!
Hipaa Is Heating Up!!Hipaa Is Heating Up!!
Hipaa Is Heating Up!!
 
DEF CON 23 - CHRIS ROCK - i will kill you how to get away with mu
DEF CON 23 - CHRIS ROCK - i will kill you how to get away with muDEF CON 23 - CHRIS ROCK - i will kill you how to get away with mu
DEF CON 23 - CHRIS ROCK - i will kill you how to get away with mu
 
Wk1 dq2
Wk1 dq2Wk1 dq2
Wk1 dq2
 
Wk1 dq2
Wk1 dq2Wk1 dq2
Wk1 dq2
 
Queen Miller confidentiality training
Queen Miller confidentiality trainingQueen Miller confidentiality training
Queen Miller confidentiality training
 
Ann Cavoukian Presentation
Ann Cavoukian PresentationAnn Cavoukian Presentation
Ann Cavoukian Presentation
 
Can I share this? Curating sensitive data
Can I share this? Curating sensitive dataCan I share this? Curating sensitive data
Can I share this? Curating sensitive data
 
bwmedicalidt
bwmedicalidtbwmedicalidt
bwmedicalidt
 
dgcha07
dgcha07dgcha07
dgcha07
 
Debix OnCall Healthcare
Debix OnCall HealthcareDebix OnCall Healthcare
Debix OnCall Healthcare
 
125 nipsta
125 nipsta125 nipsta
125 nipsta
 
Confidentiality and privacy training
Confidentiality and privacy trainingConfidentiality and privacy training
Confidentiality and privacy training
 
Whistle Blowing Essays
Whistle Blowing EssaysWhistle Blowing Essays
Whistle Blowing Essays
 
Cybersecurity Seminar March 2015
Cybersecurity Seminar March 2015Cybersecurity Seminar March 2015
Cybersecurity Seminar March 2015
 
Medical Legal Aspects and Concerns of the Mid-Level Pratcioner
Medical Legal Aspects and Concerns of the Mid-Level PratcionerMedical Legal Aspects and Concerns of the Mid-Level Pratcioner
Medical Legal Aspects and Concerns of the Mid-Level Pratcioner
 
Qrepublik Medical ID Seed Round Funding_compressed.pdf
Qrepublik Medical ID Seed Round Funding_compressed.pdfQrepublik Medical ID Seed Round Funding_compressed.pdf
Qrepublik Medical ID Seed Round Funding_compressed.pdf
 

Recently uploaded

Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
PSYCHIATRIC History collection FORMAT.pptx
PSYCHIATRIC   History collection FORMAT.pptxPSYCHIATRIC   History collection FORMAT.pptx
PSYCHIATRIC History collection FORMAT.pptxPoojaSen20
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Micromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersMicromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersChitralekhaTherkar
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docxPoojaSen20
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 

Recently uploaded (20)

Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
PSYCHIATRIC History collection FORMAT.pptx
PSYCHIATRIC   History collection FORMAT.pptxPSYCHIATRIC   History collection FORMAT.pptx
PSYCHIATRIC History collection FORMAT.pptx
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Micromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersMicromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of Powders
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docx
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 

O'Reilly Webcast: Anonymizing Health Data

  • 1. Anonymizing Health Data Webcast Case Studies and Methods to Get You Started Khaled El Emam & Luk Arbuckle
  • 2. Anonymizing Health Data Part 1 of Webcast: Intro and Methodology Part 2 of Webcast: A Look at Our Case Studies Part 3 of Webcast: Questions and Answers Khaled El Emam & Luk Arbuckle
  • 3. Anonymizing Health Data Part 1 of Webcast: Intro and Methodology Khaled El Emam & Luk Arbuckle
  • 4. Anonymizing Health Data To Anonymize or not to Anonymize Khaled El Emam & Luk Arbuckle
  • 5. Anonymizing Health Data Consent needs to be informed. To Anonymize or not to Anonymize Khaled El Emam & Luk Arbuckle
  • 6. Anonymizing Health Data Consent needs to be informed. Not all health care providers are willing to share their patient’s PHI. To Anonymize or not to Anonymize Khaled El Emam & Luk Arbuckle
  • 7. Anonymizing Health Data Consent needs to be informed. Not all health care providers are willing to share their patient’s PHI. Anonymization allows for the sharing of health information. To Anonymize or not to Anonymize Khaled El Emam & Luk Arbuckle
  • 8. Anonymizing Health Data Consent needs to be informed. Not all health care providers are willing to share their patient’s PHI. Anonymization allows for the sharing of health information. To Anonymize or not to Anonymize Compelling financial case. Breach cost ~$200 per patient. Khaled El Emam & Luk Arbuckle
  • 9. Anonymizing Health Data Consent needs to be informed. Not all health care providers are willing to share their patient’s PHI. Anonymization allows for the sharing of health information. To Anonymize or not to Anonymize Compelling financial case. Breach cost ~$200 per patient. Khaled El Emam & Luk Arbuckle
  • 10. Anonymizing Health Data Consent needs to be informed. Not all health care providers are willing to share their patient’s PHI. Anonymization allows for the sharing of health information. To Anonymize or not to Anonymize Privacy protective behaviors by patients. Compelling financial case. Breach cost ~$200 per patient. Khaled El Emam & Luk Arbuckle
  • 11. Anonymizing Health Data Masking Standards Khaled El Emam & Luk Arbuckle
  • 12. Anonymizing Health Data Masking Standards First name, last name, SSN. Khaled El Emam & Luk Arbuckle
  • 13. Anonymizing Health Data Masking Standards Distortion of data—no analytics. First name, last name, SSN. Khaled El Emam & Luk Arbuckle
  • 14. Anonymizing Health Data Masking Standards Creating pseudonyms. First name, last name, SSN. Distortion of data—no analytics. Khaled El Emam & Luk Arbuckle
  • 15. Anonymizing Health Data Masking Standards Removing a whole field. Creating pseudonyms. First name, last name, SSN. Distortion of data—no analytics. Khaled El Emam & Luk Arbuckle
  • 16. Anonymizing Health Data Masking Standards Removing a whole field. Creating pseudonyms. Replacing actual values with random ones. First name, last name, SSN. Distortion of data—no analytics. Khaled El Emam & Luk Arbuckle
  • 17. Anonymizing Health Data De-identification Standards Khaled El Emam & Luk Arbuckle
  • 18. Anonymizing Health Data De-identification Standards Age, sex, race, address, income. Khaled El Emam & Luk Arbuckle
  • 19. Anonymizing Health Data Minimal distortion of data—for analytics. Age, sex, race, address, income. De-identification Standards Khaled El Emam & Luk Arbuckle
  • 20. Anonymizing Health Data Minimal distortion of data—for analytics. Age, sex, race, address, income. De-identification Standards Safe Harbor in HIPAA Privacy Rule. Khaled El Emam & Luk Arbuckle
  • 21. Anonymizing Health Data What’s “Actual Knowledge”? Privacy Rule Safe Harbor Khaled El Emam & Luk Arbuckle
  • 22. Anonymizing Health Data What’s “Actual Knowledge”? Info, alone or in combo, that could identify an individual. Khaled El Emam & Luk Arbuckle
  • 23. Anonymizing Health Data What’s “Actual Knowledge”? Info, alone or in combo, that could identify an individual. Has to be specific to the data set—not theoretical. Khaled El Emam & Luk Arbuckle
  • 24. Anonymizing Health Data What’s “Actual Knowledge”? Info, alone or in combo, that could identify an individual. Has to be specific to the data set—not theoretical. Occupation Mayor of Gotham. Khaled El Emam & Luk Arbuckle
  • 25. Anonymizing Health Data Heuristics, or rules of thumb. Minimal distortion of data—for analytics. Age, sex, race, address, income. Safe Harbor in HIPAA Privacy Rule. De-identification Standards Khaled El Emam & Luk Arbuckle
  • 26. Anonymizing Health Data Heuristics, or rules of thumb. Statistical method in HIPAA Privacy Rule. Minimal distortion of data—for analytics. Age, sex, race, address, income. Safe Harbor in HIPAA Privacy Rule. De-identification Standards Khaled El Emam & Luk Arbuckle
  • 27. Anonymizing Health Data De-identification Myths Khaled El Emam & Luk Arbuckle
  • 28. Anonymizing Health Data De-identification Myths Myth: It’s possible to re-identify most, if not all, data. Khaled El Emam & Luk Arbuckle
  • 29. Anonymizing Health Data De-identification Myths Myth: It’s possible to re-identify most, if not all, data. Using robust methods, evidence suggests risk can be very small. Khaled El Emam & Luk Arbuckle
  • 30. Anonymizing Health Data De-identification Myths Myth: It’s possible to re-identify most, if not all, data. Myth: Genomic sequences are not identifiable, or are easy to re-identify. Using robust methods, evidence suggests risk can be very small. Khaled El Emam & Luk Arbuckle
  • 31. Anonymizing Health Data De-identification Myths Myth: It’s possible to re-identify most, if not all, data. Myth: Genomic sequences are not identifiable, or are easy to re-identify. In some cases can re-identify, difficult to de- identify using our methods. Using robust methods, evidence suggests risk can be very small. Khaled El Emam & Luk Arbuckle
  • 32. Anonymizing Health Data A Risk-based De-identification Methodology Khaled El Emam & Luk Arbuckle
  • 33. Anonymizing Health Data A Risk-based De-identification Methodology The risk of re-identification can be quantified. Khaled El Emam & Luk Arbuckle
  • 34. Anonymizing Health Data A Risk-based De-identification Methodology The risk of re-identification can be quantified. The Goldilocks principle: balancing privacy with data utility. Khaled El Emam & Luk Arbuckle
  • 35. Anonymizing Health Data Khaled El Emam & Luk Arbuckle
  • 36. Anonymizing Health Data A Risk-based De-identification Methodology The risk of re-identification can be quantified. The Goldilocks principle: balancing privacy with data utility. The re-identification risk needs to be very small. Khaled El Emam & Luk Arbuckle
  • 37. Anonymizing Health Data A Risk-based De-identification Methodology The risk of re-identification can be quantified. The Goldilocks principle: balancing privacy with data utility. De-identification involves a mix of technical, contractual, and other measures. The re-identification risk needs to be very small. Khaled El Emam & Luk Arbuckle
  • 38. Anonymizing Health Data Steps in the De-identification Methodology Step 1: Select Direct and Indirect Identifiers Step 2: Setting the Threshold Step 3: Examining Plausible Attacks Step 4: De-identifying the Data Step 5: Documenting the Process Khaled El Emam & Luk Arbuckle
  • 39. Anonymizing Health Data Step 1: Select Direct and Indirect Identifiers Khaled El Emam & Luk Arbuckle
  • 40. Anonymizing Health Data Direct identifiers: name, telephone number, health insurance card number, medical record number. Step 1: Select Direct and Indirect Identifiers Khaled El Emam & Luk Arbuckle
  • 41. Anonymizing Health Data Direct identifiers: name, telephone number, health insurance card number, medical record number. Indirect identifiers, or quasi-identifiers: sex, date of birth, ethnicity, locations, event dates, medical codes. Step 1: Select Direct and Indirect Identifiers Khaled El Emam & Luk Arbuckle
  • 42. Anonymizing Health Data Step 2: Setting the Threshold Khaled El Emam & Luk Arbuckle
  • 43. Anonymizing Health Data Maximum acceptable risk for sharing data. Step 2: Setting the Threshold Khaled El Emam & Luk Arbuckle
  • 44. Anonymizing Health Data Maximum acceptable risk for sharing data. Needs to be quantitative and defensible. Step 2: Setting the Threshold Khaled El Emam & Luk Arbuckle
  • 45. Anonymizing Health Data Maximum acceptable risk for sharing data. Needs to be quantitative and defensible. Is the data in going to be in the public domain? Step 2: Setting the Threshold Khaled El Emam & Luk Arbuckle
  • 46. Anonymizing Health Data Maximum acceptable risk for sharing data. Needs to be quantitative and defensible. Is the data in going to be in the public domain? Extent of invasion-of-privacy when data was shared? Step 2: Setting the Threshold Khaled El Emam & Luk Arbuckle
  • 47. Anonymizing Health Data Step 3: Examining Plausible Attacks Khaled El Emam & Luk Arbuckle
  • 48. Anonymizing Health Data Recipient deliberately attempts to re-identify the data. Step 3: Examining Plausible Attacks Khaled El Emam & Luk Arbuckle
  • 49. Anonymizing Health Data Recipient deliberately attempts to re-identify the data. Recipient inadvertently re-identifies the data. “Holly Smokes, I know her!” Step 3: Examining Plausible Attacks Khaled El Emam & Luk Arbuckle
  • 50. Anonymizing Health Data Recipient deliberately attempts to re-identify the data. Recipient inadvertently re-identifies the data. Data breach at recipient’s site, “data gone wild”. Step 3: Examining Plausible Attacks Khaled El Emam & Luk Arbuckle
  • 51. Anonymizing Health Data Recipient deliberately attempts to re-identify the data. Data breach at recipient’s site, “data gone wild”. Adversary launches a demonstration attack on the data. Step 3: Examining Plausible Attacks Khaled El Emam & Luk Arbuckle Recipient inadvertently re-identifies the data.
  • 52. Anonymizing Health Data Step 4: De-identifying the Data Khaled El Emam & Luk Arbuckle
  • 53. Anonymizing Health Data Step 4: De-identifying the Data Generalization: reducing the precision of a field. Dates converted to month/year, or year. Khaled El Emam & Luk Arbuckle
  • 54. Anonymizing Health Data Step 4: De-identifying the Data Generalization: reducing the precision of a field. Suppression: replacing a cell with NULL. Unique 55-year old female in birth registry. Khaled El Emam & Luk Arbuckle
  • 55. Anonymizing Health Data Step 4: De-identifying the Data Generalization: reducing the precision of a field. Suppression: replacing a cell with NULL. Sub-sampling: releasing a simple random sample. 50% of data set instead of all data. Khaled El Emam & Luk Arbuckle
  • 56. Anonymizing Health Data Step 5: Documenting the Process Khaled El Emam & Luk Arbuckle
  • 57. Anonymizing Health Data Step 5: Documenting the Process Process documentation—a methodology text. Khaled El Emam & Luk Arbuckle
  • 58. Anonymizing Health Data Step 5: Documenting the Process Results documentation—data set, risk thresholds, assumptions, evidence of low risk. Khaled El Emam & Luk Arbuckle Process documentation—a methodology text.
  • 59. Anonymizing Health Data Measuring Risk Under Plausible Attacks Khaled El Emam & Luk Arbuckle
  • 60. Anonymizing Health Data T1:Deliberate Attempt Measuring Risk Under Plausible Attacks Pr(re-id, attempt) = Pr(attempt) × Pr(re-id | attempt) Khaled El Emam & Luk Arbuckle
  • 61. Anonymizing Health Data T1:Deliberate Attempt Measuring Risk Under Plausible Attacks Khaled El Emam & Luk Arbuckle T2: Inadvertent Attempt (“Holly Smokes, I know her!”) Pr(re-id, acquaintance) = Pr(acquaintance) × Pr(re-id | acquaintance)
  • 62. Anonymizing Health Data T1:Deliberate Attempt Measuring Risk Under Plausible Attacks Khaled El Emam & Luk Arbuckle T2: Inadvertent Attempt (“Holly Smokes, I know her!”) T3: Data Breach (“data gone wild”) Pr(re-id, breach) = Pr(breach) × Pr(re-id | breach)
  • 63. Anonymizing Health Data T1:Deliberate Attempt Measuring Risk Under Plausible Attacks Khaled El Emam & Luk Arbuckle T2: Inadvertent Attempt (“Holly Smokes, I know her!”) T3: Data Breach (“data gone wild”) T4: Public Data (demonstration attack) Pr(re-id), based on data set only
  • 64. Anonymizing Health Data Choosing Thresholds Khaled El Emam & Luk Arbuckle
  • 65. Anonymizing Health Data Choosing Thresholds Khaled El Emam & Luk Arbuckle Many precedents going back multiple decades.
  • 66. Anonymizing Health Data Choosing Thresholds Khaled El Emam & Luk Arbuckle Many precedents going back multiple decades. Recommended by regulators.
  • 67. Anonymizing Health Data Choosing Thresholds Khaled El Emam & Luk Arbuckle Many precedents going back multiple decades. Recommended by regulators. All based on max risk though.
  • 68. Anonymizing Health Data Choosing Thresholds Khaled El Emam & Luk Arbuckle Many precedents going back multiple decades. Recommended by regulators. All based on max risk though.
  • 69. Anonymizing Health Data Part 2 of Webcast: A Look at Our Case Studies Khaled El Emam & Luk Arbuckle
  • 70. Anonymizing Health Data Cross Sectional Data: Research Registries Khaled El Emam & Luk Arbuckle
  • 71. Anonymizing Health Data Cross Sectional Data: Research Registries Khaled El Emam & Luk Arbuckle Better Outcomes Registry & Network (BORN) of Ontario
  • 72. Anonymizing Health Data Cross Sectional Data: Research Registries Khaled El Emam & Luk Arbuckle Better Outcomes Registry & Network (BORN) of Ontario 140,000 births per year.
  • 73. Anonymizing Health Data Cross Sectional Data: Research Registries Khaled El Emam & Luk Arbuckle Better Outcomes Registry & Network (BORN) of Ontario 140,000 births per year. Cross-sectional—mothers not traced over time.
  • 74. Anonymizing Health Data Cross Sectional Data: Research Registries Khaled El Emam & Luk Arbuckle Better Outcomes Registry & Network (BORN) of Ontario 140,000 births per year. Cross-sectional—mothers not traced over time. Process of getting de-identified data from a research registry.
  • 75. Anonymizing Health Data Cross Sectional Data: Research Registries Khaled El Emam & Luk Arbuckle Better Outcomes Registry & Network (BORN) of Ontario 140,000 births per year. Cross-sectional—mothers not traced over time. Process of getting de-identified data from a research registry.
  • 76. Anonymizing Health Data Researcher Ronnie wants data! Khaled El Emam & Luk Arbuckle
  • 77. Anonymizing Health Data Researcher Ronnie wants data! Khaled El Emam & Luk Arbuckle 919,710 records from 2005-2011
  • 78. Anonymizing Health Data Researcher Ronnie wants data! Khaled El Emam & Luk Arbuckle 919,710 records from 2005-2011
  • 79. Anonymizing Health Data Choosing Thresholds Khaled El Emam & Luk Arbuckle
  • 80. Anonymizing Health Data Choosing Thresholds Khaled El Emam & Luk Arbuckle Average risk of 0.1 for Researcher Ronnie (and the data he specifically requested).
  • 81. Anonymizing Health Data Choosing Thresholds Khaled El Emam & Luk Arbuckle 0.05 if there were highly sensitive variables (congenital anomalies, mental health problems). Average risk of 0.1 for Researcher Ronnie
  • 82. Anonymizing Health Data Measuring Risk Under Plausible Attacks Khaled El Emam & Luk Arbuckle
  • 83. Anonymizing Health Data T1:Deliberate Attempt Measuring Risk Under Plausible Attacks Khaled El Emam & Luk Arbuckle Low motives and capacity
  • 84. Anonymizing Health Data T1:Deliberate Attempt Measuring Risk Under Plausible Attacks Khaled El Emam & Luk Arbuckle Low motives and capacity; low mitigating controls.
  • 85. Anonymizing Health Data T1:Deliberate Attempt Measuring Risk Under Plausible Attacks Khaled El Emam & Luk Arbuckle Pr(attempt) = 0.4
  • 86. Anonymizing Health Data T1:Deliberate Attempt Measuring Risk Under Plausible Attacks Khaled El Emam & Luk Arbuckle T2: Inadvertent Attempt (“Holly Smokes, I know her!”) 119,785 births out of a 4,478,500 women ( = 0.027)
  • 87. Anonymizing Health Data T1:Deliberate Attempt Measuring Risk Under Plausible Attacks Khaled El Emam & Luk Arbuckle T2: Inadvertent Attempt (“Holly Smokes, I know her!”) Pr(aquaintance) = 1- (1-0.027)150/2 = 0.87
  • 88. Anonymizing Health Data T1:Deliberate Attempt Measuring Risk Under Plausible Attacks Khaled El Emam & Luk Arbuckle T2: Inadvertent Attempt (“Holly Smokes, I know her!”) T3: Data Breach (“data gone wild”) Based on historical data.
  • 89. Anonymizing Health Data T1:Deliberate Attempt Measuring Risk Under Plausible Attacks Khaled El Emam & Luk Arbuckle T2: Inadvertent Attempt (“Holly Smokes, I know her!”) T3: Data Breach (“data gone wild”) Pr(breach)=0.27
  • 90. Anonymizing Health Data T1:Deliberate Attempt Measuring Risk Under Plausible Attacks Khaled El Emam & Luk Arbuckle T2: Inadvertent Attempt (“Holly Smokes, I know her!”) T3: Data Breach (“data gone wild”) T4: Public Data (demonstration attack)
  • 91. Anonymizing Health Data T1:Deliberate Attempt Measuring Risk Under Plausible Attacks Khaled El Emam & Luk Arbuckle T2: Inadvertent Attempt (“Holly Smokes, I know her!”) T3: Data Breach (“data gone wild”) Overall risk Pr(re-id, T) = Pr(T) x Pr(re-id | T) ≤ 0.1
  • 92. Anonymizing Health Data Measuring Risk Under Plausible Attacks Khaled El Emam & Luk Arbuckle T2: Inadvertent Attempt (“Holly Smokes, I know her!”) Pr(aquaintance) = 1- (1-0.027)150/2 = 0.87 Overall risk Pr(re-id, acquaintance) = 0.87 × Pr(re-id | acquaintance) ≤ 0.1
  • 93. Anonymizing Health Data De-identifying the Data Set Khaled El Emam & Luk Arbuckle
  • 94. Anonymizing Health Data Meeting Thresholds: k-anonymity Khaled El Emam & Luk Arbuckle k
  • 95. Anonymizing Health Data Meeting Thresholds: k-anonymity Khaled El Emam & Luk Arbuckle
  • 96. Anonymizing Health Data De-identifying the Data Set Khaled El Emam & Luk Arbuckle MDOB in 1-yy; BDOB in wk/yy; MPC of 1 char.
  • 97. Anonymizing Health Data De-identifying the Data Set Khaled El Emam & Luk Arbuckle MDOB in 1-yy; BDOB in wk/yy; MPC of 1 char. MDOB in 10-yy; BDOB in qtr/yy; MPC of 3 chars.
  • 98. Anonymizing Health Data De-identifying the Data Set Khaled El Emam & Luk Arbuckle MDOB in 1-yy; BDOB in wk/yy; MPC of 1 char. MDOB in 10-yy; BDOB in qtr/yy; MPC of 3 chars. MDOB in 10-yy; BDOB in mm/yy; MPC of 3 chars.
  • 99. Anonymizing Health Data Year on Year: Re-using Risk Analyses Khaled El Emam & Luk Arbuckle
  • 100. Anonymizing Health Data Year on Year: Re-using Risk Analyses Khaled El Emam & Luk Arbuckle In 2006 Researcher Ronnie asks for 2005.
  • 101. Anonymizing Health Data Year on Year: Re-using Risk Analyses Khaled El Emam & Luk Arbuckle In 2006 Researcher Ronnie asks for 2005—deleted. In 2007 Researcher Ronnie asks for 2006.
  • 102. Anonymizing Health Data Year on Year: Re-using Risk Analyses Khaled El Emam & Luk Arbuckle In 2006 Researcher Ronnie asks for 2005. In 2007 Researcher Ronnie asks for 2006—deleted. In 2008 Researcher Ronnie asks for 2007.
  • 103. Anonymizing Health Data Year on Year: Re-using Risk Analyses Khaled El Emam & Luk Arbuckle In 2006 Researcher Ronnie asks for 2005. In 2007 Researcher Ronnie asks for 2006. In 2008 Researcher Ronnie asks for 2007—deleted. In 2009 Researcher Ronnie asks for 2008.
  • 104. Anonymizing Health Data Year on Year: Re-using Risk Analyses Khaled El Emam & Luk Arbuckle In 2006 Researcher Ronnie asks for 2005. In 2007 Researcher Ronnie asks for 2006. In 2008 Researcher Ronnie asks for 2007. In 2009 Researcher Ronnie asks for 2008—deleted. In 2010 Researcher Ronnie asks for 2009.
  • 105. Anonymizing Health Data Year on Year: Re-using Risk Analyses Khaled El Emam & Luk Arbuckle In 2006 Researcher Ronnie asks for 2005. In 2007 Researcher Ronnie asks for 2006. In 2008 Researcher Ronnie asks for 2007. In 2009 Researcher Ronnie asks for 2008—deleted. In 2010 Researcher Ronnie asks for 2009. Can we use the same de-identification scheme every year?
  • 106. Anonymizing Health Data Khaled El Emam & Luk Arbuckle
  • 107. Anonymizing Health Data Khaled El Emam & Luk Arbuckle
  • 108. Anonymizing Health Data Year on Year: Re-using Risk Analyses Khaled El Emam & Luk Arbuckle BORN data pertains to very stable populations.
  • 109. Anonymizing Health Data Year on Year: Re-using Risk Analyses Khaled El Emam & Luk Arbuckle BORN data pertains to very stable populations. No dramatic changes in the number or characteristics of births from 2005-2010.
  • 110. Anonymizing Health Data Year on Year: Re-using Risk Analyses Khaled El Emam & Luk Arbuckle BORN data pertains to very stable populations. No dramatic changes in the number or characteristics of births from 2005-2010. Revisit de-identification scheme every 18 to 24 months.
  • 111. Anonymizing Health Data Year on Year: Re-using Risk Analyses Khaled El Emam & Luk Arbuckle BORN data pertains to very stable populations. No dramatic changes in the number or characteristics of births from 2005-2010. Revisit de-identification scheme every 18 to 24 months. Revisit if any new quasi-identifiers are added or changed.
  • 112. Anonymizing Health Data Longitudinal Discharge Abstract Data: State Inpatient Databases Khaled El Emam & Luk Arbuckle
  • 113. Anonymizing Health Data Longitudinal Discharge Abstract Data: State Inpatient Databases Khaled El Emam & Luk Arbuckle Linking a patient’s records over time.
  • 114. Anonymizing Health Data Longitudinal Discharge Abstract Data: State Inpatient Databases Khaled El Emam & Luk Arbuckle Linking a patient’s records over time. Need to be de-identified differently.
  • 115. Anonymizing Health Data Meeting Thresholds: k-anonymity? Khaled El Emam & Luk Arbuckle k?
  • 116. Anonymizing Health Data Meeting Thresholds: k-anonymity? Khaled El Emam & Luk Arbuckle
  • 117. Anonymizing Health Data Meeting Thresholds: k-anonymity? Khaled El Emam & Luk Arbuckle
  • 118. Anonymizing Health Data De-identifying Under Complete Knowledge Khaled El Emam & Luk Arbuckle
  • 119. Anonymizing Health Data De-identifying Under Complete Knowledge Khaled El Emam & Luk Arbuckle
  • 120. Anonymizing Health Data De-identifying Under Complete Knowledge Khaled El Emam & Luk Arbuckle
  • 121. Anonymizing Health Data De-identifying Under Complete Knowledge Khaled El Emam & Luk Arbuckle
  • 122. Anonymizing Health Data State Inpatient Database (SID) of California Khaled El Emam & Luk Arbuckle
  • 123. Anonymizing Health Data State Inpatient Database (SID) of California Khaled El Emam & Luk Arbuckle Researcher Ronnie wants public data!
  • 124. Anonymizing Health Data State Inpatient Database (SID) of California Khaled El Emam & Luk Arbuckle Researcher Ronnie wants public data!
  • 125. Anonymizing Health Data State Inpatient Database (SID) of California Khaled El Emam & Luk Arbuckle
  • 126. Anonymizing Health Data Measuring Risk Under Plausible Attacks Khaled El Emam & Luk Arbuckle
  • 127. Anonymizing Health Data T1:Deliberate Attempt Measuring Risk Under Plausible Attacks Khaled El Emam & Luk Arbuckle T2: Inadvertent Attempt (“Holly Smokes, I know her!”) T3: Data Breach (“data gone wild”) T4: Public Data (demonstration attack) Pr(re-id) ≤ 0.09 (maximum risk)
  • 128. Anonymizing Health Data De-identifying the Data Set Khaled El Emam & Luk Arbuckle
  • 129. Anonymizing Health Data De-identifying the Data Set Khaled El Emam & Luk Arbuckle BirthYear in 5-yy (cut at 1910-); AdmissionYear unchanged; DaysSinceLastService in 28-dd (cut at 7-, 182+); LengthOfStay same as DaysSinceLastService.
  • 130. Anonymizing Health Data De-identifying the Data Set Khaled El Emam & Luk Arbuckle BirthYear in 5-yy (cut at 1910-); AdmissionYear unchanged; DaysSinceLastService in 28-dd (cut at 7-, 182+); LengthOfStay same as DaysSinceLastService.
  • 131. Anonymizing Health Data Connected Variables Khaled El Emam & Luk Arbuckle
  • 132. Anonymizing Health Data Connected Variables Khaled El Emam & Luk Arbuckle QI to QI
  • 133. Anonymizing Health Data Connected Variables Khaled El Emam & Luk Arbuckle QI to QI Similar QI? Same generalization and suppression.
  • 134. Anonymizing Health Data Connected Variables Khaled El Emam & Luk Arbuckle QI to QI Similar QI? Same generalization and suppression. QI to non-QI
  • 135. Anonymizing Health Data Connected Variables Khaled El Emam & Luk Arbuckle QI to QI Similar QI? Same generalization and suppression. QI to non-QI Non-QI is revealing? Same suppression so both are removed.
  • 136. Anonymizing Health Data Other Issues Regarding Longitudinal Data Khaled El Emam & Luk Arbuckle
  • 137. Anonymizing Health Data Other Issues Regarding Longitudinal Data Khaled El Emam & Luk Arbuckle Date shifting—maintaining order of records.
  • 138. Anonymizing Health Data Other Issues Regarding Longitudinal Data Khaled El Emam & Luk Arbuckle Date shifting—maintaining order of records. Long tails—truncation of records.
  • 139. Anonymizing Health Data Other Issues Regarding Longitudinal Data Khaled El Emam & Luk Arbuckle Date shifting—maintaining order of records. Long tails—truncation of records. Adversary power—assumption of knowledge.
  • 140. Anonymizing Health Data Other Concerns to Think About Khaled El Emam & Luk Arbuckle
  • 141. Anonymizing Health Data Other Concerns to Think About Khaled El Emam & Luk Arbuckle Free-form text—anonymization.
  • 142. Anonymizing Health Data Other Concerns to Think About Khaled El Emam & Luk Arbuckle Free-form text—anonymization. Geospatial information—aggregation and geoproxy risk.
  • 143. Anonymizing Health Data Other Concerns to Think About Khaled El Emam & Luk Arbuckle Free-form text—anonymization. Geospatial information—aggregation and geoproxy risk. Medical codes—generalization, suppression, shuffling (yes, as in cards).
  • 144. Anonymizing Health Data Other Concerns to Think About Khaled El Emam & Luk Arbuckle Free-form text—anonymization. Geospatial information—aggregation and geoproxy risk. Medical codes—generalization, suppression, shuffling (yes, as in cards). Secure linking—linking data through encryption before anonymization.
  • 145. Anonymizing Health Data Part 3 of Webcast: Questions and Answers Khaled El Emam & Luk Arbuckle
  • 146. Anonymizing Health Data Khaled El Emam & Luk Arbuckle More Comments or Questions: Contact us!
  • 147. Anonymizing Health Data Khaled El Emam & Luk Arbuckle Khaled El Emam: kelemam@privacyanalytics.ca Luk Arbuckle: larbuckle@privacyanalytics.ca More Comments or Questions: Contact us!