Authors: Khaled El Emam, Luk Arbuckle
How can health data be released to analysts and app developers who desperately want it? Under current legislation, the use and disclosure of health data for secondary purposes is limited—patients must either consent to have their data used, which is often difficult to get and can lead to bias, or the data needs to be de-identified (there are some exceptions, but we won't address them in this webinar.)
To ensure that end users get data that is anonymized and highly useful, we focus on the HIPAA Privacy Rule De-identification Standard. We've built our risk-based methodology for anonymizing data around the foundation created by HIPAA's Statistical Method. In this webcast we'll share several of the case studies that we've described in our O'Reilly book Anonymizing Health Data, which is devoted to examples of how we anonymized real-world data sets. In almost every case in which we've anonymized data, there have been new and interesting challenges to overcome.
2. Anonymizing Health Data
Part 1 of Webcast: Intro and Methodology
Part 2 of Webcast: A Look at Our Case Studies
Part 3 of Webcast: Questions and Answers
Khaled El Emam & Luk Arbuckle
6. Anonymizing Health Data
Consent needs to be informed.
Not all health care providers are willing to
share their patient’s PHI.
To Anonymize or not to Anonymize
Khaled El Emam & Luk Arbuckle
7. Anonymizing Health Data
Consent needs to be informed.
Not all health care providers are willing to
share their patient’s PHI.
Anonymization allows for the sharing of health information.
To Anonymize or not to Anonymize
Khaled El Emam & Luk Arbuckle
8. Anonymizing Health Data
Consent needs to be informed.
Not all health care providers are willing to
share their patient’s PHI.
Anonymization allows for the sharing of health information.
To Anonymize or not to Anonymize
Compelling financial case. Breach cost ~$200 per patient.
Khaled El Emam & Luk Arbuckle
9. Anonymizing Health Data
Consent needs to be informed.
Not all health care providers are willing to
share their patient’s PHI.
Anonymization allows for the sharing of health information.
To Anonymize or not to Anonymize
Compelling financial case. Breach cost ~$200 per patient.
Khaled El Emam & Luk Arbuckle
10. Anonymizing Health Data
Consent needs to be informed.
Not all health care providers are willing to
share their patient’s PHI.
Anonymization allows for the sharing of health information.
To Anonymize or not to Anonymize
Privacy protective behaviors by patients.
Compelling financial case. Breach cost ~$200 per patient.
Khaled El Emam & Luk Arbuckle
13. Anonymizing Health Data
Masking Standards
Distortion of data—no analytics.
First name, last name, SSN.
Khaled El Emam & Luk Arbuckle
14. Anonymizing Health Data
Masking Standards
Creating pseudonyms.
First name, last name, SSN.
Distortion of data—no analytics.
Khaled El Emam & Luk Arbuckle
15. Anonymizing Health Data
Masking Standards
Removing a whole field.
Creating pseudonyms.
First name, last name, SSN.
Distortion of data—no analytics.
Khaled El Emam & Luk Arbuckle
16. Anonymizing Health Data
Masking Standards
Removing a whole field.
Creating pseudonyms.
Replacing actual values with random ones.
First name, last name, SSN.
Distortion of data—no analytics.
Khaled El Emam & Luk Arbuckle
22. Anonymizing Health Data
What’s “Actual Knowledge”?
Info, alone or in combo, that could identify
an individual.
Khaled El Emam & Luk Arbuckle
23. Anonymizing Health Data
What’s “Actual Knowledge”?
Info, alone or in combo, that could identify
an individual.
Has to be specific to the data set—not
theoretical.
Khaled El Emam & Luk Arbuckle
24. Anonymizing Health Data
What’s “Actual Knowledge”?
Info, alone or in combo, that could identify
an individual.
Has to be specific to the data set—not
theoretical.
Occupation Mayor of Gotham.
Khaled El Emam & Luk Arbuckle
25. Anonymizing Health Data
Heuristics, or rules of thumb.
Minimal distortion of data—for analytics.
Age, sex, race, address, income.
Safe Harbor in HIPAA Privacy Rule.
De-identification Standards
Khaled El Emam & Luk Arbuckle
26. Anonymizing Health Data
Heuristics, or rules of thumb.
Statistical method in HIPAA Privacy Rule.
Minimal distortion of data—for analytics.
Age, sex, race, address, income.
Safe Harbor in HIPAA Privacy Rule.
De-identification Standards
Khaled El Emam & Luk Arbuckle
29. Anonymizing Health Data
De-identification Myths
Myth: It’s possible to re-identify most, if not
all, data.
Using robust methods, evidence suggests risk
can be very small.
Khaled El Emam & Luk Arbuckle
30. Anonymizing Health Data
De-identification Myths
Myth: It’s possible to re-identify most, if not
all, data.
Myth: Genomic sequences are not
identifiable, or are easy to re-identify.
Using robust methods, evidence suggests risk
can be very small.
Khaled El Emam & Luk Arbuckle
31. Anonymizing Health Data
De-identification Myths
Myth: It’s possible to re-identify most, if not
all, data.
Myth: Genomic sequences are not
identifiable, or are easy to re-identify.
In some cases can re-identify, difficult to de-
identify using our methods.
Using robust methods, evidence suggests risk
can be very small.
Khaled El Emam & Luk Arbuckle
33. Anonymizing Health Data
A Risk-based De-identification Methodology
The risk of re-identification can be quantified.
Khaled El Emam & Luk Arbuckle
34. Anonymizing Health Data
A Risk-based De-identification Methodology
The risk of re-identification can be quantified.
The Goldilocks principle:
balancing privacy with data utility.
Khaled El Emam & Luk Arbuckle
36. Anonymizing Health Data
A Risk-based De-identification Methodology
The risk of re-identification can be quantified.
The Goldilocks principle:
balancing privacy with data utility.
The re-identification risk needs to be very small.
Khaled El Emam & Luk Arbuckle
37. Anonymizing Health Data
A Risk-based De-identification Methodology
The risk of re-identification can be quantified.
The Goldilocks principle:
balancing privacy with data utility.
De-identification involves a mix of technical, contractual,
and other measures.
The re-identification risk needs to be very small.
Khaled El Emam & Luk Arbuckle
38. Anonymizing Health Data
Steps in the De-identification Methodology
Step 1: Select Direct and Indirect Identifiers
Step 2: Setting the Threshold
Step 3: Examining Plausible Attacks
Step 4: De-identifying the Data
Step 5: Documenting the Process
Khaled El Emam & Luk Arbuckle
40. Anonymizing Health Data
Direct identifiers: name, telephone number, health
insurance card number, medical record number.
Step 1: Select Direct and Indirect Identifiers
Khaled El Emam & Luk Arbuckle
41. Anonymizing Health Data
Direct identifiers: name, telephone number, health
insurance card number, medical record number.
Indirect identifiers, or quasi-identifiers: sex, date of birth,
ethnicity, locations, event dates, medical codes.
Step 1: Select Direct and Indirect Identifiers
Khaled El Emam & Luk Arbuckle
43. Anonymizing Health Data
Maximum acceptable risk for sharing data.
Step 2: Setting the Threshold
Khaled El Emam & Luk Arbuckle
44. Anonymizing Health Data
Maximum acceptable risk for sharing data.
Needs to be quantitative and defensible.
Step 2: Setting the Threshold
Khaled El Emam & Luk Arbuckle
45. Anonymizing Health Data
Maximum acceptable risk for sharing data.
Needs to be quantitative and defensible.
Is the data in going to be in the public domain?
Step 2: Setting the Threshold
Khaled El Emam & Luk Arbuckle
46. Anonymizing Health Data
Maximum acceptable risk for sharing data.
Needs to be quantitative and defensible.
Is the data in going to be in the public domain?
Extent of invasion-of-privacy when data was shared?
Step 2: Setting the Threshold
Khaled El Emam & Luk Arbuckle
48. Anonymizing Health Data
Recipient deliberately attempts to re-identify the data.
Step 3: Examining Plausible Attacks
Khaled El Emam & Luk Arbuckle
49. Anonymizing Health Data
Recipient deliberately attempts to re-identify the data.
Recipient inadvertently re-identifies the data.
“Holly Smokes, I know her!”
Step 3: Examining Plausible Attacks
Khaled El Emam & Luk Arbuckle
50. Anonymizing Health Data
Recipient deliberately attempts to re-identify the data.
Recipient inadvertently re-identifies the data.
Data breach at recipient’s site, “data gone wild”.
Step 3: Examining Plausible Attacks
Khaled El Emam & Luk Arbuckle
51. Anonymizing Health Data
Recipient deliberately attempts to re-identify the data.
Data breach at recipient’s site, “data gone wild”.
Adversary launches a demonstration attack on the data.
Step 3: Examining Plausible Attacks
Khaled El Emam & Luk Arbuckle
Recipient inadvertently re-identifies the data.
53. Anonymizing Health Data
Step 4: De-identifying the Data
Generalization: reducing the precision of a field.
Dates converted to month/year, or year.
Khaled El Emam & Luk Arbuckle
54. Anonymizing Health Data
Step 4: De-identifying the Data
Generalization: reducing the precision of a field.
Suppression: replacing a cell with NULL.
Unique 55-year old female in birth registry.
Khaled El Emam & Luk Arbuckle
55. Anonymizing Health Data
Step 4: De-identifying the Data
Generalization: reducing the precision of a field.
Suppression: replacing a cell with NULL.
Sub-sampling: releasing a simple random sample.
50% of data set instead of all data.
Khaled El Emam & Luk Arbuckle
57. Anonymizing Health Data
Step 5: Documenting the Process
Process documentation—a methodology text.
Khaled El Emam & Luk Arbuckle
58. Anonymizing Health Data
Step 5: Documenting the Process
Results documentation—data set, risk thresholds,
assumptions, evidence of low risk.
Khaled El Emam & Luk Arbuckle
Process documentation—a methodology text.
60. Anonymizing Health Data
T1:Deliberate Attempt
Measuring Risk Under Plausible Attacks
Pr(re-id, attempt) = Pr(attempt) × Pr(re-id | attempt)
Khaled El Emam & Luk Arbuckle
61. Anonymizing Health Data
T1:Deliberate Attempt
Measuring Risk Under Plausible Attacks
Khaled El Emam & Luk Arbuckle
T2: Inadvertent Attempt (“Holly Smokes, I know her!”)
Pr(re-id, acquaintance) = Pr(acquaintance) × Pr(re-id | acquaintance)
62. Anonymizing Health Data
T1:Deliberate Attempt
Measuring Risk Under Plausible Attacks
Khaled El Emam & Luk Arbuckle
T2: Inadvertent Attempt (“Holly Smokes, I know her!”)
T3: Data Breach (“data gone wild”)
Pr(re-id, breach) = Pr(breach) × Pr(re-id | breach)
63. Anonymizing Health Data
T1:Deliberate Attempt
Measuring Risk Under Plausible Attacks
Khaled El Emam & Luk Arbuckle
T2: Inadvertent Attempt (“Holly Smokes, I know her!”)
T3: Data Breach (“data gone wild”)
T4: Public Data (demonstration attack)
Pr(re-id), based on data set only
66. Anonymizing Health Data
Choosing Thresholds
Khaled El Emam & Luk Arbuckle
Many precedents going back multiple decades.
Recommended by regulators.
67. Anonymizing Health Data
Choosing Thresholds
Khaled El Emam & Luk Arbuckle
Many precedents going back multiple decades.
Recommended by regulators.
All based on max risk though.
68. Anonymizing Health Data
Choosing Thresholds
Khaled El Emam & Luk Arbuckle
Many precedents going back multiple decades.
Recommended by regulators.
All based on max risk though.
71. Anonymizing Health Data
Cross Sectional Data: Research Registries
Khaled El Emam & Luk Arbuckle
Better Outcomes Registry & Network (BORN)
of Ontario
72. Anonymizing Health Data
Cross Sectional Data: Research Registries
Khaled El Emam & Luk Arbuckle
Better Outcomes Registry & Network (BORN)
of Ontario
140,000 births per year.
73. Anonymizing Health Data
Cross Sectional Data: Research Registries
Khaled El Emam & Luk Arbuckle
Better Outcomes Registry & Network (BORN)
of Ontario
140,000 births per year.
Cross-sectional—mothers not traced over time.
74. Anonymizing Health Data
Cross Sectional Data: Research Registries
Khaled El Emam & Luk Arbuckle
Better Outcomes Registry & Network (BORN)
of Ontario
140,000 births per year.
Cross-sectional—mothers not traced over time.
Process of getting de-identified data from a
research registry.
75. Anonymizing Health Data
Cross Sectional Data: Research Registries
Khaled El Emam & Luk Arbuckle
Better Outcomes Registry & Network (BORN)
of Ontario
140,000 births per year.
Cross-sectional—mothers not traced over time.
Process of getting de-identified data from a
research registry.
80. Anonymizing Health Data
Choosing Thresholds
Khaled El Emam & Luk Arbuckle
Average risk of 0.1 for Researcher Ronnie
(and the data he specifically requested).
81. Anonymizing Health Data
Choosing Thresholds
Khaled El Emam & Luk Arbuckle
0.05 if there were highly sensitive variables
(congenital anomalies, mental health problems).
Average risk of 0.1 for Researcher Ronnie
84. Anonymizing Health Data
T1:Deliberate Attempt
Measuring Risk Under Plausible Attacks
Khaled El Emam & Luk Arbuckle
Low motives and capacity; low mitigating controls.
86. Anonymizing Health Data
T1:Deliberate Attempt
Measuring Risk Under Plausible Attacks
Khaled El Emam & Luk Arbuckle
T2: Inadvertent Attempt (“Holly Smokes, I know her!”)
119,785 births out of a 4,478,500 women ( = 0.027)
87. Anonymizing Health Data
T1:Deliberate Attempt
Measuring Risk Under Plausible Attacks
Khaled El Emam & Luk Arbuckle
T2: Inadvertent Attempt (“Holly Smokes, I know her!”)
Pr(aquaintance) = 1- (1-0.027)150/2 = 0.87
88. Anonymizing Health Data
T1:Deliberate Attempt
Measuring Risk Under Plausible Attacks
Khaled El Emam & Luk Arbuckle
T2: Inadvertent Attempt (“Holly Smokes, I know her!”)
T3: Data Breach (“data gone wild”)
Based on historical data.
89. Anonymizing Health Data
T1:Deliberate Attempt
Measuring Risk Under Plausible Attacks
Khaled El Emam & Luk Arbuckle
T2: Inadvertent Attempt (“Holly Smokes, I know her!”)
T3: Data Breach (“data gone wild”)
Pr(breach)=0.27
90. Anonymizing Health Data
T1:Deliberate Attempt
Measuring Risk Under Plausible Attacks
Khaled El Emam & Luk Arbuckle
T2: Inadvertent Attempt (“Holly Smokes, I know her!”)
T3: Data Breach (“data gone wild”)
T4: Public Data (demonstration attack)
91. Anonymizing Health Data
T1:Deliberate Attempt
Measuring Risk Under Plausible Attacks
Khaled El Emam & Luk Arbuckle
T2: Inadvertent Attempt (“Holly Smokes, I know her!”)
T3: Data Breach (“data gone wild”)
Overall risk
Pr(re-id, T) = Pr(T) x Pr(re-id | T) ≤ 0.1
92. Anonymizing Health Data
Measuring Risk Under Plausible Attacks
Khaled El Emam & Luk Arbuckle
T2: Inadvertent Attempt (“Holly Smokes, I know her!”)
Pr(aquaintance) = 1- (1-0.027)150/2 = 0.87
Overall risk
Pr(re-id, acquaintance) = 0.87 × Pr(re-id | acquaintance) ≤ 0.1
97. Anonymizing Health Data
De-identifying the Data Set
Khaled El Emam & Luk Arbuckle
MDOB in 1-yy; BDOB in wk/yy; MPC of 1 char.
MDOB in 10-yy; BDOB in qtr/yy; MPC of 3 chars.
98. Anonymizing Health Data
De-identifying the Data Set
Khaled El Emam & Luk Arbuckle
MDOB in 1-yy; BDOB in wk/yy; MPC of 1 char.
MDOB in 10-yy; BDOB in qtr/yy; MPC of 3 chars.
MDOB in 10-yy; BDOB in mm/yy; MPC of 3 chars.
100. Anonymizing Health Data
Year on Year: Re-using Risk Analyses
Khaled El Emam & Luk Arbuckle
In 2006 Researcher Ronnie asks for 2005.
101. Anonymizing Health Data
Year on Year: Re-using Risk Analyses
Khaled El Emam & Luk Arbuckle
In 2006 Researcher Ronnie asks for 2005—deleted.
In 2007 Researcher Ronnie asks for 2006.
102. Anonymizing Health Data
Year on Year: Re-using Risk Analyses
Khaled El Emam & Luk Arbuckle
In 2006 Researcher Ronnie asks for 2005.
In 2007 Researcher Ronnie asks for 2006—deleted.
In 2008 Researcher Ronnie asks for 2007.
103. Anonymizing Health Data
Year on Year: Re-using Risk Analyses
Khaled El Emam & Luk Arbuckle
In 2006 Researcher Ronnie asks for 2005.
In 2007 Researcher Ronnie asks for 2006.
In 2008 Researcher Ronnie asks for 2007—deleted.
In 2009 Researcher Ronnie asks for 2008.
104. Anonymizing Health Data
Year on Year: Re-using Risk Analyses
Khaled El Emam & Luk Arbuckle
In 2006 Researcher Ronnie asks for 2005.
In 2007 Researcher Ronnie asks for 2006.
In 2008 Researcher Ronnie asks for 2007.
In 2009 Researcher Ronnie asks for 2008—deleted.
In 2010 Researcher Ronnie asks for 2009.
105. Anonymizing Health Data
Year on Year: Re-using Risk Analyses
Khaled El Emam & Luk Arbuckle
In 2006 Researcher Ronnie asks for 2005.
In 2007 Researcher Ronnie asks for 2006.
In 2008 Researcher Ronnie asks for 2007.
In 2009 Researcher Ronnie asks for 2008—deleted.
In 2010 Researcher Ronnie asks for 2009.
Can we use the same de-identification scheme every year?
108. Anonymizing Health Data
Year on Year: Re-using Risk Analyses
Khaled El Emam & Luk Arbuckle
BORN data pertains to very stable populations.
109. Anonymizing Health Data
Year on Year: Re-using Risk Analyses
Khaled El Emam & Luk Arbuckle
BORN data pertains to very stable populations.
No dramatic changes in the number or characteristics of
births from 2005-2010.
110. Anonymizing Health Data
Year on Year: Re-using Risk Analyses
Khaled El Emam & Luk Arbuckle
BORN data pertains to very stable populations.
No dramatic changes in the number or characteristics of
births from 2005-2010.
Revisit de-identification scheme every 18 to 24 months.
111. Anonymizing Health Data
Year on Year: Re-using Risk Analyses
Khaled El Emam & Luk Arbuckle
BORN data pertains to very stable populations.
No dramatic changes in the number or characteristics of
births from 2005-2010.
Revisit de-identification scheme every 18 to 24 months.
Revisit if any new quasi-identifiers are added or changed.
113. Anonymizing Health Data
Longitudinal Discharge Abstract Data:
State Inpatient Databases
Khaled El Emam & Luk Arbuckle
Linking a patient’s records over time.
114. Anonymizing Health Data
Longitudinal Discharge Abstract Data:
State Inpatient Databases
Khaled El Emam & Luk Arbuckle
Linking a patient’s records over time.
Need to be de-identified differently.
129. Anonymizing Health Data
De-identifying the Data Set
Khaled El Emam & Luk Arbuckle
BirthYear in 5-yy (cut at 1910-);
AdmissionYear unchanged;
DaysSinceLastService in 28-dd (cut at 7-, 182+);
LengthOfStay same as DaysSinceLastService.
130. Anonymizing Health Data
De-identifying the Data Set
Khaled El Emam & Luk Arbuckle
BirthYear in 5-yy (cut at 1910-);
AdmissionYear unchanged;
DaysSinceLastService in 28-dd (cut at 7-, 182+);
LengthOfStay same as DaysSinceLastService.
133. Anonymizing Health Data
Connected Variables
Khaled El Emam & Luk Arbuckle
QI to QI
Similar QI?
Same generalization and suppression.
134. Anonymizing Health Data
Connected Variables
Khaled El Emam & Luk Arbuckle
QI to QI
Similar QI?
Same generalization and suppression.
QI to non-QI
135. Anonymizing Health Data
Connected Variables
Khaled El Emam & Luk Arbuckle
QI to QI
Similar QI?
Same generalization and suppression.
QI to non-QI
Non-QI is revealing?
Same suppression so both are removed.
137. Anonymizing Health Data
Other Issues Regarding Longitudinal Data
Khaled El Emam & Luk Arbuckle
Date shifting—maintaining order of records.
138. Anonymizing Health Data
Other Issues Regarding Longitudinal Data
Khaled El Emam & Luk Arbuckle
Date shifting—maintaining order of records.
Long tails—truncation of records.
139. Anonymizing Health Data
Other Issues Regarding Longitudinal Data
Khaled El Emam & Luk Arbuckle
Date shifting—maintaining order of records.
Long tails—truncation of records.
Adversary power—assumption of knowledge.
142. Anonymizing Health Data
Other Concerns to Think About
Khaled El Emam & Luk Arbuckle
Free-form text—anonymization.
Geospatial information—aggregation and
geoproxy risk.
143. Anonymizing Health Data
Other Concerns to Think About
Khaled El Emam & Luk Arbuckle
Free-form text—anonymization.
Geospatial information—aggregation and
geoproxy risk.
Medical codes—generalization, suppression,
shuffling (yes, as in cards).
144. Anonymizing Health Data
Other Concerns to Think About
Khaled El Emam & Luk Arbuckle
Free-form text—anonymization.
Geospatial information—aggregation and
geoproxy risk.
Medical codes—generalization, suppression,
shuffling (yes, as in cards).
Secure linking—linking data through
encryption before anonymization.
147. Anonymizing Health Data
Khaled El Emam & Luk Arbuckle
Khaled El Emam: kelemam@privacyanalytics.ca
Luk Arbuckle: larbuckle@privacyanalytics.ca
More Comments or Questions: Contact us!