O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data
Webcast
Case Studies and Methods to Get You Started
Khaled El Emam & Luk Arbuckle

Part 1 of Webcast: Intro and Methodology
Part 2 of Webcast: A Look at Our Case Studies
Part 3 of Webcast: Questions and Answers

Part 1 of Webcast: Intro and Methodology

To Anonymize or not to Anonymize

Consent needs to be informed.

Not all health care providers are willing to
share their patient’s PHI.

Anonymization allows for the sharing of health information.

Compelling financial case. Breach cost ~$200 per patient.

Privacy protective behaviors by patients.
Compelling financial case. Breach cost ~$200 per patient.

Masking Standards

Masking Standards
First name, last name, SSN.

Masking Standards
Distortion of data—no analytics.

Masking Standards
Creating pseudonyms.

Masking Standards
Removing a whole field.

Masking Standards
Removing a whole field.
Replacing actual values with random ones.

De-identification Standards

Age, sex, race, address, income.

Minimal distortion of data—for analytics.

Safe Harbor in HIPAA Privacy Rule.

What’s “Actual Knowledge”?
Privacy Rule
Safe Harbor

Info, alone or in combo, that could identify
an individual.

an individual.
Has to be specific to the data set—not
theoretical.

an individual.
Has to be specific to the data set—not
theoretical.
Occupation Mayor of Gotham.

Heuristics, or rules of thumb.

Heuristics, or rules of thumb.
Statistical method in HIPAA Privacy Rule.

De-identification Myths

Myth: It’s possible to re-identify most, if not
all, data.

all, data.
Using robust methods, evidence suggests risk
can be very small.

all, data.
Myth: Genomic sequences are not
identifiable, or are easy to re-identify.
can be very small.

all, data.
Myth: Genomic sequences are not
identifiable, or are easy to re-identify.
In some cases can re-identify, difficult to de-
identify using our methods.
can be very small.

A Risk-based De-identification Methodology

The risk of re-identification can be quantified.

The Goldilocks principle:
balancing privacy with data utility.

The re-identification risk needs to be very small.

De-identification involves a mix of technical, contractual,
and other measures.
The re-identification risk needs to be very small.

Steps in the De-identification Methodology
Step 1: Select Direct and Indirect Identifiers
Step 2: Setting the Threshold
Step 3: Examining Plausible Attacks
Step 4: De-identifying the Data
Step 5: Documenting the Process

Direct identifiers: name, telephone number, health
insurance card number, medical record number.

Direct identifiers: name, telephone number, health
insurance card number, medical record number.
Indirect identifiers, or quasi-identifiers: sex, date of birth,
ethnicity, locations, event dates, medical codes.

Maximum acceptable risk for sharing data.

Needs to be quantitative and defensible.

Is the data in going to be in the public domain?

Is the data in going to be in the public domain?
Extent of invasion-of-privacy when data was shared?

Recipient deliberately attempts to re-identify the data.

Recipient inadvertently re-identifies the data.
“Holly Smokes, I know her!”

Data breach at recipient’s site, “data gone wild”.

Data breach at recipient’s site, “data gone wild”.
Adversary launches a demonstration attack on the data.

Generalization: reducing the precision of a field.
Dates converted to month/year, or year.

Suppression: replacing a cell with NULL.
Unique 55-year old female in birth registry.

Suppression: replacing a cell with NULL.
Sub-sampling: releasing a simple random sample.
50% of data set instead of all data.

Process documentation—a methodology text.

Results documentation—data set, risk thresholds,
assumptions, evidence of low risk.
Process documentation—a methodology text.

Measuring Risk Under Plausible Attacks

T1:Deliberate Attempt
Pr(re-id, attempt) = Pr(attempt) × Pr(re-id | attempt)

T2: Inadvertent Attempt (“Holly Smokes, I know her!”)
Pr(re-id, acquaintance) = Pr(acquaintance) × Pr(re-id | acquaintance)

T3: Data Breach (“data gone wild”)
Pr(re-id, breach) = Pr(breach) × Pr(re-id | breach)

T4: Public Data (demonstration attack)
Pr(re-id), based on data set only

Choosing Thresholds

Choosing Thresholds
Many precedents going back multiple decades.

Choosing Thresholds
Recommended by regulators.

Choosing Thresholds
Recommended by regulators.
All based on max risk though.

Part 2 of Webcast: A Look at Our Case Studies

Cross Sectional Data: Research Registries

Better Outcomes Registry & Network (BORN)
of Ontario

of Ontario
140,000 births per year.

of Ontario
Cross-sectional—mothers not traced over time.

of Ontario
Cross-sectional—mothers not traced over time.
Process of getting de-identified data from a
research registry.

Researcher Ronnie wants data!

Researcher Ronnie wants data!
919,710 records
from 2005-2011

Choosing Thresholds
Average risk of 0.1 for Researcher Ronnie
(and the data he specifically requested).

Choosing Thresholds
0.05 if there were highly sensitive variables
(congenital anomalies, mental health problems).
Average risk of 0.1 for Researcher Ronnie

Low motives and capacity

Low motives and capacity; low mitigating controls.

Pr(attempt) = 0.4

119,785 births out of a 4,478,500 women ( = 0.027)

Pr(aquaintance) = 1- (1-0.027)150/2 = 0.87

Based on historical data.

Pr(breach)=0.27

Overall risk
Pr(re-id, T) = Pr(T) x Pr(re-id | T) ≤ 0.1

Pr(aquaintance) = 1- (1-0.027)150/2 = 0.87
Overall risk
Pr(re-id, acquaintance) = 0.87 × Pr(re-id | acquaintance) ≤ 0.1

De-identifying the Data Set

Meeting Thresholds: k-anonymity
k

Meeting Thresholds: k-anonymity

MDOB in 1-yy; BDOB in wk/yy; MPC of 1 char.

MDOB in 10-yy; BDOB in qtr/yy; MPC of 3 chars.

MDOB in 10-yy; BDOB in qtr/yy; MPC of 3 chars.
MDOB in 10-yy; BDOB in mm/yy; MPC of 3 chars.

Year on Year: Re-using Risk Analyses

In 2006 Researcher Ronnie asks for 2005.

In 2006 Researcher Ronnie asks for 2005—deleted.

Can we use the same de-identification scheme every year?

BORN data pertains to very stable populations.

No dramatic changes in the number or characteristics of
births from 2005-2010.

Revisit de-identification scheme every 18 to 24 months.

Revisit de-identification scheme every 18 to 24 months.
Revisit if any new quasi-identifiers are added or changed.

Longitudinal Discharge Abstract Data:
State Inpatient Databases

Linking a patient’s records over time.

Linking a patient’s records over time.
Need to be de-identified differently.

Meeting Thresholds: k-anonymity?
k?

Meeting Thresholds: k-anonymity?

De-identifying Under Complete Knowledge

State Inpatient Database (SID) of California

State Inpatient Database (SID) of California
Researcher Ronnie wants public data!

Pr(re-id) ≤ 0.09 (maximum risk)

BirthYear in 5-yy (cut at 1910-);
AdmissionYear unchanged;
DaysSinceLastService in 28-dd (cut at 7-, 182+);
LengthOfStay same as DaysSinceLastService.

Connected Variables

Connected Variables
QI to QI

Connected Variables
QI to QI
Similar QI?
Same generalization and suppression.

Connected Variables
QI to QI
Similar QI?
QI to non-QI

Connected Variables
QI to QI
Similar QI?
QI to non-QI
Non-QI is revealing?
Same suppression so both are removed.

Other Issues Regarding Longitudinal Data

Date shifting—maintaining order of records.

Long tails—truncation of records.

Long tails—truncation of records.
Adversary power—assumption of knowledge.

Other Concerns to Think About

Free-form text—anonymization.

Geospatial information—aggregation and
geoproxy risk.

geoproxy risk.
Medical codes—generalization, suppression,
shuffling (yes, as in cards).

geoproxy risk.
Medical codes—generalization, suppression,
shuffling (yes, as in cards).
Secure linking—linking data through
encryption before anonymization.

Part 3 of Webcast: Questions and Answers

O'Reilly Webcast: Anonymizing Health Data

Recommended

Recommended

More Related Content

Similar to O'Reilly Webcast: Anonymizing Health Data

Similar to O'Reilly Webcast: Anonymizing Health Data (16)

Recently uploaded

Recently uploaded (20)

O'Reilly Webcast: Anonymizing Health Data