BigInsight seminar on Practical Privacy-Preserving Distributed Statistical Computation

Health data and the re-
identification threat – a real
world example
Giske Ursin
Cancer Registry of Norway
March 5, 2018
Seminar om privacy-preserving distributed
statistical computation,
Statistics Norway

Norwegian health registries
17 central + 54 clinical registries
Purpose:
- Asess distribution of disease
- Obtain information on how to prevent disease and
death from disease
Other health data:
- Population surveys
- 360+ biobanks
Cancer screening programs: all women 25-69

All these data…..
Let’s
link
them

30/04/20
18
Put the data somewhere safe…
Can only access them there….
But….

Kreft i Norge 20151. Are the data safe?
National platform
coming….

1. Are the data really SAFE?
http://www.free-bullion-investment-guide.com/homesafes.html

Kreft i Norge 2015
2. Does it matter?
Reidentification threat
Trust
…versus…..

Current systems are based on trust

An example
Month and year of birth
Dates of all cervical exams
Results of each test
Whether or not get cancer
Cancer diagnosis date
1 million women

An example
Whether or not get cancer
Cancer diagnosis date
1 million women
Linked to identifiers
on n = xxx women

What do we do?
Trust?
All data deliveries based on trust
….or

What do we do?
Reduce
reidentification threat
………………Exactly HOW??

Anonymization protocols
K-anonymization (categorizing variables)
Creating synthetic datasets
Fuzzification

Synthetic datasets
Reset all dates from reference date
Day of birth = day 0
Day started using drug before diagnosis) = day 19 345
Day diagnosed with cancer = day 20 693
Challenge:
If need some aspect of calendar year
(treatments change)

Fuzzification – alter the data
- K-anonymization (Categorized variables)
- Excluded some observations (extreme dates/combinations)
- ALTERED all dates:
- Removed DAY
- CHANGED month – with random number (fuzzy factor)
- REMOVED month of birth

Fuzzification of cervix data
Ursin et al., Cancer Epidemiology Biomarkers Prevention 2017

• 5,6 million records
• All cervical exam dates
• Results
• Diagnosis dates of cancer
• 915 000 women

• Removed extreme dates/combinations
• Set day in dates to 15
• Used fuzzy factor on month:
• random value between -4 and +4
• All dates one individual changed with same
random number

Original ID DOB Exam 1 Exam 2 Diagnosis date
01071972 23456 1/7/1972 2/8/2000 10/11/2004 21/1/2007
03041960 45678 3/4/1960 5/1/1995 10/2/1998 ----
ID DOB Exam 1 Exam 2 Diagnosis date
001 15/7/1972 15/8/2000 15/11/2004 15/1/2007
002 15/4/1960 15/1/1995 15/2/1998 ----
Allocated ID DOB Exam1 Exam2 Diagnosis date
1023 15/10/1972 15/11/2000 15/2/2005 15/4/2007
4567 15/1/1960 15/11/1994 15/12/1997 ---
1023 1972 15/11/2000 15/2/2005 15/4/2007
4567 1960 15/11/1994 15/12/1997 ---
FINAL DATA

Original ID DOB Exam 1 Exam 2 Diagnosis date
01071972 23456 1/7/1972 2/8/2000 10/11/2004 21/1/2007
03041960 45678 3/4/1960 5/1/1995 10/2/1998 ----
1023 1972 15/11/2000 15/2/2005 15/4/2007
4567 1960 15/11/1994 15/12/1997 ---
FINAL DATA

Assessing the risk of reidentification
• ARX tool
• Quantifies risk of re-identification based on
uniqueness
• Prosecutor scenario: assumes person in dataset
• Classify variables as identiable, quasi-identifiable
or sensitive
Prasser F, Kohlmayer F, Lautenschlager R, Kuhn KA. ARX--A
Comprehensive Tool for Anonymizing Biomedical Data. AMIA Annu
Symp Proc. 2014;2014:984-93.

Assessing the reidentification risk
• D1. Realistic dataset
• D2. k-anonymization of dataset D1
• changing all dates in the dataset to 15th of the month
• D3. Fuzzifying the month in D2
• by adding a random factor between -4 to +4 months to each
month.

Fuzzification – WHAT helps?

Reidentification risk
• Simple step reduces the risk of reidentification
• Adding a fuzzy factor makes reidentification even
more difficult

Graden av personidentifikasjon skal ikke være større enn
nødvendig for det aktuelle formålet. Graden av
personidentifikasjon skal begrunnes. Tilsynsmyndigheten kan
kreve at den databehandlingsansvarlige legger frem
begrunnelsen.
• Helseregisterloven §6
Current regulations
EU – GDPR:
Data Protection Impact Assessment
Article 35

Current practice - examples
Cancer Registry: Restrictive with dates
Helseregisterloven §6
Prescription Registry: Restrictive with dates
§4 «Forbud mot samtidig tilgang»
(Differansedager = synthetic dataset)
Statistics Norway: ?
Common guidelines - and
better solutions - needed!
Income?

Large linkages continue
…..still based on trust
Can NOT build a national platform on TRUST alone

For the researchers…….
BALANCE

The researchers need:
Safe analysis
of large linked data
(no reidentification threat)
- Rapid and seamless analyses
- Ability to check individual records
Need national platforms that can do it all!

Thank you
Fuzzy paper:
Mari Nygård
Sagar Sen
Jean-Marie Mottu
Discussions with:
Jan Nygård
Bjørn Møller
Hilde Olav
Johanne Gulbrandsen
Datautleveringsenheten
Livmorhalsprogrammet

BigInsight seminar on Practical Privacy-Preserving Distributed Statistical Computation

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (12)

Ähnlich wie BigInsight seminar on Practical Privacy-Preserving Distributed Statistical Computation

Ähnlich wie BigInsight seminar on Practical Privacy-Preserving Distributed Statistical Computation (20)

Mehr von Statistisk sentralbyrå

Mehr von Statistisk sentralbyrå (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

BigInsight seminar on Practical Privacy-Preserving Distributed Statistical Computation

Hinweis der Redaktion