Norwegian health registries collect data from 17 central and 54 clinical registries for purposes like disease assessment and prevention. There are concerns about safely linking these datasets while avoiding reidentification. An example showed one woman could be reidentified from her birth month, cervical exam dates and cancer diagnosis. To reduce this risk, dates were altered by removing days, changing months randomly by -4 to +4 months, and removing birth months. This "fuzzification" technique significantly reduced the reidentification risk according to tools like ARX. However, current national data platforms still rely heavily on trust rather than technical solutions, which is insufficient for large linked datasets. Better anonymization protocols are needed to balance open analysis and individual privacy.
BigInsight seminar on Practical Privacy-Preserving Distributed Statistical Computation
1. Health data and the re-
identification threat – a real
world example
Giske Ursin
Cancer Registry of Norway
March 5, 2018
Seminar om privacy-preserving distributed
statistical computation,
Statistics Norway
2. Norwegian health registries
17 central + 54 clinical registries
Purpose:
- Asess distribution of disease
- Obtain information on how to prevent disease and
death from disease
Other health data:
- Population surveys
- 360+ biobanks
Cancer screening programs: all women 25-69
10. An example
Month and year of birth
Dates of all cervical exams
Results of each test
Whether or not get cancer
Cancer diagnosis date
1 million women
11. An example
Month and year of birth
Dates of all cervical exams
Results of each test
Whether or not get cancer
Cancer diagnosis date
1 million women
Month and year of birth
Dates of all cervical exams
Results of each test
Linked to identifiers
on n = xxx women
12. What do we do?
Trust?
All data deliveries based on trust
….or
13. What do we do?
Reduce
reidentification threat
………………Exactly HOW??
15. Synthetic datasets
Reset all dates from reference date
Day of birth = day 0
Day started using drug before diagnosis) = day 19 345
Day diagnosed with cancer = day 20 693
Challenge:
If need some aspect of calendar year
(treatments change)
16. Fuzzification – alter the data
- K-anonymization (Categorized variables)
- Excluded some observations (extreme dates/combinations)
- ALTERED all dates:
- Removed DAY
- CHANGED month – with random number (fuzzy factor)
- REMOVED month of birth
18. Fuzzification – alter the data
• 5,6 million records
• All cervical exam dates
• Results
• Diagnosis dates of cancer
• 915 000 women
Ursin et al., Cancer Epidemiology Biomarkers Prevention 2017
19. Fuzzification – alter the data
• Removed extreme dates/combinations
• Set day in dates to 15
• Used fuzzy factor on month:
• random value between -4 and +4
• All dates one individual changed with same
random number
Ursin et al., Cancer Epidemiology Biomarkers Prevention 2017
20. Original ID DOB Exam 1 Exam 2 Diagnosis date
01071972 23456 1/7/1972 2/8/2000 10/11/2004 21/1/2007
03041960 45678 3/4/1960 5/1/1995 10/2/1998 ----
ID DOB Exam 1 Exam 2 Diagnosis date
001 15/7/1972 15/8/2000 15/11/2004 15/1/2007
002 15/4/1960 15/1/1995 15/2/1998 ----
Allocated ID DOB Exam1 Exam2 Diagnosis date
1023 15/10/1972 15/11/2000 15/2/2005 15/4/2007
4567 15/1/1960 15/11/1994 15/12/1997 ---
Fuzzification – alter the data
Allocated ID DOB Exam1 Exam2 Diagnosis date
1023 1972 15/11/2000 15/2/2005 15/4/2007
4567 1960 15/11/1994 15/12/1997 ---
FINAL DATA
21. Original ID DOB Exam 1 Exam 2 Diagnosis date
01071972 23456 1/7/1972 2/8/2000 10/11/2004 21/1/2007
03041960 45678 3/4/1960 5/1/1995 10/2/1998 ----
Fuzzification – alter the data
Allocated ID DOB Exam1 Exam2 Diagnosis date
1023 1972 15/11/2000 15/2/2005 15/4/2007
4567 1960 15/11/1994 15/12/1997 ---
FINAL DATA
22. Assessing the risk of reidentification
• ARX tool
• Quantifies risk of re-identification based on
uniqueness
• Prosecutor scenario: assumes person in dataset
• Classify variables as identiable, quasi-identifiable
or sensitive
Prasser F, Kohlmayer F, Lautenschlager R, Kuhn KA. ARX--A
Comprehensive Tool for Anonymizing Biomedical Data. AMIA Annu
Symp Proc. 2014;2014:984-93.
23. Assessing the reidentification risk
• D1. Realistic dataset
• D2. k-anonymization of dataset D1
• changing all dates in the dataset to 15th of the month
• D3. Fuzzifying the month in D2
• by adding a random factor between -4 to +4 months to each
month.
26. Fuzzification – WHAT helps?
Ursin et al., Cancer Epidemiology Biomarkers Prevention 2017
27. Reidentification risk
• Simple step reduces the risk of reidentification
• Adding a fuzzy factor makes reidentification even
more difficult
28. Graden av personidentifikasjon skal ikke være større enn
nødvendig for det aktuelle formålet. Graden av
personidentifikasjon skal begrunnes. Tilsynsmyndigheten kan
kreve at den databehandlingsansvarlige legger frem
begrunnelsen.
• Helseregisterloven §6
Current regulations
EU – GDPR:
Data Protection Impact Assessment
Article 35
29. Current practice - examples
Cancer Registry: Restrictive with dates
Helseregisterloven §6
Prescription Registry: Restrictive with dates
§4 «Forbud mot samtidig tilgang»
(Differansedager = synthetic dataset)
Statistics Norway: ?
Common guidelines - and
better solutions - needed!
Income?
32. The researchers need:
Safe analysis
of large linked data
(no reidentification threat)
- Rapid and seamless analyses
- Ability to check individual records
Need national platforms that can do it all!
33. Thank you
Fuzzy paper:
Mari Nygård
Sagar Sen
Jean-Marie Mottu
Discussions with:
Jan Nygård
Bjørn Møller
Hilde Olav
Johanne Gulbrandsen
Datautleveringsenheten
Livmorhalsprogrammet
Hinweis der Redaktion
Vi har mye helsedata. Først og fremst mange helseregistre. 17 setnrale helseregistre:
Fødselsregister, dødsårsak, norsk pasientregister, reseptregister, kreftregister osv.
Så Kvalitetsregistre for ulike sykdommer.
Data samlet inn for å kartlegge…..
I tillegg andre helsedata