Weitere Ă€hnliche Inhalte Ăhnlich wie 2013 07-12 ncc-group_data_anonymisation_technical_aspects_v1 0 (20) KĂŒrzlich hochgeladen (20) 2013 07-12 ncc-group_data_anonymisation_technical_aspects_v1 01. Technical Aspects of Data Anonymisation &
Pseudonymisation
Risks, Challenges & Mitigations
Matt Lewis
Principal Consultant
2. Agenda
NCC Group â who we are and what we do
Anonymisation, Pseudonymisation & Re-identification â overview
of concepts
Examples â when anonymisation goes wrong
Pitfalls of image anonymisation and other information leakage
through meta-data
A risk-based approach to anonymisation
Summary and advice
Questions
7/15/2013 © NCC Group 2
3. NCC Group
7/15/2013 © NCC Group 3
Global information assurance specialist
15,000 customers worldwide across all sectors
The Group has two complementary divisions - escrow and
assurance
Independence from hardware and software providers ensures we
provide unbiased and impartial advice
Largest penetration testing team in the world, with approximately
250 consultants
4. Me: Brief Bio
7/15/2013 © NCC Group 4
Over 12 years working in Information Security
Previous Employers:
âą CESG â The Information Assurance arm of GCHQ
âą Information Risk Management (IRM) plc â penetration testing
âą KPMG â Executive Advisor in the Information Protection
division of IT Advisory
âą NCC Group â Principal Consultant, providing penetration
testing and consultancy around all aspects of Information
Security
5. Anonymisation â Overview
7/15/2013 © NCC Group 5
Anonymised data should be information that does not identify any
individuals, either in isolation or when cross-referenced with other
data already in the public domain
A careful balance is required around the level of anonymisation
versus the usefulness of the resultant data
Quantitative versus Qualitative â the latter is harder to anonymise
in a consistent way, and requires more rigour on a âper recordâ
basis â e.g. meeting minutes
6. Pseudonymisation â Overview
7/15/2013 © NCC Group 6
Information is anonymous to the receiver (e.g. researchers), but
contains codes or identifiers to allow others to re-identify
individuals from the pseudonymised data
Universally protecting pseudonymised data whilst allowing general
analysis of it is difficult â requires careful management of the
âcodesâ or âkeysâ that uniquely identify individuals
Quantitative versus Qualitative â again, the latter is harder to
pseudonymise in a consistent way, and requires more rigour on a
âper recordâ basis â e.g. meeting minutes
7. Anonymisation â Techniques and Methods
7/15/2013 © NCC Group 7
There are four main operations available for anonymising data
Suppression, Substitution/Distortion, Generalisation, Aggregation
Consider the following dataset:
Name Sex Birth Date Post Code Complaint
John Male 02/12/1954 SE24 6TY Pain in left eye
Daniel Male 05/01/1984 NW1 6XD Chest pains
Sarah Female 04/08/1978 E17 7WE Chest pains
Samantha Female 03/10/1960 WC1 7RA Back pains
James Male 09/09/1990 NW7 5LK Headaches
8. Anonymisation â Techniques and Methods
7/15/2013 © NCC Group 8
Suppression - deleting or omitting data fields entirely
Sex Complaint
Male Pain in left eye
Male Chest pains
Female Chest pains
Female Back pains
Male Headaches
9. Anonymisation â Techniques and Methods
7/15/2013 © NCC Group 9
Substitution/Distortion â e.g. replace a personâs name with a
unique number â this is also an example of pseudonymisation
Name Sex Birth Date Post Code Complaint
0000001 Male 02/12/1954 SE24 6TY Pain in left eye
0000002 Male 05/01/1984 NW1 6XD Chest pains
0000003 Female 04/08/1978 E17 7WE Chest pains
0000004 Female 03/10/1960 WC1 7RA Back pains
0000005 Male 09/09/1990 NW7 5LK Headaches
10. Anonymisation â Techniques and Methods
7/15/2013 © NCC Group 10
Generalisation - alter rather than delete identifier values to
increase privacy while preserving utility
Name Sex Birth Year Post Code Complaint
John Male 1954 SE24 Pain in left eye
Daniel Male 1984 NW1 Chest pains
Sarah Female 1978 E17 Chest pains
Samantha Female 1960 WC1 Back pains
James Male 1990 NW7 Headaches
11. Anonymisation â Techniques and Methods
7/15/2013 © NCC Group 11
Aggregation - produce summary statistics across a dataset
instead of an anonymised dataset
40% of patients complain of chest pains
60% of patients are male
etc.
Name Sex Birth Date Post Code Complaint
John Male 02/12/1954 SE24 6TY Pain in left eye
Daniel Male 05/01/1984 NW1 6XD Chest pains
Sarah Female 04/08/1978 E17 7WE Chest pains
Samantha Female 03/10/1960 WC1 7RA Back pains
James Male 09/09/1990 NW7 5LK Headaches
13. Anonymisation of Qualitative Data
7/15/2013 © NCC Group 13
Pseudonymisation in
qualitative data can be
much more difficult
The content/themes in
this meeting for
example might allow for
re-identification of any
pseudonymised
individuals
14. Re-identification â What is it?
7/15/2013 © NCC Group 14
Re-identification is the act of cross-referencing anonymised data with
other data sources, and using inference, deduction and correlation to
identify individuals
Depending on the nature of data re-identified, this might raise data
protection concerns
15. Re-identification â Who does this and Why?
7/15/2013 © NCC Group 15
Researchers â e.g. computer scientists, genuinely interested in the
challenges of re-identification
Malicious individuals use re-identification information to discriminate,
harass or discredit a victim
Investigative journalists
Organised crime â re-identification can facilitate creation of fake
identities, or be used to extort victims (if data is personal/sensitive in
nature)
Competitors â seeking to re-identify and publish to discredit
State sponsored data mining and correlation
The Internet is essentially a vast, ever-growing cross-correlation
database; access to most of which is open and free to anyoneâŠ
17. Inference as a Starting Point
Recall our example:
7/15/2013 © NCC Group 17
Name Sex Birth Date Post Code Complaint
John Male 02/12/1954 SE24 6TY Pain in left eye
Daniel Male 05/01/1984 NW1 6XD Chest pains
Sarah Female 04/08/1978 E17 7WE Chest pains
Samantha Female 03/10/1960 WC1 7RA Back pains
James Male 09/09/1990 NW7 5LK Headaches
18. Inference as a Starting Point
Suppose the following anonymised aggregations are published:
60% of patients complain of chest pains
60% of patients are male
100% of Back pain sufferers live in WC1
100% of patients are over 21 years of age
100% of females suffer from chest or back pains
20% of patients suffer from Pain in the left eye
From this we can infer the following table fields:
Sex, Age, Condition, Post Code
7/15/2013 © NCC Group 18
19. Inference as a Starting Point
Suppose we know the sample size (i.e. 5), and the time of data
publication
60% of patients are male
7/15/2013 © NCC Group 19
Sex Birth Date Complaint Post Code
Male
Male
Male
Female
Female
20. Inference as a Starting Point
100% of patients are over 21 years of age
7/15/2013 © NCC Group 20
Sex Birth Date Complaint Post Code
Male <= 1992
Male <= 1992
Male <= 1992
Female <= 1992
Female <= 1992
21. Inference as a Starting Point
100% of females suffer from chest or back pains
7/15/2013 © NCC Group 21
Sex Birth Date Complaint Post Code
Male <= 1992
Male <= 1992
Male <= 1992
Female <= 1992 Chest Pains
Female <= 1992 Back Pains
22. Inference as a Starting Point
100% of Back pain sufferers live in WC1
7/15/2013 © NCC Group 22
Sex Birth Date Complaint Post Code
Male <= 1992
Male <= 1992
Male <= 1992
Female <= 1992 Chest Pains
Female <= 1992 Back Pains WC1
23. Inference as a Starting Point
20% of patients suffer from Pain in the left eye
7/15/2013 © NCC Group 23
Sex Birth Date Complaint Post Code
Male <= 1992 Pain in left eye
Male <= 1992
Male <= 1992
Female <= 1992 Chest Pains
Female <= 1992 Back Pains WC1
24. Inference as a Starting Point
60% of patients complain of chest pains
The next step would be to correlate/cross-reference with other
sources
7/15/2013 © NCC Group 24
Sex Birth Date Complaint Post Code
Male <= 1992 Pain in left eye
Male <= 1992 Chest Pains
Male <= 1992 Chest Pains
Female <= 1992 Chest Pains
Female <= 1992 Back Pains WC1
25. An Example: Massachusetts Group Insurance Company
7/15/2013 © NCC Group 25
In the mid-1990s GIC released anonymised data on state employees that
showed every single hospital visit
The goal was to help researchers; the state spent time removing all
obvious identifiers such as name, address, and Social Security number
William Weld, then Governor of Massachusetts, assured the public that
GIC had protected patient privacy by deleting identifiers
Computer Science graduate Dr. Latanya Sweeney requested a copy of the
data and performed re-identification research on the dataset
Main anonymisation technique used: Suppression
26. An Example: Massachusetts Group Insurance Company
7/15/2013 © NCC Group 26
Governor Weld
lived in
Cambridge MA
54,000
Residents
7 Post Codes Electoral Roll Purchase
for $20:
Contained name,
address, post code, birth
date, sex etc.
GIC Anonymised Data
Only 6 people in
Cambridge
shared Weldâs
birthday
Only 3 of these
were men
Only 1 lived in
Weldâs Post
Code
Dr. Sweeney sent the Governorâs health records (which included diagnoses and prescriptions) to his office
27. Another Example: America Online (AOL) Data Release
7/15/2013 © NCC Group 27
In 2006, AOL publicly released twenty million search queries for
650,000 users of AOLâs search engine summarising three months
of activity
AOL suppressed username and IP address, but replaced these
with unique numbers that allowed researchers to correlate different
searches with a specific user (pseudonymisation)
New York Times reporters Michael Barbaro and Tom Zeller
performed some research around User 4417749âs identity. His/her
searches had included:
âą âlandscapers in Lilburn, Gaâ
âą âseveral people with the last name Arnoldâ
âą âhomes sold in shadow lake subdivision gwinnett county georgiaâ
28. Another Example: America Online (AOL) Data Release
7/15/2013 © NCC Group 28
The reporters tracked down Thelma Arnold, a sixty-two-year-old
widow from Lilburn, Georgia who acknowledged that she had
authored the searches, including queries such as
âą ânumb fingersâ, â60 single menâ and âdog that urinates on everythingâ
Main anonymisation technique used: Suppression and Substitution
29. Anonymisation of non-textual data and Metadata
7/15/2013 © NCC Group 29
Anonymisation might be required of non-textual data. E.g. images
and videos (obfuscating faces)
This might require release of hundreds or thousands of
anonymised files, rather than one large dataset
Often, hidden meta-data within those files is forgotten, and can be
a valuable source for individuals attempting re-identification
30. Anonymisation of non-textual data and Metadata
7/15/2013 © NCC Group 30
A simple example â a picture of a
heron visiting my garden
Suppose I obfuscate the heron to
protect its identity
33. A Risk-Based Approach
7/15/2013 © NCC Group 33
There is anonymisation guidance, but there is no anonymisation
formula
Anonymisation is not an exact science
Each data set presents a unique instance, and the choice of
anonymisation operation(s) must be carefully considered in order
to maintain anonymity and utility
A risk-based approach is therefore the only optionâŠ
34. Risk Mitigation Advice
7/15/2013 © NCC Group 34
Consult with experts before embarking on anonymisation, on the
proposed approach and potential risks
If there is no business case or perceived benefit in going through
the process, then donât
Consider release to limited audiences â only go public if strictly
required
Protect the anonymisation method/formula
Qualitative anonymisation can typically be automated, Quantitative
anonymisation will require more manual efforts
Always vet anonymisations of Quantitative and Qualitative data
before release, donât just fire and forget
35. Risk Mitigation Advice
7/15/2013 © NCC Group 35
Perform your own rudimentary Google searches and correlation
attempts with other public data sources before publishing
If in doubt, engage again with experts on the likelihood of re-
identification given the derived anonymised data set
Consider the quantity of released anonymised data - in practically
all re-identification studies performed, researchers have been more
successful with larger databases
Try and remove one or more of the top 3 culprits: Post Code,
Birthdate and Sex.
âą In 2000 Dr. Latanya Sweeney showed that 87% of all Americans could
be uniquely identified using only these three bits of information
Metadata â prior to release, ensure all meta-data in documents is
removed â A number of tools exist for this, depending on the
document type (e.g. Adobe Acrobat, Microsoft Office, JPEGs etc.)
36. Risk Mitigation Advice - Pseudonymisation
7/15/2013 © NCC Group 36
Donât chose reversible identifiers â for example, names replaced
with identifiers that are based on record data
John Smith, 05/01/1978 -> JS05011978
Be aware of potential inferences from sorted data (e.g. alphabetical
ordering might provide clues for re-identification)
Keep the pseudonymisation formula secret â protect it with the same
controls as for encryption keys and passwords
Perform pseudonymisation functions on segregated, secure
environments, only copy/migrate the pseudonymised data (keep
them separate)
Remove all meta-data from pseudonymised data files â make sure
the individual(s) performing the pseudonymisation are not
referenced in the meta-data
Any cryptographic hashing used as identifiers/keys should always
be salted
37. References
7/15/2013 © NCC Group 37
ICO guidance on anonymisation:
http://www.ico.org.uk/for_organisations/data_protection/topic_guides/anon
ymisation
Paper on the failures of anonymisation, Paul Ohm
http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1450006
Research/Blog on anonymisation http://33bits.org/