Cylcia Bolibaugh spoke about reproducibility, open data and GDPR at the first Open Data in Practice event at the University of York on 15 November 2018.
2. Data sharing in Education
EROS (Education Researchers for Open Science)
(UYSEG, CRESJ, PERC, CReLLU)
• qualitative
• quantitative (experimental)
• quantitative (individual differences)
• Various goals for sharing data -- today’s
focus on reproducibility
– Verifiability of a publication’s findings -- data
and code
3. GDPR & Data Protection Act
complicate sharing of research data…
– Co-regulatory approach: a shift in accountability
from data protection authorities to data controllers
and data processors (us!)
– Adoption of open science practices hindered by
worries about compliance (funder, university
requirements, legal, ethical),
4. Personal data & identifiability
“‘personal data’ means any information relating to an
identified or identifiable natural person (‘data subject’);
an identifiable natural person is one who can be identified,
directly or indirectly, in particular by reference to an identifier
such as a name, an identification number, location data, an
online identifier or to one or more factors specific to the
physical, physiological, genetic, mental, economic, cultural
or social identity of that natural person”
5. The ‘motivated intruder’ test:
To determine whether a natural person is identifiable, account should
be taken of all the means reasonably likely to be used, such as
singling out, either by the controller or by another person to identify the
natural person directly or indirectly.
To ascertain whether means are reasonably likely to be used to identify
the natural person, account should be taken of all objective factors,
such as the costs of and the amount of time required for identification,
taking into consideration the available technology at the time of the
processing and technological developments. (Recital 26 EU GDPR)
6. Differentiating between personal
and anonymised data:
A balance between
(1) risk of disclosure/ re-identification
(2) consequences of disclosure (“perceived
value of the information”)
7. A toy dataset (Polish immigrants to the UK)
-- accuracy scores on language measure
-- reaction times on language measure
-- score on cognitive measure
-- score on cognitive measure
-- Age
-- Native language
-- Age of arrival to UK
-- Length of residence in UK
8. Assessing risk of reidentification (Klein et al 2018)
Small population and
rare traits
Dyadic data
Hierarchical data (e.g.,
small subsamples of
students, co-workers)
Motivated intruder test
(e.g., jealous partner,
nosy neighbor, envious
co-worker, insurers,
criminals)
9. questions, questions…
1) do the biographical variables constitute indirect identifiers?
(1b) how can I systematically calculate the risk of re-identification (e.g. what is the
risk of reidentification for a Polish immigrant to the UK, based on their age, length of
residence in UK and age at time of immigration?)
(2) If there is only a very slight possibility that an individual could be indirectly
identified, is it still personal data?
(3) What if the perceived value of the information that might be linked to that
individual is actually quite low (e.g. how many milliseconds an individual took to
identify an English word, or their rating of how acceptable a particular phrase or
grammatical construction is)?
(4) How would one go about documenting their consideration of these factors?
10. solutions?
Reproducibility Open Data Usability
Binning ✗ ✓✓ ✓✓✓
Permutation ✓✗ ✓✓ ✓✓✓
K-anonymity tools
(e.g. R package
sdcMicro)
✗ ✓✓ ✓✓
Synthesized dataset
(e.g. R package
Synthpop)
✓✓ ✗ ✓
Encrypted data with
script (e.g. OSF)
✓✓✓ ✗ ✓
Restricted access
depository
✓✓✓ ✓✓✓ ✓✓
11. OSF approved Protected Access
repositories which are GDPR compliant
- Research Data Center of the SOEP (DE)
- Datorium (DE)
- DataFirst (DE)
- PsychData (ZPID, Leibniz)
- University of Bristol Research Data
Repository
- The UK Data Service (ESRC)
12. Anonymisation
• Europe-wide standards for anonymisation are needed.
– OpenAire European Data Protection Board could issue
guidelines concerning anonymisation.
• Nationally, codes of conduct to differentiate between
personal and anonymised data.
– may only be binding for members
– involvement of umbrella orgs -- UKRN
• Institutionally, researcher friendly guidance (decision
trees, case studies, tools for documentation of risk
assessment etc)
13. Anonymisation
• Europe-wide standards for anonymisation
are needed.
– OpenAire European Data Protection Board
could issue guidelines concerning
anonymisation.
• Nationally, codes of conduct to differentiate
between personal and anonymised data.
– may only be binding for members
– involvement of umbrella orgs -- UKRN
• Institutionally, researcher friendly guidance
(decision trees, case studies, tools for
documentation of risk assessment etc)
Thanks!
Questions?
14. The Open Data badge is
earned for making publicly
available the digitally-
shareable data necessary
to reproduce the reported
results.
Hinweis der Redaktion
Lack of clear procedural guidance, and precedent/case studies means that data controllers (i,e, researchers!) understandably risk averse (ris being not only legal compliance, but also the time investment necessary to
(https://ico.org.uk/for-organisations/guide-to-the-general-data-protection-regulation-gdpr/what-is-personal-data/can-we-identify-an-individual-indirectly/)
If there is only a very slight possibility that an individual could be indirectly identified, is it still personal data?
You should assume that you are not looking just at the means reasonably likely to be used by an ordinary person, but also by a determined person with a particular reason to want to identify individuals.
The measures reasonably likely to be taken to identify an individual may vary depending upon the perceived value of the information.
https://www.york.ac.uk/library/info-for/researchers/data/sharing/#tab-3
“In practice, even sensitive and personal data may be shared ethically if care has been taken in anonymisation, suitable consent obtained, reuse conditions prudently planned and appropriate data access restrictions applied.”
From a project investigating whether there are differences in learning mechanisms between child and adult language learners, minimal data required to model variability in the language attainment/proficiency of bilinguals as a function of their learning history (what age they started, how long their exposure has been, and cognitive skills theorised to underlie particular learning mechanisms.
In this case, the biographical data are integral to the reproducibility of the analysis, and cannot be separated or binned etc without detriment to the reproducibility.
I have a sample of Polish immigrants, and data about their age at test, the age they arrived to the UK, and their length of residence. Is the combination of these indirect identifiers sufficient to reidentify an individual?
Approx 900,000 Polish immigrants to UK, so my population is large and risk of reidentification small. However sampling criteria (very advanced proficiency in English, and minimum 12 years residence) likely increase that, but by how much. Finally risk not evenly spread throughout sample: WWII immigrants.
Depending on the answer to these questions, there are a variety of means by which data can be further anonymised, or other ways in which the data could be shared.
However, there is a tradeoff between increasing the availability of the dataset, and ensuring the reproducibility of analyses underlying a published output, which I tried to sketch here in a back of the envelope fashion.
My feeling is that there is likely to be a bias toward placing data in restricted access repositories, even when the disclosure risk is relatively small. The problem with this solution, at least for language researchers, is that it eliminates 2 repositories that are most commonly used (IRIS which is repository specifiliased in materials for L2 research, and OSF, Figshare etc).
If you are interested in obtaining an open data badge, Restricted access notation was added earlier this year, but only a small number of repositories have been certified. The first 4 on the list on in Germany, and relatively few
UKDA has an end user agreement
But in practice, the repositories most commonly used, OSF, and figshare, github