Data Sharing with ICPSR was presented at IASSIST 2015 in Minneapolis, MN.
The learning objectives and content cover:
- Federal data sharing requirements and
other good reasons to share data
• Options for sharing data
• Protection of confidentiality when
sharing data
• Data discovery tools
• Online data exploration tools from ICPSR
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
Data Sharing with ICPSR: Fueling the Cycle of Science through Discovery, Access Tools, and Long-Term Availability
1. Data Sharing with ICPSR: Fueling the Cycle
of Science through Discovery, Access
Tools, and Long-Term Availability
Johanna Bleckman and Kaye Marz, ICPSR
IASSIST 2015
Minneapolis, MN
June 2, 2015
2. Learning objectives
You will become more familiar with:
• Federal data sharing requirements and
other good reasons to share data
• Options for sharing data
• Protection of confidentiality when
sharing data
• Data discovery tools
• Online data exploration tools from ICPSR
4. Why share data?
• Federal data sharing requirements allow
effective use of federal agency resources
– Office of Science and Technology Policy
memorandum (2013)
– National Science Foundation (2008)
Requirement for Data Management Plans in 2010
– National Institutes of Heath (2003)
Requirement for Data Sharing Plans included
– National Institute of Justice (1978)
5. Why share data?
• Benefits to the social science community (NIH, 2003)
– Reinforces open scientific inquiry
– Encourages diversity of analysis and opinion
– Promotes new research, testing of new hypotheses and
methods of analysis
– Supports studies on data collection methods and
measurement
– Facilitates education of new researchers
– Enables exploration of topics not envisioned by initial
investigators
– Permits creation of datasets combined from multiple sources
6. Why share data?
• Benefits to the researcher
– Data in the public domain generate new research which
cites the original research
– Data are preserved and can be obtained by the original
researchers if their copies are lost or destroyed
– Archiving data helps researchers meet requirements of
NIH and NSF data management plans and NIJ data
archiving special conditions requirements
– Information on research community’s interest in a dataset
can assist with the success of future grant funding
proposals
7. • Lost in a technical malfunction
• Destroyed in a flood in the
department
• Files were on the university
server but are long gone
• Data are kept for 10 years
beyond the last time used
• Purged after retirement
• Institutional review boards
required the data be destroyed
after the project
Where are the data now?
Source: Pienta, Amy, Myron Gutmann, &
Jared Lyle. 2009. “Research Data in The
Social Sciences: How Much is Being
Shared?” Research Conference on Research
Integrity, Niagara Falls, NY.
8. Sharing data…more publications!
Table. Median Number of Publications by Data Sharing Status
Data
Archived
(n=111)
Data
Shared
Informal
(n=415)
Data Not
Shared
(n=409)
Total
(n=935)
Primary PI Publications 6 6 3 4
Secondary Publications 8 6 3 5
Publications with Students 4 3 1 2
Source: Pienta, Amy M., George Alter, and Jared Lyle. 2010. “The Enduring Value of Social Science Research: The Use
and Reuse of Primary Research Data.” Presented at the BRICK, DIME, STRIKE Workshop, The Organisation,
Economics, and Policy of Scientific Research, Turin, Italy, April 23‐24, 2010 (http://hdl.handle.net/2027.42/78307)
9. Data deposited
in archive
Discovered and
accessed from
archive
Analyzed and
results
published
New datasets or
research ideas
New data
collected
Cycle of
Science
10. Data deposited
in archive
Discovered and
accessed from
archive
Analyzed and
results
published
New datasets or
research ideas
New data
collected
Cycle of
Science
11. Data deposited
in archive
Discovered and
accessed from
archive
Analyzed and
results
published
New datasets or
research ideas
New data
collected
Cycle of
Science
12. Data deposited
in archive
Discovered and
accessed from
archive
Analyzed and
results
published
New datasets or
research ideas
New data
collected
Cycle of
Science
15. How do ICPSR and openICPSR compare
to other data service providers?
• Have professional data curators to review deposited
materials who are experts in developing metadata (tags) for
the social and behavioral sciences
• Provide an immediate distribution network of over 760
institutions looking for research data, that has powerful
search tools, and a data catalog indexed by major search
engines
• Sustained by a respected organization with over 50 years of
experience in reliably protecting research data
• Prepared to accept and disseminate sensitive and/or
restricted-use data
16. How is openICPSR different from ICPSR?
• Sustained by deposit fee to
individuals from non-members
institutions ($600/project)
• Data freely available to the
public (or for a nominal charge
for restricted-use data)
• Most data available only in the
raw (bit-level) form as deposited
and described by the depositor;
may be fully curated if the
Professional Curation option
was chosen and the quoted fee
paid for ICPSR’s curation and/or
disclosure review services
• Sustained by institutional
member fees; depositors are not
charged to deposit data
• Data are freely available to
individuals affiliated with
member institutions; non-
members pay an access fee; both
may pay fee to access restricted-
use data in VDE
• Data are fully curated including
professional processing, value-
added documentation, and
renderings in popular statistical
programs and online analysis
17. Who might use openICPSR?
• Researchers required to share data freely with the
public to comply with grant/contract
requirements
• Authors required to share data for replication
purposes to comply with journal requirements
• Researchers required to share sensitive data or
data with disclosure risk with the public from a
secure digital environment
• Researchers, including students, who want to
share data publicly as good practice or for the
purposes of replication
18. Deposit agreements
• Has implicit or explicit copyright and have the right to
make it publicly available through ICPSR
• Permits ICPSR to redisseminate, promote, catalog,
reformat, store, and preserve the data collection
• Permits ICPSR to transform or enhance the collection to
protect confidentiality and for usability
• Has removed all direct identifiers and done due diligence
to prevent disclosure of subject identities
• Holds ICPSR and University of Michigan harmless from
liability for breaches of subject confidentiality or invasions
of privacy
19. openICPSR deposit terms includes more
• Deposited work will be distributed under an Attribution
4.0 Creative Commons License
• Depositor has institutional approval to share the data
collection
• Data collection will be preserved as-is and available at no
cost to data users
• Research subjects have consented to sharing the data
and/or the depositor has institutional approval to share
the data
20. openICPSR: Public-access sharing
solution for Institutions and Journals
• Branded repository to represent and showcase their
research data using a fully-hosted (cloud) service
• Able to fulfill grant requirements that will pass an audit
• No need to have technical staff or equipment to manage
the repository
• Easy and clean interface for deposit and access
• Administrative (usage) reporting available 24/7
• Able to assure the Research Administration Office that
data are safe and secure for the long term
21. openICPSR: Examples of branded sites and services
Your logo. Your colors.
A unique URL.
On-demand deposit.
On-demand reports.
22. Should the data be restricted?
• Re-identification
–Can individuals be identified from information in this material?
–If the data were made public, could someone use a
combination of variables (e.g. age, sex, race, occupation,
geography) to find individuals in a publicly available
database?
• Harm
–Does this material include sensitive information?
–Would the release of individually identifiable information
create a risk of harm (e.g. psychological distress, social
embarrassment, financial loss) greater than the risks that
people experience in everyday life?
23. Identifiers
• Direct
– Names
– Addresses, including ZIP codes and other postal codes
– Telephone and fax numbers, including area codes
– Social security numbers
– Other linkable numbers such as driver’s license numbers,
certification numbers, medical device numbers, etc.
• Indirect
– Detailed geographic information
– Organizations belonged to, offices or posts held by respondent
– Educational institutions attended, year of graduation
– Detailed occupational titles
– Place where respondent was born or grew up
– Exact dates of significant life events (birth, death, marriage, etc)
– Detailed income
Direct identifiers must be removed from
data before sharing! Data with indirect
identifiers may need to be restricted.
24. What sensitive topics could cause harm?
• Psychological well being or mental health
• Sexual attitudes, preferences, or practices
• Use of alcohol or drugs
• Illegal behavior
• Behavior that puts the respondent at risk of criminal or
civil liability
• Behavior damaging to an individual’s financial standing,
employability, or reputation
• Medical information that could lead to discrimination,
stigmatization, etc.
25. Let’s make a deposit with openICPSR
www.openicpsr.org
27. Common Objection/Misperception:
“My data are too sensitive to share. . .”
• Many restricted data files can be processed for public
release
• ICPSR has been sharing restricted-use data for over a
decade via three methods:
– Secure Download
– Virtual Data Enclave
– Physical Enclave
• ICPSR stores & shares over 6,400 restricted-use datasets
associated with over 2,000 ‘active’ restricted-use data
contracts
28. Virtual Data Enclave
• Virtual machine launched from your desktop machine
• Functions just like your localmachine, but with
restrictions on whatcan enter and exitthe
environment
• Full suite of statisticalpackagesand other software
31. Examples of Disclosure Risk Concerns
• Tabular output including demographics or unique
characteristics/activities, resulting in small cell sizes
• Geographic information (zip codes, cities/towns, counties)
• Geocodes, GIS data, maps
• Longitudinal data – data from research subjects at multiple time
points
• Verbatim text – interviews, video transcriptions, short answer
responses
• Photos, videos, audio recordings
32. Common Rules for Tables
Each cell size is sufficiently large to prevent identifiability
• Establish minimum threshold (often 3, 5, or 10), dependent on
type and context of the data
• Rules for cells, rows, and columns:
– All cases in any row or column should not be in a single cell
– Cell percentage should not correspond to a cell size less
than a threshold number
– A cell should not be a high percent of all cases included in a
row or column (more than 60%)
• Combinations of tables should also meet guidelines
33. Disclosure Risk Protections (DRP)
• Release data from only a sample of the population
• Remove/mask obvious identifiers and high-risk variables
• Limit geographic detail
• Limit the number and detailed breakdown of categories
• Top or bottom coding
• Recode into intervals or round values
• Addition of noise
• Swap records
• Blank and impute
• Aggregate and replace with mean value (aka blurring)
36. Documenting Disclosure Mitigation
• Maintain and deposit the syntax used to
modify the data
• Include either:
– a narrative description of changes in the
study-level documentation, or
– a series of item-specific narratives to be
included at the variable level (codebook/user
guide)
38. • One of the world’s oldest and largest social
science data archives, est. 1962
• Data distributed on punch cards, then reel-
to-reel tape, now:
– Data available on demand
– Over 8,200 studies with over 68,700 data sets
• Membership organization among 21
universities, now:
– Currently over 760 members world-wide
– Federal funding of public collections
What is ICPSR?
- Then and Now -
39. The Concept of “Data Curation”
• Curation, from the Latin "to care," is the process used to add value to
data, maximize access, and ensure long-term preservation
• Data curation is akin to work performed by an art or museum curator.
– Data are organized, described, cleaned, enhanced, and preserved for
public use, much like the work done on paintings or rare books to make
the works accessible to the public now and in the future
• Curation provides meaningful and enduring access to data
• Data curation is the foundation for effective, long-term data sharing
40. Assessing the data in the collection
• Searching for and downloading data
• Codebooks
• Full descriptives
• Variable search and compare
• Simple crosstabs and frequencies
• Online analysis functionality
41. Study Search Behaviors
In practice, we encounter three typical search
behaviors from our users:
• A user has a research question in mind.
• A user is looking for a dataset that contains
specific variables.
• A user is looking for a specific dataset and has
the study title or investigator name.
42. Natural Language Searching
• Does juvenile drug use lead to delinquency?
• juvenile “drug use” delinquency
• “juvenile drug use”
44. Variable Search Tips, 1
• Enter words or strings that are likely to appear
in a variable name, label, question, and value
labels:
• Presidential election will return variables
dealing with all presidential elections
• Presidential election Obama will return only
variables dealing with the 2008 and 2012
presidential elections
45. Variable Search Tips, 2
• Use quotes to search for specific phrases:
• "life satisfaction""minority rights""community
programs"
• The minus sign may be used to remove certain
types of results:
• "Presidential debate" -"Bill Clinton" will
eliminate the debates from 1992 and 1996 in
which Clinton participated
46. Variable Search Tips, 3
• A Boolean "and" is implied in the search.
• The search automatically does stemming,
there's no need to type in an asterisk. It's also
case-insensitive.
• A fielded search is also available.
48. It’s really a searchable database . . .
containing over 65,000 citations of known published
and unpublished works resulting from analyses of
data archived at ICPSR
. . .that can generate study bibliographies
associating each study with the literature about it
. . . Included in the integrated search
on the ICPSR Web site
The Bibliography of Data-related Literature
69. Assistance is available
• Our user support staff are available from 8:00
a.m. to 5:00 p.m., ET, Monday to Friday
– Phone: 734-647-2200
– netmail@icpsr.umich.edu
70. References
• Office of Science and Technology Policy memorandum (2013). Retrieved on
27/05/2015 from
https://www.whitehouse.gov/sites/default/files/microsites/ostp/ostp_public_
access_memo_2013.pdf
• National Science Foundation, Social, Behavioral and Economic Sciences (SBE).
(2010). Data Management Plan Policy. Retrieved 27/05/2015 from
http://www.nsf.gov/sbe/SBE_DataMgmtPlanPolicy.pdf
• National Science Foundation, Social, Behavioral and Economic Sciences (SBE)
(2008). Data Archiving Policy. Retrieved 27/05/2015 from
http://www.nsf.gov/sbe/ses/common/archive.jsp .
• National Institutes of Health. (2003). NIH Data Sharing Policy and
Implementation Guidance (includes data sharing plan requirements). Retrieved
27/05/2015 from
grants2.nih.gov/grants/policy/data_sharing/data_sharing_guidance.htm#finfre
71. References (cont.)
• National Institute of Justice (2014). About the Data Resources Program.
Retrieved on 27/05/2015 from http://www.nij.gov/funding/data-resources-
program/Pages/about.aspx
• National Institute of Justice (2012). Data Archiving Plans for NIJ Funding
Applicants. Retrieved 27/05/2015 from http://www.nij.gov/funding/data-
resources-program/applying/data-archiving-strategies.htm
• Pienta, Amy. M., George Alter., and Jared Lyle. (2010). The enduring value of
social science research: The use and reuse of primary research data. Presented
at the BRICK, DIME, STRIKE Workshop, The Organisation, Economics, and Policy
of Scientific Research, Turin, Italy, April 23‐24, 2010
(http://hdl.handle.net/2027.42/78307)
72. References (cont.)
• Confidentiality and Data Access Committee (CDAC), Federal Committee on
Statistical Methodology. Statistical Policy Working Paper 22 (Second version,
2005), Report on Statistical Disclosure Limitation Methodology,
http://www.fcsm.gov/working-papers/spwp22.html.
• Other Requirements Relating to Uses and Disclosures of Protected Health
Information, 45 CFR 164.514 (2002),
http://edocket.access.gpo.gov/cfr_2002/octqtr/pdf/45cfr164.514.pdf
• U.S. Department of Health and Human Services, National Institutes of Health.
Research Repositories, Databases, and the HIPPA Privacy Rule (posted
January 12, 2014; revised July 2, 2004),
http://privacyruleandresearch.nih.gov/research_repositories.asp
• Sweeney, L., “Simple Demographics Often Identify People Uniquely,”. Carnegie
Mellon University, Data Privacy Working Paper 3, Carnegie Mellon
University,. Pittsburgh 2000.
http://dataprivacylab.org/projects/identifiability/index.html
73. Resources
• Link to Guidelines for Effective Data Management Plans (PDF),
http://www.icpsr.umich.edu/files/datamanagement/DataManagementPlans-
All.pdf.
• Link to Guidelines and Resources for OSTP Data Access Plans (video),
https://www.youtube.com/watch?v=sWnMFEKmfnE
• Link to ICPSR Deposit Data (web page),
http://www.icpsr.umich.edu/icpsrweb/deposit/index.jsp
• Link to openICPSR (website and vidoes) https://www.openicpsr.org/
• Link to Disclosure Risk Training – For Public Use or Not For Public Use (video),
https://www.youtube.com/watch?v=9vdWseLay9g&feature=youtu.be&list=P
LqC9lrhW1VvaKgzk-S87WwrlSMHliHQo6%22
Abstract: Federal data sharing requirements increase public access to federally-funded scientific data. For researchers, data sharing is a key resource in translating research into knowledge, policies, and practices. This workshop will assist participants in facilitating data sharing in the cycle of science that starts with deposited data, which through additional use, leads to the sharing of knowledge that inspires new data collection.
The workshop will cover several deposit options (to fully-curated archives and the public access archive, openICPSR), differences between sharing public-use and restricted-use data, and benefits to depositors through the ICPSR Website. A hands-on demonstration of making a deposit is planned.Finding data for the unique needs of a research project can be challenging, particularly in a world that values both the liberal use and protection of research data. The workshop will describe and demonstrate the array of discovery and exploration tools that leverage ICPSR’s vast data catalog, metadata, and online analysis options, discuss the discovery, use, and publishing from restricted-use data, and include group discussion of disclosure issues and hands-on time with ICPSR data tools.Participants will become more familiar with:Federal data sharing requirements
Options for sharing data
Data discovery tools
Protection of confidentiality when sharing data
Presenters: Johanna Bleckman; Kaye Marz, Tuesday, June 2, 2015 at 1:30-3:30 pm
Our Learning Objectives today are to help you to become more familiar with:
Federal (U.S.) data sharing requirements and other good reasons to share data
Options for sharing data through ICPSR and a hands-on walk through of doing a deposit
Johanna will cover the protection of confidentiality when sharing data and give specific examples of assessing and modifying the data
She will show you ICPSR’s data discovery tools
And also demonstrate ICPSR online data exploration tools in which you can do also on your own laptop if you have one with you
So let’s proceed on to our first objective.
First, we’ll explain why individuals should share their research data. Number 1 is because the government says you have to if the research was federally-funded. Data sharing makes effective use of the agency’s resources by avoiding unnecessary duplication of data collection and conserves research funds to support more investigators. In the U.S., our federal data sharing requirements include:
In February 2013, the Executive Office of the President's Office of Science and Technology Policy published a memo (entitled "Increasing Access to the Results of Federally Funded Scientific Research,“) which directs funding agencies with an annual R&D budget over $100 million to develop a public access plan for disseminating the results of their research. (ICPSR DM&C site)
The National Science Foundation and the National Institute of Health, both major U.S. research funding agencies, issued data sharing requirements to their grantees. A couple years later NSF issued a requirement for data management plans be included in proposals submitted to NSF starting in Jan 2011; NIH’s data sharing plan was part of their initial requirement. DMPs provide a structure for grantees to plan for data sharing at the end of the award period. The National Institute of Justice in the U.S. Department of Justice established a Data Resource Program to share their grantee data already in the late 1970s. More info about DMPs are available on the ICPSR YouTube channel.
A second reason to share data is the benefits to the social science community
The National Institutes of Health (NIH) data sharing policy (2003) identifies the following benefits from sharing data:
Reinforces open scientific inquiry and encourages diversity of analysis and opinion, both important to the scientific enterprise
Promotes new research, the testing of new or alternative hypotheses and methods of analysis
Supports studies on data collection methods and measurement
Facilitates the education of new researchers
Enables the exploration of topics not envisioned by the initial investigators, and
Permits the creation of new datasets when data from multiple sources are combined. This data integration and harmonization or other “data mash-ups” are of high interest to funding agencies and researchers alike.
And the benefit isn’t only to the greater good; individual researchers benefit as well:
Data in the public domain generate new research which cites the original research
Data are preserved and can be obtained by the original researchers if their copies are lost or destroyed
Archiving data helps researchers meet federal agency requirements
Information on the research community’s interest in a dataset that ICPSR provides online as usage statistics can assist researchers with success of future grant funding proposals.
Dr. Pienta and her colleagues at ICPSR conducted a study in 2008 and received the following responses about where the researcher’s data are located. The answers are a data counterpart to that lost library book: technical malfunctions or limitations, purged, or destroyed. More than 25% of the data were reported lost. Less than 15% reported archiving their data.
Well, those researchers missed out on a real opportunity, because sharing data leads to more publications! The same study showed that the median number of publications were higher for data archived or shared, whether primary publications the original research team, by secondary analysts, or by their students.
So, here we have a graphic of the cycle of science, which this workshop seeks to fuel. In the upper left, new data are collected, these data are then deposited in an archive or repository, the data are discovered and accessed by other researchers, who analyze the published their results, and this secondary research uncovers new research ideas and or creates new datasets from the data mash-ups mentioned earlier, which then leads to new research and more data being collected.
So where data librarians and archivists have a role in this cycle?
Helping with data management plans and data archiving
So where data librarians and archivists have a role in this cycle?
Helping with data management plans and data archiving
Helping users find data in the repository
So where data librarians and archivists have a role in this cycle?
Helping with data management plans and data archiving
Helping users find data in the repository
Helping users find analyzed results in publications based on the data, which leads to more research and more data.
So data librarians and archivists are crucial actors in this cycle of science.
And now will talk about options for sharing data
openICPSR is ICPSR’s public-access data collection.
openICPSR assists researchers in meeting requirements for public access to federally funded research data. It ensures that data depositors fulfill public-access requirements of grant and contract RFPs. It is available to any researchers (from member and non-member institutions) who wish to make their data available to the public via a sustainable archive.
Calendar Year 2013: Over 185,200 searches for data conducted on ICPSR’s data search tools
ICPSR is an international consortium of member institutions.. It is a fee for access model – it is sustained by membership fees that are pooled to allow deposits to be made at no charge and members to access the data at no charge. Nonmembers of ICPSR can obtain member data for an access fee. Members and non-members may pay a fee to access restricted-use data in the VDE. These pooled funds also cover full data curation services, professional processing, value-added documentation, and data offered in the popular statistical programs, and many also for online analysis.
openICPSR is a public access research data-sharing service that uses a fee for deposit model. Currently the fee of $600/project is charged to non-members and waived for researchers at ICPSR member institutions. The deposit fee enables public to access research data without charge—or in the case of restricted-use data, for nominal charge. It is important to note that fees for openICPSR are intended to be written into grant and contract proposals. Fees should be funded by the project, not outright by an institution or members of ICPSR. Most data in openICPSR are “as-is,” as deposited and described the depositor unless the depositor chose the Professional Curation option, which has a fee to cover use of ICPSR’’s curation and/or disclosure review services.
Why is ICPSR venturing into this space? Because we were hearing from agencies that ICPSR might not be ‘open enough’ for some of our research scientists.
Now we’re going to wade through a bit of legalese. But just for a couple of slides. Both the ICPSR and openICPSR Deposit Agreements have the same terms shown here.
ICPSR request to be able to transform or enhance the collection even with openICPSR since ICPSR may elect to fully curate for the membership a very popular collection in openICPSR.
But the openICPSR deposit agreement terms have 4 more clauses.
And just launched earlier this year: openICPSR is available as a branded and hosted site for institutions and journals. Organization pays annual fee. Rates graduated based upon the projected number of new deposits per year and a maximum total GB of storage. Pricing plans are available on the openICPSR .
Here are screenshots of openICPSR branded sites. On the left are site front pages with the organization’s logo and colors. The URL is also customized to be unique to that site. In the back on the right side, you can see the Browse Deposit list page with the site’s own colors. In the front, you can see the web traffic report for the other site.
But before sharing data we need to consider the important concepts reidentification and harm to research subjects. Should the data be restricted?
Re-identification
Can individuals be identified from information in this material?
If the data were made public, could someone use a combination of variables (e.g. age, sex, race, occupation, geography) to find individuals in a publicly available database?
Harm
Does this material include sensitive information?
Would the release of individually identifiable information create a risk of harm (e.g. psychological distress, social embarrassment, financial loss) greater than the risks that people experience in everyday life?
Here is a brief list of the most obvious direct and indirect identifiers.
Direct identifiers must be removed from the data before sharing with ICPSR. Data that retain indirect identifiers may need to be made available as a restricted-use collection.
Note that ICPSR will remove a collection from openICPSR if it determines that the files pose risk of disclosure or harm to the individuals represented in the data. ICPSR will notify the investigator, the investigator’s institution, and funding agency that ICPSR has done so.
The other concern is potential harm. These are a concern if the data have any concerns about reidentification of any individual in the data.
Let’s make a deposit with openICPSR. We do not want to create a test deposit in the production version of openICPSR, so only for today, we’ll use the openICPSR sandbox to make a deposit using the scenario in your workshop folder.
A common misperception that we encounter is the belief that an investigator’s primary data are too disclosive to share.
On the contrary, however, many restricted data files can be processed for public release, either in downloadable form or via a Restricted Use Agreement, with the right kind and the right level risk analysis and remediation.
In fact, many of our studies start out too disclosive for public access, but can be modified such that they are still analytically valuable but disclosure risk is low enough that the files can be downloaded under a simple Term of Use.
ICPSR obviously shares an abundance of downloadable data, but we have also been sharing restricted-use data for over a decade. Restricted-use data applicants submit to us a Data Use Agreement signed by their institution, a data security plan (meaning how they will store the data securely), and documentation from their institution’s IRB.
We disseminate the data one of three ways.
1. Secure Download - most of the restricted data files curated and disseminated via ICPSR are distributed to approved users via a secure download link.
2. Virtual Data Enclave (I’ll talk more about this in a moment)
3. Physical Enclave – a room in the basement of ICPSR that allows access to the data on a stand-alone computer without access to the internet or ICPSR networks. We call this “monitored processing” – the analyst is accompanied by ICPSR staff and all of their output is hand-reviewed before s/he can leave the building.
ICPSR stores & shares over 6,400 restricted-use datasets associated with over 2,000 currently ‘active’ restricted-use data contracts
We’re currently adding about 50 new contracts each month.
So what is it…
Virtual machine launched from your desktop machine
Functions just like your local machine, but with restrictions on what can enter and exit the environment
Full suite of statistical packages and other software
This is a screenshot of my machine’s screen when the VDE is active.
You can see that the majority of the real estate here is the Virtual Environment, with my start menu, Windows Explorer accessing my files on the ICPSR servers…
but then you can see that I have TWO task bars, one that is inside and relevant to the Virtual Environment, and one that is outside and relevant to my physical machine.
I can toggle easily between the VDE and my email, the VDE and files on my local machine or on a server that I am connected to outside of the VDE
This is the logic of the VDE, with me as ICPSR staff and you as an approved researcher, working on our local machines, logging in through this University of Michigan virtual space, and accessing restricted data on ICPSR servers.
When we’re looking at data through the lens of disclosure, there are several things we’re targeting as likely sources of disclosure or calculations of risk.
Tabular output. Tables crossing variables, particularly demographic variables, are a typical source of disclosure risk.
Geographic information. Geographic information is always a concern, and can be an issue even for large geographic regions like states when combined with other information. As geographic information becomes more precise and available, the risk of disclosure increases.
Dr. L. Sweeney from the Carnegie Mellon, in his paper “Simple Demographics Often Identify People Uniquely,” reported on the percentage of the population in the United States in the 1990 census that had characteristics that likely made them unique based only on gender and date of birth, plus location information [discuss table]
Also Geocodes, GIS data, maps, etc.
Longitudinal or panel data. Studies that contain information across time for the same individuals of course increases the risk of reidentification, given that the history over time provides much more risk for a unique pattern of values than does a single case from a one-time study.
Verbatim responses or recordings.
Cases that include the person’s personal history in his/her own words, greatly increases risk of identification or of reidentification.
Names and locations might be mentioned by the respondent, and events may be unique to the individual. The style of speaking, regionalisms, and accents (if it’s a recording) all provide more identifiable information than on a multiple-choice survey.
Any released information must include only content that has been selected and/or anonymized sufficiently to prevent identification.
One of the projects in my archive is a study of teaching effectiveness, and it includes thousands of hours of videos of classroom sessions. Sports teams, school names or mascots, etc., in addition to faces and voices.
Photographs or videos. A person may be recognized by an acquaintance, but reidentification could also occur through use of facial recognition software and other photos tagged with the person’s identity that are publicly available or available through social media.
Each cell size is sufficiently large to prevent identifiability:
An unweighted non-zero cell size should not be less than a small threshold number. Commonly used values are 3, 5, 10, depending on the sensitivity of the data.
Percentage and count based rules for cells, rows, and columns:
All cases in any row or column should not be in a single cell. If percentages are displayed, no cell should be 100% for a row or a column.
If a cell entry is a percentage, that percentage should not correspond to a cell size less than a threshold number.
A table cell should not ‘dominate’—a cell should not be a high percent of all cases included in a row or column, e.g. more than 60%.
Combinations of tables (any table that could be constructed by combining a set of tables (e.g., addition and subtraction of rows or columns) also needs to meet guidelines.
The values for ‘threshold’ numbers for cell size or value depend on the type and context of the data.
So what do we do about it?
Quite often Disclosure Risk Protections (or DRP) need to be incorporated to create uncertainty in the data in order to protect subjects. So the goal is “plausible deniability” – meaning one cannot say with certainty that this information pertains to a particular individual or organization.
Listed on the screen right now are a list of DRP options.
One DRP is to release data from only a sample of the population
Remove or mask obvious identifiers and high-risk variables, i.e., they can be deleted from the data or original values are masked with a generic code
Limit geographic detail (so you might recode up to a higher level of geography (such as county to state) or drop or further restrict access to geography variables
Limit the number and detailed breakdown of categories within variables, so in the criminal justice world, specific offense codes can be combined into property, violent, drug, other
Top or bottom-code to combined extreme codes in a variable, e.g., particular young or old ages appropriate for the data
Recode into intervals or round values, e.g., collapse individual age values into 5-year increments
Add noise by multiplying or adding a random numbers to the original value
Swap or rank swap records (aka switching), e.g., matching records on key variables (demographics and study-specific variables of interest) and swapping the record across states
Blank and impute: Selecting records at random or purposefully, blanking out selected variables, and replacing original values with imputed values
Aggregating across small groups of respondents and replacing one individual's reported value with the average (aka blurring)
In a study of teaching effectiveness, 1,930 teachers were observed and videotaped during classes. The observations were coded, and several teaching practice and effectiveness measures were created.
A secondary analyst plans to study teacher effectiveness measures for eighth grade English. She is interested in the effect of years of teaching experience and student-and-teacher interactions on scores related to communicating with students. She wants to determine if there are significant differences by ethnicity and gender.
This is a table created to check unweighted sample sizes by years of teaching experience for the primary demographic categories she is interested in.
In the top table here, sample sizes are listed by year for the first 5 years of experience, and then by 5-year groups from 10 to 40 years.
As you can see, the shaded cells for ethnicity and gender don’t meet the minimum cell size requirement, as you can see here.
In the recoded version below you can see that I have collapsed Ethnicity into White, other, and Black or Hispanic. I also collapsed the 35 and 40 year values into a single 35+ value. Those actions together ameliorated this specific disclosure risk.
We are always careful to balance the research utility against disclosure protections and it may be the case that this is a reasonable analytic sacrifice within the context of this study. If not, we would pursue one or more of the other tactics described on the previous slide.
As we were processing these data it became apparent to us that individual students were identifiable via a combination of variables, but age in months was the biggest issue.
We wanted to preserve as much of that specificity as possible given what we know about the utility of these data (meaning that we did not want to coarsen the age variable – age in months matters for school-age children).
What we chose instead was to add “noise” to the age variable – adding or subtracting up to two months from each child’s age.
This table here shows the means and standard deviations pre-noise, so to speak, and post-noise, demonstrating that while we have changed the data, the effects are very small. The correlations are all well above .98, which was our minimum threshold.
I encourage you to document any disclosure mitigation that you carry out on your files before you share them.
Best practice to:
Maintain the syntax used to modify the data
Include either:
a narrative description of changes in the study-level documentation, or even better,
a series of item-specific narratives to be included at the variable level (codebook/user guide)
First, a little bit of history so that you can get a sense of how far we’ve come….
ICPSR is one of the world’s oldest and largest social science data archives, established in 1962
We first distributed data on of course punch cards,
Then reel to reel tape
And now, thankfully, the data are available on demand, most datasets can be downloaded directly from our websites.
We are a membership organization, meaning that we are sustained by membership dues paid by institutions who in return have free access to all of our data and other services.
We begin in 1962 with 22 member universities, and now we have over 750 member institutions from across the globe.
As of January of this year, over 63,350 datasets (over 181,716 files) available for download associated with over 8,000 studies.
In addition, just over 1,200 restricted-use studies (which includes over 6,000 datasets) available for analysis.
ICPSR catalog indexes over 9,000 catalog entries/studies (as of 1/2015).
To give you a sense of volume of downloads, total downloads for FY 2014 = over 683,200 datasets (428,900 studies) downloaded/accessed.
What is Data Curation?
In general, curation means to undergo a process to add value, maximize access, and ensure long-term preservation.
As you would imagine, data curation is similar to the work done by museum or art gallery curators – so we organize, describe, clean, enhance, and preserve social science data so that the public, and more specifically the research community, can more easily and broadly access and make use of the data now and into the future.
Similar to unique works of art, keeping research data only in our individual offices, or on our office computers, is a risky proposition if our goal is long-term preservation. It makes a lot more sense to turn it over to a professional curator to take the steps necessary to protect its integrity and make it findable and usable in perpetuity.
So – I included this slide mostly because one of the folks who responded to our survey was hoping to leave this session with more information about how ICPSR curates data. The technical aspects of that are a bit out of scope for this workshop, but whoever you are – or anyone who is interested! – please do find me afterward if you have specific questions!
A powerful search engine that supports multiple effective strategies for discovering data,
We create and support full documentation including codebooks, quite often instruments, and other rich metadata,
We have a variable search and compare tool that allows you to search our entire, enormous collection of variable level metadata to identify items of interest, whether it’s
to identify data you want to analyze,
compare how concepts are approached across various surveys, or perhaps across disciplines,
or to begin to think about creating your own survey.
Analysis tools
a simple crosstab or frequency tool
and then the full online analysis program called SDA, out of UC Berkeley.
Our study search indexes
the metadata records that ICPSR staff create,
as well as the full text of the documentation files,
including all the variable markup and question and answer text.
In the study search, the metadata record is heavily weighted, especially the study title and the subject terms. This is as opposed to the variable search which I’ll talk more about in a moment.
In practice, we encounter three typical search behaviors from our users that dictate which search options best meet the user’s needs.
A user has a specific research question in mind and they want to find data that directly addresses that issue.
The second common search behavior that we see is users searching for specific variables, which may or may not be conceptually related to one another.
3. The third search behavior I’ll demo, is when a user is looking for a specific dataset and has the study title or investigator name. That’s a pretty easy one.
For some users, finding the right dataset is as simple as typing in the research question.
Unfortunately, less great search engines are so common that users are more likely to provide one or two query terms, on the assumption that more query terms would narrow search results too far.
The challenge then for us is educating our users and making sure our metadata is sufficient to the task given the capabilities of the search.
To provide an example, let’s do a natural language search on the ICPSR website using a simple research question, like “Does juvenile drug use lead to delinquency?”
The results are pretty good, and tend to be better than concise keyword searches, such as “juvenile ‘drug use’ delinquency,” [demo] which results in studies that only grab two of the three factors or focus on prediction of delinquency instead of delinquency itself.
If I searched for “juvenile drug use” as a phrase, I’d actually get only one result. That’s one of the big limitations of phrase searching.
One great feature of our website and the search tool is the variable relevance sort.
So if you do a search and then sort by variable relevance,
we weight the variable metadata more heavily and then display matching variables on the search results page,
along with instructions that the user should separate different concepts with commas.
This makes it relatively easy for a user to find a dataset that has a specific combination of variables.
The variable display also includes checkboxes beside each variable and we provide a “compare variable” utility that allows users to view selected variables side by side, which I will demo in a few minutes.
If a user is searching for a particular investigator, the search is relatively easy, unless the person’s name is very common. (E.g., Smith.) Our site offers a “Browse by Author” page to make the task somewhat easier, and to allow users to see variations on the name if they run into trouble.
Site visitors can use quotation marks for phrase searching, but this can be problematic if the user is mistaken about the word order or gets a word in the title wrong. For example, “study” versus “survey.” In general, searching by title is very effective and we don’t need to provide the user with additional guidance.
ICPSR’s Bibliography of Data-related Literature, an ongoing project to grow a searchable database that contains over 65,000 citations of known published and unpublished works resulting from analyses of data held in the ICPSR archive.
Our bibliography was developed with support from the National Science Foundation, and it represents over 50 years of scholarship in the quantitative social sciences, extending from the inception of ICPSR in 1962 to the present.
In the collection are:
Resources using data in the ICPSR holdings as the primary data source
Resources using ICPSR data in a comparison with the primary dataset investigated
Resources "about" an ICPSR dataset or study series.
The Bibliography of course facilitates literature searches to help folks
identify research that has already been done,
replicate analyses
avoid unintentionally duplicating analyses
and identify cross-disciplinary implications and uses for the data.
Some unexpected benefits of the Bibliography are also listed here:
I think of this as meta work –
So studying how data resources are used,
Conducting citations analyses,
Understanding the data life cycle
And learning about current methodological issues that folks are grappling with.
We have several different use cases for our citation search.
- Users who are looking for facts/tables, rather than raw data
- Users who want to see how a particular dataset has been utilized
- Researchers and funding agencies who want to gauge the impact of a particular study by seeing how much it has been cited
One caveat is that the citation search is a bit limited in that it only searches the citation, not the full text.
Due to relatively recent court rulings on the Google copyright case that favor indexing copyrighted content and providing snippets to users on the search results page, we are investigating the possibility of expanding this service to include full text of the publication, which would of course strengthen the database search tool.
One of the current strengths of this utility, however, is that we provide links to full text whenever possible.
We retain DOIs when we can locate them, and we provide dynamic links to WorldCat and Google Scholar if we don’t know the specific location of the article/report, making it that much easier for users to get to the full text.