Public data archiving: Who does? Who doesn't? What can we do about it?
1. Public data archiving:
Who shares?
Who doesnât?
What can we do about it?
Heather Piwowar
Presented at UBC BLISS, Sept 2010
DataONE postdoc with Dryad and NESCent, @UBC
PhD in Dept of Biomedical Informatics, U of Pittsburgh
20. As we seek to embrace and
encourage data sharing,
understanding patterns of adoption
will allow us to make informed
decisions about tools, policies, and
best practices.
Measuring adoption over time will
allow us to note progress and
identify best practices and
opportunities for improvement.
21. research questions
1. Is there beneďŹt for those who share?
2. How can we study data sharing behaviour in
a scalable, systematic way?
3. What factors are correlated with sharing
and withholding data?
30. ⢠gene expression microarray data
⢠raw intensity data
⢠upon publication
⢠publicly on the internet
⢠(centralized databases)
http://www.flickr.com/photos/paulhami/1020538523//
31. http://en.wikipedia.org/wiki/DNA_microarray
http://en.wikipedia.org/wiki/Image:Heatmap.png
http://commons.wikimedia.org/wiki/
File:DNA_double_helix_vertikal.PNG
microarray
data
35. currency of value?
Citations.
$50!
Diamond,Arthur M. What is a Citation Worth?.
The Journal of Human Resources (1986)
vol. 21 (2) pp. 200-215
36. dataset
85 cancer microarray trials published in 1999-2003, as
identiďŹed by Ntzani and Ioannidis (2003)
citations
ISI Web of Science Citation index, citations from
2004-2005
data sharing locations
Publisher and lab websites, microarray databases, WayBack
Internet Archive, Oncomine
statistics
Multivariate linear regression
40. 2. Need automated methods to:
a) Identify studies that create datasets
b) Determine which of these
have in fact been shared
c) Extract attributes about the environment
41. a) Identify studies that create datasets
http://www.ďŹickr.com/photos/lofaesofa/248546821/
43. Combined, these full-text portals reach 85%
of the articles available through
U of Pittsburgh library subscriptions.
44. But how to generate an effective query?
Use open access articles.
45. ⢠text analysis:
automatically catalogued
single words and word-pairs from full text
⢠assessed precision and recall
⢠combined the high performers:
46. Derived query:
("gene expression" AND microarray AND cell AND rna)
AND (rneasy OR trizol OR "real-time pcr")
NOT (âtissue microarray*â OR âcpg island*â)
47. Evaluation:
Ochsner et al. Nature Methods (2008)
400 studies across 20 journals
Precision: 90% (conf int: 86% to 93%)
Recall: 56% (conf int: 52% to 61%)
48. a) Identify studies that create datasets
b) Determine which of these
have in fact been shared
c) Extract attributes about the environment
53. a) Identify studies that create datasets
b) Determine which of these
have in fact been shared
c) Extract attributes about the environment
54. Funder Journal Investigator Institution Study
Is research data shared
after publication?
55. Funder Journal Investigator Institution Study
funded by impact years since sector humans?
NIH? factor ďŹrst paper
size mice?
size of strength of # pubs
grant policy impact plants?
# citations rank
sharing open cancer?
plan reqâd? access? previously country
shared? clinical
funded by number of trial?
non-NIH? microarray previously
reused? number of
studies authors
published gender
year
57. journal data sharing policy
âAn inherent principle of publication is that
others should be able to replicate and build
upon the authors' published claims.
Therefore, a condition of publication
in a Nature journal is that authors are
required to make materials, data and
associated protocols available in a publicly
accessible database âŚâ
http://www.nature.com/authors/editorial_policies/availability.html
http://www.nature.com/nature/journal/v453/n7197/index.html
60. author âexperienceâ
Author publication history:
Author name Author-ity web service
Torvik & Smalheiser. (2009). Author Name
disambiguation: Disambiguation in MEDLINE. ACM Transactions on
Knowledge Discovery from Data, 3(3):11.
Citation counts:
63. funder mandates
Requires a data sharing plan
for studies funded after October 2003
that receive more than $500 000 in
direct funding per year
64. funder mandates
Proxy for NIH data sharing policy
applicability:
If in any year since 2004,
⢠funded by an NIH grant number
with a â1â or â2â type code
⢠received more than $750 000 in
total funding from the grant
66. Now equipped with automated methods to:
a) Identify studies that create datasets
b) Determine which of these
have in fact been shared
c) Extract attributes about the environment
70. Proportion of articles with shared datasets, by year
0.35
Proportion of articles with datasets found in GEO or ArrayExpress
0.30
0.25
0.20
0.15
Across time
0.10
0.05
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
Year article published
71. Proportion of datasets shared
0.0
0.2
0.4
0.6
0.8
1.0
Physiol Genomics
PLoS Genet
Genome Biol
Microbiology
PLoS One
BMC Genomics
Plant Cell
Genome Res
Eukaryot Cell
Appl Environ Microbiol
BMC Med Genomics
Hum Mol Genet
Proc Natl Acad Sci U S A
Infect Immun
Am J Respir Cell Mol Biol
Dev Biol
J Bacteriol
Mol Endocrinol
BMC Cancer
Plant Physiol
Biol Reprod
Blood
J Immunol
FASEB J
Toxicol Sci
J Exp Bot
Nucleic Acids Res
Diabetes
Mol Cell Biol
Mol Cancer Ther
BMC Bioinformatics
Stem Cells
FEBS Lett
J Neurosci
Am J Pathol
J Biol Chem
J Virol
OTHER
Cancer Res
J Clin Endocrinol Metab
Plant Mol Biol
Clin Cancer Res
Genomics
Journals
Invest Ophthalmol Vis Sci
Mol Hum Reprod
Carcinogenesis
Gene
Endocrinology
Oncogene
Cancer Lett
Biochem Biophys Res Commun
(Physiological Genomics)
72. Proportion of datasets shared
0.0
0.2
0.4
0.6
0.8
1.0
Stanford University
University of Pennsylvania
University of Illinois
University of California, Los Angeles
University of Wisconsin, Madison
University of Washington
University of California, Davis
The University of British Columbia
University of California, San Francisco
University of Florida
University of California, San Diego
University of Minnesota, Twin Cities
Baylor College of Medicine
OTHER
Max Planck Gesellschaft
Harvard University
Duke University Medical Center
Yale University
Johns Hopkins University
University of Pittsburgh
(Stanford)
Washington University in Saint Louis
University of Toronto
University of California, Berkeley
University of Michigan, Ann Arbor
Michigan State University
Institutions
National Cancer Institute
Tokyo Daigaku
89. Multivariate nonlinear regressions with interactions
Odds Ratio
0.25 0.50 1.00 2.00 4.00 8.00
Has journal policy
Multivariate nonlinear regressions with interactions
Count of R01 & other NIH grants Odds Ratio
0.95
0.25 0.50 1.00 2.00 4.00 8.00
Authors prev GEOAE sharing & OA & microarray creation
Has journal policy
NO K funding other P funding
Count of R01 & or NIH grants
0.95
Authors prev GEOAE sharing & OA & microarray creation
NO K Journalfunding
funding or P impact
Institution high citations & collaboration
Journal policy consequences & Journal impact long halflife
Journal policy consequences & long halflife
Institution high citations NOTcollaboration & animals or mice
Instititution is government & NOT higher ed
NOT animals or mice
Last author num prev pubs & first year pub
Large NIH grant
Instititution is government & NOT higher ed Humans & cancer
NO geo reuse + YES high institution output
Last author num prev pubs & first year pub
First author num prev pubs & first year pub
Large NIH grant
Humans & cancer
NO geo reuse + YES high institution output
First author num prev pubs & first year pub
90. Multivariate nonlinear regressions with interactions
Odds Ratio
0.25 0.50 1.00 2.00 4.00 8.00
Has journal policy
Multivariate nonlinear regressions with interactions
Count of R01 & other NIH grants Odds Ratio
0.95
0.25 0.50 1.00 2.00 4.00 8.00
Authors prev GEOAE sharing & OA & microarray creation
Has journal policy
NO K funding other P funding
Count of R01 & or NIH grants
0.95
Authors prev GEOAE sharing & OA & microarray creation
NO K Journalfunding
funding or P impact
Institution high citations & collaboration
Journal policy consequences & Journal impact long halflife
Journal policy consequences & long halflife
Institution high citations NOTcollaboration & animals or mice
Instititution is government & NOT higher ed
NOT animals or mice
Last author num prev pubs & first year pub
Large NIH grant
Instititution is government & NOT higher ed Humans & cancer
NO geo reuse + YES high institution output
Last author num prev pubs & first year pub
First author num prev pubs & first year pub
Large NIH grant
Humans & cancer
NO geo reuse + YES high institution output
First author num prev pubs & first year pub
92. Multivariate nonlinear regression with interactions
Odds Ratio
0.25 0.50 1.00 2.00 4.00
OA journal & previous GEO-AE sharing
Amount of NIH funding
0.95
Journal impact factor and policy
Higher Ed in USA
Cancer & humans
93. Multivariate nonlinear regression with interactions
Odds Ratio
0.25 0.50 1.00 2.00 4.00
OA journal & previous GEO-AE sharing
Amount of NIH funding
0.95
Journal impact factor and policy
Higher Ed in USA
Cancer & humans
94. Conclusions:
⢠data sharing rates are increasing,
but overall levels are low
Preliminary evidence:
⢠levels are particularly low in cancer
⢠levels are highest for those who
⢠publish in a journal with a policy
⢠publish in an open access journal
⢠have shared data before
95. ⢠data and ďŹlters were imperfect
⢠many assumptions
⢠didnât capture all types of sharing
⢠donât know how generalizable across datatypes
⢠should be considered hypothesis-generating
http://www.flickr.com/photos/vlastula/300102949/
97. NSF-funded distributed framework
and cyberinfrastructure for
environmental science.
Dryad is a repository of data
underlying scientiďŹc publications,
with an initial focus on evolution,
ecology, and related ďŹelds.
The National Evolutionary
Synthesis Center, NSF-funded:
⢠Duke University,
⢠UNC at Chapel Hill
⢠North Carolina State University
101. ⢠evolution and ecology
datasets
⢠raw data that support results
⢠upon publication
or short embargo
⢠publicly on the internet
http://www.flickr.com/photos/paulhami/1020538523//
102. challenges!
1. No PubMed
2. Diverse data types, norms, repositories
3. Data almost always collected for a speciďŹc
hypothesis
4. Less public sharing so far
112. I post my data, code, and statistical scripts on
GitHub (links from http://researchremix.org)
Share yours too!
http://www.flickr.com/photos/myklroventine/892446624/
113. âDoes anyone want your data?
Thatâs hard to predict [âŚ]
After all, no one ever knocked on your door asking to
buy those figurines collecting dust in your cabinet
before you listed them on eBay.
Your data, too, may simply be awaiting an effective
matchmaker.â
Got data? Nature Neuroscience (2007)
114. Dept of Biomedical Informatics at U of Pittsburgh
Wendy Chapman for support and feedback
Todd Vision, Mike Whitlock for ongoing discussions
NIH NLM. NSF through DataONE, NESCent, Dryad.
Open science online community and those who release their
articles, datasets and photos openly
thank you
118. ⢠readers
⢠reusers perspectives,
⢠authors and also driving towards
⢠editors actionable results
for these groups
⢠reviewers
⢠funders
⢠database designers, maintainers, curators
⢠patients, subjects, or populations
122. Correlates with selfâreported dataÂ
withholding
industry involvement
perceived competitiveness of field
male
sharing discouraged in training
human participants
academic productivity
0 1 2 3
Blumenthal et al. Acad Med. 2006
123. Selfâreported reasons for dataÂ
withholding
sharing is too much effort
want student or jr faculty to publish more
they themselves want to publish more
cost
industrial sponsor
confidentiality
commercial value of results
0% 20% 40% 60% 80%
Campbell et al. JAMA 2002.
124. Table 2: Second-order factor loadings, by ďŹrst-order factors
Amount of NIH funding
0.88 Count of R01 & other NIH grants
0.49 Large NIH grant
-0.55 NO K funding or P funding
Cancer & humans
0.83 Humans & cancer
OA journal & previous GEO-AE sharing
0.59 Authors prev GEOAE sharing & OA & microarray creation
0.43 Institution high citations & collaboration
0.31 First author num prev pubs & ďŹrst year pub
-0.36 Last author num prev pubs & ďŹrst year pub
Journal impact factor and policy
0.57 Journal impact
0.51 Last author num prev pubs & ďŹrst year pub
Higher Ed in USA
0.40 NO geo reuse + YES high institution output
-0.44 Institution is government & NOT higher ed