Presentation by Heather Piwowar as PhD dissertation defense on March 24, 2010 at the Dept of Biomedical Informatics, U of Pittsburgh. "Foundational studies for
measuring the impact, prevalence, and patterns of publicly sharing biomedical research data." I passed :)
13. but how much isn’t
shared?
what isn’t shared?
who isn’t sharing it?
why not?
how much does it matter?
what can we do
about it?
14. Prior studies: surveys and/or manual audits
Blumenthal et al. Acad Med. 2006
Campbell et al. JAMA. 2002.
Kyzas et al. J Natl Cancer Inst. 2005.
Vogeli et al. Acad Med. 2006.
Reidpath et al. Bioethics 2001.
http://www.flickr.com/photos/jima/606588905/
15. Limitations of related work
• small sample sizes
• relatively few variables
• self-reporting bias
• not much focus on measuring demonstrated behavior
• not much focus on rewards
• not much focus on policy
• not much focus on biomedical data other than
DNA sequences
16. I believe analysis of the impact, prevalence, and patterns
with which researchers share and withhold biomedical data
can uncover rewards, best practices, and opportunities for
increased adoption of data sharing.
http://www.flickr.com/photos/archeon/2941655917/
17. Goal of this dissertation:
Collect useful evidence on patterns of
data sharing behaviour through methods
that can be applied broadly, repeatably,
and cost-effectively.
25. Aim 1: Does sharing have benefit
for those who share?
dataset
85 cancer microarray trials published in 1999-2003, as
identified by Ntzani and Ioannidis (2003)
citations
ISI Web of Science Citation index, citations from
2004-2005
data sharing locations
Publisher and lab websites, microarray databases, WayBack
Internet Archive, Oncomine
statistics
Multivariate linear regression
37. Aim 2a: Identify studies that create
gene expression microarray data
Look for wetlab methods in full text:
http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1522022&tool=pmcentrez
http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1590031&tool=pmcentrez
http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1482311&tool=pmcentrez#id331936
http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2082469&tool=pmcentrez
http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=126870&tool=pmcentrez#id442745
40. Features?
Unigrams and bigrams from full text
Training classifications?
Automatic filter for whether publication had
an associated dataset deposited in a database
Feature selection and combination:
41. Derived query:
("gene expression" AND microarray AND cell AND rna)
AND (rneasy OR trizol OR "real-time pcr")
NOT (“tissue microarray*” OR “cpg island*”)
42. Evaluation:
Ochsner et al. Nature Methods (2008) vol. 5 (12) pp. 991
• 400 studies across 20 journals
Precision: 90% (86% to 93%)
Recall: 56% (52% to 61%)
54. Funder Journal Investigator Institution Study
Is research data shared
after publication?
55. Funder Journal Investigator Institution Study
funded by impact years since sector humans?
NIH? factor first paper
size mice?
size of strength of # pubs
grant policy impact plants?
# citations rank
sharing open cancer?
plan req’d? access? previously country
shared? clinical
funded by number of trial?
non-NIH? microarray previously
reused? number of
studies authors
published gender
year
57. journal data sharing policy
“An inherent principle of publication is that
others should be able to replicate and build
upon the authors' published claims.
Therefore, a condition of publication
in a Nature journal is that authors are
required to make materials, data and
associated protocols available in a publicly
accessible database …”
http://www.nature.com/authors/editorial_policies/availability.html
http://www.nature.com/nature/journal/v453/n7197/index.html
60. author “experience”
Author publication history:
Author name Author-ity web service
Torvik & Smalheiser. (2009). Author Name
disambiguation: Disambiguation in MEDLINE. ACM Transactions on
Knowledge Discovery from Data, 3(3):11.
Citation counts:
63. funder mandates
Requires a data sharing plan
for studies funded after October 2003
that receive more than $500 000 in
direct funding per year
64. funder mandates
Proxy for NIH data sharing policy
applicability:
If in any year since 2004,
• funded by an NIH grant number
with a “1” or “2” type code
• received more than $750 000 in
total funding from the grant
68. results
11,603 datapoints
we found shared datasets for 25%
69. Proportion of articles with shared datasets, by year
0.35
Proportion of articles with datasets found in GEO or ArrayExpress
0.30
0.25
0.20
0.15
Across time
0.10
0.05
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
Year article published
71. Proportion of datasets shared
0.0
0.2
0.4
0.6
0.8
1.0
Physiol Genomics
PLoS Genet
Genome Biol
Microbiology
PLoS One
BMC Genomics
Plant Cell
Genome Res
Eukaryot Cell
Appl Environ Microbiol
BMC Med Genomics
Hum Mol Genet
Proc Natl Acad Sci U S A
Infect Immun
Am J Respir Cell Mol Biol
Dev Biol
J Bacteriol
Mol Endocrinol
BMC Cancer
Plant Physiol
Biol Reprod
Blood
J Immunol
FASEB J
Toxicol Sci
J Exp Bot
Nucleic Acids Res
Diabetes
Mol Cell Biol
Mol Cancer Ther
BMC Bioinformatics
Stem Cells
FEBS Lett
J Neurosci
Am J Pathol
J Biol Chem
J Virol
OTHER
Cancer Res
J Clin Endocrinol Metab
Plant Mol Biol
Clin Cancer Res
Genomics
Journals
Invest Ophthalmol Vis Sci
Mol Hum Reprod
Carcinogenesis
Gene
Endocrinology
Oncogene
Cancer Lett
Biochem Biophys Res Commun
72. Proportion of datasets shared
0.0
0.2
0.4
0.6
0.8
1.0
Stanford University
University of Pennsylvania
University of Illinois
University of California, Los Angeles
University of Wisconsin, Madison
University of Washington
University of California, Davis
The University of British Columbia
University of California, San Francisco
University of Florida
University of California, San Diego
University of Minnesota, Twin Cities
Baylor College of Medicine
OTHER
Max Planck Gesellschaft
Harvard University
Duke University Medical Center
Yale University
Johns Hopkins University
University of Pittsburgh
Washington University in Saint Louis
University of Toronto
University of California, Berkeley
University of Michigan, Ann Arbor
Michigan State University
Institutions
National Cancer Institute
Tokyo Daigaku
90. Multivariate nonlinear regressions with interactions
Odds Ratio
0.25 0.50 1.00 2.00 4.00 8.00
Has journal policy
Multivariate nonlinear regressions with interactions
Count of R01 & other NIH grants Odds Ratio
0.95
0.25 0.50 1.00 2.00 4.00 8.00
Authors prev GEOAE sharing & OA & microarray creation
Has journal policy
NO K funding other P funding
Count of R01 & or NIH grants
0.95
Authors prev GEOAE sharing & OA & microarray creation
NO K Journalfunding
funding or P impact
Institution high citations & collaboration
Journal policy consequences & Journal impact long halflife
Journal policy consequences & long halflife
Institution high citations NOTcollaboration & animals or mice
Instititution is government & NOT higher ed
NOT animals or mice
Last author num prev pubs & first year pub
Large NIH grant
Instititution is government & NOT higher ed Humans & cancer
NO geo reuse + YES high institution output
Last author num prev pubs & first year pub
First author num prev pubs & first year pub
Large NIH grant
Humans & cancer
NO geo reuse + YES high institution output
First author num prev pubs & first year pub
91. Multivariate nonlinear regressions with interactions
Odds Ratio
0.25 0.50 1.00 2.00 4.00 8.00
Has journal policy
Multivariate nonlinear regressions with interactions
Count of R01 & other NIH grants Odds Ratio
0.95
0.25 0.50 1.00 2.00 4.00 8.00
Authors prev GEOAE sharing & OA & microarray creation
Has journal policy
NO K funding other P funding
Count of R01 & or NIH grants
0.95
Authors prev GEOAE sharing & OA & microarray creation
NO K Journalfunding
funding or P impact
Institution high citations & collaboration
Journal policy consequences & Journal impact long halflife
Journal policy consequences & long halflife
Institution high citations NOTcollaboration & animals or mice
Instititution is government & NOT higher ed
NOT animals or mice
Last author num prev pubs & first year pub
Large NIH grant
Instititution is government & NOT higher ed Humans & cancer
NO geo reuse + YES high institution output
Last author num prev pubs & first year pub
First author num prev pubs & first year pub
Large NIH grant
Humans & cancer
NO geo reuse + YES high institution output
First author num prev pubs & first year pub
93. Instititution is government & NOT higher ed
NOT institution NCI or intramural
NO K funding or P funding
Journal policy consequences & long halflife
Authors prev GEOAE sharing & OA & microarray creation
Institution high citations & collaboration
NOT animals or mice
First author num prev pubs & first year pub
Humans & cancer
Count of R01 & other NIH grants
Large NIH grant
Has journal policy
NO geo reuse + YES high institution output
Last author num prev pubs & first year pub
Journal impact
Instititution is government & NOT higher ed
NOT institution NCI or intramural
NO K funding or P funding
prev GEOAE sharing & OA & microarray creation
NOT animals or mice
First author num prev pubs & first year pub
Humans & cancer
Count of R01 & other NIH grants
Large NIH grant
Last author num prev pubs & first year pub
Journal impact
Institution high citations & collaboration
Has journal policy
NO geo reuse + YES high institution output
Journal policy consequences & long halflife
97. Multivariate nonlinear regression with interactions
Odds Ratio
0.25 0.50 1.00 2.00 4.00
OA journal & previous GEO-AE sharing
Amount of NIH funding
0.95
Journal impact factor and policy
Higher Ed in USA
Cancer & humans
98. Multivariate nonlinear regression with interactions
Odds Ratio
0.25 0.50 1.00 2.00 4.00
OA journal & previous GEO-AE sharing
Amount of NIH funding
0.95
Journal impact factor and policy
Higher Ed in USA
Cancer & humans
102. Open access/
previous
sharing: 31%
Less
OA/prev
sharing: 19%
Not Overall:
cancer/human: cancer/human: 25%
18% 32%
103. Open access/
24% 37% previous
sharing: 31%
Less
13% 25% OA/prev
sharing: 19%
Not Overall:
cancer/human: cancer/human: 25%
18% 32%
104. Conclusions:
• data sharing rates are increasing,
but overall levels are low
Preliminary evidence:
• levels are particularly low in cancer
• levels are highest for those who are
publishing OA,
have shared before
105. • data and filters were imperfect
• many assumptions
• didn’t capture all types of sharing
• don’t know how generalizable across datatypes
• should be considered hypothesis-generating
http://www.flickr.com/photos/vlastula/300102949/
106. Goal of this dissertation:
Collect useful evidence on patterns of
data sharing behaviour through methods
that can be applied broadly, repeatably,
and cost-effectively.
107. contribution
• Aim 1 publication cited 45 times in Google Scholar,
including by several editorials and books
• Aim 2 methods reused in a neuroethics study at UBC
• Aim 3 revealed evidence suggesting areas with high and
low data sharing adoption for future study
• data collection was mostly automated using mostly free,
and open resources
• dataset, collection code, analysis scripts to be made
openly available upon publication of thesis
109. More data analysis
Including:
• Citation analysis of the 11,603 articles
• Analysis with a focus on policy variables
• Causality through structural equation
modeling
doi/10.1371/journal.pone.0008469.g002
111. who reuses data?
why?
when?
who doesn’t?
which datasets are most likely to be
reused?
how many datasets could be reused but
aren’t?
why aren’t they?
what can we do about it?
what should we do about it?
112. Post‐doc of my dreams
Postdoctoral Research Associate
in the Sharing, Preservation,
and Stewardship of Scientific
Data
Potential areas of focus include:
• overcoming social and technological
barriers to data deposition among
scientists
• the roles and interactions of individual
scientists, journals/publishers,
institutions, and the variety of
disciplinary repositories
• ...
http://www.flickr.com/photos/gatewaystreets/3838452287/
113. Enable new science and knowledge
creation through universal access
to data about life on earth and
the environment that sustains it.
Dryad is a repository of data
underlying scientific publications,
with an initial focus on evolution,
ecology, and related fields.
The National Evolutionary
Synthesis Center, NSF-funded:
• Duke University,
• UNC at Chapel Hill
• North Carolina State University
114. Data sharing
is hard.
I share my code and data at http://www.researchremix.org
It is hard.
Some is better than none.
Be the change you want to see.
http://www.flickr.com/photos/myklroventine/892446624/
115. Thanks to
the Dept of Biomedical Informatics at the U of Pittsburgh,
the NLM for funding through training grant 5 T15 LM007059,
those who openly publish their data, source code, papers, photos,
Dr. Wendy Chapman for her support and feedback,
My family.