This document discusses problems with traditional scholarly publishing and proposes solutions centered around open data and transparency. It notes that traditional publishing hinders reproducibility due to lack of access to data and methods. This has led to an increasing number of non-reproducible findings and retractions. The document advocates for incentivizing the publication of data, software, workflows and other research objects to improve reproducibility and transparency. It highlights several examples where making these elements openly available improved scrutiny and identified errors in published works.
2. The problems with publishing
• Scholarly articles are merely advertisement of scholarship .
The actual scholarly artefacts, i.e. the data and computational
methods, which support the scholarship, remain largely
inaccessible --- Jon B. Buckheit and David L. Donoho, WaveLab
and reproducible research, 1995
• Lack of transparency, lack of credit for anything other than
350-year old style “dead tree” publication
• Traditional publishing policies and practices a hindrance
(licensing & access, embargoes, Ingelfinger, closed doors,
anti-granularity & forking)
3. The consequences: growing replication gap
1. Ioannidis et al., (2009). Repeatability of published microarray gene expression analyses. Nature Genetics 41: 14
2. Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8)
Out of 18 microarray papers, results
from 10 could not be reproduced
Out of 18 microarray papers, results
from 10 could not be reproduced
4. Consequences: increasing number of retractions
>15X increase in last decade
At current % > by 2045 as many
papers published as retracted
1. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html
2. Bjorn Brembs: Open Access and the looming crisis in science https://theconversation.com/open-access-and-the-looming-crisis-in-science-14950
5. STAP paper demonstrates problems:
Nature Editorial, 2nd
July 2014:
“We have concluded that we and the referees could
not have detected the problems that fatally
undermined the papers. The referees’ rigorous
reports quite rightly took on trust what was
presented in the papers.”
http://www.nature.com/news/stap-retracted-1.15488
6. STAP paper demonstrates problems:
…to publish protocols BEFORE analysis
…better access to supporting data
…more transparent & accountable review
…to publish replication studies
Need:
7. • Review
• Data
• Software
• Models
• Pipelines
• Re-use…
= Credit
}
Credit where credit is overdue:
“One option would be to provide researchers who release data to public
repositories with a means of accreditation.”
“An ability to search the literature for all online papers that used a particular data
set would enable appropriate attribution for those who share. “
Nature Biotechnology 27, 579 (2009)
New incentives/credit
8. Not just carrots…
“The data discovery index (DDI) enabled through
bioCADDIE is to do for data what PubMed (and
PubMed Central) did for the literature.”
20. Nanopore MinION E. Coli genome
released via GigaDB 10-Sep-2014
Curated & converted to ISA-tab, &
worked with EBI to get raw data there
Data Note submitted & preprint version
out 26th
September
Peer reviewed & published 20th
October
2. Data
Reward Faster Data Release
http://www.gigasciencejournal.com/content/3/1/22
21. Real time sequencing era needs real time publication!
• Used as test data for
“minoTour”: real time data
analysis tools for minION data
• Nanopore data already used
in (CC0 GitHub based)
teaching materials
• Next stop…Erratums, Updates
& more (see later)
1. mioTour http://minotour.nottingham.ac.uk/
2. https://github.com/lexnederbragt/INF-BIOx121_fall2014_de_novo_assembly
2. Data
Reward Faster Data Release
22. OMERO: providing access
to imaging data
Already used by JCB.
View, filter, measure raw
images with direct links
from journal article.
See all image data, not just
cherry picked examples.
Download and reprocess.
2. Data
Reward Imaging Data
34. How reproducible can we get?
Data sets
Analyses
Linked to
Linked to
DOI
DOI
Open-Paper
Open-Review
DOI:10.1186/2047-217X-1-18
>33,000 accesses
& 270 citations
Open-Code
7 reviewers tested data in ftp server & named reports published
DOI:10.5524/100044
Open-Pipelines
Open-Workflows
DOI:10.5524/100038
Open-Data
78GB CC0 data
Code in sourceforge under GPLv3:
http://soapdenovo2.sourceforge.net/>36,000 downloads
Enabled code to being picked apart by bloggers in wiki
http://homolog.us/wiki/index.php?title=SOAPdenovo2
34
35. Post publication: bloggers pull apart code/reviews in blogs + wiki:
SOAPdenov2 wiki: http://homolog.us/wiki1/index.php?title=SOAPdenovo2
Homologus blogs: http://www.homolog.us/blogs/category/soapdenovo/
Reward open & transparent review
39. The SOAPdenovo2 Case study
Subject to and test with 3 models:
DataData
Method/Experi
mental protocol
Method/Experi
mental protocol
FindingsFindings
Types of resources in an RO
Wfdesc/ISA-
TAB/ISA2OWL
Wfdesc/ISA-
TAB/ISA2OWL
Models to describe each resource type
See: http://biorxiv.org/content/early/2014/12/08/011973
40.
41. 1. While there are huge improvements to the quality of the
resulting assemblies, other than the tables it was not stressed in
the text that the speed of SOAPdenovo2 can be slightly slower
than SOAPdenovo v1.
2. In the testing an assessment section (page 3), based on the
correct results in table 2, where we say the scaffold N50 metric
is an order of magnitude longer from SOAPdenovo2 versus
SOAPdenovo1, this was actually 45 times longer
3. Also in the testing an assessment section, based on the
correct results in table 2, where we say SOAPdenovo2
produced a contig N50 1.53 times longer than ALL-PATHS, this
should be 2.18 times longer.
4. Finally in this section, where we say the correct assembly
length produced by SOAPdenovo2 was 3-80 fold longer than
SOAPdenovo1, this should be 3-64 fold longer.
42. Lessons Learned
• Most published research findings are false. Or at
least have errors
• Is possible to push button(s) & recreate a result from
a paper
• Reproducibility is COSTLY. How much are you willing
to spend?
• Much easier to do this before rather than after
publication
43. The cost of staying with the status quo?
• Ioannidis estimate that 85% of research resources are wasted.
• Each retraction estimated to cost $400,000.
44. Make your data, software &
other ROs open (CC0, OSI)
Get credit for your reviewing
Publish your research objects
(with us!)
In Summary
scott@gigasciencejournal.com
www.gigasciencejournal.com
@gigascience
facebook.com/GigaScience
45. Ruibang Luo (BGI/HKU)
Shaoguang Liang (BGI-SZ)
Tin-Lap Lee (CUHK)
Qiong Luo (HKUST)
Senghong Wang (HKUST)
Yan Zhou (HKUST)
Thanks to:
@gigascience
facebook.com/GigaScience
blogs.biomedcentral.com/gigablog/
Peter Li
Chris Hunter
Jesse Si Zhe
Rob Davidson
Nicole Nogoy
Laurie Goodman
Amye Kenall (BMC)
Marco Roos (LUMC)
Mark Thompson (LUMC)
Jun Zhao (Lancaster)
Susanna Sansone (Oxford)
Philippe Rocca-Serra (Oxford)
Alejandra Gonzalez-Beltran (Oxford)
www.gigadb.org
galaxy.cbiit.cuhk.edu.hk
www.gigasciencejournal.com
CBIITFunding from:
Our collaborators:team: Case study:
45
Hinweis der Redaktion
Ferric Fang of the University of Washington and his colleagues quantified just how much fraud costs the government
It turns out that every paper retracted because of research misconduct costs about $400,000 in funds from the US National Institutes of Health (NIH)—totaling $58 million for papers retracted between 1992 and 2012.
Scientific fraud incurs additional costs.
That just leaves me to thank the GigaScience team: Laurie, Scott, Alexandra, Peter and Jesse, BGI for their support - specifically Shaoguang for IT and bioinformatics support – our collaborators on the database, website and tools: Tin-Lap, Qiong, Senhong, Yan, the Cogini web design team, Datacite for providing the DOI service and the isacommons team for their support and advocacy for best practice use of metadata reporting and sharing.
Thank you for listening.