Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing

ScSc
0000-0001-6444-1436
@SCEdmunds
scott@gigasciencejournal.com
NEW
M
O
DEL
O
pen
data
publishing
Scott Edmunds
Balti Bioinformatics

The problems with publishing
• Scholarly articles are merely advertisement of scholarship .
The actual scholarly artefacts, i.e. the data and computational
methods, which support the scholarship, remain largely
inaccessible --- Jon B. Buckheit and David L. Donoho, WaveLab
and reproducible research, 1995
• Lack of transparency, lack of credit for anything other than
350-year old style “dead tree” publication
• Traditional publishing policies and practices a hindrance
(licensing & access, embargoes, Ingelfinger, closed doors,
anti-granularity & forking)

The consequences: growing replication gap
1. Ioannidis et al., (2009). Repeatability of published microarray gene expression analyses. Nature Genetics 41: 14
2. Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8)
Out of 18 microarray papers, results
from 10 could not be reproduced
Out of 18 microarray papers, results
from 10 could not be reproduced

Consequences: increasing number of retractions
>15X increase in last decade
At current % > by 2045 as many
papers published as retracted
1. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html
2. Bjorn Brembs: Open Access and the looming crisis in science https://theconversation.com/open-access-and-the-looming-crisis-in-science-14950

STAP paper demonstrates problems:
Nature Editorial, 2nd
July 2014:
“We have concluded that we and the referees could
not have detected the problems that fatally
undermined the papers. The referees’ rigorous
reports quite rightly took on trust what was
presented in the papers.”
http://www.nature.com/news/stap-retracted-1.15488

STAP paper demonstrates problems:
…to publish protocols BEFORE analysis
…better access to supporting data
…more transparent & accountable review
…to publish replication studies
Need:

• Review
• Data
• Software
• Models
• Pipelines
• Re-use…
= Credit
}
Credit where credit is overdue:
“One option would be to provide researchers who release data to public
repositories with a means of accreditation.”
“An ability to search the literature for all online papers that used a particular data
set would enable appropriate attribution for those who share. “
Nature Biotechnology 27, 579 (2009)
New incentives/credit

Not just carrots…
“The data discovery index (DDI) enabled through
bioCADDIE is to do for data what PubMed (and
PubMed Central) did for the literature.”

Methods
Answer
Metadata
softwareAnalysis
(Pipelines)
Workflows/
Environments
Idea
Study
Rewarding the
DOI, etc.
Publication
Publication
Publication
Data

Open peer review
1. Transparency

The only drawback?
End reviewer 3 Downfall parody videos, now!
1. Transparency
Open peer review

Publons + AcademicKarma
= credit for reviewers efforts
http://publons.com/
1. Transparency/open peer review
http://academickarma.org/

1. Transparency
Reward pre-prints

http://tmblr.co/ZzXdssfOMJfy
arXiv + blogged reviews = real-time open-review
1. Transparency

arXiv + blogged reviews = real-time open-review
1. Transparency

IRRI GALAXY
Rice 3K project: 3,000 rice genomes, 13.4TB public data
2. (Big) Data

2. Data
Reward Intermediate Data

Nanopore MinION E. Coli genome
released via GigaDB 10-Sep-2014
Curated & converted to ISA-tab, &
worked with EBI to get raw data there
Data Note submitted & preprint version
out 26th
September
Peer reviewed & published 20th
October
2. Data
Reward Faster Data Release
http://www.gigasciencejournal.com/content/3/1/22

Real time sequencing era needs real time publication!
• Used as test data for
“minoTour”: real time data
analysis tools for minION data
• Nanopore data already used
in (CC0 GitHub based)
teaching materials
• Next stop…Erratums, Updates
& more (see later)
1. mioTour http://minotour.nottingham.ac.uk/
2. https://github.com/lexnederbragt/INF-BIOx121_fall2014_de_novo_assembly
2. Data
Reward Faster Data Release

OMERO: providing access
to imaging data
Already used by JCB.
View, filter, measure raw
images with direct links
from journal article.
See all image data, not just
cherry picked examples.
Download and reprocess.
2. Data
Reward Imaging Data

The alternative...
...look but don't touch
2. Data
Reward Imaging Data

3. Software
https://www.change.org/p/everyone-in-the-research-community-we-must-accept-that-s

galaxy.cbiit.cuhk.edu.hk
4. Workflows
Reward Sharing of Workflows

Visualisations
& DOIs for workflows
http://www.gigasciencejournal.com/series/Galaxy 26

• Can facilitate reproducibility, reuse & sharing with tools like:
Knitr, Sweave, iPython Notebook
5. Open Documents
Reward Open/Dynamic Workbooks

5. Virtual Machines
?
http://ivory.idyll.org/blog/vms-considered-harmful.html

http://dx.doi.org/10.5524/100106
http://www.gigasciencejournal.com/content/3/1/23
5. Virtual Machines

Taking a microscope to the
publication process

How reproducible can we get?
Data sets
Analyses
Linked to
Linked to
DOI
DOI
Open-Paper
Open-Review
DOI:10.1186/2047-217X-1-18
>33,000 accesses
& 270 citations
Open-Code
7 reviewers tested data in ftp server & named reports published
DOI:10.5524/100044
Open-Pipelines
Open-Workflows
DOI:10.5524/100038
Open-Data
78GB CC0 data
Code in sourceforge under GPLv3:
http://soapdenovo2.sourceforge.net/>36,000 downloads
Enabled code to being picked apart by bloggers in wiki
http://homolog.us/wiki/index.php?title=SOAPdenovo2
34

Post publication: bloggers pull apart code/reviews in blogs + wiki:
SOAPdenov2 wiki: http://homolog.us/wiki1/index.php?title=SOAPdenovo2
Homologus blogs: http://www.homolog.us/blogs/category/soapdenovo/
Reward open & transparent review

SOAPdenovo2 workflows implemented in

SOAPdenovo2 workflows implemented in
Implemented entire workflow in our Galaxy server, inc.:
• 3 pre-processing steps
• 4 SOAPdenovo modules
• 1 post processing steps
• Evaluation and visualization tools

Can we reproduce results? SOAPdenovo2 S. aureus pipeline

The SOAPdenovo2 Case study
Subject to and test with 3 models:
DataData
Method/Experi
mental protocol
Method/Experi
mental protocol
FindingsFindings
Types of resources in an RO
Wfdesc/ISA-
TAB/ISA2OWL
Wfdesc/ISA-
TAB/ISA2OWL
Models to describe each resource type
See: http://biorxiv.org/content/early/2014/12/08/011973

1. While there are huge improvements to the quality of the
resulting assemblies, other than the tables it was not stressed in
the text that the speed of SOAPdenovo2 can be slightly slower
than SOAPdenovo v1.
2. In the testing an assessment section (page 3), based on the
correct results in table 2, where we say the scaffold N50 metric
is an order of magnitude longer from SOAPdenovo2 versus
SOAPdenovo1, this was actually 45 times longer
3. Also in the testing an assessment section, based on the
correct results in table 2, where we say SOAPdenovo2
produced a contig N50 1.53 times longer than ALL-PATHS, this
should be 2.18 times longer.
4. Finally in this section, where we say the correct assembly
length produced by SOAPdenovo2 was 3-80 fold longer than
SOAPdenovo1, this should be 3-64 fold longer.

Lessons Learned
• Most published research findings are false. Or at
least have errors
• Is possible to push button(s) & recreate a result from
a paper
• Reproducibility is COSTLY. How much are you willing
to spend?
• Much easier to do this before rather than after
publication

The cost of staying with the status quo?
• Ioannidis estimate that 85% of research resources are wasted.
• Each retraction estimated to cost $400,000.

Make your data, software &
other ROs open (CC0, OSI)
Get credit for your reviewing
Publish your research objects
(with us!)
In Summary
scott@gigasciencejournal.com
www.gigasciencejournal.com
@gigascience
facebook.com/GigaScience

Ruibang Luo (BGI/HKU)
Shaoguang Liang (BGI-SZ)
Tin-Lap Lee (CUHK)
Qiong Luo (HKUST)
Senghong Wang (HKUST)
Yan Zhou (HKUST)
Thanks to:
@gigascience
facebook.com/GigaScience
blogs.biomedcentral.com/gigablog/
Peter Li
Chris Hunter
Jesse Si Zhe
Rob Davidson
Nicole Nogoy
Laurie Goodman
Amye Kenall (BMC)
Marco Roos (LUMC)
Mark Thompson (LUMC)
Jun Zhao (Lancaster)
Susanna Sansone (Oxford)
Philippe Rocca-Serra (Oxford)
Alejandra Gonzalez-Beltran (Oxford)
www.gigadb.org
www.gigasciencejournal.com
CBIITFunding from:
Our collaborators:team: Case study:
45

Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing

Ähnlich wie Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing (20)

Mehr von GigaScience, BGI Hong Kong

Mehr von GigaScience, BGI Hong Kong (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing

Hinweis der Redaktion