2. Challenges/Opportunities in the Data-Driven Era
Big Challenges:
Quick response to climate change, food security & disease outbreaks
Enables:
Using networking power of the internet to tackle problems
Can ask new questions & find hidden patterns & connections
Build on each others efforts quicker & more efficiently
More collaborations across more disciplines
Harness wisdom of the crowds: crowdsourcing, citizen science,
crowdfunding
Enabled by:
Removing silos, standards/formats, open-access/data
4. What do publishers do?
the scholarly chicken
(tl;dr version)
Apologies: http://scholarlykitchen.sspnet.org/2014/10/21/updated-80-things-publishers-do-2014-edition/
5. The problems with publishing
• Scholarly articles are merely advertisement of scholarship .
The actual scholarly artefacts, i.e. the data and computational
methods, which support the scholarship, remain largely
inaccessible --- Jon B. Buckheit and David L. Donoho, WaveLab
and reproducible research, 1995
• Traditional publishing policies and practices a hindrance
(licensing & access, embargoes, Ingelfinger, closed doors,
anti-granularity & forking)
• Lack of transparency, lack of credit for anything other than
“regular” dead tree publication.
6. Are publishers really adding value?
1. http://www.plosmedicine.org/article/info:doi/10.1371/journal.pmed.0020124
7. The consequences: growing replication gap
Out of 18 microarray papers, results
from 10 could not be reproduced
Out of 18 microarray papers, results
from 10 could not be reproduced
1. Ioannidis et al., (2009). Repeatability of published microarray gene expression analyses. Nature Genetics 41: 14
2. Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8)
8. Consequences: increasing number of retractions
>15X increase in last decade
1. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html
2. Retracted Science and the Retraction Index ▿ http://iai.asm.org/content/79/10/3855.abstract?
9. Consequences: increasing number of retractions
>15X increase in last decade
At current % > by 2045 as many
papers published as retracted
1. Ioannidis et al., 2009. Repeatability of published microarray gene expression analyses. Nature Genetics 41: 14
2. Bjorn Brembs: Open Access and the looming crisis in science https://theconversation.com/open-access-and-the-looming-crisis-in-science-14950
10. STAP paper demonstrates problems:
Nature Editorial, 2nd July 2014:
“We have concluded that we and the referees could not
have detected the problems that fatally undermined the
papers. The referees’ rigorous reports quite rightly took
on trust what was presented in the papers.”
http://www.nature.com/news/stap-retracted-1.15488
11. STAP paper demonstrates problems:
Need:
…to publish protocols BEFORE analysis
…better access to supporting data
…more transparent & accountable review
…to publish replication studies
12. The solutions for publishing?
1. http://www.plosmedicine.org/article/info:doi/10.1371/journal.pmed.0020124
2. http://www.plosmedicine.org/article/info:doi/10.1371/journal.pmed.1001747
14. Credit where credit is overdue:
“One option would be to provide researchers who release data to public
repositories with a means of accreditation.”
“An ability to search the literature for all online papers that used a particular data
set would enable appropriate attribution for those who share. “
Nature Biotechnology 27, 579 (2009)
• Data
• Software
• }
Review
• Re-use…
= Credit
New incentives/credit
17. BMC Series
Medical Journals
Reward open & transparent review
• Good data showing no difference in acceptance/rejection rates, but
better quality reviews.
• Does take marginally longer to find reviewers (and for them to return
reports).
Data from similar scope open/closed review journals in BMC Series shows ~5-
10% harder to get referees for open review. (data from Tim Sands at BMC)
18. Reward open & transparent review
GigaScience + Publons = further credit for reviewers efforts
http://publons.com/
19. Reward faster review
GigaScience + AcademicKarma = even more credit
http://academickarma.org/
20. Reward open & transparent review
Real-time open-review = paper in arXiv + blogged reviews
www.gigasciencejournal.com/content/2/1/10 http://tmblr.co/ZzXdssfOMJfy
21. Reward open & transparent review
Real-time open-review = paper in arXiv + blogged reviews
22. Reward open & transparent review
Real-time open-review = paper in arXiv + blogged reviews
(Assemblathon ‘publish for free’ contest: publishforfree@assemblathon.org)
24. Snapshots of the research cycle
Data, data, data…
Genomic: (cats, and minipigs,and parrots, and
elephants, oh my!)
Imaging: fMRI, myocardial MRI, micro-CT from
worms & centipedes, sea urchin MRIs
Neurophysiology: neural activity recordings, EEG
25. Snapshots of the research cycle
Strict code availability policy in
GigaScience (OSI compliant)
Publication/proof of record
version archived in GigaDB
Provides extra credit &
discoverability with DOI
Also link to dynamic/updating version in code repository, inc our
GigaGitHub repo (https://github.com/gigascience)
Experimenting with supplemental tables in GitHub (see:
https://github.com/gigascience/paper-chen2014/wiki)
25
Software, pipelines, workflows…
26. Snapshots of the research cycle
Implement workflows in a community-accepted format
http://galaxyproject.org
Open source
Over 50,000 main
Galaxy server users
Over 1,000 papers
citing Galaxy use
Over 60 Galaxy
servers deployed
26
30. Further beyond dead trees:
Open lab books, dynamic documents
• Can facilitate reproducibility, reuse & sharing with tools like:
Knitr, Sweave, iPython Notebook
• Working towards executable papers…
33. Aiding reproducibility of imaging studies
OMERO: providing
access to imaging data
Already used by JCB.
View, filter, measure raw
images with direct links
from journal article.
See all image data, not
just cherry picked
examples.
Download and reprocess.
37. Our first DOI:
To maximize its utility to the research community and aid those fighting
the current epidemic, genomic data is released here into the public
domain under a CC0 license. Until the publication of research papers on
the assembly and whole-genome analysis of this isolate we would ask you
to cite this dataset as:
Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G; Wang,
J; Zhang, Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S; Li, J;
Peng, Y; Pu, F; Sun, Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z; Zhao,
X; Chen, F; Yin, X; Song,Y ; Rohde, H; Li, Y; Wang, J; Wang, J and the
Escherichia coli O104:H4 TY-2482 isolate genome sequencing consortium
(2011)
Genomic data from Escherichia coli O104:H4 isolate TY-2482. BGI
Shenzhen. doi:10.5524/100001
http://dx.doi.org/10.5524/100001
To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring rights to
Genomic Data from the 2011 E. coli outbreak. This work is published from: China.
38.
39.
40.
41. Downstream consequences:
1. Citations (~240) 2. Therapeutics (primers, antimicrobials) 3. Platform Comparisons
4. Example for faster & more open science
“Last summer, biologist Andrew Kasarskis was eager to help decipher the genetic origin of the
Escherichia coli strain that infected roughly 4,000 people in Germany between May and July. But he
knew it that might take days for the lawyers at his company — Pacific Biosciences — to parse the
agreements governing how his team could use data collected on the strain. Luckily, one team had
released its data under a Creative Commons licence that allowed free use of the data, allowing
Kasarskis and his colleagues to join the international research effort and publish their work without
wasting time on legal wrangling.”
42.
43. 1.3 The power of intelligently open data
The benefits of intelligently open data were powerfully
illustrated by events following an outbreak of a severe gastro-intestinal
infection in Hamburg in Germany in May 2011. This
spread through several European countries and the US,
affecting about 4000 people and resulting in over 50 deaths.
All tested positive for an unusual and little-known Shiga-toxin–
producing E. coli bacterium. The strain was initially analysed
by scientists at BGI-Shenzhen in China, working together with
those in Hamburg, and three days later a draft genome was
released under an open data licence. This generated interest
from bioinformaticians on four continents. 24 hours after the
release of the genome it had been assembled. Within a week
two dozen reports had been filed on an open-source site
dedicated to the analysis of the strain. These analyses
provided crucial information about the strain’s virulence and
resistance genes – how it spreads and which antibiotics are
effective against it. They produced results in time to help
contain the outbreak. By July 2011, scientists published papers
based on this work. By opening up their early sequencing
results to international collaboration, researchers in Hamburg
produced results that were quickly tested by a wide range of
experts, used to produce new knowledge and ultimately to
control a public health emergency.
44. Nanopore MinION E. Coli genome
released via GigaDB 10-Sep-2014
Curated & converted to ISA-tab, &
worked with EBI to get raw data there
Data Note submitted & preprint version
out 26th September
Peer reviewed & published 20th October
second
http://dx.doi.org/10.5524/100102
45. second
Real time sequencing era needs real time publication!
• Used as test data for
“minoTour”: real time data
analysis tools for minION data
• Nanopore data already used
in (CC0 GitHub based)
teaching materials
• Next stop…poreathon!
(crowdsourced v2 assembly)
1. mioTour http://minotour.nottingham.ac.uk/
2. https://github.com/lexnederbragt/INF-BIOx121_fall2014_de_novo_assembly
46. Lessons Learned
• Most published research findings are false. Or at
least have errors.
• Is possible to push button(s) & recreate a result from
a paper
• Reproducibility is COSTLY. How much are you willing
to spend?
• Much easier to do this before rather than after
publication
47. The cost of staying with the status quo?
• Ioannidis estimate that 85% of research resources are wasted.
• Each retraction estimated to cost $400,000.
48. In Summary
Make your data & software
open (CC0, OSI)
Get credit for your reviewing
Publish your research objects
(with us!)*
* Free APCs until end of 2014
scott@gigasciencejournal.com
@gigascience
facebook.com/GigaScience
www.gigasciencejournal.com
49. Thanks to:
team: Our collaborators: Case study:
Ruibang Luo (BGI/HKU)
Shaoguang Liang (BGI-SZ)
Tin-Lap Lee (CUHK)
Qiong Luo (HKUST)
Senghong Wang (HKUST)
Yan Zhou (HKUST)
Funding from: CBIIT
@gigascience
facebook.com/GigaScience
blogs.biomedcentral.com/gigablog/
Peter Li
Chris Hunter
Jesse Si Zhe
Rob Davidson
Nicole Nogoy
Laurie Goodman
Amye Kenall (BMC)
Marco Roos (LUMC)
Mark Thompson (LUMC)
Jun Zhao (Lancaster)
Susanna Sansone (Oxford)
Philippe Rocca-Serra (Oxford)
Alejandra Gonzalez-Beltran (Oxford)
www.gigadb.org
galaxy.cbiit.cuhk.edu.hk
www.gigasciencejournal.com
49
Hinweis der Redaktion
Ferric Fang of the University of Washington and his colleagues quantified just how much fraud costs the government
It turns out that every paper retracted because of research misconduct costs about $400,000 in funds from the US National Institutes of Health (NIH)—totaling $58 million for papers retracted between 1992 and 2012.
Scientific fraud incurs additional costs.
That just leaves me to thank the GigaScience team: Laurie, Scott, Alexandra, Peter and Jesse, BGI for their support - specifically Shaoguang for IT and bioinformatics support – our collaborators on the database, website and tools: Tin-Lap, Qiong, Senhong, Yan, the Cogini web design team, Datacite for providing the DOI service and the isacommons team for their support and advocacy for best practice use of metadata reporting and sharing.
Thank you for listening.