Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era

@SCEdmunds
0000-0001-6444-1436
scott@gigasciencejournal.com

Challenges/Opportunities in the Data-Driven Era
Big Challenges:
Quick response to climate change, food security & disease outbreaks
Enables:
Using networking power of the internet to tackle problems
Can ask new questions & find hidden patterns & connections
Build on each others efforts quicker & more efficiently
More collaborations across more disciplines
Harness wisdom of the crowds: crowdsourcing, citizen science,
crowdfunding
Enabled by:
Removing silos, standards/formats, open-access/data

What do publishers do?
the scholarly chicken
(tl;dr version)
Apologies: http://scholarlykitchen.sspnet.org/2014/10/21/updated-80-things-publishers-do-2014-edition/

The problems with publishing
• Scholarly articles are merely advertisement of scholarship .
The actual scholarly artefacts, i.e. the data and computational
methods, which support the scholarship, remain largely
inaccessible --- Jon B. Buckheit and David L. Donoho, WaveLab
and reproducible research, 1995
• Traditional publishing policies and practices a hindrance
(licensing & access, embargoes, Ingelfinger, closed doors,
anti-granularity & forking)
• Lack of transparency, lack of credit for anything other than
“regular” dead tree publication.

Are publishers really adding value?
1. http://www.plosmedicine.org/article/info:doi/10.1371/journal.pmed.0020124

The consequences: growing replication gap
Out of 18 microarray papers, results
from 10 could not be reproduced
Out of 18 microarray papers, results
from 10 could not be reproduced
1. Ioannidis et al., (2009). Repeatability of published microarray gene expression analyses. Nature Genetics 41: 14
2. Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8)

Consequences: increasing number of retractions
>15X increase in last decade
1. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html
2. Retracted Science and the Retraction Index ▿ http://iai.asm.org/content/79/10/3855.abstract?

Consequences: increasing number of retractions
>15X increase in last decade
At current % > by 2045 as many
papers published as retracted
1. Ioannidis et al., 2009. Repeatability of published microarray gene expression analyses. Nature Genetics 41: 14
2. Bjorn Brembs: Open Access and the looming crisis in science https://theconversation.com/open-access-and-the-looming-crisis-in-science-14950

STAP paper demonstrates problems:
Nature Editorial, 2nd July 2014:
“We have concluded that we and the referees could not
have detected the problems that fatally undermined the
papers. The referees’ rigorous reports quite rightly took
on trust what was presented in the papers.”
http://www.nature.com/news/stap-retracted-1.15488

STAP paper demonstrates problems:
Need:
…to publish protocols BEFORE analysis
…better access to supporting data
…more transparent & accountable review
…to publish replication studies

The solutions for publishing?

Credit where credit is overdue:
“One option would be to provide researchers who release data to public
repositories with a means of accreditation.”
“An ability to search the literature for all online papers that used a particular data
set would enable appropriate attribution for those who share. “
Nature Biotechnology 27, 579 (2009)
• Data
• Software
• }
Review
• Re-use…
= Credit
New incentives/credit

More transparency:
open peer review

BMC Series
Medical Journals
Reward open & transparent review
• Good data showing no difference in acceptance/rejection rates, but
better quality reviews.
• Does take marginally longer to find reviewers (and for them to return
reports).
Data from similar scope open/closed review journals in BMC Series shows ~5-
10% harder to get referees for open review. (data from Tim Sands at BMC)

GigaScience + Publons = further credit for reviewers efforts
http://publons.com/

Reward faster review
GigaScience + AcademicKarma = even more credit
http://academickarma.org/

Real-time open-review = paper in arXiv + blogged reviews
www.gigasciencejournal.com/content/2/1/10 http://tmblr.co/ZzXdssfOMJfy

(Assemblathon ‘publish for free’ contest: publishforfree@assemblathon.org)

Snapshots of the research cycle
Data, data, data…
Genomic: (cats, and minipigs,and parrots, and
elephants, oh my!)
Imaging: fMRI, myocardial MRI, micro-CT from
worms & centipedes, sea urchin MRIs
Neurophysiology: neural activity recordings, EEG

Strict code availability policy in
GigaScience (OSI compliant)
Publication/proof of record
version archived in GigaDB
Provides extra credit &
discoverability with DOI
Also link to dynamic/updating version in code repository, inc our
GigaGitHub repo (https://github.com/gigascience)
Experimenting with supplemental tables in GitHub (see:
https://github.com/gigascience/paper-chen2014/wiki)
25
Software, pipelines, workflows…

Implement workflows in a community-accepted format
http://galaxyproject.org
Open source
Over 50,000 main
Galaxy server users
Over 1,000 papers
citing Galaxy use
Over 60 Galaxy
servers deployed
26

Workflow publishing:
galaxy.cbiit.cuhk.edu.hk
27

Visualisations
& DOIs for workflows
http://www.gigasciencejournal.com/series/Galaxy 28

Next step: publishing VMs…
29
http://dx.doi.org/10.5524/100106

Further beyond dead trees:
Open lab books, dynamic documents
• Can facilitate reproducibility, reuse & sharing with tools like:
Knitr, Sweave, iPython Notebook
• Working towards executable papers…

Aiding reproducibility of imaging studies
OMERO: providing
access to imaging data
Already used by JCB.
View, filter, measure raw
images with direct links
from journal article.
See all image data, not
just cherry picked
examples.
Download and reprocess.

The alternative...
...look but don't touch

Beneficiaries/users of our work
IRRI GALAXY

Beneficiaries/users of our work
Rice 3K project: 3,000 rice genomes, 13.4TB public data
IRRI GALAXY

Our first DOI:
To maximize its utility to the research community and aid those fighting
the current epidemic, genomic data is released here into the public
domain under a CC0 license. Until the publication of research papers on
the assembly and whole-genome analysis of this isolate we would ask you
to cite this dataset as:
Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G; Wang,
J; Zhang, Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S; Li, J;
Peng, Y; Pu, F; Sun, Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z; Zhao,
X; Chen, F; Yin, X; Song,Y ; Rohde, H; Li, Y; Wang, J; Wang, J and the
Escherichia coli O104:H4 TY-2482 isolate genome sequencing consortium
(2011)
Genomic data from Escherichia coli O104:H4 isolate TY-2482. BGI
Shenzhen. doi:10.5524/100001
http://dx.doi.org/10.5524/100001
To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring rights to
Genomic Data from the 2011 E. coli outbreak. This work is published from: China.

Downstream consequences:
1. Citations (~240) 2. Therapeutics (primers, antimicrobials) 3. Platform Comparisons
4. Example for faster & more open science
“Last summer, biologist Andrew Kasarskis was eager to help decipher the genetic origin of the
Escherichia coli strain that infected roughly 4,000 people in Germany between May and July. But he
knew it that might take days for the lawyers at his company — Pacific Biosciences — to parse the
agreements governing how his team could use data collected on the strain. Luckily, one team had
released its data under a Creative Commons licence that allowed free use of the data, allowing
Kasarskis and his colleagues to join the international research effort and publish their work without
wasting time on legal wrangling.”

1.3 The power of intelligently open data
The benefits of intelligently open data were powerfully
illustrated by events following an outbreak of a severe gastro-intestinal
infection in Hamburg in Germany in May 2011. This
spread through several European countries and the US,
affecting about 4000 people and resulting in over 50 deaths.
All tested positive for an unusual and little-known Shiga-toxin–
producing E. coli bacterium. The strain was initially analysed
by scientists at BGI-Shenzhen in China, working together with
those in Hamburg, and three days later a draft genome was
released under an open data licence. This generated interest
from bioinformaticians on four continents. 24 hours after the
release of the genome it had been assembled. Within a week
two dozen reports had been filed on an open-source site
dedicated to the analysis of the strain. These analyses
provided crucial information about the strain’s virulence and
resistance genes – how it spreads and which antibiotics are
effective against it. They produced results in time to help
contain the outbreak. By July 2011, scientists published papers
based on this work. By opening up their early sequencing
results to international collaboration, researchers in Hamburg
produced results that were quickly tested by a wide range of
experts, used to produce new knowledge and ultimately to
control a public health emergency.

Nanopore MinION E. Coli genome
released via GigaDB 10-Sep-2014
Curated & converted to ISA-tab, &
worked with EBI to get raw data there
Data Note submitted & preprint version
out 26th September
Peer reviewed & published 20th October
second
http://dx.doi.org/10.5524/100102

second
Real time sequencing era needs real time publication!
• Used as test data for
“minoTour”: real time data
analysis tools for minION data
• Nanopore data already used
in (CC0 GitHub based)
teaching materials
• Next stop…poreathon!
(crowdsourced v2 assembly)
1. mioTour http://minotour.nottingham.ac.uk/
2. https://github.com/lexnederbragt/INF-BIOx121_fall2014_de_novo_assembly

Lessons Learned
• Most published research findings are false. Or at
least have errors.
• Is possible to push button(s) & recreate a result from
a paper
• Reproducibility is COSTLY. How much are you willing
to spend?
• Much easier to do this before rather than after
publication

The cost of staying with the status quo?
• Ioannidis estimate that 85% of research resources are wasted.
• Each retraction estimated to cost $400,000.

In Summary
Make your data & software
open (CC0, OSI)
Get credit for your reviewing
Publish your research objects
(with us!)*
* Free APCs until end of 2014
scott@gigasciencejournal.com
@gigascience
facebook.com/GigaScience
www.gigasciencejournal.com

Thanks to:
team: Our collaborators: Case study:
Ruibang Luo (BGI/HKU)
Shaoguang Liang (BGI-SZ)
Tin-Lap Lee (CUHK)
Qiong Luo (HKUST)
Senghong Wang (HKUST)
Yan Zhou (HKUST)
Funding from: CBIIT
@gigascience
facebook.com/GigaScience
blogs.biomedcentral.com/gigablog/
Peter Li
Chris Hunter
Jesse Si Zhe
Rob Davidson
Nicole Nogoy
Laurie Goodman
Amye Kenall (BMC)
Marco Roos (LUMC)
Mark Thompson (LUMC)
Jun Zhao (Lancaster)
Susanna Sansone (Oxford)
Philippe Rocca-Serra (Oxford)
Alejandra Gonzalez-Beltran (Oxford)
www.gigadb.org
galaxy.cbiit.cuhk.edu.hk
www.gigasciencejournal.com
49

Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era

Ähnlich wie Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era (20)

Mehr von GigaScience, BGI Hong Kong

Mehr von GigaScience, BGI Hong Kong (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era

Hinweis der Redaktion