Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling

Scott Edmunds

: Big Data, Data Citation
and Future Data Handling

William Gibson: "Information is the currency of the future world"

www.gigasciencejournal.com cc Flickr allan*

Data Tsunami?

Flickr cc: opensourceway

Rice v Wheat: consequences of publically available
genome data.

rice wheat
700
600
500
400
300
200
100
0

Sharing aids everyone…

Sharing Detailed Research
Data Is Associated with
Increased Citation Rate.
Piwowar HA, Day RS, Fridsma DB (2007)
PLoS ONE 2(3): e308.
doi:10.1371/journal.pone.0000308

Every 10 datasets collected contributes to at least 4 papers in the
following 3-years.
Piwowar, HA, Vision, TJ, & Whitlock, MC (2011). Data archiving is a good investment Nature, 473
(7347), 285-285 DOI: 10.1038/473285a

Problems?


Sequencing cost ($ per Mbp)

Moore’s Law

~100,000X

Sequencing

Source: E Lander/Broad

Sequencing Output

Data

Moore’s/Kryder
s Law

Sequencing Output

Data

Dissemination?

Potential sequencing capacity

1 Illumina HiSeq 2000 (+Truseq upgrade)
= 600Gb/run (12 days)

X 128 Hiseq = 6Tb/day = >2Pb/year

= ~ 2000 Human Genomes/day

Difficulties keeping up…


Do we have models for long term funding?

Human Gene Mutation Database

Kyoto Encyclopedia of Genes and Genomes

?

Are there now too many hurdles?

?

Are there now too many hurdles?
Technical: too large volumes
too heterogeneous
no home for many data types
too time consuming

Economic: too expensive, no long-term funding

Cultural: inertia
?
no incentives to share
unaware of how

Incentives/credit
Credit where credit is overdue:
“One option would be to provide researchers who release data to
public repositories with a means of accreditation.”
“An ability to search the literature for all online papers that used a
particular data set would enable appropriate attribution for those
who share. “
Nature Biotechnology 27, 579 (2009)

Prepublication data sharing
(Toronto International Data Release Workshop)
“Data producers benefit from creating a citable reference, as it can
later be used to reflect impact of the data sets.”
Nature 461, 168-170 (2009)

Datacitation: Datacite and DOIs

Digital Object Identifiers (DOIs)


offer a solution

 Mostly widely used identifier for Dataset
scientific articles Yancheva et al (2007). Analyses on
 Researchers, authors, publishers sediment of Lake Maar. PANGAEA.
know how to use them doi:10.1594/PANGAEA.587840
 Put datasets on the same playing
field as articles

Datacitation: Datacite and DOIs

>1 million DOIs since Dec 2009

Central metadata repository to link with WoS/ISI
- finally can track and credit use!

Now taking submissions…

Large-Scale Data
Journal/Database
In conjunction with:

Editor-in-Chief: Laurie Goodman, PhD
Editor: Scott Edmunds, PhD
Assistant Editor: Alexandra Basford, PhD

www.gigasciencejournal.com

Editorial Board: International
Stephan Beck, UK Stephen O'Brien, USA
Alvis Brazma, UK Hanchuan Peng, USA
Ann-Shyn Chiang, Taiwan Russell Poldrack, USA
Richard Durbin, UK Ming Qi, China/USA
Paul Flicek, UK Susanna-Assunta Sansone, UK
Robert Hanner, Canada Michael Schatz, USA
Yoshihide Hayashizaki, Japan David Schwartz, USA
Henning Hermjakob, UK Fritz Sommer, USA
Wolfgang Huber, Germany Lincoln Stein, Canada
Gary King, USA Sumio Sugano, Japan
Tin-Lap Lee, Hong Kong Thomas Wachtler, Germany
Donald Moerman, Canada Jun Wang, China
Karen Nelson, USA Alistair Young, New Zealand
Francis Ouellette, Canada Zang Yufeng, China
Marie Zins, France

Editorial Board: International
Stephan Beck, Epigenomics Stephen O'Brien, Genomics
Alvis Brazma, Transcriptomics Hanchuan Peng, Imaging/Neuro
Ann-Shyn Chiang, Neuroscience Russell Poldrack, Neuroscience
Richard Durbin, Genetics/Genomics Ming Qi, Genetics
Paul Flicek, Genomics Susanna-Assunta Sansone, Standards
Robert Hanner, DNA Barcoding/Ecology Michael Schatz, Cloud Computing
Yoshihide Hayashizaki, Genomics David Schwartz, Optical Mapping
Henning Hermjakob, Proteomics Fritz Sommer, Neuroscience
Wolfgang Huber, Functional Genomics Lincoln Stein, Cloud Computing
Gary King, Medicine Sumio Sugano, Genomics
Tin-Lap Lee, Genomics Thomas Wachtler, Neuroscience
Donald Moerman, Functional Genomics Jun Wang, Genomics
Karen Nelson, Metagenomics Alistair Young, Medical Imaging
Francis Ouellette, Genomics Zang Yufeng, Neuroscience
Marie Zins, Medicine

Criteria and Focus of Journal/Database
Reproducibility/Reuse
Utility/Usability
Standards/Searchability/Scale/Sharing
Data publishing/DOI

Use of Data = Importance + Usability

subjective? easier to assess


Reproducibility/Reuse
 BGI Cloud Computing resources for
handling and analyzing large-scale data.
Integrated tools to promote more
widespread access, viewing, and analysis of
data.
Encourage and aid use of workflow systems
for methods (e.g. submission of Galaxy XML
files).

Special Series/Hub for cloud-based tools
Technical notes: test tools in the BGI-Cloud.
Tools + Test Data (BGI or user) in one place.
Aids reproducibility.
Aids reviewers (free)
Aids authors: visibility (pubmed, etc.)
hosting (included/free offers)
–contact us: editorial@gigasciencejournal.com
Oledoe flickr cc


Standards/Searchability/Sharing
 ISA-Tab compatibility to aid and promote
best practice in metadata reporting.
All supporting data must be publically
available.
Ask for MIBBI compliance and use of
reporting checklists.
Part of the Biosharing network.


Data publishing/DOI
New journal format combines standard manuscript
publication with an extensive database to host all
associated data.
 Data hosting will follow standard funding agency
and community guidelines.
DOI assignment available for submitted data to
allow ease of finding and citing datasets, as well as for
citation tracking.

The era of the data consumer?

?

The era of the data consumer?
Free access to data – but analysis hubs/nodes for will form around it

?

GDSAP:Genomic Data Submission and Analytical platform
Big data
from the
Data, Data, Data… “Sequencing
Farm”

Data
Modeling

Tin-Lap Lee, CUHK
Pipeline
design

Validation

Commercial
applications “Apps”

New Database

www.gigaDB.org

BGI Datasets Get DOI®s
Invertebrate PLANTS
Ant Vertebrates Chinese cabbage
- Florida carpenter ant Giant panda Macaque Cucumber
- Jerdon’s jumping ant - Chinese rhesus Foxtail millet
- Leaf-cutter ant - Crab-eating Pigeonpea
Roundworm Naked mole rat Potato
Silkworm Penguin Sorghum
- Emperor penguin
Human - Adelie penguin
Asian individual (YH) Pigeon, domestic
- DNA Methylome Polar bear
- Genome Assembly Sheep
doi:10.5524/100004
- Transcriptome Tibetan antelope
Ancient DNA (coming soon)
- Saqqaq Eskimo Microbe
- Aboriginal Australian E. Coli O104:H4 TY-2482

Cell-Line
Chinese Hamster Ovary

BGI Datasets Get DOI®s
Many unpublished…
Invertebrate PLANTS
Ant Vertebrates Chinese cabbage
- Florida carpenter ant Giant panda Macaque Cucumber
- Jerdon’s jumping ant - Chinese rhesus Foxtail millet
- Leaf-cutter ant - Crab-eating Pigeonpea
Roundworm Naked mole rat Potato
Silkworm Penguin Sorghum
- Emperor penguin
Human - Adelie penguin
Asian individual (YH) Pigeon, domestic
- DNA Methylome Polar bear
- Genome Assembly Sheep
doi:10.5524/100004
- Transcriptome Tibetan antelope
Ancient DNA (coming soon)
- Saqqaq Eskimo Microbe
- Aboriginal Australian E. Coli O104:H4 TY-2482

Cell-Line
Chinese Hamster Ovary

Data also submitted to NCBI (including SV data to dbVar)

Complemented by citable form, and data-types including:

Assemblies of 3 strains Raw Data

SNPs InDels

CNVs SV

Our first DOI:

To maximize its utility to the research community and aid those fighting the current
epidemic, genomic data is released here into the public domain under a CC0
license. Until the publication of research papers on the assembly and whole-
genome analysis of this isolate we would ask you to cite this dataset as:

Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G;
Wang, J; Zhang, Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S;
Li, J; Peng, Y; Pu, F; Sun, Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z;
Zhao, X; Chen, F; Yin, X; Song,Y ; Rohde, H; Li, Y; Wang, J; Wang, J and
the Escherichia coli O104:H4 TY-2482 isolate genome sequencing
consortium (2011)
Genomic data from Escherichia coli O104:H4 isolate TY-2482. BGI
Shenzhen. http://dx.doi.org/10.5524/100001

To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring rights to
Genomic Data from the 2011 E. coli outbreak. This work is published from: China.

“The way that the genetic data of the 2011 E. coli strain were disseminated
globally suggests a more effective approach for tackling public health
problems. Both groups put their sequencing data on the Internet, so scientists
the world over could immediately begin their own analysis of the bug's
makeup. BGI scientists also are using Twitter to communicate their latest
findings.”

“German scientists and their colleagues at the Beijing Genomics Institute in China have
been working on uncovering secrets of the outbreak. BGI scientists revised their draft
genetic sequence of the E. coli strain and have been sharing their data with dozens of
scientists around the world as a way to "crowdsource" this data. By publishing their data
publicy and freely, these other scientists can have a look at the genetic structure, and try
to sort it out for themselves.”

We want your
data!
scott@gigasciencejournal.com

editorial@gigasciencejournal.com

@gigascience
facebook.com/GigaScience

blogs.openaccesscentral.com/blogs/gigablog/


Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling

Similar to Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling (20)

More from GigaScience, BGI Hong Kong

More from GigaScience, BGI Hong Kong (20)

Recently uploaded

Recently uploaded (20)

Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling