This document discusses the challenges of handling large-scale genomic and biological data and proposes potential solutions. It notes that data volumes are increasing rapidly due to advances in sequencing technology but dissemination and data handling methods have not kept pace. Several hurdles to data sharing are described including technical issues around data size, heterogeneity and longevity as well as economic and cultural barriers. Potential solutions discussed include providing incentives for data sharing through attribution and citation, adopting data citation practices using Digital Object Identifiers, establishing funding models for long-term curation, and launching new databases and journals focused on publishing and analyzing large-scale datasets.
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
1. Scott Edmunds
: Big Data, Data Citation
and Future Data Handling
William Gibson: "Information is the currency of the future world"
www.gigasciencejournal.com cc Flickr allan*
4. Rice v Wheat: consequences of publically available
genome data.
rice wheat
700
600
500
400
300
200
100
0
5. Sharing aids everyone…
Sharing Detailed Research
Data Is Associated with
Increased Citation Rate.
Piwowar HA, Day RS, Fridsma DB (2007)
PLoS ONE 2(3): e308.
doi:10.1371/journal.pone.0000308
Every 10 datasets collected contributes to at least 4 papers in the
following 3-years.
Piwowar, HA, Vision, TJ, & Whitlock, MC (2011). Data archiving is a good investment Nature, 473
(7347), 285-285 DOI: 10.1038/473285a
16. Are there now too many hurdles?
Technical: too large volumes
too heterogeneous
no home for many data types
too time consuming
Economic: too expensive, no long-term funding
Cultural: inertia
?
no incentives to share
unaware of how
18. Incentives/credit
Credit where credit is overdue:
“One option would be to provide researchers who release data to
public repositories with a means of accreditation.”
“An ability to search the literature for all online papers that used a
particular data set would enable appropriate attribution for those
who share. “
Nature Biotechnology 27, 579 (2009)
Prepublication data sharing
(Toronto International Data Release Workshop)
“Data producers benefit from creating a citable reference, as it can
later be used to reflect impact of the data sets.”
Nature 461, 168-170 (2009)
19. Datacitation: Datacite and DOIs
Digital Object Identifiers (DOIs)
offer a solution
Mostly widely used identifier for Dataset
scientific articles Yancheva et al (2007). Analyses on
Researchers, authors, publishers sediment of Lake Maar. PANGAEA.
know how to use them doi:10.1594/PANGAEA.587840
Put datasets on the same playing
field as articles
20. Datacitation: Datacite and DOIs
>1 million DOIs since Dec 2009
Central metadata repository to link with WoS/ISI
- finally can track and credit use!
21. Now taking submissions…
Large-Scale Data
Journal/Database
In conjunction with:
Editor-in-Chief: Laurie Goodman, PhD
Editor: Scott Edmunds, PhD
Assistant Editor: Alexandra Basford, PhD
www.gigasciencejournal.com
23. Editorial Board: International
Stephan Beck, UK Stephen O'Brien, USA
Alvis Brazma, UK Hanchuan Peng, USA
Ann-Shyn Chiang, Taiwan Russell Poldrack, USA
Richard Durbin, UK Ming Qi, China/USA
Paul Flicek, UK Susanna-Assunta Sansone, UK
Robert Hanner, Canada Michael Schatz, USA
Yoshihide Hayashizaki, Japan David Schwartz, USA
Henning Hermjakob, UK Fritz Sommer, USA
Wolfgang Huber, Germany Lincoln Stein, Canada
Gary King, USA Sumio Sugano, Japan
Tin-Lap Lee, Hong Kong Thomas Wachtler, Germany
Donald Moerman, Canada Jun Wang, China
Karen Nelson, USA Alistair Young, New Zealand
Francis Ouellette, Canada Zang Yufeng, China
Marie Zins, France
www.gigasciencejournal.com
24. Editorial Board: International
Stephan Beck, Epigenomics Stephen O'Brien, Genomics
Alvis Brazma, Transcriptomics Hanchuan Peng, Imaging/Neuro
Ann-Shyn Chiang, Neuroscience Russell Poldrack, Neuroscience
Richard Durbin, Genetics/Genomics Ming Qi, Genetics
Paul Flicek, Genomics Susanna-Assunta Sansone, Standards
Robert Hanner, DNA Barcoding/Ecology Michael Schatz, Cloud Computing
Yoshihide Hayashizaki, Genomics David Schwartz, Optical Mapping
Henning Hermjakob, Proteomics Fritz Sommer, Neuroscience
Wolfgang Huber, Functional Genomics Lincoln Stein, Cloud Computing
Gary King, Medicine Sumio Sugano, Genomics
Tin-Lap Lee, Genomics Thomas Wachtler, Neuroscience
Donald Moerman, Functional Genomics Jun Wang, Genomics
Karen Nelson, Metagenomics Alistair Young, Medical Imaging
Francis Ouellette, Genomics Zang Yufeng, Neuroscience
Marie Zins, Medicine
www.gigasciencejournal.com
25. Criteria and Focus of Journal/Database
Reproducibility/Reuse
Utility/Usability
Standards/Searchability/Scale/Sharing
Data publishing/DOI
www.gigasciencejournal.com
26. Use of Data = Importance + Usability
subjective? easier to assess
www.gigasciencejournal.com
27. Reproducibility/Reuse
BGI Cloud Computing resources for
handling and analyzing large-scale data.
Integrated tools to promote more
widespread access, viewing, and analysis of
data.
Encourage and aid use of workflow systems
for methods (e.g. submission of Galaxy XML
files).
www.gigasciencejournal.com
28. Special Series/Hub for cloud-based tools
Technical notes: test tools in the BGI-Cloud.
Tools + Test Data (BGI or user) in one place.
Aids reproducibility.
Aids reviewers (free)
Aids authors: visibility (pubmed, etc.)
hosting (included/free offers)
–contact us: editorial@gigasciencejournal.com
Oledoe flickr cc
www.gigasciencejournal.com
29. Standards/Searchability/Sharing
ISA-Tab compatibility to aid and promote
best practice in metadata reporting.
All supporting data must be publically
available.
Ask for MIBBI compliance and use of
reporting checklists.
Part of the Biosharing network.
www.gigasciencejournal.com
30. Data publishing/DOI
New journal format combines standard manuscript
publication with an extensive database to host all
associated data.
Data hosting will follow standard funding agency
and community guidelines.
DOI assignment available for submitted data to
allow ease of finding and citing datasets, as well as for
citation tracking.
www.gigasciencejournal.com
34. The era of the data consumer?
Free access to data – but analysis hubs/nodes for will form around it
?
35. GDSAP:Genomic Data Submission and Analytical platform
Big data
from the
Data, Data, Data… “Sequencing
Farm”
Data
Modeling
Tin-Lap Lee, CUHK
Pipeline
design
Validation
Commercial
applications “Apps”
38. BGI Datasets Get DOI®s
Invertebrate PLANTS
Ant Vertebrates Chinese cabbage
- Florida carpenter ant Giant panda Macaque Cucumber
- Jerdon’s jumping ant - Chinese rhesus Foxtail millet
- Leaf-cutter ant - Crab-eating Pigeonpea
Roundworm Naked mole rat Potato
Silkworm Penguin Sorghum
- Emperor penguin
Human - Adelie penguin
Asian individual (YH) Pigeon, domestic
- DNA Methylome Polar bear
- Genome Assembly Sheep
doi:10.5524/100004
- Transcriptome Tibetan antelope
Ancient DNA (coming soon)
- Saqqaq Eskimo Microbe
- Aboriginal Australian E. Coli O104:H4 TY-2482
Cell-Line
Chinese Hamster Ovary
39. BGI Datasets Get DOI®s
Many unpublished…
Invertebrate PLANTS
Ant Vertebrates Chinese cabbage
- Florida carpenter ant Giant panda Macaque Cucumber
- Jerdon’s jumping ant - Chinese rhesus Foxtail millet
- Leaf-cutter ant - Crab-eating Pigeonpea
Roundworm Naked mole rat Potato
Silkworm Penguin Sorghum
- Emperor penguin
Human - Adelie penguin
Asian individual (YH) Pigeon, domestic
- DNA Methylome Polar bear
- Genome Assembly Sheep
doi:10.5524/100004
- Transcriptome Tibetan antelope
Ancient DNA (coming soon)
- Saqqaq Eskimo Microbe
- Aboriginal Australian E. Coli O104:H4 TY-2482
Cell-Line
Chinese Hamster Ovary
40.
41. Data also submitted to NCBI (including SV data to dbVar)
Complemented by citable form, and data-types including:
Assemblies of 3 strains Raw Data
SNPs InDels
CNVs SV
42. Our first DOI:
To maximize its utility to the research community and aid those fighting the current
epidemic, genomic data is released here into the public domain under a CC0
license. Until the publication of research papers on the assembly and whole-
genome analysis of this isolate we would ask you to cite this dataset as:
Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G;
Wang, J; Zhang, Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S;
Li, J; Peng, Y; Pu, F; Sun, Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z;
Zhao, X; Chen, F; Yin, X; Song,Y ; Rohde, H; Li, Y; Wang, J; Wang, J and
the Escherichia coli O104:H4 TY-2482 isolate genome sequencing
consortium (2011)
Genomic data from Escherichia coli O104:H4 isolate TY-2482. BGI
Shenzhen. http://dx.doi.org/10.5524/100001
To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring rights to
Genomic Data from the 2011 E. coli outbreak. This work is published from: China.
43.
44.
45. “The way that the genetic data of the 2011 E. coli strain were disseminated
globally suggests a more effective approach for tackling public health
problems. Both groups put their sequencing data on the Internet, so scientists
the world over could immediately begin their own analysis of the bug's
makeup. BGI scientists also are using Twitter to communicate their latest
findings.”
“German scientists and their colleagues at the Beijing Genomics Institute in China have
been working on uncovering secrets of the outbreak. BGI scientists revised their draft
genetic sequence of the E. coli strain and have been sharing their data with dozens of
scientists around the world as a way to "crowdsource" this data. By publishing their data
publicy and freely, these other scientists can have a look at the genetic structure, and try
to sort it out for themselves.”
46.
47. We want your
data!
scott@gigasciencejournal.com
editorial@gigasciencejournal.com
@gigascience
facebook.com/GigaScience
blogs.openaccesscentral.com/blogs/gigablog/
www.gigasciencejournal.com