Scott Edmunds at the China National GeneBank Youth Biodiversity MegaData Forum: Democratising biodiversity and genomics research: open and citizen science to build trust and fill the data gaps. 18th December 2018
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Democratising biodiversity and genomics research: open and citizen science to build trust and fill the data gaps.
1. Democratising biodiversity and genomics
research: open and citizen science to
build trust and fill the data gaps.
Scott Edmunds
CNGB
18th December 2018
4. Scientists: need to convince public + politicians
科学家:取信于官民
https://www.nature.com/articles/s41538-018-0018-4
“China’s Ministry of Agriculture and the science community generally expressed a positive attitude
toward GM food, but the percentage of respondents that trusted the government and scientists was
only 11.7 and 23.2%, respectively.”
7. How to regain trust?
如何重获信任?
Areas we need to tackle to allow citizens to trust us
Citizen Science - Involve the public
in the scientific process
Open Science - Increase
transparency & fill the data gaps
Open Access - Change incentive
systems away from dead tree
advertising to reproducibility
8. How to build a community genome project using local pride
How to regain trust?
如何重获信任?
9. We need genetic literacy to make decisions on
Health Starting a family Shopping
What we need to know: 21st Century Edition
Context:
12. HK Botanical &
Afforestation Dept.
"The mysterious origin
of the tree & its
magnificent flowers at
once arrest the interest.
Solve the Bauhinia Mystery?
1903
So far, all efforts to identify them with
any foreign species have failed"
23. Need to fill biodiversity gaps
Expert predictions
of species richness
https://www.nature.com/articles/ncomms9221
Completeness of
biodiversity records
24.
25. HK Citizens far outpacing academic research grade GBIF observations
https://www.gbif.org/country/HK/summary
…
• Much higher eBird (146,113) & iNaturalist (39,152) research grade observations than HKU
Herbarium (1,061)
• Korean International School made 10,792 iNaturalist observations during Inter-schools
Challenge, and CFSS saw 931 species
27. Into an information vacuum fills rumour
How not to regain trust?
失信的深渊?
https://www.independent.co.uk/news/world/asia/japan-cracks-down-on-leaks-after-scandal-of-fukushima-nuclear-power-plant-8965296.html
38. Buckheit & Donoho: Scholarly articles are merely advertisement of
scholarship. The actual scholarly artifacts, i.e. the data and
computational methods, which support the scholarship, remain largely
inaccessible.
How not to regain trust?
失信的深渊?
39. Provide evidence not advertising
Transparency or bust
Show me the peer reviews
Give me the data/ code/protocols
Let me publish replication studies
Buckheit & Donoho: Scholarly articles are merely advertisement of
scholarship. The actual scholarly artifacts, i.e. the data and
computational methods, which support the scholarship, remain largely
inaccessible.
How to regain trust?
如何重获信任?
用证据说话
40. GigaScience Ethos/Policies: ‘Impact' is subjective. Data is quantitive.
Reward evidence (data), not advertising
鼓励证据(数据)而非包装
• Data
• Software
• Models
• Pipelines
• Reviews
• Re-use…
= Credit
41. Data Publishing: nothing new…
Data & Metadata Collection/Experiments
Analysis/Hypothesis/Analysis
Conclusions
+ Area of Interest/Question
1839
1859
20 Yrs.
42. Rewarding open data & code
鼓励开放数据和代码
http://gigasciencejournal.com/
Since July 2012. Publishes “Data Notes” for CC0 data, “Tech Notes” for OSI software.
43. Integrated GigaDB repository. DataCite DOIs. No size limits, APC covers storage.
http://gigadb.org/
Rewarding open data & code
鼓励开放数据和代码
46. Visualisations
& DOIs for workflows
http://www.gigasciencejournal.com/series/Galaxy 46
Rewarding & enabling interaction
鼓励并实现互动
47. Workflows/Virtual Machines/containers
• Downloadable as virtual harddisk/available as Amazon Machine Image
• Now publishing container (docker) submissions
• CodeOcean widgets for code, “compute capsule” run on AWS
48. First journal with deep integration with
Launched 2nd June 2016
Reward better handling of “wet” protocols…
• Create, share, modify forkeable protocols in repo.
• Download & run on smartphone app.
• Widgets embedded in GigaDB
• Get discoverability, credit, DOIs for sharing methods.
• Create your own, or let us set up & you claim.
https://www.protocols.io/groups/gigascience-journal
49. Rewarding & enabling interaction
鼓励并实现互动
Building tools (inc Jbrowse for genomes, sketchfab for 3D images) on top of datasets…
[Insert Widget Here]
50. Democratising Data at GigaScience
• From Big Data to usable Data
• Example: WebTools for easy browsing and visualisation
• Pan-and-zoom map browser as a visual aid to allow the end user to
find datasets
51. • 3D viewer allows users to interact and explore image data prior to data
download
• 3D models are CC0, can be downloaded, and are printable
Democratising Data at GigaScience
• From Big Data to usable Data
• Example: WebTools for easy browsing and visualisation
https://sketchfab.com/GigaDB
52. Democratising Data at GigaScience
• Widening the target audience
• Bioinformaticians and ‘Big Data’ scientists are a
primary target audience
• Plugins and visualisations make access easier for
the less technically inclined
• Democratises access
through education
potential and ease of use
https://www.thingiverse.com/GigaScience/designs
55. To maximize its utility to the research community and aid those fighting
the current epidemic, genomic data is released here into the public domain
under a CC0 license. Until the publication of research papers on the
assembly and whole-genome analysis of this isolate we would ask you to
cite this dataset as:
Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G; Wang,
J; Zhang, Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S; Li, J;
Peng, Y; Pu, F; Sun, Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z; Zhao, X;
Chen, F; Yin, X; Song,Y ; Rohde, H; Li, Y; Wang, J; Wang, J and the
Escherichia coli O104:H4 TY-2482 isolate genome sequencing consortium
(2011)
Genomic data from Escherichia coli O104:H4 isolate TY-2482. BGI
Shenzhen. doi:10.5524/100001
http://dx.doi.org/10.5524/100001
Our first DOI:
To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring rights to
Genomic Data from the 2011 E. coli outbreak. This work is published from: China.
Open Data to the rescue…
56.
57.
58.
59. Downstream consequences:
“Last summer, biologist Andrew Kasarskis was eager to help decipher the genetic origin of the Escherichia coli
strain that infected roughly 4,000 people in Germany between May and July. But he knew it that might take days
for the lawyers at his company — Pacific Biosciences — to parse the agreements governing how his team could
use data collected on the strain. Luckily, one team had released its data under a Creative Commons licence that
allowed free use of the data, allowing Kasarskis and his colleagues to join the international research effort and
publish their work without wasting time on legal wrangling.”
1. Many Citations 2. Therapeutics (primers, antimicrobials) 3. Platform Comparisons
4. Example for faster & more open science
60. 1.3 The power of intelligently open data
The benefits of intelligently open data were powerfully
illustrated by events following an outbreak of a severe gastro-
intestinal infection in Hamburg in Germany in May 2011. This
spread through several European countries and the US,
affecting about 4000 people and resulting in over 50 deaths. All
tested positive for an unusual and little-known Shiga-toxin–
producing E. coli bacterium. The strain was initially analysed by
scientists at BGI-Shenzhen in China, working together with
those in Hamburg, and three days later a draft genome was
released under an open data licence. This generated interest
from bioinformaticians on four continents. 24 hours after the
release of the genome it had been assembled. Within a week
two dozen reports had been filed on an open-source site
dedicated to the analysis of the strain. These analyses
provided crucial information about the strain’s virulence and
resistance genes – how it spreads and which antibiotics are
effective against it. They produced results in time to help
contain the outbreak. By July 2011, scientists published papers
based on this work. By opening up their early sequencing
results to international collaboration, researchers in Hamburg
produced results that were quickly tested by a wide range of
experts, used to produce new knowledge and ultimately to
control a public health emergency.
62. Oxford Nanopore in the spotlight, Sept 2014. Does it work?
https://doi.org/10.1111/1755-0998.12324
http://omicsomics.blogspot.com/2014/09/oxford-takes-some-flak-fires-back.html
2014年9月面世的Oxford Nanopore,好用吗?
63. Nanopore MinION E. Coli genome
released via GigaDB 10-Sep-2014
Curated & converted to ISA-tab, &
worked with EBI to get raw data there
Data Note submitted & preprint version
out 26-Sept-2014
Peer reviewed & published 20-Oct-2014
http://dx.doi.org/10.5524/100102
66. Try before you buy: inspect ALL the data yourselves
https://doi.org/10.1093/gigascience/gix024
• Comparisons with Illumina for
PE50, 100 & 150
• Raw sequencing data in NCBI SRA
• FASTQ files in GigaDB
• Raw image files also shared
Would you trust a BGI sequencer?
华大测序仪可信吗?
先尝后买:亲自检查所有数据
67. Open, transparent and peer reviewed benchmarking
https://doi.org/10.1093/gigascience/gix024
http://dx.doi.org/10.5524/review.100698
http://dx.doi.org/10.5524/review.100699Open
Review
Would you trust a BGI sequencer?
华大测序仪可信吗?
70. Transparency saves wildlife
User-friendly pipeline for the rapid identification of CITES-listed
species in forensic samples using Illumina data.
• International validation trial by 16 laboratories.
• All input sequence data + results available in GigaDB.
• SOPs available in protocols.io.
https://doi.org/10.1093/gigascience/gix080
72. Democratising Data at GigaScience
• Challenges of Food security
• Rice, Oryza sativa L., is the
staple food for half the world’s
population
• By 2030, rice production must
increase by at least 25% to keep
pace with population growth
• 80% of countries face a serious
burden of malnutrition,
especially in Africa and SE Asia
73. Democratising Data at GigaScience
Rice 3K project
• 3,000 rice genomes
• 13.4TB public data
• 6 months to copy
data to Sequence
Read Archive (SRA)
• Data published 4
years before
analysis published
74. Democratising Data at GigaScience
• Orphan Crops
• The African Orphan Crop
Consortium (AOCC) is
developing genomic resources
for 101 crops that represent a
significant part of African/Asian
diets.
• To-date, the AOCC working on
69 genomes, first 5 of which
just published in GigaScience.
Hyacinth bean
https://doi.org/10.1093/gigascience/giy152
75. Democratising Data at GigaScience
• Each AOCC genome is a single GigaDB dataset (with DOI)
76. From Big Data to usable(ish) Data
• Although 13TB data in GigaDB was open (CC0), after analysing in
Tianhe supercomputer processed rice3K data = 100TB
• AWS hosted for free, but expensive to process
https://aws.amazon.com/public-data-sets/3000-rice-genome/
77. Processed data finally published 1st May 2018, Nature v557, p43–49
https://www.nature.com/articles/s41586-018-0063-9
78. Democratising Data at GigaScience
• From Big Data to usable Data
• Example: Easy-to-use plug and play RiceGalaxy
• GUI means plant breeders can utilise genetic data without coding skills
• Funded to run at low cost (<100 USD/month) via AWS Singapore & local
servers (2 vCPUs, 8GB RAM, 2 mounted volumes, 200GB total storage)
• CGIAR Excellence in Plant Breeding Platform/model will roll out to other
crops
80. Other beneficiaries: you!
Piwowar HA, Day RS, Fridsma DB (2007)
PLoS ONE 2(3): e308.
doi:10.1371/journal.pone.0000308
Sharing Detailed Research
Data Is Associated with
Increased Citation Rate.
Every 10 datasets collected contributes to at least 4 papers in the
following 3-years.
Piwowar, HA, Vision, TJ, & Whitlock, MC (2011). Data archiving is a good investment Nature, 473
(7347), 285-285 DOI: 10.1038/473285a
81. Open Science = Science
• Science needed more than ever to tackle grave
environmental challenges and fight disease
• Stand on the shoulders of giants, and allow others
to stand on yours
• Choose evidence not branding
• Being closed provokes distrust, prevents
downstream use, and ultimately harms science
• Being open helps science, your immediate
community, and ultimately your career
• Preempt new EU Open Science and MOST rules on
“strengthening research integrity”…
http://most.gov.cn/mostinfo/xinxifenlei/fgzc/gfxwj/gfxwj2018/201805/t20180531_139731.htm
82. Help GigaScience make it happen
www.gigasciencejournal.com
Give us your data,
pipelines & papers
scott@gigasciencejournal.com
editorial@gigasciencejournal.com
database@gigasciencejournal.com
Contact us:
助力GigaScience实现科研过程全公开
83. Thanks to:
Laurie Goodman, Editor in Chief
Nicole Nogoy, Editor
Hans Zauner, Assistant Editor
Hongling Zhao, Assistant Editor
Peter Li, Lead Data Manager
Chris Hunter, Lead BioCurator
Chris Armit, Data Scientist
Mary Ann Tulli, Data Ediitor
Xiao (Jesse) Si Zhe, Database Developer
Chen Qi, Shenzhen Office.
@GigaScience
facebook.com/GigaScience
http://gigasciencejournal.com/blog/
Follow us:
www.gigasciencejournal.com
www.gigadb.org
+
Weibo
& WeChat