This document discusses open data in bioinformatics and the infrastructure needed to achieve sustainable development goals. It summarizes the exponential growth in biological data from advances like high-throughput sequencing platforms. The H3Africa initiative aims to apply genomics research to improve African health by supporting projects across 27 countries. The H3Africa Bioinformatics Network is developing capacity to archive and analyze the genomic and phenotypic data being collected from over 75,000 research participants to understand disease susceptibility in African populations.
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Open Data in Bioinformatics and Required Infrastructure towards achieving the SDGs/Samar Kassim
1. Open
Data
in
Bioinforma/cs
and
Required
Infrastructure
towards
achieving
the
SDGs
www.h3abionet.org
9th
BioVisionAlexandria
Conference,
Alexandria,
Egypt
2018
Prof.
Samar
Kassim
samar_kassim@med.asu.edu.eg
9th
BioVisionAlexandria
Conference,
Egypt
2. Introduc/on
• Major
technological
advances
in
molecular
biology
is
the
sophis7ca7on,
diversity,
scale
and
decreasing
cost
of
the
data
being
generated
i.e.
by
high
throughput
pla;orms
• First
human
genome
sequence:
– Throughput
2.8
million
bases
per
24
hours
on
AB3730xl
sequencers
– 13
years
to
sequence
3
billion
bases
at
x10
coverage
– Cost
~
500
million
USD
(lower
bound
es7mate)
• Next
(now)
genera7on
sequencing:
– Throughput
1
million
bases
per
second
– ~10
hours
to
sequence
3
billion
bases
at
x10
coverage
– Cost
~
4,000
USD
per
genome
hTps://www.genome.gov/sequencingcosts/
hTp://en.wikipedia.org/wiki/File:Historic_cost_of_sequencing_a_human_genome.svg
Author
=
Ben
Moore
9th
BioVisionAlexandria
Conference,
Egypt
3. Data
driven
biological
science
-‐
bioinforma/cs
• Decreasing
data
genera7on
costs
shiZed
biological
sciences
to
a
data
driven
science
with
bioinforma7cs
playing
a
major
component
Stephens
ZD,
Lee
SY,
Faghri
F,
Campbell
RH,
Zhai
C,
et
al.
(2015)
Big
Data:
Astronomical
or
Genomical?.
PLOS
Biology
13(7):
e1002195.
hTps://doi.org/10.1371/journal.pbio.1002195:
hTp://journals.plos.org/plosbiology/ar7cle?id=10.1371/journal.pbio.1002195
9th
BioVisionAlexandria
Conference,
Egypt
4. Genomics
and
Africa
-‐
H3Africa
• “The
Human
Heredity
and
Health
in
Africa
(H3Africa)
Ini/a/ve
aims
to
facilitate
a
contemporary
research
approach
to
the
study
of
genomics
and
environmental
determinants
of
common
diseases
with
the
goal
of
improving
the
health
of
African
popula7ons.”
(hTp://h3africa.org/)
• “The
vision
of
H3Africa
is
to
create
and
support
a
pan-‐con7nental
network
of
laboratories
that
will
be
equipped
to
apply
leading-‐edge
research
to
the
study
of
the
complex
interplay
between
environmental
and
gene7c
factors
which
determines
disease
suscep7bility
and
drug
responses
in
African
popula7ons.”
(hTp://h3africa.org/about/vision)
9th
BioVisionAlexandria
Conference,
Egypt
5. H3Africa
Phase
I
overview
• 25
research
projects
in
Africa
• >
500
inves7gators
• Covers
27
African
countries
• Upto
75,000
research
par7cipants
• >
USD
76
million
invested
in
phase
1
8
Collabora/ve
Centers
7
Research
Projects
3
Biorepositories
6
Ethics
Grants
The
H3Africa
Consor/um
Bioinforma/cs
Network
hTp://h3africa.org/consor7um/projects
9th
BioVisionAlexandria
Conference,
Egypt
6. H3Africa
Bioinformatcs
Network
(H3ABioNet)
• Pan
African
Bioinforma7cs
Network
to
develop
bioinforma7cs
capacity
in
Africa
and
support
the
H3Africa
research
projects
• 28
nodes
in
17
African
countries
• PI:
Prof.
Nicky
Mulder,
CBIO-‐UCT
• Educa7on,
infrastructure,
research
• Archive
African
genomics
data
9th
BioVisionAlexandria
Conference,
Egypt
7. H3Africa
data
being
collected
(Phase
I)
• Phenotype
data
(associated
with
genotype
data)
– Demographic
informa7on
– Anthropometric
data
– Disease
and
health
related
phenotype
data
• Gene7c
Varia7on
data
human
and
pathogen
– Sequence
data
(whole
genome,
exome,
targeted)
• Genotyping
chip
array
data
– ~55,000
samples
to
be
run
on
an
H3Africa
African
custom
chip
• Microbiome
sequence
data
– Pa7ent/sample
phenotypes
– Non-‐human
16S
rRNA
sequence
data
for
microbiome
– Non-‐human
full
genome
sequence
data
for
microbiome
– Possible
human
sequence
contamina7on
• Biospecimens
to
be
deposited
at
the
H3Africa
biorepositories
Image
credits:
Na/onal
Human
Genome
Research
Ins/tute
(h]ps://www.genome.gov/imagegallery/)
9th
BioVisionAlexandria
Conference,
Egypt
8. Lack
of
repository
for
African
Genomics
data
• 1,759
datasets
with
the
query
“African”
–
none
in
Africa
hTps://discover.reposi7ve.io/
9th
BioVisionAlexandria
Conference,
Egypt
9. 9th
BioVisionAlexandria
Conference,
Egypt
H3Africa
Data
Archive
• Assist
H3Africa
projects
as
data
coordina7on
center:
Transfer
Validate
Store
Submit
to
EGA
Obtain
EGA
accessions
for
publica/ons
0.5
petabytes
storage
size
including
offsite
replica7on
10. H3Africa
Catalogue
9th
BioVisionAlexandria
Conference,
Egypt
• Online
catalogue
with
meta-‐data
to
search
and
apply
for
datasets
and
biospecimens
(under
development)
11. Human
gene/c
data
privacy
• H3Africa
rich
source
of
meta-‐data
(phenotypes)
(1)
Age
&
(2)
Sex
(3) Country
of
birth
(4) Current
residence
(5) Native
language
(6)
Ethno-‐linguistic/tribal
affiliation
(7) Country
of
birth
of
father
and
mother
(8) Na7ve
language
of
father
and
mother
(9) Ethno-‐linguistic/tribal
affiliation
of
mother
and
father
(10) Height
(11) Weight
(12) Current
medica7ons
(13) Smoking
history
(14) Alcohol
history
Image
credits:
Na/onal
Human
Genome
Research
Ins/tute
(h]ps://www.genome.gov/imagegallery/)
• Combina7on
of
phenotype
and
gene7c
data
makes
it
possible
to
iden7fy
different
popula7ons
and
individuals
–
restricted
access
9th
BioVisionAlexandria
Conference,
Egypt
12. Sharing
of
research
data
and
outputs
• Funders’
data
sharing
policies
“The
Wellcome
Trust
is
commiTed
to
ensuring
that
the
outputs
of
the
research
it
funds,
including
research
data,
are
managed
and
used
in
ways
that
maximise
public
benefit.
Making
research
data
widely
available
to
the
research
community
in
a
7mely
and
responsible
manner
ensures
that
these
data
can
be
verified,
built
upon
and
used
to
advance
knowledge
and
its
applica7on
to
generate
improvements
in
health.”
hTps://wellcome.ac.uk/funding/managing-‐grant/policy-‐data-‐management-‐and-‐sharing
“The
Na7onal
Ins7tutes
of
Health
(NIH)
Genomic
Data
Sharing
Policy
expects
that
genomic
research
data
from
NIH-‐supported
studies
involving
human
specimens
as
well
as
non-‐human
and
model
organisms
will
be
submiTed
to
an
NIH-‐designated
data
repository.
The
list
below
provides
examples
of
relevant
databases.”
hTps://gds.nih.gov/02dr2.html
9th
BioVisionAlexandria
Conference,
Egypt
13. Limits
to
sharing
human
gene/c
data
• Ethics:
– Digital
data
(genomes)
can
be
stored
indefinitely,
biobank
specimens
can
be
stored
for
up
to
20
years
–
secondary
use
– Rapid
innova7on
with
‘omics
technologies
• H3Africa:
“Seven
projects
used
broad
consent,
five
projects
used
7ered
consent
and
one
used
specific
consentӤ
• History
of
vulnerable
popula7ons,
low
educa7on
levels
and
exploita7on
• Blood
sample
collec7on
and
visits
to
clinics
associated
with
disease
and
treatment
–
even
if
a
healthy
control
• “All
but
one
of
the
consent
forms
that
we
reviewed
included
a
statement
about
data
sharing.”
§
§
Munung
NS,
Marshall
P,
Campbell
M,
et
al
Obtaining
informed
consent
for
genomics
research
in
Africa:
analysis
of
H3Africa
consent
documents.
Journal
of
Medical
Ethics
2016;42:132-‐137)
Ethical
considera7ons
Informed
consent
Par7cipant
iden7fica7on
S7gma7sa7on
Benefit
sharing
9th
BioVisionAlexandria
Conference,
Egypt
14. Limits
to
sharing
human
gene/c
data
• Non-‐harmonized
na7on
/
regional
laws
and
policies
for
ethics
and
genome
data
sharing
within
Africa
Image
credits:
hTps://en.wikipedia.org/wiki/African_Economic_Community
9th
BioVisionAlexandria
Conference,
Egypt
15. H3Africa
data
sharing
and
access
policy
• Balance
between
ensuring
that
adequate
safeguards
to
protect
par7cipants
while
not
being
a
barrier
for
scien7sts
to
advance
research:
- Maximizing
the
availability
of
research
data,
in
a
7mely
and
responsible
manner.
- Protec7ng
the
rights
and
privacy
of
human
subjects
who
par7cipated
in
research
studies.
- Recognizing
the
scien7fic
contribu7on
of
researchers
who
generated
the
data.
- Considering
the
nature
and
ethics
of
the
research
proposed
in
establishing
the
7mely
release
of
data,
and
mechanisms
of
data
sharing.
- Promo7ng
deposi7on
of
genomic
data
in
exis7ng
community
data
repositories
whenever
possible
hTp://h3africa.org/images/DataSARWG_folders/FinalDocsDSAR/H3Africa%20Consor7um%20Data
%20Access%20%20Release%20Policy%20Aug%202014.pdf
9th
BioVisionAlexandria
Conference,
Egypt
16. Challenges
in
sharing
data
–
metadata
standards
• Meta-‐data
(phenotype)
data
is
collected
via
case
report
forms
(CRFs)
Project
1
CRF
Project
2
CRF
Project
3
CRF
Female
Woman
1
Daily
units
Weekly
units
User
defined
7me
period
• Same
ques7on
–
data
coded
in
different
ways
• Similar
measure
–
collected
in
different
ways
9th
BioVisionAlexandria
Conference,
Egypt
17. Use
established
standards
-‐
Ontologies
• “An
ontology
defines
a
common
vocabulary
for
researchers
who
need
to
share
informa7on
in
a
domain.
It
includes
machine-‐interpretable
defini7ons
of
basic
concepts
in
the
domain
and
rela7ons
among
them.”*
*hTp://protege.stanford.edu/publica7ons/ontology_development/ontology101-‐noy-‐
mcguinness.html
9th
BioVisionAlexandria
Conference,
Egypt
18. Op/ons
to
aid
data
sharing
• Make
data
Findable,
Accessible,
Interoperable
and
Reusable
(FAIR
compliant)
• Do
you
see
a
gene7c
variant
in
a
specific
posi7on
within
your
dataset
–
Yes
/
No
as
in
the
case
for
the
South
African
Human
Genome
Program
(SAHGP)
Global
Alliance
for
Genomics
and
Health:
hTp://ga4gh.org/#/beacon
9th
BioVisionAlexandria
Conference,
Egypt
19. H3Africa
genotyping
chip
• Current
genotyping
technologies
are
designed
for
European
popula7ons
• African
popula7ons
under
represented,
although
have
the
most
diversity
9th
BioVisionAlexandria
Conference,
Egypt
Image
credits:
Na/onal
Human
Genome
Research
Ins/tute
(h]ps://www.genome.gov/imagegallery/)
20. Designing
the
H3Africa
genotyping
chip
9th
BioVisionAlexandria
Conference,
Egypt
Image
credits:
Na/onal
Human
Genome
Research
Ins/tute
(h]ps://www.genome.gov/imagegallery/)
• Collabora7on
between
H3ABioNet
and
Na7onal
Center
for
Supercompu7ng
Applica7ons
(NCSA-‐US
based)
via
US
partner
at
University
of
Illinois
• U7lized
the
Bluewaters
supercomputer
facili7es
and
CHPC
facili7es
212,000
Node
compu7ng
hours
used
at
Bluewaters
600
TB
of
storage
needed
Chip
undergone
assessment
and
in
use
with
pos7ve
results
h]ps://twi]er.com/billgates/status/800800954790465536?lang=en
21. Connec/vity
for
data
transfers
GO endpoints
Transfer speeds (Mbps)
(min, max)
Baylor <-> Blue Waters
340, 1900
Blue Waters -> UCT
204, 322
CHPC <-> Blue Waters
81, 243
UCT <-> CHPC
34, 406
Sanger <-> UCT
38, 76
GO
source
and
des/na/on
Files
to
transfer
and
size
per
sample
Total
size
of
transfer
for
350
samples
Min
transfer
speed
Time
to
transfer
Baylor
to
Blue
Waters
Baylor
FASTQ.gzs
/
100GB
75TB
340Mbps
21
days
Blue
Waters
to
UCT
Baylor
FASTQ.gzs
/
100GB
75TB
200Mbps
35
days
Blue
Waters
to
UCT
BW
BAMs
/
100GB
40TB
200Mbps
19
days
UCT
to
CHPC
BW
BAMs
/
100GB
40TB
34Mbps
109
days
CHPC
to
UCT
Union
set
/
VCFs
1TB
34Mbps
3
days
UCT
to
Sanger
Union
set
/
VCFs
1TB
34Mbps
3
days
Globus
Online
installed
at
Nodes
9th
BioVisionAlexandria
Conference,
Egypt
22. Challenge
of
unequal
infrastuctures
• Diverse
levels
of
exper7se
and
infrastructure
between
different
countries
www.project-‐redcap.org/map_fullscreen.php
SoZware
and
hardware
sanc7ons
exacerbate
exis7ng
inequali7es
e.g
Sudan
Node
hTp://mgafrica.com/ar7cle/2015-‐01-‐14-‐17-‐
startling-‐facts-‐about-‐the-‐state-‐of-‐science-‐and-‐
research-‐in-‐africa
9th
BioVisionAlexandria
Conference,
Egypt
23. Bioinforma/cs
educa/on
9th
BioVisionAlexandria
Conference,
Egypt
Aim:
• Basic
bioinforma7cs
training
for
interested
H3Africa
members
(bioinforma7cs
users
–
Introduc7on
to
Bioinforma7cs
Training)
• Web-‐based
bioinforma7cs
tools
and
resources
and
how
to
use
them
Course
logis/cs:
•
3
months,
2
days
contact
7me
per
week
(3
hours
per
session)
•
Distance
learning
model
–
physical
classrooms
connected
to
virtual
classroom
•
Mconf
–
video
conferencing
•
Vula
–
course
management
virtual
classroom
24. 9th
BioVisionAlexandria
Conference,
Egypt
IBT_2017
classroom
sites
27
in
total
(vs.
20
classrooms
in
2016)
Countries
that
have
joined
IBT
in
2017:
Ethiopia,
Burkina
Faso
Some
par7cipants
from
first
course
are
going
to
be
TAs
Over
580
enrolled
Par/cipants
and
over
130
volunteer
staff
IBT
2017
Classrooms
Paper
published
on
course
design
VIRTUAL CLASSROOM
classroom site 2016
new classroom site
2017
classroom site 2016
and 2017
25. Conclusion
• Bioinforma7cs
=
big
data
and
needs
computa7onal
power,
storage,
fast
read
and
write
for
processing
• Well
defined
meta-‐data
standards
are
vital
for
interoperability
and
sharing
of
data
• Cyber
infrastructure
for
moving
and
sharing
large
datasets
is
needed
to
foster
open
data
and
open
science
• Educa7on
and
skills
development
essen7al
for
African
ci7zens
to
take
advantage
of
the
data
revolu7on
• Percep7ons
and
a{tudes
–
no
amount
of
infrastructure
will
drive
Open
data
and
Open
science
if
the
sen7ment
is
absent
9th
BioVisionAlexandria
Conference,
Egypt
26. Acknowledgements
• Prof
Nicky
Mulder
and
H3ABioNet
members
• Ina
Smith
and
the
Academy
of
Science
of
South
Africa
• BioVisionAlexandria
2018
organizers
H3ABioNet
Consor/um
Members
2017
9th
BioVisionAlexandria
Conference,
Egypt
27. Conclusions
Provide
data
archiving
solu7on
for
H3Africa
projects
to
ensure
that
local
copy
of
the
data
remains
on
the
con7nent
9th
BioVisionAlexandria
Conference,
Egypt
28. Communica/on
–
H3Africa
Image
credit:
hTps://commons.wikimedia.org/wiki/File:UTC_hue4map_X_world_Robinson.png
• H3Africa
working
groups
meet
every
fortnight
• Regular
mee7ngs
are
challenging
due
to
diversity
of
7mezones
(most
funders
in
the
US)
and
daylight
saving
hours
9th
BioVisionAlexandria
Conference,
Egypt
29. Communica/on
–
H3Africa
• H3Africa
funders
and
project
members
meet
face
to
face
every
six
months
to
provide
reports
and
for
working
groups
to
also
wrap
up
deliverables
9th
BioVisionAlexandria
Conference,
Egypt
30. Communica/on
–
H3ABioNet
• Within
H3ABioNet
the
nodes
are
located
in
Africa
so
7me
differences
are
not
a
hindrance
• Working
groups
meet
once
a
month
and
network
meets
annually
for
SAB
review
and
network
business
• Only
some
countries
have
toll
free
access
to
a
booked
conference
call,
costly
• Challenges:
communica7on
pla;orms
hTp://mconf.org/
9th
BioVisionAlexandria
Conference,
Egypt
35. OECD
–
WDS
Workshop,
Brussels
2017
Ontologies work
35
Adapting
OMIABIS
ontology to
H3Africa data
Mapping CRFs to ontologies, e.g.
phenotype or disease ontology
Mapping
genomics data to
Experimental
Factor ontology
Developing Sickle Cell Disease Ontology
36. OECD
–
WDS
Workshop,
Brussels
2017
Beacons in Africa
hTps://beacon-‐network.org//#/directory
• First Beacon in Africa “lit” on October 2016 for the SAHGP