Bioinformatics and NGS for advancing in hearing loss research

Joaquín Dopazo
Computational Genomics Department,
Centro de Investigación Príncipe Felipe (CIPF),
Functional Genomics Node, (INB),
Bioinformatics Group (CIBERER) and
Medical Genome Project,
Spain.
Bioinformatics and NGS: an
indissoluble marriage for advancing in
hearing loss research
http://bioinfo.cipf.es
http://www.medicalgenomeproject.com
http://www.babelomics.org
http://www.hpc4g.org
@xdopazo
Fundación Ramón Areces, Madrid, 5th Marzo 2015

Why Bioinformatics and NGS are important?
Lessons learned from the Spanish 1000 genomes project:
Rare and familiar diseases sequencing initiative
• Metabolic (86 samples)
• Optiz
• Atypical fracture
• coQ10 deficiency
• Congenital disorder of glycosylation types I and II
• Maple syrup urine disease
• Pelizaeus-like
• 4 unknown syndroms
• Genetic (24 samples)
• Charcot-Marie-Tooth
• Rett Syndrome
• Neurosensorial (35 samples)
• Usher
• AD non-syndromic hearing loss
• AR non-syndromic hearing loss
• RP
• Mitochondrial (28 samples)
• Progressive External Oftalmoplegy
• Multi-enzymatic deficiency in mitochondrial
respiratory complexes
• CoQ disease
• Other
• APL (10 samples)
Autism (37 samples)
Mental retardation (autosomal recessive) (24)
Immunodeficiency (18)
Leber's congenital amaurosis (9)
Cataract (2)
RP(AR) (60)
RP(AD) (46)
Deafness (24)
CLAPO (4)
Skeletal Dysplasia (3)
Cantú syndrome (1)
Dubowitz syndrome (2)
Gorham-Stout syndrome (1)
Malpuech syndrome (4)
Hirschprung’s disease (81)
Hereditary macrothrombocytopenia (3)
MTC (41)
Controls (301)
1044 samples = 183 samples + 200 controls + 360 samples + 301 controls

Organization of the initiative
Diseases with:
• Unknown genes
• Known genes/mutations discarded
Search for:
• Novel genes
• Responsible genes known but unknown modifier genes
• Susceptibility Genes
• Therapeutic targets
http://www.gbpa.es/
Data production Sequencing platforms Data analysis
Big-Data Team
science paradigm

Data management, analysis
and storage
http://www.gbpa.es/
GCGTATAG
CACGGGTA
TCTGTATTA
TGGTGGAT
ATCAGCGG
ATTGCGATT
GGCAGAGC
GGCAAAGT
GCGTATAG
CACGGGTA
TCTGTATTA
TGGTGGAT
ATCAGCGG
ATTGCGATT
GGCAGAGC
GGCAAAGT
GCGTATAG
CACGGGTA
TCTGTATTA
TGGTGGAT
ATCAGCGG
ATTGCGATT
GGCAGAGC
GGCAAAGT
GCGTATAG
CACGGGTA
TCTGTATTA
TGGTGGAT
ATCAGCGG
ATTGCGATT
GGCAGAGC
GGCAAAGT
Raw files
(FastQ)
DB
Analysis
Pipeline
Storage
K-DB
Gene 1 ksdhkahcka
Gene 2 jckacsksda
Gene 3 lkkxkccj<jdc
Gene 4 ksfdjvjvlsdkvjd
Gene 5 kckcksñdksd
Gene 6 ldkdkcksdcldl
Gene x kcdlkclkldsklk
Gene Y jcdksdkcdks
Prioritization
report
Dialog with experts in the
disease + validations
Samples
GCGTATAG
CACGGGTA
TCTGTATTA
TGGTGGAT
ATCAGCGG
GCGTATAG
CACGGGTA
TCTGTATTA
TGGTGGAT
ATCAGCGG
VCF BAM
Processed files

Pipeline of data analysis
Initial QC
Sequence
cleansing
Base quality
Remove adapters
Remove
duplicates
FASTQ file
Variant calling +
QC
Calling and labeling
of missing values
Calling SNVs and
indels (GATK) using
6 statistics based
on QC, strand bias,
consistence (poor
QC callings are
converted to
missing values as
well)
Create multiple VCF
with missing, SNVs
and indels
VCF file
Mapping + QC
Mapping (HPG)
Remove multiple
mapping reads
Remove low
quality mapping
reads
Realigning
Base quality
recalibrating
BAM file
Variant and gene
prioritization + QC
Counts of sites with
variants
Variant annotation
(function, putative effect,
conservation, etc.)
Inheritance analysis
(including compound
heterozygotes in recessive
inheritance)
Filtering by frequency with
external controls (Spanish
controls, dbSNP, 1000g,
ESP) and annotation
Multi-family intersection of
genes and variants
Function/Network-based
prioritization
Report
Primary analysis Gene prioritization

Primary
processing
Initial QC
FASTQ file
Mapping
BAM file
Variant calling
VCF File
Knowledge-based
prioritization
Proximity to other
known disease genes
Functional proximity
Network proximity
Burden tests
Other prioritization
methods
Secondary analysis
(Successive filtering)
Variant annotation
Filtering by effect
Filtering by MAF
Filtering by family
segregation
Primary
analysis
Gene prioritization
VARIANT
annotation tool

Variant annotation
HPG Variant, a suite of tools for HPC-based genomic variant annotation VARIANT = VARIant
ANnotation Tool. Tools implemented using OpenMP, Nvidia CUDA and MPI for large clusters.
EFFECT: A CLI and web application, it's a cloud-based genomic variant effect predictor tool
has been implemented (http://variant.bioinfo.cipf.es, Medina 2012 NAR)
VCF: C library and tool: allows to analyze large VCFs files with a low memory footprint: stats,
filter, split, merge, etc. Example: hpg-variant vcf –stats –vcf-file ceu.vcf
Annotations
sought

The knowledge database
CellBase (Bleda, 2012, NAR), a
comprehensive integrative database
and RESTful Web Services API,
more than 250GB of data and 90
tables exported in TXT and JSON:
● Core features: genes, transcripts,
exons, cytobands, proteins (UniProt),...
● Variation: dbSNP and Ensembl SNPs,
HapMap, 1000Genomes, Cosmic, ...
● Functional: 40 OBO ontologies (Gene
Ontology), Interpro, etc.
● Regulatory: TFBS, miRNA targets,
conserved regions, etc.
● System biology: Interactome (IntAct),
Reactome database, co-expressed
genes.
NoSQL and scales to TB
Wiki: http://docs.bioinfo.cipf.es/projects/cellbase/wiki
Project: http://bioinfo.cipf.es/compbio/cellbase
Now available at the EBI: http://www.ebi.ac.uk/cellbase/webservices/rest/v3/

Primary
processing
Initial QC
FASTQ file
Mapping
BAM file
Variant calling
VCF File
Knowledge-based
prioritization
Proximity to other
known disease genes
Functional proximity
Network proximity
Burden tests
Other prioritization
methods
Secondary analysis
(Successive filtering)
Variant annotation
Filtering by effect
Filtering by MAF
Filtering by family
segregation
Primary
analysis
Gene prioritization
1000 genomes
EVS
Local variants

Use known variants and their
population frequencies to filter out.
• Typically dbSNP, 1000 genomes and
the 6515 exomes from the ESP are
used as sources of population
frequencies.
• We sequenced 300 healthy controls
(rigorously phenotyped) to add and
extra filtering step to the analysis
pipeline
Novembre et al., 2008. Genes mirror
geography within Europe. Nature
Comparison of MGP controls to 1000g
How important do you
think local information is
to detect disease genes?

Filtering with or without local variants
Number of genes as a function of individuals in the study of a dominant disease
Retinitis Pigmentosa autosomal dominant
The use of local
variants makes
an enormous
difference

The CIBERER Exome Server (CES): the first
repository of variability of the Spanish
population
Only another similar
initiative exists: the GoNL
http://www.nlgenome.nl/
http://ciberer.es/bier/exome-server/
And more recently
the Finnish
population

Information provided
Genotypes in the
different reference
populations
Genomic coordinates, variation, gene.
SNPid
if any

Information provided
PolyPhen and SIFT
pathogenicity indexes Phenotype,
if available

Variants can also be seen
within their genomic context
GenomeMaps viewer (Medina et al., 2013, NAR) embedded in the application.
GenomeMaps is the official genome viewer of the ICGC (http://dcc.icgc.org/)

Occurrence of pathological variants in
“normal” population
Reference
genome is
mutated
Nine carriers
in 1000
genomes
One affected
and 73 carriers
in EVS

Table of Spanish
Frequencies
(TSF)
DB of Spanish
variants (DBSV)
Chr Position Ref Alt 0/0 0/1 1/1
1 1365313 A T 75 0 0
1 1484884 G A 70 4 1
2 326252 T C 25 35 15
CES
use
Other countries
CES
input
External
Unrelated?
(DBSV)
VCFs Spanish?
(TSF)
YES YES
NO NO
Counts
Internal
Regional
AIM (Ancestry-informative
markers) are used to
discard kinship and
different ethnicity

Organization of the database
Project D1 D2 … Case Control Counts
A x x f1
X x f2
B X X f3
X X f4
C X X f5
X X f6
X X f7
… … … … … … …
Organized in projects / diseases / case-control. Frequencies are
calculated for each project-disease-status, and selections can be
done as required. The items can be combined to maximize
pseudo-control sample size
Example: frequencies f1, f2, and f5 can be used as pseudo-
controls for studying disease D2. Under a less stringent scenario
f4 and f6 could also be used.

Are we there yet?
Variability spectrum of the
Spanish population
A total of 131.897 variant positions, unique in Spanish population, were
detected in all the 75 samples together. Approximately 90.000 were
singletons. 51.295 variants are non-synonymous changes and 18.450
correspond to synonymous changes (pattern opposite to variants shared
with 1000g and EVS).

CIBERER
76 samples
CES II
76+269+X
Mixed
MGP
269 samples
Healthy controls
Phase I Phase II Phase III
CES II
1000+76+269+X
Mixed
More
CIBERER
samples
SPANEX:
1000 exomes
(200 ongoing)
CIBERER
CIBERER exome server roadmap and
the Spanish 1000 genomes project
(Spanex)
2014-June 2014 2015 Today
400

BiERapp: interactive web-based tool for easy
candidate prioritization by successive filtering
SEQUENCING CENTER
Data
preprocessing
VCF
FASTQ
Genome
Maps
BAM
BiERapp filters
No-SQL (Mongo)
VCF indexing
Population
frequencies Consequence types
Experimental
design
BAM viewer and
Genomic context?
Easy
scaleup

NA19660 NA19661
NA19600 NA19685
BiERapp: the interactive filtering tool for
easy candidate prioritization
http://bierapp.babelomics.org
Aleman et al., 2014 NAR

NA19660 NA19661
NA19600 NA19685
A/T A/T
T/T A/T
NA19660 NA19661
NA19600 NA19685
?/? A/T
T/T A/T
1
A proper filtering system must
consider missing values
Unreported alternative
alleles can happen
because:
a) The position was
read and the
reference allele was
found
b) The position could
not be read and/or
it was low quality
(missing value)
Most VCF formats do
not allow deconvolution
of both scenarios.
We specifically include
missing values

3-Methylglutaconic aciduria (3-MGA-uria) is
a heterogeneous group of syndromes
characterized by an increased excretion of
3-methylglutaconic and 3-methylglutaric
acids.
WES with a consecutive filter approach is
enough to detect the new mutation in this
case.
Successive Filtering approach
An example with 3-Methylglutaconic aciduria syndrome

Readjusting filtering thresholds
Primary
analysis
VCF
Frequency
Deleteriousness
Experimental design
GO enrichment
Network analysis
Pathway analysis
Gene
yes
no
Paper
BiERapp
Quite often, the result
is not conclusive either
by excess or by defect
of candidates .
And it is completely
dependent on the
disease and the
experimental setup
In our experience,
easy interactivity
in the filtering is
the best asset for
gene discovery

Results: 36 new disease variants in known
genes and 27 disease variants in 13 new genes
WES
IRDs
arRP
(EYS)
BBS
arRParRP
(USH2)
3-MGA-
uria
(SERAC1)
NBD
(BCKDK )

Tool for defining panelsIf no diagnostic variants appear, then
variants of uncertain effect are studied
Also incidental findings can be handled
Diagnostic mutations
http://team.babelomics.org
Diagnostic by targeted resequencing
(panels –real or virtual– of genes)
Collaboration with M.A. Moreno, Hospital Ramon y Cajal
New filter based on
local population variant
frequencies

Virtual panels are a reality
4813 genes with known
phenotypes.
• One physical panel
• As many virtual panels
as you need

CACNA1F,
CACNA2D4
GNAT2
RP
CORD/COD
CORD/COD
CVD
CVD
MD
LCA
ERVR/EVR
C2ORF71, C8ORF37,
CA4,CERKL, CNGA1, CNGB1,
DHDDS,EYS, FAM161A, IDH3B,KLHL7
IMPG2, MAK, NRL, PAP1, PDE6A,
PDE6G, PRCD, PRF3, PRPF8, PRPF31
RBP3, RGR, ROM1, RP1, RP2,
SNRNP200, TOPORS, TTC8
ZNF513
PDE6B,
RHO,
SAG
GRK1,
GRM6,
NYX,
TRPM1
CABP4,
LCA5,
RD3
CRB1, IMPDH1,
LRAT, MERTK,
RDH12, RPE65,
SPATA7, TULP1
CRX
AIPL1,
GUCY2D,
RPGRIP1
ADAM9,
GUCA1A,
HRG4/UNC119,
KCNV2, PDE6H,
PITPNM3, RAX2,
RDH5, RIM1
CNGA3,
PDE6C
BCP,
GCP,
RCP
ABCA4,
PROM1,
PRPH2,
RPGR
RLBP1,
SEMA4A
C1QTNF5,
EFEMP1,
ELOVL4,
HMNC1,
RS1,
TIMP3
FSCN2,
GUCA1B
NR2E3
BEST1
FZD4, KCNJ13,
LRP5, NDP,
TSPAN12, VCAN
NB
ABHD12, CDH23, CIB2,
DFNB31, GPR98,
HARS, MYO7A,
PCDH15, USH1C,
USH1G
CLRN1,
USH2A
USH
CEP290
BBS1
BBS
ARL6,, BBS2, BBS4,
BBS5, BBS7, BBS9,
BBS10, BBS12,, INPP5E,
LZTFL1, MKKS, MKS1,
SDCCAG8, TRIM32, TTC8
Building virtual panels
An example with Inherited Retinal Dystrophies
LCA-Leber Congenital Amaurosis
CORD/COD- Cone and cone-rod dystro.
CVD- Colour Vision Defects
MD- Macular Degeneration
ERVR/EVR- Erosive and Exudative
Vitreoretinopathies
USH- Usher Syndrome
RP- Retinitis Pigmentosa
NB- Night Blindness
BBS- Bardet-Biedl Syndrome

CACNA1F,
CACNA2D4
GNAT2
RP
CORD/COD
CORD/COD
CVD
CVD
MD
LCA
ERVR/EVR
C2ORF71, C8ORF37,
ZNF513
PDE6B,
RHO,
SAG
GRK1,
GRM6,
NYX,
TRPM1
CABP4,
LCA5,
RD3
CRB1, IMPDH1,
LRAT, MERTK,
RDH12, RPE65,
SPATA7, TULP1
CRX
AIPL1,
GUCY2D,
RPGRIP1
ADAM9,
GUCA1A,
HRG4/UNC119,
KCNV2, PDE6H,
PITPNM3, RAX2,
RDH5, RIM1
CNGA3,
PDE6C
BCP,
GCP,
RCP
ABCA4,
PROM1,
PRPH2,
RPGR
RLBP1,
SEMA4A
C1QTNF5,
EFEMP1,
ELOVL4,
HMNC1,
RS1,
TIMP3
FSCN2,
GUCA1B
NR2E3
BEST1
FZD4, KCNJ13,
LRP5, NDP,
TSPAN12, VCAN
NB
DFNB31, GPR98,
HARS, MYO7A,
PCDH15, USH1C,
USH1G
CLRN1,
USH2A
USH
CEP290
BBS1
BBS
ARL6,, BBS2, BBS4,
BBS5, BBS7, BBS9,
LZTFL1, MKKS, MKS1,
Vitreoretinopathies
USH- Usher Syndrome
NB- Night Blindness
Panel for RP

CACNA1F,
CACNA2D4
GNAT2
RP
CORD/COD
CORD/COD
CVD
CVD
MD
LCA
ERVR/EVR
C2ORF71, C8ORF37,
ZNF513
PDE6B,
RHO,
SAG
GRK1,
GRM6,
NYX,
TRPM1
CABP4,
LCA5,
RD3
CRB1, IMPDH1,
LRAT, MERTK,
RDH12, RPE65,
SPATA7, TULP1
CRX
AIPL1,
GUCY2D,
RPGRIP1
ADAM9,
GUCA1A,
HRG4/UNC119,
KCNV2, PDE6H,
PITPNM3, RAX2,
RDH5, RIM1
CNGA3,
PDE6C
BCP,
GCP,
RCP
ABCA4,
PROM1,
PRPH2,
RPGR
RLBP1,
SEMA4A
C1QTNF5,
EFEMP1,
ELOVL4,
HMNC1,
RS1,
TIMP3
FSCN2,
GUCA1B
NR2E3
BEST1
FZD4, KCNJ13,
LRP5, NDP,
TSPAN12, VCAN
NB
DFNB31, GPR98,
HARS, MYO7A,
PCDH15, USH1C,
USH1G
CLRN1,
USH2A
USH
CEP290
BBS1
BBS
ARL6,, BBS2, BBS4,
BBS5, BBS7, BBS9,
LZTFL1, MKKS, MKS1,
Vitreoretinopathies
USH- Usher Syndrome
NB- Night Blindness
Extended panel
for RP

CACNA1F,
CACNA2D4
GNAT2
RP
CORD/COD
CORD/COD
CVD
CVD
MD
LCA
ERVR/EVR
C2ORF71, C8ORF37,
ZNF513
PDE6B,
RHO,
SAG
GRK1,
GRM6,
NYX,
TRPM1
CABP4,
LCA5,
RD3
CRB1, IMPDH1,
LRAT, MERTK,
RDH12, RPE65,
SPATA7, TULP1
CRX
AIPL1,
GUCY2D,
RPGRIP1
ADAM9,
GUCA1A,
HRG4/UNC119,
KCNV2, PDE6H,
PITPNM3, RAX2,
RDH5, RIM1
CNGA3,
PDE6C
BCP,
GCP,
RCP
ABCA4,
PROM1,
PRPH2,
RPGR
RLBP1,
SEMA4A
C1QTNF5,
EFEMP1,
ELOVL4,
HMNC1,
RS1,
TIMP3
FSCN2,
GUCA1B
NR2E3
BEST1
FZD4, KCNJ13,
LRP5, NDP,
TSPAN12, VCAN
NB
DFNB31, GPR98,
HARS, MYO7A,
PCDH15, USH1C,
USH1G
CLRN1,
USH2A
USH
CEP290
BBS1
BBS
ARL6,, BBS2, BBS4,
BBS5, BBS7, BBS9,
LZTFL1, MKKS, MKS1,
Vitreoretinopathies
USH- Usher Syndrome
NB- Night Blindness
Super extended
panel for RP

Knowledge DB
Freq.popul.
MiSeq
IonTorrent
IonProton
HiSeq
IonProton
NO
Diagnostic
Therapeutic
decision
Newvariants
Disease
All
Candidate
Prioritization
Datapreprocessing
Sequence DB
Sequences
Freqs.
Future
technologies
New knowledge
for future
diagnostic
The final schema: diagnostic and discovery

Implementation of tools for genomic big data
management in the IT4I Supercomputing
Center (Czech Republic)
The pipelines of primary and
secondary analysis developed by the
Computational Genomics
Department has proven its efficiency
in the analysis of more than 1000
exomes in a joint collaborative
project of the CIBERER and the
MGP
A first pilot has been implemented in
the IT4I supercomputing center,
which aims to centralize the analysis
of genomics data in the country. Genomic data management
solutions scalable to country size

What is next?
Miniaturized sequencing
devices (still far away
from clinic)… …that will bring sequencing closer to the bed
We only lack the bioinformatics to deal with

Software development
See interactive map of for the last 24h use http://bioinfo.cipf.es/toolsusage
Babelomics is the third most cited tool for
functional analysis. Includes more than 30
tools for advanced, systems-biology based
data analysis
More than 150.000 experiments were analyzed in our tools during the last year
HPC on CPU, SSE4,
GPUs on NGS data
processing
Speedups up to 40X
Genome maps is now part
of the ICGC data portal
Ultrafast
genome
viewer with
google
technology
Mapping
Visualization
Functional analysis
Variant annotation
CellBase Knowledge
database
Variant
prioritization
NGS
panels
Signaling network Regulatory
network
Interaction
network
Diagnostic
CellBase is now
available at EBI
Prototype running
in Czech Republic

The Computational Genomics Department at the
Centro de Investigación Príncipe Felipe (CIPF),
Valencia, Spain, and…
...the INB, National
Institute of
Bioinformatics
(Functional Genomics
Node)
and the BiER
(CIBERER Network of
Centers for Rare
Diseases)
@xdopazo
@bioinfocipf

Bioinformatics and NGS for advancing in hearing loss research

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Bioinformatics and NGS for advancing in hearing loss research

Ähnlich wie Bioinformatics and NGS for advancing in hearing loss research (20)

Mehr von Joaquin Dopazo

Mehr von Joaquin Dopazo (8)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Bioinformatics and NGS for advancing in hearing loss research