Pondicherry Call Girls Book Now 9630942363 Top Class Pondicherry Escort Servi...
Bioinformatics and NGS for advancing in hearing loss research
1. Joaquín Dopazo
Computational Genomics Department,
Centro de Investigación Príncipe Felipe (CIPF),
Functional Genomics Node, (INB),
Bioinformatics Group (CIBERER) and
Medical Genome Project,
Spain.
Bioinformatics and NGS: an
indissoluble marriage for advancing in
hearing loss research
http://bioinfo.cipf.es
http://www.medicalgenomeproject.com
http://www.babelomics.org
http://www.hpc4g.org
@xdopazo
Fundación Ramón Areces, Madrid, 5th Marzo 2015
2. Why Bioinformatics and NGS are important?
Lessons learned from the Spanish 1000 genomes project:
Rare and familiar diseases sequencing initiative
• Metabolic (86 samples)
• Optiz
• Atypical fracture
• coQ10 deficiency
• Congenital disorder of glycosylation types I and II
• Maple syrup urine disease
• Pelizaeus-like
• 4 unknown syndroms
• Genetic (24 samples)
• Charcot-Marie-Tooth
• Rett Syndrome
• Neurosensorial (35 samples)
• Usher
• AD non-syndromic hearing loss
• AR non-syndromic hearing loss
• RP
• Mitochondrial (28 samples)
• Progressive External Oftalmoplegy
• Multi-enzymatic deficiency in mitochondrial
respiratory complexes
• CoQ disease
• Other
• APL (10 samples)
Autism (37 samples)
Mental retardation (autosomal recessive) (24)
Immunodeficiency (18)
Leber's congenital amaurosis (9)
Cataract (2)
RP(AR) (60)
RP(AD) (46)
Deafness (24)
CLAPO (4)
Skeletal Dysplasia (3)
Cantú syndrome (1)
Dubowitz syndrome (2)
Gorham-Stout syndrome (1)
Malpuech syndrome (4)
Hirschprung’s disease (81)
Hereditary macrothrombocytopenia (3)
MTC (41)
Controls (301)
1044 samples = 183 samples + 200 controls + 360 samples + 301 controls
3. Organization of the initiative
Diseases with:
• Unknown genes
• Known genes/mutations discarded
Search for:
• Novel genes
• Responsible genes known but unknown modifier genes
• Susceptibility Genes
• Therapeutic targets
http://www.gbpa.es/
Data production Sequencing platforms Data analysis
Big-Data Team
science paradigm
5. Pipeline of data analysis
Initial QC
Sequence
cleansing
Base quality
Remove adapters
Remove
duplicates
FASTQ file
Variant calling +
QC
Calling and labeling
of missing values
Calling SNVs and
indels (GATK) using
6 statistics based
on QC, strand bias,
consistence (poor
QC callings are
converted to
missing values as
well)
Create multiple VCF
with missing, SNVs
and indels
VCF file
Mapping + QC
Mapping (HPG)
Remove multiple
mapping reads
Remove low
quality mapping
reads
Realigning
Base quality
recalibrating
BAM file
Variant and gene
prioritization + QC
Counts of sites with
variants
Variant annotation
(function, putative effect,
conservation, etc.)
Inheritance analysis
(including compound
heterozygotes in recessive
inheritance)
Filtering by frequency with
external controls (Spanish
controls, dbSNP, 1000g,
ESP) and annotation
Multi-family intersection of
genes and variants
Function/Network-based
prioritization
Report
Primary analysis Gene prioritization
6. Pipeline of data analysis
Primary
processing
Initial QC
FASTQ file
Mapping
BAM file
Variant calling
VCF File
Knowledge-based
prioritization
Proximity to other
known disease genes
Functional proximity
Network proximity
Burden tests
Other prioritization
methods
Secondary analysis
(Successive filtering)
Variant annotation
Filtering by effect
Filtering by MAF
Filtering by family
segregation
Primary
analysis
Gene prioritization
VARIANT
annotation tool
7. Variant annotation
HPG Variant, a suite of tools for HPC-based genomic variant annotation VARIANT = VARIant
ANnotation Tool. Tools implemented using OpenMP, Nvidia CUDA and MPI for large clusters.
EFFECT: A CLI and web application, it's a cloud-based genomic variant effect predictor tool
has been implemented (http://variant.bioinfo.cipf.es, Medina 2012 NAR)
VCF: C library and tool: allows to analyze large VCFs files with a low memory footprint: stats,
filter, split, merge, etc. Example: hpg-variant vcf –stats –vcf-file ceu.vcf
Annotations
sought
8. The knowledge database
CellBase (Bleda, 2012, NAR), a
comprehensive integrative database
and RESTful Web Services API,
more than 250GB of data and 90
tables exported in TXT and JSON:
● Core features: genes, transcripts,
exons, cytobands, proteins (UniProt),...
● Variation: dbSNP and Ensembl SNPs,
HapMap, 1000Genomes, Cosmic, ...
● Functional: 40 OBO ontologies (Gene
Ontology), Interpro, etc.
● Regulatory: TFBS, miRNA targets,
conserved regions, etc.
● System biology: Interactome (IntAct),
Reactome database, co-expressed
genes.
NoSQL and scales to TB
Wiki: http://docs.bioinfo.cipf.es/projects/cellbase/wiki
Project: http://bioinfo.cipf.es/compbio/cellbase
Now available at the EBI: http://www.ebi.ac.uk/cellbase/webservices/rest/v3/
9. Pipeline of data analysis
Primary
processing
Initial QC
FASTQ file
Mapping
BAM file
Variant calling
VCF File
Knowledge-based
prioritization
Proximity to other
known disease genes
Functional proximity
Network proximity
Burden tests
Other prioritization
methods
Secondary analysis
(Successive filtering)
Variant annotation
Filtering by effect
Filtering by MAF
Filtering by family
segregation
Primary
analysis
Gene prioritization
1000 genomes
EVS
Local variants
10. Use known variants and their
population frequencies to filter out.
• Typically dbSNP, 1000 genomes and
the 6515 exomes from the ESP are
used as sources of population
frequencies.
• We sequenced 300 healthy controls
(rigorously phenotyped) to add and
extra filtering step to the analysis
pipeline
Novembre et al., 2008. Genes mirror
geography within Europe. Nature
Comparison of MGP controls to 1000g
How important do you
think local information is
to detect disease genes?
11. Filtering with or without local variants
Number of genes as a function of individuals in the study of a dominant disease
Retinitis Pigmentosa autosomal dominant
The use of local
variants makes
an enormous
difference
12. The CIBERER Exome Server (CES): the first
repository of variability of the Spanish
population
Only another similar
initiative exists: the GoNL
http://www.nlgenome.nl/
http://ciberer.es/bier/exome-server/
And more recently
the Finnish
population
15. Variants can also be seen
within their genomic context
GenomeMaps viewer (Medina et al., 2013, NAR) embedded in the application.
GenomeMaps is the official genome viewer of the ICGC (http://dcc.icgc.org/)
16. Occurrence of pathological variants in
“normal” population
Reference
genome is
mutated
Nine carriers
in 1000
genomes
One affected
and 73 carriers
in EVS
17. Table of Spanish
Frequencies
(TSF)
DB of Spanish
variants (DBSV)
Chr Position Ref Alt 0/0 0/1 1/1
1 1365313 A T 75 0 0
1 1484884 G A 70 4 1
2 326252 T C 25 35 15
CES
use
Other countries
CES
input
External
Unrelated?
(DBSV)
VCFs Spanish?
(TSF)
YES YES
NO NO
Counts
Internal
Regional
AIM (Ancestry-informative
markers) are used to
discard kinship and
different ethnicity
18. Organization of the database
Project D1 D2 … Case Control Counts
A x x f1
X x f2
B X X f3
X X f4
C X X f5
X X f6
X X f7
… … … … … … …
Organized in projects / diseases / case-control. Frequencies are
calculated for each project-disease-status, and selections can be
done as required. The items can be combined to maximize
pseudo-control sample size
Example: frequencies f1, f2, and f5 can be used as pseudo-
controls for studying disease D2. Under a less stringent scenario
f4 and f6 could also be used.
19. Are we there yet?
Variability spectrum of the
Spanish population
A total of 131.897 variant positions, unique in Spanish population, were
detected in all the 75 samples together. Approximately 90.000 were
singletons. 51.295 variants are non-synonymous changes and 18.450
correspond to synonymous changes (pattern opposite to variants shared
with 1000g and EVS).
20. CIBERER
76 samples
CES II
76+269+X
Mixed
MGP
269 samples
Healthy controls
Phase I Phase II Phase III
CES II
1000+76+269+X
Mixed
More
CIBERER
samples
SPANEX:
1000 exomes
(200 ongoing)
CIBERER
CIBERER exome server roadmap and
the Spanish 1000 genomes project
(Spanex)
2014-June 2014 2015 Today
400
21. BiERapp: interactive web-based tool for easy
candidate prioritization by successive filtering
SEQUENCING CENTER
Data
preprocessing
VCF
FASTQ
Genome
Maps
BAM
BiERapp filters
No-SQL (Mongo)
VCF indexing
Population
frequencies Consequence types
Experimental
design
BAM viewer and
Genomic context?
Easy
scaleup
23. NA19660 NA19661
NA19600 NA19685
A/T A/T
T/T A/T
NA19660 NA19661
NA19600 NA19685
?/? A/T
T/T A/T
1
A proper filtering system must
consider missing values
Unreported alternative
alleles can happen
because:
a) The position was
read and the
reference allele was
found
b) The position could
not be read and/or
it was low quality
(missing value)
Most VCF formats do
not allow deconvolution
of both scenarios.
We specifically include
missing values
24. 3-Methylglutaconic aciduria (3-MGA-uria) is
a heterogeneous group of syndromes
characterized by an increased excretion of
3-methylglutaconic and 3-methylglutaric
acids.
WES with a consecutive filter approach is
enough to detect the new mutation in this
case.
Successive Filtering approach
An example with 3-Methylglutaconic aciduria syndrome
25. Readjusting filtering thresholds
Primary
analysis
VCF
Frequency
Deleteriousness
Experimental design
GO enrichment
Network analysis
Pathway analysis
Gene
yes
no
Paper
BiERapp
Quite often, the result
is not conclusive either
by excess or by defect
of candidates .
And it is completely
dependent on the
disease and the
experimental setup
In our experience,
easy interactivity
in the filtering is
the best asset for
gene discovery
26. Results: 36 new disease variants in known
genes and 27 disease variants in 13 new genes
WES
IRDs
arRP
(EYS)
BBS
arRParRP
(USH2)
3-MGA-
uria
(SERAC1)
NBD
(BCKDK )
27. Tool for defining panelsIf no diagnostic variants appear, then
variants of uncertain effect are studied
Also incidental findings can be handled
Diagnostic mutations
http://team.babelomics.org
Diagnostic by targeted resequencing
(panels –real or virtual– of genes)
Collaboration with M.A. Moreno, Hospital Ramon y Cajal
New filter based on
local population variant
frequencies
28. Virtual panels are a reality
4813 genes with known
phenotypes.
• One physical panel
• As many virtual panels
as you need
34. Implementation of tools for genomic big data
management in the IT4I Supercomputing
Center (Czech Republic)
The pipelines of primary and
secondary analysis developed by the
Computational Genomics
Department has proven its efficiency
in the analysis of more than 1000
exomes in a joint collaborative
project of the CIBERER and the
MGP
A first pilot has been implemented in
the IT4I supercomputing center,
which aims to centralize the analysis
of genomics data in the country. Genomic data management
solutions scalable to country size
35. What is next?
Miniaturized sequencing
devices (still far away
from clinic)… …that will bring sequencing closer to the bed
We only lack the bioinformatics to deal with
36. Software development
See interactive map of for the last 24h use http://bioinfo.cipf.es/toolsusage
Babelomics is the third most cited tool for
functional analysis. Includes more than 30
tools for advanced, systems-biology based
data analysis
More than 150.000 experiments were analyzed in our tools during the last year
HPC on CPU, SSE4,
GPUs on NGS data
processing
Speedups up to 40X
Genome maps is now part
of the ICGC data portal
Ultrafast
genome
viewer with
google
technology
Mapping
Visualization
Functional analysis
Variant annotation
CellBase Knowledge
database
Variant
prioritization
NGS
panels
Signaling network Regulatory
network
Interaction
network
Diagnostic
CellBase is now
available at EBI
Prototype running
in Czech Republic
37. The Computational Genomics Department at the
Centro de Investigación Príncipe Felipe (CIPF),
Valencia, Spain, and…
...the INB, National
Institute of
Bioinformatics
(Functional Genomics
Node)
and the BiER
(CIBERER Network of
Centers for Rare
Diseases)
@xdopazo
@bioinfocipf