Bioinformatics and Big Data in the era of Personalized Medicine
10th Anniversary Instituto Roche Forum on Personalized Medicine: Challenges for the next decade.
Santiago de Compostela (Spain), September 25th 2014
Forum on Personalized Medicine: Challenges for the next decade
1. Joaquín Dopazo
Computational Genomics Department,
Centro de Investigación Príncipe Felipe (CIPF),
Functional Genomics Node, (INB),
Bioinformatics Group (CIBERER) and
Medical Genome Project,
Spain.
http://bioinfo.cipf.es
http://www.medicalgenomeproject.com
http://www.babelomics.org
http://www.hpc4g.org
@xdopazo
Forum on Personalized Medicine, 25 September 2014
Bioinformatics and Big Data in the era of Personalized Medicine
2. Allison, 2008. Is personalized medicine finally arriving? Nature.
Personalized medicine: just about a better understanding of the relationship phenotype-genotype
Personalized medicine through precision medicine
•Precision medicine requires of better ways of defining diseases by introducing genomic technologies into the diagnostic procedures.
•A more precise diagnostic of diseases, based on the description of their molecular mechanisms, is critical for creating innovative diagnostic, prognostic, and therapeutic strategies properly tailored to each patient’s necessities
3. The future of personalized medicine is strongly based on genomics
•Personalized medicine is based on the availability of diagnostic biomarkers
•Genome sequencing offers ALL this information (if properly analyzed)
•Genome sequence prices are in free fall (exome price expected < 300€ in 2-3 years)
•Over 30-40 % of budget (>500 B $) per year, is spent on costs associated with “overuse, underuse, misuse, ...”
4. While the cost falls down, the amount of data to manage and its complexity raise exponentially.
Costs are already almost competitive enough to be used in clinic
The problem is… are we ready to deal with this data?
Exome sequencing successfully used. NGS prices will be soon affordable.
http://www.genome.gov/sequencingcosts/
6. Personalized Genomic Medicine. Phase I: generating the knowledge database
----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
sequencing
Patient
List of variants
Database. Query: variant/pathway
Therapy
Outcome
System feedback
Genetic variants are linked to therapies through the knowledge of their functional effects (systems biology)
Initially the system will need much feedback: Knowledge generation phase.
Growing knowledge database
Genomic medicine
Knowledge database
7. Personalized genomic medicine. Phase II: applying the knowledge database
Patient
1)Genomic sequencing
2)Database of markers
3)Therapy prediction
Genomic core facility phase II
Clinician receives hints on possible prescriptions and therapeutic interventions
+
Other factors (risk, cost, etc.)
Prescription
Pre-symptomatic:
• Genetic predisposition of acquired diseases (>6000. some treatable)
• Early diagnosis of genetic diseases Symptomatic analysis
• Diagnostic of acquired diseases
• Early cancer detection
• Cancer treatment recommendation
8. From genetics to genomic medicine
Test 1
Test 2
Therapy 1
Therapy 2
Therapy 3
?
Genetic medicine
Test
Therapy 1
Therapy 2
Therapy 3
?
Genomic medicine
+
Genomic analysis allows associating patients to therapies from the very beginning, saving time and costs and increasing the success of treatments.
feedback
10. Preparing the scenario for the introduction of genome in the clinics
Patient
Treatment
eHR
Decision support techniques: algorithms that relate biomarkers to treatments, outcomes, etc. (gene prioritization and predictors)
Integration of the data in the eHR
Visualization and data presentation. Ready for the clinical interpretation
Acceleration of algorithms for data pre- processing. Data strorage optimization
feedback
Corporative systems
Orion clinic Abucasis, Gaia, etc.
11. Preparing the scenario for the introduction of genome in the clinics
Patient
Treatment
eHR
feedback
Corporative systems
Orion clinic Abucasis, Gaia, etc.
Decision support techniques: algorithms that relate biomarkers to treatments, outcomes, etc. (gene prioritization and predictors)
Visualization and data presentation.
Ready for the clinical interpretation
Integration of the data in the eHR
Acceleration of algorithms for data pre- processing. Data strorage optimization
12. New Big Data storage strategies
Automatic QC Sequence cleansing
Variant calling + QC
Mapping
+ QC
8-10 hours 8-12 hours 8-12 hours
CLOUD
FASTQ (10GB)
BAM (7GB)
VCF (200MB)
Data sizes for exomes. In case of whole genomes sizes are >20x
Remote visualization of big data. Data production phase
e-health record
Final human supervision of data QC
13. Tools developed to improve the pipeline Genome Maps, a HTML5+SVG data visualization of VCF and BAM
oGenome scale data visualization plays an important role in the data analysis process. It is a big data management problem.
oFeatures of Genome Maps (Medina, 2013, NAR; ICGC data analysis portal)
●First 100% HTML5 web based: HTML5+SVG (inspired in Google Maps)
●Always updated, no browser plugins or installation
●Data taken from CellBase, remote NGS data, local files and DAS servers: genes, transcripts, exons, SNPs, TFBS, miRNA targets, etc.
●Other features: Multi species, API oriented, easy integration, plugin framework, etc.
BAM viewer
VCF viewer
ICGC genomic viewer
www.genomemaps.org
14. Patient
Treatment
eHR
feedback
Corporative systems
Orion clinic Abucasis, Gaia, etc.
Acceleration of algorithms for data pre- processing. Data strorage optimization
Integration of the data in the eHR
Visualization and data presentation. Ready for the clinical interpretation
Decision support techniques: algorithms that relate biomarkers to treatments, outcomes, etc. (gene prioritization and predictors)
Preparing the scenario for the introduction of genome in the clinics
15. Finding new biomarkers
Test
Therapy 1
Therapy 2
Therapy 3
?
feedback
Feedback: treatment failures are reanalyzed to search for:
1)Biomarkers (of failure)
2)Subgroups (to search for new personalized and rational therapeutic interventions
Treatables
Failure treatment biomarkers
Group A biomarkers
Group A biomarkers
Irrelevant
Non treatables
Signaling
Protein interaction
Regulation
Variants are used as biomarkers to distinguish between responders and non-responders and to sub-classify non-responders
Rationale design of therapies rely on Systems Biology concepts. Pathways are complex and must be understood with the proper bioinformatic tools
16. Patient
Treatment
eHR
feedback
Corporative systems Orion clinic Abucasis, Gaia, etc.
Decision support techniques: algorithms that relate biomarkers to treatments, outcomes, etc. (gene prioritization and predictors)
Acceleration of algorithms for data pre- processing. Data strorage optimization
Visualization and data presentation.
Ready for the clinical interpretation
Integration of the data in the eHR
Preparing the scenario for the introduction of the genome in clinics
17. BiERapp: interactive web-based tool for easy candidate prioritization by successive filtering
SEQUENCING CENTER
Data preprocessing
VCF
FASTQ
Genome Maps
BAM
BiERapp filters
No-SQL (Mongo) VCF indexing
Population frequencies
Consequence types
Experimental
design
BAM viewer and Genomic context
?
Easy
scale up
18. NA19660 NA19661
NA19600 NA19685
BiERapp: the interactive filtering tool for easy candidate prioritization
http://bierapp.babelomics.org
Aleman et al., 2014 NAR
19. 3-Methylglutaconic aciduria (3- MGA-uria) is a heterogeneous group of syndromes characterized by an increased excretion of 3-methylglutaconic and 3-methylglutaric acids.
WES with a consecutive filter approach is enough to detect the new mutation in this case.
Successive Filtering approach An example with 3-Methylglutaconic aciduria syndrome
20. Use known variants and their population frequencies to filter out irrelevant polymorphisms.
•Typically dbSNP, 1000 genomes and the 6515 exomes from the ESP are used as sources of population frequencies.
•We sequenced 300 healthy controls (rigorously phenotyped) to add and extra filtering step to the analysis pipeline
Novembre et al., 2008. Genes mirror geography within Europe. Nature
Comparison of MGP controls to 1000g
How important do you think local information is to detect disease genes?
21. Filtering with or without local variants
Number of genes as a function of individuals in the study of a dominant disease Retinitis Pigmentosa autosomal dominant
The use of local variants makes an enormous difference
22. New variants and disease genes found with WES and successive filtering
WES
IRDs
arRP (EYS)
BBS
arRP
arRP (USH2)
3-MGA- uria (SERAC1)
NBD (BCKDK )
23. Knowledge DB
Freq. popul.
MySeq IonTorrent IonProton
Illumina
NO
Diagnostic
Therapeutic decision
New variants
Disease
All
Candidate Prioritization
Data preprocessing
Sequence DB
Sequences
Freqs.
Future
technologies
New knowledge for future diagnostic
The final schema: diagnostic and discovery
24. Diagnostic by targeted sequencing (panels of genes)
Tool for defining panels
New filter based on local population variant frequencies
If no diagnostic variants appear, then secondary findings are studied
Diagnostic mutations
http://team.babelomics.org
25. Implementation of tools in the IT4I Supercomputing Center (Czech Republic)
The pipelines of primary and secondary analysis developed by the Computational Genomics Department of the CIPF in close collaboration with the Bull Chair has proven its efficiency in the analysis of more than 1000 exomes in a joint collaborative project of the CIBERER and the MGP
A first pilot implementation has been done in the IT4I supercomputing center, which aims to centralize the analysis of genomics data in the country.
26. Implementation in the AVS
…..
1PB DB
We have taken advantage of the already operative corporative medical image system using a quite similar philosophy.
eHR
gateway
Upload image
Retrieve (by patient ID)
Genomic gateway
Pilot project with 20 leukemias
27. Knowledge DB
Freq. popul.
MySeq
IonTorrent
IonProton
Illumina
NO
Diagnostic
Therapeutic decision
New variants
Disease
All
Candidate Prioritization
Data preprocessing
Sequence DB
Sequences
Freqs.
Future
technologies
New knowledge for future diagnostic
Gene discovery and diagnostic implemented
But… what about personalized treatments?
28. Patient’s omic data Biological
knowledge
Systems
biology
computational
models
Epigenomics Regulation
Interaction
Function
Proteomics
Genomics and
transcriptomics
Patient
Metabolomics
Diagnostic biomarkers
Personalized medicine
Therapeutic
targets
Cell culture
Best
combination
Xenograft model
Drug treatment
Network drugs
Personalized
therapy
Are individualized treatments a realistic option?
Dopazo, 2003, Drug Discovery Today
29. Modeling pathways The effect of gene expression over signaling can be estimated. Virtual KOs (or over-expressions) can be simulated
Colorectal cancer activates a signaling circuit of VEGF pathway that produces PGI2.
Virtual KO of COX2 interrupts the circuit (known therapeutic inhibitor in CGR
COX2 gene
KO
31. The ENCODE project suggests a functional
role for a large fraction of the genome
Which percentage of the genome is
occupied by:
Coding genes: 2.4%
TFBSs 8.1%
Open chromatin regions 15.2%
Different RNA types 62.0%
Total annotated elements: 80.4%
Exomes are only covering a small fraction of the potential functionality of the genome
(2.4%).
Is the missing heritability hidden in the remaining 78%?
If so, what type of variant should be expect to discover? SNVs? SVs?
32. Future prospects
We need to efficiently query all the information contained in the genome, including all the epigenomic signatures as well as the structural variation.
This involves data integration and “epistatic” queries.
We need to prepare our health systems to deal with all the genomic data flood
Information about variations Processed Raw
Genome variant information (VCF) 150 MB 250 GB
Epigenome 150 MB 250 GB
Each transcriptome 20 MB 80 GB
Individual complete variability 400 MB 525 GB
Hospital (100.000 patients) 40 TB 50 PB
We are only starting to realize the dimension of the daunting challenges posed by genomic big data
There are technical (data size) and conceptual problems (data analysis) in the way genomic information is managed that must be addressed.
33. The Computational Genomics Department at the Centro de Investigación Príncipe Felipe (CIPF), Valencia, Spain, and…
...the INB, National Institute of Bioinformatics (Functional Genomics Node) and the CIBERER Network of Centers for Rare Diseases.
@xdopazo
@bioinfocipf