One of the most exciting frontiers in science is building automated systems that use existing biomedical data to understand and ultimately treat human disease. The key difficulty in the case of cancer is that it is a highly heterogeneous disease, making it challenging to uncover which molecular alterations in tumors are important for the disease and to predict how an individual will respond to treatment. This talk presents an overview of integrative computational methods for analyzing cancer genomes that leverage a diverse range of complementary data in order to extract biomedically relevant insights.
5. Cancer Genome Landscapes
32 cancer subtypes
11,315 patient samples
• 22 cancer subtypes
• 20,487 patient samples
• Many mutations per cancer genome
• Only a few mutations within an individual “drive” the cancer
Mutationsper1MillionDNAbases
6. Cancer Genome Landscapes
32 cancer subtypes
11,315 patient samples
• 22 cancer subtypes
• 20,487 patient samples
• Many mutations per cancer genome
• Only a few mutations within an individual “drive” the cancer
Mutationsper1MillionDNAbases
❶ Discover “causal” cancer
driver mutations and genes
❷ Predict drug response
8. Uncovering Cancer Genes In The Context of Other
Information
AAATCGAGGCGATC...
ATATCGAGTCGATC...
ATATCGAGTCGATC...
CAATCGAGGCGATC...
ATATCGAGGCGGTC...
TTATCGAGGAGATC...
8,608,691 varying sites
60,706 individuals
WWbZIP
Bromo
ZF ZF ZF
Population Genomic Data Probabilistic Sequence
Patterns
16,230 “domains” cover 88%
of human proteins
Protein Structures
>127,000 PDB structures
Biological Networks
~300,000 interactions
9. Proteins Function Through Interactions
protein 1 protein 2 protein 3 protein 20K
protein−RNA
protein−ionprotein−DNAprotein−protein
protein−small molecule
13. 22,712total genes in human
61%13,923Computationally inferred interaction site info
2,871 13% genes w/ structural knowledge of any interaction sites
0
1
MISILRRGLLVLLAAFPLLALAVQTPHEVVQSTTNELLGDLKANKE
Partial, per-position 0 to 1
interaction potential
14. Uncovering Significantly Mutated Binding Sites
N C
no known interactionsmodeled interactions
0
1
1 20 3
Somatic mutations
per-position binding
potentials
Xi
sum of binding potentials where
mutations land
analytically compute mean
and variance
Z-score:
Xi
~7X speedup per shuffle
Typically >1,000 shuffles
17. PertInInt Identifies Cancer-Relevant Genes
Frequency based
Conservation
Domain
Interaction
All
Gene Rank
EnrichmentofGenesintheCancerGene
Census
30
20
10
0 1 50 100 150 200
~10 minutes
to process 10,000+ tumor samples
(2.4-2.7Ghz processor, <4GB RAM)
18. PertInInt In Summary
ZF ZF ZF
H. sap. MEGDAVEAIVEES...
P. tro. MENEPSEVILEEN...
G. gor. MEGGPTEAVVEDA...
P. mar. MEKILQMAEGIDI...
*** * **
• Perturbed interactions predictive for cancer genes
• Integrative framework identifies cancer-relevant
genes
– Novel and distinct mutational avenues for driver
genes
• Alternate way to prioritize mutations in an
individual’s cancerAAATCGAGGCGATC...
ATATCGAGTCGATC...
20. Tumor Growth in the Presence of Drug “X”
Illustration obtained from Verschoor et al. 2013
Compound is more effective
Compound is less effective
21. Drug Effectiveness Varies Across Tumor Cells
Source: Genomics of Drug Sensitivity in Cancer (GDSC)
Activity of 250 compounds on 960 cell lines
~160K drug-cell pairings
23. Our Data
Activity of drugs on diverse cell lines
Gene expression measurements on
untreated cells
Chemical structure of drugs
Goal: Predict activity of drugs on a
tumor using gene expression
profiles
and drug features
New tumors, new drugs!
24. Genomics Data Has Modular Structure
Illustration obtained from https://rgd.mcw.edu/rgdweb/pathway/pathwayRecord.html?acc_id=PW:0000
Can we use this modular information
to aid in our predictions?
25. Solution
Use modular knowledge of cellular function
Starting feature space: 960 cell lines x 20K features
(Resistant/Sensitive)
26. Approach: Autoencoders
Neural network approach to obtain a reduced feature space using
a guided modular genomics approach.
Use gene set autoencoded features for prediction
27. Merging Multiple Genomic Sources
Mutation within known cancer genes (CGCs)
Reduced set of gene expression values
28. The Other Half: Structure of Drugs
Features: 2D structural descriptors (chemical
subgroups) and physical features (e.g., size, charge)
Starting space: 250 drug compounds
30. Physical Features
1444 PaDEL physicochemical features from SMILES strings.
Molecular free energy, volume, topology.
Apply autoencoders to reduce feature space (90 features)
35. Summary
• Biologically-guided deep net approach to
predict response to drugs
• By training model across drugs and tumors,
can make predictions for new drugs & tumors
• Ultimate goals:
–Personalized oncology
–In silico drug development