17th Annual International Conference on Critical Assessment of Massive Data Analysis (CAMDA 2018)
Cancer Data Integration Challenge (http://camda.info/)
Robust Pathway-based Multi-Omics Data Integration using Directed Random Walk for Survival Prediction in Multiple Cancer Studies
1. Robust Pathway-based Multi-Omics Data Integration
using Directed Random Walk for Survival Prediction
in Multiple Cancer Studies
So Yeon Kim, Hyun-Hwan Jeong, Jaesik Kim,
Jeong-Hyeon Moon, Kyung-Ah Sohn
17TH ANNUAL INTERNATIONAL CONFERENCE ON CRITICAL
ASSESSMENT OF MASSIVE DATA ANALYSIS (CAMDA 2018)
CANCER DATA INTEGRATION CHALLENGE
CAMDA 2018 1
4. Motivation (1/3)
• Rich information of multi-omics
data provide opportunities for
better biological understanding
and improved clinical outcome
prediction
• Integrative analysis is
important to discover
interrelationships between
multiple different levels of data
CAMDA 2018 4
Weinstein et al. Nature genetics, 2013
5. Motivation (2/3)
• Graph-based integration methods are effective at combining
multi-omics data to consider the interactions between different
types of genomic data
CAMDA 2018 5
Kim et al. Journal of the American Medical Informatics Association, 2014
6. Motivation (3/3)
• Incorporating genomic
knowledge such as pathway
information on the
integrated graph can be
useful to increase prediction
power and find important
genes and pathways in
cancers
CAMDA 2018 6
Liu et al. Bioinformatics, 2013
7. Related works (1/7)
• Pathway-based integrative methods
− They simply transformed single genomic profile into pathway profile using
activity scoring measure
− A pathway level analysis of gene expression (PLAGE) used the singular
vector of singular value decomposition of given gene set
CAMDA 2018 7
TomFohr et al. Bioinformatics, 2005
8. Related works (2/7)
• Pathway-based integrative methods
− Z-score method convert gene expression
profile into z-scores and combines z-
scores of genes in each pathway per
sample
− They take pathways as the set of genes
− Better to consider gene-gene interactions
CAMDA 2018 8
Lee et al. PLoS Comput Biol, 2008
9. Related works (3/7)
• Some methods utilized gene-gene
interactions on a graph
− A denoising algorithm based on
relevance network topology (DART)
integrates pathways by deriving
perturbation signatures which reflect
gene contributions in each pathway
CAMDA 2018 9
Jiao et al. Bioinformatics, 2011
10. Related works (4/7)
• Some methods utilized gene-gene
interactions on a graph
− A directed random walk-based pathway
activity inference method (DRW)
identifies topologically important genes
and pathways by weighting the genes in
the gene-gene network
CAMDA 2018 10
Liu et al. Bioinformatics, 2013
11. Related works (5/7)
• Some methods utilized gene-gene
interactions on a graph
− Integrated extension on multi-omics
data (DRW-GM)
− Improved prediction performance
− Found many risk metabolite pathways
and topologically important genes for
cancer by a joint analysis of gene
expression and metabolite data
CAMDA 2018 11
Liu, et al. Scientific reports, 2015
12. Related works (6/7)
• Integrative DRW (iDRW) incorporate interaction between gene
expression and methylation features exploiting DRW-based methods
CAMDA 2018 12
Kim et al. BMC Medical Genomics, 2018 (to be appear)
13. Related works (7/7)
• Improved survival prediction power and jointly analyzed gene
expression and methylation data on an integrated gene-gene graph
CAMDA 2018 13
Kim et al. BMC Medical Genomics, 2018 (to be appear)
14. Overview
• Investigate the effectiveness of iDRW method on other types of
genomic profiles for two different cancers
• Reflect the interactions between gene expression and copy
number data on an integrated graph
• Construct graph with the updated pathway database
• A survival group classification for breast cancer and neuroblastoma
patient samples
CAMDA 2018 14
17. Integrated gene-gene graph construction (1/2)
• 327 human pathways and
corresponding gene sets from KEGG
database
• Interactions between genes were
defined using R KEGGgraph package
• Integrated directed gene-gene graph
− 7,390 nodes and 58,426 edges
CAMDA 2018 17
B
A
gene
KEGG PATHWAY
Database
18. Integrated gene-gene graph construction (2/2)
• To reflect the impact of copy number variation on gene expression,
we assign directional edges to all the overlapping genes
CAMDA 2018 18
Gene expression
Overlapping genes
Copy number alteration
Gene expression
Copy number alteration
19. Pathway activity inference
• The weight of the gene 𝒘 𝒈 is
the p-value from
- DESeq2 analysis (RNA-Seq)
- Two-tailed t-test (Microarray)
- 𝜒2-test of independence (Copy
number data)
CAMDA 2018 19
Genes
Samples
Gene expression
𝒛 𝒈𝒊
Genes
Samples
CNA
𝒛 𝒈𝒊
Weight initialization
𝑾 𝟎 = −𝒍𝒐𝒈(𝒘 𝒈 + 𝝐)
23. Pathway feature selection and survival prediction
• Feature ranking strategy
• p-values from the t-test of pathway
activities
• Top-k pathways across samples are
going to be the input to the
classification model
CAMDA 2018 23
Pathway Profile
Pathway
Samples
𝒂 𝑷𝒋
Rank
Top-k pathway feature selection
k
p-value
from t-test
24. Pathway feature selection and survival prediction
• Survival prediction
• Logistic regression model
classifies the samples into
good and poor group
• Empirically select top-k
pathway features that
showed the best
classification performance
CAMDA 2018 24
Pathway Profile
Pathway
Samples
𝒂 𝑷𝒋
Rank
pathway
00410pathway
00060
Risk-active pathway identification
Survival prediction
26. Challenge Dataset (1/2)
• Breast cancer patients data
from METABRIC dataset
• 24,368 genes of mRNA expression
profile from Illumina Human v3
microarray with log intensity levels
• 22,544 genes of putative copy-
number alterations data
• 1,648 patient samples are divided
into 908 good (> 10 years) and 740
poor (≤ 10 years) samples
CAMDA 2018 26
Agerage
survival years
10
Agerage
age at diagnosis
62
27. Performance evaluation (1/2)
CAMDA 2018 27
Predicted
good poor
Actual
good TP FN
poor FP TN
Survival prediction
1,648
patients
908 good
group
(long-term
survival)
740 poor
group
(short-
term
survival)
Classification accuracy
𝐀𝐜𝐜𝐮𝐫𝐚𝐜𝐲 =
𝐓𝐏 + 𝐓𝐍
𝐓𝐏 + 𝐅𝐍 + 𝐅𝐏 + 𝐓𝐍
28. Performance evaluation (1/2)
CAMDA 2018 28
Predicted
good poor
Actual
good TP FN
poor FP TN
Survival prediction
1,648
patients
908 good
group
(long-term
survival)
740 poor
group
(short-
term
survival)
Classification accuracy
𝐀𝐜𝐜𝐮𝐫𝐚𝐜𝐲 =
𝐓𝐏 + 𝐓𝐍
𝐓𝐏 + 𝐅𝐍 + 𝐅𝐏 + 𝐓𝐍
fold 1
fold 2
fold 3
fold 4
fold 5
Training setValidation set
5-fold cross-validation
29. Challenge Dataset (2/2)
• Neuroblastoma dataset from
NCBI GSE49711
• 60,586 genes of gene expression
profile of RNA sequencing
• 22,692 genes of DNA copy number
data
• 144 patient samples are divided into
38 good and 105 poor samples
(binary class label for overall survival
days provided by NCBI dataset)
CAMDA 2018 29
88 56
Agerage
survival years
< 1 year
Agerage
age at diagnosis
16 months
30. Performance evaluation (1/2)
CAMDA 2018 30
Predicted
good poor
Actual
good TP FN
poor FP TN
Survival prediction
144
patients
38 good
group
(long-term
survival)
105 poor
group
(short-
term
survival)
Classification accuracy
𝐀𝐜𝐜𝐮𝐫𝐚𝐜𝐲 =
𝐓𝐏 + 𝐓𝐍
𝐓𝐏 + 𝐅𝐍 + 𝐅𝐏 + 𝐓𝐍
31. Performance evaluation (1/2)
CAMDA 2018 31
Predicted
good poor
Actual
good TP FN
poor FP TN
Survival prediction
144
patients
38 good
group
(long-term
survival)
105 poor
group
(short-
term
survival)
Classification accuracy
𝐀𝐜𝐜𝐮𝐫𝐚𝐜𝐲 =
𝐓𝐏 + 𝐓𝐍
𝐓𝐏 + 𝐅𝐍 + 𝐅𝐏 + 𝐓𝐍
fold 1
fold 2
fold 3
fold 4
…
fold N
Training setValidation set
Leave-one-out cross-validation
32. Pathway-based methods
• For gene expression data in each dataset, four pathway-based
methods were compared
− PLAGE [TomFohr et al. Bioinformatics, 2005]
− Z-score [Lee et al. PLoS Comput Biol, 2008]
− DART [Jiao et al. Bioinformatics, 2011]
− DRW [Liu et al. Bioinformatics, 2013]
• Evaluate classification performances in the same way as the
proposed method
CAMDA 2018 32
33. Integrative analysis on multi-omics data improves
survival prediction performance (1/2)
• Four pathway-based
methods on a single
gene expression profile
• iDRW method on the
gene expression profile
and copy number data in
breast cancer (A) or in
neuroblastoma patients
(B)
CAMDA 2018 33
Breast cancer Neuroblastoma
34. Integrative analysis on multi-omics data improves
survival prediction performance (2/2)
• Improved performances
when utilizing interactions
between genes on a graph
• Especially, DRW-based
methods showed a more
contribution to a
performance improvement
• iDRW performed the best
in both cancer dataset
CAMDA 2018 34
Breast cancer Neuroblastoma
35. iDRW identifies cancer-associated pathways and genes (1/5)
Dataset Pathway ID Pathway name Total genes EXP CNA
Breast
cancer
(k = 25)
hsa04740 Olfactory transduction 419 54 268
hsa04014 Ras signaling pathway 232 68 164
hsa04015 Rap1 signaling pathway 206 64 142
hsa04916 Melanogenesis 101 37 73
hsa04722 Neurotrophin signaling pathway 119 38 84
hsa05200 Pathways in cancer 526 166 359
hsa04933 AGE-RAGE signaling pathway in diabetic complications 99 37 67
hsa04530 Tight junction 170 53 107
hsa04510 Focal adhesion 199 76 125
hsa04080 Neuroactive ligand-receptor interaction 278 64 193
hsa05225 Hepatocellular carcinoma 168 56 112
hsa04020 Calcium signaling pathway 182 59 136
hsa04024 cAMP signaling pathway 198 58 139
CAMDA 2018 35
Top-k pathways ranked by the iDRW method in breast cancer. For each pathway, the total number of genes, the number of significant
genes whose p-value(𝒘 𝒈) < 0.05 from gene expression (EXP) or copy number data (CNA) are shown.
36. iDRW identifies cancer-associated pathways and genes (2/5)
Dataset Pathway ID Pathway name Total genes EXP CNA
Breast
cancer
(k = 25)
hsa04217 Necroptosis 164 49 97
hsa04060 Cytokine-cytokine receptor interaction 270 70 192
hsa05152 Tuberculosis 179 58 112
hsa05165 Human papillomavirus infection 319 103 210
hsa04810 Regulation of actin cytoskeleton 208 64 132
hsa04151 PI3K-Akt signaling pathway 352 119 241
hsa04022 cGMP-PKG signaling pathway 163 58 109
hsa04630 Jak-STAT signaling pathway 162 43 112
hsa05167 Kaposi's sarcoma-associated herpesvirus infection 186 61 114
hsa04010 MAPK signaling pathway 295 87 209
hsa04371 Apelin signaling pathway 137 46 99
hsa04390 Hippo signaling pathway 154 58 100
CAMDA 2018 36
Top-k pathways ranked by the iDRW method in breast cancer. For each pathway, the total number of genes, the number of significant
genes whose p-value(𝒘 𝒈) < 0.05 from gene expression (EXP) or copy number data (CNA) are shown.
37. iDRW identifies cancer-associated pathways and genes (3/5)
CAMDA 2018 37
Hanahan et al. Cell, 2011
Six biological capabilities which are acquired during the tumor generation
Some of top-ranked pathways (Ras signaling, Necroptosis, Regulation of actin cytoskeleton, and PI3K-
Akt signaling pathway) are related with at least one of six functions
“…overexpression of 34 Olfactory Receptors genes has been reported
in patients bearing breast tumors caused by CHEK2 1100delC mutation…”
38. iDRW identifies cancer-associated pathways and genes (4/5)
Dataset Pathway ID Pathway name Total genes EXP CNA
Neuroblastoma
(k = 5)
hsa04976 Bile secretion 71 13 5
hsa05034 Alcoholism 180 22 7
hsa01100 Metabolic pathways 1273 43 93
hsa04080 Neuroactive ligand-receptor interaction 278 21 24
hsa04151 PI3K-Akt signaling pathway 352 19 31
CAMDA 2018 38
Top-k pathways ranked by the iDRW method in neuroblastoma data. For each pathway, the total number of genes, the number of
significant genes whose p-value(𝒘 𝒈) < 0.05 from gene expression (EXP) or copy number data (CNA) are shown.
39. iDRW identifies cancer-associated pathways and genes (5/5)
CAMDA 2018
39
“… we propose a mechanism underlying a potent and
selective anti-tumor effect of LCA in cultured human neuroblastoma cells …”“…the level of Urinary catecholamine metabolites which consist of vanillylmandelic
acid (VMA), homovanillic acid (HVA) and dopamine elevated in neuroblastoma
patients…”
40. Conclusions
• We showed the effectiveness of an integrative directed random
walk-based method utilizing pathway information (iDRW) on
different cancer datasets
• We benchmark iDRW and several state-of-the-art pathway-based
methods for the survival prediction model
CAMDA 2018 40
41. Conclusions
• Contributions
− Revamp a directed gene-gene graph considering the interactions
between gene expression and copy number data
− Jointly identify cancer-related pathways and genes on gene
expression and copy number data for breast cancer and
neuroblastoma datasets
CAMDA 2018 41
42. Acknowledgements
All lab members of LAMDA lab
Kyung-Ah Sohn
Byungkon Kang
Yenewondim Biadgie
Garam Lee
Habtamu Minassie Aycheh
Sehee Wang
Jungryul Seo
Nam-Hyuk Ahn
Min-Soo Kim
Tae-rim Kim
Young-Bum Choi
Jun-hyung Yu
Jeong-hyun Moon
Jaesik Kim
Sijin Kim
Heejin Kim
Joon-seon Hwang
Hyun-Hwan Jeong, Ph.D.
Post-doctoral associate
Baylor College of Medicine
Texas Children’s Hospital
Kyung-Ah Sohn, Ph.D.
Associate Professor
Department of Software and Computer Engineering,
Ajou University
Jaesik Kim
Graduate student, Masters course
Department of Software and Computer Engineering,
Ajou University
Jeong-Hyeon Moon
Graduate student, Masters course
Department of Software and Computer Engineering,
Ajou University
CAMDA 2018 42