Comparative genomics
Standard expression experiments: cases vs. controls ->
differential genes -> interpretation
Problems
Small number of samples
Non-specific signal
Interpretation of a gene set/ gene ranking
Goal: find specific changes for a tested disease
E.g., an up-regulated pathway
Crucial for clinical studies
3
Previous integrative classification studies
Huang et al. 2010 PNAS (9,160 samples); Schmid et al.
PNAS 2012 (3,030); Lee et al. Bioinformatics 2013 (~14,000)
Multilabel classification
Global expression patterns
Only 1-3 platforms
Many datasets were removed from GEO
No “healthy” class (Huang);No diseases (Lee)
Pathprint (Altschuler et al. 2013)
Use pathways
Tissue classification (as in Lee et al.)
4
Integrating pathways and molecular
profiles
Enrichment tests
Improves interpretability
GSEAGSA
Ranked based
Higher statistical power
Classification
Extract pathway features
Example: given a pathway remove non-differential genes
Not clear if prediction performance improves
compared to using genes (Staiger et al. 2013)
5
Pathways
KEGG Reactome
Biocarta NCI
Expression
profiles
GSE
GDS
TCGA
Sample labels
Disease
Datasetsample
description
Single sample - single
pathway analysis
For each
pathway
• Mean
• SD
Y
Samples
XP
Pathway features
Platform
data
Single sample analysis
Ranked
genes
transcripts
Sample j
Weighted
ranks
/i k
iW ie
Standardized
profile
low
expression
high
expression
7
Single sample analysis
Input: an expression profile of a sample
A vector of real values for each patient
Step 1: rank the genes
Step 2: calculate a score for each gene
Rank of
gene g in
sample s
Total number
of ranked
genes
(Yang et al. 2012,2013)
8
Pathway features
1723 pathways in total
Covering 7842 genes
Mean size: 36.35 (median 15)
Score all genes that are in the pathway databases
Pathway statistics:
Mean score
Standard deviation
Skewness
KS test
Pathway DBs
KEGG Reactome
Biocarta NCI
9
Patient labels
Unite ~180 datasets, >14,000 samples
Public databases contain ‘free text’
Problem: automatic mapping fails,
example:
GDS4358:” lymph-node biopsies
from classic Hodgkins lymphoma
HIV- patients before ABVD
chemotherapy”
MetaMap top score: “HIV infections”
Solution: manual analysis
Read descriptions and papers
10
Current microarray data
Data from GEO
13,314 samples
17 platforms
Sample annotation
Ignore terms with less than
100 samples
5 datasets
48 disease terms
Disease terms
XP
Samples
Pathway features
Y
Disease terms {0,1}
Samples 11
Multi-label classification algorithms
Learn a single classifier for each disease
Ignore class dependencies
Adaptation: Bayesian Correction
Learn single classifiers
Correct errors using the DO DAG
Transformation: use the label power
sets and learn a multiclass model
Using RF: multi-label trees
Was better than most approaches in an
experimental study (Madjarov et al. 2012)
13
How to validate an classifier?
Use leave-dataset out cross-validation
Global AUC scores: each prediction Pij vs the correct label Yij
Disease based AUC scores: consider each column separately
14
Y
Disease terms {0,1}
Samples
P
Probabilities [0,1]
Samples
The output of a multi-label learner
Test set
A problem (!)
What is in the background?
For a disease D define:
Positives: disease samples
Negatives: direct controls
Background controls
15
Example:
500 positives
500 negatives
10000 BGCs
Y
P
Multistep validation
16
It is recommended to use several scores (Lee et al. 2013)
Measure global AUPR
For each disease we calculate three scores
Measure Used (additional)
information
AUPR: check separation between positives and
all others
Sick vs. not sick
ROC: test for separation between positives and
negatives
Direct use of negatives
Meta analysis p-value: calculate the overall
separation significance within the original
datasets (a p-value)
Mapping of samples to
datasets
Pathway-Disease network
Steps (for each of the selected diseases):
1. Disease-pathway edges
1. RF importance: Select the top features
2. Test for disease relevance
2. Add edges between diseases
1. Use the DO structure
3. Add edges between pathways
1. Based on significant overlap in genes
20
Summary
Large scale integration
Multi-label learning
Careful validation
Pathway based features as biomarkers
Summary of the results in a network
Currently
Add genes: overcome missing values
Shows improvement in validation
25