Choosing the Right CBSE School A Comprehensive Guide for Parents
Probabilistic refinement of cellular pathway models
1. Probabilistic refinement of
cellular pathway models
Cambridge Statistical Laboratory
Networks seminar series
2009 Jan 21
Florian Markowetz
florian.markowetz@cancer.org.uk
2. What is a signaling pathway?
Environmental
stimuli
Protein
Receptor in
cell membrane
Pat
hw
mRNA
Protein cascade
ay
Transcription factors
regulating target genes DNA
3. Pathway reconstruction
Signaling pathways are important
- Deregulation causes many diseases incl. cancer
Signaling pathways are poorly understood
- Only parts-lists
- missing are interactions within and between pathways
Biological research
- So far mostly focused on individual genes
New genome-scale datasets
- Opportunity for data integration and novel methods
4. What data do we have?
Proteins:
- interactions between proteins
Bulk of data:
- binding to DNA
Microarray
mRNA:
Protein
- Expression under
different stimuli
- binding to DNA
mRNA
Sequence:
- binding motifs
- epigenetic marks
DNA
Morphology
5. Pathways as graphs
• Nodes are (mostly) known
• Goal: infer edges from data
• Data are heterogeneous
• co-expression between
Edges genes
• interactions between
proteins
• binding motifs at genes
• binding of proteins to
Nodes • Protein domains
DNA
• Functional annotation
• Cause-effect data:
Paths • changing environments
• experimental perturbations
6. Pathway reconstruction
“Classical” statistical approaches:
Treat the genes/proteins as random variables and
explore correlation structure in the data:
– Correlation graphs
– Gaussian graphical models (partial correlation)
– Bayesian networks
Challenges/Problems/Opportunities
1. Correlation may be un-informative
2. Integrate heterogeneous and noisy and
complementary data sources
Review: Markowetz and Spang (2007)
8. Experimental perturbations
Drugs
Small
molecules
RNAi
Protein
Stress
Knockout
mRNA
DNA
Readout:
Global gene expression measurements
9. Drosophila immune response
Columns: perturbed genes
Rows: effects on other genes
1. Silencing tak1 reduces
expression of all LPS-
inducible transcripts
2. Silencing rel (key) or
mkk4/hep reduces
expression of subsets of
induced transcripts
(Boutros et al, Dev Cell 2002)
10. (!) Two types of entities
Components of signaling
pathway which are
experimentally
perturbed
Downstream effect
reporters
11. (!!) Only indirect information
No direct observation of
perturbation effects on
other pathway
components!
Inference from observed
perturbation effects on
downstream reporters.
12. The information gap
Direct information: Indirect information:
effects are visible at other effects are only visible at
pathway components down-stream reporters
Pathway Pathway
B B
D D
A C
A C
- Cell survival or death
- Growth rate
- downstream genes
13. Correlation won’t do
“Classical” approach
Pathway Correlation
B D Graphical models:
- Bayes Nets
A C - GGMs
Mutual Information
Nested
Downstream
Effects
regulated
genes
Models
14. Nested Effects Models
1. Set of candidate pathway genes
INPUT
2. High-dimensional phenotypic profile, e.g. microarray
Graph representation of information flow explaining
OUTPUT
the phenotypes
Phenotypic profiles Inferred pathway
Gene perturbations
A
AB
B
C
D EF
CD
E
F
G GH
H
Effects
15. NEM: model formulation
M’xyz: Expected Observed
Z
X Y X X FN FN
Y Y FP
Z Z FN
E1 E2 E3 E4 E5 E6 E1 E2 E3 E4 E5 E6 E1 E2 E3 E4 E5 E6
Pathway genes: X, Y, Z Effect reporters: E1, …, E6
• core topology • states are observed
• to be reconstructed = Data D
= Model M • positions in pathway unknown
= Parameters θ
Marginal likelihood
Posterior: P ( M | D ) = 1/Z . P( D | M ) . P( M )
16. Likelihood P( D | M, θ )
Compare predictions with observations:
Y
Prediction E1=0 E2=1
X Z
Observation 1. E1=1 E2=1
2. E1=0 E2=1
E1 E2
Error probabilities
e.g. false NEG rate 20%, false POS rate 5%
Lik = Pr( E1 = 1) ⋅ Pr( E2 = 1) ⋅ Pr( E1 = 0) ⋅ Pr( E2 = 1)
= 0.05 ⋅ 0.95 ⋅ 0.80 ⋅ 0.95
17. Marginal likelihood
P ( D | M ) = ∫ P ( D | M , Θ ) P (Θ | M ) dΘ
m l
n
1
∏∑∏ P(e | M ,θ i = j )
=m ik
n i =1 j =1 k =1
Uniform
prior over
positions
Distribution of
single effect
Product over
Product over reporter with
all effect Average over
possible positions replicate known position
reporters
observation
in the pathway
18. NEM: inference
Model space: all transitively closed directed graphs
Exhaustive enumeration: score all models to find
the one fitting the data best
Markowetz et al. Bioinformatics, 2005
MCMC, Simulated Annealing: take small
probabilistic steps to explore model space
. . . with A Tresch; in preparation
Divide and conquer: break a big model into smaller,
manageable pieces and then re-assemble
Markowetz et al. ISMB 2007
19. NEM: extensions
Likelihood based on
Drop transitivity
requirement log-ratios of effects
Feature selection to concentrate on
informative effect reporters
Tresch and Markowetz (2008)
21. Summary of part 1
1. Gene perturbation screens with gene-
expression readouts
2. Perturbation screens suffer from the
information gap between pathways and
reporters
3. Nested Effects Models reconstruct pathway
features from subset relations between
observed effects
22. – Part 2 –
Data integration and
probabilistic refinement of
a signaling pathway hypothesis
23. Pathway refinement
1. Start from given pathway hypothesis
Even if our understanding of pathways is poor, that does
not mean we have none at all!
2. Evaluate evidence for hypothesis in
data
3. Identify weakly supported areas and
likely extensions
Not reconstruction from scratch.
Step 1: assemble pathway hypothesis
(KEGG, literature, …) for pheromone
response pathway in Yeast
24. Edge data I
Support for hypothesis in
protein-protein interaction data
25. Edge data II
Support for hypothesis in
co-expression data
26. Edge data III
Why is it so hard to reconstruct
nuclear regulatory network from
correlations?
27. Edge data IV
Support for hypothesis in
TF-DNA binding data
28. Paths: cause-effect data
Expression profiling of knock-out mutants
(Hughes et al., 2000)
Result:
transcriptional response to perturbation
only visible on down-stream genes
(information gap!)
29. Conclusion from data analysis
• Every data source is informative for a specific
compartment of the pathway
• No data source is informative in all
compartments
• We expect these observations also to hold for
other MAPK and signaling pathways.
Need compartment-specific integrative model
encompassing edge, node, and path data.
30. Integrative model
Conditional distributions
for each data type
Pathway graph as
hidden/latent
variables
Prior Parameters
Graphical model defines
Different data types contribute
posterior P(G|data)
to each compartment
-> inference by Gibbs sampler
31. Evaluation
1. Fit model parameters on pheromone
response pathway (training)
2. Use fitted model on other MAPK pathways
(generalization to closely related examples)
3. Use fitted model on all other Yeast signaling
pathways (generalization to everything else)
… work in progress …
32. Acknowledgements
Nested Effects Models
Rainer Spang (Univ. Regensburg) .:. Dennis
Kostka (UC SF) .:. Achim Tresch (Gene Center
Munich) .:. Holger Fröhlich (DKFZ Heidelberg)
.:. Tim Beißbarth (Univ. Göttingen) .:. Josh
Stuart, Charlie Vaske (UC SC) .:.
Data integration
Olga G. Troyanskaya (Princeton) .:. Edoardo
Airoldi (Harvard) .:. David Blei (Princeton) .:.
33. Probabilistic refinement of
cellular pathway models
Thank you !
Florian Markowetz
florian.markowetz@cancer.org.uk