Probabilistic refinement of cellular pathway models

Probabilistic refinement of
cellular pathway models
Cambridge Statistical Laboratory
Networks seminar series
2009 Jan 21

Florian Markowetz
florian.markowetz@cancer.org.uk

What is a signaling pathway?

Environmental
stimuli

Protein
Receptor in
cell membrane

Pat
hw
mRNA
Protein cascade
ay
Transcription factors
regulating target genes DNA

Pathway reconstruction
Signaling pathways are important
- Deregulation causes many diseases incl. cancer
Signaling pathways are poorly understood
- Only parts-lists
- missing are interactions within and between pathways
Biological research
- So far mostly focused on individual genes
New genome-scale datasets
- Opportunity for data integration and novel methods

What data do we have?
Proteins:
- interactions between proteins
Bulk of data:
- binding to DNA
Microarray

mRNA:
Protein
- Expression under
different stimuli
- binding to DNA
mRNA
Sequence:
- binding motifs
- epigenetic marks
DNA
Morphology

Pathways as graphs
• Nodes are (mostly) known
• Goal: infer edges from data
• Data are heterogeneous
• co-expression between
Edges genes
• interactions between
proteins
• binding motifs at genes
• binding of proteins to
Nodes • Protein domains
DNA
• Functional annotation
• Cause-effect data:
Paths • changing environments
• experimental perturbations

Pathway reconstruction
“Classical” statistical approaches:
Treat the genes/proteins as random variables and
explore correlation structure in the data:
– Correlation graphs
– Gaussian graphical models (partial correlation)
– Bayesian networks

Challenges/Problems/Opportunities
1. Correlation may be un-informative
2. Integrate heterogeneous and noisy and
complementary data sources
Review: Markowetz and Spang (2007)

– Part 1 –

Nested Effects Models

Experimental perturbations
Drugs
Small
molecules
RNAi
Protein
Stress

Knockout
mRNA

DNA

Readout:
Global gene expression measurements

Drosophila immune response
Columns: perturbed genes
Rows: effects on other genes

1. Silencing tak1 reduces
expression of all LPS-
inducible transcripts
2. Silencing rel (key) or
mkk4/hep reduces
expression of subsets of
induced transcripts

(Boutros et al, Dev Cell 2002)

(!) Two types of entities

Components of signaling
pathway which are
experimentally
perturbed

Downstream effect
reporters

(!!) Only indirect information

No direct observation of
perturbation effects on
other pathway
components!

Inference from observed
perturbation effects on
downstream reporters.

The information gap

Direct information: Indirect information:
effects are visible at other effects are only visible at
pathway components down-stream reporters
Pathway Pathway
B B
D D
A C
A C

- Cell survival or death
- Growth rate
- downstream genes

Correlation won’t do
“Classical” approach
Pathway Correlation
B D Graphical models:
- Bayes Nets
A C - GGMs
Mutual Information

Nested
Downstream
Effects
regulated
genes
Models

1. Set of candidate pathway genes
INPUT
2. High-dimensional phenotypic profile, e.g. microarray

Graph representation of information flow explaining
OUTPUT
the phenotypes
Phenotypic profiles Inferred pathway
Gene perturbations

A
AB
B
C
D EF
CD
E
F
G GH
H

Effects

NEM: model formulation
M’xyz: Expected Observed
Z
X Y X X FN FN
Y Y FP
Z Z FN
E1 E2 E3 E4 E5 E6 E1 E2 E3 E4 E5 E6 E1 E2 E3 E4 E5 E6

Pathway genes: X, Y, Z Effect reporters: E1, …, E6
• core topology • states are observed
• to be reconstructed = Data D
= Model M • positions in pathway unknown
= Parameters θ
Marginal likelihood
Posterior: P ( M | D ) = 1/Z . P( D | M ) . P( M )

Likelihood P( D | M, θ )

Compare predictions with observations:
Y
Prediction E1=0 E2=1
X Z
Observation 1. E1=1 E2=1
2. E1=0 E2=1
E1 E2

Error probabilities
e.g. false NEG rate 20%, false POS rate 5%
Lik = Pr( E1 = 1) ⋅ Pr( E2 = 1) ⋅ Pr( E1 = 0) ⋅ Pr( E2 = 1)
= 0.05 ⋅ 0.95 ⋅ 0.80 ⋅ 0.95

Marginal likelihood

P ( D | M ) = ∫ P ( D | M , Θ ) P (Θ | M ) dΘ
m l
n
1
∏∑∏ P(e | M ,θ i = j )
=m ik
n i =1 j =1 k =1
Uniform
prior over
positions
Distribution of
single effect
Product over
Product over reporter with
all effect Average over
possible positions replicate known position
reporters
observation
in the pathway

NEM: inference
Model space: all transitively closed directed graphs
Exhaustive enumeration: score all models to find
the one fitting the data best
Markowetz et al. Bioinformatics, 2005
MCMC, Simulated Annealing: take small
probabilistic steps to explore model space
. . . with A Tresch; in preparation
Divide and conquer: break a big model into smaller,
manageable pieces and then re-assemble
Markowetz et al. ISMB 2007

NEM: extensions

Likelihood based on
Drop transitivity
requirement log-ratios of effects

Feature selection to concentrate on
informative effect reporters

Tresch and Markowetz (2008)

Summary of part 1

1. Gene perturbation screens with gene-
expression readouts
2. Perturbation screens suffer from the
information gap between pathways and
reporters
3. Nested Effects Models reconstruct pathway
features from subset relations between
observed effects

– Part 2 –

Data integration and
probabilistic refinement of
a signaling pathway hypothesis

Pathway refinement
1. Start from given pathway hypothesis
Even if our understanding of pathways is poor, that does
not mean we have none at all!
2. Evaluate evidence for hypothesis in
data
3. Identify weakly supported areas and
likely extensions
Not reconstruction from scratch.
Step 1: assemble pathway hypothesis
(KEGG, literature, …) for pheromone
response pathway in Yeast

Edge data I
Support for hypothesis in
protein-protein interaction data

Edge data II
co-expression data

Edge data III
Why is it so hard to reconstruct
nuclear regulatory network from
correlations?

Edge data IV
TF-DNA binding data

Paths: cause-effect data
Expression profiling of knock-out mutants
(Hughes et al., 2000)

Result:
transcriptional response to perturbation
only visible on down-stream genes
(information gap!)

Conclusion from data analysis

• Every data source is informative for a specific
compartment of the pathway
• No data source is informative in all
compartments
• We expect these observations also to hold for
other MAPK and signaling pathways.

Need compartment-specific integrative model
encompassing edge, node, and path data.

Integrative model
Conditional distributions
for each data type
Pathway graph as
hidden/latent
variables

Prior Parameters

Graphical model defines
Different data types contribute
posterior P(G|data)
to each compartment
-> inference by Gibbs sampler

Evaluation

1. Fit model parameters on pheromone
response pathway (training)
2. Use fitted model on other MAPK pathways
(generalization to closely related examples)
3. Use fitted model on all other Yeast signaling
pathways (generalization to everything else)

… work in progress …

Acknowledgements
Rainer Spang (Univ. Regensburg) .:. Dennis
Kostka (UC SF) .:. Achim Tresch (Gene Center
Munich) .:. Holger Fröhlich (DKFZ Heidelberg)
.:. Tim Beißbarth (Univ. Göttingen) .:. Josh
Stuart, Charlie Vaske (UC SC) .:.
Data integration
Olga G. Troyanskaya (Princeton) .:. Edoardo
Airoldi (Harvard) .:. David Blei (Princeton) .:.

Probabilistic refinement of
cellular pathway models

Thank you !
Florian Markowetz
florian.markowetz@cancer.org.uk

Probabilistic refinement of cellular pathway models

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (10)

Ähnlich wie Probabilistic refinement of cellular pathway models

Ähnlich wie Probabilistic refinement of cellular pathway models (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Probabilistic refinement of cellular pathway models