SlideShare ist ein Scribd-Unternehmen logo
1 von 1
Downloaden Sie, um offline zu lesen
Sufficient statistics for CTMCs:
Methods, Implementations and Applications
Paula Tataru
Bioinformatics Research Center
Problem and Motivation
Continuous-time Markov chains (CTMCs) are a widely used modelling tool. Applications include DNA sequence evolution, ion channel gating behaviour and
mathematical finance. I consider the problem of calculating the expected number of jumps between any two states and expected waiting time in a state, conditioned on
the end points of the process. These statistics are needed in bioinformatics to estimate model parameters by maximum likelihood using the EM algorithm or for the
robust computation of pairwise distance for aligned DNA sequences, as well in other fields.
Application
AAA AAC
TTT
...
Q2...
Q1...
Q12
Q21
Qn...
Q1n
Q2n
Qn1
Qn2
T ab
Γ
= ∑
i∈Γ
E [T i 1( X (T )=b)∣X (0)=a]=∑
i∈Γ
I ab
ii
(T )
N ab
Δ
= ∑
(i , j)∈Δ
E [N ij 1(X (T )=b)∣X (0)=a]= ∑
(i , j)∈Δ
Qij I ab
ij
(T )
where I ab
ij
(T ) = ∫
0
T
Pai (t)P jb (T −t )dt
P (t ) = eQt
A CTMC is a stochastic process X(t) that full fills the
Markov property: given the present state and the past
states, the future states depend only upon the present.
The process is characterized by a rate matrix Q that
describes the rates at which the process moves from one
state to another. We are interested in
Summary Statistics
Paula Tataru
paula@birc.au.dk
89 42 31 75
Bioinformatics Research Centre
Aarhus University
C.F. Møllers Allé
8000 Aarhus C
1. Eigenvalue Decomposition - eigen
J ij(T ) =
{
T e
λi T
if λi=λ j
e
λi T
−e
λ j T
λi−λ j
if λi≠λ j
T
Γ
(T ) = U⋅[J∗[U
−1
⋅(1Γ⋅U )]]⋅U
−1
N Δ
(T ) = U⋅[J∗[U
−1
⋅((1Δ∗Q )⋅U )]]⋅U
−1
Let Q =U ΛU
−1
be the eigenvalue decomposition
T
Γ
(T ) =
1
μ
∑
m=0
∞
Pois(m+1;μT )∑
l =0
m
R
l
⋅1Γ⋅R
m−l
N
Δ
(T ) =
1
μ
∑
m=0
∞
Pois(m+1;μT )∑
l =0
m
R
l
⋅(1Δ∗Q )⋅R
m−l
2. Uniformization - uni
Let us define μ=maxi−Q ii and R=
1
μ
Q +I , then Let us define AB=
[Q B
0 Q ], then
3. Exponentiation - expm
For the exponentiation, I used the improved Scaling and
Squaring method with Padé approximation, implemented
in the R expm package.
T
Γ
(T ) = (e
A1Γ
T
)1:n ,(n+1): 2n
N
Δ
(T ) = (e
A1Δ∗Q T
)1:n ,(n+1):2n
Methods
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0
0.5
1
1.5
2
2.5
3
3.5
Running time
eigen
uni
expm
# seq
sec
The EM algorithm has been used to estimate a 61x61 rate matrix for codon evolution. The data set contained
16 codon aligned DNA sequences from the pol HIV gene. For each data set (using the first 2, 3, …, 15, 16
sequences), a fixed tree was assumed.
Results
CTMCs can be used to describe the evolution of DNA sequences. My
application involves looking at codon sequences and their evolution and estimate
the CTMC from the data. In this care, the rate matrix is 61x61 and it was first
introduced by Goldman and Yang (94). It is given by:
Qij=
{
0 if i → j by more than one substitution
κπj if i → j by a synonymous transition
πj if i → j by a synonymous transversion
ω κπj if i → j by a non-synonymous transition
ωπj if i → j by a non-synonymous transversion
where π is the stationary distribution (the codon frequencies), k is the
transition / transversion ratio and ω is the non-synonymous / synonymous rate.
To estimate Q, I use a maximum likelihood approach (optimized via the EM
algorithm) which requires the calculation of the summary statistics. For example,
I am interested in the sums of the expected number of jumps between states that
differ by a transition (which can be either synonymous or non-synonymous).
The sum will be computed over the set Δ that contains these states:
Δ={(i , j)∣i → j by a transition}
The EM algorithm contains two steps: the expectation step, where the summary
statistics are computed, and the maximization step, where the likelihood is
maximized with respect to the unknown parameters.
In order to estimate Q, I also need a tree that shows the relationships between
the sequences. The tree describes how the sequences evolved from each other.

Weitere ähnliche Inhalte

Was ist angesagt?

The proof theoretic strength of the Steinitz exchange theorem - EACA 2006
The proof theoretic strength of the Steinitz exchange theorem - EACA 2006The proof theoretic strength of the Steinitz exchange theorem - EACA 2006
The proof theoretic strength of the Steinitz exchange theorem - EACA 2006
Michael Soltys
 
Optimal Control System Design
Optimal Control System DesignOptimal Control System Design
Optimal Control System Design
M Reza Rahmati
 
Recursive Compressed Sensing
Recursive Compressed SensingRecursive Compressed Sensing
Recursive Compressed Sensing
Pantelis Sopasakis
 

Was ist angesagt? (20)

Slides ensae 8
Slides ensae 8Slides ensae 8
Slides ensae 8
 
Calculus
CalculusCalculus
Calculus
 
Distributed solution of stochastic optimal control problem on GPUs
Distributed solution of stochastic optimal control problem on GPUsDistributed solution of stochastic optimal control problem on GPUs
Distributed solution of stochastic optimal control problem on GPUs
 
Slides erm-cea-ia
Slides erm-cea-iaSlides erm-cea-ia
Slides erm-cea-ia
 
HMPC for Upper Stage Attitude Control
HMPC for Upper Stage Attitude ControlHMPC for Upper Stage Attitude Control
HMPC for Upper Stage Attitude Control
 
Panacm 2015 paper
Panacm 2015 paperPanacm 2015 paper
Panacm 2015 paper
 
QMC: Operator Splitting Workshop, Perturbed (accelerated) Proximal-Gradient A...
QMC: Operator Splitting Workshop, Perturbed (accelerated) Proximal-Gradient A...QMC: Operator Splitting Workshop, Perturbed (accelerated) Proximal-Gradient A...
QMC: Operator Splitting Workshop, Perturbed (accelerated) Proximal-Gradient A...
 
ETSATPWAATFU
ETSATPWAATFUETSATPWAATFU
ETSATPWAATFU
 
The proof theoretic strength of the Steinitz exchange theorem - EACA 2006
The proof theoretic strength of the Steinitz exchange theorem - EACA 2006The proof theoretic strength of the Steinitz exchange theorem - EACA 2006
The proof theoretic strength of the Steinitz exchange theorem - EACA 2006
 
Graduate Econometrics Course, part 4, 2017
Graduate Econometrics Course, part 4, 2017Graduate Econometrics Course, part 4, 2017
Graduate Econometrics Course, part 4, 2017
 
Anomaly And Parity Odd Transport
Anomaly And Parity Odd Transport Anomaly And Parity Odd Transport
Anomaly And Parity Odd Transport
 
Slides econ-lm
Slides econ-lmSlides econ-lm
Slides econ-lm
 
Slides ensae 9
Slides ensae 9Slides ensae 9
Slides ensae 9
 
Optimal Control System Design
Optimal Control System DesignOptimal Control System Design
Optimal Control System Design
 
Slides simplexe
Slides simplexeSlides simplexe
Slides simplexe
 
Slides lln-risques
Slides lln-risquesSlides lln-risques
Slides lln-risques
 
Recursive Compressed Sensing
Recursive Compressed SensingRecursive Compressed Sensing
Recursive Compressed Sensing
 
Stochastic Processes - part 4
Stochastic Processes - part 4Stochastic Processes - part 4
Stochastic Processes - part 4
 
Harmonic Oscillations QA 6
Harmonic Oscillations QA 6Harmonic Oscillations QA 6
Harmonic Oscillations QA 6
 
Solving High-order Non-linear Partial Differential Equations by Modified q-Ho...
Solving High-order Non-linear Partial Differential Equations by Modified q-Ho...Solving High-order Non-linear Partial Differential Equations by Modified q-Ho...
Solving High-order Non-linear Partial Differential Equations by Modified q-Ho...
 

Ähnlich wie PhDretreat2011

2012 mdsp pr03 kalman filter
2012 mdsp pr03 kalman filter2012 mdsp pr03 kalman filter
2012 mdsp pr03 kalman filter
nozomuhamada
 
KAUST_talk_short.pdf
KAUST_talk_short.pdfKAUST_talk_short.pdf
KAUST_talk_short.pdf
Chiheb Ben Hammouda
 
Fir 05 dynamics 2-dof
Fir 05 dynamics 2-dofFir 05 dynamics 2-dof
Fir 05 dynamics 2-dof
nguyendattdh
 
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Chiheb Ben Hammouda
 

Ähnlich wie PhDretreat2011 (20)

A kernel-free particle method: Smile Problem Resolved
A kernel-free particle method: Smile Problem ResolvedA kernel-free particle method: Smile Problem Resolved
A kernel-free particle method: Smile Problem Resolved
 
PCB_Lect02_Pairwise_allign (1).pdf
PCB_Lect02_Pairwise_allign (1).pdfPCB_Lect02_Pairwise_allign (1).pdf
PCB_Lect02_Pairwise_allign (1).pdf
 
HMM-Based Speech Synthesis
HMM-Based Speech SynthesisHMM-Based Speech Synthesis
HMM-Based Speech Synthesis
 
Thiele
ThieleThiele
Thiele
 
Lec1 01
Lec1 01Lec1 01
Lec1 01
 
Estimating Future Initial Margin with Machine Learning
Estimating Future Initial Margin with Machine LearningEstimating Future Initial Margin with Machine Learning
Estimating Future Initial Margin with Machine Learning
 
Tracking Control of Nanosatellites with Uncertain Time Varying Parameters
Tracking Control of Nanosatellites with Uncertain Time Varying ParametersTracking Control of Nanosatellites with Uncertain Time Varying Parameters
Tracking Control of Nanosatellites with Uncertain Time Varying Parameters
 
2012 mdsp pr03 kalman filter
2012 mdsp pr03 kalman filter2012 mdsp pr03 kalman filter
2012 mdsp pr03 kalman filter
 
KAUST_talk_short.pdf
KAUST_talk_short.pdfKAUST_talk_short.pdf
KAUST_talk_short.pdf
 
assignment_2
assignment_2assignment_2
assignment_2
 
MCQMC 2020 talk: Importance Sampling for a Robust and Efficient Multilevel Mo...
MCQMC 2020 talk: Importance Sampling for a Robust and Efficient Multilevel Mo...MCQMC 2020 talk: Importance Sampling for a Robust and Efficient Multilevel Mo...
MCQMC 2020 talk: Importance Sampling for a Robust and Efficient Multilevel Mo...
 
Mcqmc talk
Mcqmc talkMcqmc talk
Mcqmc talk
 
Mit2 092 f09_lec20
Mit2 092 f09_lec20Mit2 092 f09_lec20
Mit2 092 f09_lec20
 
Numerical Methods
Numerical MethodsNumerical Methods
Numerical Methods
 
Slides ensae-2016-9
Slides ensae-2016-9Slides ensae-2016-9
Slides ensae-2016-9
 
Tensor train to solve stochastic PDEs
Tensor train to solve stochastic PDEsTensor train to solve stochastic PDEs
Tensor train to solve stochastic PDEs
 
Unbiased MCMC with couplings
Unbiased MCMC with couplingsUnbiased MCMC with couplings
Unbiased MCMC with couplings
 
Fir 05 dynamics
Fir 05 dynamicsFir 05 dynamics
Fir 05 dynamics
 
Fir 05 dynamics 2-dof
Fir 05 dynamics 2-dofFir 05 dynamics 2-dof
Fir 05 dynamics 2-dof
 
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
 

Mehr von Paula Tataru

Mehr von Paula Tataru (20)

write_thesis
write_thesiswrite_thesis
write_thesis
 
PhDretreat2014
PhDretreat2014PhDretreat2014
PhDretreat2014
 
PaulaTataru_PhD_defense
PaulaTataru_PhD_defensePaulaTataru_PhD_defense
PaulaTataru_PhD_defense
 
part A
part Apart A
part A
 
birc-csd2012
birc-csd2012birc-csd2012
birc-csd2012
 
TreeOfLife-jeopardy-2014
TreeOfLife-jeopardy-2014TreeOfLife-jeopardy-2014
TreeOfLife-jeopardy-2014
 
AB-RNA-Mfold&SCFGs-2011
AB-RNA-Mfold&SCFGs-2011AB-RNA-Mfold&SCFGs-2011
AB-RNA-Mfold&SCFGs-2011
 
AB-RNA-comparison-2011
AB-RNA-comparison-2011AB-RNA-comparison-2011
AB-RNA-comparison-2011
 
AB-RNA-alignments-2011
AB-RNA-alignments-2011AB-RNA-alignments-2011
AB-RNA-alignments-2011
 
AB-RNA-Nussinov-2011
AB-RNA-Nussinov-2011AB-RNA-Nussinov-2011
AB-RNA-Nussinov-2011
 
AB-RNA-SCFGdesign=2010
AB-RNA-SCFGdesign=2010AB-RNA-SCFGdesign=2010
AB-RNA-SCFGdesign=2010
 
AB-RNA-SCFG-2010
AB-RNA-SCFG-2010AB-RNA-SCFG-2010
AB-RNA-SCFG-2010
 
AB-RNA-alignments-2010
AB-RNA-alignments-2010AB-RNA-alignments-2010
AB-RNA-alignments-2010
 
AB-RNA-Nus-2010
AB-RNA-Nus-2010AB-RNA-Nus-2010
AB-RNA-Nus-2010
 
PaulaTataruVienna
PaulaTataruViennaPaulaTataruVienna
PaulaTataruVienna
 
PaulaTataruCSHL
PaulaTataruCSHLPaulaTataruCSHL
PaulaTataruCSHL
 
PaulaTataruAarhus
PaulaTataruAarhusPaulaTataruAarhus
PaulaTataruAarhus
 
mgsa_poster
mgsa_postermgsa_poster
mgsa_poster
 
PaulaTataruOxford
PaulaTataruOxfordPaulaTataruOxford
PaulaTataruOxford
 
PaulaTataru
PaulaTataruPaulaTataru
PaulaTataru
 

PhDretreat2011

  • 1. Sufficient statistics for CTMCs: Methods, Implementations and Applications Paula Tataru Bioinformatics Research Center Problem and Motivation Continuous-time Markov chains (CTMCs) are a widely used modelling tool. Applications include DNA sequence evolution, ion channel gating behaviour and mathematical finance. I consider the problem of calculating the expected number of jumps between any two states and expected waiting time in a state, conditioned on the end points of the process. These statistics are needed in bioinformatics to estimate model parameters by maximum likelihood using the EM algorithm or for the robust computation of pairwise distance for aligned DNA sequences, as well in other fields. Application AAA AAC TTT ... Q2... Q1... Q12 Q21 Qn... Q1n Q2n Qn1 Qn2 T ab Γ = ∑ i∈Γ E [T i 1( X (T )=b)∣X (0)=a]=∑ i∈Γ I ab ii (T ) N ab Δ = ∑ (i , j)∈Δ E [N ij 1(X (T )=b)∣X (0)=a]= ∑ (i , j)∈Δ Qij I ab ij (T ) where I ab ij (T ) = ∫ 0 T Pai (t)P jb (T −t )dt P (t ) = eQt A CTMC is a stochastic process X(t) that full fills the Markov property: given the present state and the past states, the future states depend only upon the present. The process is characterized by a rate matrix Q that describes the rates at which the process moves from one state to another. We are interested in Summary Statistics Paula Tataru paula@birc.au.dk 89 42 31 75 Bioinformatics Research Centre Aarhus University C.F. Møllers Allé 8000 Aarhus C 1. Eigenvalue Decomposition - eigen J ij(T ) = { T e λi T if λi=λ j e λi T −e λ j T λi−λ j if λi≠λ j T Γ (T ) = U⋅[J∗[U −1 ⋅(1Γ⋅U )]]⋅U −1 N Δ (T ) = U⋅[J∗[U −1 ⋅((1Δ∗Q )⋅U )]]⋅U −1 Let Q =U ΛU −1 be the eigenvalue decomposition T Γ (T ) = 1 μ ∑ m=0 ∞ Pois(m+1;μT )∑ l =0 m R l ⋅1Γ⋅R m−l N Δ (T ) = 1 μ ∑ m=0 ∞ Pois(m+1;μT )∑ l =0 m R l ⋅(1Δ∗Q )⋅R m−l 2. Uniformization - uni Let us define μ=maxi−Q ii and R= 1 μ Q +I , then Let us define AB= [Q B 0 Q ], then 3. Exponentiation - expm For the exponentiation, I used the improved Scaling and Squaring method with Padé approximation, implemented in the R expm package. T Γ (T ) = (e A1Γ T )1:n ,(n+1): 2n N Δ (T ) = (e A1Δ∗Q T )1:n ,(n+1):2n Methods 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0 0.5 1 1.5 2 2.5 3 3.5 Running time eigen uni expm # seq sec The EM algorithm has been used to estimate a 61x61 rate matrix for codon evolution. The data set contained 16 codon aligned DNA sequences from the pol HIV gene. For each data set (using the first 2, 3, …, 15, 16 sequences), a fixed tree was assumed. Results CTMCs can be used to describe the evolution of DNA sequences. My application involves looking at codon sequences and their evolution and estimate the CTMC from the data. In this care, the rate matrix is 61x61 and it was first introduced by Goldman and Yang (94). It is given by: Qij= { 0 if i → j by more than one substitution κπj if i → j by a synonymous transition πj if i → j by a synonymous transversion ω κπj if i → j by a non-synonymous transition ωπj if i → j by a non-synonymous transversion where π is the stationary distribution (the codon frequencies), k is the transition / transversion ratio and ω is the non-synonymous / synonymous rate. To estimate Q, I use a maximum likelihood approach (optimized via the EM algorithm) which requires the calculation of the summary statistics. For example, I am interested in the sums of the expected number of jumps between states that differ by a transition (which can be either synonymous or non-synonymous). The sum will be computed over the set Δ that contains these states: Δ={(i , j)∣i → j by a transition} The EM algorithm contains two steps: the expectation step, where the summary statistics are computed, and the maximization step, where the likelihood is maximized with respect to the unknown parameters. In order to estimate Q, I also need a tree that shows the relationships between the sequences. The tree describes how the sequences evolved from each other.