1. Sufficient statistics for CTMCs:
Methods, Implementations and Applications
Paula Tataru
Bioinformatics Research Center
Problem and Motivation
Continuous-time Markov chains (CTMCs) are a widely used modelling tool. Applications include DNA sequence evolution, ion channel gating behaviour and
mathematical finance. I consider the problem of calculating the expected number of jumps between any two states and expected waiting time in a state, conditioned on
the end points of the process. These statistics are needed in bioinformatics to estimate model parameters by maximum likelihood using the EM algorithm or for the
robust computation of pairwise distance for aligned DNA sequences, as well in other fields.
Application
AAA AAC
TTT
...
Q2...
Q1...
Q12
Q21
Qn...
Q1n
Q2n
Qn1
Qn2
T ab
Γ
= ∑
i∈Γ
E [T i 1( X (T )=b)∣X (0)=a]=∑
i∈Γ
I ab
ii
(T )
N ab
Δ
= ∑
(i , j)∈Δ
E [N ij 1(X (T )=b)∣X (0)=a]= ∑
(i , j)∈Δ
Qij I ab
ij
(T )
where I ab
ij
(T ) = ∫
0
T
Pai (t)P jb (T −t )dt
P (t ) = eQt
A CTMC is a stochastic process X(t) that full fills the
Markov property: given the present state and the past
states, the future states depend only upon the present.
The process is characterized by a rate matrix Q that
describes the rates at which the process moves from one
state to another. We are interested in
Summary Statistics
Paula Tataru
paula@birc.au.dk
89 42 31 75
Bioinformatics Research Centre
Aarhus University
C.F. Møllers Allé
8000 Aarhus C
1. Eigenvalue Decomposition - eigen
J ij(T ) =
{
T e
λi T
if λi=λ j
e
λi T
−e
λ j T
λi−λ j
if λi≠λ j
T
Γ
(T ) = U⋅[J∗[U
−1
⋅(1Γ⋅U )]]⋅U
−1
N Δ
(T ) = U⋅[J∗[U
−1
⋅((1Δ∗Q )⋅U )]]⋅U
−1
Let Q =U ΛU
−1
be the eigenvalue decomposition
T
Γ
(T ) =
1
μ
∑
m=0
∞
Pois(m+1;μT )∑
l =0
m
R
l
⋅1Γ⋅R
m−l
N
Δ
(T ) =
1
μ
∑
m=0
∞
Pois(m+1;μT )∑
l =0
m
R
l
⋅(1Δ∗Q )⋅R
m−l
2. Uniformization - uni
Let us define μ=maxi−Q ii and R=
1
μ
Q +I , then Let us define AB=
[Q B
0 Q ], then
3. Exponentiation - expm
For the exponentiation, I used the improved Scaling and
Squaring method with Padé approximation, implemented
in the R expm package.
T
Γ
(T ) = (e
A1Γ
T
)1:n ,(n+1): 2n
N
Δ
(T ) = (e
A1Δ∗Q T
)1:n ,(n+1):2n
Methods
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0
0.5
1
1.5
2
2.5
3
3.5
Running time
eigen
uni
expm
# seq
sec
The EM algorithm has been used to estimate a 61x61 rate matrix for codon evolution. The data set contained
16 codon aligned DNA sequences from the pol HIV gene. For each data set (using the first 2, 3, …, 15, 16
sequences), a fixed tree was assumed.
Results
CTMCs can be used to describe the evolution of DNA sequences. My
application involves looking at codon sequences and their evolution and estimate
the CTMC from the data. In this care, the rate matrix is 61x61 and it was first
introduced by Goldman and Yang (94). It is given by:
Qij=
{
0 if i → j by more than one substitution
κπj if i → j by a synonymous transition
πj if i → j by a synonymous transversion
ω κπj if i → j by a non-synonymous transition
ωπj if i → j by a non-synonymous transversion
where π is the stationary distribution (the codon frequencies), k is the
transition / transversion ratio and ω is the non-synonymous / synonymous rate.
To estimate Q, I use a maximum likelihood approach (optimized via the EM
algorithm) which requires the calculation of the summary statistics. For example,
I am interested in the sums of the expected number of jumps between states that
differ by a transition (which can be either synonymous or non-synonymous).
The sum will be computed over the set Δ that contains these states:
Δ={(i , j)∣i → j by a transition}
The EM algorithm contains two steps: the expectation step, where the summary
statistics are computed, and the maximization step, where the likelihood is
maximized with respect to the unknown parameters.
In order to estimate Q, I also need a tree that shows the relationships between
the sequences. The tree describes how the sequences evolved from each other.