2. Probabilistic Document Clustering
and Topic Models
A popular method for probabilistic document clustering is that of topic
modeling.
The idea of topic modeling is to create a probabilistic generative model for the
text documents in the corpus.
The main approach is to represent a corpus as a function of hidden random
variables, the parameters of which are estimated using a particular document
collection.
The primary assumptions in any topic modeling approach (together with the
corresponding random variables) are as follows:
The n documents in the corpus are assumed to have a probability of belonging
to one of k topics.
Thus, a given document may have a probability of belonging to multiple
topics, and this reflects the fact that the same document may contain a
multitude of subjects.
3. For a given document Di, and a set of topics T1 . . . Tk, the probability that
the document Di belongs to the topic Tj is given by P(Tj |Di).
The topics are essentially analogous to clusters, and the value of P(Tj |Di)
provides a probability of cluster membership of the ith document to the jth
cluster.
In non-probabilistic clustering methods, the membership of documents to
clusters is deterministic in nature, and therefore the clustering is typically a
clean partitioning of the document collection.
When there are overlaps in document subject matter across multiple
clusters. The use of a soft cluster membership in terms of probabilities is an
elegant solution to this dilemma.
4. In this scenario, the determination of the membership of the documents to
clusters is a secondary goal to that of finding the latent topical clusters in
the underlying text collection.
Topic modeling is related to the clustering problem, it is often studied as a
distinct area of research from clustering.
The value of P(Tj |Di) is estimated using the topic modeling approach, and
is one of the primary outputs of the algorithm.
The value of k is one of the inputs to the algorithm and is analogous to the
number of clusters.
Each topic is associated with a probability vector, which quantifies the
probability of the different terms in the lexicon for that topic.
5. Let t1 . . . td be the d terms in the lexicon. Then, for a document that belongs
completely to topic Tj , the probability that the term tl occurs in it is given
by P(tl|Tj ).
The value of P(tl|Tj) is another important parameter which needs to be
estimated by the topic modeling approach.
The number of documents is denoted by n, topics by k and lexicon size
(terms) by d.
Most topic modeling methods attempt to learn the above parameters using
maximum likelihood methods, so that the probabilistic fit to the given
corpus of documents is as large as possible.
There are two basic methods which are used for topic modeling,
Probabilistic Latent Semantic Indexing (PLSI)
Latent Dirichlet Allocation (LDA)
6. Probabilistic Latent Semantic Indexing Method
Set of random variables P(Tj |Di) and P(tl|Tj) model the probability of a term tl
occurring in any document Di.
The probability P(tl|Di) of the term tl occurring document Di can be expressed
in terms:
For each term tl and document Di, generate a n × d matrix of probabilities in
terms of these parameters,
where, n - number of documents and d - number of terms.
For a given corpus, the n × d term-document occurrence matrix X, tells us
which term actually occurs in each document, and how many times the term
occurs in the document.
In other words, X(i, l) is the number of times that term tl occurs in document
Di.
Therefore, we can use a maximum likelihood estimation algorithm which
maximizes the product of the probabilities of terms that are observed in each
document in the entire collection.
7. Log likelihood probability Σi,l X(i, l) ·log(P(tl|Di)) subject to the constraints that
the probability values over each of the topic-document and term-topic spaces
must sum to 1:
The Lagrangian solution essentially leads to a set of iterative update equations
for the corresponding parameters need to be estimated.
These parameters can be estimated with the iterative update of two matrices
[P1]k×n and [P2]d×k containing the topic-document probabilities and term-topic
probabilities respectively.
8. Initializing the matrices randomly, and normalize each of them so that the
probability values in their columns sum to one.
Then, iteratively perform the following steps on each of P1 and P2
respectively:
The process is iterated to convergence.
The output of this approach are the two matrices P1 and P2, the entries of
which provide the topic document and term-topic probabilities respectively.
9. Latent Dirichlet Allocation
The term-topic probabilities and topic-document probabilities are modeled
with a Dirichlet distribution as a prior.
LDA method is the Bayesian version of the PLSI technique. PLSI method is
equivalent to the LDA technique, when applied with a uniform Dirichlet prior.
The LDA method can be used to model the topic distribution of a new
document more robustly, even if it is not present in the original data set.
EM-concepts used for topic modeling are quite general, and can be used for
different variations on the text clustering tasks, such as text classification or
incorporating user feedback into clustering.
10. LDA’s main advantage over the PLSI method is that it is not quite as
susceptible to overfitting.
This is generally true of Bayesian methods which reduce the number of
model parameters to be estimated, and therefore work much better for
smaller data sets.
Even for larger data sets, PLSI has the disadvantage that the number of
model parameters grows linearly with the size of the collection.
The PLSI model is not a fully generative model, because there is no
accurate way to model the topical distribution of a document which is not
included in the current data set.
11. Probabilistic Models for Information
Extraction
Probabilistic models show better accuracy and robustness against the
noise than categorical models.
Useful for the different tasks in extracting meaning from natural
language texts.
Most prominent among these probabilistic approaches are
Hidden Markov Models (HMMs),
Stochastic Context-free Grammars (SCFG), and
Maximal Entropy (ME).
12. Probabilistic Models for Information
Extraction Hidden Markov Models
The Three Classic Problems Related to HMMs
The Forward–Backward Procedure
The Viterbi Algorithm
The Training of the HMM
Dealing with Training Data Sparseness
Stochastic Context-Free Grammars
Using SCFGs
Maximal Entropy Modeling
Computing the Parameters of the Model
Maximal Entropy Markov Models
Training the MEMM
Conditional Random Fields
The Three Classic Problems Relating to CRF
Computing the Conditional Probability
Finding the Most Probable Label Sequence
Training the CRF
13. Hidden Markov Models
An HMM is a finite-state automaton with stochastic state transitions
and symbol emissions.
The automaton models a probabilistic generative process.
Process, a sequence of symbols is produced by
Starting in an initial state,
Emitting a symbol selected by the state,
Making a transition to a new state,
Emitting a symbol selected by the state, and
Repeating this transition–emission cycle until a designated final
state is reached.
14. HMM Assumptions
Markov assumption: the state transition depends only on the origin and
destination
Output-independent assumption: all observation frames are dependent on
the state that generated them, not on neighbouring observation frames
15. Formally,
Let O = {o1, . . . oM} - finite set of observation symbols and
Q ={q1, . . . qN} - finite set of states.
A first-order Markov model λ is a triple (π, A, B),
where π : Q→ [0, 1] defines the starting probabilities,
A : Q× Q→ [0, 1] defines the transition probabilities, and
B : Q× O→ [0, 1] denotes the emission probabilities.
The functions π, A, and B define true probabilities, they must satisfy
A model λ together with the random process described above induces a
probability distribution over the set O* of all possible observation
sequences.
1)(
qQq
Oo
oqB 1),(
1)',('
Qq
qqA
for all states q
16. The Three Classic Problems
Related to HMMs
Most applications of hidden Markov models can be reduced to three basic
problems:
1. Find P(T | λ) [Evaluation]– the probability of a given observation
sequence T in a given model λ. (compute the probability distribution induced
by the model)
2. Find argmaxS∈Q
|T| P(T, S | λ) [Decoding]– the most likely state
trajectory given λ and T. (finds the most probable states sequence for a given
observation sequence)
3. Find argmax λ P(T, | λ) [Learning]– the model that best accounts for a
given sequence. (adjusts the model itself to maximize the likelihood of the
given observation)
17. Description of how these three problems can be solved:
Calculate P(T | λ), where , T is a sequence of observation symbols T = t1t2 . . .
tk ∈ O∗.
Enumerate every possible state sequence of length |T|.
Let S = s1,s2 . . . s|T| ∈ Q|T| be one such sequence.
Calculate the probability P(T | S, λ) of generating T knowing that the process
went through the states sequence S.
By Markovian assumption, the emission probabilities are all independent of
each other. Therefore,
),(),|( ||...1 iiTi tsBSTP
18. Similarly, the transition probabilities are independent. Thus the probability
P(S|λ) for the process to go through the state sequence S is
Using the above probabilities, we find that the probability P(T|λ) of generating
the sequence can be calculated as
This solution is of course infeasible in practice because of the exponential
number of possible state sequences.
To solve the problem efficiently, we use a dynamical programming technique.
The resulting algorithm is called the forward–backward procedure.
),()()|( 11||...11 iiTi ssAsSP
||
)|(),|()|(
T
QS
SPSTPTP
19. The Forward–Backward Procedure
Let αm(q), the forward variable, denote the probability of generating the
initial segment t1, t2 . . . tm of the sequence T and finishing at the state q at
time m. This forward variable can be computed recursively as follows:
Then, the probability of the whole sequence T can be calculated as
),(),'()'()(.2
),()()(.1
1'1
11
nQq nn tqBqqAqq
tqBqq
||
)()|( TQq
qTP
20. In a similar manner, one can define βm (q), the backward variable, which
denotes the probability of starting at the state q and generates the final
segment tm+1 . . . t|T| of the sequence T.
The backward variable can be calculated starting from the end and going
backward to the beginning of the sequence:
The probability of the whole sequence is then
Qq nnn
T
qtqBqqAq
q
'1
||
)'(),'()',()(.2
,1)(.1
)(),()()|( 11 qtqBqTP Qq
21. The Viterbi Algorithm
Solution of the second problem – finding the most likely state sequence for a
given sequence T.
As with the previous problem, enumerating all possible state sequences S
and choosing the one maximizing P(T, S | λ) is infeasible.
Dynamical programming, utilizing the following property of the optimal
states sequence: if is some initial segment of the sequence T = t1 t2 . . . t|T|
and S = s1 s2 . . . s|T| is a state sequence maximizing P(T, S| λ), then
maximizes among all state sequences of
length ending with s|T|.
The resulting algorithm is called the Viterbi algorithm.
'T
'||21 ...' TsssS )|','( STP
|'|T
22. Let γ n(q) denote the state sequence ending with the state q, which is
optimal for the initial segment Tn = t1t2 . . . tn among all sequences ending
with q, and let δn(q) denote the probability P(Tn, γ n(q) | λ) of generating
this initial segment following those optimal states. Delta and gamma can
be recursively calculated as follows:
Where,
Then, the best states sequence among {γ|T|(q) : q ∈ Q} is the optimal one:
,)'()(),,(),'()'(max)(.2
,)(),,().()(1.1
111'1
111
qqqtqBqqAqq
qqtqBqq
nnnQqn
),(),'()'(maxarg' 1' nnQq tqBqqAqq
))(max(arg)|,(maxarg |||||| qSTP TQqTQS T
23. Example of the Viterbi Computation
Using the HMM described in Figure with the sequence (a, b, a), the following
steps are there in using the Viterbi algorithm:
A sample HMM
24. Computation of the optimal path
using the Viterbi algorithm
Two optimal paths: {S1, S3, S1} and {S3, S2, S3}.
25. The Training of the HMM
Baum–Welsh re-estimation formulas
Let μn(q) be the probability P(sn = q | T, λ) of being in the state q at time n
while generating the observation sequence T. Then μn(q) ·
P(T | λ) is the probability of generating T passing through the state q at time n.
By definition of the forward and backward variables, this probability is equal
to αn(q) · βn(q). Thus,
Also let ϕn(q, q' ) be the probability P(sn = q, sn+1 = q' | T, λ) of passing from
state q to state q at time n while generating the observation sequence T. As in
the preceding equation,
The sum of μn(q) over all n = 1 . . . | T | can be seen as the expected number of
times the state q was visited while generating the sequence T.
Or, if one sums over n = 1 . . . | T |−1, the expected number of transitions out of
the state q results because there is no transition at time |T|.
)|(/)()()( TPqqq nnn
)|(/)(),'()',()()',( 1 TPqoqBqqAqqq nnnn
26. Similarly, the sum of ϕn(q, q') over all n = 1 . . . | T | −1 can be interpreted as
the expected number of transitions from the state q to q'
The Baum–Welsh formulas re-estimate the parameters of the model λ
according to the expectations
It can be shown that the model λ'= (π', A', B') is equal either to λ, in which
case the λ is the critical point of the likelihood function P(T | λ), or λ', which
better accounts for the training sequence T than the original model λ in the
sense that P(T | λ') >P(T | λ).
Therefore, the training problem can be solved by iteratively applying the re-
estimation formulas until convergence.
)(/)(:),('
),(/)',(:)',('
),(:)('
||..1:
1||..11||..1
1
qqoqB
qqqqqA
qq
Tn noTnn n
Tn nTn n
27. Dealing with Training Data Sparseness
Techniques for data sparseness problems in probabilistic modeling
Smoothing
shrinkage
Smoothing
Process of flattening a probability distribution implied by a model so that
all reasonable sequences can occur with some probability.
Broadening the distribution by redistributing weight from high-probability
regions to zero-probability regions.
Example
Laplace smoothing
o Every possible training event occurs one time more than it actually
does. Any constant can be used instead of “one.”
Other possible methods may include back-off smoothing, deleted
interpolation, and others.
28. Shrinkage
Defined in terms of some hierarchy representing the expected similarity
between parameter estimates.
With respect to HMMs, the hierarchy can be defined as a tree with the
HMM states for the leaves – all at the same depth.
Hierarchy is created as follows:
First, the most complex HMM is built and its states are used for the leaves
of the tree.
Then the states are separated into disjoint classes within which the states are
expected to have similar probability distributions.
The classes become the parents of their constituent states in the hierarchy
(HMM structure at the leaves induces a simpler HMM structure at the level
of the classes).
It is generated by summing the probabilities of emissions and transitions of
all states in a class.
This process may be repeated until only a single-state HMM remains at the
root of the hierarchy
29. Successful Application Areas of HMM
Online handwriting recognition
Speech recognition
Gesture recognition
Language recognition
Motion Video analysis and tracking
Protein sequence / gene sequence alignment
Stock price prediction
…
30. Stochastic context-free grammars
An SCFG is a quintuple G = (T, N, S, R, P),
where, T is the alphabet of terminal symbols (tokens),
N is the set of nonterminals, S is the starting nonterminal,
R is the set of rules, and P : R→[0.1] defines their probabilities.
The rules have the form: n→ s1s2 . . . sk,
where, n is a nonterminal and
si is either a token or another nonterminal.
SCFG generate (or accept) a given string (sequence of tokens) if the string can
be produced starting from a sequence containing just the starting symbol S and
expanding nonterminals one by one in the sequence using the rules from the
grammar.
The string generated can be naturally represented by a parse tree,
Starting symbol as a root,
Nonterminals as internal nodes, and
Tokens as leaves.
31. SCFG is a usual context-free grammar with the addition of the P function.
The semantics of the probability function P are straightforward.
If r is the rule n → s1s2 . . . sk, then P(r) is the frequency of expanding n
using this rule.
In Bayesian terms, if it is known that a given sequence of tokens was
generated by expanding n, then P(r) is the a priori likelihood that n was
expanded using the rule r.
For every nonterminal n the sum Σ P(r ) of probabilities of all rules r
headed by n must be equal to one.
32. Using SCFGs
Classical definition of SCFG:
It is assumed that the rules are all independent.
Find the (unconditional) probability of a given parse tree by simply
multiplying the probabilities of all rules participating in it.
Parsing problem is formulated as follows:
Given a sequence of tokens (a string), find the most probable parse tree
that could generate the string.
A simple generalization of the Viterbi algorithm is able to solve this
problem efficiently.
Practical applications of SCFGs:
Rare the case that the rules are truly independent.
Let the probabilities P(r) be conditioned on the context where the rule is
applied.
If the conditioning context is chosen reasonably, the Viterbi algorithm still
works correctly even for this more general problem.
33. Maximal Entropy Modeling
Consider a random process of an unknown nature that produces a single output
value y, a member of a finite set Y of possible output values.
The process of generating y may be influenced by some contextual information
x – a member of the set X of possible contexts.
The task is to construct a statistical model that accurately represents the
behavior of the random process.
Such a model is a method of estimating the conditional probability of
generating y given the context x.
Let P(x, y) be denoted as the unknown true joint probability distribution of the
random process, and let p(y | x) be the model we are trying to build taken from
the class ℘ of all possible models.
To build the model we are given a set of training samples generated by
observing the random process for some time.
The training data consist of a sequence of pairs (xi, yi) of different outputs
produced in different contexts.
34. In many cases the set X is too large and underspecified to be used directly.
For instance, X may be the set of all dots “.” in all possible English texts.
For contrast, the Y may be extremely simple while remaining interesting.
In the preceding case, the Y may contain just two outcomes:
“SentenceEnd” and “NotSentenceEnd.”
The target model p(y | x) would in this case solve the problem of finding sentence
boundaries.
In such cases it is impossible to use the context x directly to generate the output y.
There are usually many regularities and correlations, however, that can be
exploited.
Different contexts are usually similar to each other in all manner of ways, and
similar contexts tend to produce similar output distributions.
35. To express such regularities and their statistics, can use constraint functions
and their expected values.
A constraint function f : X × Y → R can be any real-valued function.
Binary-valued trigger functions:
Such a trigger function returns one for pair (x, y) if the context x satisfies the
condition predicate C and the output value y is yi.
A common short notation for such a trigger function is C→yi.
For the example above, useful triggers are
previous token is “Mr”→NotSentenceEnd,
next token is capitalized→SentenceEnd.
Given a constraint function f, its importance by requiring our target model to
reproduce f ’s expected value faithfully in the true distribution:
36. In practice we cannot calculate the true expectation and must use an empirical
expected value calculated by summing over the training samples:
The choice of feature functions is domain dependent. Let us assume the
complete set of features F={ fk} is given.
The completeness of the set of features by requiring that the model agree with
all the expected value constraints while otherwise being as uniform as possible.
The uniformity requirement defines the target model uniquely. The degree of
uniformity of a model is expressed by its conditional entropy
Or, empirically,
37. The constrained optimization problem of finding the maximal-entropy target
model is solved by application of Lagrange multipliers and the Kuhn–Tucker
theorem.
Let us introduce a parameter λk (the Lagrange multiplier) for every feature.
Define the Lagrangian Λ(p, λ) by
Holding λ fixed, we compute the unconstrained maximum of the Lagrangian
over all p ∈ ℘. Denote by pλ the p where Λ(p, λ) achieves its maximum and by
(λ) the value of Λ at this point.
The functions pλ and (λ) can be calculated using simple calculus:
Where, Zλ(x) is a normalizing constant determined by the requirement that
Σy∈Y pλ(y | x) = 1.
The dual optimization problem
38. The Kuhn–Tucker theorem asserts that, under certain conditions, the
solutions of the primal and dual optimization problems coincide.
The model p, which maximizes HE(p) while satisfying the constraints, has
the parametric form pλ*.
The function (λ) is simply the log-likelihood of the training sample as
predicted by the model pλ.
Thus, the model pλ* maximizes the likelihood of the training sample
among all models of the parametric form pλ.
39. Computing the Parameters of the Model
The function(λ) is well behaved from the perspective of numerical optimization,
for it is smooth and concave.
Consequently, various methods can be used for calculating λ*.
Generalized iterative scaling is the algorithm specifically tailored for the
problem. This algorithm is applicable whenever all constraint functions are non-
negative: fk(x, y) ≥ 0.
The algorithm starts with an arbitrary choice of λ’s – for instance λk= 0 for all k.
At each iteration the λ’s are adjusted as follows:
In the simplest case, when f # is constant, Δλk is simply (1/f #) · log PE( fk)/pλE(
fk).
Any numerical algorithm for solving the equation can be used such as Newton’s
method.
40. Maximal Entropy Markov Models
A MEMM is a probabilistic finite-state acceptor.
Unlike HMM, which has separate transition and emission probabilities,
MEMM has only transition probabilities, depend on the observations.
A slightly modified version of the Viterbi algorithm solves the problem of
finding the most likely state sequence for a given observation sequence.
A MEMM consists of a set Q = {q1, . . . , qN} of states, and a set of transition
probabilities functions Aq : X × Q → [0, 1], where X denotes the set of all
possible observations. Aq(x, q) gives the probability P(q | q, x) of transition
from q to q, given the observation x.
The model does not generate x but only conditions on it.
The set X need not be small and need not even be fully defined.
The transition probabilities Aq are separate exponential models trained using
maximal entropy.
41. The task of a trained MEMM is to produce the most probable sequence of states given
the observation, solved by a simple modification of the Viterbi algorithm.
The forward–backward algorithm, loses its meaning because here it computes the
probability of the observation being generated by any state sequence, which is always
one.
The forward and backward variables are still useful for the MEMM training. The
forward variable [Ref >HMM] αm(q) denotes the probability of being in state q at time m
given the observation. It is computed recursively as
The backward variable β denotes the probability of starting from state q at time m given
the observation. It is computed similarly as
The model Aq for transition probabilities from a state is defined parametrically using
constraint functions. If fk : X× Q→ R is the set of such functions for a given state q, then
the model Aq can be represented in the form
where λk are the parameters to be trained and Z(x, q) is the normalizing factor making
probabilities of all transitions from a state sum to one.
42. Training the MEMM
If the true states sequence for the training data is known, the parameters of the
models can be straightforwardly estimated using the GIS algorithm for training ME
models.
If the sequence is not known-for instance, if there are several states with the same
label in a fully connected MEMM-the parameters must be estimated using a
combination of the Baum-Welsh procedure and iterative scaling.
Every iteration consists of two steps:
1. Using the forward–backward algorithm and the current transition functions to
compute the state occupancies for all training sequences.
2. Computing the new transition functions using GIS with the feature frequencies
based on the state occupancies computed in step 1.
It is unnecessary to run GIS to convergence in step 2; a single GIS iteration is
sufficient.
43. Conditional Random Fields(CRF)
Problem description
Why conditional random fields(CRF)
Introduction to CRF
CRF model
Inference of CRF
Learning of CRF
44. Problem Description
Given observed data X, we wish to predict Y (labels)
Example:
X = {Temperature, Humidity, ...} Xn = observation on day n
Y = {Sunny, Rainy, Cloudy} Yn = weather on day n
30°C 20%
Sunny? Rainy?
Cloudy?
Light
breeze
May depend
on one
another
May depend
on the
weather of
yesterday
45. Generative Model vs. Discriminative Model
Generative model
A model that generate observed data randomly
Model the joint probability p(x,y)
Discriminative model
Directly estimate the posterior probability p(y|x)
Aim at modeling the “discrimination” between different outputs
Naïve Bayes, … HMM, …
Bayesian network,
MRF, …
Single variable Sequence General
Logistic
regression, …
Linear-chain CRF
MEMM, …
General CRF, …
Conditional
46. Why Conditional Random Fields
Generative model
Generative model targets to find the joint probability p(x,y) and make the
prediction based on Bayes rule to calculate p(y|x)
Ex: Naive Bayes (single output) and HMM (Hidden Markov Model)
(sequence output)
K
k
k yxpypyxp
1
)|()(),(
a vector of
features
Assume that given y,
features are independent
T
t
tttt yxpyypyxp
1
1 )|()|(),(
Assumption: 1. each state t only
depends on its immediate
predecessor 2. Conditional
independence of observed given
Sequence
output
47. Why Conditional Random Fields
30°C
20%
Humidity, temperature and
the wind scale are
independent
Mon.
{30°C, 20%,
light breeze}
Light
breeze
Tue.
{28°C, 30%,
light breeze}
Wed.
{25°C, 40%,
moderate
breeze}
Thu.
{22°C, 60%,
moderate
breeze}
A B: A causes B
48. Why Conditional Random Fields
Difficulties for generative models
Not practical to represent multiple interacting features (hard to
model p(x)) or long-range dependencies of the observations
Very strict independence assumptions on the observations
Mon.
{30°C, 20%,
light breeze}
Tue.
{28°C, 30%,
light breeze}
Wed.
{25°C, 40%,
moderate
breeze}
Thu.
{22°C, 60%,
moderate
breeze}
49. Why Conditional Random Fields
Discriminative models
Directly model the posterior p(y|x)
Aim at modeling the “discrimination” between different outputs
Ex: logistic regression (maximum entropy) and CRF
Advantages of discriminative models
Training process aim at finding optimal coefficients for features no
matter the features are correlated
Not sensitive to unbalanced training data
Especially for the classification problem, we don’t have to care about
p(x)
50. Why Conditional Random Fields
Logistic regression (maximum entropy)
Suppose we have a bin of candies, each with an associated label (A,B,C, or D)
Each candy has multiple colors in its wrapper
Each candy is assigned a label randomly based on some distribution over
wrapper colors
Observation: the color
of the wrapper
Label: 4 kinds of flavors
A: chocolate
B: strawberry
C: lemon
D: milk
51. Why Conditional Random Field
For any candy with a red label pulled from the bin:
P(A|red)+P(B|red)+P(C|red)+P(D|red) = 1
Infinite number of distributions exist that fit this constraint
The distribution that fits with the idea of maximum entropy is: (the
most uniform)
o P(A|red)=0.25
o P(B|red)=0.25
o P(C|red)=0.25
o P(D|red)=0.25
52. Why Conditional Random Field
Now suppose we add some evidence to our model
We note that 80% of all candies with red labels are either labeled A or B
o P(A|red) + P(B|red) = 0.8
The updated model that reflects this would be:
o P(A|red) = 0.4
o P(B|red) = 0.4
o P(C|red) = 0.1
o P(D|red) = 0.1
As we make more observations and find more constraints, the model gets
more complex
53. Why Conditional Random Field
Given a collection of facts, choose a model which is consistent with all
the facts, but otherwise as uniform as possible
,
),(
);|(
),(
wxZ
e
wxyp
j
jj yxFw
ion termnormalizatais),(
),(
y
yxFw
j
jj
ewxZ
By
learning
Defined feature
functions evidence
x1 x2 xd
y
B)f(A,nodesBandAbetweennodesfactor
Factor Graph:
54. Linear-Chain CRF
If we extend the logistic regression to a sequence problem ( ):
,
),(
);|(
),(
wxZ
e
wxyp
j
jj yxFw
ion termnormalizatais),(
),(
y
yxFw
j
jj
ewxZ
y
x1 x2 xd
yt-1
x1 x2 xd
yt
x1 x2 xd
yt+1
Entire x
sentencethealongsuma,),,(),(where 1
t
ttjj xyyfyxF
56. General CRF
Divide Graph G into many templates ψA. The parameters inside each
template are tied
K(A) is the number of feature functions for the template
)(
)|(
)(
1
),(
xZ
e
xyp
G
xyf
A
AK
k
aaakak
57. Inference of CRF
Problem description:
Given the observations({xi}) and the probability model(parameters
such as ωi mentioned above), we target to find the best state
sequence
For general graphs, the problem of exact inference in CRFs is
intractable
Chain or tree like CRF can yield exact inference
Approximation solutions
58. Inference of Linear-Chain CRF
• The inference of linear-chain CRF is very similar to that of HMM
Example: POS(part of speech) tagging
The identification of words as nouns, verbs, adjectives, adverbs,
etc.
Students need another break
noun verb article noun
60. Inference of Linear-Chain CRF
Then back to CRF
i
iiiy
i j
iiijy
j i
iiijy
j
jjy
yxF
y
y
yyg
xyyf
xyyf
yxF
xZ
e
xypy
j
jj
),(maxarg
),,(maxarg
),,(maxarg
),(maxarg
)(
maxarg
);|(maxarg
1
1
1
),(
*
61. Inference of Linear-Chain CRF
gi can be represented as a MxM matrix where m is the cardinality of the
set of the tags
j
iijjiii xyyfyyg ),,(),( 11
V
ART
N
N V ART
yi-1
yi
V
ART
N
V
ART
N
62. Inference of Linear-Chain CRF
The inference of linear-chain CRF is similar to that of HMM, which
uses Viterbi algorithm.
v: range over the tags
U(k,v) to be the score of the best sequence of tags from 1 to k, where
tag k is required to be v
)],(),1([max
)],(),([max),(
11
1
1
1
1
},...,{
1
11
vygykU
vygyygvkU
kkk
y
k
k
i
kiii
yy
k
k
63. Learning of CRF
Problem description
Given training pairs ({xi,yi}), we wish to estimate the parameters of
the model ({ωi})
Method
For chain or tree structured CRFs, they can be trained by maximum
likelihood we will focus on the learning of linear chain CRF
General CRFs are intractable hence approximation solutions are
necessary
64. Learning of Linear-chain CRF
Conditional maximum likelihood (CML)
x: observations; y: labels
Apply CML to the learning of CRF
It can be shown that the conditional log-likelihood of the linear-chain
CRF is a convex function we can apply gradient ascent to the CML
problem
);|(max)|;(max xypxyL
);|(max xyp
);|(log xyp
0),(log),();|(log
xZyxFxyp
j
j
j
65. Learning of Linear-chain CRF
For the entire training set T
)]',([),(
);|'()',(),(
0),(log),();|(log
);'|(~'
'
yxFEyxF
xypyxFyxF
xZyxFxyp
jxypyj
y
jj
j
j
j
Ep[·] denotes
expectation with
respect to distribution
p.
Tx
jxypy
Tyx
j yxFEyxF
,
);|(~
,
)],([),(
The expectation of the
feature fx with respect
to the model
distribution
The expectation of the
feature fx with respect to
the empirical distribution
66. Learning of Linear-chain CRF
To yield the best model:
The expectation of each feature with respect to the model distribution
is equal to the expected value under the empirical distribution of the
training data
The same as the “maximum entropy model”
Logistic regression
(maximum entropy)
Extend to
sequence
Linear-Chain CRF