SlideShare ist ein Scribd-Unternehmen logo
1 von 66
Text Data Mining
(part-2)
PROBABILISTIC MODELS
Probabilistic Document Clustering
and Topic Models
 A popular method for probabilistic document clustering is that of topic
modeling.
 The idea of topic modeling is to create a probabilistic generative model for the
text documents in the corpus.
 The main approach is to represent a corpus as a function of hidden random
variables, the parameters of which are estimated using a particular document
collection.
 The primary assumptions in any topic modeling approach (together with the
corresponding random variables) are as follows:
 The n documents in the corpus are assumed to have a probability of belonging
to one of k topics.
 Thus, a given document may have a probability of belonging to multiple
topics, and this reflects the fact that the same document may contain a
multitude of subjects.
 For a given document Di, and a set of topics T1 . . . Tk, the probability that
the document Di belongs to the topic Tj is given by P(Tj |Di).
 The topics are essentially analogous to clusters, and the value of P(Tj |Di)
provides a probability of cluster membership of the ith document to the jth
cluster.
 In non-probabilistic clustering methods, the membership of documents to
clusters is deterministic in nature, and therefore the clustering is typically a
clean partitioning of the document collection.
 When there are overlaps in document subject matter across multiple
clusters. The use of a soft cluster membership in terms of probabilities is an
elegant solution to this dilemma.
 In this scenario, the determination of the membership of the documents to
clusters is a secondary goal to that of finding the latent topical clusters in
the underlying text collection.
 Topic modeling is related to the clustering problem, it is often studied as a
distinct area of research from clustering.
 The value of P(Tj |Di) is estimated using the topic modeling approach, and
is one of the primary outputs of the algorithm.
 The value of k is one of the inputs to the algorithm and is analogous to the
number of clusters.
 Each topic is associated with a probability vector, which quantifies the
probability of the different terms in the lexicon for that topic.
 Let t1 . . . td be the d terms in the lexicon. Then, for a document that belongs
completely to topic Tj , the probability that the term tl occurs in it is given
by P(tl|Tj ).
 The value of P(tl|Tj) is another important parameter which needs to be
estimated by the topic modeling approach.
 The number of documents is denoted by n, topics by k and lexicon size
(terms) by d.
 Most topic modeling methods attempt to learn the above parameters using
maximum likelihood methods, so that the probabilistic fit to the given
corpus of documents is as large as possible.
 There are two basic methods which are used for topic modeling,
 Probabilistic Latent Semantic Indexing (PLSI)
 Latent Dirichlet Allocation (LDA)
Probabilistic Latent Semantic Indexing Method
 Set of random variables P(Tj |Di) and P(tl|Tj) model the probability of a term tl
occurring in any document Di.
 The probability P(tl|Di) of the term tl occurring document Di can be expressed
in terms:
 For each term tl and document Di, generate a n × d matrix of probabilities in
terms of these parameters,
where, n - number of documents and d - number of terms.
 For a given corpus, the n × d term-document occurrence matrix X, tells us
which term actually occurs in each document, and how many times the term
occurs in the document.
 In other words, X(i, l) is the number of times that term tl occurs in document
Di.
 Therefore, we can use a maximum likelihood estimation algorithm which
maximizes the product of the probabilities of terms that are observed in each
document in the entire collection.
 Log likelihood probability Σi,l X(i, l) ·log(P(tl|Di)) subject to the constraints that
the probability values over each of the topic-document and term-topic spaces
must sum to 1:
 The Lagrangian solution essentially leads to a set of iterative update equations
for the corresponding parameters need to be estimated.
 These parameters can be estimated with the iterative update of two matrices
[P1]k×n and [P2]d×k containing the topic-document probabilities and term-topic
probabilities respectively.
 Initializing the matrices randomly, and normalize each of them so that the
probability values in their columns sum to one.
 Then, iteratively perform the following steps on each of P1 and P2
respectively:
 The process is iterated to convergence.
 The output of this approach are the two matrices P1 and P2, the entries of
which provide the topic document and term-topic probabilities respectively.
Latent Dirichlet Allocation
 The term-topic probabilities and topic-document probabilities are modeled
with a Dirichlet distribution as a prior.
 LDA method is the Bayesian version of the PLSI technique. PLSI method is
equivalent to the LDA technique, when applied with a uniform Dirichlet prior.
 The LDA method can be used to model the topic distribution of a new
document more robustly, even if it is not present in the original data set.
 EM-concepts used for topic modeling are quite general, and can be used for
different variations on the text clustering tasks, such as text classification or
incorporating user feedback into clustering.
 LDA’s main advantage over the PLSI method is that it is not quite as
susceptible to overfitting.
 This is generally true of Bayesian methods which reduce the number of
model parameters to be estimated, and therefore work much better for
smaller data sets.
 Even for larger data sets, PLSI has the disadvantage that the number of
model parameters grows linearly with the size of the collection.
 The PLSI model is not a fully generative model, because there is no
accurate way to model the topical distribution of a document which is not
included in the current data set.
Probabilistic Models for Information
Extraction
 Probabilistic models show better accuracy and robustness against the
noise than categorical models.
 Useful for the different tasks in extracting meaning from natural
language texts.
 Most prominent among these probabilistic approaches are
 Hidden Markov Models (HMMs),
 Stochastic Context-free Grammars (SCFG), and
 Maximal Entropy (ME).
Probabilistic Models for Information
Extraction Hidden Markov Models
 The Three Classic Problems Related to HMMs
 The Forward–Backward Procedure
 The Viterbi Algorithm
 The Training of the HMM
 Dealing with Training Data Sparseness
 Stochastic Context-Free Grammars
 Using SCFGs
 Maximal Entropy Modeling
 Computing the Parameters of the Model
 Maximal Entropy Markov Models
 Training the MEMM
 Conditional Random Fields
 The Three Classic Problems Relating to CRF
 Computing the Conditional Probability
 Finding the Most Probable Label Sequence
 Training the CRF
Hidden Markov Models
 An HMM is a finite-state automaton with stochastic state transitions
and symbol emissions.
 The automaton models a probabilistic generative process.
 Process, a sequence of symbols is produced by
 Starting in an initial state,
 Emitting a symbol selected by the state,
 Making a transition to a new state,
 Emitting a symbol selected by the state, and
 Repeating this transition–emission cycle until a designated final
state is reached.
HMM Assumptions
 Markov assumption: the state transition depends only on the origin and
destination
 Output-independent assumption: all observation frames are dependent on
the state that generated them, not on neighbouring observation frames
 Formally,
 Let O = {o1, . . . oM} - finite set of observation symbols and
 Q ={q1, . . . qN} - finite set of states.
 A first-order Markov model λ is a triple (π, A, B),
where π : Q→ [0, 1] defines the starting probabilities,
A : Q× Q→ [0, 1] defines the transition probabilities, and
B : Q× O→ [0, 1] denotes the emission probabilities.
 The functions π, A, and B define true probabilities, they must satisfy
 A model λ together with the random process described above induces a
probability distribution over the set O* of all possible observation
sequences.
1)(  
qQq

 
Oo
oqB 1),(
1)',('
 Qq
qqA
for all states q
The Three Classic Problems
Related to HMMs
 Most applications of hidden Markov models can be reduced to three basic
problems:
1. Find P(T | λ) [Evaluation]– the probability of a given observation
sequence T in a given model λ. (compute the probability distribution induced
by the model)
2. Find argmaxS∈Q
|T| P(T, S | λ) [Decoding]– the most likely state
trajectory given λ and T. (finds the most probable states sequence for a given
observation sequence)
3. Find argmax λ P(T, | λ) [Learning]– the model that best accounts for a
given sequence. (adjusts the model itself to maximize the likelihood of the
given observation)
 Description of how these three problems can be solved:
Calculate P(T | λ), where , T is a sequence of observation symbols T = t1t2 . . .
tk ∈ O∗.
 Enumerate every possible state sequence of length |T|.
 Let S = s1,s2 . . . s|T| ∈ Q|T| be one such sequence.
 Calculate the probability P(T | S, λ) of generating T knowing that the process
went through the states sequence S.
 By Markovian assumption, the emission probabilities are all independent of
each other. Therefore,
),(),|( ||...1 iiTi tsBSTP  
 Similarly, the transition probabilities are independent. Thus the probability
P(S|λ) for the process to go through the state sequence S is
 Using the above probabilities, we find that the probability P(T|λ) of generating
the sequence can be calculated as
 This solution is of course infeasible in practice because of the exponential
number of possible state sequences.
 To solve the problem efficiently, we use a dynamical programming technique.
The resulting algorithm is called the forward–backward procedure.
),()()|( 11||...11  iiTi ssAsSP 
 

||
)|(),|()|(
T
QS
SPSTPTP 
The Forward–Backward Procedure
 Let αm(q), the forward variable, denote the probability of generating the
initial segment t1, t2 . . . tm of the sequence T and finishing at the state q at
time m. This forward variable can be computed recursively as follows:
 Then, the probability of the whole sequence T can be calculated as
),(),'()'()(.2
),()()(.1
1'1
11
 

 nQq nn tqBqqAqq
tqBqq


 
 ||
)()|( TQq
qTP 
 In a similar manner, one can define βm (q), the backward variable, which
denotes the probability of starting at the state q and generates the final
segment tm+1 . . . t|T| of the sequence T.
 The backward variable can be calculated starting from the end and going
backward to the beginning of the sequence:
 The probability of the whole sequence is then
  

Qq nnn
T
qtqBqqAq
q
'1
||
)'(),'()',()(.2
,1)(.1


)(),()()|( 11 qtqBqTP Qq
   
The Viterbi Algorithm
 Solution of the second problem – finding the most likely state sequence for a
given sequence T.
 As with the previous problem, enumerating all possible state sequences S
and choosing the one maximizing P(T, S | λ) is infeasible.
 Dynamical programming, utilizing the following property of the optimal
states sequence: if is some initial segment of the sequence T = t1 t2 . . . t|T|
and S = s1 s2 . . . s|T| is a state sequence maximizing P(T, S| λ), then
maximizes among all state sequences of
length ending with s|T|.
 The resulting algorithm is called the Viterbi algorithm.
'T
'||21 ...' TsssS  )|','( STP
|'|T
 Let γ n(q) denote the state sequence ending with the state q, which is
optimal for the initial segment Tn = t1t2 . . . tn among all sequences ending
with q, and let δn(q) denote the probability P(Tn, γ n(q) | λ) of generating
this initial segment following those optimal states. Delta and gamma can
be recursively calculated as follows:
Where,
 Then, the best states sequence among {γ|T|(q) : q ∈ Q} is the optimal one:
,)'()(),,(),'()'(max)(.2
,)(),,().()(1.1
111'1
111
qqqtqBqqAqq
qqtqBqq
nnnQqn 




),(),'()'(maxarg' 1'   nnQq tqBqqAqq 
))(max(arg)|,(maxarg |||||| qSTP TQqTQS T  

Example of the Viterbi Computation
 Using the HMM described in Figure with the sequence (a, b, a), the following
steps are there in using the Viterbi algorithm:
A sample HMM
Computation of the optimal path
using the Viterbi algorithm
Two optimal paths: {S1, S3, S1} and {S3, S2, S3}.
The Training of the HMM
Baum–Welsh re-estimation formulas
 Let μn(q) be the probability P(sn = q | T, λ) of being in the state q at time n
while generating the observation sequence T. Then μn(q) ·
 P(T | λ) is the probability of generating T passing through the state q at time n.
 By definition of the forward and backward variables, this probability is equal
to αn(q) · βn(q). Thus,
 Also let ϕn(q, q' ) be the probability P(sn = q, sn+1 = q' | T, λ) of passing from
state q to state q at time n while generating the observation sequence T. As in
the preceding equation,
 The sum of μn(q) over all n = 1 . . . | T | can be seen as the expected number of
times the state q was visited while generating the sequence T.
 Or, if one sums over n = 1 . . . | T |−1, the expected number of transitions out of
the state q results because there is no transition at time |T|.
)|(/)()()(  TPqqq nnn 
)|(/)(),'()',()()',( 1  TPqoqBqqAqqq nnnn  
 Similarly, the sum of ϕn(q, q') over all n = 1 . . . | T | −1 can be interpreted as
the expected number of transitions from the state q to q'
 The Baum–Welsh formulas re-estimate the parameters of the model λ
according to the expectations
 It can be shown that the model λ'= (π', A', B') is equal either to λ, in which
case the λ is the critical point of the likelihood function P(T | λ), or λ', which
better accounts for the training sequence T than the original model λ in the
sense that P(T | λ') >P(T | λ).
 Therefore, the training problem can be solved by iteratively applying the re-
estimation formulas until convergence.
)(/)(:),('
),(/)',(:)',('
),(:)('
||..1:
1||..11||..1
1
qqoqB
qqqqqA
qq
Tn noTnn n
Tn nTn n










Dealing with Training Data Sparseness
 Techniques for data sparseness problems in probabilistic modeling
 Smoothing
 shrinkage
 Smoothing
 Process of flattening a probability distribution implied by a model so that
all reasonable sequences can occur with some probability.
 Broadening the distribution by redistributing weight from high-probability
regions to zero-probability regions.
 Example
 Laplace smoothing
o Every possible training event occurs one time more than it actually
does. Any constant can be used instead of “one.”
 Other possible methods may include back-off smoothing, deleted
interpolation, and others.
 Shrinkage
 Defined in terms of some hierarchy representing the expected similarity
between parameter estimates.
 With respect to HMMs, the hierarchy can be defined as a tree with the
HMM states for the leaves – all at the same depth.
 Hierarchy is created as follows:
 First, the most complex HMM is built and its states are used for the leaves
of the tree.
 Then the states are separated into disjoint classes within which the states are
expected to have similar probability distributions.
 The classes become the parents of their constituent states in the hierarchy
(HMM structure at the leaves induces a simpler HMM structure at the level
of the classes).
 It is generated by summing the probabilities of emissions and transitions of
all states in a class.
 This process may be repeated until only a single-state HMM remains at the
root of the hierarchy
Successful Application Areas of HMM
 Online handwriting recognition
 Speech recognition
 Gesture recognition
 Language recognition
 Motion Video analysis and tracking
 Protein sequence / gene sequence alignment
 Stock price prediction
 …
Stochastic context-free grammars
 An SCFG is a quintuple G = (T, N, S, R, P),
where, T is the alphabet of terminal symbols (tokens),
N is the set of nonterminals, S is the starting nonterminal,
R is the set of rules, and P : R→[0.1] defines their probabilities.
 The rules have the form: n→ s1s2 . . . sk,
where, n is a nonterminal and
si is either a token or another nonterminal.
 SCFG generate (or accept) a given string (sequence of tokens) if the string can
be produced starting from a sequence containing just the starting symbol S and
expanding nonterminals one by one in the sequence using the rules from the
grammar.
 The string generated can be naturally represented by a parse tree,
 Starting symbol as a root,
 Nonterminals as internal nodes, and
 Tokens as leaves.
 SCFG is a usual context-free grammar with the addition of the P function.
 The semantics of the probability function P are straightforward.
 If r is the rule n → s1s2 . . . sk, then P(r) is the frequency of expanding n
using this rule.
 In Bayesian terms, if it is known that a given sequence of tokens was
generated by expanding n, then P(r) is the a priori likelihood that n was
expanded using the rule r.
 For every nonterminal n the sum Σ P(r ) of probabilities of all rules r
headed by n must be equal to one.
Using SCFGs
Classical definition of SCFG:
 It is assumed that the rules are all independent.
 Find the (unconditional) probability of a given parse tree by simply
multiplying the probabilities of all rules participating in it.
 Parsing problem is formulated as follows:
 Given a sequence of tokens (a string), find the most probable parse tree
that could generate the string.
 A simple generalization of the Viterbi algorithm is able to solve this
problem efficiently.
Practical applications of SCFGs:
 Rare the case that the rules are truly independent.
 Let the probabilities P(r) be conditioned on the context where the rule is
applied.
 If the conditioning context is chosen reasonably, the Viterbi algorithm still
works correctly even for this more general problem.
Maximal Entropy Modeling
 Consider a random process of an unknown nature that produces a single output
value y, a member of a finite set Y of possible output values.
 The process of generating y may be influenced by some contextual information
x – a member of the set X of possible contexts.
 The task is to construct a statistical model that accurately represents the
behavior of the random process.
 Such a model is a method of estimating the conditional probability of
generating y given the context x.
 Let P(x, y) be denoted as the unknown true joint probability distribution of the
random process, and let p(y | x) be the model we are trying to build taken from
the class ℘ of all possible models.
 To build the model we are given a set of training samples generated by
observing the random process for some time.
 The training data consist of a sequence of pairs (xi, yi) of different outputs
produced in different contexts.
 In many cases the set X is too large and underspecified to be used directly.
 For instance, X may be the set of all dots “.” in all possible English texts.
 For contrast, the Y may be extremely simple while remaining interesting.
 In the preceding case, the Y may contain just two outcomes:
“SentenceEnd” and “NotSentenceEnd.”
 The target model p(y | x) would in this case solve the problem of finding sentence
boundaries.
 In such cases it is impossible to use the context x directly to generate the output y.
 There are usually many regularities and correlations, however, that can be
exploited.
 Different contexts are usually similar to each other in all manner of ways, and
similar contexts tend to produce similar output distributions.
 To express such regularities and their statistics, can use constraint functions
and their expected values.
 A constraint function f : X × Y → R can be any real-valued function.
 Binary-valued trigger functions:
 Such a trigger function returns one for pair (x, y) if the context x satisfies the
condition predicate C and the output value y is yi.
 A common short notation for such a trigger function is C→yi.
 For the example above, useful triggers are
previous token is “Mr”→NotSentenceEnd,
next token is capitalized→SentenceEnd.
 Given a constraint function f, its importance by requiring our target model to
reproduce f ’s expected value faithfully in the true distribution:
 In practice we cannot calculate the true expectation and must use an empirical
expected value calculated by summing over the training samples:
 The choice of feature functions is domain dependent. Let us assume the
complete set of features F={ fk} is given.
 The completeness of the set of features by requiring that the model agree with
all the expected value constraints while otherwise being as uniform as possible.
 The uniformity requirement defines the target model uniquely. The degree of
uniformity of a model is expressed by its conditional entropy
 Or, empirically,
 The constrained optimization problem of finding the maximal-entropy target
model is solved by application of Lagrange multipliers and the Kuhn–Tucker
theorem.
 Let us introduce a parameter λk (the Lagrange multiplier) for every feature.
Define the Lagrangian Λ(p, λ) by
 Holding λ fixed, we compute the unconstrained maximum of the Lagrangian
over all p ∈ ℘. Denote by pλ the p where Λ(p, λ) achieves its maximum and by
(λ) the value of Λ at this point.
 The functions pλ and (λ) can be calculated using simple calculus:
Where, Zλ(x) is a normalizing constant determined by the requirement that
Σy∈Y pλ(y | x) = 1.
 The dual optimization problem
 The Kuhn–Tucker theorem asserts that, under certain conditions, the
solutions of the primal and dual optimization problems coincide.
 The model p, which maximizes HE(p) while satisfying the constraints, has
the parametric form pλ*.
 The function (λ) is simply the log-likelihood of the training sample as
predicted by the model pλ.
 Thus, the model pλ* maximizes the likelihood of the training sample
among all models of the parametric form pλ.
Computing the Parameters of the Model
 The function(λ) is well behaved from the perspective of numerical optimization,
for it is smooth and concave.
 Consequently, various methods can be used for calculating λ*.
 Generalized iterative scaling is the algorithm specifically tailored for the
problem. This algorithm is applicable whenever all constraint functions are non-
negative: fk(x, y) ≥ 0.
 The algorithm starts with an arbitrary choice of λ’s – for instance λk= 0 for all k.
 At each iteration the λ’s are adjusted as follows:
 In the simplest case, when f # is constant, Δλk is simply (1/f #) · log PE( fk)/pλE(
fk).
 Any numerical algorithm for solving the equation can be used such as Newton’s
method.
Maximal Entropy Markov Models
 A MEMM is a probabilistic finite-state acceptor.
 Unlike HMM, which has separate transition and emission probabilities,
MEMM has only transition probabilities, depend on the observations.
 A slightly modified version of the Viterbi algorithm solves the problem of
finding the most likely state sequence for a given observation sequence.
 A MEMM consists of a set Q = {q1, . . . , qN} of states, and a set of transition
probabilities functions Aq : X × Q → [0, 1], where X denotes the set of all
possible observations. Aq(x, q) gives the probability P(q | q, x) of transition
from q to q, given the observation x.
 The model does not generate x but only conditions on it.
 The set X need not be small and need not even be fully defined.
 The transition probabilities Aq are separate exponential models trained using
maximal entropy.
 The task of a trained MEMM is to produce the most probable sequence of states given
the observation, solved by a simple modification of the Viterbi algorithm.
 The forward–backward algorithm, loses its meaning because here it computes the
probability of the observation being generated by any state sequence, which is always
one.
 The forward and backward variables are still useful for the MEMM training. The
forward variable [Ref >HMM] αm(q) denotes the probability of being in state q at time m
given the observation. It is computed recursively as
 The backward variable β denotes the probability of starting from state q at time m given
the observation. It is computed similarly as
 The model Aq for transition probabilities from a state is defined parametrically using
constraint functions. If fk : X× Q→ R is the set of such functions for a given state q, then
the model Aq can be represented in the form
where λk are the parameters to be trained and Z(x, q) is the normalizing factor making
probabilities of all transitions from a state sum to one.
Training the MEMM
 If the true states sequence for the training data is known, the parameters of the
models can be straightforwardly estimated using the GIS algorithm for training ME
models.
 If the sequence is not known-for instance, if there are several states with the same
label in a fully connected MEMM-the parameters must be estimated using a
combination of the Baum-Welsh procedure and iterative scaling.
 Every iteration consists of two steps:
1. Using the forward–backward algorithm and the current transition functions to
compute the state occupancies for all training sequences.
2. Computing the new transition functions using GIS with the feature frequencies
based on the state occupancies computed in step 1.
 It is unnecessary to run GIS to convergence in step 2; a single GIS iteration is
sufficient.
Conditional Random Fields(CRF)
 Problem description
 Why conditional random fields(CRF)
 Introduction to CRF
 CRF model
 Inference of CRF
 Learning of CRF
Problem Description
 Given observed data X, we wish to predict Y (labels)
 Example:
 X = {Temperature, Humidity, ...}  Xn = observation on day n
 Y = {Sunny, Rainy, Cloudy}  Yn = weather on day n
30°C 20%
Sunny? Rainy?
Cloudy?
Light
breeze
May depend
on one
another
May depend
on the
weather of
yesterday
Generative Model vs. Discriminative Model
 Generative model
 A model that generate observed data randomly
 Model the joint probability p(x,y)
 Discriminative model
 Directly estimate the posterior probability p(y|x)
 Aim at modeling the “discrimination” between different outputs
Naïve Bayes, … HMM, …
Bayesian network,
MRF, …
Single variable Sequence General
Logistic
regression, …
Linear-chain CRF
MEMM, …
General CRF, …
Conditional
Why Conditional Random Fields
 Generative model
 Generative model targets to find the joint probability p(x,y) and make the
prediction based on Bayes rule to calculate p(y|x)
 Ex: Naive Bayes (single output) and HMM (Hidden Markov Model)
(sequence output)


K
k
k yxpypyxp
1
)|()(),(

a vector of
features
Assume that given y,
features are independent


T
t
tttt yxpyypyxp
1
1 )|()|(),(

Assumption: 1. each state t only
depends on its immediate
predecessor 2. Conditional
independence of observed given
Sequence
output
Why Conditional Random Fields
30°C
20%
Humidity, temperature and
the wind scale are
independent
Mon.
{30°C, 20%,
light breeze}
Light
breeze
Tue.
{28°C, 30%,
light breeze}
Wed.
{25°C, 40%,
moderate
breeze}
Thu.
{22°C, 60%,
moderate
breeze}
A  B: A causes B
Why Conditional Random Fields
 Difficulties for generative models
 Not practical to represent multiple interacting features (hard to
model p(x)) or long-range dependencies of the observations
 Very strict independence assumptions on the observations
Mon.
{30°C, 20%,
light breeze}
Tue.
{28°C, 30%,
light breeze}
Wed.
{25°C, 40%,
moderate
breeze}
Thu.
{22°C, 60%,
moderate
breeze}
Why Conditional Random Fields
 Discriminative models
 Directly model the posterior p(y|x)
 Aim at modeling the “discrimination” between different outputs
 Ex: logistic regression (maximum entropy) and CRF
 Advantages of discriminative models
 Training process aim at finding optimal coefficients for features no
matter the features are correlated
 Not sensitive to unbalanced training data
 Especially for the classification problem, we don’t have to care about
p(x)
Why Conditional Random Fields
 Logistic regression (maximum entropy)
 Suppose we have a bin of candies, each with an associated label (A,B,C, or D)
 Each candy has multiple colors in its wrapper
 Each candy is assigned a label randomly based on some distribution over
wrapper colors
Observation: the color
of the wrapper
Label: 4 kinds of flavors
A: chocolate
B: strawberry
C: lemon
D: milk
Why Conditional Random Field
 For any candy with a red label pulled from the bin:
 P(A|red)+P(B|red)+P(C|red)+P(D|red) = 1
 Infinite number of distributions exist that fit this constraint
 The distribution that fits with the idea of maximum entropy is: (the
most uniform)
o P(A|red)=0.25
o P(B|red)=0.25
o P(C|red)=0.25
o P(D|red)=0.25
Why Conditional Random Field
 Now suppose we add some evidence to our model
 We note that 80% of all candies with red labels are either labeled A or B
o P(A|red) + P(B|red) = 0.8
 The updated model that reflects this would be:
o P(A|red) = 0.4
o P(B|red) = 0.4
o P(C|red) = 0.1
o P(D|red) = 0.1
 As we make more observations and find more constraints, the model gets
more complex
Why Conditional Random Field
 Given a collection of facts, choose a model which is consistent with all
the facts, but otherwise as uniform as possible
,
),(
);|(
),(
wxZ
e
wxyp
j
jj yxFw




 ion termnormalizatais),(
),(



y
yxFw
j
jj
ewxZ


By
learning
Defined feature
functions  evidence
x1 x2 xd
y
B)f(A,nodesBandAbetweennodesfactor 
Factor Graph:
Linear-Chain CRF
 If we extend the logistic regression to a sequence problem ( ):
,
),(
);|(
),(
wxZ
e
wxyp
j
jj yxFw




 ion termnormalizatais),(
),(



y
yxFw
j
jj
ewxZ



y

x1 x2 xd
yt-1
x1 x2 xd
yt
x1 x2 xd
yt+1
Entire x
sentencethealongsuma,),,(),(where 1 
t
ttjj xyyfyxF

Linear-Chain CRF
y1 y2 y3
x1 x2 x3
y1 y2 y3
x
General CRF
 Divide Graph G into many templates ψA. The parameters inside each
template are tied
 K(A) is the number of feature functions for the template
)(
)|(
)(
1
),(
xZ
e
xyp
G
xyf
A
AK
k
aaakak








Inference of CRF
 Problem description:
 Given the observations({xi}) and the probability model(parameters
such as ωi mentioned above), we target to find the best state
sequence
 For general graphs, the problem of exact inference in CRFs is
intractable
 Chain or tree like CRF can yield exact inference
 Approximation solutions
Inference of Linear-Chain CRF
• The inference of linear-chain CRF is very similar to that of HMM
 Example: POS(part of speech) tagging
 The identification of words as nouns, verbs, adjectives, adverbs,
etc.
Students need another break
noun verb article noun
Inference of Linear-Chain CRF
 We firstly illustrate the inference of HMM
students/V
students/N
students/P
students/ART
need/V
need/N
need/P
need/ART
o/s
another/V
another/N
another/P
another/ART
break/V
break/N
break/P
break/ART
7.6x10-6
0.00725
0
0
0.00031
1.3x10-5
0.0002
0
0
1.2x10-7
0
7.2x10-5
2.6x10-9
4.3x10-6
0
0
Inference of Linear-Chain CRF
 Then back to CRF


 











i
iiiy
i j
iiijy
j i
iiijy
j
jjy
yxF
y
y
yyg
xyyf
xyyf
yxF
xZ
e
xypy
j
jj
),(maxarg
),,(maxarg
),,(maxarg
),(maxarg
)(
maxarg
);|(maxarg
1
1
1
),(
*

















Inference of Linear-Chain CRF
 gi can be represented as a MxM matrix where m is the cardinality of the
set of the tags
  
j
iijjiii xyyfyyg ),,(),( 11


V
ART
N
N V ART
yi-1
yi
V
ART
N
V
ART
N
Inference of Linear-Chain CRF
 The inference of linear-chain CRF is similar to that of HMM, which
uses Viterbi algorithm.
 v: range over the tags
 U(k,v) to be the score of the best sequence of tags from 1 to k, where
tag k is required to be v
)],(),1([max
)],(),([max),(
11
1
1
1
1
},...,{
1
11
vygykU
vygyygvkU
kkk
y
k
k
i
kiii
yy
k
k










Learning of CRF
 Problem description
 Given training pairs ({xi,yi}), we wish to estimate the parameters of
the model ({ωi})
 Method
 For chain or tree structured CRFs, they can be trained by maximum
likelihood  we will focus on the learning of linear chain CRF
 General CRFs are intractable hence approximation solutions are
necessary
Learning of Linear-chain CRF
 Conditional maximum likelihood (CML)
 x: observations; y: labels
 Apply CML to the learning of CRF

 It can be shown that the conditional log-likelihood of the linear-chain
CRF is a convex function  we can apply gradient ascent to the CML
problem
);|(max)|;(max  xypxyL 
);|(max xyp

);|(log xyp

0),(log),();|(log 









xZyxFxyp
j
j
j

Learning of Linear-chain CRF
 For the entire training set T
)]',([),(
);|'()',(),(
0),(log),();|(log
);'|(~'
'
yxFEyxF
xypyxFyxF
xZyxFxyp
jxypyj
y
jj
j
j
j
















Ep[·] denotes
expectation with
respect to distribution
p.
 

Tx
jxypy
Tyx
j yxFEyxF
,
);|(~
,
)],([),( 
The expectation of the
feature fx with respect
to the model
distribution
The expectation of the
feature fx with respect to
the empirical distribution
Learning of Linear-chain CRF
 To yield the best model:
 The expectation of each feature with respect to the model distribution
is equal to the expected value under the empirical distribution of the
training data
 The same as the “maximum entropy model”
Logistic regression
(maximum entropy)
Extend to
sequence
Linear-Chain CRF

Weitere ähnliche Inhalte

Was ist angesagt?

Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsGabriel Moreira
 
Text similarity measures
Text similarity measuresText similarity measures
Text similarity measuresankit_ppt
 
Introduction to MCMC methods
Introduction to MCMC methodsIntroduction to MCMC methods
Introduction to MCMC methodsChristian Robert
 
Language modelling and its use cases
Language modelling and its use casesLanguage modelling and its use cases
Language modelling and its use casesKhrystyna Skopyk
 
K-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsK-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsVarad Meru
 
Text classification using Text kernels
Text classification using Text kernelsText classification using Text kernels
Text classification using Text kernelsDev Nath
 
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...Logistic Regression in Python | Logistic Regression Example | Machine Learnin...
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...Edureka!
 
Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet AllocationSangwoo Mo
 
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8Hakky St
 
Probabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsProbabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsSelman Bozkır
 
Probabilistic retrieval model
Probabilistic retrieval modelProbabilistic retrieval model
Probabilistic retrieval modelbaradhimarch81
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingSangwoo Mo
 
Fuzzy relations
Fuzzy relationsFuzzy relations
Fuzzy relationsnaugariya
 
Text clustering
Text clusteringText clustering
Text clusteringKU Leuven
 
A Theory of the Learnable; PAC Learning
A Theory of the Learnable; PAC LearningA Theory of the Learnable; PAC Learning
A Theory of the Learnable; PAC Learningdhruvgairola
 
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 3 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 3 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai UniversityMadhav Mishra
 
Artificial Neural Networks 1
Artificial Neural Networks 1Artificial Neural Networks 1
Artificial Neural Networks 1swapnac12
 
Lecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation MaximizationLecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation Maximizationbutest
 

Was ist angesagt? (20)

Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive models
 
Text similarity measures
Text similarity measuresText similarity measures
Text similarity measures
 
Introduction to MCMC methods
Introduction to MCMC methodsIntroduction to MCMC methods
Introduction to MCMC methods
 
Language modelling and its use cases
Language modelling and its use casesLanguage modelling and its use cases
Language modelling and its use cases
 
K-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsK-Means, its Variants and its Applications
K-Means, its Variants and its Applications
 
Text classification using Text kernels
Text classification using Text kernelsText classification using Text kernels
Text classification using Text kernels
 
CS8080 information retrieval techniques unit iii ppt in pdf
CS8080 information retrieval techniques unit iii ppt in pdfCS8080 information retrieval techniques unit iii ppt in pdf
CS8080 information retrieval techniques unit iii ppt in pdf
 
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...Logistic Regression in Python | Logistic Regression Example | Machine Learnin...
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...
 
Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet Allocation
 
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
 
Relation Extraction
Relation ExtractionRelation Extraction
Relation Extraction
 
Probabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsProbabilistic information retrieval models & systems
Probabilistic information retrieval models & systems
 
Probabilistic retrieval model
Probabilistic retrieval modelProbabilistic retrieval model
Probabilistic retrieval model
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
 
Fuzzy relations
Fuzzy relationsFuzzy relations
Fuzzy relations
 
Text clustering
Text clusteringText clustering
Text clustering
 
A Theory of the Learnable; PAC Learning
A Theory of the Learnable; PAC LearningA Theory of the Learnable; PAC Learning
A Theory of the Learnable; PAC Learning
 
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 3 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 3 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University
 
Artificial Neural Networks 1
Artificial Neural Networks 1Artificial Neural Networks 1
Artificial Neural Networks 1
 
Lecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation MaximizationLecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation Maximization
 

Ähnlich wie Tdm probabilistic models (part 2)

Discovering Novel Information with sentence Level clustering From Multi-docu...
Discovering Novel Information with sentence Level clustering  From Multi-docu...Discovering Novel Information with sentence Level clustering  From Multi-docu...
Discovering Novel Information with sentence Level clustering From Multi-docu...irjes
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...kevig
 
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...ijnlc
 
lecture_mooney.ppt
lecture_mooney.pptlecture_mooney.ppt
lecture_mooney.pptbutest
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligencevini89
 
Similarity Features, and their Role in Concept Alignment Learning
Similarity Features, and their Role in Concept Alignment Learning Similarity Features, and their Role in Concept Alignment Learning
Similarity Features, and their Role in Concept Alignment Learning Shenghui Wang
 
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015rusbase
 
Introduction to complexity theory assignment
Introduction to complexity theory assignmentIntroduction to complexity theory assignment
Introduction to complexity theory assignmenttesfahunegn minwuyelet
 
Basic review on topic modeling
Basic review on  topic modelingBasic review on  topic modeling
Basic review on topic modelingHiroyuki Kuromiya
 
20070702 Text Categorization
20070702 Text Categorization20070702 Text Categorization
20070702 Text Categorizationmidi
 
Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large C...
Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large C...Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large C...
Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large C...romovpa
 
Topic models
Topic modelsTopic models
Topic modelsAjay Ohri
 
Topic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic ModelsTopic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic ModelsClaudia Wagner
 
Search Engines
Search EnginesSearch Engines
Search Enginesbutest
 
ECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERINGECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERINGGeorge Simov
 

Ähnlich wie Tdm probabilistic models (part 2) (20)

Discovering Novel Information with sentence Level clustering From Multi-docu...
Discovering Novel Information with sentence Level clustering  From Multi-docu...Discovering Novel Information with sentence Level clustering  From Multi-docu...
Discovering Novel Information with sentence Level clustering From Multi-docu...
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
 
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
 
lecture_mooney.ppt
lecture_mooney.pptlecture_mooney.ppt
lecture_mooney.ppt
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
 
Similarity Features, and their Role in Concept Alignment Learning
Similarity Features, and their Role in Concept Alignment Learning Similarity Features, and their Role in Concept Alignment Learning
Similarity Features, and their Role in Concept Alignment Learning
 
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
 
Introduction to complexity theory assignment
Introduction to complexity theory assignmentIntroduction to complexity theory assignment
Introduction to complexity theory assignment
 
Equirs: Explicitly Query Understanding Information Retrieval System Based on Hmm
Equirs: Explicitly Query Understanding Information Retrieval System Based on HmmEquirs: Explicitly Query Understanding Information Retrieval System Based on Hmm
Equirs: Explicitly Query Understanding Information Retrieval System Based on Hmm
 
Basic review on topic modeling
Basic review on  topic modelingBasic review on  topic modeling
Basic review on topic modeling
 
20070702 Text Categorization
20070702 Text Categorization20070702 Text Categorization
20070702 Text Categorization
 
A-Study_TopicModeling
A-Study_TopicModelingA-Study_TopicModeling
A-Study_TopicModeling
 
Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large C...
Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large C...Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large C...
Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large C...
 
A survey of xml tree patterns
A survey of xml tree patternsA survey of xml tree patterns
A survey of xml tree patterns
 
Topic models
Topic modelsTopic models
Topic models
 
Topic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic ModelsTopic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic Models
 
Search Engines
Search EnginesSearch Engines
Search Engines
 
TEXT CLUSTERING.doc
TEXT CLUSTERING.docTEXT CLUSTERING.doc
TEXT CLUSTERING.doc
 
ECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERINGECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERING
 

Tdm probabilistic models (part 2)

  • 2. Probabilistic Document Clustering and Topic Models  A popular method for probabilistic document clustering is that of topic modeling.  The idea of topic modeling is to create a probabilistic generative model for the text documents in the corpus.  The main approach is to represent a corpus as a function of hidden random variables, the parameters of which are estimated using a particular document collection.  The primary assumptions in any topic modeling approach (together with the corresponding random variables) are as follows:  The n documents in the corpus are assumed to have a probability of belonging to one of k topics.  Thus, a given document may have a probability of belonging to multiple topics, and this reflects the fact that the same document may contain a multitude of subjects.
  • 3.  For a given document Di, and a set of topics T1 . . . Tk, the probability that the document Di belongs to the topic Tj is given by P(Tj |Di).  The topics are essentially analogous to clusters, and the value of P(Tj |Di) provides a probability of cluster membership of the ith document to the jth cluster.  In non-probabilistic clustering methods, the membership of documents to clusters is deterministic in nature, and therefore the clustering is typically a clean partitioning of the document collection.  When there are overlaps in document subject matter across multiple clusters. The use of a soft cluster membership in terms of probabilities is an elegant solution to this dilemma.
  • 4.  In this scenario, the determination of the membership of the documents to clusters is a secondary goal to that of finding the latent topical clusters in the underlying text collection.  Topic modeling is related to the clustering problem, it is often studied as a distinct area of research from clustering.  The value of P(Tj |Di) is estimated using the topic modeling approach, and is one of the primary outputs of the algorithm.  The value of k is one of the inputs to the algorithm and is analogous to the number of clusters.  Each topic is associated with a probability vector, which quantifies the probability of the different terms in the lexicon for that topic.
  • 5.  Let t1 . . . td be the d terms in the lexicon. Then, for a document that belongs completely to topic Tj , the probability that the term tl occurs in it is given by P(tl|Tj ).  The value of P(tl|Tj) is another important parameter which needs to be estimated by the topic modeling approach.  The number of documents is denoted by n, topics by k and lexicon size (terms) by d.  Most topic modeling methods attempt to learn the above parameters using maximum likelihood methods, so that the probabilistic fit to the given corpus of documents is as large as possible.  There are two basic methods which are used for topic modeling,  Probabilistic Latent Semantic Indexing (PLSI)  Latent Dirichlet Allocation (LDA)
  • 6. Probabilistic Latent Semantic Indexing Method  Set of random variables P(Tj |Di) and P(tl|Tj) model the probability of a term tl occurring in any document Di.  The probability P(tl|Di) of the term tl occurring document Di can be expressed in terms:  For each term tl and document Di, generate a n × d matrix of probabilities in terms of these parameters, where, n - number of documents and d - number of terms.  For a given corpus, the n × d term-document occurrence matrix X, tells us which term actually occurs in each document, and how many times the term occurs in the document.  In other words, X(i, l) is the number of times that term tl occurs in document Di.  Therefore, we can use a maximum likelihood estimation algorithm which maximizes the product of the probabilities of terms that are observed in each document in the entire collection.
  • 7.  Log likelihood probability Σi,l X(i, l) ·log(P(tl|Di)) subject to the constraints that the probability values over each of the topic-document and term-topic spaces must sum to 1:  The Lagrangian solution essentially leads to a set of iterative update equations for the corresponding parameters need to be estimated.  These parameters can be estimated with the iterative update of two matrices [P1]k×n and [P2]d×k containing the topic-document probabilities and term-topic probabilities respectively.
  • 8.  Initializing the matrices randomly, and normalize each of them so that the probability values in their columns sum to one.  Then, iteratively perform the following steps on each of P1 and P2 respectively:  The process is iterated to convergence.  The output of this approach are the two matrices P1 and P2, the entries of which provide the topic document and term-topic probabilities respectively.
  • 9. Latent Dirichlet Allocation  The term-topic probabilities and topic-document probabilities are modeled with a Dirichlet distribution as a prior.  LDA method is the Bayesian version of the PLSI technique. PLSI method is equivalent to the LDA technique, when applied with a uniform Dirichlet prior.  The LDA method can be used to model the topic distribution of a new document more robustly, even if it is not present in the original data set.  EM-concepts used for topic modeling are quite general, and can be used for different variations on the text clustering tasks, such as text classification or incorporating user feedback into clustering.
  • 10.  LDA’s main advantage over the PLSI method is that it is not quite as susceptible to overfitting.  This is generally true of Bayesian methods which reduce the number of model parameters to be estimated, and therefore work much better for smaller data sets.  Even for larger data sets, PLSI has the disadvantage that the number of model parameters grows linearly with the size of the collection.  The PLSI model is not a fully generative model, because there is no accurate way to model the topical distribution of a document which is not included in the current data set.
  • 11. Probabilistic Models for Information Extraction  Probabilistic models show better accuracy and robustness against the noise than categorical models.  Useful for the different tasks in extracting meaning from natural language texts.  Most prominent among these probabilistic approaches are  Hidden Markov Models (HMMs),  Stochastic Context-free Grammars (SCFG), and  Maximal Entropy (ME).
  • 12. Probabilistic Models for Information Extraction Hidden Markov Models  The Three Classic Problems Related to HMMs  The Forward–Backward Procedure  The Viterbi Algorithm  The Training of the HMM  Dealing with Training Data Sparseness  Stochastic Context-Free Grammars  Using SCFGs  Maximal Entropy Modeling  Computing the Parameters of the Model  Maximal Entropy Markov Models  Training the MEMM  Conditional Random Fields  The Three Classic Problems Relating to CRF  Computing the Conditional Probability  Finding the Most Probable Label Sequence  Training the CRF
  • 13. Hidden Markov Models  An HMM is a finite-state automaton with stochastic state transitions and symbol emissions.  The automaton models a probabilistic generative process.  Process, a sequence of symbols is produced by  Starting in an initial state,  Emitting a symbol selected by the state,  Making a transition to a new state,  Emitting a symbol selected by the state, and  Repeating this transition–emission cycle until a designated final state is reached.
  • 14. HMM Assumptions  Markov assumption: the state transition depends only on the origin and destination  Output-independent assumption: all observation frames are dependent on the state that generated them, not on neighbouring observation frames
  • 15.  Formally,  Let O = {o1, . . . oM} - finite set of observation symbols and  Q ={q1, . . . qN} - finite set of states.  A first-order Markov model λ is a triple (π, A, B), where π : Q→ [0, 1] defines the starting probabilities, A : Q× Q→ [0, 1] defines the transition probabilities, and B : Q× O→ [0, 1] denotes the emission probabilities.  The functions π, A, and B define true probabilities, they must satisfy  A model λ together with the random process described above induces a probability distribution over the set O* of all possible observation sequences. 1)(   qQq    Oo oqB 1),( 1)',('  Qq qqA for all states q
  • 16. The Three Classic Problems Related to HMMs  Most applications of hidden Markov models can be reduced to three basic problems: 1. Find P(T | λ) [Evaluation]– the probability of a given observation sequence T in a given model λ. (compute the probability distribution induced by the model) 2. Find argmaxS∈Q |T| P(T, S | λ) [Decoding]– the most likely state trajectory given λ and T. (finds the most probable states sequence for a given observation sequence) 3. Find argmax λ P(T, | λ) [Learning]– the model that best accounts for a given sequence. (adjusts the model itself to maximize the likelihood of the given observation)
  • 17.  Description of how these three problems can be solved: Calculate P(T | λ), where , T is a sequence of observation symbols T = t1t2 . . . tk ∈ O∗.  Enumerate every possible state sequence of length |T|.  Let S = s1,s2 . . . s|T| ∈ Q|T| be one such sequence.  Calculate the probability P(T | S, λ) of generating T knowing that the process went through the states sequence S.  By Markovian assumption, the emission probabilities are all independent of each other. Therefore, ),(),|( ||...1 iiTi tsBSTP  
  • 18.  Similarly, the transition probabilities are independent. Thus the probability P(S|λ) for the process to go through the state sequence S is  Using the above probabilities, we find that the probability P(T|λ) of generating the sequence can be calculated as  This solution is of course infeasible in practice because of the exponential number of possible state sequences.  To solve the problem efficiently, we use a dynamical programming technique. The resulting algorithm is called the forward–backward procedure. ),()()|( 11||...11  iiTi ssAsSP     || )|(),|()|( T QS SPSTPTP 
  • 19. The Forward–Backward Procedure  Let αm(q), the forward variable, denote the probability of generating the initial segment t1, t2 . . . tm of the sequence T and finishing at the state q at time m. This forward variable can be computed recursively as follows:  Then, the probability of the whole sequence T can be calculated as ),(),'()'()(.2 ),()()(.1 1'1 11     nQq nn tqBqqAqq tqBqq      || )()|( TQq qTP 
  • 20.  In a similar manner, one can define βm (q), the backward variable, which denotes the probability of starting at the state q and generates the final segment tm+1 . . . t|T| of the sequence T.  The backward variable can be calculated starting from the end and going backward to the beginning of the sequence:  The probability of the whole sequence is then     Qq nnn T qtqBqqAq q '1 || )'(),'()',()(.2 ,1)(.1   )(),()()|( 11 qtqBqTP Qq    
  • 21. The Viterbi Algorithm  Solution of the second problem – finding the most likely state sequence for a given sequence T.  As with the previous problem, enumerating all possible state sequences S and choosing the one maximizing P(T, S | λ) is infeasible.  Dynamical programming, utilizing the following property of the optimal states sequence: if is some initial segment of the sequence T = t1 t2 . . . t|T| and S = s1 s2 . . . s|T| is a state sequence maximizing P(T, S| λ), then maximizes among all state sequences of length ending with s|T|.  The resulting algorithm is called the Viterbi algorithm. 'T '||21 ...' TsssS  )|','( STP |'|T
  • 22.  Let γ n(q) denote the state sequence ending with the state q, which is optimal for the initial segment Tn = t1t2 . . . tn among all sequences ending with q, and let δn(q) denote the probability P(Tn, γ n(q) | λ) of generating this initial segment following those optimal states. Delta and gamma can be recursively calculated as follows: Where,  Then, the best states sequence among {γ|T|(q) : q ∈ Q} is the optimal one: ,)'()(),,(),'()'(max)(.2 ,)(),,().()(1.1 111'1 111 qqqtqBqqAqq qqtqBqq nnnQqn      ),(),'()'(maxarg' 1'   nnQq tqBqqAqq  ))(max(arg)|,(maxarg |||||| qSTP TQqTQS T   
  • 23. Example of the Viterbi Computation  Using the HMM described in Figure with the sequence (a, b, a), the following steps are there in using the Viterbi algorithm: A sample HMM
  • 24. Computation of the optimal path using the Viterbi algorithm Two optimal paths: {S1, S3, S1} and {S3, S2, S3}.
  • 25. The Training of the HMM Baum–Welsh re-estimation formulas  Let μn(q) be the probability P(sn = q | T, λ) of being in the state q at time n while generating the observation sequence T. Then μn(q) ·  P(T | λ) is the probability of generating T passing through the state q at time n.  By definition of the forward and backward variables, this probability is equal to αn(q) · βn(q). Thus,  Also let ϕn(q, q' ) be the probability P(sn = q, sn+1 = q' | T, λ) of passing from state q to state q at time n while generating the observation sequence T. As in the preceding equation,  The sum of μn(q) over all n = 1 . . . | T | can be seen as the expected number of times the state q was visited while generating the sequence T.  Or, if one sums over n = 1 . . . | T |−1, the expected number of transitions out of the state q results because there is no transition at time |T|. )|(/)()()(  TPqqq nnn  )|(/)(),'()',()()',( 1  TPqoqBqqAqqq nnnn  
  • 26.  Similarly, the sum of ϕn(q, q') over all n = 1 . . . | T | −1 can be interpreted as the expected number of transitions from the state q to q'  The Baum–Welsh formulas re-estimate the parameters of the model λ according to the expectations  It can be shown that the model λ'= (π', A', B') is equal either to λ, in which case the λ is the critical point of the likelihood function P(T | λ), or λ', which better accounts for the training sequence T than the original model λ in the sense that P(T | λ') >P(T | λ).  Therefore, the training problem can be solved by iteratively applying the re- estimation formulas until convergence. )(/)(:),(' ),(/)',(:)',(' ),(:)(' ||..1: 1||..11||..1 1 qqoqB qqqqqA qq Tn noTnn n Tn nTn n          
  • 27. Dealing with Training Data Sparseness  Techniques for data sparseness problems in probabilistic modeling  Smoothing  shrinkage  Smoothing  Process of flattening a probability distribution implied by a model so that all reasonable sequences can occur with some probability.  Broadening the distribution by redistributing weight from high-probability regions to zero-probability regions.  Example  Laplace smoothing o Every possible training event occurs one time more than it actually does. Any constant can be used instead of “one.”  Other possible methods may include back-off smoothing, deleted interpolation, and others.
  • 28.  Shrinkage  Defined in terms of some hierarchy representing the expected similarity between parameter estimates.  With respect to HMMs, the hierarchy can be defined as a tree with the HMM states for the leaves – all at the same depth.  Hierarchy is created as follows:  First, the most complex HMM is built and its states are used for the leaves of the tree.  Then the states are separated into disjoint classes within which the states are expected to have similar probability distributions.  The classes become the parents of their constituent states in the hierarchy (HMM structure at the leaves induces a simpler HMM structure at the level of the classes).  It is generated by summing the probabilities of emissions and transitions of all states in a class.  This process may be repeated until only a single-state HMM remains at the root of the hierarchy
  • 29. Successful Application Areas of HMM  Online handwriting recognition  Speech recognition  Gesture recognition  Language recognition  Motion Video analysis and tracking  Protein sequence / gene sequence alignment  Stock price prediction  …
  • 30. Stochastic context-free grammars  An SCFG is a quintuple G = (T, N, S, R, P), where, T is the alphabet of terminal symbols (tokens), N is the set of nonterminals, S is the starting nonterminal, R is the set of rules, and P : R→[0.1] defines their probabilities.  The rules have the form: n→ s1s2 . . . sk, where, n is a nonterminal and si is either a token or another nonterminal.  SCFG generate (or accept) a given string (sequence of tokens) if the string can be produced starting from a sequence containing just the starting symbol S and expanding nonterminals one by one in the sequence using the rules from the grammar.  The string generated can be naturally represented by a parse tree,  Starting symbol as a root,  Nonterminals as internal nodes, and  Tokens as leaves.
  • 31.  SCFG is a usual context-free grammar with the addition of the P function.  The semantics of the probability function P are straightforward.  If r is the rule n → s1s2 . . . sk, then P(r) is the frequency of expanding n using this rule.  In Bayesian terms, if it is known that a given sequence of tokens was generated by expanding n, then P(r) is the a priori likelihood that n was expanded using the rule r.  For every nonterminal n the sum Σ P(r ) of probabilities of all rules r headed by n must be equal to one.
  • 32. Using SCFGs Classical definition of SCFG:  It is assumed that the rules are all independent.  Find the (unconditional) probability of a given parse tree by simply multiplying the probabilities of all rules participating in it.  Parsing problem is formulated as follows:  Given a sequence of tokens (a string), find the most probable parse tree that could generate the string.  A simple generalization of the Viterbi algorithm is able to solve this problem efficiently. Practical applications of SCFGs:  Rare the case that the rules are truly independent.  Let the probabilities P(r) be conditioned on the context where the rule is applied.  If the conditioning context is chosen reasonably, the Viterbi algorithm still works correctly even for this more general problem.
  • 33. Maximal Entropy Modeling  Consider a random process of an unknown nature that produces a single output value y, a member of a finite set Y of possible output values.  The process of generating y may be influenced by some contextual information x – a member of the set X of possible contexts.  The task is to construct a statistical model that accurately represents the behavior of the random process.  Such a model is a method of estimating the conditional probability of generating y given the context x.  Let P(x, y) be denoted as the unknown true joint probability distribution of the random process, and let p(y | x) be the model we are trying to build taken from the class ℘ of all possible models.  To build the model we are given a set of training samples generated by observing the random process for some time.  The training data consist of a sequence of pairs (xi, yi) of different outputs produced in different contexts.
  • 34.  In many cases the set X is too large and underspecified to be used directly.  For instance, X may be the set of all dots “.” in all possible English texts.  For contrast, the Y may be extremely simple while remaining interesting.  In the preceding case, the Y may contain just two outcomes: “SentenceEnd” and “NotSentenceEnd.”  The target model p(y | x) would in this case solve the problem of finding sentence boundaries.  In such cases it is impossible to use the context x directly to generate the output y.  There are usually many regularities and correlations, however, that can be exploited.  Different contexts are usually similar to each other in all manner of ways, and similar contexts tend to produce similar output distributions.
  • 35.  To express such regularities and their statistics, can use constraint functions and their expected values.  A constraint function f : X × Y → R can be any real-valued function.  Binary-valued trigger functions:  Such a trigger function returns one for pair (x, y) if the context x satisfies the condition predicate C and the output value y is yi.  A common short notation for such a trigger function is C→yi.  For the example above, useful triggers are previous token is “Mr”→NotSentenceEnd, next token is capitalized→SentenceEnd.  Given a constraint function f, its importance by requiring our target model to reproduce f ’s expected value faithfully in the true distribution:
  • 36.  In practice we cannot calculate the true expectation and must use an empirical expected value calculated by summing over the training samples:  The choice of feature functions is domain dependent. Let us assume the complete set of features F={ fk} is given.  The completeness of the set of features by requiring that the model agree with all the expected value constraints while otherwise being as uniform as possible.  The uniformity requirement defines the target model uniquely. The degree of uniformity of a model is expressed by its conditional entropy  Or, empirically,
  • 37.  The constrained optimization problem of finding the maximal-entropy target model is solved by application of Lagrange multipliers and the Kuhn–Tucker theorem.  Let us introduce a parameter λk (the Lagrange multiplier) for every feature. Define the Lagrangian Λ(p, λ) by  Holding λ fixed, we compute the unconstrained maximum of the Lagrangian over all p ∈ ℘. Denote by pλ the p where Λ(p, λ) achieves its maximum and by (λ) the value of Λ at this point.  The functions pλ and (λ) can be calculated using simple calculus: Where, Zλ(x) is a normalizing constant determined by the requirement that Σy∈Y pλ(y | x) = 1.  The dual optimization problem
  • 38.  The Kuhn–Tucker theorem asserts that, under certain conditions, the solutions of the primal and dual optimization problems coincide.  The model p, which maximizes HE(p) while satisfying the constraints, has the parametric form pλ*.  The function (λ) is simply the log-likelihood of the training sample as predicted by the model pλ.  Thus, the model pλ* maximizes the likelihood of the training sample among all models of the parametric form pλ.
  • 39. Computing the Parameters of the Model  The function(λ) is well behaved from the perspective of numerical optimization, for it is smooth and concave.  Consequently, various methods can be used for calculating λ*.  Generalized iterative scaling is the algorithm specifically tailored for the problem. This algorithm is applicable whenever all constraint functions are non- negative: fk(x, y) ≥ 0.  The algorithm starts with an arbitrary choice of λ’s – for instance λk= 0 for all k.  At each iteration the λ’s are adjusted as follows:  In the simplest case, when f # is constant, Δλk is simply (1/f #) · log PE( fk)/pλE( fk).  Any numerical algorithm for solving the equation can be used such as Newton’s method.
  • 40. Maximal Entropy Markov Models  A MEMM is a probabilistic finite-state acceptor.  Unlike HMM, which has separate transition and emission probabilities, MEMM has only transition probabilities, depend on the observations.  A slightly modified version of the Viterbi algorithm solves the problem of finding the most likely state sequence for a given observation sequence.  A MEMM consists of a set Q = {q1, . . . , qN} of states, and a set of transition probabilities functions Aq : X × Q → [0, 1], where X denotes the set of all possible observations. Aq(x, q) gives the probability P(q | q, x) of transition from q to q, given the observation x.  The model does not generate x but only conditions on it.  The set X need not be small and need not even be fully defined.  The transition probabilities Aq are separate exponential models trained using maximal entropy.
  • 41.  The task of a trained MEMM is to produce the most probable sequence of states given the observation, solved by a simple modification of the Viterbi algorithm.  The forward–backward algorithm, loses its meaning because here it computes the probability of the observation being generated by any state sequence, which is always one.  The forward and backward variables are still useful for the MEMM training. The forward variable [Ref >HMM] αm(q) denotes the probability of being in state q at time m given the observation. It is computed recursively as  The backward variable β denotes the probability of starting from state q at time m given the observation. It is computed similarly as  The model Aq for transition probabilities from a state is defined parametrically using constraint functions. If fk : X× Q→ R is the set of such functions for a given state q, then the model Aq can be represented in the form where λk are the parameters to be trained and Z(x, q) is the normalizing factor making probabilities of all transitions from a state sum to one.
  • 42. Training the MEMM  If the true states sequence for the training data is known, the parameters of the models can be straightforwardly estimated using the GIS algorithm for training ME models.  If the sequence is not known-for instance, if there are several states with the same label in a fully connected MEMM-the parameters must be estimated using a combination of the Baum-Welsh procedure and iterative scaling.  Every iteration consists of two steps: 1. Using the forward–backward algorithm and the current transition functions to compute the state occupancies for all training sequences. 2. Computing the new transition functions using GIS with the feature frequencies based on the state occupancies computed in step 1.  It is unnecessary to run GIS to convergence in step 2; a single GIS iteration is sufficient.
  • 43. Conditional Random Fields(CRF)  Problem description  Why conditional random fields(CRF)  Introduction to CRF  CRF model  Inference of CRF  Learning of CRF
  • 44. Problem Description  Given observed data X, we wish to predict Y (labels)  Example:  X = {Temperature, Humidity, ...}  Xn = observation on day n  Y = {Sunny, Rainy, Cloudy}  Yn = weather on day n 30°C 20% Sunny? Rainy? Cloudy? Light breeze May depend on one another May depend on the weather of yesterday
  • 45. Generative Model vs. Discriminative Model  Generative model  A model that generate observed data randomly  Model the joint probability p(x,y)  Discriminative model  Directly estimate the posterior probability p(y|x)  Aim at modeling the “discrimination” between different outputs Naïve Bayes, … HMM, … Bayesian network, MRF, … Single variable Sequence General Logistic regression, … Linear-chain CRF MEMM, … General CRF, … Conditional
  • 46. Why Conditional Random Fields  Generative model  Generative model targets to find the joint probability p(x,y) and make the prediction based on Bayes rule to calculate p(y|x)  Ex: Naive Bayes (single output) and HMM (Hidden Markov Model) (sequence output)   K k k yxpypyxp 1 )|()(),(  a vector of features Assume that given y, features are independent   T t tttt yxpyypyxp 1 1 )|()|(),(  Assumption: 1. each state t only depends on its immediate predecessor 2. Conditional independence of observed given Sequence output
  • 47. Why Conditional Random Fields 30°C 20% Humidity, temperature and the wind scale are independent Mon. {30°C, 20%, light breeze} Light breeze Tue. {28°C, 30%, light breeze} Wed. {25°C, 40%, moderate breeze} Thu. {22°C, 60%, moderate breeze} A  B: A causes B
  • 48. Why Conditional Random Fields  Difficulties for generative models  Not practical to represent multiple interacting features (hard to model p(x)) or long-range dependencies of the observations  Very strict independence assumptions on the observations Mon. {30°C, 20%, light breeze} Tue. {28°C, 30%, light breeze} Wed. {25°C, 40%, moderate breeze} Thu. {22°C, 60%, moderate breeze}
  • 49. Why Conditional Random Fields  Discriminative models  Directly model the posterior p(y|x)  Aim at modeling the “discrimination” between different outputs  Ex: logistic regression (maximum entropy) and CRF  Advantages of discriminative models  Training process aim at finding optimal coefficients for features no matter the features are correlated  Not sensitive to unbalanced training data  Especially for the classification problem, we don’t have to care about p(x)
  • 50. Why Conditional Random Fields  Logistic regression (maximum entropy)  Suppose we have a bin of candies, each with an associated label (A,B,C, or D)  Each candy has multiple colors in its wrapper  Each candy is assigned a label randomly based on some distribution over wrapper colors Observation: the color of the wrapper Label: 4 kinds of flavors A: chocolate B: strawberry C: lemon D: milk
  • 51. Why Conditional Random Field  For any candy with a red label pulled from the bin:  P(A|red)+P(B|red)+P(C|red)+P(D|red) = 1  Infinite number of distributions exist that fit this constraint  The distribution that fits with the idea of maximum entropy is: (the most uniform) o P(A|red)=0.25 o P(B|red)=0.25 o P(C|red)=0.25 o P(D|red)=0.25
  • 52. Why Conditional Random Field  Now suppose we add some evidence to our model  We note that 80% of all candies with red labels are either labeled A or B o P(A|red) + P(B|red) = 0.8  The updated model that reflects this would be: o P(A|red) = 0.4 o P(B|red) = 0.4 o P(C|red) = 0.1 o P(D|red) = 0.1  As we make more observations and find more constraints, the model gets more complex
  • 53. Why Conditional Random Field  Given a collection of facts, choose a model which is consistent with all the facts, but otherwise as uniform as possible , ),( );|( ),( wxZ e wxyp j jj yxFw      ion termnormalizatais),( ),(    y yxFw j jj ewxZ   By learning Defined feature functions  evidence x1 x2 xd y B)f(A,nodesBandAbetweennodesfactor  Factor Graph:
  • 54. Linear-Chain CRF  If we extend the logistic regression to a sequence problem ( ): , ),( );|( ),( wxZ e wxyp j jj yxFw      ion termnormalizatais),( ),(    y yxFw j jj ewxZ    y  x1 x2 xd yt-1 x1 x2 xd yt x1 x2 xd yt+1 Entire x sentencethealongsuma,),,(),(where 1  t ttjj xyyfyxF 
  • 55. Linear-Chain CRF y1 y2 y3 x1 x2 x3 y1 y2 y3 x
  • 56. General CRF  Divide Graph G into many templates ψA. The parameters inside each template are tied  K(A) is the number of feature functions for the template )( )|( )( 1 ),( xZ e xyp G xyf A AK k aaakak        
  • 57. Inference of CRF  Problem description:  Given the observations({xi}) and the probability model(parameters such as ωi mentioned above), we target to find the best state sequence  For general graphs, the problem of exact inference in CRFs is intractable  Chain or tree like CRF can yield exact inference  Approximation solutions
  • 58. Inference of Linear-Chain CRF • The inference of linear-chain CRF is very similar to that of HMM  Example: POS(part of speech) tagging  The identification of words as nouns, verbs, adjectives, adverbs, etc. Students need another break noun verb article noun
  • 59. Inference of Linear-Chain CRF  We firstly illustrate the inference of HMM students/V students/N students/P students/ART need/V need/N need/P need/ART o/s another/V another/N another/P another/ART break/V break/N break/P break/ART 7.6x10-6 0.00725 0 0 0.00031 1.3x10-5 0.0002 0 0 1.2x10-7 0 7.2x10-5 2.6x10-9 4.3x10-6 0 0
  • 60. Inference of Linear-Chain CRF  Then back to CRF                i iiiy i j iiijy j i iiijy j jjy yxF y y yyg xyyf xyyf yxF xZ e xypy j jj ),(maxarg ),,(maxarg ),,(maxarg ),(maxarg )( maxarg );|(maxarg 1 1 1 ),( *                 
  • 61. Inference of Linear-Chain CRF  gi can be represented as a MxM matrix where m is the cardinality of the set of the tags    j iijjiii xyyfyyg ),,(),( 11   V ART N N V ART yi-1 yi V ART N V ART N
  • 62. Inference of Linear-Chain CRF  The inference of linear-chain CRF is similar to that of HMM, which uses Viterbi algorithm.  v: range over the tags  U(k,v) to be the score of the best sequence of tags from 1 to k, where tag k is required to be v )],(),1([max )],(),([max),( 11 1 1 1 1 },...,{ 1 11 vygykU vygyygvkU kkk y k k i kiii yy k k          
  • 63. Learning of CRF  Problem description  Given training pairs ({xi,yi}), we wish to estimate the parameters of the model ({ωi})  Method  For chain or tree structured CRFs, they can be trained by maximum likelihood  we will focus on the learning of linear chain CRF  General CRFs are intractable hence approximation solutions are necessary
  • 64. Learning of Linear-chain CRF  Conditional maximum likelihood (CML)  x: observations; y: labels  Apply CML to the learning of CRF   It can be shown that the conditional log-likelihood of the linear-chain CRF is a convex function  we can apply gradient ascent to the CML problem );|(max)|;(max  xypxyL  );|(max xyp  );|(log xyp  0),(log),();|(log           xZyxFxyp j j j 
  • 65. Learning of Linear-chain CRF  For the entire training set T )]',([),( );|'()',(),( 0),(log),();|(log );'|(~' ' yxFEyxF xypyxFyxF xZyxFxyp jxypyj y jj j j j                 Ep[·] denotes expectation with respect to distribution p.    Tx jxypy Tyx j yxFEyxF , );|(~ , )],([),(  The expectation of the feature fx with respect to the model distribution The expectation of the feature fx with respect to the empirical distribution
  • 66. Learning of Linear-chain CRF  To yield the best model:  The expectation of each feature with respect to the model distribution is equal to the expected value under the empirical distribution of the training data  The same as the “maximum entropy model” Logistic regression (maximum entropy) Extend to sequence Linear-Chain CRF