3. 3
Conditional Probability and Tags
• P(Verb) is the probability of a randomly selected word being a verb.
• P(Verb|race) is “what’s the probability of a word being a verb given that it’s
the word “race”?
• Race can be a noun or a verb.
• It’s more likely to be a noun.
• P(Verb|race) can be estimated by looking at some corpus and saying “out of
all the times we saw ‘race’, how many were verbs?
• In Brown corpus, P(Verb|race) = 96/98 = .98
• How to calculate for a tag sequence, say P(NN|DT)?
P(V | race) =
Count(race is verb)
total Count(race)
Prof. Vikas Dubey | RCOE | COMP | NLP | BE 2021-22
4. Stochastic Tagging
• Stochastic taggers generally resolve tagging ambiguities by using a
training corpus to compute the probability of a given word having a
given tag in a given context.
• Stochastic tagger called also HMM tagger or a Maximum
Likelihood Tagger, or a Markov model HMM TAGGER tagger,
based on the Hidden Markov Model.
• For a given word sequence, Hidden Markov Model (HMM) Taggers
choose the tag sequence that maximizes,
P(word | tag) * P(tag | previous-n-tags)
Prof. Vikas Dubey | RCOE | COMP | NLP | BE 2021-22
4
5. Stochastic Tagging
• A bigram HMM tagger chooses the tag ti for word wi that is most
probable given the previous tag, ti-1
ti = argmaxj P(tj | ti-1, wi)
• From the chain rule for probability factorization,
• Some approximation are introduced to simplify the model, such as
Prof. Vikas Dubey | RCOE | COMP | NLP | BE 2021-22
5
6. Stochastic Tagging
• The word probability depends only on the tag
• The dependence of a tag from the preceding tag history is limited in
time, e.i. a tag depends only on the two preceding ones,
Prof. Vikas Dubey | RCOE | COMP | NLP | BE 2021-22
6
7. 7
Statistical POS Tagging (Allen95)
• Let’s step back a minute and remember some probability theory and its use in
POS tagging.
• Suppose, with no context, we just want to know given the word “flies” whether
it should be tagged as a noun or as a verb.
• We use conditional probability for this: we want to know which is greater
PROB(N | flies) or PROB(V | flies)
• Note definition of conditional probability
PROB(a | b) = PROB(a & b) / PROB(b)
– Where PROB(a & b) is the probability of the two events a & b occurring
simultaneously
Prof. Vikas Dubey | RCOE | COMP | NLP | BE 2021-22
8. 8
Calculating POS for “flies”
We need to know which is more
• PROB(N | flies) = PROB(flies & N) / PROB(flies)
• PROB(V | flies) = PROB(flies & V) / PROB(flies)
• Count on a Corpus
Prof. Vikas Dubey | RCOE | COMP | NLP | BE 2021-22
9. Stochastic Tagging
• The simplest stochastic tagger applies the following approaches for POS
tagging –
Approach 1: Word Frequency Approach
• In this approach, the stochastic taggers disambiguate the words based on
the probability that a word occurs with a particular tag.
• We can also say that the tag encountered most frequently with the word in
the training set is the one assigned to an ambiguous instance of that word.
• The main issue with this approach is that it may yield inadmissible
sequence of tags.
Prof. Vikas Dubey | RCOE | COMP | NLP | BE 2021-22
9
10. Stochastic Tagging
• Assign each word its most likely POS tag
– If w has tags t1, …, tk, then can use
– P(ti | w) = c(w,ti )/(c(w,t1) + … + c(w,tk)), where
– c(w, ti ) = number of times w/ti appears in the corpus
– Success: 91% for English
Example heat :: noun/89, verb/5
Prof. Vikas Dubey | RCOE | COMP | NLP | BE 2021-22
10
11. Stochastic Tagging
Approach 2: Tag Sequence Probabilities
• It is another approach of stochastic tagging, where the tagger
calculates the probability of a given sequence of tags occurring.
• It is also called n-gram approach.
• It is called so because the best tag for a given word is determined by
the probability at which it occurs with the n previous tags.
Prof. Vikas Dubey | RCOE | COMP | NLP | BE 2021-22
11
12. Stochastic Tagging
• Given: sequence of words W
– W = w1,w2,…,wn (a sentence)
• – e.g., W = heat water in a large vessel
• Assign sequence of tags T:
• T = t1, t2, … , tn
• Find T that maximizes P(T | W)
Prof. Vikas Dubey | RCOE | COMP | NLP | BE 2021-22
12
13. Stochastic Tagging
• But P(ti|wi) is difficult to compute and Bayesian classification rule is used:
P(x|y) = P(x) P(y|x) / P(y)
• When applied to the sequence of words, the most probable tag sequence
would be
P(ti|wi) = P(ti) P(wi|ti)/P(wi)
• where P(wi) does not change and thus do not need to be calculated
• Thus, the most probable tag sequence is the product of two probabilities for
each possible sequence:
– Prior probability of the tag sequence. Context P(ti)
– Likelihood of the sequence of words considering a sequence of (hidden)
tags. P(wi|ti)
Prof. Vikas Dubey | RCOE | COMP | NLP | BE 2021-22
13
14. Stochastic Tagging
• Two simplifications for computing the most probable sequence of tags:
– Prior probability of the part of speech tag of a word depends only on the
tag of the previous word (bigrams, reduce context to previous). Facilitates
the computation of P(ti)
– Ex. Probability of noun after determiner
– Probability of a word depends only on its part-of-speech tag.
(independent of other words in the context). Facilitates the computation of
P(wi|ti), Likelihood probability.
• Ex. given the tag noun, probability of word dog
Prof. Vikas Dubey | RCOE | COMP | NLP | BE 2021-22
14
15. 15
Stochastic Tagging
• Based on probability of certain tag occurring given
various possibilities
• Necessitates a training corpus
• No probabilities for words not in corpus.
• Training corpus may be too different from test corpus.
Prof. Vikas Dubey | RCOE | COMP | NLP | BE 2021-22
16. 16
Stochastic Tagging (cont.)
Simple Method: Choose most frequent tag in training text
for each word!
– Result: 90% accuracy
– Why?
– Baseline: Others will do better
– HMM is an example
Prof. Vikas Dubey | RCOE | COMP | NLP | BE 2021-22