SlideShare ist ein Scribd-Unternehmen logo
1 von 15
Downloaden Sie, um offline zu lesen
Natural Language Processing
Unit 2 – Language Model: Part 2
Anantharaman Narayana Iyer
narayana dot Anantharaman at gmail dot com
Part 1 on 28th Aug 2014, Part 2 on 4th Sep 2014
References
• Week 2 Language Modeling, lectures of Coursera course on Natural
Language Processing by Dan Jurafsky and Christopher Manning
• Week 1 Language Modeling Problem, Parameter Estimation in
Language Models lectures of Coursera course on Natural Language
Processing by Michael Collins
Objective of today’s and the next class
• The objective of Part 1 of LM class is to:
• Describe Language Model
• Understand its applications
• Describe n-gram, bi-gram and tri-gram models using Markov processes
• Measuring the performance of Language Models
• In Part 2 of this topic, we discuss:
• Parameter estimation techniques: Interpolation, Discounting Methods
• Smoothing
• Pre requisite for this class:
• Mathematics for Natural Language Processing (Covered in Unit 1)
Summary of Part 1
• Given a sequence of words drawn from a vocabulary, Language Model assigns a probability for this word
sequence
• This probability distribution satisfies the 2 fundamental axioms of probability: Probability of all possible word sequences
sums to 1 and Probability of any sequence is bounded between 0 to 1, both limits inclusive
• Computing this probability in its exact form as a joint distribution is intractable for non trivial vocabulary and
sequence length sizes
• Unigram models assume the words in the word sequence are independent and hence the joint probability
reduces to multiplication of individual word probabilities
• Though unigram models are useful for certain class of applications, they have severe limitations, as in reality
there is often a dependency between words of a sequence.
• Bigrams approximate the computation of a joint probability distribution by assuming that a word in a sequence
depends only on the previous word of the sequence. Trigrams extend this further by allowing 2 previous words
to be considered while computing the probability of a word in a sequence.
• Bigrams can be mathematically represented by the first order Markov process while the second order Markov
process can be used to present trigrams. In general a k-gram LM can be represented by (k-1)st order Markov.
• Note that it is good to think of domain concepts (such as bigram, trigram) separate from the representation
(such as Markov processes) of such concepts. A given representation may be applicable to multiple class of
problems from different domains and a given concept may be represented using multiple representations.
Markov Model is a formal framework applicable to multiple problems – NLP LM is one of them.
Summary of Part 1 - continued
• The model complexity and the number of model parameters increase as we
increase the order of our representation. The simplest representation of
Unigram involves a direct multiplication of probabilities of words of a sequence
and computing trigram probabilities would involve N3 computations as each
word in a trigram is a random variable that can take N = |V| values.
• Perplexity is a metric that is used to evaluate the performance of a LM
• Based on the nature of application, simple models might result in errors due to
underfitting while increasing the model complexity requires a larger corpus for
training and may also cause errors due to overfitting
• Hence, Bias-Variance trade offs need to be considered while selecting the LM
• From the studies done by Goodman it was shown that the Trigrams provide an
optimal trade off for many applications
• Back off, Interpolation techniques can be used to select the right mix of models
for the application
Back off and Interpolation
• Back off techniques successively back off starting from a higher
complexity language model to lower ones based on the performance
of the model
• Use trigram if there is good evidence that it produces good results. That is, we
have adequate data to train the system and the trigram estimates converge
close to their true values
• If trigrams are not showing good results, try using bigrams, else use unigrams
• Interpolation produces a new language model by mixing all the
different n-gram models using weights as coefficients
• In general interpolation is known to produce better results compared
to back off
Interpolation
• Main challenge: Sparse data model
• Natural estimate is MLE
• q(wi | wi-2, wi-1) = trigram_counts / bigram_counts
• There are N3 parameters in the trigram model – eg: If N = |V| =
10000, then there are 1012 parameters in the model
• In many cases trigram_counts may be zero or even bigram_counts
may be zero leading to sparse data problem
Bias Variance
• Unigram, Bigram and Trigram models have different strengths and
weaknesses
• Trigram models exhibit relatively low bias as it takes in to account the
local context but need more data to train
• Unigram models ignore the context and hence suffer from high bias
• We need an estimator that can trade off these strengths and
weaknesses
Linear Interpolator
• We define the probability of a word wi given wi-1 and wi-2 to be:
q(wi|wi-2, wi-1) = λ1 qML(wi|wi-2, wi-1) + λ2 qML(wi|wi-1) + λ3 qML(wi)
With the constraints:
1. λ1 + λ2 + λ3 = 1
2. λi >= 0 for all i
• The constraints ensure that the interpolator produces a valid
probability distribution
Let V be the vocabulary and V’ = V U {STOP}
ΣwԑV’ q(w | u, v) = 1 and q(w | u, v) >= 0 for all w ԑ V’
Determining the λ coefficients
• Create a validation dataset which is the held out dataset separate
from the training data
• Compute qML parameters for unigram, bigram and trigram from the
training data
• Define c’(w1, w2, w3) as the count of trigram (w1, w2, w3) in the
validation set
• Choose λ1, λ2, λ3 such that these coefficients maximize:
L(λ1, λ2, λ3) = Σw1,w2,w3 c’(w1, w2, w3) log q(w3 | w1, w2) such that:
q(wi|wi-2, wi-1) = λ1 qML(wi|wi-2, wi-1) + λ2 qML(wi|wi-1) + λ3 qML(wi)
With the constraints:
1. λ1 + λ2 + λ3 = 1
2. λi >= 0 for all i
Varying the values of λ coefficients
• Often it is useful to keep different sets of λ coefficients in order to get better performance.
• For this, we define a partition function 𝜋 𝑤𝑖−1, 𝑤𝑖−2
• Example:
• 𝜋 𝑤𝑖−1, 𝑤𝑖−2 = 1 if Count 𝑤𝑖−1, 𝑤𝑖−2 = 0
• 𝜋 𝑤𝑖−1, 𝑤𝑖−2 = 2 if 1 ≤ Count 𝑤𝑖−1, 𝑤𝑖−2 ≤ 2
• 𝜋 𝑤𝑖−1, 𝑤𝑖−2 = 3 if 2 ≤ Count 𝑤𝑖−1, 𝑤𝑖−2 ≤ 5
• 𝜋 𝑤𝑖−1, 𝑤𝑖−2 = 4 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
• We can write:
𝑞 𝑤𝑖 𝑤𝑖−1, 𝑤𝑖−2 = λ1
𝜋 𝑤 𝑖−1,𝑤 𝑖−2
∗ 𝑞 𝑀𝐿(wi|wi-2, wi-1) + λ2
𝜋 𝑤 𝑖−1,𝑤 𝑖−2
∗ 𝑞 𝑀𝐿(wi|wi-1) + λ3
𝜋 𝑤 𝑖−1,𝑤 𝑖−2
∗ 𝑞 𝑀𝐿(wi)
Subject to the constraints:
1. λ1
𝜋 𝑤 𝑖−1,𝑤 𝑖−2
+ λ2
𝜋 𝑤 𝑖−1,𝑤 𝑖−2
+ λ3
𝜋 𝑤 𝑖−1,𝑤 𝑖−2
= 1
2. λ𝑖
𝜋 𝑤 𝑖−1,𝑤 𝑖−2
≥ 0 ∀𝑖
Add 1 Smoothing
• Add 1 smoothing doesn’t always
work well when the n-gram matrix
is sparse
• It may be adequate when dealing
with classification problems (as we
saw in Naïve Bayes classifier)
where the sparsty is less but does
a poor job for language model
tasks
• However, it is still useful as a
baseline technique
Table: Courtesy Dan Jurafsky’s Coursera Slides
Discounting Methods
w1 w2 count
the researchers 515
the university 421
the journal 315
the first 312
the study 281
the same 253
the world 246
the new 205
the most 185
the scientists 183
the australian 164
the us 153
the research 140
the brain 131
the sun 122
the team 118
the other 113
the findings 97
the past 97
the next 94
the earth 85
the toddlers 1
the larynx 1
the scripts 1
the investor 1
the vagus 1
the fradulent 1
the lomonosov 1
the fledgling 1
the cotsen 1
the elgin 1
the single-wing 1
the stout 1
the britons 1
the bonds 1
the armand 1
the construct 1
the fin 1
the lebanese 1
the frog-sniffing 1
the did 1
the nano-nylon 1
the dread 1
the lizard-like 1
the auditory 1
the manuscipts 1
the closet 1
• Motivation
• Language Models are sensitive to the training corpus.
• N-gram models are sparsely populated as n increases.
• Test data might have several word sequences that might not have been
covered in the training data.
• The probability may be zero or sometime indeterminate even if the
sequence may be a very valid and frequently found sequence
• Add 1 smoothing doesn’t always work well
• Discounting methods attempt to address this problem
• The strategy is to extract some probability mass out of the
probability distribution of training data and assign them to new
sequences
• For each n-gram, we subtract a fixed number from the counts,
hoping to distribute them to new sequences in the test data in
a fair manner
• Define: count* (x) = count(x) – d, where d is a number (say 0.5)
Discounting Methods
• Calculate the new probabilities using Count* for each word w
• As we have subtracted some counts from the initial value, we will be
left with a probability mass:
𝛼 𝑤𝑖−1 = 1 −
𝑤
count* 𝑤𝑖−1, 𝑤 / 𝑐𝑜𝑢𝑛𝑡 𝑤𝑖−1
Katz Back off
• For Bigram and Trigram back off models based on the discounting
algorithm discussed in this class, please refer Prof Michael Collins’
Coursera notes on Language Model

Weitere ähnliche Inhalte

Was ist angesagt?

Basics of Machine Learning
Basics of Machine LearningBasics of Machine Learning
Basics of Machine LearningPranav Challa
 
Deep Learning Sample Class (Jon Lederman)
Deep Learning Sample Class (Jon Lederman)Deep Learning Sample Class (Jon Lederman)
Deep Learning Sample Class (Jon Lederman)Jon Lederman
 
Neural Networks and Deep Learning Basics
Neural Networks and Deep Learning BasicsNeural Networks and Deep Learning Basics
Neural Networks and Deep Learning BasicsJon Lederman
 
Machine learning Algorithms with a Sagemaker demo
Machine learning Algorithms with a Sagemaker demoMachine learning Algorithms with a Sagemaker demo
Machine learning Algorithms with a Sagemaker demoHridyesh Bisht
 
Deep learning with TensorFlow
Deep learning with TensorFlowDeep learning with TensorFlow
Deep learning with TensorFlowBarbara Fusinska
 
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Chris Ohk
 
Machine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data DemystifiedMachine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data DemystifiedOmid Vahdaty
 
Deep Implicit Layers: Learning Structured Problems with Neural Networks
Deep Implicit Layers: Learning Structured Problems with Neural NetworksDeep Implicit Layers: Learning Structured Problems with Neural Networks
Deep Implicit Layers: Learning Structured Problems with Neural NetworksSangwoo Mo
 
Machine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedMachine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedOmid Vahdaty
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningNAVER Engineering
 
Neural network for machine learning
Neural network for machine learningNeural network for machine learning
Neural network for machine learningUjjawal
 
Ot regularization and_gradient_descent
Ot regularization and_gradient_descentOt regularization and_gradient_descent
Ot regularization and_gradient_descentankit_ppt
 
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...Sangwoo Mo
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality ReductionKnoldus Inc.
 
Ml10 dimensionality reduction-and_advanced_topics
Ml10 dimensionality reduction-and_advanced_topicsMl10 dimensionality reduction-and_advanced_topics
Ml10 dimensionality reduction-and_advanced_topicsankit_ppt
 
PPT - Enhancing the Locality and Breaking the Memory Bottleneck of Transforme...
PPT - Enhancing the Locality and Breaking the Memory Bottleneck of Transforme...PPT - Enhancing the Locality and Breaking the Memory Bottleneck of Transforme...
PPT - Enhancing the Locality and Breaking the Memory Bottleneck of Transforme...Jisang Yoon
 
Regularization in deep learning
Regularization in deep learningRegularization in deep learning
Regularization in deep learningKien Le
 
Natural Language Processing: L03 maths fornlp
Natural Language Processing: L03 maths fornlpNatural Language Processing: L03 maths fornlp
Natural Language Processing: L03 maths fornlpananth
 
Deep learning paper review ppt sourece -Direct clr
Deep learning paper review ppt sourece -Direct clr Deep learning paper review ppt sourece -Direct clr
Deep learning paper review ppt sourece -Direct clr taeseon ryu
 
Algorithms and Programming
Algorithms and ProgrammingAlgorithms and Programming
Algorithms and ProgrammingMelanie Knight
 

Was ist angesagt? (20)

Basics of Machine Learning
Basics of Machine LearningBasics of Machine Learning
Basics of Machine Learning
 
Deep Learning Sample Class (Jon Lederman)
Deep Learning Sample Class (Jon Lederman)Deep Learning Sample Class (Jon Lederman)
Deep Learning Sample Class (Jon Lederman)
 
Neural Networks and Deep Learning Basics
Neural Networks and Deep Learning BasicsNeural Networks and Deep Learning Basics
Neural Networks and Deep Learning Basics
 
Machine learning Algorithms with a Sagemaker demo
Machine learning Algorithms with a Sagemaker demoMachine learning Algorithms with a Sagemaker demo
Machine learning Algorithms with a Sagemaker demo
 
Deep learning with TensorFlow
Deep learning with TensorFlowDeep learning with TensorFlow
Deep learning with TensorFlow
 
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
 
Machine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data DemystifiedMachine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data Demystified
 
Deep Implicit Layers: Learning Structured Problems with Neural Networks
Deep Implicit Layers: Learning Structured Problems with Neural NetworksDeep Implicit Layers: Learning Structured Problems with Neural Networks
Deep Implicit Layers: Learning Structured Problems with Neural Networks
 
Machine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedMachine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data Demystified
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
Neural network for machine learning
Neural network for machine learningNeural network for machine learning
Neural network for machine learning
 
Ot regularization and_gradient_descent
Ot regularization and_gradient_descentOt regularization and_gradient_descent
Ot regularization and_gradient_descent
 
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
Ml10 dimensionality reduction-and_advanced_topics
Ml10 dimensionality reduction-and_advanced_topicsMl10 dimensionality reduction-and_advanced_topics
Ml10 dimensionality reduction-and_advanced_topics
 
PPT - Enhancing the Locality and Breaking the Memory Bottleneck of Transforme...
PPT - Enhancing the Locality and Breaking the Memory Bottleneck of Transforme...PPT - Enhancing the Locality and Breaking the Memory Bottleneck of Transforme...
PPT - Enhancing the Locality and Breaking the Memory Bottleneck of Transforme...
 
Regularization in deep learning
Regularization in deep learningRegularization in deep learning
Regularization in deep learning
 
Natural Language Processing: L03 maths fornlp
Natural Language Processing: L03 maths fornlpNatural Language Processing: L03 maths fornlp
Natural Language Processing: L03 maths fornlp
 
Deep learning paper review ppt sourece -Direct clr
Deep learning paper review ppt sourece -Direct clr Deep learning paper review ppt sourece -Direct clr
Deep learning paper review ppt sourece -Direct clr
 
Algorithms and Programming
Algorithms and ProgrammingAlgorithms and Programming
Algorithms and Programming
 

Andere mochten auch

Natural Language Processing: L02 words
Natural Language Processing: L02 wordsNatural Language Processing: L02 words
Natural Language Processing: L02 wordsananth
 
Deep Learning For Practitioners, lecture 2: Selecting the right applications...
Deep Learning For Practitioners,  lecture 2: Selecting the right applications...Deep Learning For Practitioners,  lecture 2: Selecting the right applications...
Deep Learning For Practitioners, lecture 2: Selecting the right applications...ananth
 
Overview of TensorFlow For Natural Language Processing
Overview of TensorFlow For Natural Language ProcessingOverview of TensorFlow For Natural Language Processing
Overview of TensorFlow For Natural Language Processingananth
 
Natural Language Processing: L01 introduction
Natural Language Processing: L01 introductionNatural Language Processing: L01 introduction
Natural Language Processing: L01 introductionananth
 
Convolutional Neural Networks: Part 1
Convolutional Neural Networks: Part 1Convolutional Neural Networks: Part 1
Convolutional Neural Networks: Part 1ananth
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introductionRobert Lujo
 
Word representation: SVD, LSA, Word2Vec
Word representation: SVD, LSA, Word2VecWord representation: SVD, LSA, Word2Vec
Word representation: SVD, LSA, Word2Vecananth
 
Deep Learning For Speech Recognition
Deep Learning For Speech RecognitionDeep Learning For Speech Recognition
Deep Learning For Speech Recognitionananth
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language ProcessingPranav Gupta
 
Current state of the art pos tagging for indian languages – a study
Current state of the art pos tagging for indian languages – a studyCurrent state of the art pos tagging for indian languages – a study
Current state of the art pos tagging for indian languages – a studyiaemedu
 
Azure machine learning tech mela
Azure machine learning tech melaAzure machine learning tech mela
Azure machine learning tech melaYogendra Tamang
 
Operations Management Study in Textured Jersy Lanka Limited
Operations Management Study in Textured Jersy Lanka LimitedOperations Management Study in Textured Jersy Lanka Limited
Operations Management Study in Textured Jersy Lanka LimitedHansi Thenuwara
 
Introduction and Starting ASP.NET MVC
Introduction and Starting ASP.NET MVCIntroduction and Starting ASP.NET MVC
Introduction and Starting ASP.NET MVCYogendra Tamang
 
Machine learning and azure ml studio
Machine learning and azure ml studioMachine learning and azure ml studio
Machine learning and azure ml studioYogendra Tamang
 
Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...Rajnish Raj
 

Andere mochten auch (20)

Natural Language Processing: L02 words
Natural Language Processing: L02 wordsNatural Language Processing: L02 words
Natural Language Processing: L02 words
 
Deep Learning For Practitioners, lecture 2: Selecting the right applications...
Deep Learning For Practitioners,  lecture 2: Selecting the right applications...Deep Learning For Practitioners,  lecture 2: Selecting the right applications...
Deep Learning For Practitioners, lecture 2: Selecting the right applications...
 
Overview of TensorFlow For Natural Language Processing
Overview of TensorFlow For Natural Language ProcessingOverview of TensorFlow For Natural Language Processing
Overview of TensorFlow For Natural Language Processing
 
Natural Language Processing: L01 introduction
Natural Language Processing: L01 introductionNatural Language Processing: L01 introduction
Natural Language Processing: L01 introduction
 
Convolutional Neural Networks: Part 1
Convolutional Neural Networks: Part 1Convolutional Neural Networks: Part 1
Convolutional Neural Networks: Part 1
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
 
Word representation: SVD, LSA, Word2Vec
Word representation: SVD, LSA, Word2VecWord representation: SVD, LSA, Word2Vec
Word representation: SVD, LSA, Word2Vec
 
Deep Learning For Speech Recognition
Deep Learning For Speech RecognitionDeep Learning For Speech Recognition
Deep Learning For Speech Recognition
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
L3 v2
L3 v2L3 v2
L3 v2
 
Current state of the art pos tagging for indian languages – a study
Current state of the art pos tagging for indian languages – a studyCurrent state of the art pos tagging for indian languages – a study
Current state of the art pos tagging for indian languages – a study
 
Ignat vita artur
Ignat vita arturIgnat vita artur
Ignat vita artur
 
Electronics projects
Electronics projectsElectronics projects
Electronics projects
 
Azure machine learning tech mela
Azure machine learning tech melaAzure machine learning tech mela
Azure machine learning tech mela
 
Operations Management Study in Textured Jersy Lanka Limited
Operations Management Study in Textured Jersy Lanka LimitedOperations Management Study in Textured Jersy Lanka Limited
Operations Management Study in Textured Jersy Lanka Limited
 
ADO.NET Introduction
ADO.NET IntroductionADO.NET Introduction
ADO.NET Introduction
 
Introduction and Starting ASP.NET MVC
Introduction and Starting ASP.NET MVCIntroduction and Starting ASP.NET MVC
Introduction and Starting ASP.NET MVC
 
Machine learning and azure ml studio
Machine learning and azure ml studioMachine learning and azure ml studio
Machine learning and azure ml studio
 
Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...
 

Ähnlich wie L05 language model_part2

NS-CUK Seminar: H.B.Kim, Review on "subgraph2vec: Learning Distributed Repre...
NS-CUK Seminar: H.B.Kim,  Review on "subgraph2vec: Learning Distributed Repre...NS-CUK Seminar: H.B.Kim,  Review on "subgraph2vec: Learning Distributed Repre...
NS-CUK Seminar: H.B.Kim, Review on "subgraph2vec: Learning Distributed Repre...ssuser4b1f48
 
NLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelNLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelHemantha Kulathilake
 
240311_JW_labseminar[Sequence to Sequence Learning with Neural Networks].pptx
240311_JW_labseminar[Sequence to Sequence Learning with Neural Networks].pptx240311_JW_labseminar[Sequence to Sequence Learning with Neural Networks].pptx
240311_JW_labseminar[Sequence to Sequence Learning with Neural Networks].pptxthanhdowork
 
Genetic programming
Genetic programmingGenetic programming
Genetic programmingYun-Yan Chi
 
Word_Embedding.pptx
Word_Embedding.pptxWord_Embedding.pptx
Word_Embedding.pptxNameetDaga1
 
Incremental Sense Weight Training for In-depth Interpretation of Contextualiz...
Incremental Sense Weight Training for In-depth Interpretation of Contextualiz...Incremental Sense Weight Training for In-depth Interpretation of Contextualiz...
Incremental Sense Weight Training for In-depth Interpretation of Contextualiz...Jinho Choi
 
Transfer Learning in NLP: A Survey
Transfer Learning in NLP: A SurveyTransfer Learning in NLP: A Survey
Transfer Learning in NLP: A SurveyNUPUR YADAV
 
MLCC Schedule #1
MLCC Schedule #1MLCC Schedule #1
MLCC Schedule #1Bruce Lee
 
Analysis and Design of Algorithms
Analysis and Design of AlgorithmsAnalysis and Design of Algorithms
Analysis and Design of AlgorithmsBulbul Agrawal
 
Semi orthogonal low-rank matrix factorization for deep neural networks
Semi orthogonal low-rank matrix factorization for deep neural networksSemi orthogonal low-rank matrix factorization for deep neural networks
Semi orthogonal low-rank matrix factorization for deep neural networks品媛 陳
 
Generalized Linear Models in Spark MLlib and SparkR by Xiangrui Meng
Generalized Linear Models in Spark MLlib and SparkR by Xiangrui MengGeneralized Linear Models in Spark MLlib and SparkR by Xiangrui Meng
Generalized Linear Models in Spark MLlib and SparkR by Xiangrui MengSpark Summit
 
Generalized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkRGeneralized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkRDatabricks
 
support vector machine 1.pptx
support vector machine 1.pptxsupport vector machine 1.pptx
support vector machine 1.pptxsurbhidutta4
 
머피의 머신러닝: 17장 Markov Chain and HMM
머피의 머신러닝: 17장  Markov Chain and HMM머피의 머신러닝: 17장  Markov Chain and HMM
머피의 머신러닝: 17장 Markov Chain and HMMJungkyu Lee
 
Lecture 6
Lecture 6Lecture 6
Lecture 6hunglq
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRUananth
 

Ähnlich wie L05 language model_part2 (20)

Language models
Language modelsLanguage models
Language models
 
NS-CUK Seminar: H.B.Kim, Review on "subgraph2vec: Learning Distributed Repre...
NS-CUK Seminar: H.B.Kim,  Review on "subgraph2vec: Learning Distributed Repre...NS-CUK Seminar: H.B.Kim,  Review on "subgraph2vec: Learning Distributed Repre...
NS-CUK Seminar: H.B.Kim, Review on "subgraph2vec: Learning Distributed Repre...
 
NLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelNLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language Model
 
240311_JW_labseminar[Sequence to Sequence Learning with Neural Networks].pptx
240311_JW_labseminar[Sequence to Sequence Learning with Neural Networks].pptx240311_JW_labseminar[Sequence to Sequence Learning with Neural Networks].pptx
240311_JW_labseminar[Sequence to Sequence Learning with Neural Networks].pptx
 
Genetic programming
Genetic programmingGenetic programming
Genetic programming
 
Word_Embedding.pptx
Word_Embedding.pptxWord_Embedding.pptx
Word_Embedding.pptx
 
Incremental Sense Weight Training for In-depth Interpretation of Contextualiz...
Incremental Sense Weight Training for In-depth Interpretation of Contextualiz...Incremental Sense Weight Training for In-depth Interpretation of Contextualiz...
Incremental Sense Weight Training for In-depth Interpretation of Contextualiz...
 
Transfer Learning in NLP: A Survey
Transfer Learning in NLP: A SurveyTransfer Learning in NLP: A Survey
Transfer Learning in NLP: A Survey
 
Logistical Regression.pptx
Logistical Regression.pptxLogistical Regression.pptx
Logistical Regression.pptx
 
ngboost.pptx
ngboost.pptxngboost.pptx
ngboost.pptx
 
MLCC Schedule #1
MLCC Schedule #1MLCC Schedule #1
MLCC Schedule #1
 
Analysis and Design of Algorithms
Analysis and Design of AlgorithmsAnalysis and Design of Algorithms
Analysis and Design of Algorithms
 
Semi orthogonal low-rank matrix factorization for deep neural networks
Semi orthogonal low-rank matrix factorization for deep neural networksSemi orthogonal low-rank matrix factorization for deep neural networks
Semi orthogonal low-rank matrix factorization for deep neural networks
 
Generalized Linear Models in Spark MLlib and SparkR by Xiangrui Meng
Generalized Linear Models in Spark MLlib and SparkR by Xiangrui MengGeneralized Linear Models in Spark MLlib and SparkR by Xiangrui Meng
Generalized Linear Models in Spark MLlib and SparkR by Xiangrui Meng
 
Generalized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkRGeneralized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkR
 
Regression ppt
Regression pptRegression ppt
Regression ppt
 
support vector machine 1.pptx
support vector machine 1.pptxsupport vector machine 1.pptx
support vector machine 1.pptx
 
머피의 머신러닝: 17장 Markov Chain and HMM
머피의 머신러닝: 17장  Markov Chain and HMM머피의 머신러닝: 17장  Markov Chain and HMM
머피의 머신러닝: 17장 Markov Chain and HMM
 
Lecture 6
Lecture 6Lecture 6
Lecture 6
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
 

Mehr von ananth

Convolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular ArchitecturesConvolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular Architecturesananth
 
Overview of Convolutional Neural Networks
Overview of Convolutional Neural NetworksOverview of Convolutional Neural Networks
Overview of Convolutional Neural Networksananth
 
An Overview of Naïve Bayes Classifier
An Overview of Naïve Bayes Classifier An Overview of Naïve Bayes Classifier
An Overview of Naïve Bayes Classifier ananth
 
Search problems in Artificial Intelligence
Search problems in Artificial IntelligenceSearch problems in Artificial Intelligence
Search problems in Artificial Intelligenceananth
 
Introduction to Artificial Intelligence
Introduction to Artificial IntelligenceIntroduction to Artificial Intelligence
Introduction to Artificial Intelligenceananth
 
Machine Learning Lecture 3 Decision Trees
Machine Learning Lecture 3 Decision TreesMachine Learning Lecture 3 Decision Trees
Machine Learning Lecture 3 Decision Treesananth
 
Introduction To Applied Machine Learning
Introduction To Applied Machine LearningIntroduction To Applied Machine Learning
Introduction To Applied Machine Learningananth
 
MaxEnt (Loglinear) Models - Overview
MaxEnt (Loglinear) Models - OverviewMaxEnt (Loglinear) Models - Overview
MaxEnt (Loglinear) Models - Overviewananth
 
L05 word representation
L05 word representationL05 word representation
L05 word representationananth
 
Deep Learning Primer - a brief introduction
Deep Learning Primer - a brief introductionDeep Learning Primer - a brief introduction
Deep Learning Primer - a brief introductionananth
 

Mehr von ananth (10)

Convolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular ArchitecturesConvolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular Architectures
 
Overview of Convolutional Neural Networks
Overview of Convolutional Neural NetworksOverview of Convolutional Neural Networks
Overview of Convolutional Neural Networks
 
An Overview of Naïve Bayes Classifier
An Overview of Naïve Bayes Classifier An Overview of Naïve Bayes Classifier
An Overview of Naïve Bayes Classifier
 
Search problems in Artificial Intelligence
Search problems in Artificial IntelligenceSearch problems in Artificial Intelligence
Search problems in Artificial Intelligence
 
Introduction to Artificial Intelligence
Introduction to Artificial IntelligenceIntroduction to Artificial Intelligence
Introduction to Artificial Intelligence
 
Machine Learning Lecture 3 Decision Trees
Machine Learning Lecture 3 Decision TreesMachine Learning Lecture 3 Decision Trees
Machine Learning Lecture 3 Decision Trees
 
Introduction To Applied Machine Learning
Introduction To Applied Machine LearningIntroduction To Applied Machine Learning
Introduction To Applied Machine Learning
 
MaxEnt (Loglinear) Models - Overview
MaxEnt (Loglinear) Models - OverviewMaxEnt (Loglinear) Models - Overview
MaxEnt (Loglinear) Models - Overview
 
L05 word representation
L05 word representationL05 word representation
L05 word representation
 
Deep Learning Primer - a brief introduction
Deep Learning Primer - a brief introductionDeep Learning Primer - a brief introduction
Deep Learning Primer - a brief introduction
 

Kürzlich hochgeladen

Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 

Kürzlich hochgeladen (20)

Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 

L05 language model_part2

  • 1. Natural Language Processing Unit 2 – Language Model: Part 2 Anantharaman Narayana Iyer narayana dot Anantharaman at gmail dot com Part 1 on 28th Aug 2014, Part 2 on 4th Sep 2014
  • 2. References • Week 2 Language Modeling, lectures of Coursera course on Natural Language Processing by Dan Jurafsky and Christopher Manning • Week 1 Language Modeling Problem, Parameter Estimation in Language Models lectures of Coursera course on Natural Language Processing by Michael Collins
  • 3. Objective of today’s and the next class • The objective of Part 1 of LM class is to: • Describe Language Model • Understand its applications • Describe n-gram, bi-gram and tri-gram models using Markov processes • Measuring the performance of Language Models • In Part 2 of this topic, we discuss: • Parameter estimation techniques: Interpolation, Discounting Methods • Smoothing • Pre requisite for this class: • Mathematics for Natural Language Processing (Covered in Unit 1)
  • 4. Summary of Part 1 • Given a sequence of words drawn from a vocabulary, Language Model assigns a probability for this word sequence • This probability distribution satisfies the 2 fundamental axioms of probability: Probability of all possible word sequences sums to 1 and Probability of any sequence is bounded between 0 to 1, both limits inclusive • Computing this probability in its exact form as a joint distribution is intractable for non trivial vocabulary and sequence length sizes • Unigram models assume the words in the word sequence are independent and hence the joint probability reduces to multiplication of individual word probabilities • Though unigram models are useful for certain class of applications, they have severe limitations, as in reality there is often a dependency between words of a sequence. • Bigrams approximate the computation of a joint probability distribution by assuming that a word in a sequence depends only on the previous word of the sequence. Trigrams extend this further by allowing 2 previous words to be considered while computing the probability of a word in a sequence. • Bigrams can be mathematically represented by the first order Markov process while the second order Markov process can be used to present trigrams. In general a k-gram LM can be represented by (k-1)st order Markov. • Note that it is good to think of domain concepts (such as bigram, trigram) separate from the representation (such as Markov processes) of such concepts. A given representation may be applicable to multiple class of problems from different domains and a given concept may be represented using multiple representations. Markov Model is a formal framework applicable to multiple problems – NLP LM is one of them.
  • 5. Summary of Part 1 - continued • The model complexity and the number of model parameters increase as we increase the order of our representation. The simplest representation of Unigram involves a direct multiplication of probabilities of words of a sequence and computing trigram probabilities would involve N3 computations as each word in a trigram is a random variable that can take N = |V| values. • Perplexity is a metric that is used to evaluate the performance of a LM • Based on the nature of application, simple models might result in errors due to underfitting while increasing the model complexity requires a larger corpus for training and may also cause errors due to overfitting • Hence, Bias-Variance trade offs need to be considered while selecting the LM • From the studies done by Goodman it was shown that the Trigrams provide an optimal trade off for many applications • Back off, Interpolation techniques can be used to select the right mix of models for the application
  • 6. Back off and Interpolation • Back off techniques successively back off starting from a higher complexity language model to lower ones based on the performance of the model • Use trigram if there is good evidence that it produces good results. That is, we have adequate data to train the system and the trigram estimates converge close to their true values • If trigrams are not showing good results, try using bigrams, else use unigrams • Interpolation produces a new language model by mixing all the different n-gram models using weights as coefficients • In general interpolation is known to produce better results compared to back off
  • 7. Interpolation • Main challenge: Sparse data model • Natural estimate is MLE • q(wi | wi-2, wi-1) = trigram_counts / bigram_counts • There are N3 parameters in the trigram model – eg: If N = |V| = 10000, then there are 1012 parameters in the model • In many cases trigram_counts may be zero or even bigram_counts may be zero leading to sparse data problem
  • 8. Bias Variance • Unigram, Bigram and Trigram models have different strengths and weaknesses • Trigram models exhibit relatively low bias as it takes in to account the local context but need more data to train • Unigram models ignore the context and hence suffer from high bias • We need an estimator that can trade off these strengths and weaknesses
  • 9. Linear Interpolator • We define the probability of a word wi given wi-1 and wi-2 to be: q(wi|wi-2, wi-1) = λ1 qML(wi|wi-2, wi-1) + λ2 qML(wi|wi-1) + λ3 qML(wi) With the constraints: 1. λ1 + λ2 + λ3 = 1 2. λi >= 0 for all i • The constraints ensure that the interpolator produces a valid probability distribution Let V be the vocabulary and V’ = V U {STOP} ΣwԑV’ q(w | u, v) = 1 and q(w | u, v) >= 0 for all w ԑ V’
  • 10. Determining the λ coefficients • Create a validation dataset which is the held out dataset separate from the training data • Compute qML parameters for unigram, bigram and trigram from the training data • Define c’(w1, w2, w3) as the count of trigram (w1, w2, w3) in the validation set • Choose λ1, λ2, λ3 such that these coefficients maximize: L(λ1, λ2, λ3) = Σw1,w2,w3 c’(w1, w2, w3) log q(w3 | w1, w2) such that: q(wi|wi-2, wi-1) = λ1 qML(wi|wi-2, wi-1) + λ2 qML(wi|wi-1) + λ3 qML(wi) With the constraints: 1. λ1 + λ2 + λ3 = 1 2. λi >= 0 for all i
  • 11. Varying the values of λ coefficients • Often it is useful to keep different sets of λ coefficients in order to get better performance. • For this, we define a partition function 𝜋 𝑤𝑖−1, 𝑤𝑖−2 • Example: • 𝜋 𝑤𝑖−1, 𝑤𝑖−2 = 1 if Count 𝑤𝑖−1, 𝑤𝑖−2 = 0 • 𝜋 𝑤𝑖−1, 𝑤𝑖−2 = 2 if 1 ≤ Count 𝑤𝑖−1, 𝑤𝑖−2 ≤ 2 • 𝜋 𝑤𝑖−1, 𝑤𝑖−2 = 3 if 2 ≤ Count 𝑤𝑖−1, 𝑤𝑖−2 ≤ 5 • 𝜋 𝑤𝑖−1, 𝑤𝑖−2 = 4 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 • We can write: 𝑞 𝑤𝑖 𝑤𝑖−1, 𝑤𝑖−2 = λ1 𝜋 𝑤 𝑖−1,𝑤 𝑖−2 ∗ 𝑞 𝑀𝐿(wi|wi-2, wi-1) + λ2 𝜋 𝑤 𝑖−1,𝑤 𝑖−2 ∗ 𝑞 𝑀𝐿(wi|wi-1) + λ3 𝜋 𝑤 𝑖−1,𝑤 𝑖−2 ∗ 𝑞 𝑀𝐿(wi) Subject to the constraints: 1. λ1 𝜋 𝑤 𝑖−1,𝑤 𝑖−2 + λ2 𝜋 𝑤 𝑖−1,𝑤 𝑖−2 + λ3 𝜋 𝑤 𝑖−1,𝑤 𝑖−2 = 1 2. λ𝑖 𝜋 𝑤 𝑖−1,𝑤 𝑖−2 ≥ 0 ∀𝑖
  • 12. Add 1 Smoothing • Add 1 smoothing doesn’t always work well when the n-gram matrix is sparse • It may be adequate when dealing with classification problems (as we saw in Naïve Bayes classifier) where the sparsty is less but does a poor job for language model tasks • However, it is still useful as a baseline technique Table: Courtesy Dan Jurafsky’s Coursera Slides
  • 13. Discounting Methods w1 w2 count the researchers 515 the university 421 the journal 315 the first 312 the study 281 the same 253 the world 246 the new 205 the most 185 the scientists 183 the australian 164 the us 153 the research 140 the brain 131 the sun 122 the team 118 the other 113 the findings 97 the past 97 the next 94 the earth 85 the toddlers 1 the larynx 1 the scripts 1 the investor 1 the vagus 1 the fradulent 1 the lomonosov 1 the fledgling 1 the cotsen 1 the elgin 1 the single-wing 1 the stout 1 the britons 1 the bonds 1 the armand 1 the construct 1 the fin 1 the lebanese 1 the frog-sniffing 1 the did 1 the nano-nylon 1 the dread 1 the lizard-like 1 the auditory 1 the manuscipts 1 the closet 1 • Motivation • Language Models are sensitive to the training corpus. • N-gram models are sparsely populated as n increases. • Test data might have several word sequences that might not have been covered in the training data. • The probability may be zero or sometime indeterminate even if the sequence may be a very valid and frequently found sequence • Add 1 smoothing doesn’t always work well • Discounting methods attempt to address this problem • The strategy is to extract some probability mass out of the probability distribution of training data and assign them to new sequences • For each n-gram, we subtract a fixed number from the counts, hoping to distribute them to new sequences in the test data in a fair manner • Define: count* (x) = count(x) – d, where d is a number (say 0.5)
  • 14. Discounting Methods • Calculate the new probabilities using Count* for each word w • As we have subtracted some counts from the initial value, we will be left with a probability mass: 𝛼 𝑤𝑖−1 = 1 − 𝑤 count* 𝑤𝑖−1, 𝑤 / 𝑐𝑜𝑢𝑛𝑡 𝑤𝑖−1
  • 15. Katz Back off • For Bigram and Trigram back off models based on the discounting algorithm discussed in this class, please refer Prof Michael Collins’ Coursera notes on Language Model