SlideShare ist ein Scribd-Unternehmen logo
1 von 18
Downloaden Sie, um offline zu lesen
Hidden Topic Markov Models - Gruber et al
Alex Klibisz, alex.klibisz.com, UTK STAT645
September 29, 2016
Hidden Markov Models Overview
System of tokens, assumed to be a Markov Process.
State sequence is “hidden”, but tokens (which depend on state)
are visible.
Each state has a probability distribution over the possible
tokens.
Sequence of tokens generated by an HMM gives some
information about the sequence of states.
Hidden Markov Models Example
Task: Given a sentence, determine the most likely sequence for its
parts of speech.1
Some information is known based on prior data
States are parts of speech, tokens are the words.
p(s |s) - probability of transitioning from one state (part of
speech) to another. Example: p(noun -> verb) = 0.9
p(word|s) - probability of token (word) given a state. Example:
p("Blue"|noun) = 0.4
Traverse the words to compute probability of each sequence.
Sentence: The blue bank closed.
p(det, adj, noun, verb) =
p("the" | det) · p(adj | det) · p("blue" | adj) · ...
1
Source: https://www.youtube.com/watch?v=7glSTzgjwuU
Hidden Topic Markov Models Introduction
Topics in a document are hidden and should be extracted.
Bag of words is an unrealistic oversimplification.
Topics should only transition at the beginning of a new
sentence.
Each document has a θd vector, representing its topic
distribution.
Topics transition based on binomial transition variable
ψn ∈ 0, 1 for every word w1...wNd
in a document.
HTMM vs. LDA Visually
HTMM (top) segments by sentence, LDA (bottom) segments
only individual words.
HTMM Definition
HTMM Definition (Annotated)
For every latent topic z = 1...K, draw a βz ∼ Dirichlet(η)
Generate each document d = 1...D as follows:
Draw a topic distribution θ ∼ Dirichlet(α)
Word 1 is a new topic, ψ1 = 1
For every word n = 2...Nd :
If it’s the first word in a sentence, draw ψ1 ∼ Binom( ),
otherwise no topic transition ψ1 = 0
For every word n = 1...Nd :
If ψn == 0, topic doesn’t change: zn = zn−1.
Else draw new zn ∼ multinomial(θ)
Draw wn ∼ multinomial(βzn )
Parameter Approximation
Parameter Approximation (cont.)
Use Estimation-Maximization Algorithm (EM)
EM for HMMs distinguishes between latent variables (topics zn,
transition variables ψn) and parameters.
Estimation step uses Forward-Backward Algorithm
Unknown Parameters
θd - topic distribution for each document
β - used for multinomial word distributions
- used for binomial topic transition variables
Known Parameters
Based on prior research
α = 1 + 50
K - used for drawing θ ∼ Dirichlet(α)
η = 1.01 - used for drawing βz ∼ Dirichlet(η)
Experiment: NIPS Dataset
Data
1740 documents, 1557 training, 183 testing.
12113 words in vocabulary.
Extract vocabulary words, preserving order.
Split sentences on punctuation . ? ! ;
Metric
Perplexity for HTMM vs. LDA vs. VHTMM12
Perplexity reflects the difficulty of predicting a new unseen
document after learning from a training set, lower is better.
2
VHTMM1 uses constat ψn = 1 so every sentence is of a new topic.
Figure 2: Perplexity as a function of observed words
HTMM is significantly better than LDA for N ≤ 64.
Average document length is 1300 words.
Figure 3: topical segmentation in HTMM
HTMM attributes “Support” to two different topics,
mathematical and acknowledgments.
Figure 5: topical segmentation in LDA
LDA attributes “Support” to only one topic.
Figure 6: Perplexity as a function of K
Each topic limited to N = 10 words.
Figure 7: as a function of K
Fewer topics → lower → infrequent transitions.
More topics → higher → frequent transitions.
Table 1: Lower perplexity due to degrees of freedom?
Eliminate the option that the perplexity of HTMM might be
lower than the perplexity of LDA only because it has less
degrees of freedom (due to the dependencies between latent
topics).
Generate and train on two datasets, D = 1000, V = 200,
K = 5
Dataset 1 generated with HTMM with = 0.1 (likely to
transition topics).
Dataset 2 generated with LDA “bag of words”.
HTMM still learns the correct parameters and outperforms on
ordered data, does not outperform on “bag of words” data.
Careful: maybe bag of words was just a bad assumption for the
NIPS dataset.
Conclusion
Authors’ Conclusion
HTMM should be considered an extension of LDA.
Markovian structure can learn more coherent topics,
disambiguate topics of ambiguous words.
Efficient learning and inference algorithms for HTMM already
exist.
Questions
Why only one dataset?
How does it perform on unstructured speech? For example:
run-on sentences or transcripts of live speech, debates, etc.
Why is ψn necessary for every word, could you just consider the
first words?
Thoughts for Project
Maybe shopping trips follow a Markov Model
A shopper who purchases a crib on one store visit then diapers
on the next visit might be having a baby soon.
These may be purchased with many other unrelated items, but
there is still some meaning to them.
Extracting the crib, diapers, etc. sequence could help
determine this meaning, whereas treating each shopping trip
independently might ignore it.

Weitere ähnliche Inhalte

Was ist angesagt?

TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxKalpit Desai
 
Topic modeling using big data analytics
Topic modeling using big data analyticsTopic modeling using big data analytics
Topic modeling using big data analyticsFarheen Nilofer
 
Basic review on topic modeling
Basic review on  topic modelingBasic review on  topic modeling
Basic review on topic modelingHiroyuki Kuromiya
 
Interactive Latent Dirichlet Allocation
Interactive Latent Dirichlet AllocationInteractive Latent Dirichlet Allocation
Interactive Latent Dirichlet AllocationQuentin Pleplé
 
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015rusbase
 
NLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic ClassificationNLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic ClassificationEugene Nho
 
Introduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisIntroduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisNYC Predictive Analytics
 
Topic model, LDA and all that
Topic model, LDA and all thatTopic model, LDA and all that
Topic model, LDA and all thatZhibo Xiao
 
Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for RetrievalBhaskar Mitra
 
Duet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackDuet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackBhaskar Mitra
 
(Kpi summer school 2015) word embeddings and neural language modeling
(Kpi summer school 2015) word embeddings and neural language modeling(Kpi summer school 2015) word embeddings and neural language modeling
(Kpi summer school 2015) word embeddings and neural language modelingSerhii Havrylov
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for SearchBhaskar Mitra
 
Word2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensimWord2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensimEdgar Marca
 
形式言語理論への 測度論的アプローチ
形式言語理論への 測度論的アプローチ形式言語理論への 測度論的アプローチ
形式言語理論への 測度論的アプローチRyoma Sin'ya
 

Was ist angesagt? (20)

TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptx
 
Topic modeling using big data analytics
Topic modeling using big data analyticsTopic modeling using big data analytics
Topic modeling using big data analytics
 
Basic review on topic modeling
Basic review on  topic modelingBasic review on  topic modeling
Basic review on topic modeling
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 
Interactive Latent Dirichlet Allocation
Interactive Latent Dirichlet AllocationInteractive Latent Dirichlet Allocation
Interactive Latent Dirichlet Allocation
 
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 
Author Topic Model
Author Topic ModelAuthor Topic Model
Author Topic Model
 
NLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic ClassificationNLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic Classification
 
Introduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisIntroduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic Analysis
 
Topic model, LDA and all that
Topic model, LDA and all thatTopic model, LDA and all that
Topic model, LDA and all that
 
Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for Retrieval
 
Skip gram and cbow
Skip gram and cbowSkip gram and cbow
Skip gram and cbow
 
Duet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackDuet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning Track
 
(Kpi summer school 2015) word embeddings and neural language modeling
(Kpi summer school 2015) word embeddings and neural language modeling(Kpi summer school 2015) word embeddings and neural language modeling
(Kpi summer school 2015) word embeddings and neural language modeling
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
Tutorial on word2vec
Tutorial on word2vecTutorial on word2vec
Tutorial on word2vec
 
Word2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensimWord2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensim
 
形式言語理論への 測度論的アプローチ
形式言語理論への 測度論的アプローチ形式言語理論への 測度論的アプローチ
形式言語理論への 測度論的アプローチ
 
Fol
FolFol
Fol
 

Andere mochten auch

Tugasan 2 PEMBANGUNAN BANDAR MAPAN
Tugasan 2 PEMBANGUNAN BANDAR MAPANTugasan 2 PEMBANGUNAN BANDAR MAPAN
Tugasan 2 PEMBANGUNAN BANDAR MAPANamirdn
 
Task 2 bandar anda
Task 2 bandar andaTask 2 bandar anda
Task 2 bandar andamaheran96
 
Penilaian pemapanan bandar kuala krai
Penilaian pemapanan bandar kuala kraiPenilaian pemapanan bandar kuala krai
Penilaian pemapanan bandar kuala kraiWan Azlina
 
Lmcp1532 pembangunan bandar task2
Lmcp1532 pembangunan bandar task2Lmcp1532 pembangunan bandar task2
Lmcp1532 pembangunan bandar task2NURSYAMIMI SHARIFF
 
Lmcp 1532 pembangunan bandar mampan
Lmcp 1532 pembangunan bandar mampanLmcp 1532 pembangunan bandar mampan
Lmcp 1532 pembangunan bandar mampanAshikin Mohamed
 
Lmcp 1532 pembangunan bandar mapan
Lmcp 1532 pembangunan bandar mapanLmcp 1532 pembangunan bandar mapan
Lmcp 1532 pembangunan bandar mapanCHEW leeyee
 
Penilaian kemampanan bandar kuala selangor, selangor
Penilaian kemampanan bandar kuala selangor, selangorPenilaian kemampanan bandar kuala selangor, selangor
Penilaian kemampanan bandar kuala selangor, selangorFizaini
 
TASK II - PENILAIAN BANDAR MAPAN - BANDAR SEREMBAN
TASK II - PENILAIAN BANDAR MAPAN - BANDAR SEREMBANTASK II - PENILAIAN BANDAR MAPAN - BANDAR SEREMBAN
TASK II - PENILAIAN BANDAR MAPAN - BANDAR SEREMBANNoorhadila Abd Wahid
 
AKTIVITI 2: BANDAR ANDA, PENILAIAN KEMAMPANAN TAMPOI, JOHOR BAHRU
AKTIVITI 2: BANDAR ANDA, PENILAIAN KEMAMPANAN TAMPOI, JOHOR BAHRUAKTIVITI 2: BANDAR ANDA, PENILAIAN KEMAMPANAN TAMPOI, JOHOR BAHRU
AKTIVITI 2: BANDAR ANDA, PENILAIAN KEMAMPANAN TAMPOI, JOHOR BAHRUNurul Farina Abdul Rahim
 
LMCP1532 Tugasan 2- Penilaian berkenaan bandar sendiri
LMCP1532 Tugasan 2- Penilaian berkenaan bandar sendiriLMCP1532 Tugasan 2- Penilaian berkenaan bandar sendiri
LMCP1532 Tugasan 2- Penilaian berkenaan bandar sendiriLOW KANG NING
 
Penilaian bandar mampan dari segi alam sekitar,ekonomi dan sosial
Penilaian bandar mampan dari segi alam sekitar,ekonomi dan sosialPenilaian bandar mampan dari segi alam sekitar,ekonomi dan sosial
Penilaian bandar mampan dari segi alam sekitar,ekonomi dan sosialNUR HIDAYAH SHAHIRA ABD RANI
 
Pembangunan bandar manpan di gurun, kedah
Pembangunan bandar manpan di gurun, kedahPembangunan bandar manpan di gurun, kedah
Pembangunan bandar manpan di gurun, kedahnur zafirah mohd faudzi
 
Lmcp 1532 pembangunan bandar mapan
Lmcp 1532 pembangunan bandar mapanLmcp 1532 pembangunan bandar mapan
Lmcp 1532 pembangunan bandar mapanHabsah Shairudin
 
LCMP1532 Pembangunan Bandar Mapan
LCMP1532 Pembangunan Bandar MapanLCMP1532 Pembangunan Bandar Mapan
LCMP1532 Pembangunan Bandar Mapanzikrios_95
 

Andere mochten auch (20)

Tugasan 2 PEMBANGUNAN BANDAR MAPAN
Tugasan 2 PEMBANGUNAN BANDAR MAPANTugasan 2 PEMBANGUNAN BANDAR MAPAN
Tugasan 2 PEMBANGUNAN BANDAR MAPAN
 
Task 2 bandar anda
Task 2 bandar andaTask 2 bandar anda
Task 2 bandar anda
 
Penilaian pemapanan bandar kuala krai
Penilaian pemapanan bandar kuala kraiPenilaian pemapanan bandar kuala krai
Penilaian pemapanan bandar kuala krai
 
Tugasan 2: Bandar Anda
Tugasan 2: Bandar AndaTugasan 2: Bandar Anda
Tugasan 2: Bandar Anda
 
Lmcp 1532
Lmcp 1532Lmcp 1532
Lmcp 1532
 
LMPC1532 Pembangunan Bandar Mapan
LMPC1532 Pembangunan Bandar MapanLMPC1532 Pembangunan Bandar Mapan
LMPC1532 Pembangunan Bandar Mapan
 
Lmcp1532 pembangunan bandar task2
Lmcp1532 pembangunan bandar task2Lmcp1532 pembangunan bandar task2
Lmcp1532 pembangunan bandar task2
 
Lmcp 1532 pembangunan bandar mampan
Lmcp 1532 pembangunan bandar mampanLmcp 1532 pembangunan bandar mampan
Lmcp 1532 pembangunan bandar mampan
 
Lmcp 1532 pembangunan bandar mapan
Lmcp 1532 pembangunan bandar mapanLmcp 1532 pembangunan bandar mapan
Lmcp 1532 pembangunan bandar mapan
 
Penilaian kemampanan bandar kuala selangor, selangor
Penilaian kemampanan bandar kuala selangor, selangorPenilaian kemampanan bandar kuala selangor, selangor
Penilaian kemampanan bandar kuala selangor, selangor
 
Lmcp 1532
Lmcp 1532Lmcp 1532
Lmcp 1532
 
TASK II - PENILAIAN BANDAR MAPAN - BANDAR SEREMBAN
TASK II - PENILAIAN BANDAR MAPAN - BANDAR SEREMBANTASK II - PENILAIAN BANDAR MAPAN - BANDAR SEREMBAN
TASK II - PENILAIAN BANDAR MAPAN - BANDAR SEREMBAN
 
AKTIVITI 2: BANDAR ANDA, PENILAIAN KEMAMPANAN TAMPOI, JOHOR BAHRU
AKTIVITI 2: BANDAR ANDA, PENILAIAN KEMAMPANAN TAMPOI, JOHOR BAHRUAKTIVITI 2: BANDAR ANDA, PENILAIAN KEMAMPANAN TAMPOI, JOHOR BAHRU
AKTIVITI 2: BANDAR ANDA, PENILAIAN KEMAMPANAN TAMPOI, JOHOR BAHRU
 
LMCP1532 Tugasan 2- Penilaian berkenaan bandar sendiri
LMCP1532 Tugasan 2- Penilaian berkenaan bandar sendiriLMCP1532 Tugasan 2- Penilaian berkenaan bandar sendiri
LMCP1532 Tugasan 2- Penilaian berkenaan bandar sendiri
 
Penilaian bandar mampan dari segi alam sekitar,ekonomi dan sosial
Penilaian bandar mampan dari segi alam sekitar,ekonomi dan sosialPenilaian bandar mampan dari segi alam sekitar,ekonomi dan sosial
Penilaian bandar mampan dari segi alam sekitar,ekonomi dan sosial
 
A161441 bandar saya
A161441   bandar sayaA161441   bandar saya
A161441 bandar saya
 
Pembangunan bandar manpan di gurun, kedah
Pembangunan bandar manpan di gurun, kedahPembangunan bandar manpan di gurun, kedah
Pembangunan bandar manpan di gurun, kedah
 
Lmcp 1532 pembangunan bandar mapan
Lmcp 1532 pembangunan bandar mapanLmcp 1532 pembangunan bandar mapan
Lmcp 1532 pembangunan bandar mapan
 
LCMP1532 Pembangunan Bandar Mapan
LCMP1532 Pembangunan Bandar MapanLCMP1532 Pembangunan Bandar Mapan
LCMP1532 Pembangunan Bandar Mapan
 
Bandar anda
Bandar andaBandar anda
Bandar anda
 

Ähnlich wie Hidden Topic Markov Models for Document Segmentation and Topic Extraction

The Geometry of Learning
The Geometry of LearningThe Geometry of Learning
The Geometry of Learningfridolin.wild
 
Latent dirichletallocation presentation
Latent dirichletallocation presentationLatent dirichletallocation presentation
Latent dirichletallocation presentationSoojung Hong
 
Tdm probabilistic models (part 2)
Tdm probabilistic  models (part  2)Tdm probabilistic  models (part  2)
Tdm probabilistic models (part 2)KU Leuven
 
Context-dependent Token-wise Variational Autoencoder for Topic Modeling
Context-dependent Token-wise Variational Autoencoder for Topic ModelingContext-dependent Token-wise Variational Autoencoder for Topic Modeling
Context-dependent Token-wise Variational Autoencoder for Topic ModelingTomonari Masada
 
Language Technology Enhanced Learning
Language Technology Enhanced LearningLanguage Technology Enhanced Learning
Language Technology Enhanced Learningtelss09
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)KU Leuven
 
Hash Functions: lecture series by Ahto Buldas
Hash Functions: lecture series by Ahto BuldasHash Functions: lecture series by Ahto Buldas
Hash Functions: lecture series by Ahto BuldasGuardTimeEstonia
 
Introduction to Tree-LSTMs
Introduction to Tree-LSTMsIntroduction to Tree-LSTMs
Introduction to Tree-LSTMsDaniel Perez
 
Learning to summarize using coherence
Learning to summarize using coherenceLearning to summarize using coherence
Learning to summarize using coherenceContent Savvy
 
Simulation of Language Acquisition Walter Daelemans
Simulation of Language Acquisition Walter DaelemansSimulation of Language Acquisition Walter Daelemans
Simulation of Language Acquisition Walter Daelemansbutest
 
Discovering Novel Information with sentence Level clustering From Multi-docu...
Discovering Novel Information with sentence Level clustering  From Multi-docu...Discovering Novel Information with sentence Level clustering  From Multi-docu...
Discovering Novel Information with sentence Level clustering From Multi-docu...irjes
 
Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large C...
Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large C...Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large C...
Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large C...romovpa
 
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...Varad Meru
 
SURF 2012 Final Report(1)
SURF 2012 Final Report(1)SURF 2012 Final Report(1)
SURF 2012 Final Report(1)Eric Zhang
 

Ähnlich wie Hidden Topic Markov Models for Document Segmentation and Topic Extraction (20)

The Geometry of Learning
The Geometry of LearningThe Geometry of Learning
The Geometry of Learning
 
Canini09a
Canini09aCanini09a
Canini09a
 
Lecture1.pptx
Lecture1.pptxLecture1.pptx
Lecture1.pptx
 
Latent dirichletallocation presentation
Latent dirichletallocation presentationLatent dirichletallocation presentation
Latent dirichletallocation presentation
 
Tdm probabilistic models (part 2)
Tdm probabilistic  models (part  2)Tdm probabilistic  models (part  2)
Tdm probabilistic models (part 2)
 
Context-dependent Token-wise Variational Autoencoder for Topic Modeling
Context-dependent Token-wise Variational Autoencoder for Topic ModelingContext-dependent Token-wise Variational Autoencoder for Topic Modeling
Context-dependent Token-wise Variational Autoencoder for Topic Modeling
 
Language Technology Enhanced Learning
Language Technology Enhanced LearningLanguage Technology Enhanced Learning
Language Technology Enhanced Learning
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)
 
Theory of computing
Theory of computingTheory of computing
Theory of computing
 
Hash Functions: lecture series by Ahto Buldas
Hash Functions: lecture series by Ahto BuldasHash Functions: lecture series by Ahto Buldas
Hash Functions: lecture series by Ahto Buldas
 
Introduction to Tree-LSTMs
Introduction to Tree-LSTMsIntroduction to Tree-LSTMs
Introduction to Tree-LSTMs
 
Learning to summarize using coherence
Learning to summarize using coherenceLearning to summarize using coherence
Learning to summarize using coherence
 
Simulation of Language Acquisition Walter Daelemans
Simulation of Language Acquisition Walter DaelemansSimulation of Language Acquisition Walter Daelemans
Simulation of Language Acquisition Walter Daelemans
 
Discovering Novel Information with sentence Level clustering From Multi-docu...
Discovering Novel Information with sentence Level clustering  From Multi-docu...Discovering Novel Information with sentence Level clustering  From Multi-docu...
Discovering Novel Information with sentence Level clustering From Multi-docu...
 
Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large C...
Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large C...Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large C...
Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large C...
 
Lec1
Lec1Lec1
Lec1
 
Paris Lecture 1
Paris Lecture 1Paris Lecture 1
Paris Lecture 1
 
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
 
SURF 2012 Final Report(1)
SURF 2012 Final Report(1)SURF 2012 Final Report(1)
SURF 2012 Final Report(1)
 
Unequal-Cost Prefix-Free Codes
Unequal-Cost Prefix-Free CodesUnequal-Cost Prefix-Free Codes
Unequal-Cost Prefix-Free Codes
 

Mehr von Alex Klibisz

Reservoir Computing Overview (with emphasis on Liquid State Machines)
Reservoir Computing Overview (with emphasis on Liquid State Machines)Reservoir Computing Overview (with emphasis on Liquid State Machines)
Reservoir Computing Overview (with emphasis on Liquid State Machines)Alex Klibisz
 
Research Summary: Scalable Algorithms for Nearest-Neighbor Joins on Big Traje...
Research Summary: Scalable Algorithms for Nearest-Neighbor Joins on Big Traje...Research Summary: Scalable Algorithms for Nearest-Neighbor Joins on Big Traje...
Research Summary: Scalable Algorithms for Nearest-Neighbor Joins on Big Traje...Alex Klibisz
 
Research Summary: Efficiently Estimating Statistics of Points of Interest on ...
Research Summary: Efficiently Estimating Statistics of Points of Interest on ...Research Summary: Efficiently Estimating Statistics of Points of Interest on ...
Research Summary: Efficiently Estimating Statistics of Points of Interest on ...Alex Klibisz
 
Exploring Serverless Architectures: AWS Lambda
Exploring Serverless Architectures: AWS LambdaExploring Serverless Architectures: AWS Lambda
Exploring Serverless Architectures: AWS LambdaAlex Klibisz
 
React, Flux, and Realtime RSVPs
React, Flux, and Realtime RSVPsReact, Flux, and Realtime RSVPs
React, Flux, and Realtime RSVPsAlex Klibisz
 
Research Summary: Unsupervised Prediction of Citation Influences, Dietz
Research Summary: Unsupervised Prediction of Citation Influences, DietzResearch Summary: Unsupervised Prediction of Citation Influences, Dietz
Research Summary: Unsupervised Prediction of Citation Influences, DietzAlex Klibisz
 

Mehr von Alex Klibisz (6)

Reservoir Computing Overview (with emphasis on Liquid State Machines)
Reservoir Computing Overview (with emphasis on Liquid State Machines)Reservoir Computing Overview (with emphasis on Liquid State Machines)
Reservoir Computing Overview (with emphasis on Liquid State Machines)
 
Research Summary: Scalable Algorithms for Nearest-Neighbor Joins on Big Traje...
Research Summary: Scalable Algorithms for Nearest-Neighbor Joins on Big Traje...Research Summary: Scalable Algorithms for Nearest-Neighbor Joins on Big Traje...
Research Summary: Scalable Algorithms for Nearest-Neighbor Joins on Big Traje...
 
Research Summary: Efficiently Estimating Statistics of Points of Interest on ...
Research Summary: Efficiently Estimating Statistics of Points of Interest on ...Research Summary: Efficiently Estimating Statistics of Points of Interest on ...
Research Summary: Efficiently Estimating Statistics of Points of Interest on ...
 
Exploring Serverless Architectures: AWS Lambda
Exploring Serverless Architectures: AWS LambdaExploring Serverless Architectures: AWS Lambda
Exploring Serverless Architectures: AWS Lambda
 
React, Flux, and Realtime RSVPs
React, Flux, and Realtime RSVPsReact, Flux, and Realtime RSVPs
React, Flux, and Realtime RSVPs
 
Research Summary: Unsupervised Prediction of Citation Influences, Dietz
Research Summary: Unsupervised Prediction of Citation Influences, DietzResearch Summary: Unsupervised Prediction of Citation Influences, Dietz
Research Summary: Unsupervised Prediction of Citation Influences, Dietz
 

Kürzlich hochgeladen

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 

Kürzlich hochgeladen (20)

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

Hidden Topic Markov Models for Document Segmentation and Topic Extraction

  • 1. Hidden Topic Markov Models - Gruber et al Alex Klibisz, alex.klibisz.com, UTK STAT645 September 29, 2016
  • 2. Hidden Markov Models Overview System of tokens, assumed to be a Markov Process. State sequence is “hidden”, but tokens (which depend on state) are visible. Each state has a probability distribution over the possible tokens. Sequence of tokens generated by an HMM gives some information about the sequence of states.
  • 3. Hidden Markov Models Example Task: Given a sentence, determine the most likely sequence for its parts of speech.1 Some information is known based on prior data States are parts of speech, tokens are the words. p(s |s) - probability of transitioning from one state (part of speech) to another. Example: p(noun -> verb) = 0.9 p(word|s) - probability of token (word) given a state. Example: p("Blue"|noun) = 0.4 Traverse the words to compute probability of each sequence. Sentence: The blue bank closed. p(det, adj, noun, verb) = p("the" | det) · p(adj | det) · p("blue" | adj) · ... 1 Source: https://www.youtube.com/watch?v=7glSTzgjwuU
  • 4. Hidden Topic Markov Models Introduction Topics in a document are hidden and should be extracted. Bag of words is an unrealistic oversimplification. Topics should only transition at the beginning of a new sentence. Each document has a θd vector, representing its topic distribution. Topics transition based on binomial transition variable ψn ∈ 0, 1 for every word w1...wNd in a document.
  • 5. HTMM vs. LDA Visually HTMM (top) segments by sentence, LDA (bottom) segments only individual words.
  • 7. HTMM Definition (Annotated) For every latent topic z = 1...K, draw a βz ∼ Dirichlet(η) Generate each document d = 1...D as follows: Draw a topic distribution θ ∼ Dirichlet(α) Word 1 is a new topic, ψ1 = 1 For every word n = 2...Nd : If it’s the first word in a sentence, draw ψ1 ∼ Binom( ), otherwise no topic transition ψ1 = 0 For every word n = 1...Nd : If ψn == 0, topic doesn’t change: zn = zn−1. Else draw new zn ∼ multinomial(θ) Draw wn ∼ multinomial(βzn )
  • 9. Parameter Approximation (cont.) Use Estimation-Maximization Algorithm (EM) EM for HMMs distinguishes between latent variables (topics zn, transition variables ψn) and parameters. Estimation step uses Forward-Backward Algorithm Unknown Parameters θd - topic distribution for each document β - used for multinomial word distributions - used for binomial topic transition variables Known Parameters Based on prior research α = 1 + 50 K - used for drawing θ ∼ Dirichlet(α) η = 1.01 - used for drawing βz ∼ Dirichlet(η)
  • 10. Experiment: NIPS Dataset Data 1740 documents, 1557 training, 183 testing. 12113 words in vocabulary. Extract vocabulary words, preserving order. Split sentences on punctuation . ? ! ; Metric Perplexity for HTMM vs. LDA vs. VHTMM12 Perplexity reflects the difficulty of predicting a new unseen document after learning from a training set, lower is better. 2 VHTMM1 uses constat ψn = 1 so every sentence is of a new topic.
  • 11. Figure 2: Perplexity as a function of observed words HTMM is significantly better than LDA for N ≤ 64. Average document length is 1300 words.
  • 12. Figure 3: topical segmentation in HTMM HTMM attributes “Support” to two different topics, mathematical and acknowledgments.
  • 13. Figure 5: topical segmentation in LDA LDA attributes “Support” to only one topic.
  • 14. Figure 6: Perplexity as a function of K Each topic limited to N = 10 words.
  • 15. Figure 7: as a function of K Fewer topics → lower → infrequent transitions. More topics → higher → frequent transitions.
  • 16. Table 1: Lower perplexity due to degrees of freedom? Eliminate the option that the perplexity of HTMM might be lower than the perplexity of LDA only because it has less degrees of freedom (due to the dependencies between latent topics). Generate and train on two datasets, D = 1000, V = 200, K = 5 Dataset 1 generated with HTMM with = 0.1 (likely to transition topics). Dataset 2 generated with LDA “bag of words”. HTMM still learns the correct parameters and outperforms on ordered data, does not outperform on “bag of words” data. Careful: maybe bag of words was just a bad assumption for the NIPS dataset.
  • 17. Conclusion Authors’ Conclusion HTMM should be considered an extension of LDA. Markovian structure can learn more coherent topics, disambiguate topics of ambiguous words. Efficient learning and inference algorithms for HTMM already exist. Questions Why only one dataset? How does it perform on unstructured speech? For example: run-on sentences or transcripts of live speech, debates, etc. Why is ψn necessary for every word, could you just consider the first words?
  • 18. Thoughts for Project Maybe shopping trips follow a Markov Model A shopper who purchases a crib on one store visit then diapers on the next visit might be having a baby soon. These may be purchased with many other unrelated items, but there is still some meaning to them. Extracting the crib, diapers, etc. sequence could help determine this meaning, whereas treating each shopping trip independently might ignore it.