Topic models are a family of unsupervised learning algorithms that can help translate raw, unlabeled text into actionable insights. This presentation provides an overview of topic models before discussing concrete examples.
2. Handling
Raw,
Unlabeled
Text
§ Common
Datasets:
ª Product/
Customer
Reviews
ª Call
Center
Transcripts
ª News
Paper
Articles
ª Legal
Documents
§ Common
Tasks:
ª Find
documents
were
interested
in?
ª Categorize
documents?
ª Retrieve
information?
2
3. Handling
Raw,
Unlabeled
Text
3
§ Common
Datasets:
ª Product/
Customer
Reviews
ª Call
Center
Transcripts
ª News
Paper
Articles
ª Legal
Documents
§ Common
Tasks:
ª Find
documents
were
interested
in?
ª Categorize
documents?
ª Retrieve
information?
§ The
Challenge
ª Normal
quantitative
approaches
don’t
work
with
text.
ª Datasets
are
large,
complicated,
sparse,
and
unwieldy.
ª Data
is
often
unlabeled.
4. Example:
Understanding
Customer
Reviews
4
§ Mon
Ami
Gabi
is
a
restaurant
in
the
Paris
Paris
Hotel
and
Casino.
§ Thousands
of
customer
reviews
for
the
restaurant
over
the
last
8
years.
What
are
customers
saying?
Excellent
breakfast
menu.
They
just
need
to
hire
more
staff
to
have
a
better
service.
Great
place
for
brunch!
Highly
recommend
the
steak
and
fries
and
sitting
outside.
Had
a
great
meal
with
a
great
atmosphere
Food
was
ok…
What
it
has
going
for
it
is
the
view
from
the
outside
terrace.
5. Topic
Modeling:
Framework
5
Excellent
breakfast
menu.
They
just
need
to
hire
more
staff
to
have
a
better
service
Breakfast
Quality
of
Service
breakfast
better
service
staff
Documents
Topics
Words
and
Phrases
6. Topic
Modeling:
Preprocessing
6
§ Tokenize:
Extract
meaningful
units
from
sentences
ª I
ordered
a
french
toast
ª Regular
expression
cleanup,
end-‐of-‐line
hyphenation,
contraction,
and
sentence-‐initial
capitalization
rules.
§ Stemming
Algorithm:
Consolidate
feature
space
into
word
stems
or
lemmas
ª {I,
ordered,
a,
french
toast}
ª Suffix
stripping,
part
of
speech
tagging
§ Matrix
Factorization:
Convert
text
into
data
structure
for
learning
algorithms.
ª Word-‐document
matrices
often
have
1,000,000,000,000+
values.
Need
special
compression
algorithms
to
make
data
manageable.
{I,
ordered,
a,
french
toast}
{I,
order,
a,
french
toast}
7. Topic
Modeling:
Estimation
with
Gibbs
Sampler
7
ª Use
Markov
Chain
Monte
Carlo
methods
to
simulate
our
document-‐topic
and
topic-‐
word
probability
distributions.
ª Results:
Topic-‐Word
Breakfast
Service
Breakfast:
0.31
Service:
0.28
Eggs:
0.27
Staff:
0.24
Coffee:
0.24
Friendly:
0.21
Document-‐Topic
The
french
toast
was
great
The
staff
was
great,
but
the
outdoor
patio
was
a
bit
noisy.
French
Toast:
0.71
Service:
0.51
Breakfast:
0.25
Environment:
0.44
Service:
0.03
Breakfast:
0.02