Turning Text Into Insights: An Introduction to Topic Models

•

2 gefällt mir•687 views

Topic models are a family of unsupervised learning algorithms that can help translate raw, unlabeled text into actionable insights. This presentation provides an overview of topic models before discussing concrete examples.

Technologie

AN
INTRODUCTION
TO
TOPIC
MODELING

Turning
text
into
insight:

Handling
Raw,
Unlabeled
Text

§  Common
Datasets:

ª  Product/
Customer
Reviews

ª  Call
Center
Transcripts

ª  News
Paper
Articles

ª  Legal
Documents

§  Common
Tasks:

ª  Find
documents
were
interested
in?

ª  Categorize
documents?

ª  Retrieve
information?

2

Handling
Raw,
Unlabeled
Text

3

§  Common
Datasets:

ª  Product/
Customer
Reviews

ª  Call
Center
Transcripts

ª  News
Paper
Articles

ª  Legal
Documents

§  Common
Tasks:

ª  Find
documents
were

interested
in?

ª  Categorize
documents?

ª  Retrieve
information?

§  The
Challenge

ª  Normal
quantitative
approaches
don’t
work
with
text.

ª  Datasets
are
large,
complicated,
sparse,
and
unwieldy.

ª  Data
is
often
unlabeled.

Example:
Understanding
Customer
Reviews

4

§  Mon
Ami
Gabi
is
a
restaurant
in
the

Paris
Paris
Hotel
and
Casino.

§  Thousands
of
customer
reviews

for
the
restaurant
over
the
last

8
years.

What
are

customers

saying?

Excellent
breakfast

menu.
They
just

need
to
hire
more

staﬀ
to
have
a

better
service.

Great
place

for
brunch!

Highly
recommend

the
steak
and
fries

and
sitting
outside.

Had
a
great
meal
with

a
great
atmosphere

Food
was
ok…

What
it
has
going

for
it
is
the
view

from
the
outside

terrace.

Topic
Modeling:
Framework

5

Excellent
breakfast

menu.
They
just
need

to
hire
more
staﬀ
to
have

a
better
service

Breakfast
Quality
of
Service

breakfast

better

service

staﬀ

Documents
Topics
Words
and
Phrases

Topic
Modeling:
Preprocessing

6

§  Tokenize:
Extract
meaningful
units
from
sentences

ª  I
ordered
a
french
toast

ª  Regular
expression
cleanup,
end-‐of-‐line
hyphenation,
contraction,

and
sentence-‐initial
capitalization
rules.

§  Stemming
Algorithm:
Consolidate
feature
space
into
word

stems
or
lemmas

ª  {I,
ordered,
a,
french
toast}

ª  Suﬃx
stripping,
part
of
speech
tagging

§  Matrix
Factorization:
Convert
text
into
data
structure
for

learning
algorithms.

ª  Word-‐document
matrices
often
have
1,000,000,000,000+
values.

Need
special
compression
algorithms
to
make
data
manageable.

{I,
ordered,
a,
french
toast}

{I,
order,
a,
french
toast}

Topic
Modeling:
Estimation
with
Gibbs
Sampler

7

ª  Use
Markov
Chain
Monte
Carlo
methods
to
simulate
our
document-‐topic
and
topic-‐
word
probability
distributions.

ª  Results:

Topic-‐Word

Breakfast
Service

Breakfast:
0.31
Service:
0.28

Eggs:
0.27
Staﬀ:
0.24

Coﬀee:
0.24
Friendly:
0.21

Document-‐Topic

The
french
toast
was
great
The
staﬀ
was
great,
but
the

outdoor
patio
was
a
bit
noisy.

French
Toast:
0.71
Service:
0.51

Breakfast:
0.25
Environment:
0.44

Service:
0.03
Breakfast:
0.02

Harnessing
the
Model:
Topic
Frequency

8

What
are
my
customers
talking

about?

Harnessing
the
Model:
Evaluate
Products
and
Verticals

9

How
do
customers
feel
about
my

products?

Harnessing
the
Model:
Temporal
Insights

10

How
has
customer
sentiment

evolved
among
my
product
lines

over
time?

Harnessing
the
Model:
Deep
Product
Insights

11

Which
properties
of
French
Toast

drive
satisfaction
(or

dissatisfaction)?

Empfohlen

Semantic Analysis of User Browsing Patterns in the Web of Data @USEWOD, WWW2012juliahoxha

How to Forecast with Limited Historical DataDataScience

Designing a Real Time Data Ingestion PipelineDataScience

60 ideas in 60 minutes - Speech AnalyticsVasudeva Akula, Ph.D.

Predictive Analytics Usage and Implications in HealthcareJ. Bryan Bennett, MBA, CPA, LSSGB

Text and text stream mining tutorialmgrcar

Data Ingestion, Extraction & Parsing on Hadoopskaluska

Exploring temporal graph data with Python:  a study on tensor decomposition o...André Panisson

Empfohlen

Semantic Analysis of User Browsing Patterns in the Web of Data @USEWOD, WWW2012juliahoxha

How to Forecast with Limited Historical DataDataScience

Designing a Real Time Data Ingestion PipelineDataScience

60 ideas in 60 minutes - Speech AnalyticsVasudeva Akula, Ph.D.

Predictive Analytics Usage and Implications in HealthcareJ. Bryan Bennett, MBA, CPA, LSSGB

Text and text stream mining tutorialmgrcar

Data Ingestion, Extraction & Parsing on Hadoopskaluska

Exploring temporal graph data with Python:  a study on tensor decomposition o...André Panisson

Creating AnswerBot with Keras and TensorFlow (TensorBeat)Avkash Chauhan

Argosy university eng 096leesa marteen

Graphs in the Real WorldNeo4j

Rob Brown portfolio full pdfRob Brown

Essay About CommunityShelly Martinez

MongoDB .local Bengaluru 2019: A Complete Methodology to Data Modeling for Mo...MongoDB

Turning Waffle into MagicRobert Bullard

40 Email StrategiesAvianca Brasil

Optimisation vs predictionDr. Stylianos Kampakis

You're testing what!Nexer Digital

Turning XML to XLS on the JVM, without loosing your Sanity, with Groovygagravarr

Dynamic Quality Revisited - Lena Marg (Welocalize)TAUS - The Language Data Network

Taus summit levels_of_peRobert Martin

Conversion Optimization: The World Beyond Headlines & Button ColorOptimizely

Lean Enterprise Experience CanvesCatchi

How Gousto is moving to just-in-time personalization with SnowplowGiuseppe Gaviani

Case_Interview_Training.pdfHiAnhNguynLng

1st Annual National Forum Clarion Case Competition Report .docxherminaprocter

Eskm20140903Shuhei Otani

Georgetown Data Science - Team BuzzFeed Joshua Erb

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

Weitere ähnliche Inhalte

Ähnlich wie Turning Text Into Insights: An Introduction to Topic Models

Creating AnswerBot with Keras and TensorFlow (TensorBeat)Avkash Chauhan

Argosy university eng 096leesa marteen

Graphs in the Real WorldNeo4j

Rob Brown portfolio full pdfRob Brown

Essay About CommunityShelly Martinez

MongoDB .local Bengaluru 2019: A Complete Methodology to Data Modeling for Mo...MongoDB

Turning Waffle into MagicRobert Bullard

40 Email StrategiesAvianca Brasil

Optimisation vs predictionDr. Stylianos Kampakis

You're testing what!Nexer Digital

Turning XML to XLS on the JVM, without loosing your Sanity, with Groovygagravarr

Dynamic Quality Revisited - Lena Marg (Welocalize)TAUS - The Language Data Network

Taus summit levels_of_peRobert Martin

Conversion Optimization: The World Beyond Headlines & Button ColorOptimizely

Lean Enterprise Experience CanvesCatchi

How Gousto is moving to just-in-time personalization with SnowplowGiuseppe Gaviani

Case_Interview_Training.pdfHiAnhNguynLng

1st Annual National Forum Clarion Case Competition Report .docxherminaprocter

Eskm20140903Shuhei Otani

Georgetown Data Science - Team BuzzFeed Joshua Erb

Ähnlich wie Turning Text Into Insights: An Introduction to Topic Models (20)

Creating AnswerBot with Keras and TensorFlow (TensorBeat)

Argosy university eng 096

Graphs in the Real World

Rob Brown portfolio full pdf

Essay About Community

MongoDB .local Bengaluru 2019: A Complete Methodology to Data Modeling for Mo...

Turning Waffle into Magic

40 Email Strategies

Optimisation vs prediction

You're testing what!

Turning XML to XLS on the JVM, without loosing your Sanity, with Groovy

Dynamic Quality Revisited - Lena Marg (Welocalize)

Taus summit levels_of_pe

Conversion Optimization: The World Beyond Headlines & Button Color

Lean Enterprise Experience Canves

How Gousto is moving to just-in-time personalization with Snowplow

Case_Interview_Training.pdf

1st Annual National Forum Clarion Case Competition Report .docx

Eskm20140903

Georgetown Data Science - Team BuzzFeed

Kürzlich hochgeladen

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

Sample pptx for embedding into website for demoHarshalMandlekar2

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3

The State of Passkeys with FIDO Alliance.pptxLoriGlavin3

The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech

Anypoint Exchange: It’s Not Just a Repo!Manik S Magar

TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc

unit 4 immunoblotting technique complete.pptxBkGupta21

From Family Reminiscence to Scholarly Archive .Alan Dix

Training state-of-the-art general text embeddingZilliz

Rise of the Machines: Known As Drones...Rick Flair

DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

WordPress Websites for Engineers: Elevate Your Brandgvaughan

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited

Take control of your SAP testing with UiPath Test SuiteDianaGray10

Kürzlich hochgeladen (20)

Developer Data Modeling Mistakes: From Postgres to NoSQL

What's New in Teams Calling, Meetings and Devices March 2024

Sample pptx for embedding into website for demo

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx

The State of Passkeys with FIDO Alliance.pptx

The Ultimate Guide to Choosing WordPress Pros and Cons

Anypoint Exchange: It’s Not Just a Repo!

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy

unit 4 immunoblotting technique complete.pptx

From Family Reminiscence to Scholarly Archive .

Training state-of-the-art general text embedding

Rise of the Machines: Known As Drones...

DSPy a system for AI to Write Prompts and Do Fine Tuning

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)

WordPress Websites for Engineers: Elevate Your Brand

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack

Ensuring Technical Readiness For Copilot in Microsoft 365

Take control of your SAP testing with UiPath Test Suite

Turning Text Into Insights: An Introduction to Topic Models

1. AN INTRODUCTION TO TOPIC MODELING Turning text into insight:

2. Handling Raw, Unlabeled Text §  Common Datasets: ª  Product/ Customer Reviews ª  Call Center Transcripts ª  News Paper Articles ª  Legal Documents §  Common Tasks: ª  Find documents were interested in? ª  Categorize documents? ª  Retrieve information? 2

3. Handling Raw, Unlabeled Text 3 §  Common Datasets: ª  Product/ Customer Reviews ª  Call Center Transcripts ª  News Paper Articles ª  Legal Documents §  Common Tasks: ª  Find documents were interested in? ª  Categorize documents? ª  Retrieve information? §  The Challenge ª  Normal quantitative approaches don’t work with text. ª  Datasets are large, complicated, sparse, and unwieldy. ª  Data is often unlabeled.

4. Example: Understanding Customer Reviews 4 §  Mon Ami Gabi is a restaurant in the Paris Paris Hotel and Casino. §  Thousands of customer reviews for the restaurant over the last 8 years. What are customers saying? Excellent breakfast menu. They just need to hire more staﬀ to have a better service. Great place for brunch! Highly recommend the steak and fries and sitting outside. Had a great meal with a great atmosphere Food was ok… What it has going for it is the view from the outside terrace.

5. Topic Modeling: Framework 5 Excellent breakfast menu. They just need to hire more staﬀ to have a better service Breakfast Quality of Service breakfast better service staﬀ Documents Topics Words and Phrases

6. Topic Modeling: Preprocessing 6 §  Tokenize: Extract meaningful units from sentences ª  I ordered a french toast ª  Regular expression cleanup, end-‐of-‐line hyphenation, contraction, and sentence-‐initial capitalization rules. §  Stemming Algorithm: Consolidate feature space into word stems or lemmas ª  {I, ordered, a, french toast} ª  Suﬃx stripping, part of speech tagging §  Matrix Factorization: Convert text into data structure for learning algorithms. ª  Word-‐document matrices often have 1,000,000,000,000+ values. Need special compression algorithms to make data manageable. {I, ordered, a, french toast} {I, order, a, french toast}

7. Topic Modeling: Estimation with Gibbs Sampler 7 ª  Use Markov Chain Monte Carlo methods to simulate our document-‐topic and topic-‐ word probability distributions. ª  Results: Topic-‐Word Breakfast Service Breakfast: 0.31 Service: 0.28 Eggs: 0.27 Staff: 0.24 Coffee: 0.24 Friendly: 0.21 Document-‐Topic The french toast was great The staff was great, but the outdoor patio was a bit noisy. French Toast: 0.71 Service: 0.51 Breakfast: 0.25 Environment: 0.44 Service: 0.03 Breakfast: 0.02

8. Harnessing the Model: Topic Frequency 8 What are my customers talking about?

9. Harnessing the Model: Evaluate Products and Verticals 9 How do customers feel about my products?

10. Harnessing the Model: Temporal Insights 10 How has customer sentiment evolved among my product lines over time?

11. Harnessing the Model: Deep Product Insights 11 Which properties of French Toast drive satisfaction (or dissatisfaction)?

12. Thank you.