SlideShare ist ein Scribd-Unternehmen logo
1 von 12
Downloaden Sie, um offline zu lesen
AN	
  INTRODUCTION	
  TO	
  TOPIC	
  MODELING	
  
Turning	
  text	
  into	
  insight:	
  	
  
Handling	
  Raw,	
  Unlabeled	
  Text	
  
§  Common	
  Datasets:	
  
ª  Product/	
  Customer	
  Reviews	
  
ª  Call	
  Center	
  Transcripts	
  
ª  News	
  Paper	
  Articles	
  
ª  Legal	
  Documents	
  
§  Common	
  Tasks:	
  
ª  Find	
  documents	
  were	
  interested	
  in?	
  
ª  Categorize	
  documents?	
  
ª  Retrieve	
  information?	
  
2	
  
Handling	
  Raw,	
  Unlabeled	
  Text	
  
3	
  
§  Common	
  Datasets:	
  
ª  Product/	
  Customer	
  Reviews	
  
ª  Call	
  Center	
  Transcripts	
  
ª  News	
  Paper	
  Articles	
  
ª  Legal	
  Documents	
  
§  Common	
  Tasks:	
  
ª  Find	
  documents	
  were	
  
interested	
  in?	
  
ª  Categorize	
  documents?	
  
ª  Retrieve	
  information?	
  
§  The	
  Challenge	
  
ª  Normal	
  quantitative	
  approaches	
  don’t	
  work	
  with	
  text.	
  
ª  Datasets	
  are	
  large,	
  complicated,	
  sparse,	
  and	
  unwieldy.	
  
ª  Data	
  is	
  often	
  unlabeled.	
  
	
  
Example:	
  Understanding	
  Customer	
  Reviews	
  
4	
  
§  Mon	
  Ami	
  Gabi	
  is	
  a	
  restaurant	
  in	
  the	
  
Paris	
  Paris	
  Hotel	
  and	
  Casino.	
  
§  Thousands	
  of	
  customer	
  reviews	
  
for	
  the	
  restaurant	
  over	
  the	
  last	
  	
  
8	
  years.	
  
What	
  are	
  	
  
customers	
  	
  
saying?	
  
Excellent	
  breakfast	
  
menu.	
  They	
  just	
  
need	
  to	
  hire	
  more	
  
staff	
  to	
  have	
  a	
  
better	
  service.	
  
Great	
  place	
  	
  
for	
  brunch!	
  
Highly	
  recommend	
  	
  
the	
  steak	
  and	
  fries	
  
	
  and	
  sitting	
  outside.	
  
Had	
  a	
  great	
  meal	
  with	
  
a	
  great	
  atmosphere	
  
Food	
  was	
  ok…	
  
What	
  it	
  has	
  going	
  
for	
  it	
  is	
  the	
  view	
  
from	
  the	
  outside	
  
terrace.	
  
Topic	
  Modeling:	
  Framework	
  
5	
  
Excellent	
  breakfast	
  	
  
menu.	
  They	
  just	
  need	
  	
  
to	
  hire	
  more	
  staff	
  to	
  have	
  	
  
a	
  better	
  service	
  
Breakfast
Quality	
  of	
  Service	
  
breakfast	
  
better	
  
service	
  
staff	
  
Documents	
   Topics	
   Words	
  and	
  Phrases	
  
Topic	
  Modeling:	
  Preprocessing	
  
6	
  
§  Tokenize:	
  Extract	
  meaningful	
  units	
  from	
  sentences	
  
ª  I	
  ordered	
  a	
  french	
  toast	
  
ª  Regular	
  expression	
  cleanup,	
  end-­‐of-­‐line	
  hyphenation,	
  contraction,	
  
and	
  sentence-­‐initial	
  capitalization	
  rules.	
  	
  
§  Stemming	
  Algorithm:	
  Consolidate	
  feature	
  space	
  into	
  word	
  
stems	
  or	
  lemmas	
  
ª  {I,	
  ordered,	
  a,	
  french	
  toast}	
  
ª  Suffix	
  stripping,	
  part	
  of	
  speech	
  tagging	
  
§  Matrix	
  Factorization:	
  Convert	
  text	
  into	
  data	
  structure	
  for	
  
learning	
  algorithms.	
  
ª  Word-­‐document	
  matrices	
  often	
  have	
  1,000,000,000,000+	
  values.	
  
Need	
  special	
  compression	
  algorithms	
  to	
  make	
  data	
  manageable.	
  
{I,	
  ordered,	
  a,	
  french	
  toast}	
  
{I,	
  order,	
  a,	
  french	
  toast}	
  
Topic	
  Modeling:	
  Estimation	
  with	
  Gibbs	
  Sampler	
  
7	
  
ª  Use	
  Markov	
  Chain	
  Monte	
  Carlo	
  methods	
  to	
  simulate	
  our	
  document-­‐topic	
  and	
  topic-­‐
word	
  probability	
  distributions.	
  
ª  Results:	
  
Topic-­‐Word	
  
Breakfast	
   Service	
  
Breakfast:	
  0.31	
   Service:	
  0.28	
  
Eggs:	
  0.27	
   Staff:	
  0.24	
  
Coffee:	
  0.24	
   Friendly:	
  0.21	
  
Document-­‐Topic	
  
The	
  french	
  toast	
  was	
  great	
   The	
  staff	
  was	
  great,	
  but	
  the	
  
outdoor	
  patio	
  was	
  a	
  bit	
  noisy.	
  
French	
  Toast:	
  0.71	
   Service:	
  0.51	
  
Breakfast:	
  0.25	
   Environment:	
  0.44	
  
Service:	
  0.03	
   Breakfast:	
  0.02	
  
Harnessing	
  the	
  Model:	
  Topic	
  Frequency	
  
8	
  
What	
  are	
  my	
  customers	
  talking	
  
about?	
  
Harnessing	
  the	
  Model:	
  Evaluate	
  Products	
  and	
  Verticals	
  
9	
  
How	
  do	
  customers	
  feel	
  about	
  my	
  
products?	
  
Harnessing	
  the	
  Model:	
  Temporal	
  Insights	
  
10	
  
How	
  has	
  customer	
  sentiment	
  
evolved	
  among	
  my	
  product	
  lines	
  
over	
  time?	
  
Harnessing	
  the	
  Model:	
  Deep	
  Product	
  Insights	
  
11	
  
Which	
  properties	
  of	
  French	
  Toast	
  
drive	
  satisfaction	
  (or	
  
dissatisfaction)?	
  
Thank	
  you.	
  

Weitere ähnliche Inhalte

Ähnlich wie Turning Text Into Insights: An Introduction to Topic Models

Creating AnswerBot with Keras and TensorFlow (TensorBeat)
Creating AnswerBot with Keras and TensorFlow (TensorBeat)Creating AnswerBot with Keras and TensorFlow (TensorBeat)
Creating AnswerBot with Keras and TensorFlow (TensorBeat)Avkash Chauhan
 
Argosy university eng 096
Argosy university eng 096Argosy university eng 096
Argosy university eng 096leesa marteen
 
Graphs in the Real World
Graphs in the Real WorldGraphs in the Real World
Graphs in the Real WorldNeo4j
 
Rob Brown portfolio full pdf
Rob Brown portfolio full pdfRob Brown portfolio full pdf
Rob Brown portfolio full pdfRob Brown
 
MongoDB .local Bengaluru 2019: A Complete Methodology to Data Modeling for Mo...
MongoDB .local Bengaluru 2019: A Complete Methodology to Data Modeling for Mo...MongoDB .local Bengaluru 2019: A Complete Methodology to Data Modeling for Mo...
MongoDB .local Bengaluru 2019: A Complete Methodology to Data Modeling for Mo...MongoDB
 
Turning Waffle into Magic
Turning Waffle into MagicTurning Waffle into Magic
Turning Waffle into MagicRobert Bullard
 
You're testing what!
You're testing what!You're testing what!
You're testing what!Nexer Digital
 
Turning XML to XLS on the JVM, without loosing your Sanity, with Groovy
Turning XML to XLS on the JVM, without loosing your Sanity, with GroovyTurning XML to XLS on the JVM, without loosing your Sanity, with Groovy
Turning XML to XLS on the JVM, without loosing your Sanity, with Groovygagravarr
 
Taus summit levels_of_pe
Taus summit levels_of_peTaus summit levels_of_pe
Taus summit levels_of_peRobert Martin
 
Conversion Optimization: The World Beyond Headlines & Button Color
Conversion Optimization: The World Beyond Headlines & Button ColorConversion Optimization: The World Beyond Headlines & Button Color
Conversion Optimization: The World Beyond Headlines & Button ColorOptimizely
 
Lean Enterprise Experience Canves
Lean Enterprise Experience CanvesLean Enterprise Experience Canves
Lean Enterprise Experience CanvesCatchi
 
II-SDV 2012 Dealing with Large Data Volumes in Statistical Analysis and Text ...
II-SDV 2012 Dealing with Large Data Volumes in Statistical Analysis and Text ...II-SDV 2012 Dealing with Large Data Volumes in Statistical Analysis and Text ...
II-SDV 2012 Dealing with Large Data Volumes in Statistical Analysis and Text ...Dr. Haxel Consult
 
How Gousto is moving to just-in-time personalization with Snowplow
How Gousto is moving to just-in-time personalization with SnowplowHow Gousto is moving to just-in-time personalization with Snowplow
How Gousto is moving to just-in-time personalization with SnowplowGiuseppe Gaviani
 
1st Annual National Forum Clarion Case Competition Report .docx
1st Annual National Forum Clarion Case Competition Report .docx1st Annual National Forum Clarion Case Competition Report .docx
1st Annual National Forum Clarion Case Competition Report .docxherminaprocter
 
Georgetown Data Science - Team BuzzFeed
Georgetown Data Science - Team BuzzFeed Georgetown Data Science - Team BuzzFeed
Georgetown Data Science - Team BuzzFeed Joshua Erb
 

Ähnlich wie Turning Text Into Insights: An Introduction to Topic Models (20)

Creating AnswerBot with Keras and TensorFlow (TensorBeat)
Creating AnswerBot with Keras and TensorFlow (TensorBeat)Creating AnswerBot with Keras and TensorFlow (TensorBeat)
Creating AnswerBot with Keras and TensorFlow (TensorBeat)
 
Argosy university eng 096
Argosy university eng 096Argosy university eng 096
Argosy university eng 096
 
Graphs in the Real World
Graphs in the Real WorldGraphs in the Real World
Graphs in the Real World
 
Rob Brown portfolio full pdf
Rob Brown portfolio full pdfRob Brown portfolio full pdf
Rob Brown portfolio full pdf
 
Essay About Community
Essay About CommunityEssay About Community
Essay About Community
 
MongoDB .local Bengaluru 2019: A Complete Methodology to Data Modeling for Mo...
MongoDB .local Bengaluru 2019: A Complete Methodology to Data Modeling for Mo...MongoDB .local Bengaluru 2019: A Complete Methodology to Data Modeling for Mo...
MongoDB .local Bengaluru 2019: A Complete Methodology to Data Modeling for Mo...
 
Turning Waffle into Magic
Turning Waffle into MagicTurning Waffle into Magic
Turning Waffle into Magic
 
Optimisation vs prediction
Optimisation vs predictionOptimisation vs prediction
Optimisation vs prediction
 
You're testing what!
You're testing what!You're testing what!
You're testing what!
 
Turning XML to XLS on the JVM, without loosing your Sanity, with Groovy
Turning XML to XLS on the JVM, without loosing your Sanity, with GroovyTurning XML to XLS on the JVM, without loosing your Sanity, with Groovy
Turning XML to XLS on the JVM, without loosing your Sanity, with Groovy
 
Dynamic Quality Revisited - Lena Marg (Welocalize)
Dynamic Quality Revisited - Lena Marg (Welocalize)Dynamic Quality Revisited - Lena Marg (Welocalize)
Dynamic Quality Revisited - Lena Marg (Welocalize)
 
Taus summit levels_of_pe
Taus summit levels_of_peTaus summit levels_of_pe
Taus summit levels_of_pe
 
Conversion Optimization: The World Beyond Headlines & Button Color
Conversion Optimization: The World Beyond Headlines & Button ColorConversion Optimization: The World Beyond Headlines & Button Color
Conversion Optimization: The World Beyond Headlines & Button Color
 
Lean Enterprise Experience Canves
Lean Enterprise Experience CanvesLean Enterprise Experience Canves
Lean Enterprise Experience Canves
 
II-SDV 2012 Dealing with Large Data Volumes in Statistical Analysis and Text ...
II-SDV 2012 Dealing with Large Data Volumes in Statistical Analysis and Text ...II-SDV 2012 Dealing with Large Data Volumes in Statistical Analysis and Text ...
II-SDV 2012 Dealing with Large Data Volumes in Statistical Analysis and Text ...
 
How Gousto is moving to just-in-time personalization with Snowplow
How Gousto is moving to just-in-time personalization with SnowplowHow Gousto is moving to just-in-time personalization with Snowplow
How Gousto is moving to just-in-time personalization with Snowplow
 
1st Annual National Forum Clarion Case Competition Report .docx
1st Annual National Forum Clarion Case Competition Report .docx1st Annual National Forum Clarion Case Competition Report .docx
1st Annual National Forum Clarion Case Competition Report .docx
 
Eskm20140903
Eskm20140903Eskm20140903
Eskm20140903
 
Georgetown Data Science - Team BuzzFeed
Georgetown Data Science - Team BuzzFeed Georgetown Data Science - Team BuzzFeed
Georgetown Data Science - Team BuzzFeed
 
Kpi 5
Kpi 5Kpi 5
Kpi 5
 

Kürzlich hochgeladen

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 

Kürzlich hochgeladen (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Turning Text Into Insights: An Introduction to Topic Models

  • 1. AN  INTRODUCTION  TO  TOPIC  MODELING   Turning  text  into  insight:    
  • 2. Handling  Raw,  Unlabeled  Text   §  Common  Datasets:   ª  Product/  Customer  Reviews   ª  Call  Center  Transcripts   ª  News  Paper  Articles   ª  Legal  Documents   §  Common  Tasks:   ª  Find  documents  were  interested  in?   ª  Categorize  documents?   ª  Retrieve  information?   2  
  • 3. Handling  Raw,  Unlabeled  Text   3   §  Common  Datasets:   ª  Product/  Customer  Reviews   ª  Call  Center  Transcripts   ª  News  Paper  Articles   ª  Legal  Documents   §  Common  Tasks:   ª  Find  documents  were   interested  in?   ª  Categorize  documents?   ª  Retrieve  information?   §  The  Challenge   ª  Normal  quantitative  approaches  don’t  work  with  text.   ª  Datasets  are  large,  complicated,  sparse,  and  unwieldy.   ª  Data  is  often  unlabeled.    
  • 4. Example:  Understanding  Customer  Reviews   4   §  Mon  Ami  Gabi  is  a  restaurant  in  the   Paris  Paris  Hotel  and  Casino.   §  Thousands  of  customer  reviews   for  the  restaurant  over  the  last     8  years.   What  are     customers     saying?   Excellent  breakfast   menu.  They  just   need  to  hire  more   staff  to  have  a   better  service.   Great  place     for  brunch!   Highly  recommend     the  steak  and  fries    and  sitting  outside.   Had  a  great  meal  with   a  great  atmosphere   Food  was  ok…   What  it  has  going   for  it  is  the  view   from  the  outside   terrace.  
  • 5. Topic  Modeling:  Framework   5   Excellent  breakfast     menu.  They  just  need     to  hire  more  staff  to  have     a  better  service   Breakfast Quality  of  Service   breakfast   better   service   staff   Documents   Topics   Words  and  Phrases  
  • 6. Topic  Modeling:  Preprocessing   6   §  Tokenize:  Extract  meaningful  units  from  sentences   ª  I  ordered  a  french  toast   ª  Regular  expression  cleanup,  end-­‐of-­‐line  hyphenation,  contraction,   and  sentence-­‐initial  capitalization  rules.     §  Stemming  Algorithm:  Consolidate  feature  space  into  word   stems  or  lemmas   ª  {I,  ordered,  a,  french  toast}   ª  Suffix  stripping,  part  of  speech  tagging   §  Matrix  Factorization:  Convert  text  into  data  structure  for   learning  algorithms.   ª  Word-­‐document  matrices  often  have  1,000,000,000,000+  values.   Need  special  compression  algorithms  to  make  data  manageable.   {I,  ordered,  a,  french  toast}   {I,  order,  a,  french  toast}  
  • 7. Topic  Modeling:  Estimation  with  Gibbs  Sampler   7   ª  Use  Markov  Chain  Monte  Carlo  methods  to  simulate  our  document-­‐topic  and  topic-­‐ word  probability  distributions.   ª  Results:   Topic-­‐Word   Breakfast   Service   Breakfast:  0.31   Service:  0.28   Eggs:  0.27   Staff:  0.24   Coffee:  0.24   Friendly:  0.21   Document-­‐Topic   The  french  toast  was  great   The  staff  was  great,  but  the   outdoor  patio  was  a  bit  noisy.   French  Toast:  0.71   Service:  0.51   Breakfast:  0.25   Environment:  0.44   Service:  0.03   Breakfast:  0.02  
  • 8. Harnessing  the  Model:  Topic  Frequency   8   What  are  my  customers  talking   about?  
  • 9. Harnessing  the  Model:  Evaluate  Products  and  Verticals   9   How  do  customers  feel  about  my   products?  
  • 10. Harnessing  the  Model:  Temporal  Insights   10   How  has  customer  sentiment   evolved  among  my  product  lines   over  time?  
  • 11. Harnessing  the  Model:  Deep  Product  Insights   11   Which  properties  of  French  Toast   drive  satisfaction  (or   dissatisfaction)?