SlideShare a Scribd company logo
1 of 18
Download to read offline
Mining text data for topics
Aka: Unsupervised clustering
mathieu.lacage@alcmeon.com
The objective
Input: corpus of text document
Output:
● List of topics (max 10 to 40)
● Human description for each topic
● Size of each topic
What this talk is about:
● Help you get quickly a rough idea of what this content is about
● No requirements that you are a master of deep learning concepts, fancy maths
How does this work ?
Text
documents
Tokenized
documents
Vectorized
documents
Document/topic
mapping
How does this work ?
Text
documents
Tokenized
documents
Vectorized
documents
Document/topic
mapping
Input: “hey, how are you?”
Output: [“hey”, “how”, “are”, “you”, “?”]
N documents
How does this work ?
Text
documents
Tokenized
documents
Vectorized
documents
Document/topic
mapping
Input: [“hey”, “how”, “are”, “you”, “?”]
Output: M-sized vector [0, 0, …, 0, 1, 0, ... , 0, 1, 0, ...]
N documents, M distinct tokens (dictionary size)
Input: N * M matrix: [[0, 0, …, 0, 1, 0, ... , 0, 1, 0, ...], …]
Output: N vector: [2, 4, 2, 1, 8, …, 9, 3, 0]
How does this work ?
Text
documents
Tokenized
documents
Vectorized
documents
Document/topic
mapping
N documents, M distinct tokens (dictionary size), K topics
The code
On github: https://github.com/mathieu-lacage/sophiaconf2018
1. Collect a dataset do-collect.py -k france
2. Tokenize text do-tokenize.py --lang=fr
3. Calculate document frequencies do-df.py --min-df=1
4. Generate document vectors do-bag-of-words.py --model=boolean
5. Cluster vectors do-kmeans.py -k 10
6. Visualize the clusters do-summarize.py
Step 1: collect a dataset
do-collect.py -k france
“Sample” Twitter stream:
● 1% of all tweets which contain the word “france”
● ran a couple of hours on June 25th
Be careful:
● Hardcoded twitter app ids
● Generate your own app ids: https://apps.twitter.com/ !
Step 2: tokenize the text input
do-tokenize.py --lang=fr
Depends on language
● “Easy” for english: spaces, hyphens are word boundaries.
● CJKV languages: no space. (tough)
→ We focus on a “simple” language and open-source library (NLTK) to ignore the
problem
Step 3: calculate document frequencies
do-df.py --min-df=1
Number of documents which contain each token at least once
Eliminate all tokens which appear only once
Store number of documents as a special zero-length string token
[-1, "", 10842]
[0, "https://t.co/lzpNXIe2if", 1]
Step 4: generate document vectors
do-bag-of-words.py --model=boolean
Models
● boolean: the simplest model: 1 if token is present in document, 0 otherwise
● tf-idf: More weight for tokens which appear rarely in corpus
→ we start with the simplest option !
Step 5: Cluster document vectors
do-kmeans.py -k=10
Search 10 clusters:
● Complexity = O(nmk+1
) → hurts
● MiniBatch option is much faster but less stable numerically
● What you really want is reduce M (curse of dimensionality)
Step 6: visualize the clusters
do-summarize.py
Keep the tokens where the difference between:
● Frequency of tokens in cluster
● Frequency of tokens in corpus
Is highest
→ Inspired by KL divergence
Results
0. 3165 MAIS PERSONNE https://t.co/Xg4fOi9Q1c ACCOSTER #TraduisonsLes
1. 2407 prenne égalera battra protéger entrer
2. 255 bousiller travaillé aies gar jaloux
3. 372 262 légaux 3A https://t.co/WyunDG4wLs optim
4. 896 tchadien Tchad zénith annonçons lor
5. 110 traiter https://t.co/zCAlZJjzfX rt pute met
6. 326 GAGNANTS https://t.co/1XGv3j526K PASSE PayPal
7. 2598 Mauvais marquage Archives-Verrerie chuter Générosités
8. 242 https://t.co/byRBwkSa3U Faire l'île
9. 471 altitude giflée bled baisser Francais
Comments
Small clusters are pretty coherent
Big clusters are a mix of lots of small clusters
→ Choosing a good K is crucial !
● Too small: mishmash of topics
● Too big: many small clusters which are all about the same topic
Things you could do
1. More/different data
2. Compare accuracy loss of MiniBatchKMeans against kMeans
3. Test other clustering algorithms
4. Better summarization
5. Visualize topic relationships
6. Compare LSA and LDA to Clustering output
7. Automatically pick number of topics by optimizing for silhouette coefficient
8. Decrease dimensionality with doc2vec, LSA or LDA to replace bag-of-words
9. ...
Questions ?
Dimensionality reduction: word2vec
python ./do-word-vector-model.py -d sample-big
mv sample-big-word-vector-mode sample-word-vector-model
python ./do-doc2vec.py
“Distributed Representations of Words and Phrases and their Compositionality”, 2013
Open source implementation: gensim

More Related Content

Similar to SophiaConf 2018 - M. LACAGE (ALCMEON)

""Into the Wild" ... with Natural Language Processing and Text Classification...
""Into the Wild" ... with Natural Language Processing and Text Classification...""Into the Wild" ... with Natural Language Processing and Text Classification...
""Into the Wild" ... with Natural Language Processing and Text Classification...
Dataconomy Media
 
[Gary entsminger] turbo_pascal_for_windows_bible(book_fi.org)
[Gary entsminger] turbo_pascal_for_windows_bible(book_fi.org)[Gary entsminger] turbo_pascal_for_windows_bible(book_fi.org)
[Gary entsminger] turbo_pascal_for_windows_bible(book_fi.org)
Yogi Sharo
 
OpenFest 2012 : Leveraging the public internet
OpenFest 2012 : Leveraging the public internetOpenFest 2012 : Leveraging the public internet
OpenFest 2012 : Leveraging the public internet
tkisason
 
Dita ot pipeline webinar
Dita ot pipeline webinarDita ot pipeline webinar
Dita ot pipeline webinar
Suite Solutions
 

Similar to SophiaConf 2018 - M. LACAGE (ALCMEON) (20)

Intro
IntroIntro
Intro
 
4Developers: Tomasz Ducin- JavaScript + Java = TypeScript
4Developers: Tomasz Ducin- JavaScript + Java = TypeScript4Developers: Tomasz Ducin- JavaScript + Java = TypeScript
4Developers: Tomasz Ducin- JavaScript + Java = TypeScript
 
Data Science Workshop
Data Science WorkshopData Science Workshop
Data Science Workshop
 
""Into the Wild" ... with Natural Language Processing and Text Classification...
""Into the Wild" ... with Natural Language Processing and Text Classification...""Into the Wild" ... with Natural Language Processing and Text Classification...
""Into the Wild" ... with Natural Language Processing and Text Classification...
 
Into the Wild - wilth Natural Language Processing and Text Classification - D...
Into the Wild - wilth Natural Language Processing and Text Classification - D...Into the Wild - wilth Natural Language Processing and Text Classification - D...
Into the Wild - wilth Natural Language Processing and Text Classification - D...
 
Adaptive Thread Scheduling Techniques for Improving Scalability of Software T...
Adaptive Thread Scheduling Techniques for Improving Scalability of Software T...Adaptive Thread Scheduling Techniques for Improving Scalability of Software T...
Adaptive Thread Scheduling Techniques for Improving Scalability of Software T...
 
50 Tech Tips Webinar Slides
50 Tech Tips Webinar Slides50 Tech Tips Webinar Slides
50 Tech Tips Webinar Slides
 
TextMining with R
TextMining with RTextMining with R
TextMining with R
 
OU RSE Tutorial Big Data Cluster
OU RSE Tutorial Big Data ClusterOU RSE Tutorial Big Data Cluster
OU RSE Tutorial Big Data Cluster
 
[Gary entsminger] turbo_pascal_for_windows_bible(book_fi.org)
[Gary entsminger] turbo_pascal_for_windows_bible(book_fi.org)[Gary entsminger] turbo_pascal_for_windows_bible(book_fi.org)
[Gary entsminger] turbo_pascal_for_windows_bible(book_fi.org)
 
Coding Like the Wind - Tips and Tricks for the Microsoft Visual Studio 2012 C...
Coding Like the Wind - Tips and Tricks for the Microsoft Visual Studio 2012 C...Coding Like the Wind - Tips and Tricks for the Microsoft Visual Studio 2012 C...
Coding Like the Wind - Tips and Tricks for the Microsoft Visual Studio 2012 C...
 
GDG Helwan Introduction to python
GDG Helwan Introduction to pythonGDG Helwan Introduction to python
GDG Helwan Introduction to python
 
Text Indexing / Inverted Indices
Text Indexing / Inverted IndicesText Indexing / Inverted Indices
Text Indexing / Inverted Indices
 
OpenFest 2012 : Leveraging the public internet
OpenFest 2012 : Leveraging the public internetOpenFest 2012 : Leveraging the public internet
OpenFest 2012 : Leveraging the public internet
 
Dita ot pipeline webinar
Dita ot pipeline webinarDita ot pipeline webinar
Dita ot pipeline webinar
 
Illustrated Code (ASE 2021)
Illustrated Code (ASE 2021)Illustrated Code (ASE 2021)
Illustrated Code (ASE 2021)
 
2022 - Delivering Powerful Technical Presentations.pdf
2022 - Delivering Powerful Technical Presentations.pdf2022 - Delivering Powerful Technical Presentations.pdf
2022 - Delivering Powerful Technical Presentations.pdf
 
Unleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and InsightUnleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and Insight
 
Unleashing twitter data for fun and insight
Unleashing twitter data for fun and insightUnleashing twitter data for fun and insight
Unleashing twitter data for fun and insight
 
NDC Oslo 2019 - War stories from .NET team -- Karel Zikmund
NDC Oslo 2019 - War stories from .NET team -- Karel ZikmundNDC Oslo 2019 - War stories from .NET team -- Karel Zikmund
NDC Oslo 2019 - War stories from .NET team -- Karel Zikmund
 

More from TelecomValley

More from TelecomValley (20)

Rapport d'activité SoFAB 2022
Rapport d'activité SoFAB 2022Rapport d'activité SoFAB 2022
Rapport d'activité SoFAB 2022
 
Rapport d'activité 2022
Rapport d'activité 2022Rapport d'activité 2022
Rapport d'activité 2022
 
Rapport d'activité 2021 - Telecom Valley
Rapport d'activité 2021 - Telecom ValleyRapport d'activité 2021 - Telecom Valley
Rapport d'activité 2021 - Telecom Valley
 
Livre blanc "Les métamorphoses de l'entreprise face à l'imprévu - Tome 1 : la...
Livre blanc "Les métamorphoses de l'entreprise face à l'imprévu - Tome 1 : la...Livre blanc "Les métamorphoses de l'entreprise face à l'imprévu - Tome 1 : la...
Livre blanc "Les métamorphoses de l'entreprise face à l'imprévu - Tome 1 : la...
 
Rapport d'activité SoFAB 2020
Rapport d'activité SoFAB 2020Rapport d'activité SoFAB 2020
Rapport d'activité SoFAB 2020
 
Rapport d'activité Telecom Valley 2020
Rapport d'activité Telecom Valley 2020Rapport d'activité Telecom Valley 2020
Rapport d'activité Telecom Valley 2020
 
Rapport d'activité SoFAB 2019
Rapport d'activité SoFAB 2019Rapport d'activité SoFAB 2019
Rapport d'activité SoFAB 2019
 
Rapport d'activité Telecom Valley 2019
Rapport d'activité Telecom Valley 2019Rapport d'activité Telecom Valley 2019
Rapport d'activité Telecom Valley 2019
 
Revue de presse Telecom Valley - Février 2020
Revue de presse Telecom Valley - Février 2020Revue de presse Telecom Valley - Février 2020
Revue de presse Telecom Valley - Février 2020
 
Revue de presse Telecom Valley - Janvier 2020
Revue de presse Telecom Valley - Janvier 2020Revue de presse Telecom Valley - Janvier 2020
Revue de presse Telecom Valley - Janvier 2020
 
Revue de presse Telecom Valley - Décembre 2019
Revue de presse Telecom Valley - Décembre 2019Revue de presse Telecom Valley - Décembre 2019
Revue de presse Telecom Valley - Décembre 2019
 
Revue de presse Telecom Valley - Novembre 2019
Revue de presse Telecom Valley - Novembre 2019Revue de presse Telecom Valley - Novembre 2019
Revue de presse Telecom Valley - Novembre 2019
 
Revue de presse Telecom Valley - Octobre 2019
Revue de presse Telecom Valley - Octobre 2019Revue de presse Telecom Valley - Octobre 2019
Revue de presse Telecom Valley - Octobre 2019
 
Revue de presse Telecom Valley - Septembre 2019
Revue de presse Telecom Valley - Septembre 2019Revue de presse Telecom Valley - Septembre 2019
Revue de presse Telecom Valley - Septembre 2019
 
Présentation Team France Export régionale - 29/11/19
Présentation Team France Export régionale - 29/11/19Présentation Team France Export régionale - 29/11/19
Présentation Team France Export régionale - 29/11/19
 
2019 - NOURI - ALL4TEST- Le BDD pour decouvrir et specifier les besoins metie...
2019 - NOURI - ALL4TEST- Le BDD pour decouvrir et specifier les besoins metie...2019 - NOURI - ALL4TEST- Le BDD pour decouvrir et specifier les besoins metie...
2019 - NOURI - ALL4TEST- Le BDD pour decouvrir et specifier les besoins metie...
 
Tester c'est bien, monitorer c'est mieux - 2019 - KISSI - Soirée du Test Logi...
Tester c'est bien, monitorer c'est mieux - 2019 - KISSI - Soirée du Test Logi...Tester c'est bien, monitorer c'est mieux - 2019 - KISSI - Soirée du Test Logi...
Tester c'est bien, monitorer c'est mieux - 2019 - KISSI - Soirée du Test Logi...
 
Et si mon test était la spécification de mon application ? - JACOB - iWE - So...
Et si mon test était la spécification de mon application ? - JACOB - iWE - So...Et si mon test était la spécification de mon application ? - JACOB - iWE - So...
Et si mon test était la spécification de mon application ? - JACOB - iWE - So...
 
A la poursuite du bug perdu - 2019 - THEAULT - DI GIORGIO - ACPQUALIFE
A la poursuite du bug perdu - 2019 - THEAULT - DI GIORGIO - ACPQUALIFEA la poursuite du bug perdu - 2019 - THEAULT - DI GIORGIO - ACPQUALIFE
A la poursuite du bug perdu - 2019 - THEAULT - DI GIORGIO - ACPQUALIFE
 
2019 - HAGE CHAHINE - ALTRAN - Presentation-DecouverteMondeAgile_V1.1
2019 - HAGE CHAHINE - ALTRAN - Presentation-DecouverteMondeAgile_V1.12019 - HAGE CHAHINE - ALTRAN - Presentation-DecouverteMondeAgile_V1.1
2019 - HAGE CHAHINE - ALTRAN - Presentation-DecouverteMondeAgile_V1.1
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Recently uploaded (20)

Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 

SophiaConf 2018 - M. LACAGE (ALCMEON)

  • 1. Mining text data for topics Aka: Unsupervised clustering mathieu.lacage@alcmeon.com
  • 2. The objective Input: corpus of text document Output: ● List of topics (max 10 to 40) ● Human description for each topic ● Size of each topic What this talk is about: ● Help you get quickly a rough idea of what this content is about ● No requirements that you are a master of deep learning concepts, fancy maths
  • 3. How does this work ? Text documents Tokenized documents Vectorized documents Document/topic mapping
  • 4. How does this work ? Text documents Tokenized documents Vectorized documents Document/topic mapping Input: “hey, how are you?” Output: [“hey”, “how”, “are”, “you”, “?”] N documents
  • 5. How does this work ? Text documents Tokenized documents Vectorized documents Document/topic mapping Input: [“hey”, “how”, “are”, “you”, “?”] Output: M-sized vector [0, 0, …, 0, 1, 0, ... , 0, 1, 0, ...] N documents, M distinct tokens (dictionary size)
  • 6. Input: N * M matrix: [[0, 0, …, 0, 1, 0, ... , 0, 1, 0, ...], …] Output: N vector: [2, 4, 2, 1, 8, …, 9, 3, 0] How does this work ? Text documents Tokenized documents Vectorized documents Document/topic mapping N documents, M distinct tokens (dictionary size), K topics
  • 7. The code On github: https://github.com/mathieu-lacage/sophiaconf2018 1. Collect a dataset do-collect.py -k france 2. Tokenize text do-tokenize.py --lang=fr 3. Calculate document frequencies do-df.py --min-df=1 4. Generate document vectors do-bag-of-words.py --model=boolean 5. Cluster vectors do-kmeans.py -k 10 6. Visualize the clusters do-summarize.py
  • 8. Step 1: collect a dataset do-collect.py -k france “Sample” Twitter stream: ● 1% of all tweets which contain the word “france” ● ran a couple of hours on June 25th Be careful: ● Hardcoded twitter app ids ● Generate your own app ids: https://apps.twitter.com/ !
  • 9. Step 2: tokenize the text input do-tokenize.py --lang=fr Depends on language ● “Easy” for english: spaces, hyphens are word boundaries. ● CJKV languages: no space. (tough) → We focus on a “simple” language and open-source library (NLTK) to ignore the problem
  • 10. Step 3: calculate document frequencies do-df.py --min-df=1 Number of documents which contain each token at least once Eliminate all tokens which appear only once Store number of documents as a special zero-length string token [-1, "", 10842] [0, "https://t.co/lzpNXIe2if", 1]
  • 11. Step 4: generate document vectors do-bag-of-words.py --model=boolean Models ● boolean: the simplest model: 1 if token is present in document, 0 otherwise ● tf-idf: More weight for tokens which appear rarely in corpus → we start with the simplest option !
  • 12. Step 5: Cluster document vectors do-kmeans.py -k=10 Search 10 clusters: ● Complexity = O(nmk+1 ) → hurts ● MiniBatch option is much faster but less stable numerically ● What you really want is reduce M (curse of dimensionality)
  • 13. Step 6: visualize the clusters do-summarize.py Keep the tokens where the difference between: ● Frequency of tokens in cluster ● Frequency of tokens in corpus Is highest → Inspired by KL divergence
  • 14. Results 0. 3165 MAIS PERSONNE https://t.co/Xg4fOi9Q1c ACCOSTER #TraduisonsLes 1. 2407 prenne égalera battra protéger entrer 2. 255 bousiller travaillé aies gar jaloux 3. 372 262 légaux 3A https://t.co/WyunDG4wLs optim 4. 896 tchadien Tchad zénith annonçons lor 5. 110 traiter https://t.co/zCAlZJjzfX rt pute met 6. 326 GAGNANTS https://t.co/1XGv3j526K PASSE PayPal 7. 2598 Mauvais marquage Archives-Verrerie chuter Générosités 8. 242 https://t.co/byRBwkSa3U Faire l'île 9. 471 altitude giflée bled baisser Francais
  • 15. Comments Small clusters are pretty coherent Big clusters are a mix of lots of small clusters → Choosing a good K is crucial ! ● Too small: mishmash of topics ● Too big: many small clusters which are all about the same topic
  • 16. Things you could do 1. More/different data 2. Compare accuracy loss of MiniBatchKMeans against kMeans 3. Test other clustering algorithms 4. Better summarization 5. Visualize topic relationships 6. Compare LSA and LDA to Clustering output 7. Automatically pick number of topics by optimizing for silhouette coefficient 8. Decrease dimensionality with doc2vec, LSA or LDA to replace bag-of-words 9. ...
  • 18. Dimensionality reduction: word2vec python ./do-word-vector-model.py -d sample-big mv sample-big-word-vector-mode sample-word-vector-model python ./do-doc2vec.py “Distributed Representations of Words and Phrases and their Compositionality”, 2013 Open source implementation: gensim