SlideShare ist ein Scribd-Unternehmen logo
1 von 26
© 2015 Lexalytics Inc. All rights reserved
Discovery++
Clustering + Text Analytics
Seth Redmore; CMO, Lexalytics, Inc.
@sredmore
Paul Barba, Senior Architect, Lexalytics, Inc.
© 2015 Lexalytics Inc. All rights reserved
Agenda
 Who is Lexalytics
 What’s our stack looks like
 How to fit Machine Learning and Text Analytics together
 Text and its annoying challenges
 Clustering and extraction process
 Interesting results
 What else could we have done?
 Human/Computer Partnership
2
© 2015 Lexalytics Inc. All rights reserved
Who is Lexalytics? 3
• Founded in 2003
• Text Analytics Engine
– Entities, Sentiment, Themes, Summaries, Intentions, Categories
• On-Premise, SaaS, Desktop
• Popular in Social Listening, Customer Experience Mgmt.
• Billions of documents/day processed across our customers
• Hybrid approach to text analytics using machine learning, natural language processing algorithms, pattern files,
and dictionaries
• Fun fact: We maintain almost 40 different machine learning models
© 2015 Lexalytics Inc. All rights reserved
Layers of Interpretation: Transparent Deep Learning
Sentence
Breaking
Tokenization Lexical
Chaining
PoS Chun
k
Syntax
Base Knowledge
Syntax
Matrix
i
Vertical
Optimization
,
Concept
Matrix
Multi-layered
Text Deconstruction
(Text Preparation)
IntentionsThemesEntities
Feature Extraction
Sentiment
+/-
Summaries
3
Categories
4
© 2015 Lexalytics Inc. All rights reserved
The Discovery Problem vs. The Prediction Problem 5
• Two obvious ways to integrate NLP and
Machine Learning
• Learn, then NLP  Discovery
• NLP first, then Learn Predictions
• We decided to give the first one a try, as
that’s often the first question an analyst
needs to know about text.
• “Ok, I just got 500k tweets dumped on
me and I need to understand what’s up.”
• Once some degree of “importance” is
measured, then easy to integrate into
predictive models
vs.
© 2015 Lexalytics Inc. All rights reserved
Text and why it’s annoying 6
• Medium dimensionality
– As compared to:
• Video: gazillions of dimensions
• Netflix rating data:
– Lots of users, bunch of movies dimensions, also sparse
– Users * Movies
• But very sparse across those dimensions
– Of the say ~100,000-200,000 lemmas that come up with reasonable
frequency, how many are you getting in your corpus?
http://www.oxforddictionaries.com/us/words/the-oec-facts-about-the-language
© 2015 Lexalytics Inc. All rights reserved
Discovery Process – Cluster then Extract 7
• Clustering allows us to discover naturally occurring
groupings of text
• Post-clustering, we will then extract the features from the
clusters to see what’s in them
– Terms
– Bigrams
– Trigrams
– Themes
– Entities
– Sentiment
Themes EntitiesSentiment
+/-
Themes EntitiesSentiment
+/-
Themes EntitiesSentiment
+/-
© 2015 Lexalytics Inc. All rights reserved
Themes
House and Senate leaders hatched their plans Thursday to
avoid a politically risky shutdown next week, moving to
separate an acrimonious battle over abortion from a must-
pass bill to keep government agencies open.
After Pope Francis addressed a joint meeting of Congress,
Speaker John Boehner told his leadership team he would
immediately put a plan to defund Planned Parenthood into
legislative vehicle known on Capitol Hill as "reconciliation,"
which cannot be filibustered in the Senate.
The speaker's team argues that by putting the provision in a
reconciliation bill, there's a good chance it will be approved
in both chambers of Congress and it will force Obama to use
his veto pen. It would also allow them to pass a stop-gap
measure free of Planned Parenthood restrictions before the
Oct. 1 deadline to keep the government open.
The move is bound to anger conservatives, and Boehner will
pitch the plan Friday morning to a closed-door conference
meeting.
Extracted themes Sentiment
anger conservatives -2.07
risky shut down -3.82
acrimonious battle -4.50
must-pass bill -2.32
good chance +3.00
Themes example 8
© 2015 Lexalytics Inc. All rights reserved
Themes 9
Algorithm
Scoring
Patterns
Candidate
Themes
Tuning
Theme Candidate PoS
Patterns
Scored
ThemesT
Text PrepText
© 2015 Lexalytics Inc. All rights reserved
Clustering 10
• H2O supports k-means clustering
• k-means clustering:
– Find n centerpoints upon which the distance between members of the cluster are minimized
(“Within Cluster Sum of Squares” – WCSS)
• k-means can be solved in reasonable time with fixed dimensionality
and number of clusters
• 3 steps:
– Decide what you’re going to cluster on
– Initialize the set – k-means++
– Run some sort of optimized algorithm
© 2015 Lexalytics Inc. All rights reserved
Datasets 11
• 2 test datasets:
– ~10k tweets from New Hampshire
that talk about the current
election cycle
– 20,000 tweets from a
Samsung® announcement
• We want to know if there are any
interesting, natural groupings in the
content that we should be aware of.
© 2015 Lexalytics Inc. All rights reserved
Challenges in Clustering 12
• Dimensionality vs. Sparseness
• We tried clustering on:
– Terms (single words) (stemmed +
unstemmed)
– Bigrams (stemmed + unstemmed)
– Themes (stemmed and unstemmed)
• Each one got a single mega-cluster
• Data is too high-dimensional and
sparse
– need to reduce dimensions
© 2015 Lexalytics Inc. All rights reserved
Reducing Dimensions (and improving sparseness) 13
• Principal Component Analysis (PCA) is native to H2O, so we tried that first.
– PCA reduces dimensionality by first finding the “principle component” that accounts for
the most variability.
– Then, it finds the component that has the next largest variability – with the constraint that
this component must be orthogonal to the first component
– Lather, rinse, repeat
• PCA ran for over a week on the fairly hefty cluster we were given to use, then
went down.
• PCA is thus too slow for this problem
© 2015 Lexalytics Inc. All rights reserved
Word2Vec 14
• Word2Vec is an open-source toolset for
– calculating the cosine distance between words
– categorizing words based on a training corpus
• You can train it yourself on your own corpora, or can
use some of the pre-trained Word2Vec models out
there already (see below)
• The cosine distance can be used to reduce the
dimensionality by grouping words into an arbitrary
number of dimensions
• We used 300, because This Is SPARTAAAAA!
– Actually because we used the pre-existing Google model that
had 300 vectors in it already
– https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21p
QmM/edit?usp=sharing
https://code.google.com/p/word2vec/
© 2015 Lexalytics Inc. All rights reserved
Clustering on Word2Vec processed content 15
• Yay! We’re not getting one big cluster any more.
• Now we need to figure out how many clusters are optimal
– Remember, we’re just doing discovery here, so, we don’t have to spend a lot of time optimizing
– We tried 8, 30, and 100 clusters
© 2015 Lexalytics Inc. All rights reserved
16Politics-30-split: Cluster 14 Size = 305, Sentiment = -0.31
Bigrams
• #alpolitics #iacaucus
• #alpolitics #tennessee
• #alpolitics #ukip
• #alpolitics @anncoulter
• #alpolitics @realdonaldtrump
• #gopdebate #nhpolitics
• #iacaucus #alpolitics
• #iacaucus #ukip
• #iacaucus @anncoulter
• #iacaucus @realdonaldtrump
Trigrams
• #alpolitics @anncoulter #tennessee
• #alpolitics @anncoulter @vdare
• #alpolitics @realdonaldtrump #ukip
• #iacaucus #alpolitics #tennessee
• #iacaucus #alpolitics #ukip
• #iacaucus #alpolitics @anncoulter
• #iacaucus #alpolitics @realdonaldtrump
• #iacaucus @anncoulter #alpolitics
• #iacaucus @anncoulter @vdare
• #iacaucus @realdonaldtrump #ukip
Terms
#GOPDebate
#Immigration
#NHGOP
#TPP
#UKIP
#alpolitics
#fitn
#iacaucus
#immigration
#nhpolitics
© 2015 Lexalytics Inc. All rights reserved
17Politics-30: Cluster 14 Size = 305, Sentiment = -0.31
Entities
• Bush(-9.89)
• Marco Rubio(-16.83)
• Hillary Clinton(-10.42)
• Mexico(-0.63)
• @AnnCoulter(-4.44)
• AMNESTY(-8.94)
• @realDonaldTrump(-4.89)
• @BruceBourgoine(-0.23)
• Mass(-0.88)
• Libya(-0.23)
Themes
• open borders mass immigration(1.87500047684)
• wage-reducing mass immigration(-6.61054801941)
• nation-wrecking mass immigration(-4.53758764267)
• alien invaders(-18.000005722)
• legal immigration(-7.35771656036)
• rancid whores(-8.327501297)
• job-killing trade deal scams(-3.09999990463)
• Trans-Pacific Partnership trade deal scam(-0.490000009537)
• multicultural mayhem(-3.75)
• treasonous rat(-3.75)
© 2015 Lexalytics Inc. All rights reserved
18Politics-100: Cluster 37, Size = 213, Sentiment = -0.38
Entities
• AMNESTY(-9.45)
• @realDonaldTrump(-5.75)
• @elraymer(0.00)
• @MartinOMalley(0.00)
• Martin O'Malley(-1.14)
• Maryland(-1.00)
• @RefugeeWatcher(0.00)
• @vaughnFNC(-1.20)
• Hillary Clinton(-0.72)
• @CampaignReg(-0.13)
Themes
• illegal aliens(-4.00000047684)
• wage-reducing mass immigration(-2.23903226852)
• nation-wrecking mass immigration(-1.57310771942)
• multicultural mayhem(-2.25)
• alien invaders(-1.20000004768)
• open border(0.450000017881)
• big money donors(0.882178008556)
• sanctuary state(-0.483333349228)
• immigrant crime(-1.20319712162)
• visa foreigners(-0.600000023842)
© 2015 Lexalytics Inc. All rights reserved
19Politics-30: Cluster 25 Size = 407, Sentiment = +0.27
Terms
• #11
• #2016election
• #3
• #603forHRC
• #911Anniversary
• #ACEs
• #Bernie2016
• #BernieAtUNH
• #Brooklyn
• #CNN
Bigrams
• #11 candidate
• #603forhrc #hillary2016
• #603forhrc #newhampshire
• #603forhrc @hillaryfornh
• #603forhrc together
• #bernie2016 #feelthebern
• #brooklyn today
• #carly2016 #fitn
• #carly2016 #nhgop
• #carly2016 listen
Trigrams
• #11 candidate i
• #603forhrc #hillary2016 http
• #carly2016 #fitn #nhgop
• #carly2016 #fitn #nhpolitics
• #climateactionnow thank @berniesanders
• #cnndebate stage tonight
• #delay #nh #nhpoli
• #feelthebern #climateactionnow #stopthenhpipeline
• #feelthebern #fitn #nhpolitics
• #fitn #bernie2016 #feelthebern
© 2015 Lexalytics Inc. All rights reserved
20Politics-30-split: Cluster 25 Size = 407, Sentiment = +0.27
Entities
• @ThisWeekABC(0.00)
• @donnabrazile(0.27)
• Senator Bernie Sanders(1.81)
• @Women4Bernie(0.00)
• NH(5.36)
• @BernieSanders(4.17)
• @CornelWest(0.00)
• RI Gov Lincoln Chafee(0.00)
• 4(1.10)
• Wheeler Hall(0.00)
Themes
• race car start(0.0)
• town hall meeting(0.0)
• 2nd day(0.487500011921)
• inviting folks(0.24375000596)
• great day(0.40000000596)
• Convention crowd cheers(0.980000019073)
• clear winner(0.490000009537)
• 17 town hall(0.0)
• Living room(0.0)
• state convention(0.0)
© 2015 Lexalytics Inc. All rights reserved
Samsung-30 Interesting Clusters (Themes Only)
Cluster 5, Size = 50, Sentiment =+0.28
• Android smartphone profits(8.37637424469)
• filling pre-orders(-2.89171385765)
• supply issues(-2.91585707664)
• global supply shortages(-0.980000019073)
• global rollout(-1.94809389114)
• initial supplies(-1.94912362099)
• mobile device market(5.76794099808)
• global rollout(-0.895333886147)
• Android profits(4.18818712234)
• lion share(3.84502887726)
Cluster 6, Size = 657, Sentiment =+0.32
• Limited edition(2.96772003174)
• 2 cover case leather sleeve brown(0.0)
• Rechargeable Power(0.147000014782)
• waxed leather(0.0)
• Cheap price(0.32262301445)
• Soft Skin(0.475291997194)
• S line(0.237645998597)
• Assorted Colors(0.118822999299)
© 2015 Lexalytics Inc. All rights reserved
What else could we have done? 22
• Different cluster sizes
• Semantic meaning of the themes for associations
• Pre-sorting based on queries of candidates, or topical queries
• Gathering other examples for comparison
• Building queries to pull out common items.
– Look to see which clusters it’s appearing in, is it across all the clusters?
• Demographic data, Klout scores
© 2015 Lexalytics Inc. All rights reserved
Human/Computer Partnership 23
Loop if broken
Text Content
Entities
Sentiment
Themes
Categories
Reduce
Dimensions
Cluster Extract Examine
Pick
one
lens
Loop though to dive into one area or segment text by feautures, then classify
E.G. “What are the clusters for each of the candidates
or
“I built classifiers for – Solar Power
– Fossil Fuels
– Wind”
© 2015 Lexalytics Inc. All rights reserved
Summary 24
• Text Analytics relies heavily on machine learning to do its job
• Text Analytics can come before other Machine Learning for predictive analysis
• Machine Learning can come before Text Analytics for discovery processes
• Reducing dimensionality of the text is an important step because of the sparse nature of the
matrix
• PCA was unsuitable (we used Word2Vec, but Sparse PCA might work as well)
• For discovery, an interesting process is to loop, taking a lens built from the first run (entities,
categories, etc), and then going back to step one and looking at the related clusters for that
lens
© 2015 Lexalytics Inc. All rights reserved
Thanks!
• H2O for providing us with all the processing power we needed and excellent technical
support and tools. We were very impressed with their responsiveness and
professionalism
• Paul Barba for doing the heavy lifting
• Y’all for listening 
• Happy Diwali Everyone!
© 2015 Lexalytics Inc. All rights reserved

Weitere ähnliche Inhalte

Was ist angesagt?

H2O PySparkling Water
H2O PySparkling WaterH2O PySparkling Water
H2O PySparkling WaterSri Ambati
 
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum ShachamH2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum ShachamSri Ambati
 
Sparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya HristakevaSparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya HristakevaSpark Summit
 
H2O World - H2O Deep Learning with Arno Candel
H2O World - H2O Deep Learning with Arno CandelH2O World - H2O Deep Learning with Arno Candel
H2O World - H2O Deep Learning with Arno CandelSri Ambati
 
PyData Texas 2015 Keynote
PyData Texas 2015 KeynotePyData Texas 2015 Keynote
PyData Texas 2015 KeynotePeter Wang
 
Applied Machine learning using H2O, python and R Workshop
Applied Machine learning using H2O, python and R WorkshopApplied Machine learning using H2O, python and R Workshop
Applied Machine learning using H2O, python and R WorkshopAvkash Chauhan
 
H2O Deep Water - Making Deep Learning Accessible to Everyone
H2O Deep Water - Making Deep Learning Accessible to EveryoneH2O Deep Water - Making Deep Learning Accessible to Everyone
H2O Deep Water - Making Deep Learning Accessible to EveryoneSri Ambati
 
Drive Away Fraudsters With Driverless AI - Venkatesh Ramanathan, Senior Data ...
Drive Away Fraudsters With Driverless AI - Venkatesh Ramanathan, Senior Data ...Drive Away Fraudsters With Driverless AI - Venkatesh Ramanathan, Senior Data ...
Drive Away Fraudsters With Driverless AI - Venkatesh Ramanathan, Senior Data ...Sri Ambati
 
Using H2O for Mobile Transaction Forecasting & Anomaly Detection - Capital One
Using H2O for Mobile Transaction Forecasting & Anomaly Detection - Capital OneUsing H2O for Mobile Transaction Forecasting & Anomaly Detection - Capital One
Using H2O for Mobile Transaction Forecasting & Anomaly Detection - Capital OneSri Ambati
 
ArnoCandelAIFrontiers011217
ArnoCandelAIFrontiers011217ArnoCandelAIFrontiers011217
ArnoCandelAIFrontiers011217Sri Ambati
 
OWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - DataikuOWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - DataikuDataiku
 
H2O World - Solving Customer Churn with Machine Learning - Julian Bharadwaj
H2O World - Solving Customer Churn with Machine Learning - Julian BharadwajH2O World - Solving Customer Churn with Machine Learning - Julian Bharadwaj
H2O World - Solving Customer Churn with Machine Learning - Julian BharadwajSri Ambati
 
Big Data – A New Testing Challenge
Big Data – A New Testing ChallengeBig Data – A New Testing Challenge
Big Data – A New Testing ChallengeTEST Huddle
 
H2O World - Building a Smarter Application - Tom Kraljevic
H2O World - Building a Smarter Application - Tom KraljevicH2O World - Building a Smarter Application - Tom Kraljevic
H2O World - Building a Smarter Application - Tom KraljevicSri Ambati
 
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku
 
H2O AutoML roadmap - Ray Peck
H2O AutoML roadmap - Ray PeckH2O AutoML roadmap - Ray Peck
H2O AutoML roadmap - Ray PeckSri Ambati
 
H2O Machine Learning and Kalman Filters for Machine Prognostics - Galvanize SF
H2O Machine Learning and Kalman Filters for Machine Prognostics - Galvanize SFH2O Machine Learning and Kalman Filters for Machine Prognostics - Galvanize SF
H2O Machine Learning and Kalman Filters for Machine Prognostics - Galvanize SFSri Ambati
 

Was ist angesagt? (19)

H2O PySparkling Water
H2O PySparkling WaterH2O PySparkling Water
H2O PySparkling Water
 
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum ShachamH2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
 
Sparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya HristakevaSparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya Hristakeva
 
H2O World - H2O Deep Learning with Arno Candel
H2O World - H2O Deep Learning with Arno CandelH2O World - H2O Deep Learning with Arno Candel
H2O World - H2O Deep Learning with Arno Candel
 
PyData Texas 2015 Keynote
PyData Texas 2015 KeynotePyData Texas 2015 Keynote
PyData Texas 2015 Keynote
 
Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Applied Machine learning using H2O, python and R Workshop
Applied Machine learning using H2O, python and R WorkshopApplied Machine learning using H2O, python and R Workshop
Applied Machine learning using H2O, python and R Workshop
 
H2O Deep Water - Making Deep Learning Accessible to Everyone
H2O Deep Water - Making Deep Learning Accessible to EveryoneH2O Deep Water - Making Deep Learning Accessible to Everyone
H2O Deep Water - Making Deep Learning Accessible to Everyone
 
Drive Away Fraudsters With Driverless AI - Venkatesh Ramanathan, Senior Data ...
Drive Away Fraudsters With Driverless AI - Venkatesh Ramanathan, Senior Data ...Drive Away Fraudsters With Driverless AI - Venkatesh Ramanathan, Senior Data ...
Drive Away Fraudsters With Driverless AI - Venkatesh Ramanathan, Senior Data ...
 
Using H2O for Mobile Transaction Forecasting & Anomaly Detection - Capital One
Using H2O for Mobile Transaction Forecasting & Anomaly Detection - Capital OneUsing H2O for Mobile Transaction Forecasting & Anomaly Detection - Capital One
Using H2O for Mobile Transaction Forecasting & Anomaly Detection - Capital One
 
ArnoCandelAIFrontiers011217
ArnoCandelAIFrontiers011217ArnoCandelAIFrontiers011217
ArnoCandelAIFrontiers011217
 
OWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - DataikuOWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - Dataiku
 
H2O World - Solving Customer Churn with Machine Learning - Julian Bharadwaj
H2O World - Solving Customer Churn with Machine Learning - Julian BharadwajH2O World - Solving Customer Churn with Machine Learning - Julian Bharadwaj
H2O World - Solving Customer Churn with Machine Learning - Julian Bharadwaj
 
Big Data – A New Testing Challenge
Big Data – A New Testing ChallengeBig Data – A New Testing Challenge
Big Data – A New Testing Challenge
 
H2O World - Building a Smarter Application - Tom Kraljevic
H2O World - Building a Smarter Application - Tom KraljevicH2O World - Building a Smarter Application - Tom Kraljevic
H2O World - Building a Smarter Application - Tom Kraljevic
 
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
 
H2O AutoML roadmap - Ray Peck
H2O AutoML roadmap - Ray PeckH2O AutoML roadmap - Ray Peck
H2O AutoML roadmap - Ray Peck
 
H2O Machine Learning and Kalman Filters for Machine Prognostics - Galvanize SF
H2O Machine Learning and Kalman Filters for Machine Prognostics - Galvanize SFH2O Machine Learning and Kalman Filters for Machine Prognostics - Galvanize SF
H2O Machine Learning and Kalman Filters for Machine Prognostics - Galvanize SF
 

Ähnlich wie Discovering Themes and Groups in Political Text Data Using Clustering

Text Analytics in Action - Finding the Killer Patent. February 2015 Webinar b...
Text Analytics in Action - Finding the Killer Patent. February 2015 Webinar b...Text Analytics in Action - Finding the Killer Patent. February 2015 Webinar b...
Text Analytics in Action - Finding the Killer Patent. February 2015 Webinar b...kCura_Relativity
 
Visualizing Text: Seth Redmore at the 2015 Smart Data Conference
Visualizing Text: Seth Redmore at the 2015 Smart Data ConferenceVisualizing Text: Seth Redmore at the 2015 Smart Data Conference
Visualizing Text: Seth Redmore at the 2015 Smart Data Conferencesredmore
 
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014ALTER WAY
 
POWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership GrantPOWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership GrantLynne Thomas
 
Big Data Day LA 2015 - Building a Big Data Culture in the Entertainment Indus...
Big Data Day LA 2015 - Building a Big Data Culture in the Entertainment Indus...Big Data Day LA 2015 - Building a Big Data Culture in the Entertainment Indus...
Big Data Day LA 2015 - Building a Big Data Culture in the Entertainment Indus...Data Con LA
 
Deep learning for text analytics
Deep learning for text analyticsDeep learning for text analytics
Deep learning for text analyticsErik Tromp
 
16-nlp (2).ppt
16-nlp (2).ppt16-nlp (2).ppt
16-nlp (2).ppttestbest6
 
Metadata is a Love Note to the Future
Metadata is a Love Note to the FutureMetadata is a Love Note to the Future
Metadata is a Love Note to the FutureRachel Lovinger
 
LYCKE Artificial intelligence, hype or hope?
LYCKE Artificial intelligence, hype or hope?LYCKE Artificial intelligence, hype or hope?
LYCKE Artificial intelligence, hype or hope?FIAT/IFTA
 
Software Ecosystems = Big Data
Software Ecosystems = Big DataSoftware Ecosystems = Big Data
Software Ecosystems = Big DataTom Mens
 
Get full visibility and find hidden security issues
Get full visibility and find hidden security issuesGet full visibility and find hidden security issues
Get full visibility and find hidden security issuesElasticsearch
 
Twitter data analysis using R
Twitter data analysis using RTwitter data analysis using R
Twitter data analysis using Rsantoshi mangalgi
 
Data analytics introduction
Data analytics introductionData analytics introduction
Data analytics introductionamiyadash
 
Data science for advanced dummies
Data science for advanced dummiesData science for advanced dummies
Data science for advanced dummiesSaurav Chakravorty
 
Webinar - Harness the Power of Data with Tableau - 2016-02-18
Webinar - Harness the Power of Data with Tableau - 2016-02-18Webinar - Harness the Power of Data with Tableau - 2016-02-18
Webinar - Harness the Power of Data with Tableau - 2016-02-18TechSoup
 
Building Information Governance Policies and Workflows
Building Information Governance Policies and WorkflowsBuilding Information Governance Policies and Workflows
Building Information Governance Policies and WorkflowskCura_Relativity
 
BEA 2015 Generating Metadata by Machine Final
BEA 2015 Generating Metadata by Machine FinalBEA 2015 Generating Metadata by Machine Final
BEA 2015 Generating Metadata by Machine FinalS. M. Hassan Zaidi
 

Ähnlich wie Discovering Themes and Groups in Political Text Data Using Clustering (20)

Chatbots: Automated Conversational Model using Machine Learning
Chatbots: Automated Conversational Model using Machine LearningChatbots: Automated Conversational Model using Machine Learning
Chatbots: Automated Conversational Model using Machine Learning
 
Text Analytics in Action - Finding the Killer Patent. February 2015 Webinar b...
Text Analytics in Action - Finding the Killer Patent. February 2015 Webinar b...Text Analytics in Action - Finding the Killer Patent. February 2015 Webinar b...
Text Analytics in Action - Finding the Killer Patent. February 2015 Webinar b...
 
Visualizing Text: Seth Redmore at the 2015 Smart Data Conference
Visualizing Text: Seth Redmore at the 2015 Smart Data ConferenceVisualizing Text: Seth Redmore at the 2015 Smart Data Conference
Visualizing Text: Seth Redmore at the 2015 Smart Data Conference
 
Data Mining & Engineering
Data Mining & EngineeringData Mining & Engineering
Data Mining & Engineering
 
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
 
POWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership GrantPOWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership Grant
 
Big Data Day LA 2015 - Building a Big Data Culture in the Entertainment Indus...
Big Data Day LA 2015 - Building a Big Data Culture in the Entertainment Indus...Big Data Day LA 2015 - Building a Big Data Culture in the Entertainment Indus...
Big Data Day LA 2015 - Building a Big Data Culture in the Entertainment Indus...
 
Deep learning for text analytics
Deep learning for text analyticsDeep learning for text analytics
Deep learning for text analytics
 
16-nlp (2).ppt
16-nlp (2).ppt16-nlp (2).ppt
16-nlp (2).ppt
 
Metadata is a Love Note to the Future
Metadata is a Love Note to the FutureMetadata is a Love Note to the Future
Metadata is a Love Note to the Future
 
LYCKE Artificial intelligence, hype or hope?
LYCKE Artificial intelligence, hype or hope?LYCKE Artificial intelligence, hype or hope?
LYCKE Artificial intelligence, hype or hope?
 
Software Ecosystems = Big Data
Software Ecosystems = Big DataSoftware Ecosystems = Big Data
Software Ecosystems = Big Data
 
Lean Security
Lean SecurityLean Security
Lean Security
 
Get full visibility and find hidden security issues
Get full visibility and find hidden security issuesGet full visibility and find hidden security issues
Get full visibility and find hidden security issues
 
Twitter data analysis using R
Twitter data analysis using RTwitter data analysis using R
Twitter data analysis using R
 
Data analytics introduction
Data analytics introductionData analytics introduction
Data analytics introduction
 
Data science for advanced dummies
Data science for advanced dummiesData science for advanced dummies
Data science for advanced dummies
 
Webinar - Harness the Power of Data with Tableau - 2016-02-18
Webinar - Harness the Power of Data with Tableau - 2016-02-18Webinar - Harness the Power of Data with Tableau - 2016-02-18
Webinar - Harness the Power of Data with Tableau - 2016-02-18
 
Building Information Governance Policies and Workflows
Building Information Governance Policies and WorkflowsBuilding Information Governance Policies and Workflows
Building Information Governance Policies and Workflows
 
BEA 2015 Generating Metadata by Machine Final
BEA 2015 Generating Metadata by Machine FinalBEA 2015 Generating Metadata by Machine Final
BEA 2015 Generating Metadata by Machine Final
 

Mehr von Sri Ambati

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Generative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxGenerative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxSri Ambati
 
AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek Sri Ambati
 
LLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thLLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thSri Ambati
 
Building, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionSri Ambati
 
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Sri Ambati
 
Risk Management for LLMs
Risk Management for LLMsRisk Management for LLMs
Risk Management for LLMsSri Ambati
 
Open-Source AI: Community is the Way
Open-Source AI: Community is the WayOpen-Source AI: Community is the Way
Open-Source AI: Community is the WaySri Ambati
 
Building Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OBuilding Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OSri Ambati
 
Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Sri Ambati
 
Cutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersCutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersSri Ambati
 
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Sri Ambati
 
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Sri Ambati
 
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...Sri Ambati
 
LLM Interpretability
LLM Interpretability LLM Interpretability
LLM Interpretability Sri Ambati
 
Never Reply to an Email Again
Never Reply to an Email AgainNever Reply to an Email Again
Never Reply to an Email AgainSri Ambati
 
Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Sri Ambati
 
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...Sri Ambati
 
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...Sri Ambati
 
AI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneyAI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneySri Ambati
 

Mehr von Sri Ambati (20)

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Generative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxGenerative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptx
 
AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek
 
LLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thLLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5th
 
Building, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for Production
 
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
 
Risk Management for LLMs
Risk Management for LLMsRisk Management for LLMs
Risk Management for LLMs
 
Open-Source AI: Community is the Way
Open-Source AI: Community is the WayOpen-Source AI: Community is the Way
Open-Source AI: Community is the Way
 
Building Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OBuilding Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2O
 
Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical
 
Cutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersCutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM Papers
 
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
 
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
 
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
 
LLM Interpretability
LLM Interpretability LLM Interpretability
LLM Interpretability
 
Never Reply to an Email Again
Never Reply to an Email AgainNever Reply to an Email Again
Never Reply to an Email Again
 
Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)
 
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
 
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
 
AI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneyAI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation Journey
 

Kürzlich hochgeladen

Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....ShaimaaMohamedGalal
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 

Kürzlich hochgeladen (20)

Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 

Discovering Themes and Groups in Political Text Data Using Clustering

  • 1. © 2015 Lexalytics Inc. All rights reserved Discovery++ Clustering + Text Analytics Seth Redmore; CMO, Lexalytics, Inc. @sredmore Paul Barba, Senior Architect, Lexalytics, Inc.
  • 2. © 2015 Lexalytics Inc. All rights reserved Agenda  Who is Lexalytics  What’s our stack looks like  How to fit Machine Learning and Text Analytics together  Text and its annoying challenges  Clustering and extraction process  Interesting results  What else could we have done?  Human/Computer Partnership 2
  • 3. © 2015 Lexalytics Inc. All rights reserved Who is Lexalytics? 3 • Founded in 2003 • Text Analytics Engine – Entities, Sentiment, Themes, Summaries, Intentions, Categories • On-Premise, SaaS, Desktop • Popular in Social Listening, Customer Experience Mgmt. • Billions of documents/day processed across our customers • Hybrid approach to text analytics using machine learning, natural language processing algorithms, pattern files, and dictionaries • Fun fact: We maintain almost 40 different machine learning models
  • 4. © 2015 Lexalytics Inc. All rights reserved Layers of Interpretation: Transparent Deep Learning Sentence Breaking Tokenization Lexical Chaining PoS Chun k Syntax Base Knowledge Syntax Matrix i Vertical Optimization , Concept Matrix Multi-layered Text Deconstruction (Text Preparation) IntentionsThemesEntities Feature Extraction Sentiment +/- Summaries 3 Categories 4
  • 5. © 2015 Lexalytics Inc. All rights reserved The Discovery Problem vs. The Prediction Problem 5 • Two obvious ways to integrate NLP and Machine Learning • Learn, then NLP  Discovery • NLP first, then Learn Predictions • We decided to give the first one a try, as that’s often the first question an analyst needs to know about text. • “Ok, I just got 500k tweets dumped on me and I need to understand what’s up.” • Once some degree of “importance” is measured, then easy to integrate into predictive models vs.
  • 6. © 2015 Lexalytics Inc. All rights reserved Text and why it’s annoying 6 • Medium dimensionality – As compared to: • Video: gazillions of dimensions • Netflix rating data: – Lots of users, bunch of movies dimensions, also sparse – Users * Movies • But very sparse across those dimensions – Of the say ~100,000-200,000 lemmas that come up with reasonable frequency, how many are you getting in your corpus? http://www.oxforddictionaries.com/us/words/the-oec-facts-about-the-language
  • 7. © 2015 Lexalytics Inc. All rights reserved Discovery Process – Cluster then Extract 7 • Clustering allows us to discover naturally occurring groupings of text • Post-clustering, we will then extract the features from the clusters to see what’s in them – Terms – Bigrams – Trigrams – Themes – Entities – Sentiment Themes EntitiesSentiment +/- Themes EntitiesSentiment +/- Themes EntitiesSentiment +/-
  • 8. © 2015 Lexalytics Inc. All rights reserved Themes House and Senate leaders hatched their plans Thursday to avoid a politically risky shutdown next week, moving to separate an acrimonious battle over abortion from a must- pass bill to keep government agencies open. After Pope Francis addressed a joint meeting of Congress, Speaker John Boehner told his leadership team he would immediately put a plan to defund Planned Parenthood into legislative vehicle known on Capitol Hill as "reconciliation," which cannot be filibustered in the Senate. The speaker's team argues that by putting the provision in a reconciliation bill, there's a good chance it will be approved in both chambers of Congress and it will force Obama to use his veto pen. It would also allow them to pass a stop-gap measure free of Planned Parenthood restrictions before the Oct. 1 deadline to keep the government open. The move is bound to anger conservatives, and Boehner will pitch the plan Friday morning to a closed-door conference meeting. Extracted themes Sentiment anger conservatives -2.07 risky shut down -3.82 acrimonious battle -4.50 must-pass bill -2.32 good chance +3.00 Themes example 8
  • 9. © 2015 Lexalytics Inc. All rights reserved Themes 9 Algorithm Scoring Patterns Candidate Themes Tuning Theme Candidate PoS Patterns Scored ThemesT Text PrepText
  • 10. © 2015 Lexalytics Inc. All rights reserved Clustering 10 • H2O supports k-means clustering • k-means clustering: – Find n centerpoints upon which the distance between members of the cluster are minimized (“Within Cluster Sum of Squares” – WCSS) • k-means can be solved in reasonable time with fixed dimensionality and number of clusters • 3 steps: – Decide what you’re going to cluster on – Initialize the set – k-means++ – Run some sort of optimized algorithm
  • 11. © 2015 Lexalytics Inc. All rights reserved Datasets 11 • 2 test datasets: – ~10k tweets from New Hampshire that talk about the current election cycle – 20,000 tweets from a Samsung® announcement • We want to know if there are any interesting, natural groupings in the content that we should be aware of.
  • 12. © 2015 Lexalytics Inc. All rights reserved Challenges in Clustering 12 • Dimensionality vs. Sparseness • We tried clustering on: – Terms (single words) (stemmed + unstemmed) – Bigrams (stemmed + unstemmed) – Themes (stemmed and unstemmed) • Each one got a single mega-cluster • Data is too high-dimensional and sparse – need to reduce dimensions
  • 13. © 2015 Lexalytics Inc. All rights reserved Reducing Dimensions (and improving sparseness) 13 • Principal Component Analysis (PCA) is native to H2O, so we tried that first. – PCA reduces dimensionality by first finding the “principle component” that accounts for the most variability. – Then, it finds the component that has the next largest variability – with the constraint that this component must be orthogonal to the first component – Lather, rinse, repeat • PCA ran for over a week on the fairly hefty cluster we were given to use, then went down. • PCA is thus too slow for this problem
  • 14. © 2015 Lexalytics Inc. All rights reserved Word2Vec 14 • Word2Vec is an open-source toolset for – calculating the cosine distance between words – categorizing words based on a training corpus • You can train it yourself on your own corpora, or can use some of the pre-trained Word2Vec models out there already (see below) • The cosine distance can be used to reduce the dimensionality by grouping words into an arbitrary number of dimensions • We used 300, because This Is SPARTAAAAA! – Actually because we used the pre-existing Google model that had 300 vectors in it already – https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21p QmM/edit?usp=sharing https://code.google.com/p/word2vec/
  • 15. © 2015 Lexalytics Inc. All rights reserved Clustering on Word2Vec processed content 15 • Yay! We’re not getting one big cluster any more. • Now we need to figure out how many clusters are optimal – Remember, we’re just doing discovery here, so, we don’t have to spend a lot of time optimizing – We tried 8, 30, and 100 clusters
  • 16. © 2015 Lexalytics Inc. All rights reserved 16Politics-30-split: Cluster 14 Size = 305, Sentiment = -0.31 Bigrams • #alpolitics #iacaucus • #alpolitics #tennessee • #alpolitics #ukip • #alpolitics @anncoulter • #alpolitics @realdonaldtrump • #gopdebate #nhpolitics • #iacaucus #alpolitics • #iacaucus #ukip • #iacaucus @anncoulter • #iacaucus @realdonaldtrump Trigrams • #alpolitics @anncoulter #tennessee • #alpolitics @anncoulter @vdare • #alpolitics @realdonaldtrump #ukip • #iacaucus #alpolitics #tennessee • #iacaucus #alpolitics #ukip • #iacaucus #alpolitics @anncoulter • #iacaucus #alpolitics @realdonaldtrump • #iacaucus @anncoulter #alpolitics • #iacaucus @anncoulter @vdare • #iacaucus @realdonaldtrump #ukip Terms #GOPDebate #Immigration #NHGOP #TPP #UKIP #alpolitics #fitn #iacaucus #immigration #nhpolitics
  • 17. © 2015 Lexalytics Inc. All rights reserved 17Politics-30: Cluster 14 Size = 305, Sentiment = -0.31 Entities • Bush(-9.89) • Marco Rubio(-16.83) • Hillary Clinton(-10.42) • Mexico(-0.63) • @AnnCoulter(-4.44) • AMNESTY(-8.94) • @realDonaldTrump(-4.89) • @BruceBourgoine(-0.23) • Mass(-0.88) • Libya(-0.23) Themes • open borders mass immigration(1.87500047684) • wage-reducing mass immigration(-6.61054801941) • nation-wrecking mass immigration(-4.53758764267) • alien invaders(-18.000005722) • legal immigration(-7.35771656036) • rancid whores(-8.327501297) • job-killing trade deal scams(-3.09999990463) • Trans-Pacific Partnership trade deal scam(-0.490000009537) • multicultural mayhem(-3.75) • treasonous rat(-3.75)
  • 18. © 2015 Lexalytics Inc. All rights reserved 18Politics-100: Cluster 37, Size = 213, Sentiment = -0.38 Entities • AMNESTY(-9.45) • @realDonaldTrump(-5.75) • @elraymer(0.00) • @MartinOMalley(0.00) • Martin O'Malley(-1.14) • Maryland(-1.00) • @RefugeeWatcher(0.00) • @vaughnFNC(-1.20) • Hillary Clinton(-0.72) • @CampaignReg(-0.13) Themes • illegal aliens(-4.00000047684) • wage-reducing mass immigration(-2.23903226852) • nation-wrecking mass immigration(-1.57310771942) • multicultural mayhem(-2.25) • alien invaders(-1.20000004768) • open border(0.450000017881) • big money donors(0.882178008556) • sanctuary state(-0.483333349228) • immigrant crime(-1.20319712162) • visa foreigners(-0.600000023842)
  • 19. © 2015 Lexalytics Inc. All rights reserved 19Politics-30: Cluster 25 Size = 407, Sentiment = +0.27 Terms • #11 • #2016election • #3 • #603forHRC • #911Anniversary • #ACEs • #Bernie2016 • #BernieAtUNH • #Brooklyn • #CNN Bigrams • #11 candidate • #603forhrc #hillary2016 • #603forhrc #newhampshire • #603forhrc @hillaryfornh • #603forhrc together • #bernie2016 #feelthebern • #brooklyn today • #carly2016 #fitn • #carly2016 #nhgop • #carly2016 listen Trigrams • #11 candidate i • #603forhrc #hillary2016 http • #carly2016 #fitn #nhgop • #carly2016 #fitn #nhpolitics • #climateactionnow thank @berniesanders • #cnndebate stage tonight • #delay #nh #nhpoli • #feelthebern #climateactionnow #stopthenhpipeline • #feelthebern #fitn #nhpolitics • #fitn #bernie2016 #feelthebern
  • 20. © 2015 Lexalytics Inc. All rights reserved 20Politics-30-split: Cluster 25 Size = 407, Sentiment = +0.27 Entities • @ThisWeekABC(0.00) • @donnabrazile(0.27) • Senator Bernie Sanders(1.81) • @Women4Bernie(0.00) • NH(5.36) • @BernieSanders(4.17) • @CornelWest(0.00) • RI Gov Lincoln Chafee(0.00) • 4(1.10) • Wheeler Hall(0.00) Themes • race car start(0.0) • town hall meeting(0.0) • 2nd day(0.487500011921) • inviting folks(0.24375000596) • great day(0.40000000596) • Convention crowd cheers(0.980000019073) • clear winner(0.490000009537) • 17 town hall(0.0) • Living room(0.0) • state convention(0.0)
  • 21. © 2015 Lexalytics Inc. All rights reserved Samsung-30 Interesting Clusters (Themes Only) Cluster 5, Size = 50, Sentiment =+0.28 • Android smartphone profits(8.37637424469) • filling pre-orders(-2.89171385765) • supply issues(-2.91585707664) • global supply shortages(-0.980000019073) • global rollout(-1.94809389114) • initial supplies(-1.94912362099) • mobile device market(5.76794099808) • global rollout(-0.895333886147) • Android profits(4.18818712234) • lion share(3.84502887726) Cluster 6, Size = 657, Sentiment =+0.32 • Limited edition(2.96772003174) • 2 cover case leather sleeve brown(0.0) • Rechargeable Power(0.147000014782) • waxed leather(0.0) • Cheap price(0.32262301445) • Soft Skin(0.475291997194) • S line(0.237645998597) • Assorted Colors(0.118822999299)
  • 22. © 2015 Lexalytics Inc. All rights reserved What else could we have done? 22 • Different cluster sizes • Semantic meaning of the themes for associations • Pre-sorting based on queries of candidates, or topical queries • Gathering other examples for comparison • Building queries to pull out common items. – Look to see which clusters it’s appearing in, is it across all the clusters? • Demographic data, Klout scores
  • 23. © 2015 Lexalytics Inc. All rights reserved Human/Computer Partnership 23 Loop if broken Text Content Entities Sentiment Themes Categories Reduce Dimensions Cluster Extract Examine Pick one lens Loop though to dive into one area or segment text by feautures, then classify E.G. “What are the clusters for each of the candidates or “I built classifiers for – Solar Power – Fossil Fuels – Wind”
  • 24. © 2015 Lexalytics Inc. All rights reserved Summary 24 • Text Analytics relies heavily on machine learning to do its job • Text Analytics can come before other Machine Learning for predictive analysis • Machine Learning can come before Text Analytics for discovery processes • Reducing dimensionality of the text is an important step because of the sparse nature of the matrix • PCA was unsuitable (we used Word2Vec, but Sparse PCA might work as well) • For discovery, an interesting process is to loop, taking a lens built from the first run (entities, categories, etc), and then going back to step one and looking at the related clusters for that lens
  • 25. © 2015 Lexalytics Inc. All rights reserved Thanks! • H2O for providing us with all the processing power we needed and excellent technical support and tools. We were very impressed with their responsiveness and professionalism • Paul Barba for doing the heavy lifting • Y’all for listening  • Happy Diwali Everyone!
  • 26. © 2015 Lexalytics Inc. All rights reserved

Hinweis der Redaktion

  1. Themes are lexically important noun phrases. Think of them as the “buzz” from the document. They work really well when rolled up across many documents – so you can get a feel for what, exactly, are people saying. They are completely automatic. We can also tell you the themes that are lexically associated with an Entity, and not just the themes that are important inside a document.
  2. We extract themes by identifying candidate themes via part-of-speech patterns. If you are a Salience customer, you can tweak these, but most people don’t. Once we have extracted them, we score them using a combination of Lexical Chaining and some of our own proprietary scoring algorithms.
  3. Iterative process – cluster and see, then allow for exploration into each of those ideas by building clusters that are associated with the various things So – a tool that allows you to see the big ideas and who are they connected with, then what’s the spread across the different users Then iterate based on the topics and entities, etc to dive deeper into the issue Computers as partners, not as replacements – if you know the buckets already, then categorization is a great thing But, this process allows you to go back and forth and dive in and out. A tool with which to do research.