The document describes a computational framework for generating visual summaries of topical clusters in Twitter streams. It involves preprocessing tweets, constructing a word co-occurrence graph, performing hierarchical clustering to group related words into topics, extracting keywords for each topic based on their frequency, and creating visual summaries like treemaps or word clouds to display the results.
Automate your Kamailio Test Calls - Kamailio World 2024
Visual Summaries of Topical Clusters in Twitter Streams
1. Semantic Modeling
Computational Framework for
Generating Visual Summaries of
Topical Clusters in Twitter Streams*
Authors: Presenter:
!
Miray Kas Sebastian Alfers - HTW Berlin
Bongwon Suh
1
* http://link.springer.com/chapter/10.1007%2F978-3-319-02993-1_9
2. Visual Summaries of Twitter Streams
2
http://flowingdata.com/wp-content/uploads/2010/02/treemap-revised1.gif
http://www.infobarrel.com/media/image/54054.jpg
7. 7
• OAuth + HTTP
• here: java library with
scala and play!framework
8. Step 1: Preprocessing
• transform Tweets
- easy-to-analyze / clan format
• Process of cleaning:
1. lowercase
2. remove urls, user mentions and stop words
• like @user, „a“ or „123“
3. remove special characters (#,.)
8
10. Step 1: Preprocessing
• Example Tweets
10
new york time
reactive
programming
tool scala scale
techrepublic
akka-http based
reactive stream
scala scaladay
11. Step 1: Preprocessing
• Example Tweets
11
new york time
reactive
programming
tool scala scale
techrepublic
akka-http based
reactive stream
scala scaladay
12. Step 2: Graph
• Word Co-Occurrence Graph
- Word = Node (Unigrams)
- Tweet = Link between Nodes
• Example
akka-http based stream reactive scala scaladay
12 *http://alias-i.com/lingpipe/
13. Step 2: Graph
• Word Co-Occurrence Graph
- Word = Node (Unigrams)
- Tweet = Link between Nodes
• Example
akka-http based stream reactive scala scaladay
13 *http://alias-i.com/lingpipe/
14. Step 2: Graph
• Word Co-Occurrence Graph
- Word = Node (Unigrams)
- Tweet = Link between Nodes
• Example
14 *http://alias-i.com/lingpipe/
based
akka-http
reactive
stream
scaladay scala
15. Step 2: Graph
• Word Co-Occurrence Graph
- Word = Node (Unigrams)
- Tweet = Link between Nodes
• Example
15 *http://alias-i.com/lingpipe/
based
akka-http
reactive
stream
scaladay scala
Nodes
NLoindkess
16. Step 2: Graph
• Word Co-Occurrence Graph
- Word = Node (Unigrams)
- Tweet = Link between Nodes
• Example
16 *http://alias-i.com/lingpipe/
based
akka-http
reactive
stream
scaladay scala
19. Step 2: Graph
• Co-Occurrence Graph
- connect nodes (words) within and between
tweets
- add strength (weight) and cost (distance)
• More frequently words
- increase the strength
- decrease cost
19
21. Step 2: Clustering
• Here: „complete link (max) clustering“ algorithm
- hierarchical clustering algorithm that forms
clusters by merging subgroups
• Group Words from Tweets
- frequently appear on topic
- cluster = topic
* http://nlp.stanford.edu/IR-book/html/htmledition/single-link-and-complete-link-clustering-1.html
22. Step 2: Clustering
• Here: „complete link (max) clustering“ algorithm
• each node starts as individual cluster
!
Clusters = Nodes = Words in tweet
• close clusters are successively merged together
- close = highest cost within clusters
22
30. Step 2: Clustering
• Final step: Dendrogram
- tree diagram
- represents the arrangement of hierarchical clusters
• why?
- easy to apply thresholds metics
30
31. Step 2: Clustering
• Final step: Dendrogram
- closer to the root = lower similarity
root
reactive scala
31
first cluster
32. Step 2: Clustering
• Final step: Dendrogram
- closer to the root = lower similarity
root
new york programming … akka-http based stream scaladay
32
reactive scala
33. Step 2: Clustering
• Final step: Dendrogram
- closer to the root = lower similarity
root
new york programming … akka-http based stream scaladay
33
reactive scala
thresholds
36. Step 3: Extract topical keywords
• keywords
- express a topic
- frequently used
- summarize tweets content
• Questions
- „What are the relevant keywords?“
- „In what clusters do they appear?“
36
37. Step 3: Extract topical keywords
• How?
- „topical tweets“ vs. „general tweets“
• frequently in topical tweets!
- search keywords „reactive scala“!
• not frequently in general tweets!
- general twitter stream (all tweets)
37
38. Step 3: Extract topical keywords
• Strength of a word
- is a word relevant for that topical cluster?
38
Low
Frequency
High
Frequency
Low
Frequency
High
Frequency
Topical Tweets
General Tweets
39. Step 3: Extract topical keywords
• Strength of a word
- is a word relevant for that topical cluster?
39
Low
Frequency
High
Frequency
Low
Frequency
High
Frequency
Topical Tweets
General Tweets
✔
relevant for
topic / cluster
40. Step 3: Extract topical keywords
• Result
- topical strength for each keyword
- sort them by relevancy
- select top 20 keyword
• choose clusters that contain this words
40
41. Final Step
• Combine clusters and keywords
• create visual summary
41
42. Final Step
42
• Keyword1
• Keyword2
• Keyword3
• Keyword4
• …
high relevancy
low relevancy
43. Final Step
43
• Keyword1
• Keyword2
• Keyword3
• Keyword4
• …
high relevancy
low relevancy
44. Final Step
44
• Treemap Visualisation
- color = cluster
- area of word = frequency of word
45. Final Step
• Wordcloud Visualisation
- color = cluster
- size of word = frequency of word
45
46. Final Notes
• 4. Million Topical Tweets
• 15 Days
• User Study
- Treemap vs. Word Cloud
46
47. Thank You!
• Discussion
- Loosing precision while cleaning tweet
- Loosing sense while removing stop words like
„not“ (negate)
- Unigram vs. Multigram?
- ?
47