Visual Summaries of Topical Clusters in Twitter Streams

Semantic Modeling
Computational Framework for
Generating Visual Summaries of
Topical Clusters in Twitter Streams*
Authors: Presenter:
!
Miray Kas Sebastian Alfers - HTW Berlin
Bongwon Suh
1
* http://link.springer.com/chapter/10.1007%2F978-3-319-02993-1_9

Visual Summaries of Twitter Streams
2
http://flowingdata.com/wp-content/uploads/2010/02/treemap-revised1.gif
http://www.infobarrel.com/media/image/54054.jpg

Step 1:
get &
pre-process Data
construct graph &
clustering
extract keywords &
summarize
Keywords
Stream
Tweets
Preprocessing/
Cleaning
Construct
Graph
Clustering
Select Relevant
Clusters
Extract Topical
Keywords
Visual Cluster
Summary
Step 2:
Step 3:
3

Input: Keywords
• initial set of Keywords
• similar to Twitter Search
4

Input: Keywords
• initial set of Keywords
• similar to Twitter Search
5

Step 1: Stream Tweets
• HTTP base API
- JSON, REST
6

7
• OAuth + HTTP
• here: java library with
scala and play!framework

Step 1: Preprocessing
• transform Tweets
- easy-to-analyze / clan format
• Process of cleaning:
1. lowercase
2. remove urls, user mentions and stop words
• like @user, „a“ or „123“
3. remove special characters (#,.)
8

• Example Keywords:
- SCALA
- Scala
- scala
- #scala
• Ling Pipe Library*
- remove tense and plurals
9
}scala
*http://alias-i.com/lingpipe/

• Example Tweets
10
new york time
reactive
programming
tool scala scale
techrepublic
akka-http based
reactive stream
scala scaladay

• Example Tweets
11
new york time
reactive
programming
tool scala scale
techrepublic
akka-http based
reactive stream
scala scaladay

Step 2: Graph
• Word Co-Occurrence Graph
- Word = Node (Unigrams)
- Tweet = Link between Nodes
• Example
akka-http based stream reactive scala scaladay
12 *http://alias-i.com/lingpipe/

Step 2: Graph
• Example
akka-http based stream reactive scala scaladay

Step 2: Graph
• Example
based
akka-http
reactive
stream
scaladay scala

Step 2: Graph
• Example
based
akka-http
reactive
stream
scaladay scala
Nodes
NLoindkess

Step 2: Graph
• Example
based
akka-http
reactive
stream
scaladay scala

Step 2: Graph
• Co-Occurrence Graph
- connect nodes (words) within and between
tweets
- add strength (weight) and cost (distance)
• More frequently words
- increase the strength
- decrease cost
19

Step 2: Graph
• Summary
+
=
reactive
scala
stream
based
…
uses
programming
…

Step 2: Clustering
• Here: „complete link (max) clustering“ algorithm
- hierarchical clustering algorithm that forms
clusters by merging subgroups
• Group Words from Tweets
- frequently appear on topic
- cluster = topic
* http://nlp.stanford.edu/IR-book/html/htmledition/single-link-and-complete-link-clustering-1.html

Step 2: Clustering
• Here: „complete link (max) clustering“ algorithm
• each node starts as individual cluster
!
Clusters = Nodes = Words in tweet
• close clusters are successively merged together
- close = highest cost within clusters
22

Step 2: Clustering
Graph Representation Cluster Representation
reactive
scala
stream
based
…
reactive
scala
stream
based
…
23
cost = distance = 0.5
cost = distance = 1
1
1

Step 2: Clustering
distance = 0.5
25

Step 2: Clustering
distance = 1
distance = 0.5
distance = 1
26

Step 2: Clustering
distance = 1
distance = 0.5
distance = 1
27
1
1

Step 2: Clustering
distance = 1
distance = 0.5
distance = 1
28
distance = 2
1
1

Step 2: Clustering
• Final step: Dendrogram
- tree diagram
- represents the arrangement of hierarchical clusters
• why?
- easy to apply thresholds metics
30

Step 2: Clustering
- closer to the root = lower similarity
root
reactive scala
31
first cluster

Step 2: Clustering
root
new york programming … akka-http based stream scaladay
32
reactive scala

Step 2: Clustering
root
new york programming … akka-http based stream scaladay
33
reactive scala
thresholds

Step 3: Extract topical keywords
Preprocessing/
Cleaning
35
Construct
Graph
Extract Topical
Keywords

• keywords
- express a topic
- frequently used
- summarize tweets content
• Questions
- „What are the relevant keywords?“
- „In what clusters do they appear?“
36

• How?
- „topical tweets“ vs. „general tweets“
• frequently in topical tweets!
- search keywords „reactive scala“!
• not frequently in general tweets!
- general twitter stream (all tweets)
37

• Strength of a word
- is a word relevant for that topical cluster?
38
Low
Frequency
High
Frequency
Low
Frequency
High
Frequency
Topical Tweets
General Tweets

• Strength of a word
- is a word relevant for that topical cluster?
39
Low
Frequency
High
Frequency
Low
Frequency
High
Frequency
Topical Tweets
General Tweets
✔
relevant for
topic / cluster

• Result
- topical strength for each keyword
- sort them by relevancy
- select top 20 keyword
• choose clusters that contain this words
40

Final Step
• Combine clusters and keywords
• create visual summary
41

Final Step
42
• Keyword1
• Keyword2
• Keyword3
• Keyword4
• …
high relevancy
low relevancy

Final Step
43
• Keyword1
• Keyword2
• Keyword3
• Keyword4
• …
high relevancy
low relevancy

Final Step
44
• Treemap Visualisation
- color = cluster
- area of word = frequency of word

Final Step
• Wordcloud Visualisation
- color = cluster
- size of word = frequency of word
45

Final Notes
• 4. Million Topical Tweets
• 15 Days
• User Study
- Treemap vs. Word Cloud
46

Thank You!
• Discussion
- Loosing precision while cleaning tweet
- Loosing sense while removing stop words like
„not“ (negate)
- Unigram vs. Multigram?
- ?
47

Visual Summaries of Topical Clusters in Twitter Streams

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Visual Summaries of Topical Clusters in Twitter Streams

Ähnlich wie Visual Summaries of Topical Clusters in Twitter Streams (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Visual Summaries of Topical Clusters in Twitter Streams