My INSURER PTE LTD - Insurtech Innovation Award 2024
Twinder: A Search Engine for Twitter Streams
1. Twinder: A Search Engine for Twitter Streams
#ICWE2012 Berlin, Germany
July 25th, 2012
Ke Tao, Fabian Abel, Claudia Hauff, Geert-Jan Houben
Web Information Systems, TU Delft
1
2. Get information from Twitter
How do people use Twitter as a source of information?
• Twitter is more like a news media.
• How do people search Twitter?
• Search on Twitter (Teevan et al.)
Web Search Twitter Search
Query length (chars) 18.80 12.00
Query length (words) 3.08 1.64
Is a celebrity name 3.11% 15.22%
Twinder: A search engine for Twitter streams 2
3. Research Questions
What are the challenges we are facing?
1. Given a topic, can we identify the relevant tweets
based on the characteristics of the tweets?
2. Are semantics meaningful for determining the tweets’
relevance for a topic?
3. How can we design an architecture that is scalable?
Twinder: A search engine for Twitter streams 3
4. Search on Twitter
What you can do on current Twitter?
• Twitter search interface
• Ordered by time
• Keyword-based match
Twinder: A search engine for Twitter streams 4
6. Core Components of Twinder
Feature Extraction
• Receive Twitter messages from Social Web Streams
• Features of two categories:
• (1) Topic-sensitive features
• (2) Topic-insensitive features
• Different extracting strategies designed for different features
Twinder: A search engine for Twitter streams 6
7. Core Components of Twinder
Feature Extraction Task Broker
• Twinder makes use of MapReduce and cloud computing
infrastructures to allow for high scalability and frequent
updates of its multifaceted index.
• Feature Extration Task Broker dispatches features
extration tasks and indexing tasks to cloud computing
infrastructure.
Twinder: A search engine for Twitter streams 7
8. Core Components of Twinder
Relevance Estimation
• Accepting search queries from front-end, passing them to
Feature Extration Component.
• Tweets are classified into the relevant and the non-relevant
by Relevance Estimation component, and are further
delivered to front-end for rendering.
• Twinder can learn the classification model, initially from
training dataset, then from usage data.
Twinder: A search engine for Twitter streams 8
9. Efficiency of Indexing
How good does Twinder make use of cloud-computing
infrastructure?
Corpus size Mainstream Server EMR(10 instances)
100k (13MBytes) 0.4 min 5 min
1m (122MBytes) 5 min 8 min
10m (1.3GBytes) 48 min 19 min
32m (3.9GBytes) 283 min 47 min
Twinder: A search engine for Twitter streams 9
10. Features of Microposts
Hypothesis H1: The greater the keyword-
based relevance score, the more relevant
and interesting the tweet is to the topic.
We already have keyword-based relevance,
and…?
Topic sensitive Topic insensitive
Keyword-based
relevance
?
Twinder: A search engine for Twitter streams 10
11. Semantic-based relevance
Expand the queries to match more tweets
dbp:United_States
Reformulated query is expected to get a
dbp:Hu_Jintao
more accurate retrieval score.
Hypothesis H2 : The greater the semantic-
based relevance score, the more relevant
and interesting the tweet is.
Twinder: A search engine for Twitter streams 11
12. Semantic-based relatedness
Is there a semantic overlap between the query and the tweet?
dbp:the_United_States
dbp:Hu_Jintao
Hypothesis H3 : If a tweet is considered to
be semantically related to the query then it
is also relevant and interesting for the user.
Twinder: A search engine for Twitter streams 12
13. Overview of features
What do we have now?
Topic sensitive Topic insensitive
Keyword-based ?
Semantic-based
Twinder: A search engine for Twitter streams 13
14. Syntactical feature : Hashtag
Is a tweet more relevant if it contains a #hashtag?
Hypothesis 4: tweets that contain hashtags
are more likely to be relevant than tweets
that do not contain hashtags.
Twinder: A search engine for Twitter streams 14
15. Syntactical feature : hasURL
Is a tweet that contains a URL more relevant?
Hypothesis 5: tweets that contain a URL are
more likely to be relevant than tweets that
do not contain a URL.
Twinder: A search engine for Twitter streams 15
16. Syntactical feature : isReply
Is a tweet which is a reply to @somebody more relevant?
Hypothesis 6: tweets that are formulated as
a reply to another tweet are less likely to be
relevant than other tweets.
Twinder: A search engine for Twitter streams 16
17. Syntactical feature : length
Does the length of a tweet influence its relevance for a topic?
Hypothesis 7: the longer a tweet, the more
likely it is to be relevant and interesting.
Twinder: A search engine for Twitter streams 17
18. Overview of features
Short summary
Topic sensitive Topic insensitive
Keyword-based Syntactical features
Semantic-based ?
Are there further features that
allow for estimating the relevance?
Twinder: A search engine for Twitter streams 18
19. Semantic features
Find semantics in a tweet to estimate the relevance
dbp:Tim_Berners-Lee dbp:World_Wide_Web
dbp:France
dbp:Lyon
dbp:International_World_Wide_Web_Conference
Twinder: A search engine for Twitter streams 19
20. Semantic features : #entity
Is a tweet with more entities more interesting?
• 5 entities extracted.
Hypothesis 8: the more entities a tweet
mentions, the more likely it is to be relevant
and interesting.
Twinder: A search engine for Twitter streams 20
21. Semantic features : diversity
How many types are there in the entities?
• 4 types of entities
Hypothesis 9: the greater the diversity of
concepts mentioned in a tweet, the more
likely it is to be interesting and relevant.
Twinder: A search engine for Twitter streams 21
22. Semantic features : sentiment
Was the author of the tweet happy or not?
• Sentiment : Neutral
Hypothesis 10: the likelihood of a tweet’s
relevance is influenced by its sentiment
polarity.
Twinder: A search engine for Twitter streams 22
23. Overview of features
By now, we have 4 types of features.
Topic sensitive Topic insensitive
Keyword-based Syntactical
Semantic-based Semantics
Can we utilize the contextual
information of tweets?
Twinder: A search engine for Twitter streams 23
24. Contextual features
Does the number of followers influence the relatedness?
Hypothesis 11: The higher the number of
followers a creator of a message has, the
more likely it is that her tweets are relevant.
Twinder: A search engine for Twitter streams 24
25. Contextual features
Or the number of followers that the author appears in?
Hypothesis 12: The higher the number of
lists in which the creator of a message
appears, the more likely it is that her tweets
are relevant.
Twinder: A search engine for Twitter streams 25
26. Contextual features
How long has been the author on Twitter?
Hypothesis 13: The older the Twitter
account of a user, the more likely it is that
her tweets are relevant.
Signed up Post
July 2008 June 2012
Twinder: A search engine for Twitter streams 26
27. Summary of Features
The features
Topic sensitive Topic insensitive
Keyword-based Syntactical
Semantic-based Semantics
Contextual
Twinder: A search engine for Twitter streams 27
28. Analysis
• Research Questions:
1. Which features are more influential on predicting the
relatedness of a tweet to a certain topic?
2. Which types of features are more important? Are
semantics meaningful?
3. What’s the performance that we can achieve by utilizing
these features?
• Twinder Setup
• Consider the search problem as a classification task
• Classification algorithm = Logistic Regression
Twinder: A search engine for Twitter streams 28
29. Dataset
From TREC 2011 Microblog Track
• Twitter corpus
• 16 million tweets (Jan. 24th, 2011 – Feb. 8th)
• 4,766,901 tweets classified as English
• 6.2 million entity-extractions (140k distinct entities)
• Relevance judgments
• 49 topics
• 40,855 (topic, tweet) pairs
• 60.31 relevant tweets per topic (on average)
Twinder: A search engine for Twitter streams 29
30. Results
Which type of features matters?
Features Precision Recall F-measure
keyword relevance 0.3036 0.2851 0.2940
semantic relevance 0.3050 0.3294 0.3167
topic-sensitive 0.3135 0.3252 0.3192
topic-insensitive 0.1956 0.0064 0.0123
without semantics 0.3363 0.4618 0.3965
without sentiment 0.3701 0.3923 0.4048
without context 0.3827 0.4714 0.4225
all features 0.3674 0.4736 0.4138
Overall, we can achieve the precision and
recall of over 35% and 45% respectively by
applying all the features.
Twinder: A search engine for Twitter streams 30
31. Weights of features Syntactical
Which feature matters? 2
1
Keyword-based 0 hasHashtag hasURL isReply length
2
-1
1
0 Semantics
-1 Keyword-based relevance 2
1
Semantic-based 0
2 #entities diversity sentiment
-1
1
Contextual
0
2
-1 Relevance Relatedness
1
0 #followers #lists Age
-1
Twinder: A search engine for Twitter streams 31
32. Topics of different categories
The impact on the performance and models
• 49 topics categorized into 2 parts w.r.t. 3 dimensions:
• Popularity
• Gobal vs. Local
• Temporal persistence
• Popularity
• Higher recall for popular topics
• Less impact from sentiment features on unpopular topics
• Temporal persistence
• Higher performance on shorter-term topics
• Less impact from sentiment features on persistent topic
Twinder: A search engine for Twitter streams 32
33. Conclusions
What are our contributions?
1. Twinder search engine proposed: analyzing various
features to determine the relevance and interestingness of
Twitter messages for a given topic.
2. Scalability demonstrated for the Twinder search engine.
3. Extensive analysis on 13 features along two-dimensions:
topic-sensitive features and topic-insensitive features.
Twinder: A search engine for Twitter streams 33
34. Conclusions
The lessons learned
1. The learned models which take advantage of semantics
and topic-sensitive features outperform those which do not
take the semantics and topic-sensitive features into
account.
2. Contextual features that characterize the users who are
posting the messages have little impact on the relevance
estimation.
3. The importance of a feature differs depending on the topic
characteristics; for example, the sentiment-based features
are more important for popular than for unpopular topics.
Twinder: A search engine for Twitter streams 34