How to automatically distinguish between high-quality and low-quality content in Twitter?
Twitter is a rapidly growing microblogging platform, which provides a large amount, diversity and varying quality of content. In order to provide higher quality content (e.g. posts mentioning news, events, useful facts or well-formed opinions) when a user searches for tweets on Twitter, we propose a new method to filter and rank tweets according to their quality. In order to model the quality of tweets, we devise a new set of link-based features, in addition to content-based features. We examine the implicit links between tweets, URLs, hashtags and users, and then propose novel metrics to reflect the popularity as well as quality-based reputation of websites, hashtags and users. We then evaluate both the content-based and link-based features in terms of classification effectiveness and identify an optimal feature subset that achieves the best classification accuracy.
Presentation given at the DASFAA 2011 conference (15-18 April 2012, Busan, South Korea).
Authors: Jan Vosecky, Kenneth Wai-Ting Leung, and Wilfred Ng
Full paper: http://www.cse.ust.hk/~wilfred/paper/dasfaa12.pdf
Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links
1. Searching for Quality Microblog Posts:
Filtering and Ranking based on Content
Analysis and Implicit Links
Jan Vosecky, Kenneth Wai-Ting Leung, and Wilfred Ng
Department of Computer Science and Engineering
HKUST
Hong Kong
DASFAA‟12
2. Introduction Method Features Experiments Conclusions
Agenda
2
Introduction
Proposed method
Quality features of tweets
Experiments
Conclusions
3. Introduction Method Features Experiments Conclusions
3 Introduction
4. Introduction Method Features Experiments Conclusions
Microblogs
4
mentioned user timestamp
user
Tweet 1
Tweet 2
hashtag
URL link
Both social network and social media
Linksbetween users (follow, mention, re-tweet)
Users post updates (tweets)
5. Introduction Method Features Experiments Conclusions
Searching for “ipad” on Twitter
5
Around 50 tweets
mentioning “iPad”
posted within a
1-minute period
6. Introduction Method Features Experiments Conclusions
Research challenge
6
Twitter: user-generated content
Short messages, often comments or opinions
High volume
Varying quality
“Most tweets are not of general interest (57%)” (Alonso et
al.’10)
Information overload
Research questions:
How to distinguish content worth reading from
useless or less important messages?
How to promote „high quality‟ content?
10. Introduction Method Features Experiments Conclusions
Research goals
10
Quality-based tweet filtering
Filtering out low-quality tweets
In twitter feeds
In search results
Quality-based tweet ranking
Re-ranking Twitter search results
For a given time period
11. Introduction Method Features Experiments Conclusions
11 Proposed Method
12. Introduction Method Features Experiments Conclusions
Representation of tweets
12
Vector-space model: not sufficient
Short tweet length, terms often malformed
Ignores special features in Twitter
Feature-vector representation
Extract features from tweet
Traditional features: e.g. length, spelling
Twitter-specific features:
Exploiting hashtags, URL links, mentioned usernames
13. Introduction Method Features Experiments Conclusions
13 Quality Features of Tweets
14. Introduction Method Features Experiments Conclusions
Feature categories
14
1. Punctuation and Spelling 2. Syntactic and semantic
complexity
Number of exclamation marks Max. & Avg. word length
Number of question marks Length of tweet
Max. no. of repeated letters Percentage of stopwords
% of correctly spelled words Contains numbers
No. of capitalized words Contains a measure
Max. no. of consecutive capitalized Contains emoticons
words Uniqueness score
3. Grammaticality 4. Link-based
Has first-person part-of-speech Contains link
Formality score Is reply-tweet
Number of proper names Is re-tweet
Max. no. of consecutive No. of mentions of users
proper names Number of hashtags
Number of named entities URL domain reputation score
RT source reputation score
Hashtag reputation score
5. Timestamp
15. Introduction Method Features Experiments Conclusions
1. Punctuation and spelling
15
Excessive punctuation
Number of exclamation marks
Number of question marks
Max. number of consecutive dots
Capitalization
Presence of all-capitalized words
Largest number of consecutive words in capital letters
Spellchecking
Number of correctly spelled words
Percentage of words found in a dictionary
RT @_ChocolateCoco: WHO IS CHUCK NORRIS??!!!??
lls. He's only the greatest guy next to jesus lmao
16. Introduction Method Features Experiments Conclusions
2. Syntactic and semantic
16
complexity
Syntactic complexity
Tweet length
Max. & avg. word length
Percentage of stopwords
Presence of emoticons and other sentiment indicators
Presence of measure symbols ($, %)
Numbers – number of digits
Tweet uniqueness
Uniqueness of the tweet relative to other tweets by the author
where
17. Introduction Method Features Experiments Conclusions
3. Grammaticality
17
Parts-of-speech labelling
Presence of first person parts-of-speech
Formality score [Heylighen‟02]
F = (noun frequency + adjective freq. + preposition freq.+ article freq.
− pronoun freq. − verb freq. − adverb freq. − interjection freq. + 100)/2
Names
Number of „proper names‟ as words with a single initial capital
letter
Number of consecutive „proper names‟
Number of Named entities
F. Heylighen and J.-M. Dewaele. Variation in the contextuality of language: An empirical measure.
Context in Context. Special issue Foundations of Science, 7(3):293–340, 2002.
18. Introduction Method Features Experiments Conclusions
4. Link-based features
18
Links to other items
Re-tweet(RT), reply tweet, mention of other users
Presence of a URL link
Number of hashtags as indicated by the “#” sign
Link target‟s quality reputation
metrics to reflect the quality of tweets which relate
to a
URL domain
Hashtag
a user
19. Introduction Method Features Experiments Conclusions
URL domain reputation
19
Observation:
Tweets which link to news articles usually better quality than
tweets which link to photo sharing websites
Q=1 Q=5
Tweet 1 Tweet 4
Tweetpic.co NYtimes.co
Q=3
m m
Q=4
Q=2
Tweet 2 Tweet 5 Q=5
Tweet 3 Tweet 6
Questions:
What does the quality of tweets linking to a website say about its
quality?
Can we predict quality of future tweets linking to that website?
20. Introduction Method Features Experiments Conclusions
URL domain reputation
20
Step 1: URL translation
Short link to original link
bit.ly/e2jt9F http://www.reuters.com/4151120
Step 2: summarize tweets linking to a URL
domain
Accumulate “quality reputation” over time
21. Introduction Method Features Experiments Conclusions
URL domain reputation
21
Average URL domain quality
Td = set of tweets linking to domain d
qt = quality label of tweet t
Weakness:
Does not reflect the number of inlink tweets in the score
Favours domains with few inlink tweets
22. Introduction Method Features Experiments Conclusions
URL domain reputation
22
Domain reputation score
where AvgQ(d) is between [-1, +1]
“Collecting evidence” behaviour:
Score getting higher with more good quality inlink tweets
4.00
-1
2.00
-0,5
DRS 0.00 0 AvgQ
1 10 100 1000 0,5
-2.00
1
-4.00
|Td|
24. Introduction Method Features Experiments Conclusions
Reputation of hashtag & user
24
Q=1 Q=5
Tweet 1 Tweet 4
#justforfun #DASFAA
Q=3 Q=4
Q=2
Tweet 2 Tweet 5 Q=5
Tweet 3 Tweet 6
Hashtag reputation #DASFAA vs. #justforfun
Re-tweet source user reputation @barackobama vs.
@wysz22212
25. Introduction Method Features Experiments Conclusions
25 Experiments
26. Introduction Method Features Experiments Conclusions
Dataset
27
10,000 tweets
100 users, 100 recent tweets per user
Users:
50 random users
50 influential users
Selected from listorious.com
5 categories: technology, business, politics,
celebrities, activism
10 users per category
27. Introduction Method Features Experiments Conclusions
Labelling
28
Crowdsourcing
Amazon Mechanical Turk
3 labels per tweet from different reviewers
Possible labels: 1 to 5
1 = low quality, 5 = high quality
Random order of tweets
28. Introduction Method Features Experiments Conclusions
Labelling results
29
Tweet quality distribution
Quality score:
29. Introduction Method Features Experiments Conclusions
Feature analysis
30
Total 29 features
Top 5 features based on Information Gain:
0.374 Domain reputation
0.287 Contains link
0.130 Formality score
0.127 Num. proper names
0.113 Max. proper names
30. Introduction Method Features Experiments Conclusions
Feature selection
31
Greedy attribute selection
15 selected features:
Domain reputation RT source reputation
Formality Tweet uniqueness
No. named entities % correct. spelled words
Max. no. repeat. Letters No. hash-tags
Contains numbers No. capitalized words
Is reply-tweet Is re-tweet
Avg. word length Contains first-person
No. exclam. Marks
31. Introduction Method Features Experiments Conclusions
Classification and Ranking
32
Method
Classification:
SVM, binary classification (high-quality, low-
quality)
50/50 split for training/testing
Ranking:
Learning-to-rank (Rank SVM)
30 queries from 5 topic categories
Process:
1. Retrieve tweets matching a query
2. Extract features from the tweets
3. „Query-tweet vector‟ pairs + quality scores of the
32. Introduction Method Features Experiments Conclusions
Classification results
33
#attribute High-Quality Low-Quality Overall
Features s P R P R AUC
Link only 1 0.798 0.702 0.894 0.934 0.818
TF-IDF 3322 0.862 0.665 0.885 0.96 0.813
Subset.Reputation 3 0.812 0.746 0.909 0.936 0.841
Subset.SVM (“greedy”) 15 0.715 0.758 0.912 0.936 0.847
All quality features 29 0.815 0.66 0.882 0.944 0.802
All quality ftr‟s + TF- 3351 0.739 0.775 0.915 0.899 0.837
IDF
Optimal feature set (15 attrs.) outperforms TF-IDF (3322 attrs.)
Link-based “reputation” features (3 attrs.) achieve the 2nd best result
Combining quality features + TF-IDF does not improve result
33. Introduction Method Features Experiments Conclusions
Classification results
34
#attribute
Features s AUC
Link only 1 0.818
TF-IDF 3322 0.813
Subset.Reputation 3 0.841
Subset.SVM 15 0.847
(“greedy”) Storage cost
All quality features 29 0.802
All quality ftr‟s + TF- 3351 0.837
IDF
Optimal feature set achieves
reduced training time and storage
cost
Training time
34. Introduction Method Features Experiments Conclusions
Ranking results
35
where
NDCG@N
Features #attributes 1 2 5 10 MAP
Link only 1 0.067 0.111 0.22 0.324 0.398
Subset.Reputation 3 0.822 0.777 0.777 0.764 0.661
Subset.SVM (“greedy”) 15 0.867 0.767 0.778 0.769 0.653
All quality features 29 0.733 0.733 0.763 0.753 0.637
Optimal feature set (15 attrs.) achieves the best result
Link-based “reputation” features (3 attrs.) achieve the 2nd best result
35. Introduction Method Features Experiments Conclusions
36 Conclusions
36. Introduction Method Features Experiments Conclusions
Summary
37
Method for quality-based classification and
ranking of tweets
Proposed and evaluated a set of tweet‟s
features to capture the tweet‟s quality
Link-based features lead to the best
performance
37. Introduction Method Features Experiments Conclusions
Future work
38
Consider different types of queries in Twitter
E.g. searching for hot topics, movie reviews,
facts, opinions, etc.
Different features may be important in different
scenarios
Incorporating recent hot topics
Personalized re-ranking
38. Introduction Method Features Experiments Conclusions
Q/A
39
39. Introduction Method Features Experiments Conclusions
Thank You
40
40. Related work
41
Spam detection
Bag-of-words, keyword-based
Feature-based approaches
Combinations
Social networks
Finding quality answers in Q-A systems
E.g. Yahoo Answers
Feature-based
Web search
Quality-based ranking of web documents
Feature-based quality score (WSDM‟11)
41. ROC Curve
42
Area under the ROC curve: probability that a classifier
will rank a randomly chosen positive instance higher
than a randomly chosen negative one