Sentiment analysis over Twitter offers organisations and individuals a fast and effective way to monitor the publics' feelings towards them and their competitors. To assess the performance of sentiment analysis methods over Twitter a small set of evaluation datasets have been released in the last few years. In this paper we present an overview of eight publicly available and manually annotated evaluation datasets for Twitter sentiment analysis. Based on this review, we show that a common limitation of most of these datasets, when assessing sentiment analysis at target (entity) level, is the lack of distinctive sentiment annotations among the tweets and the entities contained in them. For example, the tweet ``I love iPhone, but I hate iPad'' can be annotated with a mixed sentiment label, but the entity iPhone within this tweet should be annotated with a positive sentiment label. Aiming to overcome this limitation, and to complement current evaluation datasets, we present STS-Gold, a new evaluation dataset where tweets and targets (entities) are annotated individually and therefore may present different sentiment labels. This paper also provides a comparative study of the various datasets along several dimensions including: total number of tweets, vocabulary size and sparsity. We also investigate the pair-wise correlation among these dimensions as well as their correlations to the sentiment classification performance on different datasets.
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new dataset, the STS-Gold
1. Evaluation Datasets for Twitter Sentiment Analysis
A survey and a new dataset, the STS-Gold
Hassan Saif, Miriam Fernandez, Yulan He and Harith Alani
Knowledge Media Institute, The Open University,
Milton Keynes, United Kingdom
1st Workshop on Emotion and Sentiment in Social and
Expressive Media Approaches and perspectives from AI
2. • Definition & Background
• Evaluation Datasets for Twitter Sentiment
Analysis
• STS-Gold
Outline
• Comparative Study
• Conclusion
3. Sentiment Analysis – Definition
Sentiment Analysis
“Sentiment analysis is the task of identifying
positive and negative opinions, emotions and
evaluations in text”
The main dish was
delicious
It is a Syrian dish
Positive
Neutral
The main dish was
salty and horrible
Negative
3
5. Evaluation Datasets for Twitter Sentiment Analysis
SA Level
SA Task
No. of Tweets
Construction & Annotation
Dataset
Dataset
Vocabulary Size
Class Distribution
Sparsity
7. • Details about the annotation
methodology (STS, HCR, Sanders)
What is Missing?
• Entity-level Sentiment Evaluation:
• Most works are focused on
assessing the performance of
sentiment classifiers at the tweet
level (STS, OMD, SS-Tweet, Sanders)
• Datasets, which allow for the
sentiment evaluation at the entity
level, assign similar sentiment
labels to the tweet and the entities
within it. (HCR, WAB, GASP)
8. Enables the evaluation at both the entity and tweet
levels
Tweets and entities are annotated independently
Contains 58 Entities & 3000 Tweets
9. Data Collection
STS Corpus
Select
28 Entities
Select
100 Tweet/Entity
180K Tweets
STS-Gold
Alchemy API
2800 Tweets
Entity-Extraction
+200 tweets
Identify Frequent
Concepts
3000 Tweets
Top & Mid
Frequent Entities
Entity-Extraction
147 Entities
13. Comparative Study.1
Vocabulary Size vs. No. of Tweets
- There exists a high correction between the vocabulary size and the number of
tweets (ρ = 0.95)
- However, increasing the number of tweets does not always lead to increasing the
vocabulary size. (OMD)
14. Data Spar sity
Comparativeimportant factor that affectstheov
Da s t s rs isa Study.2
ta e pa ity
n
-
m chinele rning cla s rs[17]. According toS if e a
a
a
s ifie
a t l.
tha
nothe type
r
sof da
ta(e m
.g., oviere w da ) duetoa
vie
ta
Data Sparsity in tweets.
words
Inthiss ction, wea
e
imtocom rethepre e dda s ts
pa
s nte ta e
Twitter datasets are generally tethes rs de eof agive
Toca
lculavery sparse ity gre
pa
nda s t weus
ta e
e
Increasing both the number of tweets or the vocabulary size increases the sparsity
[13]:
Pn
degree of the dataset:
- ρno_of_tweets = 0.71
i Ni
Sd = 1 −
- ρvocabulary_size = 0.77
n ⇥ |V |
Whe
reN i isthethenum r of dis
be
tinct wordsintwe t i
e
the dataset and |V | the vocabulary size.
9
The Twe tNLP toke r ca be downloa d from ht t p:
e
nize n
de
Tweet NLP/
15. Comparative Study.3
Classification Performance vs. Dataset Sparsity (1)
0.9
Average Classifier Performance
Average Classifier Performance
According to Makrehchi et al (2008) and Saif et al (2012): in a given dataset the
classification performance and the sparsity degree are negatively correlated, i.e.,
increasing the dataset sparsity hinders the classification performance.
228
M . M akrehchi and M .S. K amel
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
Industry Sectors
20 newsgroups
Reuters
0.991 0.992 0.993 0.994 0.995 0.996 0.997 0.998 0.999
Average Sparsity
(a)
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.9441
Industry Sectors
20 newsgroups
Reuters
0.9550
0.9661
0.9772
0.9886
1.00
0.9441
0.9550
Average Sparsity
(b)
F i g. 2. Classifier performance as a funct ion of sparsity: (a) Rocchio, and (b) SV M
16. Comparative Study.3
Classification Performance vs. Dataset Sparsity (2)
- No correlation between the classification performance and the sparsity degree
across the datasets. (ρacc = −0.06, ρf1 = 0.23)
- The sparsity-performance correlation is intrinsic, meaning that it might exists within
the dataset itself, but not necessarily across the datasets.
17. • Current datasets to evaluate Twitter
sentiment classifiers:
– Focus on the tweet-level.
– Assign similar sentiment labels to the
tweets and the entities within them.
• STS-Gold allows for sentiment evaluation
as both the tweet and the entity levels.
• A correlation between the vocabulary size
and the number of tweets does not
always exist.
• The sparsity-performance correlation is
intrinsic, i.e., it only exists within the
dataset itself, but not across the different
datasets.
Conclusion!