A Probabilistic Approach to Tweets' Sentiment Classification - ACII 2013 Conference

 Francesco Colace, Massimo De Santo, Luca Greco
DIEM –Università degli Studi di Salerno
{fcolace, desanto, lgreco}@unisa.it
ACII 2013 – Geneva, 2-5 September 2013

 Web 2.0 (or Web X.Y) rules!
 Social Networks, Blogs, Microblogs, Reviews’
Collectors Sites: huge and terrific quantity of
heterogeneus and opinonated data

 Open issues:
o How to manage this information?
o How to extract the sentiment inside the data?
o How to understand something about the users?
o How to evaluate the opinion of people about some topics or
products?
 Sentiment Analysis

 Brief introduction to the Sentiment Analysis
o Related Works
 Towards a Sentiment Analysis Framework
o The Proposed Approach
• The LDAApproach
• The Mixed Graph of Terms
• A sentiment mining algorithm
 Experimental results
 Conclusions and Future Works

 Sentiment:
o a thought, view, or attitude, especially based mainly on emotion instead
of reason
 Sentiment Analysis (as known as Opinion mining):
o use of Natural Language Processing (NLP) and computational
techniques to automate the extraction and classification of sentiment
from unstructured texts

 Consumer information
o Product reviews (Amazon, e-Bay, …)
 Marketing
o Consumer attitudes
o Trends
 Politics
o Politicians want to know voters’ point of views
o Voters want to know policitians’ stances and who else supports them
 Social
o Find like-minded individuals or communities

 What features adopt?
o Words
o Sentences
 How to interpret features for sentiment detection?
o As a bag of words
o By the use of annotated lexicons
o According to syntactic patterns
o Analyzing the paragraph structure

 Naïve Bayes
 Maximum Entropy Classifier
 SVM
 Markov Blanket Classifier
 … … …
 Latent Dirichlet Allocation (LDA)

 By the use of the Bag of Words approach, a document
can be represented as an ordered set of words
 Problems:
o What words express better the sentiment in a text?
o How to compare various «bag of words» derived from texts with the
same sentiment?
o By the use of the bag of words is it possible to represent the documents’
domain of interest?

 The mixed Graph of Terms is a «graph based» representation
of documents
 In the proposed approach, a mixed Graph of Terms is obtained
by an automatic extraction of words based on probabilistic
clustering techniques as Latent Dirichlet Allocation (LDA)
 In a mixed Graph of Terms the words are linked according to
their mutual occurence probability and «aggregating_word»
and «aggregated_words» can be recognized
 Our proposal: a mixed Graph of Terms can be used as a
«sentiment filter»

 In the proposed approach, in a mixed Graph of Terms two
different layers can be recognized:
 The Aggregator Layer: the words with higher degree of
interconnection with the words that are in the documents
 The “Aggregated Words” Layer: this layer expresses words
that have higher degree of interconnection with one or more
Aggregator Word

 In natural language processing, Latent Dirichlet Allocation (LDA) is a
generative model that allows sets of observations to be explained by
unobserved groups that explain why some parts of the data are similar
 For example, if observations are words collected into documents, it
posits that each document is a mixture of a small number of topics and
that each word's creation is attributable to one of the document's topics
 The basic idea is that the documents are represented as random
mixtures over latent topics, where a topic is characterized by a
distribution over words
 By the use of the Latent Dirichlet Allocation technique a set of
documents can be represented as a mixed Graph of Terms

 Step_1: Learn a mixed Graph of Terms by the
use of labelled documents (i.e. Positive or
Negative) obtaining:
o mGT positive
o mGT negative
 Step_2: Use the mixed Graph of Terms as filter
in order to classify the sentiment of texts
o Comparing concepts that are both in the mGTs both
in the text
o Comparing words that are both in the mGTs both in
the text

 Dataset: Movie Reviews
Approach Accuracy
Support Vector Machine* 82,90
Naive Bayes* 81,50
Maximum Entropy* 81,00
mGT-LDA 88,50
*[Bo Pang, 2002]

 Dataset: Real Tweets related to Politics
 Training Set: 3980 Tweets
 Test Set: 32185 Tweets
Approach Accuracy
mGT-LDA 87,10
SVM 79,20
Naive Bayes 76,60

http://193.205.190.209/elezioni2013/

days
accuracy

Masterchef - http://193.205.190.209/tvshow/masterchef/

 Pro:
o Indipendent from Language
o Fast classification
o Continous Upgrade
o Little Training Set
 Cons:
o In general, long Time for mGT building process
o An Annotated Lexicon is needed

 To improve the classification by the continous update of
the training set
 To Introduce SentiWordnet as Annotated lexicon
 To adopt an ontological formalism for a better
representation of the mGT
 To build a bigger tweets’ dataset

Don’t forget to tweet your sentiment!!! 

A Probabilistic Approach to Tweets' Sentiment Classification - ACII 2013 Conference

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (15)

Similar to A Probabilistic Approach to Tweets' Sentiment Classification - ACII 2013 Conference

Similar to A Probabilistic Approach to Tweets' Sentiment Classification - ACII 2013 Conference (20)

Recently uploaded

Recently uploaded (20)

A Probabilistic Approach to Tweets' Sentiment Classification - ACII 2013 Conference