Sentiment Analysis on Twitter Dataset using R Language
SENTIMENT ANALYSIS OF TWITTER DATA
1. SENTIMENT ANALYSIS OF TWITTER
DATA
ANARGHA GANGADHARAN
anarghagangadharan@gmail.com
ANJU ANIL
anjuanil1217@gmail.com
MARY LIS JOSEPH
marylisjp@gmail.com
PARVATHY D
parvathydevaraj8@gmail.com
B.Tech Scholars
Department of Computer Science
College of Engineering Cherthala
Abstract—Micro blogging has now become a very
popular communication tool. Millions of people share
their views, opinions on various topics in these sites.
Therefore these sites have become a rich source of
opinion and views of different people among many
micro blogging sites twitter is one of the popular sites.
Today it is a daily practice for many people to read the
news online and therefore In this paper we examined
the sentiment analysis of twitter data and we focused on
news channels and other news sites which post about
current news and the tweets of those news posted daily
is being analysed and the overall sentiment of that news
is being analysed. Here we have presented a system
which gives a score that indicates whether the news is
positive or negative. Each news is being considered and
is being tokenized and sentiment is being calculated
using naive bayes classifier which classify the data into
positive, negative or neutral and the main feature is
that the sentiment calculation is being done on real
time data.
Key words: sentiment analysis, machine learning,
naive bayes classifier.
I. INTRODUCTION
Various microblogging sites have become
a part of our day today life as a source for varIous
kinds of information. This is because people rely
mostly on websites rather than any other media.
This is because people can post real time messages
and their opinions on various topics. Among various
sites we have chosen twitter as a platform for
performing sentiment analysis because of various
facilities and features that twitter provides us such
as it is the only web site media through which each
can communicate with their potential customers.
Twitter audience varies from regular users to
celebrities, company representatives, politicians,
students and even it includes high authority
government officials which even consist of
president. Therefore it is possible to collect text
posts of users from various categories. Major works
on sentiment analysis has been done on subjective
texts types such as blogs, result prediction and
product reviews. Authors of such text types
typically express their individual opinions freely
sometimes it may even restrict the sentiment to a
single group of people or may even leads to a single
person. The situation is different in news articles.
News can be good or bad but it is seldom neutral.
Analysing this news and thereby calculating the
sentiments expressed by the twitter audience can
provide a meaningful sense of how the latest news
impacts important entities. Another difference
between reviews and news is that reviews
frequently are about a relatively concrete object or
which can be said as a target subject. Whereas news
articles covers a larger subject domain which is
even more complex event description And a
whole range of targets. Our paper mainly
concentrates on experimental evaluation on a set of
real time news that has been posted on twitter by
various news channels and newspapers and thereby
evaluating the overall impact of the news on the
people. We look over the news article and obtain the
tweets based on that news; the tweets may either be
a link or an opinion or can even be a query. We
classify the news as positive, negative and neutral
and consider only positive and negative news for
2. sentiment calculation. This paper is structured
mainly as follows. First module is all about
collecting the data. Second module is text pre-
processing. Third module deals with term
frequencies. Fourth module discusses about rugby
and term co-occurrences and the fifth module deals
with data visualisation basics.
II. LITERATURE SURVEY
Social media plays an important share on
the web. Users have become a part and co-creators
of contents on the web. The users now contribute
major part of social media ranging from articles,
news, reviews etc. This leads to the creation of a
large unstructured text on the web. Among all the
social media Twitter plays an important role in
interacting with the people all around the world.
The task here is to analyse the sentiment of such
data which is pertinent research topic in recent
time.
In previous studies by Namrata Godbole,
Manjunath Srinivasaiah, Steven Skiena has done
sentiment analysis on general news following news
articles and blogs. Kiran Shriniwas Dodd, Dr. Mrs.
Y. V. Haribhakta, Dr. Parag Kulkarni has also
succeeded in finding sentiment analysis on online
news media. However, not many researches in
opinion mining contemplate blogs and even much
less addressed micro blogging .Turney, 2002; Pang
and Lee, 2004 sentiment analysis has been carried
on document level classification. Whereas Hu and
Liu, 2004; Kim and Hovy, 2004 has done the
analysis of data in sentence level. Bermingham and
Smeaton, 2010 has done analysis on data but they
failed to break data into tokens and even they
succeed only in handling unigrams. Go et al. (2009)
has succeeded classifying data into tokens but he
too failed to handle n grams. In Sentiment Analysis
of Twitter Data by Apoorv Agarwal we can see
that the sentiment analysis of Twitter data has been
done on data. They included the POS specific prior
polarity features. They mainly deals with two kinds
of models tree kernel and feature based models and
demonstrated it. In another paper by Alexander
Pak, Patrick Paroubek they have used tree tagger
for POS tagging and they have presented a method
for automatic collection of data that is been used to
train a sentiment classifier in that the author used
syntactic structures to describe emotions or state
facts .In the work done by James Spencer and
Gulden Uchyigit School on sentiment analysis of
twitter data they have only deal with common
process in NLP for finding the sentiment or
meaning of a given phrase or text and it gave
accuracy of only 50%.
In another paper about sentiment analysis
of news by Alexandra Balahur we have seen that
the news is being analysed and sentiment of
particular news is being calculated but they haven't
include any method for evaluating the brunt of
using negation and valence shift. In all the papers
which we have considered as a reference for our
work we have seen that sentiment analysis on
Twitter has been done only on structured data like
product reviews, election prediction, blogs, etc. and
no past works has been done on the news that are
been posted daily on twitter. None of the past work
has been dealt with real time data for sentiment
calculation and they haven't followed any specific
algorithm for calculating sentiment analysis.
III. DATA DESCRIPTION
Twitter is the most famous social
networking site in which users are allowed to post
real time messages called tweets. Tweets are small
in size and comprises of 140 characters .As a result
of these peculiarities of tweets, users use
wordplays, spelling mistakes, emoticons so as to
express their ideas. Following is a jargon associated
with tweets.
Hashtags: A special word or phrase indicated by a
hash symbol so as to identify the topic as specific.
Emoticons: An indication of facial expression so as
to convey user's feelings towards a particular topic.
Targets: Target is expressed by @ symbol so as to
identify a particular user specified.
We collected real time messages from the
Twitter. There were no restrictions regarding the
collection of data. The collection even consists of
all the tweets received. After gathering of them we
arranged them into two types positive and negative.
IV. COLLECTING DATA
The first step in collecting data is the
registration of our application. For this we have to
login our twitter account and after logging into our
account, we have to register a name and description
regarding our application. After entering these
entities a consumer key as well as a consumer
secret is obtained and these should be kept private.
From the configuration page we are secured with
an access token and an access token secret provided
the application accesses thus permitted are read
only. Twitter provides an API so as to interact with
3. its services. We also use tweepy so as to stream
data from twitter (python). Tweepy provides the
convenient cursor interface to iterate through
different types of object.
V. TEXT ANALYSIS
Text analysis is used to extract meaningful
pattern from unstructured text. Here we use
components and concepts from text analysis to
analyse the sentiments in tweets. The process of
analysing the sentiment consists of multiple steps.
First step is breaking texts into words. This process
is known as tokenization. The purpose of
tokenization is to split the text or a tweet, which is
streamed in Real time, into several smaller units
called tokens .Tokens can be either be words or
phrases. These tokens are the primary building
blocks for our Sentiment Analysis. Tokenization is
very crucial especially for Twitter data, since it
poses many challenges because of the nature of the
language being used. In second phase we extract
meaningful terms and counts from our tweets
called term frequency .This analysis phase contains
three parts counting terms, stopwords removal and
term filter. In counting terms we observe what are
the terms most commonly used in the data set. In
every language, some words are particularly
common, and that doesn’t convey any special
meaning called stopwords. After stopword
removal, counting and sorting we will get the most
frequently used words. Sometimes terms comes
together makes more sense .In term co-occurrence
we apply this concept. Visualization phase
represents the graph of frequently used words.
Finally we calculate the sentiment of real time
tweets using naïve bayes algorithm.
A. Tokenization
Table 1: tokenization of tweets
The tokenization is based on regular
expressions. Some specific types of tokens will not
be captured. This problem can be solved by
improving the regular expressions, or even employ
more innovatory techniques like Named Entity
Recognition. The important component of the
tokenizer is the regex_str variable, which is a list of
possible patterns. In particular, we need some
emoticons, HTML tags, Twitter @usernames (@-
mentions), Twitter #hashtags, URLs, numbers,
words with and without dashes and apostrophes.
Punctuation and whitespace may or may not be
included in the resulting list of tokens. All
contiguous strings of alphabetic characters are part
of one token; likewise with numbers. Tokens are
separated by whitespace characters, such as a space
or line break, or by punctuation characters.
After tokenization ‘@-mentions’ , ’ emoticons’,
‘URLs’ and ‘#hash-tags’ are now preserved as
individual tokens using NLTK libraries .Let us see
the example given below:
Table shows how the tokenized tweets or data set
looks like. That is each token separated by white
space are now preserved as individual tokens.
B. Term Frequencies
In term frequency we are extracting frequently
used meaningful tokens and there count. On the
basis of this ,term frequency can partitioned into
three they are:
• Counting terms
• Stopword removal
• Term filter
By performing simple word count we can find
the most commonly used term in the data set.
In order to keep track of the frequencies while
we are processing the tweets, we can use
collections.Counter() which internally is a
Tweets Tokenized tweets
"How I feel when dealing
with Unicode strings in
#python n #programming
https://t.co/xqFmmmyJiJ"
‘ How ’, ‘ I ’, ‘ feel ’ , ‘ when ’ ,
‘ dealing ’ , ‘ with ’ , ‘ Unicode
’,
’ strings ’ , ‘ in ’ , ‘#Python’ , ‘
n ’ , ‘ #programming’,
‘http://t.co/xqFmmmyJiJ’
A $5 microcontroller with
wi-fi that runs python
#python
‘A’, ‘ microcontroller ’ , ’ with
’, ’ wi-fi ’, ’ that ’ , ’ runs ’ ,
’ # ’ , ’ python ’
A # python coding dojo to
end the day @ Downham
Market Academy #rocks
‘ A ’ , ‘ # ’,’ python ’ , ’ coding
’ , ’ dojo ’ , ’ to ’ , ’ end ’ , ’ the
‘ ,
’ day ’ , ’ @ ’ , ’ Downham ‘ , ’ Market
‘, ‘ Academy ’,’ #rocks ’
4. dictionary with some useful methods like
most_common()
Terms Count
The 42
It 25
Has 06
On 14
And 23
After processing, the tokens we will get
the frequency of word as in table above. Sometimes
the most frequent words are not exactly
meaningful. This due to the presence of articles,
conjunctions, adverbs, etc. in a language, which are
commonly called stop-words. Stop-word removal is
one important step that should be considered during
the pre-processing stages. Anyone can build a
custom list of stop-words, or use available lists;
NLTK provides a simple list for English stop-word.
The punctuation marks and with terms like RT used
for re-tweets and via, which are not in the default
stop-word list. After counting and sorting, we will
get the most commonly used terms.
Term filter don’t give us a deep explanation of
what the text is about.
C. Term co-occurrence
To place things in context, let’s consider
sequences of two terms. Because the terms come
together give more insight about the meaning of the
text, look at the table given below. The terms
comes together is called bigrams. The bigrams()
function from NLTK will take a list of tokens and
produce a list of tuples using adjacent tokensIn
case we decide to analyse longer n-grams that is
sequences of n tokens, it could make sense to keep
the stop-words, just in case we want to capture
phrases given in the table.
The terms that comes together gives us better
information about the meaning of a term,
supporting applications such as word
disambiguation or semantic similarity. We build a
co-occurrence matrix that contains the number of
times the term x has been seen in the same tweet as
the term y. For each term, we then extract the most
frequent co-occurrent terms, creating a list of tuple,
here we are collecting.
D. Visualisation
A good pictorial representation
of our data can help us to make sense of them and
highlight interesting insights.While there are some
options to create plots in Python using libraries like
matplotlib or ggplot Vincent bridges the gap
between a Python back-end and a front-end that
supports D3.js visualisation, allowing us to benefit
from both sides Vincent bridges the gap between a
Python back-end and a front-end that supports
D3.js visualisation, allowing us to benefit from
both sides Using the list of most frequent terms
(without hashtags) from our rugby data set, we
want to plot their frequencies: we can plot many
different types of charts with Vincent.
E. Naive Bayes Classifier Algorithm
Real time sentiment analysis using Naïve
Bayes algorithm. Final step is to calculate the
sentiment of the real time tweet . We used Naive
Bayes (NB) classification because it is simple and
natural method. NB combines efficiency with
reasonable accuracy. The important feature of this
algorithm is that the extracted text can be tokenised
easily; it is evident that they cannot be considered
as independent, since words. It is a classification
technique based on Bayes’ Theorem with an
assumption of independence among predictors. In
simple terms, a Naive Bayes classifier assumes that
the presence of a particular feature in a class is
unrelated to the presence of any other feature.
Naive Bayes model is easy to build and particularly
useful for very large data sets. Along with
bigrams
To be
Not to be
Miss you
I know
Look better
5. simplicity, Naive Bayes is known to outperform
even highly sophisticated classification methods.
Here we are using two types of data set they are
test data and train data. Supervised learning are
used in naïve bayes algorithm where supervised
learning is the machine learning task of inferring a
function from labelled training data. The training
data consist of a set a desired of training examples.
In supervised learning, each example is a pair
consisting of an input object and output value.
Trained data is the historical data.
Two different naive bayes classifiers have been
built, according to two different strategies here we
are using the second classifier.it was trained on a
simplified training corpus and makes use of a
polarity lexicon. The corpus was simplified since
only positive and negative tweets were considered.
Neutral tweets were not taken into account. As a
result, a basic binary (or Boolean) classifier which
only identifies both Positive and Negative tweets
was trained. In order to detect tweets without
polarity (or Neutral), the following basic rule is
used: if the tweet contains at least one word that is
also found in the polarity lexicon, then the tweet
has some degree of polarity. Otherwise, the tweet
has no polarity at all and is classified as Neutral.
The binary classifier is actually suited to specify
the basic polarity between positive and negative,
reaching a precision of more than 80% in a corpus
with just these two categories Bayes theorem
provides a way of calculating posterior probability
P(c|x) from P(c), P(x) and P(x|c). Look at the
equation below:0
Above,
• P(c|x) is the posterior probability of class (c,
target) given predictor (x, attributes).
• P(c) is the prior probability of class.
• P(x|c) is the likelihood which is the probability of
predictor given class.
• P(x) is the prior probability of predictor.
we’re able to get almost 73% accuracy. This is
somewhat near human accuracy, as apparently
people agree on sentiment only around 80% of the
time.
VI. CONCLUSION
We conferred results for sentiment
analysis on Twitter based on daily news. Here we
have used SVM and naive bayes classifier for
finding the sentiment of people based on the
current news. Here we have dealt with the two
possible kinds of sentiments positive and negative.
We have also dealt with uni grams, bi grams and
even n grams and have also considered the
hyphenated words. We have also dealt with tweets
which come in form of query or any links. As our
future work we also look forward on developing an
application which carries our textual analysis on
voice data and even extend our textual analysis
with specifying the overall impact of news on
people either as positive or negative along with the
root cause being specified.
VII. REFERENCES
[1] “Large Scale Sentiment Analysis for News and
Blogs” by Namrata Godbole, Manjunath
Srinivasaiah, Steven Skiena.
[2] “Sentiment Analysis of Twitter Data” by
Apoorv Agarwa, Boyi Xie, Ilia Vovsha, Owen
Rambow, Rebecca Passonneau.
[3] Apoorv Agarwal, Fadi Biadsy, and Kathleen
Mckeown 2009. “Contextual phrase-level polarity
analysis using lexical affect scoring and syntactic
n-grams”. Proceedings of the 12th Conference of
the European Chapter of the ACL.
[4] “Sentimentor: Sentiment Analysis of Twitter
Data “ by James Spencer and Gulden Uchyigit.
[5] Bo Pang, “L.L.: Opinion mining and sentiment
analysis.” Foundations and Trends in Information
Retrieval January Volume 2 Issue 1-2, 1–94 (2008)
[6] Pak, A., and Paroubek, P. 2010. “Twitter as a
corpus for sentiment analysis and opinion mining.”
[7] Pang, B., and Lee, L. 2008. “Opinion mining
and sentiment analysis.” Foundations and Trends
in Information Retrieval.