Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Twitter analysis
1. Performing sentiment analysis on Twitter data
(2011 Norway attacks)
Team –
AparnaDhanashriJayaprakash – 50094768
HimanshuYadav – 50093151
Inder Puneet Singh – 50094241
Sabah Abdul Mannan Khan – 50094894
VidyaMulukutla - 50095830
2. Analysis of Twitter Data Set
Introduction
Big Data is increasingly pertinent in today’s digitalized world and is being used in a lot of
different domains. With social media being so pervasive, it makes logical sense to use it to
generate the data sets for analysis in various areas from politics to entertainment.We have chosen
‘Twitter’ as our source for data since it has a wide user base that includes regular people as well
as popular individuals from the fields of media, movies, sports and politics. There are a lot of
analytical results that can be derived from a popular and widely used Social media platform like
Twitter and we used the data generated from it through an implementation using Apache Hadoop
and Hive. In order to gauge the reactions from the different users who responded to the
significant events in the month of July 2011, we performed a Sentiment Analysis. Sentiment
Analysis is the process of trying to gather subjective information through natural language
processing, computational linguistics and text analysis. It is also known as opinion mining.There
were two important and completely contrasting events that took place in July 2011for which we
came up with a comparison analysis and the description of the events is as follows:
The Norway attacks of 2011 were the most deadly attacks on the country. Two sequential
explosions took place within a span of two hours on 22nd
July 2011. The first one was a car bomb
that took place in the executive governmental headquarters that killed eight people and injured
around 209 people. The second one was a deadly assault that took place on an island. It was a
summer camp organized by the youth division of the ruling party. An unidentified man gained
access to the camp and open fired at the participating members. This attack claimed 69 lives and
seriously injured 110 persons. The accused in the case, Anders Behring Breivik, was sentenced
to 21 years in imprisonment.
3. Analysis of Twitter Data Set
Amy Winehouse was a hugely popular British singer and songwriter. Her work was
critically as well as commercially appreciated and she won multiple Grammy Awards for her
songs. Her sudden demise due to alcohol poisoning on 23rd
July 2011 shocked millions of her
fans worldwide and sent the online networking sites into frenzy.
Hypothesis
As per our hypothesis, we decided to evaluate how users from different geographical
locations reacted to both the stories on twitter.We took the assumption that the Norway
attackswould affect the public more as compared to the Amy Winehouse death and would garner
more tweets, hashtags and retweets as it is a more important event in the sense that it was an
attack in which many lives were lost and even more critically injured. We compared these two
events using sentiment analysis.
Technology
For our implementation, we have used Apache Hadoop which was deployed on an Amazon EC2
instance for processing of data.For the installation of Hadoop master, we used m1.1large instance
type whereas for the Hadoop slaves, we used m1.4small instance types. We elected the M1
general-purpose instance types primarily for their extremely low cost options for running
applications. They are appropriate for a moderately good CPU performance.
Apache Hive was used to analyze, summarize and query the data using a SQL type language
known as HiveQL.
Data Preparation
Data Selection
The data that was extracted was segregated into different tables for the sake of
convenience of analysis. One of the tables from the Norway attacks event is as shown below -
5. Analysis of Twitter Data Set
politics 32
NFL 32
utoya 27
PrayForNorway 27
Utøya 27
CNN 26
Islam 24
oslobomb 24
Data Cleaning:
Contrary to our perception that the data set would be limited to one specific time period
of say one year, the information extracted from the dataset spanned over many years due to
which there was no concentration of high density of information in one particular time period.
Firstly, this meant finding events that occurred in a specific time period. Also, considering the
fact that data in the data set is acquired from varied number of sources, there is often a lot of
redundant data, which makes the deletion of duplicate information mandatory before any
analysis can be conducted.
Owing to the fact that we were dealing with huge data sets, we partitioned the data to
make the analysis easier and also to improve query performance. Another important aspect of
Data cleaning is Geo tagging locations. The reason that this needs to be considered is that the
same address can be interpreted in various forms. For example, Bangalore, Bangalore Karnataka
and Bangalore Karnataka India are all different ways to write the same location. In order to
perform an accurate analysis, the location needs to be normalized and converted into the same
6. Analysis of Twitter Data Set
format. The technique that we used to do this is Google’s Geocoding API. This API assists by
giving a straightforward method to convert a particular address into coordinates like latitudes and
longitudes that can be applied for map positioning.
Challenges faced during Implementation:
Some of the hindrances that we encountered with the extracted data are:
Duplicate files:
The extracted data returned a huge number of repetitive files with the same content. This
is a huge annoyance, as single files with unique content must be filtered through additional
processing. This is also very time consuming.
Parsing data:
Parsing is a difficult aspect and it does not work owing to varied reasons such as if the
data on Twitter consists of many languages. Another reason could be the that the JSON structure
was closed incorrectly which limits the data read beyond this point.
Complete data not recovered:
This issue deals with the non-recovery of complete data when extracting through Apache
Hive. As we are dealing with huge data sets, a lot of extra programming and debugging is
required to repair the situation. Parsing exceptions were created which were thatched by locating
the erroneous files.
Analysis
After data selection and data cleaning process, different tables were selected that were
representative of various aspects of the analysis with regards to the two events – Amy
Winehouse and Norway attacks ; a comparison analysis for the two events along with asentiment
7. Analysis of Twitter Data Set
analysis for each of the two events. Following are the different aspects which will help proceed
with an analysis of the events in hand –
Data Distribution, Hashtags count table, URLS count table, Tweet sentiment, and
Famous tweeters.
Event 1: Amy Winehouse
No of Tweets
0
5000
10000
15000
20000
25000
No of Tweets
8. Analysis of Twitter Data Set
URL Share Count
http://t.co/0IGT940 http://t.co/kLYO5t5
http://huff.to/oDwgHC http://t.co/BtIzsiW
http://t.co/CahfKYh http://on.msnbc.com/4dpW6f
http://nyp.st/qYGM9L http://bit.ly/oapSdd
http://t.co/TkKR8Qm http://n.pr/nnu5XS
0
100
200
300
400
500
600
Hashtag Count
9. Analysis of Twitter Data Set
Event 2: Norway attacks
0 50 100 150 200 250 300 350 400 450
SkyNewsBreak
YouTube
BreakingNews
HuffingtonPost
Reuters
NewYorkPost
iamshortymack
RollingStone
HotNewHipHop
mashable
User Mention Count
No of Tweets
0
2000
4000
6000
8000
No of Tweets
11. Analysis of Twitter Data Set
Comparison Analysis
The Amy Winehouse event occurred on 23rd
of July,2011 whereas the Norway attacks event
occurred on 22nd
July, 2011. As can be seen from the charts, the number of tweets for event 1
peaked on the day of the event and had a steep drop over the week till they finally died down. On
the other hand, the Norway attacks event, had maximum tweets on the day of the event and
subsequently over the next couple of days while the drop in number of tweets was pretty gradual.
However, it is interesting to note that event 1 garnered the maximum number of tweets of over
20000 on the day when it occurred. Despite being of more serious nature, event 2 saw much less
number of tweets on the day of its occurrence.
Sentiment Analysis
The sentiments in terms of positive, negative and neutral tweets to the two events over a span of
a week from 07/22/2011 to 07/31/2011 are visualized. Below are graphs that depict the same –
0 50 100 150 200 250 300 350 400 450
BreakingNews
Reuters
CBSNews
YouTube
HuffingtonPost
YahooNews
StateDept
mpoppel
ggreenwald
SenatorSanders
User Mention Count
12. Analysis of Twitter Data Set
Event 1: Amy Winehouse
The Event 1 garnered maximum neutral tweets and minimum positive tweets on the whole.
Event 2: Norway Attacks
Event 2 also garnered maximum neutral tweets and minimum positive tweets on the whole.
Interestingly, the number of negative tweets exceeded the neutral and positive tweets during the
subsequent days of the event.
0
2000
4000
6000
8000
10000
12000
20-Jul-11 22-Jul-11 24-Jul-11 26-Jul-11 28-Jul-11 30-Jul-11 1-Aug-11
Tweet Count
Positive tweet Negative Tweet Neutral Tweet
0
1000
2000
3000
4000
5000
6000
7000
8000
20-Jul-11 22-Jul-11 24-Jul-11 26-Jul-11 28-Jul-11 30-Jul-11 1-Aug-11
Tweet Count
Positive Negative Neutral
13. Analysis of Twitter Data Set
Conclusion
Managing huge amounts of data is becoming convenient with the advent of distributed
file systems. They have the capability of managing and analyzing huge volumes of data that can
help assess a particular event’s significance over a period of time.
The analysis negates the hypothesis that we had initially assumed and brought us to the
conclusion that Amy Winehouse event was as popular as an event as grave as the Norway attacks
if not more. The retweets that the events generated assist in determining the most discussed
issues among the twitter users. It is extremely surprising that a celebrity death can take
precedence over assault of a nation. A reasoning for this could be that people are very conscious
and careful upon commenting on issues that are sensitive in nature and choose to refrain from
expressing views. The sentiment analysis reasserts this; with the graphs showing maximum
neutral tweets to both the events, it can be interpreted that most people are reserved in their
opinions and hence take a neutral stand while participating on a public platform where most
activities are scrutinized especially an issue as delicate as the Norway attacks.
14. Analysis of Twitter Data Set
References
http://en.wikipedia.org/wiki/Sentiment_Analysis
http://en.wikipedia.org/wiki/Apache_Hive
http://aws.amazon.com/ec2/instance-types/#selecting-instance-types
https://developers.google.com/maps/documentation/geocoding/?hl=el