The document analyzes social media data from Twitter during the 2016 US Presidential Election over one week. It identifies the major topics discussed and performs a sentiment analysis of significant terms. The analysis found that 18% of over 117,000 words were negative, while 11% were positive and 71% neutral. Most discussions were about political figures and groups, almost a quarter about Donald Trump, rather than issues. Future analysis could examine sentiment in each identified topic cluster.
Using Tweets for Understanding Public Opinion During U.S. Primaries and Predi...
CUS 695 Project Presentation
1. Social Analytics on Twitter
By: Adam Ghassouine, Robert
Monegro and Adrian Duran
CUS 695 – Capstone Project
Dr. Giancarlo Crocetti
Mondays 7:10 p.m. – 9:10 p.m.
2. Executive Summary
This report provides an analysis and insight into social media data, in particular posts on Twitter, pertaining to
the 2016 United States Presidential Election collected over a period of one week. The purpose of this report is
to identify major topics that are being discussed in regards to the election. The method of analysis included is a
sentiment analysis of all significant terms related to the topic being considered in this study. All script files from
this analysis can be found in the appendices section of this report.
The analysis clearly shows that during the Presidential Election of 2016 there was more negative verbiage used.
18% of 117,655 words were considered negative. 11% of conversations over Twitter were using positive
language, while 71% were neutral.
The report finds support that on social media most of the discussion is about gossip pertaining to political
figures and groups, almost a quarter of which is about Donald Trump, as opposed to people discussing about
actual political issues which is not unexpected for Twitter. Recommendations for future analysis include
analyzing the sentiment of each cluster.
6. StopWord Analysis
• Using the a StopWords dictionary, one can extract the frequency table of all words.
• The principle for doing this analysis is to detect and remove unnecessary words that provide
little to no substance in regards to this research.
• ‘https’ was appearing frequently, causing unnecessary n-grams to be generated. This in turn
led to the removal of this term.
• Another example of a high frequency word was ‘RT’, which stands for a post that has been
retweeted. This term provided no importance to the overall analysis.
• Unnecessary URLs in each post, random words with no meaning such as ‘absfwi’ and ‘acbqdi’,
were also eliminated.
• The result of this analysis section are words that only relate to the 2016 Presidential Election.
12. Sentiment
Code
• Used to extract positive
and negative scores to
further discern
sentiment for the
clusters generated
13. Sentiment Analysis
• The analysis clearly shows that during the
president elections of 2016 there were more
negative verbiage used. 18% of 117,655 words
were considered negative. 11% of
conversations over Twitter were using positive
language, while 71% were neutral.
• There were 28 most commonly used positive
words, such as: "Happiness, Congratulations,
Splendid, Excellent and Admirability"
• There were 16 most commonly used negative
words, such as: “Threaten, Downhill,
Apocalyptic, Negative, Trashed"
Completed pulling all tweets required to march on with project
Modest number, approximately 7000
Create a Twitter Connection choosing a name and using the provided access token.
Select a keyword with which Twitter posts you want to query (query=‘2016 election’).
Select the amount of Twitter posts you want to query at one time (limit=‘1000’).
Run the operator frequently to develop a large collection of data in order to obtain the posts
Wrote the data into a csv file in order to better aid us in the analysis
The Process Documents from Data Operator allows one to take unstructured text and generate a vector space model using TD-IDF.
Change all the words to lower case.
Remove the http hyperlink from all posts using the following regular expression.
Apply tokenization to split all words at non-letters.
-This operator splits the text of a document into a sequence of tokens. There are several options how to specify the splitting points. Either you may use all non-letter character, what is the default settings. This will result in tokens consisting of one single word, what's the most appropriate option before finally building the word vector
-Or if you are going to build windows of tokens or something like that, you will probably split complete sentences, this is possible by setting the split mode to specify character and enter all splitting characters.
-The third option let's you define regular expressions and is the most flexible for very special cases. Each non-letter character is used as separator. As a result, each word in the text is represented by a single token
Filter tokens to remove all words less than 3 letters.
Filter stopwords to remove all words such as ‘https’, @ symbol, and hashtags.
Stemming all tokens to their base form
-A stemmer providing several stemming algorithms written for the Snowball language.
-This operator stems words by applying stemming algorithms written for the Snowball language. Various stemming algorithms for different languages can be chosen
Generated n-grams of length 3
-Creates term n-Grams of tokens in a document.
-This operator creates term n-Grams of tokens in a document. A term n-Gram is defined as a series of consecutive tokens of length n. The term n-Grams generated by this operator consist of all series of consecutive tokens of length n.
Now that we had the data formatted correctly, needed to see which terms appeared the most
When a term appears in a few documents, but not all documents, this increases the importance of terms
This means that this term is very good at describing those documents
The analysis shows that the first 3-4 terms of each cluster properly convey the sentiment/topic of said cluster
WordNet is a lexical database that is used to group words into a set of synonyms called synsets
WordNet does a great job at distinguishing different kinds of words such as nouns, verbs, adjectives, and adverbs
SentiWordNet is an extension of WordNet that provides for each synset three additional measures: a Positive Score, a Negative Score, and a Neutral Score
Alyien is a RapidMiner Application that is utilized to analyze the sentiment of text.
Extracting sentiment from a piece of text such as a tweet, a review or an article can provide us with valuable insight about the author's emotions and perspective: whether the tone is positive, neutral or negative, and whether the text is subjective (meaning it's reflecting the author's opinion) or objective (meaning it's expressing a fact).