2. Twitter Data Analysis Using R
• Create twitter app developer
account
• Get access credentials
• Install required packages in R
• Connect R tool to twitter
• Extract tweets at real time
• Create corpus
• Data preprocessing and text
mining
• Wordclud
• Frequent term mining
• Sentiemt analysis using lexicon
RUAS
3. R Packages
• Twitter data extraction: twitteR
• Text cleaning and mining: tm
• Word cloud: wordcloud
• Topic modelling: topicmodels, lda
• Sentiment analysis: sentiment, syzhu
• Social network analysis: igraph, sna
• Visualisation: wordcloud, Rgraphviz, ggplot2
4. Text Cleaning Functions
• Convert to lower case: tolower
• Remove punctuation: removePunctuation
• Remove numbers: removeNumbers
• Remove stop words (like 'a', 'the', 'in'): removeWords, stopwords
• Remove extra white space: stripWhitespace
5. Text Mining { Package tm}
• Remove numbers, punctuations, words or extra whitespaces :
• removeNumbers, removePunctuation, removeWords, removeSparseTerms, stripWhitespace
• Remove sparse terms from a term-document matrix
• removeSparseTerms:
• Various kinds of stopwords
• stopwords
• Stem words and complete stems
• stemDocument, stemCompletion
• Build a term-document matrix or a document-term matrix
• TermDocumentMatrix, DocumentTermMatrix
• Generate a term frequency vector
• termFreq
• Find frequent terms or associations of terms
• findFreqTerms, findAssocs
• Various ways to weight a term-document matrix
• weightBin, weightTf, weightTfIdf, weightSMART, WeightFunction
6. Prerequisites
• You have already installed R version 3.4.3 and are using RStudio.
• In order to extract tweets, you will need a Twitter application and hence a
Twitter account.
• If you don’t have a Twitter account, please sign up.
• Use your Twitter login ID and password to sign in at Twitter Developers.
• https://apps.twitter.com/
7. New App Form
Fill out the new app form. Names
should be unique, i.e., no one
else should have used this name
for their Twitter app.
Give a brief description of the
app. You can change this later on
if needed. Enter your website or
blog address. Callback URL can be
left blank.
Once you’ve done this, make sure
you’ve read the “Developer Rules
Of The Road” blurb, check the
“Yes, I agree” box, fill in the
CAPTCHA and click the “Create
Your Twitter Application” button.
8. Create My Access Token
Scroll down and click on “Create my
access token” button.
Note the values of consumer key and
consumer secret and keep them handy for
future use. You should keep these secret. If
anyone was to get these keys, they could
effectively access your Twitter account.
10. Install And Load R Packages
• R comes with a standard set of packages. A number of other packages are available
for download and installation
• we will need the following packages:
– ROAuth: Provides an interface to the OAuth 1.0 specification, allowing users to authenticate via
OAuth to the server of their choice.
– TwitteR: Provides an interface to the Twitter web API.
• Let’s start by installing and loading all the required packages.
install.packages("twitteR")
install.packages("ROAuth")
library("twitteR")
library("ROAuth")
11. Extract Tweets
• Use searchTwitter to search Twitter based on the supplied search string and return a list. The “lang”
parameter is used below to restrict tweets to the “English” language.
>tweets <- searchTwitter(search.string, n=no.of.tweets, cainfo="cacert.pem", lang="en")
>tweets
>searchTwitter(searchString, n=25, lang=NULL, since=NULL, until=NULL, locale=NULL,
geocode=NULL, sinceID=NULL, maxID=NULL, resultType=NULL, retryOnRateLimit=120, ...)
Rtweets(n=25, lang=NULL, since=NULL, ...)
Examples
# searchTwitter(“RUAS", n=100)
# Rtweets(n=37)
### Search between two dates
# searchTwitter(‘NarendraModi', since='2015-03-01', until='2018-03-02')
### geocoded results
# searchTwitter('patriots', geocode='42.375,-71.1061111,10mi')
# ## using resultType
# searchTwitter('world cup+brazil', resultType="popular", n=15)
# searchTwitter('from:hadleywickham', resultType="recent", n=10)
12. Clean Up Text
We have already been authenticated and successfully retrieved the text from the tweets. The first step in creating a word cloud is to clean
up the text by using lowercase and removing punctuation, usernames, links, etc. We are using the function gsub to replace unwanted
text. gsub will replace all occurrences of any given pattern. Although there are alternative packages that can perform this operation, we
have chosen gsub because of its simplicity and readability.
#convert all text to lower case
1. tweets.text <- tolower(tweets.text)
# Replace blank space (“rt”)
1. tweets.text <- gsub("rt", "", tweets.text)
# Replace @UserName
1. tweets.text <- gsub("@w+", "", tweets.text)
# Remove punctuation
1. tweets.text <- gsub("[[:punct:]]", "", tweets.text)
# Remove links
1. tweets.text <- gsub("httpw+", "", tweets.text)
# Remove tabs
1. tweets.text <- gsub("[ |t]{2,}", "", tweets.text)
# Remove blank spaces at the beginning
1. tweets.text <- gsub("^ ", "", tweets.text)
# Remove blank spaces at the end
1. tweets.text <- gsub(" $", "", tweets.text)
13. Remove Stop Words
• In the next step we will use the text mining package tm to remove stop words. A stop word is a commonly
used word such as “the”.
• If tm is not already installed you will need to install it (available from the Comprehensive R Archive
Network).
• #install tm – if not already installed
install.packages("tm")
library(tm)
#create corpus
tweets.text.corpus <- Corpus(VectorSource(tweets.text))
#clean up by removing stop words
tweets.text.corpus <- tm_map(tweets.text.corpus, function(x)removeWords(x,stopwords()))
14. Generate Word Cloud
• Generate the word cloud using the wordcloud package.
• For an example we are concerned with plotting no more than 150 words that occur more than once
with random color, order, and position.
#install wordcloud if not already installed
install.packages("wordcloud")
library(word cloud)
#generate wordcloud
wordcloud(tweets.text.corpus, min.freq = 2, scale=c(7,0.5),colors=brewer.pal(8, "Dark2"),
random.color= TRUE, random.order = FALSE, max.words = 150)