SlideShare ist ein Scribd-Unternehmen logo
1 von 16
Downloaden Sie, um offline zu lesen
Twitter Data Analytics using R
By:
Santoshi Kumari
RUAS
Twitter Data Analysis Using R
• Create twitter app developer
account
• Get access credentials
• Install required packages in R
• Connect R tool to twitter
• Extract tweets at real time
• Create corpus
• Data preprocessing and text
mining
• Wordclud
• Frequent term mining
• Sentiemt analysis using lexicon
RUAS
R Packages
• Twitter data extraction: twitteR
• Text cleaning and mining: tm
• Word cloud: wordcloud
• Topic modelling: topicmodels, lda
• Sentiment analysis: sentiment, syzhu
• Social network analysis: igraph, sna
• Visualisation: wordcloud, Rgraphviz, ggplot2
Text Cleaning Functions
• Convert to lower case: tolower
• Remove punctuation: removePunctuation
• Remove numbers: removeNumbers
• Remove stop words (like 'a', 'the', 'in'): removeWords, stopwords
• Remove extra white space: stripWhitespace
Text Mining { Package tm}
• Remove numbers, punctuations, words or extra whitespaces :
• removeNumbers, removePunctuation, removeWords, removeSparseTerms, stripWhitespace
• Remove sparse terms from a term-document matrix
• removeSparseTerms:
• Various kinds of stopwords
• stopwords
• Stem words and complete stems
• stemDocument, stemCompletion
• Build a term-document matrix or a document-term matrix
• TermDocumentMatrix, DocumentTermMatrix
• Generate a term frequency vector
• termFreq
• Find frequent terms or associations of terms
• findFreqTerms, findAssocs
• Various ways to weight a term-document matrix
• weightBin, weightTf, weightTfIdf, weightSMART, WeightFunction
Prerequisites
• You have already installed R version 3.4.3 and are using RStudio.
• In order to extract tweets, you will need a Twitter application and hence a
Twitter account.
• If you don’t have a Twitter account, please sign up.
• Use your Twitter login ID and password to sign in at Twitter Developers.
• https://apps.twitter.com/
New App Form
Fill out the new app form. Names
should be unique, i.e., no one
else should have used this name
for their Twitter app.
Give a brief description of the
app. You can change this later on
if needed. Enter your website or
blog address. Callback URL can be
left blank.
Once you’ve done this, make sure
you’ve read the “Developer Rules
Of The Road” blurb, check the
“Yes, I agree” box, fill in the
CAPTCHA and click the “Create
Your Twitter Application” button.
Create My Access Token
Scroll down and click on “Create my
access token” button.
Note the values of consumer key and
consumer secret and keep them handy for
future use. You should keep these secret. If
anyone was to get these keys, they could
effectively access your Twitter account.
Save Access Credentials
Install And Load R Packages
• R comes with a standard set of packages. A number of other packages are available
for download and installation
• we will need the following packages:
– ROAuth: Provides an interface to the OAuth 1.0 specification, allowing users to authenticate via
OAuth to the server of their choice.
– TwitteR: Provides an interface to the Twitter web API.
• Let’s start by installing and loading all the required packages.
install.packages("twitteR")
install.packages("ROAuth")
library("twitteR")
library("ROAuth")
Extract Tweets
• Use searchTwitter to search Twitter based on the supplied search string and return a list. The “lang”
parameter is used below to restrict tweets to the “English” language.
>tweets <- searchTwitter(search.string, n=no.of.tweets, cainfo="cacert.pem", lang="en")
>tweets
>searchTwitter(searchString, n=25, lang=NULL, since=NULL, until=NULL, locale=NULL,
geocode=NULL, sinceID=NULL, maxID=NULL, resultType=NULL, retryOnRateLimit=120, ...)
Rtweets(n=25, lang=NULL, since=NULL, ...)
Examples
# searchTwitter(“RUAS", n=100)
# Rtweets(n=37)
### Search between two dates
# searchTwitter(‘NarendraModi', since='2015-03-01', until='2018-03-02')
### geocoded results
# searchTwitter('patriots', geocode='42.375,-71.1061111,10mi')
# ## using resultType
# searchTwitter('world cup+brazil', resultType="popular", n=15)
# searchTwitter('from:hadleywickham', resultType="recent", n=10)
Clean Up Text
We have already been authenticated and successfully retrieved the text from the tweets. The first step in creating a word cloud is to clean
up the text by using lowercase and removing punctuation, usernames, links, etc. We are using the function gsub to replace unwanted
text. gsub will replace all occurrences of any given pattern. Although there are alternative packages that can perform this operation, we
have chosen gsub because of its simplicity and readability.
#convert all text to lower case
1. tweets.text <- tolower(tweets.text)
# Replace blank space (“rt”)
1. tweets.text <- gsub("rt", "", tweets.text)
# Replace @UserName
1. tweets.text <- gsub("@w+", "", tweets.text)
# Remove punctuation
1. tweets.text <- gsub("[[:punct:]]", "", tweets.text)
# Remove links
1. tweets.text <- gsub("httpw+", "", tweets.text)
# Remove tabs
1. tweets.text <- gsub("[ |t]{2,}", "", tweets.text)
# Remove blank spaces at the beginning
1. tweets.text <- gsub("^ ", "", tweets.text)
# Remove blank spaces at the end
1. tweets.text <- gsub(" $", "", tweets.text)
Remove Stop Words
• In the next step we will use the text mining package tm to remove stop words. A stop word is a commonly
used word such as “the”.
• If tm is not already installed you will need to install it (available from the Comprehensive R Archive
Network).
• #install tm – if not already installed
install.packages("tm")
library(tm)
#create corpus
tweets.text.corpus <- Corpus(VectorSource(tweets.text))
#clean up by removing stop words
tweets.text.corpus <- tm_map(tweets.text.corpus, function(x)removeWords(x,stopwords()))
Generate Word Cloud
• Generate the word cloud using the wordcloud package.
• For an example we are concerned with plotting no more than 150 words that occur more than once
with random color, order, and position.
#install wordcloud if not already installed
install.packages("wordcloud")
library(word cloud)
#generate wordcloud
wordcloud(tweets.text.corpus, min.freq = 2, scale=c(7,0.5),colors=brewer.pal(8, "Dark2"),
random.color= TRUE, random.order = FALSE, max.words = 150)
Reference
• http://www.rdatamining.com/docs
• https://apps.twitter.com
Thank You

Weitere ähnliche Inhalte

Ähnlich wie Twitter data analysis using r (part 2)

Scalable code Design with slimmer Django models .. and more
Scalable code  Design with slimmer Django models .. and moreScalable code  Design with slimmer Django models .. and more
Scalable code Design with slimmer Django models .. and moreDawa Sherpa
 
Azure integration in dynamic crm
Azure integration in dynamic crmAzure integration in dynamic crm
Azure integration in dynamic crmssuser93127c1
 
OpenWhisk by Example - Auto Retweeting Example in Python
OpenWhisk by Example - Auto Retweeting Example in PythonOpenWhisk by Example - Auto Retweeting Example in Python
OpenWhisk by Example - Auto Retweeting Example in PythonCodeOps Technologies LLP
 
Introduction to Swagger
Introduction to SwaggerIntroduction to Swagger
Introduction to SwaggerKnoldus Inc.
 
PyTorch 튜토리얼 (Touch to PyTorch)
PyTorch 튜토리얼 (Touch to PyTorch)PyTorch 튜토리얼 (Touch to PyTorch)
PyTorch 튜토리얼 (Touch to PyTorch)Hansol Kang
 
Building Push Triggers for Logic Apps
Building Push Triggers for Logic AppsBuilding Push Triggers for Logic Apps
Building Push Triggers for Logic AppsBizTalk360
 
Twitter Mention Graph - Analytics Project
Twitter Mention Graph - Analytics ProjectTwitter Mention Graph - Analytics Project
Twitter Mention Graph - Analytics ProjectSotiris Baratsas
 
Untangling - fall2017 - week 9
Untangling - fall2017 - week 9Untangling - fall2017 - week 9
Untangling - fall2017 - week 9Derek Jacoby
 
Dive Into Azure Data Lake - PASS 2017
Dive Into Azure Data Lake - PASS 2017Dive Into Azure Data Lake - PASS 2017
Dive Into Azure Data Lake - PASS 2017Ike Ellis
 
MongoDB.local DC 2018: Tips and Tricks for Avoiding Common Query Pitfalls
MongoDB.local DC 2018: Tips and Tricks for Avoiding Common Query PitfallsMongoDB.local DC 2018: Tips and Tricks for Avoiding Common Query Pitfalls
MongoDB.local DC 2018: Tips and Tricks for Avoiding Common Query PitfallsMongoDB
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAsad Abbas
 
High Performance RPC with Finagle
High Performance RPC with FinagleHigh Performance RPC with Finagle
High Performance RPC with FinagleSamir Bessalah
 
MuleSoft Surat Virtual Meetup#28 - Exposing and Consuming SOAP Service - SOAP...
MuleSoft Surat Virtual Meetup#28 - Exposing and Consuming SOAP Service - SOAP...MuleSoft Surat Virtual Meetup#28 - Exposing and Consuming SOAP Service - SOAP...
MuleSoft Surat Virtual Meetup#28 - Exposing and Consuming SOAP Service - SOAP...Jitendra Bafna
 
Intro to Rails ActiveRecord
Intro to Rails ActiveRecordIntro to Rails ActiveRecord
Intro to Rails ActiveRecordMark Menard
 
Angular 2 overview in 60 minutes
Angular 2 overview in 60 minutesAngular 2 overview in 60 minutes
Angular 2 overview in 60 minutesLoiane Groner
 
"R & Text Analytics" (15 January 2013)
"R & Text Analytics" (15 January 2013)"R & Text Analytics" (15 January 2013)
"R & Text Analytics" (15 January 2013)Portland R User Group
 
Building Your First App with MongoDB Stitch
Building Your First App with MongoDB StitchBuilding Your First App with MongoDB Stitch
Building Your First App with MongoDB StitchMongoDB
 
O365 Meetup Seattle March 21st 2019
O365 Meetup Seattle March 21st 2019O365 Meetup Seattle March 21st 2019
O365 Meetup Seattle March 21st 2019Thomas Gölles
 

Ähnlich wie Twitter data analysis using r (part 2) (20)

Scalable code Design with slimmer Django models .. and more
Scalable code  Design with slimmer Django models .. and moreScalable code  Design with slimmer Django models .. and more
Scalable code Design with slimmer Django models .. and more
 
What is Swagger?
What is Swagger?What is Swagger?
What is Swagger?
 
Azure integration in dynamic crm
Azure integration in dynamic crmAzure integration in dynamic crm
Azure integration in dynamic crm
 
OpenWhisk by Example - Auto Retweeting Example in Python
OpenWhisk by Example - Auto Retweeting Example in PythonOpenWhisk by Example - Auto Retweeting Example in Python
OpenWhisk by Example - Auto Retweeting Example in Python
 
Introduction to Swagger
Introduction to SwaggerIntroduction to Swagger
Introduction to Swagger
 
PyTorch 튜토리얼 (Touch to PyTorch)
PyTorch 튜토리얼 (Touch to PyTorch)PyTorch 튜토리얼 (Touch to PyTorch)
PyTorch 튜토리얼 (Touch to PyTorch)
 
Building Push Triggers for Logic Apps
Building Push Triggers for Logic AppsBuilding Push Triggers for Logic Apps
Building Push Triggers for Logic Apps
 
Twitter Mention Graph - Analytics Project
Twitter Mention Graph - Analytics ProjectTwitter Mention Graph - Analytics Project
Twitter Mention Graph - Analytics Project
 
Untangling - fall2017 - week 9
Untangling - fall2017 - week 9Untangling - fall2017 - week 9
Untangling - fall2017 - week 9
 
Dive Into Azure Data Lake - PASS 2017
Dive Into Azure Data Lake - PASS 2017Dive Into Azure Data Lake - PASS 2017
Dive Into Azure Data Lake - PASS 2017
 
MongoDB.local DC 2018: Tips and Tricks for Avoiding Common Query Pitfalls
MongoDB.local DC 2018: Tips and Tricks for Avoiding Common Query PitfallsMongoDB.local DC 2018: Tips and Tricks for Avoiding Common Query Pitfalls
MongoDB.local DC 2018: Tips and Tricks for Avoiding Common Query Pitfalls
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using Lucene
 
High Performance RPC with Finagle
High Performance RPC with FinagleHigh Performance RPC with Finagle
High Performance RPC with Finagle
 
MuleSoft Surat Virtual Meetup#28 - Exposing and Consuming SOAP Service - SOAP...
MuleSoft Surat Virtual Meetup#28 - Exposing and Consuming SOAP Service - SOAP...MuleSoft Surat Virtual Meetup#28 - Exposing and Consuming SOAP Service - SOAP...
MuleSoft Surat Virtual Meetup#28 - Exposing and Consuming SOAP Service - SOAP...
 
Paper trail gem
Paper trail gemPaper trail gem
Paper trail gem
 
Intro to Rails ActiveRecord
Intro to Rails ActiveRecordIntro to Rails ActiveRecord
Intro to Rails ActiveRecord
 
Angular 2 overview in 60 minutes
Angular 2 overview in 60 minutesAngular 2 overview in 60 minutes
Angular 2 overview in 60 minutes
 
"R & Text Analytics" (15 January 2013)
"R & Text Analytics" (15 January 2013)"R & Text Analytics" (15 January 2013)
"R & Text Analytics" (15 January 2013)
 
Building Your First App with MongoDB Stitch
Building Your First App with MongoDB StitchBuilding Your First App with MongoDB Stitch
Building Your First App with MongoDB Stitch
 
O365 Meetup Seattle March 21st 2019
O365 Meetup Seattle March 21st 2019O365 Meetup Seattle March 21st 2019
O365 Meetup Seattle March 21st 2019
 

Kürzlich hochgeladen

Curve setting (Basic Mine Surveying)_MI10412MI.pptx
Curve setting (Basic Mine Surveying)_MI10412MI.pptxCurve setting (Basic Mine Surveying)_MI10412MI.pptx
Curve setting (Basic Mine Surveying)_MI10412MI.pptxRomil Mishra
 
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书rnrncn29
 
KCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitosKCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitosVictor Morales
 
STATE TRANSITION DIAGRAM in psoc subject
STATE TRANSITION DIAGRAM in psoc subjectSTATE TRANSITION DIAGRAM in psoc subject
STATE TRANSITION DIAGRAM in psoc subjectGayathriM270621
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating SystemRashmi Bhat
 
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMSHigh Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMSsandhya757531
 
Prach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism CommunityPrach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism Communityprachaibot
 
Katarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School CourseKatarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School Coursebim.edu.pl
 
Immutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfImmutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfDrew Moseley
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating SystemRashmi Bhat
 
priority interrupt computer organization
priority interrupt computer organizationpriority interrupt computer organization
priority interrupt computer organizationchnrketan
 
CS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdfCS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdfBalamuruganV28
 
11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdfHafizMudaserAhmad
 
Turn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptxTurn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptxStephen Sitton
 
Triangulation survey (Basic Mine Surveying)_MI10412MI.pptx
Triangulation survey (Basic Mine Surveying)_MI10412MI.pptxTriangulation survey (Basic Mine Surveying)_MI10412MI.pptx
Triangulation survey (Basic Mine Surveying)_MI10412MI.pptxRomil Mishra
 
Secure Key Crypto - Tech Paper JET Tech Labs
Secure Key Crypto - Tech Paper JET Tech LabsSecure Key Crypto - Tech Paper JET Tech Labs
Secure Key Crypto - Tech Paper JET Tech Labsamber724300
 
Cost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based questionCost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based questionSneha Padhiar
 
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.elesangwon
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating SystemRashmi Bhat
 
multiple access in wireless communication
multiple access in wireless communicationmultiple access in wireless communication
multiple access in wireless communicationpanditadesh123
 

Kürzlich hochgeladen (20)

Curve setting (Basic Mine Surveying)_MI10412MI.pptx
Curve setting (Basic Mine Surveying)_MI10412MI.pptxCurve setting (Basic Mine Surveying)_MI10412MI.pptx
Curve setting (Basic Mine Surveying)_MI10412MI.pptx
 
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书
 
KCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitosKCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitos
 
STATE TRANSITION DIAGRAM in psoc subject
STATE TRANSITION DIAGRAM in psoc subjectSTATE TRANSITION DIAGRAM in psoc subject
STATE TRANSITION DIAGRAM in psoc subject
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating System
 
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMSHigh Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
 
Prach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism CommunityPrach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism Community
 
Katarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School CourseKatarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School Course
 
Immutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfImmutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdf
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating System
 
priority interrupt computer organization
priority interrupt computer organizationpriority interrupt computer organization
priority interrupt computer organization
 
CS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdfCS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdf
 
11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf
 
Turn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptxTurn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptx
 
Triangulation survey (Basic Mine Surveying)_MI10412MI.pptx
Triangulation survey (Basic Mine Surveying)_MI10412MI.pptxTriangulation survey (Basic Mine Surveying)_MI10412MI.pptx
Triangulation survey (Basic Mine Surveying)_MI10412MI.pptx
 
Secure Key Crypto - Tech Paper JET Tech Labs
Secure Key Crypto - Tech Paper JET Tech LabsSecure Key Crypto - Tech Paper JET Tech Labs
Secure Key Crypto - Tech Paper JET Tech Labs
 
Cost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based questionCost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based question
 
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating System
 
multiple access in wireless communication
multiple access in wireless communicationmultiple access in wireless communication
multiple access in wireless communication
 

Twitter data analysis using r (part 2)

  • 1. Twitter Data Analytics using R By: Santoshi Kumari RUAS
  • 2. Twitter Data Analysis Using R • Create twitter app developer account • Get access credentials • Install required packages in R • Connect R tool to twitter • Extract tweets at real time • Create corpus • Data preprocessing and text mining • Wordclud • Frequent term mining • Sentiemt analysis using lexicon RUAS
  • 3. R Packages • Twitter data extraction: twitteR • Text cleaning and mining: tm • Word cloud: wordcloud • Topic modelling: topicmodels, lda • Sentiment analysis: sentiment, syzhu • Social network analysis: igraph, sna • Visualisation: wordcloud, Rgraphviz, ggplot2
  • 4. Text Cleaning Functions • Convert to lower case: tolower • Remove punctuation: removePunctuation • Remove numbers: removeNumbers • Remove stop words (like 'a', 'the', 'in'): removeWords, stopwords • Remove extra white space: stripWhitespace
  • 5. Text Mining { Package tm} • Remove numbers, punctuations, words or extra whitespaces : • removeNumbers, removePunctuation, removeWords, removeSparseTerms, stripWhitespace • Remove sparse terms from a term-document matrix • removeSparseTerms: • Various kinds of stopwords • stopwords • Stem words and complete stems • stemDocument, stemCompletion • Build a term-document matrix or a document-term matrix • TermDocumentMatrix, DocumentTermMatrix • Generate a term frequency vector • termFreq • Find frequent terms or associations of terms • findFreqTerms, findAssocs • Various ways to weight a term-document matrix • weightBin, weightTf, weightTfIdf, weightSMART, WeightFunction
  • 6. Prerequisites • You have already installed R version 3.4.3 and are using RStudio. • In order to extract tweets, you will need a Twitter application and hence a Twitter account. • If you don’t have a Twitter account, please sign up. • Use your Twitter login ID and password to sign in at Twitter Developers. • https://apps.twitter.com/
  • 7. New App Form Fill out the new app form. Names should be unique, i.e., no one else should have used this name for their Twitter app. Give a brief description of the app. You can change this later on if needed. Enter your website or blog address. Callback URL can be left blank. Once you’ve done this, make sure you’ve read the “Developer Rules Of The Road” blurb, check the “Yes, I agree” box, fill in the CAPTCHA and click the “Create Your Twitter Application” button.
  • 8. Create My Access Token Scroll down and click on “Create my access token” button. Note the values of consumer key and consumer secret and keep them handy for future use. You should keep these secret. If anyone was to get these keys, they could effectively access your Twitter account.
  • 10. Install And Load R Packages • R comes with a standard set of packages. A number of other packages are available for download and installation • we will need the following packages: – ROAuth: Provides an interface to the OAuth 1.0 specification, allowing users to authenticate via OAuth to the server of their choice. – TwitteR: Provides an interface to the Twitter web API. • Let’s start by installing and loading all the required packages. install.packages("twitteR") install.packages("ROAuth") library("twitteR") library("ROAuth")
  • 11. Extract Tweets • Use searchTwitter to search Twitter based on the supplied search string and return a list. The “lang” parameter is used below to restrict tweets to the “English” language. >tweets <- searchTwitter(search.string, n=no.of.tweets, cainfo="cacert.pem", lang="en") >tweets >searchTwitter(searchString, n=25, lang=NULL, since=NULL, until=NULL, locale=NULL, geocode=NULL, sinceID=NULL, maxID=NULL, resultType=NULL, retryOnRateLimit=120, ...) Rtweets(n=25, lang=NULL, since=NULL, ...) Examples # searchTwitter(“RUAS", n=100) # Rtweets(n=37) ### Search between two dates # searchTwitter(‘NarendraModi', since='2015-03-01', until='2018-03-02') ### geocoded results # searchTwitter('patriots', geocode='42.375,-71.1061111,10mi') # ## using resultType # searchTwitter('world cup+brazil', resultType="popular", n=15) # searchTwitter('from:hadleywickham', resultType="recent", n=10)
  • 12. Clean Up Text We have already been authenticated and successfully retrieved the text from the tweets. The first step in creating a word cloud is to clean up the text by using lowercase and removing punctuation, usernames, links, etc. We are using the function gsub to replace unwanted text. gsub will replace all occurrences of any given pattern. Although there are alternative packages that can perform this operation, we have chosen gsub because of its simplicity and readability. #convert all text to lower case 1. tweets.text <- tolower(tweets.text) # Replace blank space (“rt”) 1. tweets.text <- gsub("rt", "", tweets.text) # Replace @UserName 1. tweets.text <- gsub("@w+", "", tweets.text) # Remove punctuation 1. tweets.text <- gsub("[[:punct:]]", "", tweets.text) # Remove links 1. tweets.text <- gsub("httpw+", "", tweets.text) # Remove tabs 1. tweets.text <- gsub("[ |t]{2,}", "", tweets.text) # Remove blank spaces at the beginning 1. tweets.text <- gsub("^ ", "", tweets.text) # Remove blank spaces at the end 1. tweets.text <- gsub(" $", "", tweets.text)
  • 13. Remove Stop Words • In the next step we will use the text mining package tm to remove stop words. A stop word is a commonly used word such as “the”. • If tm is not already installed you will need to install it (available from the Comprehensive R Archive Network). • #install tm – if not already installed install.packages("tm") library(tm) #create corpus tweets.text.corpus <- Corpus(VectorSource(tweets.text)) #clean up by removing stop words tweets.text.corpus <- tm_map(tweets.text.corpus, function(x)removeWords(x,stopwords()))
  • 14. Generate Word Cloud • Generate the word cloud using the wordcloud package. • For an example we are concerned with plotting no more than 150 words that occur more than once with random color, order, and position. #install wordcloud if not already installed install.packages("wordcloud") library(word cloud) #generate wordcloud wordcloud(tweets.text.corpus, min.freq = 2, scale=c(7,0.5),colors=brewer.pal(8, "Dark2"), random.color= TRUE, random.order = FALSE, max.words = 150)