Twitter sentiment analysis

Y
Yasas SenarathResearcher, Doctoral Student um George Mason University, US
Sentiment Analysis
Demonstration: Classification & Clustering
Yasas Senarath - Information Retrieval
Dataset and Tools Required
● Dataset
○ https://www.kaggle.com/c/si650winter11 (Training Dataset Only)
○ You will be able to submit a prediction using testing set.
● Tools Required
○ Python 3.6 (or other)
○ Scikit-Learn Toolkit
○ NLTK (You will have to download ‘stopword’ using nltk.dowload())
2
High Level Architecture
● Goals
○ to classify the sentiment of each sentence into "positive" or "negative".
○ to identify clusters
3
Documents
Classify
Cluster
Cluster PolarityCombine
(Polarity)
Classification
4
Step 1: Loading Dataset
def read_dataset():
with open('../resc/data/training.txt', 'r', encoding='utf-8') as f:
records = list(zip(*[line.split('t') for line in f.readlines()]))
return records[1], records[0]
train_text, train_labels = read_dataset()
5
Step 2: Extracting Features
● We will try out TF-IDF features
from nltk import TweetTokenizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
stops = set(stopwords.words('english'))
6
Step 2: Extracting Features
● We will try out TF-IDF features
kwargs = {
'encoding': 'utf-8',
'preprocessor': None,
'stop_words': stops,
'lowercase': True,
'tokenizer': TweetTokenizer().tokenize
}
tfidfVec = TfidfVectorizer(**kwargs)
X_train = tfidfVec.fit_transform(train_text)
# X_test = tfidfVec.transform(test_text)
X_train = X_train.toarray()
7
Step 4: Training the Classifier
● Define the Classifier
○ Let’s create an SVC (Support Vector Classifier)
● Training the classifier
svc = LinearSVC()
svc.fit(X_train, train_labels)
8
Step 4: Training the Classifier
● Fix
ValueError: pos_label=1 is not a valid label: array(['0', '1'], dtype='<U1')
le = LabelEncoder()
y = le.fit_transform(train_labels)
svc = LinearSVC()
svc.fit(X_train, y_train)
● Oops!
9
Step 5: Evaluation
● 5-Fold Cross-Validation
● Train / Test Split
scores = cross_val_score(svc, X, y, cv=5, scoring='f1')
X_train, X_test, y_train, y_test = train_test_split(X,
y, test_size=0.33, random_state=42, shuffle=True)
10
Clustering
11
Step 1: Training the Clustering Algorithm
NUM_CLUSTERS = 4
kmeans = KMeans(
n_clusters=NUM_CLUSTERS,
random_state=0
)
kmeans.fit(X)
12
Step 2: Evaluating Clusters
labels = kmeans.labels_
score = silhouette_score(X, labels)
print('Silhouette Score: {}'.format(score))
13
Clusters...
I really enjoyed the Da Vinci Code
but thought I would be disappointed
in the other books & # 8230;.
this was the first clive cussler i've
ever read, but even books like Relic,
and Da Vinci code were more
plausible than this.
Brokeback Mountain was amazing,
and made me cry like a bitch.
Brokeback Mountain is an excellent
movie, I love it after watching it!
The Da Vinci Code book is just
awesome.
i liked the Da Vinci Code a lot.
friday i stayed in & watched Mission
Impossible 3 which is amazing by the
way.
I LOVED Mission Impossible 3..
Da Vinci Code
Brokeback Mountain
Mission Impossible 14
Combining the two methods...
A simple approach would be to… Find the percentage of positives for each cluster
15
16
1 von 16

Recomendados

Chapter 6.6 von
Chapter 6.6Chapter 6.6
Chapter 6.6sotlsoc
338 views32 Folien
ECS (Part 1/3) - Introduction to Data-Oriented Design von
ECS (Part 1/3) - Introduction to Data-Oriented DesignECS (Part 1/3) - Introduction to Data-Oriented Design
ECS (Part 1/3) - Introduction to Data-Oriented DesignPhuong Hoang Vu
553 views45 Folien
Constructors von
ConstructorsConstructors
Constructorsshravani2191
497 views34 Folien
X_train y_trainX_test y_testX_valid y_validtrainin.docx von
X_train y_trainX_test y_testX_valid y_validtrainin.docxX_train y_trainX_test y_testX_valid y_validtrainin.docx
X_train y_trainX_test y_testX_valid y_validtrainin.docxtroutmanboris
2 views12 Folien
Intelligent System Optimizations von
Intelligent System OptimizationsIntelligent System Optimizations
Intelligent System OptimizationsMartin Zapletal
554 views59 Folien
Assignment 5.2.pdf von
Assignment 5.2.pdfAssignment 5.2.pdf
Assignment 5.2.pdfdash41
4 views7 Folien

Más contenido relacionado

Similar a Twitter sentiment analysis

Xgboost von
XgboostXgboost
XgboostVivian S. Zhang
46.5K views128 Folien
svm classification von
svm classificationsvm classification
svm classificationAkhilesh Joshi
849 views43 Folien
maXbox starter65 machinelearning3 von
maXbox starter65 machinelearning3maXbox starter65 machinelearning3
maXbox starter65 machinelearning3Max Kleiner
168 views10 Folien
maxbox starter60 machine learning von
maxbox starter60 machine learningmaxbox starter60 machine learning
maxbox starter60 machine learningMax Kleiner
96 views5 Folien
Introduction to Machine Learning (CS 5710)¶Assignment 2¶Due by 26th .docx von
Introduction to Machine Learning (CS 5710)¶Assignment 2¶Due by 26th .docxIntroduction to Machine Learning (CS 5710)¶Assignment 2¶Due by 26th .docx
Introduction to Machine Learning (CS 5710)¶Assignment 2¶Due by 26th .docxbagotjesusa
2 views86 Folien
logistic regression with python and R von
logistic regression with python and Rlogistic regression with python and R
logistic regression with python and RAkhilesh Joshi
1.1K views27 Folien

Similar a Twitter sentiment analysis(20)

maXbox starter65 machinelearning3 von Max Kleiner
maXbox starter65 machinelearning3maXbox starter65 machinelearning3
maXbox starter65 machinelearning3
Max Kleiner168 views
maxbox starter60 machine learning von Max Kleiner
maxbox starter60 machine learningmaxbox starter60 machine learning
maxbox starter60 machine learning
Max Kleiner96 views
Introduction to Machine Learning (CS 5710)¶Assignment 2¶Due by 26th .docx von bagotjesusa
Introduction to Machine Learning (CS 5710)¶Assignment 2¶Due by 26th .docxIntroduction to Machine Learning (CS 5710)¶Assignment 2¶Due by 26th .docx
Introduction to Machine Learning (CS 5710)¶Assignment 2¶Due by 26th .docx
bagotjesusa2 views
logistic regression with python and R von Akhilesh Joshi
logistic regression with python and Rlogistic regression with python and R
logistic regression with python and R
Akhilesh Joshi1.1K views
EdSketch: Execution-Driven Sketching for Java von Lisa Hua
EdSketch: Execution-Driven Sketching for JavaEdSketch: Execution-Driven Sketching for Java
EdSketch: Execution-Driven Sketching for Java
Lisa Hua155 views
My code from sklearn-datasets import load_diabetes from sklearn-model.pdf von KOCHHARHOSY
My code  from sklearn-datasets import load_diabetes from sklearn-model.pdfMy code  from sklearn-datasets import load_diabetes from sklearn-model.pdf
My code from sklearn-datasets import load_diabetes from sklearn-model.pdf
KOCHHARHOSY7 views
maXbox starter67 machine learning V von Max Kleiner
maXbox starter67 machine learning VmaXbox starter67 machine learning V
maXbox starter67 machine learning V
Max Kleiner56 views
The input cata x1-x2-y can be loaded from fie- -x1_x2_y_circle2-csv- W.docx von GordonB0fPaigey
The input cata x1-x2-y can be loaded from fie- -x1_x2_y_circle2-csv- W.docxThe input cata x1-x2-y can be loaded from fie- -x1_x2_y_circle2-csv- W.docx
The input cata x1-x2-y can be loaded from fie- -x1_x2_y_circle2-csv- W.docx
GordonB0fPaigey7 views
why! check the codes again please!! the question was utilizing ma.pdf von amikoenterprises
why! check the codes again please!! the question was utilizing ma.pdfwhy! check the codes again please!! the question was utilizing ma.pdf
why! check the codes again please!! the question was utilizing ma.pdf
Math for anomaly detection von MenglinLiu1
Math for anomaly detectionMath for anomaly detection
Math for anomaly detection
MenglinLiu173 views
Data preprocessing for Machine Learning with R and Python von Akhilesh Joshi
Data preprocessing for Machine Learning with R and PythonData preprocessing for Machine Learning with R and Python
Data preprocessing for Machine Learning with R and Python
Akhilesh Joshi648 views
maXbox starter69 Machine Learning VII von Max Kleiner
maXbox starter69 Machine Learning VIImaXbox starter69 Machine Learning VII
maXbox starter69 Machine Learning VII
Max Kleiner117 views

Más de Yasas Senarath

Aspect Based Sentiment Analysis von
Aspect Based Sentiment AnalysisAspect Based Sentiment Analysis
Aspect Based Sentiment AnalysisYasas Senarath
101 views21 Folien
Forecasting covid 19 by states with mobility data von
Forecasting covid 19 by states with mobility data Forecasting covid 19 by states with mobility data
Forecasting covid 19 by states with mobility data Yasas Senarath
73 views26 Folien
Evaluating Semantic Feature Representations to Efficiently Detect Hate Intent... von
Evaluating Semantic Feature Representations to Efficiently Detect Hate Intent...Evaluating Semantic Feature Representations to Efficiently Detect Hate Intent...
Evaluating Semantic Feature Representations to Efficiently Detect Hate Intent...Yasas Senarath
136 views25 Folien
Solr workshop von
Solr workshopSolr workshop
Solr workshopYasas Senarath
147 views27 Folien
Affect Level Opinion Mining von
Affect Level Opinion MiningAffect Level Opinion Mining
Affect Level Opinion MiningYasas Senarath
35 views10 Folien
Data science / Big Data von
Data science / Big DataData science / Big Data
Data science / Big DataYasas Senarath
125 views25 Folien

Más de Yasas Senarath(7)

Aspect Based Sentiment Analysis von Yasas Senarath
Aspect Based Sentiment AnalysisAspect Based Sentiment Analysis
Aspect Based Sentiment Analysis
Yasas Senarath101 views
Forecasting covid 19 by states with mobility data von Yasas Senarath
Forecasting covid 19 by states with mobility data Forecasting covid 19 by states with mobility data
Forecasting covid 19 by states with mobility data
Yasas Senarath73 views
Evaluating Semantic Feature Representations to Efficiently Detect Hate Intent... von Yasas Senarath
Evaluating Semantic Feature Representations to Efficiently Detect Hate Intent...Evaluating Semantic Feature Representations to Efficiently Detect Hate Intent...
Evaluating Semantic Feature Representations to Efficiently Detect Hate Intent...
Yasas Senarath136 views

Último

[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ... von
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...DataScienceConferenc1
5 views19 Folien
[DSC Europe 23] Luca Morena - From Psychohistory to Curious Machines von
[DSC Europe 23] Luca Morena - From Psychohistory to Curious Machines[DSC Europe 23] Luca Morena - From Psychohistory to Curious Machines
[DSC Europe 23] Luca Morena - From Psychohistory to Curious MachinesDataScienceConferenc1
5 views20 Folien
[DSC Europe 23][AI:CSI] Dragan Pleskonjic - AI Impact on Cybersecurity and P... von
[DSC Europe 23][AI:CSI]  Dragan Pleskonjic - AI Impact on Cybersecurity and P...[DSC Europe 23][AI:CSI]  Dragan Pleskonjic - AI Impact on Cybersecurity and P...
[DSC Europe 23][AI:CSI] Dragan Pleskonjic - AI Impact on Cybersecurity and P...DataScienceConferenc1
8 views36 Folien
K-Drama Recommendation Using Python von
K-Drama Recommendation Using PythonK-Drama Recommendation Using Python
K-Drama Recommendation Using PythonFridaPutriassa
5 views20 Folien
CRM stick or twist.pptx von
CRM stick or twist.pptxCRM stick or twist.pptx
CRM stick or twist.pptxinfo828217
11 views16 Folien
Inawsidom - Data Journey von
Inawsidom - Data JourneyInawsidom - Data Journey
Inawsidom - Data JourneyPhilipBasford
8 views38 Folien

Último(20)

[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ... von DataScienceConferenc1
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...
[DSC Europe 23] Luca Morena - From Psychohistory to Curious Machines von DataScienceConferenc1
[DSC Europe 23] Luca Morena - From Psychohistory to Curious Machines[DSC Europe 23] Luca Morena - From Psychohistory to Curious Machines
[DSC Europe 23] Luca Morena - From Psychohistory to Curious Machines
[DSC Europe 23][AI:CSI] Dragan Pleskonjic - AI Impact on Cybersecurity and P... von DataScienceConferenc1
[DSC Europe 23][AI:CSI]  Dragan Pleskonjic - AI Impact on Cybersecurity and P...[DSC Europe 23][AI:CSI]  Dragan Pleskonjic - AI Impact on Cybersecurity and P...
[DSC Europe 23][AI:CSI] Dragan Pleskonjic - AI Impact on Cybersecurity and P...
CRM stick or twist.pptx von info828217
CRM stick or twist.pptxCRM stick or twist.pptx
CRM stick or twist.pptx
info82821711 views
Data about the sector workshop von info828217
Data about the sector workshopData about the sector workshop
Data about the sector workshop
info82821729 views
Data Journeys Hard Talk workshop final.pptx von info828217
Data Journeys Hard Talk workshop final.pptxData Journeys Hard Talk workshop final.pptx
Data Journeys Hard Talk workshop final.pptx
info82821711 views
PRIVACY AWRE PERSONAL DATA STORAGE von antony420421
PRIVACY AWRE PERSONAL DATA STORAGEPRIVACY AWRE PERSONAL DATA STORAGE
PRIVACY AWRE PERSONAL DATA STORAGE
antony4204217 views
[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo... von DataScienceConferenc1
[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...
[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...
4_4_WP_4_06_ND_Model.pptx von d6fmc6kwd4
4_4_WP_4_06_ND_Model.pptx4_4_WP_4_06_ND_Model.pptx
4_4_WP_4_06_ND_Model.pptx
d6fmc6kwd47 views
[DSC Europe 23] Ivan Dundovic - How To Treat Your Data As A Product.pptx von DataScienceConferenc1
[DSC Europe 23] Ivan Dundovic - How To Treat Your Data As A Product.pptx[DSC Europe 23] Ivan Dundovic - How To Treat Your Data As A Product.pptx
[DSC Europe 23] Ivan Dundovic - How To Treat Your Data As A Product.pptx
Customer Data Cleansing Project.pptx von Nat O
Customer Data Cleansing Project.pptxCustomer Data Cleansing Project.pptx
Customer Data Cleansing Project.pptx
Nat O6 views
Dr. Ousmane Badiane-2023 ReSAKSS Conference von AKADEMIYA2063
Dr. Ousmane Badiane-2023 ReSAKSS ConferenceDr. Ousmane Badiane-2023 ReSAKSS Conference
Dr. Ousmane Badiane-2023 ReSAKSS Conference
AKADEMIYA20635 views

Twitter sentiment analysis

  • 1. Sentiment Analysis Demonstration: Classification & Clustering Yasas Senarath - Information Retrieval
  • 2. Dataset and Tools Required ● Dataset ○ https://www.kaggle.com/c/si650winter11 (Training Dataset Only) ○ You will be able to submit a prediction using testing set. ● Tools Required ○ Python 3.6 (or other) ○ Scikit-Learn Toolkit ○ NLTK (You will have to download ‘stopword’ using nltk.dowload()) 2
  • 3. High Level Architecture ● Goals ○ to classify the sentiment of each sentence into "positive" or "negative". ○ to identify clusters 3 Documents Classify Cluster Cluster PolarityCombine
  • 5. Step 1: Loading Dataset def read_dataset(): with open('../resc/data/training.txt', 'r', encoding='utf-8') as f: records = list(zip(*[line.split('t') for line in f.readlines()])) return records[1], records[0] train_text, train_labels = read_dataset() 5
  • 6. Step 2: Extracting Features ● We will try out TF-IDF features from nltk import TweetTokenizer from nltk.corpus import stopwords from sklearn.feature_extraction.text import TfidfVectorizer stops = set(stopwords.words('english')) 6
  • 7. Step 2: Extracting Features ● We will try out TF-IDF features kwargs = { 'encoding': 'utf-8', 'preprocessor': None, 'stop_words': stops, 'lowercase': True, 'tokenizer': TweetTokenizer().tokenize } tfidfVec = TfidfVectorizer(**kwargs) X_train = tfidfVec.fit_transform(train_text) # X_test = tfidfVec.transform(test_text) X_train = X_train.toarray() 7
  • 8. Step 4: Training the Classifier ● Define the Classifier ○ Let’s create an SVC (Support Vector Classifier) ● Training the classifier svc = LinearSVC() svc.fit(X_train, train_labels) 8
  • 9. Step 4: Training the Classifier ● Fix ValueError: pos_label=1 is not a valid label: array(['0', '1'], dtype='<U1') le = LabelEncoder() y = le.fit_transform(train_labels) svc = LinearSVC() svc.fit(X_train, y_train) ● Oops! 9
  • 10. Step 5: Evaluation ● 5-Fold Cross-Validation ● Train / Test Split scores = cross_val_score(svc, X, y, cv=5, scoring='f1') X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, shuffle=True) 10
  • 12. Step 1: Training the Clustering Algorithm NUM_CLUSTERS = 4 kmeans = KMeans( n_clusters=NUM_CLUSTERS, random_state=0 ) kmeans.fit(X) 12
  • 13. Step 2: Evaluating Clusters labels = kmeans.labels_ score = silhouette_score(X, labels) print('Silhouette Score: {}'.format(score)) 13
  • 14. Clusters... I really enjoyed the Da Vinci Code but thought I would be disappointed in the other books & # 8230;. this was the first clive cussler i've ever read, but even books like Relic, and Da Vinci code were more plausible than this. Brokeback Mountain was amazing, and made me cry like a bitch. Brokeback Mountain is an excellent movie, I love it after watching it! The Da Vinci Code book is just awesome. i liked the Da Vinci Code a lot. friday i stayed in & watched Mission Impossible 3 which is amazing by the way. I LOVED Mission Impossible 3.. Da Vinci Code Brokeback Mountain Mission Impossible 14
  • 15. Combining the two methods... A simple approach would be to… Find the percentage of positives for each cluster 15
  • 16. 16