SlideShare ist ein Scribd-Unternehmen logo
1 von 43
Content
s
1. Analysis Plan
3. Twitter Text Analysis
2. Analysis of Election Results
4. Challenges Suggestion
Outline and Purposes
Tools/Packages
Motivation
Exploratory data analysis
Data preconditioning
Modeling and Test
Dataset
Twitter Text Analysis
Analysis & Conclusion
1. Analysis Plan – Outline and Purpose
The purposes of Analysis
Identify How Trump win
and who support him
Analyze what Trump and Hillary
mention in Twitter
Method of Analysis
1. Linear Regression and Decision Tree analysis
2. Text Mining and Sentimental Analysis.
- Modeling dependent variable =Trump vote rates with independent variables = US County facts.
- Classify the characteristics of group who support Trump by Decision Tree analysis.
- Analyze frequent words in Twitter data and figure out word association each other.
- Auto sentimental classification using Naiive Bayes Classification method
-2016 & 2012 votes results Data
- US County stats facts Data Twitter Data from July 26th to Aug 21st
1. Modeling for Analysis of
2016 Election results
How Donald Trump win Hillary Clinton ?
Who Support Donald Trump?
Linear Regression
Decision Tree Analysis
Data Preconditioning
US 2012 election
county-level results
US 2016 election
county-level results
County Facts data
Download the datasets
01
Removing useless variables and rename the remainders.
fips area_name
state_abb
reviation
populati
on
under.
5.y
0 United States
NA
318857056 6.2
1000 Alabama
NA
4849377 6.1
1001 Autauga County AL 55395 6
1003 Baldwin County AL 200111 5.6
1005 Barbour County AL 26887 5.7
1007 Bibb County AL 22506 5.3
… … … … …
R_Code County2_Data_sets
Data Preconditioning County_facts .csv
• Also we can select some meaningful variables in ‘votes’ data set and
rename them so that we can easily recognize what it means.
• Merge ‘county2’ data and ‘vote2’ data by ‘fips’ code .
• Add column named ‘winner’ which indicate if Trump’s vote rate is
bigger than Clinton’s, the value is ‘1’ otherwise ‘0’.
• Delete all the NA value in ‘data’ using ‘na.omit’.
Merge the ‘votes.csv’Data Preconditioning
fips area_name
state_abbre
viation.x
population under.5.y
1001
Autauga Count
y
AL 55395 6
1003 Baldwin County AL 200111 5.6
1005 Barbour County AL 26887 5.7
1007 Bibb County AL 22506 5.3
1009 Blount County AL 57719 6.1
1011 Bullock County AL 10764 6.3
… … … … …
Exploratory Data Analysis
Showing the basic statistical values of
all the variables using stat_fn function.
1. Analysis Plan – 데이터 탐색적 자료분석Exploratory Data Analysis
vs
vs
1. Analysis Plan – 데이터 탐색적 자료분석
• The relationship between Trump vote rates and Bachelor's degree
or higher rates in county is negative
• The relationship between Trump vote rates and White people
percents in county is positive
Y=Trump ,X= Bachelor Y=Trump ,X= White
Exploratory Data Analysis
• The relationship between Clinton vote rates and Bachelor's degree
or higher rates in county is positive
• The relationship between Clinton vote rates and White people
percents in county is negative
1. Analysis Plan – 데이터 탐색적 자료분석
Y=Clinton ,X= Bachelor Y=Clinton ,X= White
Exploratory Data Analysis
1. Analysis Plan – 데이터 탐색적 자료분석
• Trump and Romney vote rates
have strong correlation and
Clinton and Obama have strong
correlation.
• Trump with Bachelor education
level have negative correlation
and Black people percents also
have negative but with White ,
Trump has positive correlation.
• Clinton with Bachelor education
level have pasitive correlation
and Black people percents also
have pasitive but with White ,
Clinton has negative correlation.
Exploratory Data Analysis
Correlation Visualization chart of some representative variables
Linear Regression Modeling
• Sampling the test data
20% and training data
80%.
• Select the variables
using Forward AIC
method.
• Train the linear
regression model
inputting the variables
selected with the
smallest AIC value.
Sampling and modeling the Linear regression with training data
1. Analysis Plan – Linear Regression 모델검증
1. Test data의 predicted value와 실
제값을 비교했을 때 correlation
coefficient값이 0.98로 상당히
정확함을 알 수 있다.
2. 모델의 유의성 검정 : F-검정 p값
- p-value: < 2.2e-16 이므로 모델
이 유의하다.
3. 모델의 설명력
Multiple R-squared= 0.9624 :
very strong.
Adjusted R-squared: 0.9622
4. X변수들의 유의성 검정 Pr ***
-positive coefficients
: Romney, Asian, White,
Income.capita
-Negative coefficients
: Bachelor, household.income,
under.18.y, Housing, Black,
Foreign, Hawaiian, High.school,
Language, Female
Linear Regression Modeling
1. Analysis Plan – 데이터 탐색적 자료분석
>plot(train.lm)
Residuals vs Fitted
Normal Q-Q
Scale-Location
Residuals vs Leverage
Linear Regression Modeling
Decision Tree Analysis
White>47.3, Bachelor degree<27.9,
Housing units <562, Black <41
31.2<White<47.3, Bachelor
degree<19, Hawaian=0, Black <14.1
1. Analysis Plan – 데이터 탐색적 자료분석
• Accuracy is 0.918
• Confidence Interval of
95% is (0.8936, 0.9383)
• P-Value is 1.8*e^(-11)
Decision Tree Analysis – Test and Validation
• This Decision Tree
model is significant.
• It classifies and predicts
the winner relatively
precisely .
1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Insight & Conclusion
Romney Supporter
White Person
Obama Supporter
Highly Educated
Colored races
Low-educated
2. Twitter Data Analysis
Hillary Clinton & Donald Trump
(who were the candidates of 2016 US Election)
Text Mining
Sentiment Analysis
• 2016 US election was the hottest issue in
America this year
• Social Media plays an important role in a
political campaign.
• Analyzing tweets of two candidates can give
us more information that traditional statistical
analysis cannot do.
1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Motivation
• R for text mining
– twitteR
– ROAuth
– KoNLP
– Plyr
– tm
• Python for sentiment analysis
– NLTK
21
– SnowballC
– Ggplot2
– Wordcloud
– Topicmodels
– stringr
1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Tools / Packages
• Twitter API is a platform where you can interact
with its data(tweets) and several attributes about
tweets.
• R provides the package “twitteR” to get and
manipulate data.
• My dataset is 400 tweets dating from July 26th to
August 21st , the period before election with
sentiment labels
– Hillary’s 200 tweets are from August 10th to August
21st
– Trump’s 200 are from July 26th to August 10th
1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Datasets
1. Calculate the frequency of term
occurrences and visualize plot and word
cloud
2. Find associations of some of these words
3. Build a topic model
1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Text Mining
1. Load and format the data
2. Clean the data
– Stem the data
– build a corpus and do more cleaning
tasks
3. build a term document matrix(TDM)
Concepts
• Corpus is a collection of documents
• Term document matrix (TDM) is a matrix that lists all
occurrences of words in the corpus by documents
1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Pre-process the Data
1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Implementation in R
Trump
1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Text Mining – 1. WordCloud
Trump
1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Text Mining – 2. Word Association
Trump
1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Text Mining – 3. Topic Modeling
Clinton1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Text Mining – 1. WordCloud
Clinton1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Text Mining – 2. Word Association
1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Text Mining – 3. Topic Modeling Clinton
Sentiment Analysis
• Sentiment Analysis is a special case of text mining
generally focused on identifying opinion polarity
using NLP, statistics, or machine learning methods
• It is the process of determining whether a piece of
text is positive, negative or neutral.
• To do this, machine learning can be a good tool
– There are various classification methods: Naïve Bayes
algorithm, Maximum Entropy, SVM(support vector machine)
33
1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Sentiment Analysis
• NLTK(Natural Language Toolkit)
– a platform for building Python programs to
work with human language data.
– provides easy-to-use text processing libraries
for classification, tokenization, stemming,
tagging, parsing, and semantic reasoning,
wrappers for industrial-strength NLP libraries,
etc
34
1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Sentiment Analysis - Why python?
• Built-in module
in NTLK
• Supervised learning
- Training and
testing is required.
http://www.nltk.org/_modules/nltk/classify/naivebayes.html
1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Text Mining – 3. Topic Modeling1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Sentiment Analysis – Naïve Bayes Classifier
1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Sentiment Analysis – Train & Test
1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Sentiment Analysis – How it works
1. Define functions
Get features from data and save it as a
vector
Extract features from feature vector
the result looks like this:
'contains(hi)': False,
'contains(crooked)':True
2. Get feature list from train and test data
1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Sentiment Analysis – How it works
1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Sentiment Analysis – How it works
3. Train and test the classifier
1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Sentiment Analysis – How it works
4. Classify the unlabeled tweets
1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Sentiment Analysis – How it works
4. Classify the unlabeled tweets
• Because the sentiment polarity had to be
manually labeled, the amount of data was small.
It caused the low degree of accuracy of
classification.
• I had some technical issues. There were
encoding/decoding problems both in R and in
python, so I missed the chance to try other
classification methods supported in NLTK like
maximum entropy classifier or SVM
42
1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Challenges and Suggestions
THANK YOU
R라딘

Weitere ähnliche Inhalte

Ähnlich wie Analysis of 2016 US Election Twitter Data and County Results

Exploring Correlation Between Sentiment of Environmental Tweets and the Stock...
Exploring Correlation Between Sentiment of Environmental Tweets and the Stock...Exploring Correlation Between Sentiment of Environmental Tweets and the Stock...
Exploring Correlation Between Sentiment of Environmental Tweets and the Stock...Data Works MD
 
Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-stepsShesha R
 
Text mining and analytics v6 - p1
Text mining and analytics   v6 - p1Text mining and analytics   v6 - p1
Text mining and analytics v6 - p1Dave King
 
Research Method for Business chapter 11-12-14
Research Method for Business chapter 11-12-14Research Method for Business chapter 11-12-14
Research Method for Business chapter 11-12-14Mazhar Poohlah
 
Presentation1.pdf
Presentation1.pdfPresentation1.pdf
Presentation1.pdfZixunZhou
 
Data What Type Of Data Do You Have V2.1
Data   What Type Of Data Do You Have V2.1Data   What Type Of Data Do You Have V2.1
Data What Type Of Data Do You Have V2.1TimKasse
 
Descriptive analysis ( research methodology).pptx
Descriptive analysis ( research methodology).pptxDescriptive analysis ( research methodology).pptx
Descriptive analysis ( research methodology).pptxJothisJose1
 
Spsshelp 100608163328-phpapp01
Spsshelp 100608163328-phpapp01Spsshelp 100608163328-phpapp01
Spsshelp 100608163328-phpapp01Henock Beyene
 
Machine learning introduction
Machine learning introductionMachine learning introduction
Machine learning introductionAnas Jamil
 
Data Analysis in Research: Descriptive Statistics & Normality
Data Analysis in Research: Descriptive Statistics & NormalityData Analysis in Research: Descriptive Statistics & Normality
Data Analysis in Research: Descriptive Statistics & NormalityIkbal Ahmed
 
KIT-601 Lecture Notes-UNIT-2.pdf
KIT-601 Lecture Notes-UNIT-2.pdfKIT-601 Lecture Notes-UNIT-2.pdf
KIT-601 Lecture Notes-UNIT-2.pdfDr. Radhey Shyam
 
time_series.pptx
time_series.pptxtime_series.pptx
time_series.pptxadmsoyadm4
 
Stock prediction using social network
Stock prediction using social networkStock prediction using social network
Stock prediction using social networkChanon Hongsirikulkit
 
Researchpe-5.pptx
Researchpe-5.pptxResearchpe-5.pptx
Researchpe-5.pptxParwez17
 
IDS-Unit-II. bachelor of computer applicatio notes
IDS-Unit-II. bachelor of computer applicatio notesIDS-Unit-II. bachelor of computer applicatio notes
IDS-Unit-II. bachelor of computer applicatio notesAnkurTiwari813070
 
Mba ii rm unit-4.1 data analysis & presentation a
Mba ii rm unit-4.1 data analysis & presentation aMba ii rm unit-4.1 data analysis & presentation a
Mba ii rm unit-4.1 data analysis & presentation aRai University
 

Ähnlich wie Analysis of 2016 US Election Twitter Data and County Results (20)

Exploring Correlation Between Sentiment of Environmental Tweets and the Stock...
Exploring Correlation Between Sentiment of Environmental Tweets and the Stock...Exploring Correlation Between Sentiment of Environmental Tweets and the Stock...
Exploring Correlation Between Sentiment of Environmental Tweets and the Stock...
 
Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-steps
 
Text mining and analytics v6 - p1
Text mining and analytics   v6 - p1Text mining and analytics   v6 - p1
Text mining and analytics v6 - p1
 
Research Method for Business chapter 11-12-14
Research Method for Business chapter 11-12-14Research Method for Business chapter 11-12-14
Research Method for Business chapter 11-12-14
 
Presentation1.pdf
Presentation1.pdfPresentation1.pdf
Presentation1.pdf
 
Data What Type Of Data Do You Have V2.1
Data   What Type Of Data Do You Have V2.1Data   What Type Of Data Do You Have V2.1
Data What Type Of Data Do You Have V2.1
 
Descriptive analysis ( research methodology).pptx
Descriptive analysis ( research methodology).pptxDescriptive analysis ( research methodology).pptx
Descriptive analysis ( research methodology).pptx
 
Spsshelp 100608163328-phpapp01
Spsshelp 100608163328-phpapp01Spsshelp 100608163328-phpapp01
Spsshelp 100608163328-phpapp01
 
Machine learning introduction
Machine learning introductionMachine learning introduction
Machine learning introduction
 
EDA
EDAEDA
EDA
 
From data lakes to actionable data (adventures in data curation)
From data lakes to actionable data (adventures in data curation)From data lakes to actionable data (adventures in data curation)
From data lakes to actionable data (adventures in data curation)
 
Data Analysis in Research: Descriptive Statistics & Normality
Data Analysis in Research: Descriptive Statistics & NormalityData Analysis in Research: Descriptive Statistics & Normality
Data Analysis in Research: Descriptive Statistics & Normality
 
analysis plan.ppt
analysis plan.pptanalysis plan.ppt
analysis plan.ppt
 
KIT-601 Lecture Notes-UNIT-2.pdf
KIT-601 Lecture Notes-UNIT-2.pdfKIT-601 Lecture Notes-UNIT-2.pdf
KIT-601 Lecture Notes-UNIT-2.pdf
 
time_series.pptx
time_series.pptxtime_series.pptx
time_series.pptx
 
Stock prediction using social network
Stock prediction using social networkStock prediction using social network
Stock prediction using social network
 
Researchpe-5.pptx
Researchpe-5.pptxResearchpe-5.pptx
Researchpe-5.pptx
 
IDS-Unit-II. bachelor of computer applicatio notes
IDS-Unit-II. bachelor of computer applicatio notesIDS-Unit-II. bachelor of computer applicatio notes
IDS-Unit-II. bachelor of computer applicatio notes
 
R programming for data science
R programming for data scienceR programming for data science
R programming for data science
 
Mba ii rm unit-4.1 data analysis & presentation a
Mba ii rm unit-4.1 data analysis & presentation aMba ii rm unit-4.1 data analysis & presentation a
Mba ii rm unit-4.1 data analysis & presentation a
 

Kürzlich hochgeladen

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 

Kürzlich hochgeladen (20)

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 

Analysis of 2016 US Election Twitter Data and County Results

  • 1.
  • 2. Content s 1. Analysis Plan 3. Twitter Text Analysis 2. Analysis of Election Results 4. Challenges Suggestion Outline and Purposes Tools/Packages Motivation Exploratory data analysis Data preconditioning Modeling and Test Dataset Twitter Text Analysis Analysis & Conclusion
  • 3. 1. Analysis Plan – Outline and Purpose The purposes of Analysis Identify How Trump win and who support him Analyze what Trump and Hillary mention in Twitter Method of Analysis 1. Linear Regression and Decision Tree analysis 2. Text Mining and Sentimental Analysis. - Modeling dependent variable =Trump vote rates with independent variables = US County facts. - Classify the characteristics of group who support Trump by Decision Tree analysis. - Analyze frequent words in Twitter data and figure out word association each other. - Auto sentimental classification using Naiive Bayes Classification method -2016 & 2012 votes results Data - US County stats facts Data Twitter Data from July 26th to Aug 21st
  • 4. 1. Modeling for Analysis of 2016 Election results How Donald Trump win Hillary Clinton ? Who Support Donald Trump? Linear Regression Decision Tree Analysis
  • 5. Data Preconditioning US 2012 election county-level results US 2016 election county-level results County Facts data Download the datasets 01
  • 6. Removing useless variables and rename the remainders. fips area_name state_abb reviation populati on under. 5.y 0 United States NA 318857056 6.2 1000 Alabama NA 4849377 6.1 1001 Autauga County AL 55395 6 1003 Baldwin County AL 200111 5.6 1005 Barbour County AL 26887 5.7 1007 Bibb County AL 22506 5.3 … … … … … R_Code County2_Data_sets Data Preconditioning County_facts .csv
  • 7. • Also we can select some meaningful variables in ‘votes’ data set and rename them so that we can easily recognize what it means. • Merge ‘county2’ data and ‘vote2’ data by ‘fips’ code . • Add column named ‘winner’ which indicate if Trump’s vote rate is bigger than Clinton’s, the value is ‘1’ otherwise ‘0’. • Delete all the NA value in ‘data’ using ‘na.omit’. Merge the ‘votes.csv’Data Preconditioning fips area_name state_abbre viation.x population under.5.y 1001 Autauga Count y AL 55395 6 1003 Baldwin County AL 200111 5.6 1005 Barbour County AL 26887 5.7 1007 Bibb County AL 22506 5.3 1009 Blount County AL 57719 6.1 1011 Bullock County AL 10764 6.3 … … … … …
  • 8. Exploratory Data Analysis Showing the basic statistical values of all the variables using stat_fn function.
  • 9. 1. Analysis Plan – 데이터 탐색적 자료분석Exploratory Data Analysis vs vs
  • 10. 1. Analysis Plan – 데이터 탐색적 자료분석 • The relationship between Trump vote rates and Bachelor's degree or higher rates in county is negative • The relationship between Trump vote rates and White people percents in county is positive Y=Trump ,X= Bachelor Y=Trump ,X= White Exploratory Data Analysis
  • 11. • The relationship between Clinton vote rates and Bachelor's degree or higher rates in county is positive • The relationship between Clinton vote rates and White people percents in county is negative 1. Analysis Plan – 데이터 탐색적 자료분석 Y=Clinton ,X= Bachelor Y=Clinton ,X= White Exploratory Data Analysis
  • 12. 1. Analysis Plan – 데이터 탐색적 자료분석 • Trump and Romney vote rates have strong correlation and Clinton and Obama have strong correlation. • Trump with Bachelor education level have negative correlation and Black people percents also have negative but with White , Trump has positive correlation. • Clinton with Bachelor education level have pasitive correlation and Black people percents also have pasitive but with White , Clinton has negative correlation. Exploratory Data Analysis Correlation Visualization chart of some representative variables
  • 13. Linear Regression Modeling • Sampling the test data 20% and training data 80%. • Select the variables using Forward AIC method. • Train the linear regression model inputting the variables selected with the smallest AIC value. Sampling and modeling the Linear regression with training data
  • 14. 1. Analysis Plan – Linear Regression 모델검증 1. Test data의 predicted value와 실 제값을 비교했을 때 correlation coefficient값이 0.98로 상당히 정확함을 알 수 있다. 2. 모델의 유의성 검정 : F-검정 p값 - p-value: < 2.2e-16 이므로 모델 이 유의하다. 3. 모델의 설명력 Multiple R-squared= 0.9624 : very strong. Adjusted R-squared: 0.9622 4. X변수들의 유의성 검정 Pr *** -positive coefficients : Romney, Asian, White, Income.capita -Negative coefficients : Bachelor, household.income, under.18.y, Housing, Black, Foreign, Hawaiian, High.school, Language, Female Linear Regression Modeling
  • 15. 1. Analysis Plan – 데이터 탐색적 자료분석 >plot(train.lm) Residuals vs Fitted Normal Q-Q Scale-Location Residuals vs Leverage Linear Regression Modeling
  • 16. Decision Tree Analysis White>47.3, Bachelor degree<27.9, Housing units <562, Black <41 31.2<White<47.3, Bachelor degree<19, Hawaian=0, Black <14.1
  • 17. 1. Analysis Plan – 데이터 탐색적 자료분석 • Accuracy is 0.918 • Confidence Interval of 95% is (0.8936, 0.9383) • P-Value is 1.8*e^(-11) Decision Tree Analysis – Test and Validation • This Decision Tree model is significant. • It classifies and predicts the winner relatively precisely .
  • 18. 1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Insight & Conclusion Romney Supporter White Person Obama Supporter Highly Educated Colored races Low-educated
  • 19. 2. Twitter Data Analysis Hillary Clinton & Donald Trump (who were the candidates of 2016 US Election) Text Mining Sentiment Analysis
  • 20. • 2016 US election was the hottest issue in America this year • Social Media plays an important role in a political campaign. • Analyzing tweets of two candidates can give us more information that traditional statistical analysis cannot do. 1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Motivation
  • 21. • R for text mining – twitteR – ROAuth – KoNLP – Plyr – tm • Python for sentiment analysis – NLTK 21 – SnowballC – Ggplot2 – Wordcloud – Topicmodels – stringr 1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Tools / Packages
  • 22. • Twitter API is a platform where you can interact with its data(tweets) and several attributes about tweets. • R provides the package “twitteR” to get and manipulate data. • My dataset is 400 tweets dating from July 26th to August 21st , the period before election with sentiment labels – Hillary’s 200 tweets are from August 10th to August 21st – Trump’s 200 are from July 26th to August 10th 1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Datasets
  • 23. 1. Calculate the frequency of term occurrences and visualize plot and word cloud 2. Find associations of some of these words 3. Build a topic model 1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Text Mining
  • 24. 1. Load and format the data 2. Clean the data – Stem the data – build a corpus and do more cleaning tasks 3. build a term document matrix(TDM) Concepts • Corpus is a collection of documents • Term document matrix (TDM) is a matrix that lists all occurrences of words in the corpus by documents 1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Pre-process the Data
  • 25. 1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Implementation in R
  • 26. Trump 1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Text Mining – 1. WordCloud
  • 27. Trump 1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Text Mining – 2. Word Association
  • 28. Trump 1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Text Mining – 3. Topic Modeling
  • 29. Clinton1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Text Mining – 1. WordCloud
  • 30. Clinton1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Text Mining – 2. Word Association
  • 31. 1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Text Mining – 3. Topic Modeling Clinton
  • 33. • Sentiment Analysis is a special case of text mining generally focused on identifying opinion polarity using NLP, statistics, or machine learning methods • It is the process of determining whether a piece of text is positive, negative or neutral. • To do this, machine learning can be a good tool – There are various classification methods: Naïve Bayes algorithm, Maximum Entropy, SVM(support vector machine) 33 1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Sentiment Analysis
  • 34. • NLTK(Natural Language Toolkit) – a platform for building Python programs to work with human language data. – provides easy-to-use text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, etc 34 1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Sentiment Analysis - Why python?
  • 35. • Built-in module in NTLK • Supervised learning - Training and testing is required. http://www.nltk.org/_modules/nltk/classify/naivebayes.html 1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Text Mining – 3. Topic Modeling1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Sentiment Analysis – Naïve Bayes Classifier
  • 36. 1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Sentiment Analysis – Train & Test
  • 37. 1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Sentiment Analysis – How it works 1. Define functions Get features from data and save it as a vector Extract features from feature vector the result looks like this: 'contains(hi)': False, 'contains(crooked)':True
  • 38. 2. Get feature list from train and test data 1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Sentiment Analysis – How it works
  • 39. 1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Sentiment Analysis – How it works 3. Train and test the classifier
  • 40. 1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Sentiment Analysis – How it works 4. Classify the unlabeled tweets
  • 41. 1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Sentiment Analysis – How it works 4. Classify the unlabeled tweets
  • 42. • Because the sentiment polarity had to be manually labeled, the amount of data was small. It caused the low degree of accuracy of classification. • I had some technical issues. There were encoding/decoding problems both in R and in python, so I missed the chance to try other classification methods supported in NLTK like maximum entropy classifier or SVM 42 1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Challenges and Suggestions

Hinweis der Redaktion

  1. Removing useless variables which is not helpful for describing people who support Trump and rename them to make it easy to know the meaning.
  2. Removing useless variables which is not helpful for describing people who support Trump and rename them to make it easy to know the meaning.
  3. The plot in the upper left shows the residual errors plotted versus their fitted values. The residuals should be randomly distributed around the horizontal line representing a residual error of zero; that is, there should not be a distinct trend in the distribution of points. The plot in the lower left is a standard Q-Q plot, which should suggest that the residual errors are normally distributed. The scale-location plot in the upper right shows the square root of the standardized residuals (sort of a square root of relative error) as a function of the fitted values. Again, there should be no obvious trend in this plot. Finally, the plot in the lower right shows each points leverage, which is a measure of its importance in determining the regression result. Superimposed on the plot are contour lines for the Cook’s distance, which is another measure of the importance of each observation to the regression. Smaller distances means that removing the observation has little affect on the regression results. Distances larger than 1 are suspicious and suggest the presence of a possible outlier or a poor model.
  4. Hi, everyone. My name is Jiyeon, and I worked with Sujin as a team for this final project, Our topic is "2016 US election:" What I did is more like text analysis. I analyzed the twitter data of two presidential candidates, Hillary Clinton and Donald Trump. What I am worrying about is that it could be a little hard for you to understand the code things I wrote, since, technically we didn’t learn text miming in this class. I am going to walk through the codes, but I am not going to explain every single detail.
  5. As you know, 2016 US election was a big issue for the past few months, and it fascinated lots of data scientists around the world. They already did so many works, so it was relatively easy to get dataset. ****** I'll skip these sections to save time. It'll be mentioned later in this presentation.
  6. tm – the text mining package (see documentation). Also check out this excellent introductory article on tm. SnowballC – required for stemming (explained below). ggplot2 – plotting capabilities (see documentation) wordcloud – which is self-explanatory (see documentation) .
  7. Getting data was the easiest one in this project. Twitter APIs provide a platform where you can interact with its data, so called, tweets, and several attributes about tweets. Also, You can use the fascinating R package “twitteR’ to retrieve tweets from one’s timeline or by things like hashtags ****************** About the data I used a data set with 400 tweets dating from July 26th to August 21st (Hillary’s 200 tweets are from August 10th to August 21st, and Trump’s 200 are from July 26th to August 10th). I had to manually label the sentiment to conduct sentiment analysis.
  8. To be specific, I did such things as calculating… Finding… Building…
  9. These are the steps I followed to refine the data 1. The dataset has not only tweets itself but also several attributes about tweets, like date, ID, etc. I only used tweets in text column here 2. The next thing I did is cleansing the data. - it includes removing numbers, URLs, puncuations, convert to lower cases Data cleaning can be done before and after building a corpus. Corpus makes it easy to deal with data in text mining Do more cleaning tasks like removing stopwords, whitespaces, and stemming the data. 3. finally, I built a term document matrix TDM is a matrix that lists all occurrences of words in the corpus by document. In the TDM, the terms are represented by rows and the documents by columns. It's a way of converting a corpus of text into a mathematical object, and this should be done to do quantitative text analysis. Needed to calculate the frequency of occurrences of each word in the corpus
  10. To be specific, I did such things as calculating… Finding… Building…
  11. And Here are the results. This is a visualization of the terms that frequently occurred on Trump’s tweets. I plotted the result and created a word cloud. The thing that catches my eyes is the word “crook”. It seems that trump intentionally used the word “crook” to put hillary down.
  12. We can also check the correlations between some and other terms that occur in the corpus. In this context, correlation is a quantitative measure of the co-occurrence of words in multiple documents. I wanted to know about the trump’s opinion toward hillary or obama, so I ran findAssoc() function at a correlation limit of 20%. ---------------- 1 hillary 는 crook이란 단어와 자주 사용 Email scandal로 공격했을 것이라는 가정 2 obama와 such terms as … 와의 관계. 어떤 맥락에서 사용되었을지? The other thing I wonder is that the word obama is associated with the term worst, depress, leadership, terrible, wrong etc, and I wanted to know the context how trump used these words. Use the findAssoc() function in the tm package. This result actually explains the truth well. Trump insulted Hillary a lot and attacked her politically, like about, her email scandal.
  13. This is the result.
  14. We can also check the correlations between some and other terms that occur in the corpus. In this context, correlation is a quantitative measure of the co-occurrence of words in multiple documents. I wanted to know about the trump’s opinion toward hillary or obama, so I ran findAssoc() function at a correlation limit of 20%. ---------------- 1 hillary 는 crook이란 단어와 자주 사용 Email scandal로 공격했을 것이라는 가정 2 obama와 such terms as … 와의 관계. 어떤 맥락에서 사용되었을지? The other thing I wonder is that the word obama is associated with the term worst, depress, leadership, terrible, wrong etc, and I wanted to know the context how trump used these words. Use the findAssoc() function in the tm package. This result actually explains the truth well. Trump insulted Hillary a lot and attacked her politically, like about, her email scandal.
  15. Now I am going to classify tweets of two candidates into two (and sometimes three classes): positive or negative(neutral is the optional third process) This is known as sentiment analysis, which is
  16. Sentiment Analysis is … I wanted to build a senti
  17. Feature vector is the most important concept in implementing a classifier. A good feature vector directly determines how successful your classifier will be. The feature vector is used to build a model which the classifier learns from the training data and further can be used to classify previously unseen data. To explain this, I will take a simple example of "gender identification". Male and Female names have some distinctive characteristics. Names ending in a, e and i are likely to be female, while names ending in k, o, r, s and t are likely to be male. So, you can build a classifier based on this model using the ending letter of the names as a feature. Similarly, in tweets, we can use the presence/absence of words that appear in tweet as features. In the training data, consisting of positive, negative and neutral tweets, we can split each tweet into words and add each word to the feature vector. Some of the words might not have any say in indicating the sentiment of a tweet and hence we can filter them out. Adding individual (single) words to the feature vector is referred to as 'unigrams' approach. Some of the other feature vectors also add 'bi-grams' in combination with 'unigrams'. For example, 'not good' (bigram) completely changes the sentiment compared to adding 'not' and 'good' individually. Here, for simplicity, we will only consider the unigrams. Before adding the words to the feature vector, we need to preprocess them in order to filter, otherwise, the feature vector will explode -------------------- extracts the tweets and label from the csv file and processes it as outlined above and obtains a feature vector and stores it in a variable called "tweets".
  18. Feature vector is the most important concept in implementing a classifier. A good feature vector directly determines how successful your classifier will be. The feature vector is used to build a model which the classifier learns from the training data and further can be used to classify previously unseen data. To explain this, I will take a simple example of "gender identification". Male and Female names have some distinctive characteristics. Names ending in a, e and i are likely to be female, while names ending in k, o, r, s and t are likely to be male. So, you can build a classifier based on this model using the ending letter of the names as a feature. Similarly, in tweets, we can use the presence/absence of words that appear in tweet as features. In the training data, consisting of positive, negative and neutral tweets, we can split each tweet into words and add each word to the feature vector. Some of the words might not have any say in indicating the sentiment of a tweet and hence we can filter them out. Adding individual (single) words to the feature vector is referred to as 'unigrams' approach. Some of the other feature vectors also add 'bi-grams' in combination with 'unigrams'. For example, 'not good' (bigram) completely changes the sentiment compared to adding 'not' and 'good' individually. Here, for simplicity, we will only consider the unigrams. Before adding the words to the feature vector, we need to preprocess them in order to filter, otherwise, the feature vector will explode -------------------- extracts the tweets and label from the csv file and processes it as outlined above and obtains a feature vector and stores it in a variable called "tweets".
  19. Feature vector is the most important concept in implementing a classifier. A good feature vector directly determines how successful your classifier will be. The feature vector is used to build a model which the classifier learns from the training data and further can be used to classify previously unseen data. To explain this, I will take a simple example of "gender identification". Male and Female names have some distinctive characteristics. Names ending in a, e and i are likely to be female, while names ending in k, o, r, s and t are likely to be male. So, you can build a classifier based on this model using the ending letter of the names as a feature. Similarly, in tweets, we can use the presence/absence of words that appear in tweet as features. In the training data, consisting of positive, negative and neutral tweets, we can split each tweet into words and add each word to the feature vector. Some of the words might not have any say in indicating the sentiment of a tweet and hence we can filter them out. Adding individual (single) words to the feature vector is referred to as 'unigrams' approach. Some of the other feature vectors also add 'bi-grams' in combination with 'unigrams'. For example, 'not good' (bigram) completely changes the sentiment compared to adding 'not' and 'good' individually. Here, for simplicity, we will only consider the unigrams. Before adding the words to the feature vector, we need to preprocess them in order to filter, otherwise, the feature vector will explode -------------------- extracts the tweets and label from the csv file and processes it as outlined above and obtains a feature vector and stores it in a variable called "tweets".
  20. Feature vector is the most important concept in implementing a classifier. A good feature vector directly determines how successful your classifier will be. The feature vector is used to build a model which the classifier learns from the training data and further can be used to classify previously unseen data. To explain this, I will take a simple example of "gender identification". Male and Female names have some distinctive characteristics. Names ending in a, e and i are likely to be female, while names ending in k, o, r, s and t are likely to be male. So, you can build a classifier based on this model using the ending letter of the names as a feature. Similarly, in tweets, we can use the presence/absence of words that appear in tweet as features. In the training data, consisting of positive, negative and neutral tweets, we can split each tweet into words and add each word to the feature vector. Some of the words might not have any say in indicating the sentiment of a tweet and hence we can filter them out. Adding individual (single) words to the feature vector is referred to as 'unigrams' approach. Some of the other feature vectors also add 'bi-grams' in combination with 'unigrams'. For example, 'not good' (bigram) completely changes the sentiment compared to adding 'not' and 'good' individually. Here, for simplicity, we will only consider the unigrams. Before adding the words to the feature vector, we need to preprocess them in order to filter, otherwise, the feature vector will explode -------------------- extracts the tweets and label from the csv file and processes it as outlined above and obtains a feature vector and stores it in a variable called "tweets".
  21. Feature vector is the most important concept in implementing a classifier. A good feature vector directly determines how successful your classifier will be. The feature vector is used to build a model which the classifier learns from the training data and further can be used to classify previously unseen data. To explain this, I will take a simple example of "gender identification". Male and Female names have some distinctive characteristics. Names ending in a, e and i are likely to be female, while names ending in k, o, r, s and t are likely to be male. So, you can build a classifier based on this model using the ending letter of the names as a feature. Similarly, in tweets, we can use the presence/absence of words that appear in tweet as features. In the training data, consisting of positive, negative and neutral tweets, we can split each tweet into words and add each word to the feature vector. Some of the words might not have any say in indicating the sentiment of a tweet and hence we can filter them out. Adding individual (single) words to the feature vector is referred to as 'unigrams' approach. Some of the other feature vectors also add 'bi-grams' in combination with 'unigrams'. For example, 'not good' (bigram) completely changes the sentiment compared to adding 'not' and 'good' individually. Here, for simplicity, we will only consider the unigrams. Before adding the words to the feature vector, we need to preprocess them in order to filter, otherwise, the feature vector will explode -------------------- extracts the tweets and label from the csv file and processes it as outlined above and obtains a feature vector and stores it in a variable called "tweets".