Analysis of 2016 US Election Twitter Data and County Results

Content
s
1. Analysis Plan
3. Twitter Text Analysis
2. Analysis of Election Results
4. Challenges Suggestion
Outline and Purposes
Tools/Packages
Motivation
Exploratory data analysis
Data preconditioning
Modeling and Test
Dataset
Twitter Text Analysis
Analysis & Conclusion

1. Analysis Plan – Outline and Purpose
The purposes of Analysis
Identify How Trump win
and who support him
Analyze what Trump and Hillary
mention in Twitter
Method of Analysis
1. Linear Regression and Decision Tree analysis
2. Text Mining and Sentimental Analysis.
- Modeling dependent variable =Trump vote rates with independent variables = US County facts.
- Classify the characteristics of group who support Trump by Decision Tree analysis.
- Analyze frequent words in Twitter data and figure out word association each other.
- Auto sentimental classification using Naiive Bayes Classification method
-2016 & 2012 votes results Data
- US County stats facts Data Twitter Data from July 26th to Aug 21st

1. Modeling for Analysis of
2016 Election results
How Donald Trump win Hillary Clinton ?
Who Support Donald Trump?
Linear Regression
Decision Tree Analysis

Data Preconditioning
US 2012 election
county-level results
US 2016 election
county-level results
County Facts data
Download the datasets
01

Removing useless variables and rename the remainders.
fips area_name
state_abb
reviation
populati
on
under.
5.y
0 United States
NA
318857056 6.2
1000 Alabama
NA
4849377 6.1
1001 Autauga County AL 55395 6
1003 Baldwin County AL 200111 5.6
1005 Barbour County AL 26887 5.7
1007 Bibb County AL 22506 5.3
… … … … …
R_Code County2_Data_sets
Data Preconditioning County_facts .csv

• Also we can select some meaningful variables in ‘votes’ data set and
rename them so that we can easily recognize what it means.
• Merge ‘county2’ data and ‘vote2’ data by ‘fips’ code .
• Add column named ‘winner’ which indicate if Trump’s vote rate is
bigger than Clinton’s, the value is ‘1’ otherwise ‘0’.
• Delete all the NA value in ‘data’ using ‘na.omit’.
Merge the ‘votes.csv’Data Preconditioning
fips area_name
state_abbre
viation.x
population under.5.y
1001
Autauga Count
y
AL 55395 6
1003 Baldwin County AL 200111 5.6
1005 Barbour County AL 26887 5.7
1007 Bibb County AL 22506 5.3
1009 Blount County AL 57719 6.1
1011 Bullock County AL 10764 6.3
… … … … …

Exploratory Data Analysis
Showing the basic statistical values of
all the variables using stat_fn function.

1. Analysis Plan – 데이터 탐색적 자료분석Exploratory Data Analysis
vs
vs

1. Analysis Plan – 데이터 탐색적 자료분석
• The relationship between Trump vote rates and Bachelor's degree
or higher rates in county is negative
• The relationship between Trump vote rates and White people
percents in county is positive
Y=Trump ,X= Bachelor Y=Trump ,X= White

• The relationship between Clinton vote rates and Bachelor's degree
or higher rates in county is positive
• The relationship between Clinton vote rates and White people
percents in county is negative
Y=Clinton ,X= Bachelor Y=Clinton ,X= White

• Trump and Romney vote rates
have strong correlation and
Clinton and Obama have strong
correlation.
• Trump with Bachelor education
level have negative correlation
and Black people percents also
have negative but with White ,
Trump has positive correlation.
• Clinton with Bachelor education
level have pasitive correlation
and Black people percents also
have pasitive but with White ,
Clinton has negative correlation.
Correlation Visualization chart of some representative variables

Linear Regression Modeling
• Sampling the test data
20% and training data
80%.
• Select the variables
using Forward AIC
method.
• Train the linear
regression model
inputting the variables
selected with the
smallest AIC value.
Sampling and modeling the Linear regression with training data

1. Analysis Plan – Linear Regression 모델검증
1. Test data의 predicted value와 실
제값을 비교했을 때 correlation
coefficient값이 0.98로 상당히
정확함을 알 수 있다.
2. 모델의 유의성 검정 : F-검정 p값
- p-value: < 2.2e-16 이므로 모델
이 유의하다.
3. 모델의 설명력
Multiple R-squared= 0.9624 :
very strong.
Adjusted R-squared: 0.9622
4. X변수들의 유의성 검정 Pr ***
-positive coefficients
: Romney, Asian, White,
Income.capita
-Negative coefficients
: Bachelor, household.income,
under.18.y, Housing, Black,
Foreign, Hawaiian, High.school,
Language, Female

>plot(train.lm)
Residuals vs Fitted
Normal Q-Q
Scale-Location
Residuals vs Leverage

Decision Tree Analysis
White>47.3, Bachelor degree<27.9,
Housing units <562, Black <41
31.2<White<47.3, Bachelor
degree<19, Hawaian=0, Black <14.1

• Accuracy is 0.918
• Confidence Interval of
95% is (0.8936, 0.9383)
• P-Value is 1.8*e^(-11)
Decision Tree Analysis – Test and Validation
• This Decision Tree
model is significant.
• It classifies and predicts
the winner relatively
precisely .

1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Insight & Conclusion
Romney Supporter
White Person
Obama Supporter
Highly Educated
Colored races
Low-educated

2. Twitter Data Analysis
Hillary Clinton & Donald Trump
(who were the candidates of 2016 US Election)
Text Mining
Sentiment Analysis

• 2016 US election was the hottest issue in
America this year
• Social Media plays an important role in a
political campaign.
• Analyzing tweets of two candidates can give
us more information that traditional statistical
analysis cannot do.
1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Motivation

• R for text mining
– twitteR
– ROAuth
– KoNLP
– Plyr
– tm
• Python for sentiment analysis
– NLTK
21
– SnowballC
– Ggplot2
– Wordcloud
– Topicmodels
– stringr
1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Tools / Packages

• Twitter API is a platform where you can interact
with its data(tweets) and several attributes about
tweets.
• R provides the package “twitteR” to get and
manipulate data.
• My dataset is 400 tweets dating from July 26th to
August 21st , the period before election with
sentiment labels
– Hillary’s 200 tweets are from August 10th to August
21st
– Trump’s 200 are from July 26th to August 10th
1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Datasets

1. Calculate the frequency of term
occurrences and visualize plot and word
cloud
2. Find associations of some of these words
3. Build a topic model
1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Text Mining

1. Load and format the data
2. Clean the data
– Stem the data
– build a corpus and do more cleaning
tasks
3. build a term document matrix(TDM)
Concepts
• Corpus is a collection of documents
• Term document matrix (TDM) is a matrix that lists all
occurrences of words in the corpus by documents
1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Pre-process the Data

1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Implementation in R

Trump
1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Text Mining – 1. WordCloud

Trump
1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Text Mining – 2. Word Association

Trump
1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Text Mining – 3. Topic Modeling

Clinton1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Text Mining – 1. WordCloud

Clinton1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Text Mining – 2. Word Association

1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Text Mining – 3. Topic Modeling Clinton

• Sentiment Analysis is a special case of text mining
generally focused on identifying opinion polarity
using NLP, statistics, or machine learning methods
• It is the process of determining whether a piece of
text is positive, negative or neutral.
• To do this, machine learning can be a good tool
– There are various classification methods: Naïve Bayes
algorithm, Maximum Entropy, SVM(support vector machine)
33
1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Sentiment Analysis

• NLTK(Natural Language Toolkit)
– a platform for building Python programs to
work with human language data.
– provides easy-to-use text processing libraries
for classification, tokenization, stemming,
tagging, parsing, and semantic reasoning,
wrappers for industrial-strength NLP libraries,
etc
34
1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Sentiment Analysis - Why python?

• Built-in module
in NTLK
• Supervised learning
- Training and
testing is required.
http://www.nltk.org/_modules/nltk/classify/naivebayes.html
1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Text Mining – 3. Topic Modeling1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Sentiment Analysis – Naïve Bayes Classifier

1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Sentiment Analysis – Train & Test

1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Sentiment Analysis – How it works
1. Define functions
Get features from data and save it as a
vector
Extract features from feature vector
the result looks like this:
'contains(hi)': False,
'contains(crooked)':True

2. Get feature list from train and test data

3. Train and test the classifier

4. Classify the unlabeled tweets

• Because the sentiment polarity had to be
manually labeled, the amount of data was small.
It caused the low degree of accuracy of
classification.
• I had some technical issues. There were
encoding/decoding problems both in R and in
python, so I missed the chance to try other
classification methods supported in NLTK like
maximum entropy classifier or SVM
42
1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Challenges and Suggestions

Analysis of 2016 US Election Twitter Data and County Results

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Analysis of 2016 US Election Twitter Data and County Results

Ähnlich wie Analysis of 2016 US Election Twitter Data and County Results (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Analysis of 2016 US Election Twitter Data and County Results

Hinweis der Redaktion