Using Linear regression and decision tree analysis to identify who support Trump in 2016 US Election.
- Course name: Principles and Practice in Data Mining
- Semester: Autumn 2016
- Professor: Yuran SEO
- Sungkyunkwan University
- Department: Data science
- Name: 정수진, 박지연
- Contact: Your email address
Advanced Machine Learning for Business Professionals
Analysis of 2016 US Election Twitter Data and County Results
1.
2. Content
s
1. Analysis Plan
3. Twitter Text Analysis
2. Analysis of Election Results
4. Challenges Suggestion
Outline and Purposes
Tools/Packages
Motivation
Exploratory data analysis
Data preconditioning
Modeling and Test
Dataset
Twitter Text Analysis
Analysis & Conclusion
3. 1. Analysis Plan – Outline and Purpose
The purposes of Analysis
Identify How Trump win
and who support him
Analyze what Trump and Hillary
mention in Twitter
Method of Analysis
1. Linear Regression and Decision Tree analysis
2. Text Mining and Sentimental Analysis.
- Modeling dependent variable =Trump vote rates with independent variables = US County facts.
- Classify the characteristics of group who support Trump by Decision Tree analysis.
- Analyze frequent words in Twitter data and figure out word association each other.
- Auto sentimental classification using Naiive Bayes Classification method
-2016 & 2012 votes results Data
- US County stats facts Data Twitter Data from July 26th to Aug 21st
4. 1. Modeling for Analysis of
2016 Election results
How Donald Trump win Hillary Clinton ?
Who Support Donald Trump?
Linear Regression
Decision Tree Analysis
5. Data Preconditioning
US 2012 election
county-level results
US 2016 election
county-level results
County Facts data
Download the datasets
01
6. Removing useless variables and rename the remainders.
fips area_name
state_abb
reviation
populati
on
under.
5.y
0 United States
NA
318857056 6.2
1000 Alabama
NA
4849377 6.1
1001 Autauga County AL 55395 6
1003 Baldwin County AL 200111 5.6
1005 Barbour County AL 26887 5.7
1007 Bibb County AL 22506 5.3
… … … … …
R_Code County2_Data_sets
Data Preconditioning County_facts .csv
7. • Also we can select some meaningful variables in ‘votes’ data set and
rename them so that we can easily recognize what it means.
• Merge ‘county2’ data and ‘vote2’ data by ‘fips’ code .
• Add column named ‘winner’ which indicate if Trump’s vote rate is
bigger than Clinton’s, the value is ‘1’ otherwise ‘0’.
• Delete all the NA value in ‘data’ using ‘na.omit’.
Merge the ‘votes.csv’Data Preconditioning
fips area_name
state_abbre
viation.x
population under.5.y
1001
Autauga Count
y
AL 55395 6
1003 Baldwin County AL 200111 5.6
1005 Barbour County AL 26887 5.7
1007 Bibb County AL 22506 5.3
1009 Blount County AL 57719 6.1
1011 Bullock County AL 10764 6.3
… … … … …
10. 1. Analysis Plan – 데이터 탐색적 자료분석
• The relationship between Trump vote rates and Bachelor's degree
or higher rates in county is negative
• The relationship between Trump vote rates and White people
percents in county is positive
Y=Trump ,X= Bachelor Y=Trump ,X= White
Exploratory Data Analysis
11. • The relationship between Clinton vote rates and Bachelor's degree
or higher rates in county is positive
• The relationship between Clinton vote rates and White people
percents in county is negative
1. Analysis Plan – 데이터 탐색적 자료분석
Y=Clinton ,X= Bachelor Y=Clinton ,X= White
Exploratory Data Analysis
12. 1. Analysis Plan – 데이터 탐색적 자료분석
• Trump and Romney vote rates
have strong correlation and
Clinton and Obama have strong
correlation.
• Trump with Bachelor education
level have negative correlation
and Black people percents also
have negative but with White ,
Trump has positive correlation.
• Clinton with Bachelor education
level have pasitive correlation
and Black people percents also
have pasitive but with White ,
Clinton has negative correlation.
Exploratory Data Analysis
Correlation Visualization chart of some representative variables
13. Linear Regression Modeling
• Sampling the test data
20% and training data
80%.
• Select the variables
using Forward AIC
method.
• Train the linear
regression model
inputting the variables
selected with the
smallest AIC value.
Sampling and modeling the Linear regression with training data
14. 1. Analysis Plan – Linear Regression 모델검증
1. Test data의 predicted value와 실
제값을 비교했을 때 correlation
coefficient값이 0.98로 상당히
정확함을 알 수 있다.
2. 모델의 유의성 검정 : F-검정 p값
- p-value: < 2.2e-16 이므로 모델
이 유의하다.
3. 모델의 설명력
Multiple R-squared= 0.9624 :
very strong.
Adjusted R-squared: 0.9622
4. X변수들의 유의성 검정 Pr ***
-positive coefficients
: Romney, Asian, White,
Income.capita
-Negative coefficients
: Bachelor, household.income,
under.18.y, Housing, Black,
Foreign, Hawaiian, High.school,
Language, Female
Linear Regression Modeling
15. 1. Analysis Plan – 데이터 탐색적 자료분석
>plot(train.lm)
Residuals vs Fitted
Normal Q-Q
Scale-Location
Residuals vs Leverage
Linear Regression Modeling
16. Decision Tree Analysis
White>47.3, Bachelor degree<27.9,
Housing units <562, Black <41
31.2<White<47.3, Bachelor
degree<19, Hawaian=0, Black <14.1
17. 1. Analysis Plan – 데이터 탐색적 자료분석
• Accuracy is 0.918
• Confidence Interval of
95% is (0.8936, 0.9383)
• P-Value is 1.8*e^(-11)
Decision Tree Analysis – Test and Validation
• This Decision Tree
model is significant.
• It classifies and predicts
the winner relatively
precisely .
18. 1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Insight & Conclusion
Romney Supporter
White Person
Obama Supporter
Highly Educated
Colored races
Low-educated
19. 2. Twitter Data Analysis
Hillary Clinton & Donald Trump
(who were the candidates of 2016 US Election)
Text Mining
Sentiment Analysis
20. • 2016 US election was the hottest issue in
America this year
• Social Media plays an important role in a
political campaign.
• Analyzing tweets of two candidates can give
us more information that traditional statistical
analysis cannot do.
1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Motivation
21. • R for text mining
– twitteR
– ROAuth
– KoNLP
– Plyr
– tm
• Python for sentiment analysis
– NLTK
21
– SnowballC
– Ggplot2
– Wordcloud
– Topicmodels
– stringr
1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Tools / Packages
22. • Twitter API is a platform where you can interact
with its data(tweets) and several attributes about
tweets.
• R provides the package “twitteR” to get and
manipulate data.
• My dataset is 400 tweets dating from July 26th to
August 21st , the period before election with
sentiment labels
– Hillary’s 200 tweets are from August 10th to August
21st
– Trump’s 200 are from July 26th to August 10th
1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Datasets
23. 1. Calculate the frequency of term
occurrences and visualize plot and word
cloud
2. Find associations of some of these words
3. Build a topic model
1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Text Mining
24. 1. Load and format the data
2. Clean the data
– Stem the data
– build a corpus and do more cleaning
tasks
3. build a term document matrix(TDM)
Concepts
• Corpus is a collection of documents
• Term document matrix (TDM) is a matrix that lists all
occurrences of words in the corpus by documents
1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Pre-process the Data
25. 1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Implementation in R
26. Trump
1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Text Mining – 1. WordCloud
27. Trump
1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Text Mining – 2. Word Association
28. Trump
1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Text Mining – 3. Topic Modeling
29. Clinton1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Text Mining – 1. WordCloud
30. Clinton1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Text Mining – 2. Word Association
31. 1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Text Mining – 3. Topic Modeling Clinton
33. • Sentiment Analysis is a special case of text mining
generally focused on identifying opinion polarity
using NLP, statistics, or machine learning methods
• It is the process of determining whether a piece of
text is positive, negative or neutral.
• To do this, machine learning can be a good tool
– There are various classification methods: Naïve Bayes
algorithm, Maximum Entropy, SVM(support vector machine)
33
1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Sentiment Analysis
34. • NLTK(Natural Language Toolkit)
– a platform for building Python programs to
work with human language data.
– provides easy-to-use text processing libraries
for classification, tokenization, stemming,
tagging, parsing, and semantic reasoning,
wrappers for industrial-strength NLP libraries,
etc
34
1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Sentiment Analysis - Why python?
35. • Built-in module
in NTLK
• Supervised learning
- Training and
testing is required.
http://www.nltk.org/_modules/nltk/classify/naivebayes.html
1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Text Mining – 3. Topic Modeling1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Sentiment Analysis – Naïve Bayes Classifier
36. 1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Sentiment Analysis – Train & Test
37. 1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Sentiment Analysis – How it works
1. Define functions
Get features from data and save it as a
vector
Extract features from feature vector
the result looks like this:
'contains(hi)': False,
'contains(crooked)':True
38. 2. Get feature list from train and test data
1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Sentiment Analysis – How it works
39. 1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Sentiment Analysis – How it works
3. Train and test the classifier
40. 1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Sentiment Analysis – How it works
4. Classify the unlabeled tweets
41. 1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Sentiment Analysis – How it works
4. Classify the unlabeled tweets
42. • Because the sentiment polarity had to be
manually labeled, the amount of data was small.
It caused the low degree of accuracy of
classification.
• I had some technical issues. There were
encoding/decoding problems both in R and in
python, so I missed the chance to try other
classification methods supported in NLTK like
maximum entropy classifier or SVM
42
1. Analysis Plan – 데이터 탐색적 자료분석1. Analysis Plan – 데이터 탐색적 자료분석Challenges and Suggestions
Removing useless variables which is not helpful for describing people who support Trump and rename them to make it easy to know the meaning.
Removing useless variables which is not helpful for describing people who support Trump and rename them to make it easy to know the meaning.
The plot in the upper left shows the residual errors plotted versus their fitted values. The residuals should be randomly distributed around the horizontal line representing a residual error of zero; that is, there should not be a distinct trend in the distribution of points. The plot in the lower left is a standard Q-Q plot, which should suggest that the residual errors are normally distributed. The scale-location plot in the upper right shows the square root of the standardized residuals (sort of a square root of relative error) as a function of the fitted values. Again, there should be no obvious trend in this plot. Finally, the plot in the lower right shows each points leverage, which is a measure of its importance in determining the regression result. Superimposed on the plot are contour lines for the Cook’s distance, which is another measure of the importance of each observation to the regression. Smaller distances means that removing the observation has little affect on the regression results. Distances larger than 1 are suspicious and suggest the presence of a possible outlier or a poor model.
Hi, everyone.
My name is Jiyeon, and I worked with Sujin as a team for this final project, Our topic is "2016 US election:"
What I did is more like text analysis. I analyzed the twitter data of two presidential candidates, Hillary Clinton and Donald Trump.
What I am worrying about is that it could be a little hard for you to understand the code things I wrote, since, technically we didn’t learn text miming in this class. I am going to walk through the codes, but I am not going to explain every single detail.
As you know, 2016 US election was a big issue for the past few months, and it fascinated lots of data scientists around the world.
They already did so many works, so it was relatively easy to get dataset.
******
I'll skip these sections to save time. It'll be mentioned later in this presentation.
tm – the text mining package (see documentation). Also check out this excellent introductory article on tm.
SnowballC – required for stemming (explained below).
ggplot2 – plotting capabilities (see documentation)
wordcloud – which is self-explanatory (see documentation) .
Getting data was the easiest one in this project.
Twitter APIs provide a platform where you can interact with its data, so called, tweets, and several attributes about tweets. Also, You can use the fascinating R package “twitteR’ to retrieve tweets from one’s timeline or by things like hashtags
******************
About the data
I used a data set with 400 tweets dating from July 26th to August 21st (Hillary’s 200 tweets are from August 10th to August 21st, and Trump’s 200 are from July 26th to August 10th). I had to manually label the sentiment to conduct sentiment analysis.
To be specific, I did such things as
calculating…
Finding…
Building…
These are the steps I followed to refine the data
1. The dataset has not only tweets itself but also several attributes about tweets, like date, ID, etc. I only used tweets in text column here
2. The next thing I did is cleansing the data.
- it includes removing numbers, URLs, puncuations, convert to lower cases
Data cleaning can be done before and after building a corpus.
Corpus makes it easy to deal with data in text mining
Do more cleaning tasks like removing stopwords, whitespaces, and stemming the data.
3. finally, I built a term document matrix
TDM is a matrix that lists all occurrences of words in the corpus by document.
In the TDM, the terms are represented by rows and the documents by columns.
It's a way of converting a corpus of text into a mathematical object, and this should be done to do quantitative text analysis.
Needed to calculate the frequency of occurrences of each word in the corpus
To be specific, I did such things as
calculating…
Finding…
Building…
And Here are the results. This is a visualization of the terms that frequently occurred on Trump’s tweets.
I plotted the result and created a word cloud.
The thing that catches my eyes is the word “crook”. It seems that trump intentionally used the word “crook” to put hillary down.
We can also check the correlations between some and other terms that occur in the corpus. In this context, correlation is a quantitative measure of the co-occurrence of words in multiple documents.
I wanted to know about the trump’s opinion toward hillary or obama, so I ran findAssoc() function at a correlation limit of 20%.
----------------
1 hillary 는 crook이란 단어와 자주 사용
Email scandal로 공격했을 것이라는 가정
2 obama와 such terms as … 와의 관계. 어떤 맥락에서 사용되었을지?
The other thing I wonder is that the word obama is associated with the term worst, depress, leadership, terrible, wrong etc, and I wanted to know the context how trump used these words.
Use the findAssoc() function in the tm package.
This result actually explains the truth well. Trump insulted Hillary a lot and attacked her politically, like about, her email scandal.
This is the result.
We can also check the correlations between some and other terms that occur in the corpus. In this context, correlation is a quantitative measure of the co-occurrence of words in multiple documents.
I wanted to know about the trump’s opinion toward hillary or obama, so I ran findAssoc() function at a correlation limit of 20%.
----------------
1 hillary 는 crook이란 단어와 자주 사용
Email scandal로 공격했을 것이라는 가정
2 obama와 such terms as … 와의 관계. 어떤 맥락에서 사용되었을지?
The other thing I wonder is that the word obama is associated with the term worst, depress, leadership, terrible, wrong etc, and I wanted to know the context how trump used these words.
Use the findAssoc() function in the tm package.
This result actually explains the truth well. Trump insulted Hillary a lot and attacked her politically, like about, her email scandal.
Now I am going to classify tweets of two candidates into two (and sometimes three classes): positive or negative(neutral is the optional third process)
This is known as sentiment analysis, which is
Sentiment Analysis is …
I wanted to build a senti
Feature vector is the most important concept in implementing a classifier. A good feature vector directly determines how successful your classifier will be. The feature vector is used to build a model which the classifier learns from the training data and further can be used to classify previously unseen data.
To explain this, I will take a simple example of "gender identification". Male and Female names have some distinctive characteristics. Names ending in a, e and i are likely to be female, while names ending in k, o, r, s and t are likely to be male. So, you can build a classifier based on this model using the ending letter of the names as a feature.
Similarly, in tweets, we can use the presence/absence of words that appear in tweet as features. In the training data, consisting of positive, negative and neutral tweets, we can split each tweet into words and add each word to the feature vector. Some of the words might not have any say in indicating the sentiment of a tweet and hence we can filter them out. Adding individual (single) words to the feature vector is referred to as 'unigrams' approach.
Some of the other feature vectors also add 'bi-grams' in combination with 'unigrams'. For example, 'not good' (bigram) completely changes the sentiment compared to adding 'not' and 'good' individually. Here, for simplicity, we will only consider the unigrams. Before adding the words to the feature vector, we need to preprocess them in order to filter, otherwise, the feature vector will explode
--------------------
extracts the tweets and label from the csv file and processes it as outlined above and obtains a feature vector and stores it in a variable called "tweets".
Feature vector is the most important concept in implementing a classifier. A good feature vector directly determines how successful your classifier will be. The feature vector is used to build a model which the classifier learns from the training data and further can be used to classify previously unseen data.
To explain this, I will take a simple example of "gender identification". Male and Female names have some distinctive characteristics. Names ending in a, e and i are likely to be female, while names ending in k, o, r, s and t are likely to be male. So, you can build a classifier based on this model using the ending letter of the names as a feature.
Similarly, in tweets, we can use the presence/absence of words that appear in tweet as features. In the training data, consisting of positive, negative and neutral tweets, we can split each tweet into words and add each word to the feature vector. Some of the words might not have any say in indicating the sentiment of a tweet and hence we can filter them out. Adding individual (single) words to the feature vector is referred to as 'unigrams' approach.
Some of the other feature vectors also add 'bi-grams' in combination with 'unigrams'. For example, 'not good' (bigram) completely changes the sentiment compared to adding 'not' and 'good' individually. Here, for simplicity, we will only consider the unigrams. Before adding the words to the feature vector, we need to preprocess them in order to filter, otherwise, the feature vector will explode
--------------------
extracts the tweets and label from the csv file and processes it as outlined above and obtains a feature vector and stores it in a variable called "tweets".
Feature vector is the most important concept in implementing a classifier. A good feature vector directly determines how successful your classifier will be. The feature vector is used to build a model which the classifier learns from the training data and further can be used to classify previously unseen data.
To explain this, I will take a simple example of "gender identification". Male and Female names have some distinctive characteristics. Names ending in a, e and i are likely to be female, while names ending in k, o, r, s and t are likely to be male. So, you can build a classifier based on this model using the ending letter of the names as a feature.
Similarly, in tweets, we can use the presence/absence of words that appear in tweet as features. In the training data, consisting of positive, negative and neutral tweets, we can split each tweet into words and add each word to the feature vector. Some of the words might not have any say in indicating the sentiment of a tweet and hence we can filter them out. Adding individual (single) words to the feature vector is referred to as 'unigrams' approach.
Some of the other feature vectors also add 'bi-grams' in combination with 'unigrams'. For example, 'not good' (bigram) completely changes the sentiment compared to adding 'not' and 'good' individually. Here, for simplicity, we will only consider the unigrams. Before adding the words to the feature vector, we need to preprocess them in order to filter, otherwise, the feature vector will explode
--------------------
extracts the tweets and label from the csv file and processes it as outlined above and obtains a feature vector and stores it in a variable called "tweets".
Feature vector is the most important concept in implementing a classifier. A good feature vector directly determines how successful your classifier will be. The feature vector is used to build a model which the classifier learns from the training data and further can be used to classify previously unseen data.
To explain this, I will take a simple example of "gender identification". Male and Female names have some distinctive characteristics. Names ending in a, e and i are likely to be female, while names ending in k, o, r, s and t are likely to be male. So, you can build a classifier based on this model using the ending letter of the names as a feature.
Similarly, in tweets, we can use the presence/absence of words that appear in tweet as features. In the training data, consisting of positive, negative and neutral tweets, we can split each tweet into words and add each word to the feature vector. Some of the words might not have any say in indicating the sentiment of a tweet and hence we can filter them out. Adding individual (single) words to the feature vector is referred to as 'unigrams' approach.
Some of the other feature vectors also add 'bi-grams' in combination with 'unigrams'. For example, 'not good' (bigram) completely changes the sentiment compared to adding 'not' and 'good' individually. Here, for simplicity, we will only consider the unigrams. Before adding the words to the feature vector, we need to preprocess them in order to filter, otherwise, the feature vector will explode
--------------------
extracts the tweets and label from the csv file and processes it as outlined above and obtains a feature vector and stores it in a variable called "tweets".
Feature vector is the most important concept in implementing a classifier. A good feature vector directly determines how successful your classifier will be. The feature vector is used to build a model which the classifier learns from the training data and further can be used to classify previously unseen data.
To explain this, I will take a simple example of "gender identification". Male and Female names have some distinctive characteristics. Names ending in a, e and i are likely to be female, while names ending in k, o, r, s and t are likely to be male. So, you can build a classifier based on this model using the ending letter of the names as a feature.
Similarly, in tweets, we can use the presence/absence of words that appear in tweet as features. In the training data, consisting of positive, negative and neutral tweets, we can split each tweet into words and add each word to the feature vector. Some of the words might not have any say in indicating the sentiment of a tweet and hence we can filter them out. Adding individual (single) words to the feature vector is referred to as 'unigrams' approach.
Some of the other feature vectors also add 'bi-grams' in combination with 'unigrams'. For example, 'not good' (bigram) completely changes the sentiment compared to adding 'not' and 'good' individually. Here, for simplicity, we will only consider the unigrams. Before adding the words to the feature vector, we need to preprocess them in order to filter, otherwise, the feature vector will explode
--------------------
extracts the tweets and label from the csv file and processes it as outlined above and obtains a feature vector and stores it in a variable called "tweets".