SlideShare ist ein Scribd-Unternehmen logo
1 von 28
Downloaden Sie, um offline zu lesen
Vinodhine Rajasekar, Cory Swindle, Amal Dev Thomas, Arun Tom, Jason Warnstaff
DATA MINING 5339-002 GROUP #1
REDDIT & DJIA
1
Contents
Contents ........................................................................................................................................................................1
1 DATA BACKGROUND...................................................................................................................................................2
1.1 YAHOO FINANCE .................................................................................................................................................2
1.2 REDDIT.................................................................................................................................................................2
1.3 DATASET & ATTRIBUTES......................................................................................................................................2
1.4 CLASS ATTRIBUTE ................................................................................................................................................3
1.5 HYPOTHESIS.........................................................................................................................................................4
2 DATA CLEANING..........................................................................................................................................................4
2.1 DATA ISSUES........................................................................................................................................................4
2.2 DATA CLEANING PROCESS...................................................................................................................................5
3 Experiment Design......................................................................................................................................................8
3.1 FALSE PREDICTORS..............................................................................................................................................8
3.2 ZeroR - BENCHMARK ..........................................................................................................................................8
3.3 CLASSIFIER SELECTION.......................................................................................................................................10
3.4 FOUR CELL EXPERIMENT DESIGN ......................................................................................................................11
3.5 DATA REPRESENTATION....................................................................................................................................13
3.6 ASSUMPTIONS...................................................................................................................................................13
3.7 TARGET VARIABLES ...........................................................................................................................................13
Same Day ............................................................................................................................................................13
Open to Low........................................................................................................................................................14
Open to High.......................................................................................................................................................15
4 EXPERIMENT RESULTS ..............................................................................................................................................16
4.1 RESULTS FOR EACH CLASSIFIER.........................................................................................................................16
4.2 SUMMARY OF RESULTS.....................................................................................................................................18
5 ANALYSIS & CONCLUSION ........................................................................................................................................21
5.1 POSSIBLE ASSUMPTIONS VIOLATED..................................................................................................................21
5.2 CONCLUSION.....................................................................................................................................................21
APPENDIX –..................................................................................................................................................................23
J48 ...........................................................................................................................................................................23
META-BAGGING ......................................................................................................................................................24
PART ........................................................................................................................................................................25
R Script ....................................................................................................................................................................26
Word Cloud for the 500-850 Range: .......................................................................................................................27
2
1 DATA BACKGROUND
Our data set is comprised of two sets of information that were combined to create a class attribute. The
first piece is from Reddit World News Channel and the second comes from Yahoo finance. The data set
covers the dates from August 8th
, 2008 to July 1st
, 2016.
1.1 YAHOO FINANCE
The section of our data that comes from Yahoo Finance is the Dow Jones Industrial Average values for the
dates listed above. The Dow Jones Industrial Average (DJIA) is a price-weighted average of thirty significant
stocks traded on the New York Stock Exchange (NYSE) and NASDAQ. It is a thirty-component index or
grouping of stocks that is meant to be a gauge or indication of how the overall stock market is performing.
The thirty stocks that comprise the Dow Jones Industrial Average (DJIA) dictate the movement of the
entire stock market. The DJIA is a price weighted average, which means that stocks with higher share
prices are given a greater weight in the index. The average is scaled in order to make up for the effects of
stock splits and other adjustments, creating a stable value for the index. The DJIA was created by Dow
Jones co-founders Charles Dow and Edward Jones in the late 1800's. There have been many changes since
its original formulation and companies have been added to and taken off the Dow. General Electric is
known as the 'Dow Veteran' because it is the only one of the original Dow components that has been a
part of the index since creation. Some of the more recent additions include Cisco and Travelers which
were added after General Motors was removed from the Dow in 2009. The regular adjustments and
updates keep the Dow relevant and help it remain as reliable gauge of what the stock market looks like.
1.2 REDDIT
Reddit calls themselves the “front page of the internet” and they are a social news aggregation, web
content rating, and discussion website where users have the option to post content anonymously or under
their registered user name. Registered users can then vote submissions up or down to determine their
position on the page. The posts with the most positive votes appear on the front page or the top of a
category. Reddit contains millions of channels, each known as ‘subreddits’ with topics ranging from news
to gaming to anything else that you can think of. Our data set was created with the information crawled
from a ‘subreddit’ called World News Channel. The top 25 headlines for each day in our stated date range
were pulled and put into Excel with each headline serving as its own attribute.
1.3 DATASET & ATTRIBUTES
Our data set came from 3 different files. Each is described below.
1. RedditNews.csv: 73,608 instances and 2 attributes.
Attribute Name Description
Date Corresponding date ranges from Aug 8, 2008 to July 1, 2016 in the format mm-dd-yyyy.
News
News headlines which are ranked from top to bottom based on their votes. Hence, there are
25 attributes for each date.
2. DJIA_table.csv: 1,989 instances and 7 attributes.
Attribute Name Description
Date Date ranges from Aug 8, 2008 to July 1, 2016 in the format mm-dd-yyyy
Open The index value at the opening of the trade day
High Highest index value of the trading day
Low Lowest index value of the trading day
3
Close The index value at which the market is closed at that trading day
Volume Number of components shares traded traded during the current trading period
Adj Close Adjusted closing price
3. Combined_News_DJIA.csv: 1,989 instances and 27 attributes.
Attribute Name Description
Date Corresponding date ranges from Aug 8, 2008 to July 1, 2016 in the format mm/dd/yyyy
Label
1 to note that the DJIA Adj Close value rose or stayed the same and 0 to note when the DJIA
Adj Close value decreased
Top1 Top1 headline rating for the day as voted on by Reddit users
Top2 Top2 headline rating for the day as voted on by Reddit users
Top3 Top3 headline rating for the day as voted on by Reddit users
Top4 Top4 headline rating for the day as voted on by Reddit users
Top5 Top5 headline rating for the day as voted on by Reddit users
Top6 Top6 headline rating for the day as voted on by Reddit users
Top7 Top7 headline rating for the day as voted on by Reddit users
Top8 Top8 headline rating for the day as voted on by Reddit users
Top9 Top9 headline rating for the day as voted on by Reddit users
Top10 Top10 headline rating for the day as voted on by Reddit users
Top11 Top11 headline rating for the day as voted on by Reddit users
Top12 Top12 headline rating for the day as voted on by Reddit users
Top13 Top13 headline rating for the day as voted on by Reddit users
Top14 Top14 headline rating for the day as voted on by Reddit users
Top15 Top15 headline rating for the day as voted on by Reddit users
Top16 Top16 headline rating for the day as voted on by Reddit users
Top17 Top17 headline rating for the day as voted on by Reddit users
Top18 Top18 headline rating for the day as voted on by Reddit users
Top19 Top19 headline rating for the day as voted on by Reddit users
Top20 Top20 headline rating for the day as voted on by Reddit users
Top21 Top21 headline rating for the day as voted on by Reddit users
Top22 Top22 headline rating for the day as voted on by Reddit users
Top23 Top23 headline rating for the day as voted on by Reddit users
Top24 Top24 headline rating for the day as voted on by Reddit users
Top25 Top25 headline rating for the day as voted on by Reddit users
1.4 CLASS ATTRIBUTE
Our class attribute as presented from the data set was a binary value. The value of ‘0’ means that the DJIA
Adj. Close decreased from the previous day. A value of ‘1’ indicates that the DJIA Adj. Close either
increased or stayed the same from the previous day. We will talk further about how we transformed the
Class attribute we were given in the Experiment Design section of this paper.
4
1.5 HYPOTHESIS
Utilizing the data from the described data set, we intend to predict whether the DJIA will rise or fall on
the following day based off of the top 25 prior day’s world news headlines from Reddit.
2 DATA CLEANING
Data cleaning is the most critical and time consuming piece of any data mining project. Ensuring that the
data that is being tested is accurate and clean will lead to more reliable results and a process that can be
duplicated/replicated by others that wish to test the validity of the results presented. The process of data
cleaning consists of detecting or correcting corrupt or inaccurate records, removing inaccurate or
irrelevant parts of data, and coming up with an overall standardization of the data so that algorithms can
process and assist in determining results.
2.1 DATA ISSUES
The data for our project was pulled from the Reddit website and dumped into an Excel CSV file. Pulling
and dumping of data from the internet into Excel can lead to numerous issues especially when it involves
text. In the case of our data, a preliminary glance showed that there was some ‘trash’ characters leading
some of our headlines. The data set has 1989 instances of data with twenty-five attributes for each
instance, it would be very daunting to look through the data individually to try and find these characters
and remove them. Also, without knowing what other potential ‘trash’ was possible in the data set, it would
be very hard to either script it out in R or get rid of it in Excel. Spellcheck was also not an option as any
word that was classified as misspelled would require a judgment from our group about whether it was
intended to be spelled this way on Reddit or if it was actually just an error. In the end, we decided that we
had enough data that we could tolerate some of these ‘trash’ characters and that it wouldn’t affect our
plan for how we intended to use the data anyway.
The next problem we faced was determining how we could best use the headlines to test against our class
attribute. Text mining is still a developing field and there are two main trains of thought about how to
best utilize text data. Our initial plan was to use sentiment analysis which is the process of determining a
sentiment of a sentence based off of a sum score for the sentiment of the words used in that sentence.
As you could imagine, this can lead to some unique headaches. “The accuracy of a sentiment analysis
system is, in principle, how well it agrees with human judgments. This is usually measured by precision
and recall. However, according to research human raters typically agree 79% of the time. Thus, a 70%
accurate program is doing nearly as well as humans, even though such accuracy may not sound
impressive. If a program were "right" 100% of the time, humans would still disagree with it about 20% of
the time, since they disagree that much about any answer.” (Wikipedia.org, n.d.) Other things such as
cultural factors, linguistic nuances and differing contexts make it extremely difficult to turn a string of text
into a simple pro or con statement. Due to the complexity and lack of overall accurate processes
associated with sentiment analysis, we opted for a different approach. We chose to look for key words
occurring with a certain (to be determined later) frequency throughout our data set. Our base assumption
in pursuing this option was that certain words would have a negative/positive effect on people who invest
in the stock market’s attitudes on how/when they traded thus affecting the DJIA. In conversations at the
beginning of the project we hypothesized that words such as terror, bomb, war, explosion, etc... would
have a negative effect on the DJIA. The thinking was that these headlines were normally associated with
negative events in our minds, and would thus cause people to be less or potentially even more
enthusiastic about investing.
The last major issue that we were facing in using text mining and keywords was how to accurately
determine the correct frequencies of words. As an example, England is referred to by numerous names.
5
There is Britain, UK, and England to name a few, how would we best handle a count if we wanted to
determine if the overall ‘Great Britain’ was going to be a key attribute in determining class? Additionally,
does ‘British’ and ‘English’ mean the same thing? This also was an issue for regular words with suffixes or
prefixes that we wanted to account for. We began looking for a way to try and break words down to their
individual root word so that we could eliminate as much of this as possible. Stemming or Lemmatization
functions would have to be completed on our data set to try and eliminate this issue. Stemming is the
process of breaking words down to their root word to avoid having multiple words for what is technically
the same word. For example, ‘bomb’ would be the root word for ‘bomber’, ‘bombing’, ‘bombings’,
‘bombarding’, and ‘bombarded’. It would only make sense that all of these words should be counted as a
single word according to our group. The stemming process looks at each word and then, based off of the
assigned dictionary, breaks each word down to its root. This also caused some interesting problems as a
word such as ‘issues’ was broken down to ‘issu’. As we found out, the stemming methods that we used
were not as effective as we had hoped. However, stemming was a better option than leaving the words
as they were, so we ended up pursuing this method.
Once we had identified and developed a plan to mitigate our initial concerns with the data cleaning
process the next step was to select a tool that would allow us to accomplish everything we intended to
accomplish. We chose to utilize Excel for basic, easy calculations and counts and to use R for the more
complex data mining tasks and frequency distributions. R is a language and environment for statistical
computing and graphics. Upon doing some cursory research and speaking to data mining professionals,
we selected R due to its large community of users online that could assist us as we undertook this project
using a language that we were unfamiliar with, a vast library of answered questions available at
StackedOverflow.com, and availability of individuals that had used R in our circle of contacts.
2.2 DATA CLEANING PROCESS
The goal of our data cleaning was to come up with a list of roughly 50 words, based off of their frequency
of occurring throughout the dataset, count the number of times those words appeared on each day, and
place the values for those counts against the date they occurred as well as the class attribute.
Our data when we started was separated in the following manner:
• Date attribute
• Label Attribute (CLASS attribute)
• Headline 1
• Headline 2
• Etc…
• Headline 25
In order to achieve our goal of getting keywords it was necessary to combine all of the headline attributes
into a single attribute for the data cleaning process. In Excel, we concatenated the headlines into a single
attribute. This took us down from twenty-seven attributes to three attributes. The next step was to begin
the cleanup process on the headline attribute. As you can see in the sample of data below, there was
some ‘trash’ that was easily identifiable and could be removed from our file using basic Excel functionality.
We removed b", b', "b", and "b' from our data and then saved the file as .CSV.
6
The next step was to import the .CSV to our R workspace so we could continue on with the data cleaning.
The view below is a snapshot of what the data looked like once it was imported.
*Full R script is shown in the Appendix.
In order to move forward with our imported data in R, we needed to create a corpus. The main structure
for managing documents in a text mining package in R is a corpus. A corpus is a representation of text
documents that is stored in memory so that it is easily accessible for data manipulation.
Once we had created our corpus we could begin to break the headlines down to individual words that we
would use as our attributes going forward. The first step was to remove all upper-case letters, remove
punctuation, remove numbers, remove common stop words, and stem the words in the document. Below
is a view of our corpus after upper-case letters, punctuation, and numbers had been removed.
A Term Document Matrix (tdm) and Document Term Matrix (dtm) were created based off of our Corpus
next in order to better organize our data for frequency searches.
Next, we needed to know the overall frequency of range in our data so that we could search for ranges
within that range that we thought could potentially lead us to results. To accomplish this we created a
new variable called ‘freq’ as a matrix of all of our terms in the dtm.
Using the ‘freq’ variable that we created in the previous step, we were then able to start doing some
visualizations to determine what range of values would be useful in our tests. After some manipulation of
the visualizations, we came across two different ranges with about 50 terms each that we felt would
provide results. The charts we created in R are below:
7
Frequency of 500-850
Frequency of 250-300
Once we had our desired lists of frequent terms, we opted to return to Excel for the remaining steps in
the data cleaning process.
For the next step, we re-opened the CSV file that was imported to R and modified it to get rid of the Date
and Label attributes. We were looking for counts of our frequent terms, the Date and Label were not
important in this step. Our headline attribute is now the first attribute with the concatenated headlines
as the instances. For the remaining attributes, we added the key words from each range that we selected.
We were left with two separate files, one for each frequency range.
Next, we needed to determine a formula that would take the attribute name and use it to look at each
instance of the headline attribute and provide a count for how many times it appears. The Excel formula
we created is below:
• =(LEN(B2)-LEN(SUBSTITUTE(B2,$C$1,"")))/LEN($C$1)
The above formula accomplished what we wanted, but there were some drawbacks that are discussed in
the False Predictor section of this paper.
After the above formula was run against our Headline attribute, the Headline attribute was not needed.
We deleted it and added back the Date and Label attributes. Our data set now had a count for each
attribute on each instance. Our file showed individual words (as attributes) that appeared between either
250-300 or 500-850 times throughout the data set, and how many times that word occurred for each
instance.
At this point, we considered our data ‘clean’ and were ready to move forward with testing. We understood
that further manipulations of our data may be necessary based off of how testing went. However, the
8
bulk work of determining our attributes and providing counts that we could test against our class variable
was complete.
3 Experiment Design
3.1 FALSE PREDICTORS
After the data cleaning process, we ran various algorithms on the clean data set and encountered a few
errors with respect to the words learned from our R processing. While the formula used in Excel gave us
counts of our key word attributes against the Headline instances, this count had issues. Because this work
was performed in Excel, we were bound to Excel constraints on rules. The issue we discovered was that
instead of searching for the distinct key word attribute in the headline instance, our formula was looking
for the pattern of the same letters. This means that there was a really strong chance that we would return
a greater count for each of our key word attributes than the original frequency found in R. For example,
the attribute name ‘man’ will return a count for every time ‘man’ appears in the instance. It will also return
a count for the word ‘manage’, ‘mankind’, ‘woman’, or any other word that contains that combination of
letters. This means that our counts for each attribute would likely be exaggerated. To negate this
somewhat, we summed the count for each attribute and made judgment decisions on whether to keep
the attribute or not based on its sum count. In some cases, the count of the attribute, which had originally
been either 250-300 or 500-850, skyrocketed to 1,500 or 2,000. Since there were only 1,989 instances, if
a single attribute had a count greater than the total number of instances then we believed it would serve
more as a false predictor rather than valuable information that we should use to test. So several words
were removed from our keyword attribute list following that process.
Another false predictor that we identified very early in our testing process was the date attribute being
included in our data. In our beginning tests we were getting very high accuracy from OneR which made us
very excited. However, when we investigated the rule that was being created, it was based off of the Date
attribute. Since our knowledge of the data set told us that the Date attribute should not, in any way, be a
decision factor in predicting our Class attribute we opted to remove the attribute. Also, because our data
was already laid out in Date order, we knew that we could add the attribute back later if we wanted to
run different tests where it was needed.
3.2 ZeroR - BENCHMARK
ZeroR is the simplest classification method available for data mining. ZeroR looks at the target attribute
and ignores all other attributes. It will find the majority Class attribute and then use it as its prediction
going forward. We ran our clean data sets against ZeroR to set a benchmark for our results and the tables
below have those findings.
9
Same-day
Open-To-High
Open-To-Low
10
3.3 CLASSIFIER SELECTION
Due to the amount of assumptions in our experiment, we set out to choose our classifiers by running the
whole data set on a variety of different algorithms to see where we would get the best results. Our goal
being mainly predictive analysis of stock prices on a particular day, we wanted to choose classifiers which
are good at unsupervised learning in terms of high accuracy and precision. Looking at the tables below,
the classifiers J48, PART, and Bagging produced considerably higher results (when run with the full data
set) compared to ZeroR as well as any of other classifiers that we tested. For this reason, we chose to
move our experiment forward with these classifiers.
* 58 keywords with a frequency of 250 to 300. ‘Date’ Removed.
* 20 keywords with a frequency of 250 to 300. ‘Date’ Removed.
* 57 keywords with a frequency of 500 to 850. ‘Date’ Removed.
* 20 keywords with a frequency of 500 to 850. ‘Date’ Removed.
11
• J48 (tree) - J48 belongs to the tree group of classifiers. It works on the Decision Tree Learning
(DTL) process to find and optimize the most efficient attribute which increases the prediction
accuracy. In the data mining field J48 is well known for its capability of building high accuracy
models for predictive analysis.
• Bagging (Meta) – Bagging, an ensemble method for classifying, works by combining the decisions
of different models and amalgamating the various outputs into a single prediction. The prediction
is done by taking a cumulative vote of classifications done by each of the classifiers. This model
produces reliable results because predictions made by voting become more accurate as more
votes are taken into account. Decisions rarely deteriorate if new training sets are discovered, trees
are built for them, and their predictions participate in the vote as well. In particular, the combined
classifier will seldom be less accurate than a decision tree constructed from just one of the
datasets.
• PART (rule) – PART adopts a supervised machine learning algorithm, namely partial decision trees,
as a method for feature subset selection. Feature subset selection aims at finding the smallest
feature set having the most beneficial impact on machine learning algorithms, i.e. its prime goal
is to identify a subset of features upon which attention should be centered. More precisely, PART
exploits the partial decision tree learning algorithm for feature space reduction. It uses separate-
and-conquer method. It builds a partial C4.5 (J48) decision tree in each iteration and makes the
"best" leaf into a rule. In each iteration, a rule is derived from a pre-pruned decision tree.
3.4 FOUR CELL EXPERIMENT DESIGN
Two Factor Design: Our experiment design contains two factors:
1. Factor 1 (F1) Number of attributes
a. Using the full set of keyword attributes or the top 20 keyword attributes
2. Factor 2 (F2) Percentage Split
a. 80/20 or 20/80
Four Criteria of the Design: The two factors are to be divided up into 8 criteria based on the two sets of
key word frequency. By keeping one factor constant, and varying the other factor between two values
and vice versa. This is illustrated more clearly in the tables below.
12
Location Test Frequency Experiment Design Split
F11 C1 250-300 Frequency for SAME DAY design All Attributes
Percentage Split of
80% - 20%
F12 C2 500-850 Frequency for SAME DAY design All Attributes
Percentage Split of
80% - 20%
F13 C3 250-300 Frequency for SAME DAY design Selected Attributes
Percentage Split of
80% - 20%
F14 C4 500-850 Frequency for SAME DAY design Selected Attributes
Percentage Split of
80% - 20%
F21 C5 250-300 Frequency for SAME DAY design All Attributes
Percentage Split of
20% - 80%
F22 C6 500-850 Frequency for SAME DAY design All Attributes
Percentage Split of
20% - 80%
F23 C7 250-300 Frequency for SAME DAY design Selected Attributes
Percentage Split of
20% - 80%
F24 C8 500-850 Frequency for SAME DAY design Selected Attributes
Percentage Split of
20% - 80%
F11 C9 250-300 Frequency for OPEN-TO-LOW design All Attributes
Percentage Split of
80% - 20%
F12 C10 500-850 Frequency for OPEN-TO-LOW design All Attributes
Percentage Split of
80% - 20%
F13 C11 250-300 Frequency for OPEN-TO-LOW design Selected Attributes
Percentage Split of
80% - 20%
F14 C12 500-850 Frequency for OPEN-TO-LOW design Selected Attributes
Percentage Split of
80% - 20%
F21 C13 250-300 Frequency for OPEN-TO-LOW design All Attributes
Percentage Split of
20% - 80%
F22 C14 500-850 Frequency for OPEN-TO-LOW design All Attributes
Percentage Split of
20% - 80%
F23 C15 250-300 Frequency for OPEN-TO-LOW design Selected Attributes
Percentage Split of
20% - 80%
F24 C16 500-850 Frequency for OPEN-TO-LOW design Selected Attributes
Percentage Split of
20% - 80%
F11 C17
250-300 Frequency for OPEN-TO-HIGH
design
All Attributes
Percentage Split of
80% - 20%
F12 C18
500-850 Frequency for OPEN-TO-HIGH
design
All Attributes
Percentage Split of
80% - 20%
F13 C19
250-300 Frequency for OPEN-TO-HIGH
design
Selected Attributes
Percentage Split of
80% - 20%
F14 C20
500-850 Frequency for OPEN-TO-HIGH
design
Selected Attributes
Percentage Split of
80% - 20%
F21 C21
250-300 Frequency for OPEN-TO-HIGH
design
All Attributes
Percentage Split of
20% - 80%
F22 C22
500-850 Frequency for OPEN-TO-HIGH
design
All Attributes
Percentage Split of
20% - 80%
F23 C23
250-300 Frequency for OPEN-TO-HIGH
design
Selected Attributes
Percentage Split of
20% - 80%
F24 C24
500-850 Frequency for OPEN-TO-HIGH
design
Selected Attributes
Percentage Split of
20% - 80%
13
Once we clearly defined our four criteria for testing, we were ready to complete the experiment design
for SAME DAY, OPEN-TO-LOW, and OPEN-TO-HIGH. In order to make our training and test data truly
representative (there was a potential chance that the data might lose its properties due to sampling and
running the classifiers) we planned on doing ten runs for each criteria with each classifier having a distinct
seed value.
• Total number of experiment runs = Number of experiment design * Number of
criteria * Number of Classifiers * Number of runs
• 3 * 8 * 3 * 10 = 720 runs
3.5 DATA REPRESENTATION
Initially, the Headlines were laid out to represent the instances for our data. After the data cleaning
process, we chose the key words with the frequency selection (500-850 / 250-300) as attributes and the
counts for those keywords on the corresponding date that they occurred as the instances. We made the
following assumptions to carry out the experiment.
3.6 ASSUMPTIONS
1. We assume that the top 25 headlines were published only during the stock market hours (8:30 AM to
3:00 PM). Hence, we came up with the experiment design for SAME DAY.
a. We calculated the price change for this criteria as:
b. Same Day = (Market Close – Market Open) / Market Open
2. We assume that the top 25 headlines influence the stock market index either in the positive or
negative direction. Hence, we came up with two additional categories of experiment design.
a. Open-To-High = Market High – Market Open / Market Open
b. Open-To-Low = Market Low – Market Open / Market Low
3. We assume that we captured all of the possible keywords.
4. We assume that world news stories affect how people react when trading stocks.
5. We assume that there is either a positive or negative effect in the market influenced by positive or
negative connotations of news stories.
3.7 TARGET VARIABLES
Given our assumptions above, we came up with the sets of target variables for our class attributes that
are described below.
Same Day
14
For the experimental design for Same Day calculations, we decided to name the target variables:
• POSITIVE EFFECT - for an increase in the stock price compared to the market open
• LOW EFFECT - for a decrease in the stock price compared to the market open
• NO EFFECT - for no change between the market open and market close
The values for these target variables were derived by looking at the distribution of our calculations in the
chart below.
Open to Low
For the experimental design for the Open-to-Low calculation, we analyzed the negative effect the
headlines can have on the market price. Hence, we decided to name the target variables:
• HIGH EFFECT - for higher negative effect on the market price with respect to the Market open
• MODERATE EFFECT - for a medium decrease in the stock price compared to the Market Open
• NO EFFECT - for a minor decrease in the market price with respect to Market Open
15
The values for these target variables were derived by looking at the distribution of our calculations in the
chart below.
Open to High
For the experimental design for our Open-to-Low calculations, we analyzed the positive effect the
headlines can have on the market price. Hence, we decided to name the target variables:
• HIGH EFFECT - for higher positive effect on the market price with respect to the Market open
• MODERATE EFFECT - for medium increase in the stock price compared to the Market Open
• NO EFFECT - for a minor increase in the market price with respect to Market Open
16
The values for these target variables were derived by looking at the distribution of our calculations in the
chart below.
4 EXPERIMENT RESULTS
4.1 RESULTS FOR EACH CLASSIFIER
The tables below describe all of the possible combinations of our criteria with the three
selected classifiers. Each of these combinations was run ten times using a different seed value for each
run as explained in the Experiment Design section. The accuracy of each test is recorded below along with
the average and standard deviation of their accuracy.
J48 - Detailed chart explanation in Appendix
17
Meta-Bagging - Detailed chart explanation in Appendix
PART – Detailed chart explanation in Appendix
18
As depicted in the tables above, we have used three classifiers in the analysis process. The classifiers are
J48, Meta-Bagging, and PART. Each classifier is tested and trained with the attributes based on the
following combination of test criteria thus leading to 24 sets of algorithms runs for each classifier:
• Dividing the market based on the stock values in a day:
o Opentolow
o Opentohigh
o Sameday
• Percentage Split:
o 80% training - 20% testing
o 20% training – 80% testing
• Attribute Selection
o Using all attributes
All words in the frequencies of keywords
o Using top 20 attributes
Based off of the total count of that the word occurred in the data set
• Frequency of keywords
o 250-300 frequency words
o 500-850 frequency words
4.2 SUMMARY OF RESULTS
The tables below show the average accuracy for each of our algorithms based off of the experiment design
that we used to partition our Stock Market data (Sameday, OpentoLow, OpentoHigh). The legend
numbers for each algorithm refer to the corresponding tests identified in charts above. There is also a key
below each chart to help clarify.
19
1-4 = 250-300 Frequency Range 5-8 = 500-850 Frequency Range
Odd numbers = Top 20 Words Even Numbers = Full List
1,2,5,6 = 80/20 Split 3,4,7,8 = 20/80 Split
9-12 = 250-300 Frequency Range 13-16 = 500-850 Frequency Range
Odd numbers = Top 20 Words Even Numbers = Full List
9,10,13,14 = 80/20 Split 11,12,15,16 = 20/80 Split
20
17-20 = 250-300 Frequency Range 21-24 = 500-850 Frequency Range
Odd numbers = Top 20 Words Even Numbers = Full List
17,18,21,22 = 80/20 Split 19,20,23,24 = 20/80 Split
A glance at the above tables shows that based off of percentage accuracy, Meta-Bagging appears to be
the most accurate classifier across our experiment design factors overall. J48 was marginally better for
the OpentoLow group but worse for the other two. For the different stock market segments, the Meta-
Bagging algorithm showed averages of 34.29 (Sameday), 33.72 (OpentoLow), and 40.16 (OpentoHigh).
We notice that according to MB and J48, the tests on the ‘Top 20’ group is the least accurate while this
same group does not seem to show matching results for PART. For the 250-300 group of words PART does
not show the same decrease for the Top 20 classifier as the other algorithms. However, when run against
the 500-850 group, we see that the accuracy drops when only exposed to the Top 20 values.
Almost unanimously, the full list of key words outperformed the top 20 key words for the same factors.
This tells us that the more words we have to predict against, the better chance we have of predicting the
class attribute with higher accuracy. There were a few examples where this was easily noticeable
(OpentoLow and OpentoHigh), the rest of the time it was only a marginal increase. On both the
OpentoLow and OpentoHigh tests, the noticeable difference came in the lower frequency of words. This
would lead us to believe that the more words you include at lower frequencies would increase your
accuracy for predicting the class attribute.
However, when looking at the 250-300 group of words we can see some strange occurrences when
inspecting the averages across the test criteria. The average percentage accuracy is the same for both the
80/20 and 20/80 split when using the Top 20 key words. This goes against conventional wisdom. We would
expect that the accuracy would increase as the percentage of training data increases. This is not the case
for this split on the Top 20 key words across all three algorithms. For this reason, using the test criteria
and the other experiment design for our project, we would not recommend using the 250-300 grouping
of words for any predictive testing until it is fully understood why this is happening.
21
Meta-Bagging produced better results while considering all of the key word attributes as well as just using
the top 20 attributes. Meta-Bagging also showed consistent values while taking into account the 250-300
and 500-850 frequency words.
By far the highest accuracy that we achieved was using Meta-Bagging. On the 250-300 group of words,
using the OpentoHigh with the full list of words. We achieved greater than 50% accuracy using both the
80/20 split and 20/80 split. The problem with this is that (as explained earlier) both the 80/20 split and
20/80 split have the same values.
5 ANALYSIS & CONCLUSION
5.1 POSSIBLE ASSUMPTIONS VIOLATED
There seems to be inherit assumption violations with our project as the results we received were at times
illogical. Our first assumption, “the top 25 headlines were published only during the stock market hours
(8:30 AM to 3:00 PM)”, is violated because we do not have pre-trading and post-trading data to capture
all trading hours in a given day. We do not know at what time the headlines were produced, just the
knowledge of knowing it came from 12:00 a.m. to 11:59 p.m.
It is illogical that we were returning either the same or better results in our 20/80 splits than our 80/20
splits on some of our tests. This seemed to occur mostly with the 250-300 grouping of key words. We did
not investigate the reasoning behind this, but noted the results and would explore further in future
experiments.
5.2 CONCLUSION
We observed that the accuracy for each experiment design was close to our benchmark ZeroR. We can
also conclude that our experimental designs were inefficient given that there were over 900 rules
produced for 1,989 examples. We cannot say that our experiment designs based on our testing algorithms
and key words chosen gave conclusive evidence to predict the stock market changes on the DJIA.
From the experiment, we conclude that the third experiment design (Open-to-High) showed better
accuracy for both ‘All Key Words’ and ‘Top 20 words’ factor and produced consistent results for both 250-
300 frequency words and 500-850 frequency words.
Our goal with this project was to prove that a rise or fall in the DJIA could be predicted based off of the
prior days top 25 headlines from Reddit. This hypothesis was refined as the project went on to: attempting
to predict whether certain key words from the top 25 headlines on Reddit could accurately predict
whether the DJIA would rise or fall at various points throughout the day. Obviously, our hypothesis got
much narrower rather than broader as we went along. I think it is clear to see that we were chasing results
the further we got into it. We were convinced that we were on to a bright idea with our hypothesis and
we started chasing proof. Ultimately, we came away unsuccessful in proving our theory. This is not to say
however, that our hypothesis is incorrect. It simply means that the manner in which we tested our
hypothesis was unable to prove it correct. If we were to start this project over again knowing what we
know now, there are a couple of things that we would/could do differently.
• Use different ranges of frequencies of our key words as our attributes
• Use a better tool for searching our key word attributes against the headlines of the day. We got
faulty counts with Excel that massively exceeded our original counts pulled from R.
22
• Use combinations of words rather than individual key words
• Use sentiment analysis
• Search for trends of rises/falls in the DJIA and then use key words, or groups of key words, which
occurred around those times to see if they were more effective at predicting our class
• Use a different source to get our headlines from
If we were to examine our best results against the original ZeroR baseline results, we see that ZeroR was
a better classifier for each group.
Looking at the table above, we can say that the model we created is not good in regards to determining
whether the DJIA would rise or fall based off of our assumptions. Taking what we have learned, we should
try again with perhaps some of the changes mentioned above.
23
APPENDIX –
J48
24
META-BAGGING
25
PART
26
R Script
# install packages necessary for work below
Needed <- c("tm", "SnowballCC", "RColorBrewer", "ggplot2", "wordcloud", "biclust", "cluster", "igraph",
"fpc")
install.packages(Needed, dependencies=TRUE)
# Create corpus
myCorpus <- Corpus(VectorSource(Combined_concatenated_headlines$Headlines))
# lowercase corpus
myCorpus <- tm_map(myCorpus, tolower)
# remove all punctuation from corpus
myCorpus <- tm_map(myCorpus, removePunctuation)
# remove all numbers from Corpus
myCorpus <- tm_map(myCorpus, removeNumbers)
# Visualization of the corpus if desired
inspect(myCorpus)
# removing common words that usually have no analytical value
# ex. a, and, also, the, but, etc...
myCorpus <- tm_map(myCorpus, removeWords, stopwords("en"))
# Stemming corpus to try and get the roots of words
myCorpus <- tm_map(myCorpus, stemDocument)
# In order to complete the next two steps, the corpus will have to be changed into PlainTextDocument
# Recommend saving the corpus as a different name when done
myCorpusText <- tm_map(myCorpus, PlainTextDocument)
# create a dtm using the corpus that will be used in future steps
dtm <- DocumentTermMatrix(myCorpusText)
# create a new variable off of the dtm that will be used to create visualizations if desired
freq <- colSums(as.matrix(dtm))
length(freq)
# necessary structure for plots
wf <- data.frame(word=names(freq), freq=freq)
# the frequencies below can be adjusted to get a sample of data desired
p <- ggplot(subset(wf, freq>500 & freq<850), aes(word, freq))
p <- p + geom_bar(stat="identity")
27
p <- p + theme(axis.text.x=element_text(angle=45, hjust=1))
p
# separate visualization that can be adjusted according to the frequencies desired
wordcloud(names(freq), freq, min.freq = 500, max.words = 850, random.order = FALSE, colors =
brewer.pal(6, "Set1"))
# will display the available pallets of colors that can be chosen for the wordcloud above
display.brewer.all()
Word Cloud for the 500-850 Range:

Weitere ähnliche Inhalte

Ähnlich wie Reddit_DJIA_Project

Private Equity Performance Moved Into Positive Territory in Third Quarter 2003
 	Private Equity Performance Moved Into Positive Territory in Third Quarter 2003 	Private Equity Performance Moved Into Positive Territory in Third Quarter 2003
Private Equity Performance Moved Into Positive Territory in Third Quarter 2003
mensa25
 
Private equity-market-analysis-and-sizing-2012
Private equity-market-analysis-and-sizing-2012Private equity-market-analysis-and-sizing-2012
Private equity-market-analysis-and-sizing-2012
CAR FOR YOU
 
Djia presentation rev7
Djia presentation rev7Djia presentation rev7
Djia presentation rev7
eraserhead1982
 
Hyre Weekly Commentary
Hyre Weekly CommentaryHyre Weekly Commentary
Hyre Weekly Commentary
hyrejam
 
Leidos Capabilities Lite Brochure
Leidos Capabilities Lite BrochureLeidos Capabilities Lite Brochure
Leidos Capabilities Lite Brochure
Scott Conte
 
Wohlers report 2013 : Additive Manudacturing and 3D Printing State of the Ind...
Wohlers report 2013 : Additive Manudacturing and 3D Printing State of the Ind...Wohlers report 2013 : Additive Manudacturing and 3D Printing State of the Ind...
Wohlers report 2013 : Additive Manudacturing and 3D Printing State of the Ind...
alain Clapaud
 
Using Market Insights and Sales Data to Optimize Your Distribution Strategy
Using Market Insights and Sales Data to Optimize Your Distribution StrategyUsing Market Insights and Sales Data to Optimize Your Distribution Strategy
Using Market Insights and Sales Data to Optimize Your Distribution Strategy
Broadridge
 

Ähnlich wie Reddit_DJIA_Project (20)

Icron Technologies tactical strategic report
Icron Technologies tactical strategic reportIcron Technologies tactical strategic report
Icron Technologies tactical strategic report
 
Private Equity Performance Moved Into Positive Territory in Third Quarter 2003
 	Private Equity Performance Moved Into Positive Territory in Third Quarter 2003 	Private Equity Performance Moved Into Positive Territory in Third Quarter 2003
Private Equity Performance Moved Into Positive Territory in Third Quarter 2003
 
Enterprise solid state disk
Enterprise solid state diskEnterprise solid state disk
Enterprise solid state disk
 
News Media Metadata - The Current Landscape
News Media Metadata - The Current LandscapeNews Media Metadata - The Current Landscape
News Media Metadata - The Current Landscape
 
Private equity-market-analysis-and-sizing-2012
Private equity-market-analysis-and-sizing-2012Private equity-market-analysis-and-sizing-2012
Private equity-market-analysis-and-sizing-2012
 
Wall Street Letter - 11 08 10
Wall Street Letter - 11 08 10Wall Street Letter - 11 08 10
Wall Street Letter - 11 08 10
 
Bluekai Little Blue Book
Bluekai Little Blue BookBluekai Little Blue Book
Bluekai Little Blue Book
 
Open Data and News Analytics Demo
Open Data and News Analytics DemoOpen Data and News Analytics Demo
Open Data and News Analytics Demo
 
United States IT Asset Disposition Market by Product Type, Distribution Chann...
United States IT Asset Disposition Market by Product Type, Distribution Chann...United States IT Asset Disposition Market by Product Type, Distribution Chann...
United States IT Asset Disposition Market by Product Type, Distribution Chann...
 
Transmission Towers for Electric Power, 2014 Update : Global Market Size, Ave...
Transmission Towers for Electric Power, 2014 Update : Global Market Size, Ave...Transmission Towers for Electric Power, 2014 Update : Global Market Size, Ave...
Transmission Towers for Electric Power, 2014 Update : Global Market Size, Ave...
 
QuoteMediaOverview corporate presentation
QuoteMediaOverview corporate presentationQuoteMediaOverview corporate presentation
QuoteMediaOverview corporate presentation
 
Djia presentation rev7
Djia presentation rev7Djia presentation rev7
Djia presentation rev7
 
Hyre Weekly Commentary
Hyre Weekly CommentaryHyre Weekly Commentary
Hyre Weekly Commentary
 
Powerful Information Discovery with Big Knowledge Graphs –The Offshore Leaks ...
Powerful Information Discovery with Big Knowledge Graphs –The Offshore Leaks ...Powerful Information Discovery with Big Knowledge Graphs –The Offshore Leaks ...
Powerful Information Discovery with Big Knowledge Graphs –The Offshore Leaks ...
 
Leidos Capabilities Lite Brochure
Leidos Capabilities Lite BrochureLeidos Capabilities Lite Brochure
Leidos Capabilities Lite Brochure
 
Wohlers report 2013 : Additive Manudacturing and 3D Printing State of the Ind...
Wohlers report 2013 : Additive Manudacturing and 3D Printing State of the Ind...Wohlers report 2013 : Additive Manudacturing and 3D Printing State of the Ind...
Wohlers report 2013 : Additive Manudacturing and 3D Printing State of the Ind...
 
NIST - определения для Интернета вещей
NIST - определения для Интернета вещейNIST - определения для Интернета вещей
NIST - определения для Интернета вещей
 
IRJET - Forecasting Stock Market Movement Direction using Sentiment Analysis ...
IRJET - Forecasting Stock Market Movement Direction using Sentiment Analysis ...IRJET - Forecasting Stock Market Movement Direction using Sentiment Analysis ...
IRJET - Forecasting Stock Market Movement Direction using Sentiment Analysis ...
 
Using Market Insights and Sales Data to Optimize Your Distribution Strategy
Using Market Insights and Sales Data to Optimize Your Distribution StrategyUsing Market Insights and Sales Data to Optimize Your Distribution Strategy
Using Market Insights and Sales Data to Optimize Your Distribution Strategy
 
Seminar 2
Seminar 2   Seminar 2
Seminar 2
 

Reddit_DJIA_Project

  • 1. Vinodhine Rajasekar, Cory Swindle, Amal Dev Thomas, Arun Tom, Jason Warnstaff DATA MINING 5339-002 GROUP #1 REDDIT & DJIA
  • 2. 1 Contents Contents ........................................................................................................................................................................1 1 DATA BACKGROUND...................................................................................................................................................2 1.1 YAHOO FINANCE .................................................................................................................................................2 1.2 REDDIT.................................................................................................................................................................2 1.3 DATASET & ATTRIBUTES......................................................................................................................................2 1.4 CLASS ATTRIBUTE ................................................................................................................................................3 1.5 HYPOTHESIS.........................................................................................................................................................4 2 DATA CLEANING..........................................................................................................................................................4 2.1 DATA ISSUES........................................................................................................................................................4 2.2 DATA CLEANING PROCESS...................................................................................................................................5 3 Experiment Design......................................................................................................................................................8 3.1 FALSE PREDICTORS..............................................................................................................................................8 3.2 ZeroR - BENCHMARK ..........................................................................................................................................8 3.3 CLASSIFIER SELECTION.......................................................................................................................................10 3.4 FOUR CELL EXPERIMENT DESIGN ......................................................................................................................11 3.5 DATA REPRESENTATION....................................................................................................................................13 3.6 ASSUMPTIONS...................................................................................................................................................13 3.7 TARGET VARIABLES ...........................................................................................................................................13 Same Day ............................................................................................................................................................13 Open to Low........................................................................................................................................................14 Open to High.......................................................................................................................................................15 4 EXPERIMENT RESULTS ..............................................................................................................................................16 4.1 RESULTS FOR EACH CLASSIFIER.........................................................................................................................16 4.2 SUMMARY OF RESULTS.....................................................................................................................................18 5 ANALYSIS & CONCLUSION ........................................................................................................................................21 5.1 POSSIBLE ASSUMPTIONS VIOLATED..................................................................................................................21 5.2 CONCLUSION.....................................................................................................................................................21 APPENDIX –..................................................................................................................................................................23 J48 ...........................................................................................................................................................................23 META-BAGGING ......................................................................................................................................................24 PART ........................................................................................................................................................................25 R Script ....................................................................................................................................................................26 Word Cloud for the 500-850 Range: .......................................................................................................................27
  • 3. 2 1 DATA BACKGROUND Our data set is comprised of two sets of information that were combined to create a class attribute. The first piece is from Reddit World News Channel and the second comes from Yahoo finance. The data set covers the dates from August 8th , 2008 to July 1st , 2016. 1.1 YAHOO FINANCE The section of our data that comes from Yahoo Finance is the Dow Jones Industrial Average values for the dates listed above. The Dow Jones Industrial Average (DJIA) is a price-weighted average of thirty significant stocks traded on the New York Stock Exchange (NYSE) and NASDAQ. It is a thirty-component index or grouping of stocks that is meant to be a gauge or indication of how the overall stock market is performing. The thirty stocks that comprise the Dow Jones Industrial Average (DJIA) dictate the movement of the entire stock market. The DJIA is a price weighted average, which means that stocks with higher share prices are given a greater weight in the index. The average is scaled in order to make up for the effects of stock splits and other adjustments, creating a stable value for the index. The DJIA was created by Dow Jones co-founders Charles Dow and Edward Jones in the late 1800's. There have been many changes since its original formulation and companies have been added to and taken off the Dow. General Electric is known as the 'Dow Veteran' because it is the only one of the original Dow components that has been a part of the index since creation. Some of the more recent additions include Cisco and Travelers which were added after General Motors was removed from the Dow in 2009. The regular adjustments and updates keep the Dow relevant and help it remain as reliable gauge of what the stock market looks like. 1.2 REDDIT Reddit calls themselves the “front page of the internet” and they are a social news aggregation, web content rating, and discussion website where users have the option to post content anonymously or under their registered user name. Registered users can then vote submissions up or down to determine their position on the page. The posts with the most positive votes appear on the front page or the top of a category. Reddit contains millions of channels, each known as ‘subreddits’ with topics ranging from news to gaming to anything else that you can think of. Our data set was created with the information crawled from a ‘subreddit’ called World News Channel. The top 25 headlines for each day in our stated date range were pulled and put into Excel with each headline serving as its own attribute. 1.3 DATASET & ATTRIBUTES Our data set came from 3 different files. Each is described below. 1. RedditNews.csv: 73,608 instances and 2 attributes. Attribute Name Description Date Corresponding date ranges from Aug 8, 2008 to July 1, 2016 in the format mm-dd-yyyy. News News headlines which are ranked from top to bottom based on their votes. Hence, there are 25 attributes for each date. 2. DJIA_table.csv: 1,989 instances and 7 attributes. Attribute Name Description Date Date ranges from Aug 8, 2008 to July 1, 2016 in the format mm-dd-yyyy Open The index value at the opening of the trade day High Highest index value of the trading day Low Lowest index value of the trading day
  • 4. 3 Close The index value at which the market is closed at that trading day Volume Number of components shares traded traded during the current trading period Adj Close Adjusted closing price 3. Combined_News_DJIA.csv: 1,989 instances and 27 attributes. Attribute Name Description Date Corresponding date ranges from Aug 8, 2008 to July 1, 2016 in the format mm/dd/yyyy Label 1 to note that the DJIA Adj Close value rose or stayed the same and 0 to note when the DJIA Adj Close value decreased Top1 Top1 headline rating for the day as voted on by Reddit users Top2 Top2 headline rating for the day as voted on by Reddit users Top3 Top3 headline rating for the day as voted on by Reddit users Top4 Top4 headline rating for the day as voted on by Reddit users Top5 Top5 headline rating for the day as voted on by Reddit users Top6 Top6 headline rating for the day as voted on by Reddit users Top7 Top7 headline rating for the day as voted on by Reddit users Top8 Top8 headline rating for the day as voted on by Reddit users Top9 Top9 headline rating for the day as voted on by Reddit users Top10 Top10 headline rating for the day as voted on by Reddit users Top11 Top11 headline rating for the day as voted on by Reddit users Top12 Top12 headline rating for the day as voted on by Reddit users Top13 Top13 headline rating for the day as voted on by Reddit users Top14 Top14 headline rating for the day as voted on by Reddit users Top15 Top15 headline rating for the day as voted on by Reddit users Top16 Top16 headline rating for the day as voted on by Reddit users Top17 Top17 headline rating for the day as voted on by Reddit users Top18 Top18 headline rating for the day as voted on by Reddit users Top19 Top19 headline rating for the day as voted on by Reddit users Top20 Top20 headline rating for the day as voted on by Reddit users Top21 Top21 headline rating for the day as voted on by Reddit users Top22 Top22 headline rating for the day as voted on by Reddit users Top23 Top23 headline rating for the day as voted on by Reddit users Top24 Top24 headline rating for the day as voted on by Reddit users Top25 Top25 headline rating for the day as voted on by Reddit users 1.4 CLASS ATTRIBUTE Our class attribute as presented from the data set was a binary value. The value of ‘0’ means that the DJIA Adj. Close decreased from the previous day. A value of ‘1’ indicates that the DJIA Adj. Close either increased or stayed the same from the previous day. We will talk further about how we transformed the Class attribute we were given in the Experiment Design section of this paper.
  • 5. 4 1.5 HYPOTHESIS Utilizing the data from the described data set, we intend to predict whether the DJIA will rise or fall on the following day based off of the top 25 prior day’s world news headlines from Reddit. 2 DATA CLEANING Data cleaning is the most critical and time consuming piece of any data mining project. Ensuring that the data that is being tested is accurate and clean will lead to more reliable results and a process that can be duplicated/replicated by others that wish to test the validity of the results presented. The process of data cleaning consists of detecting or correcting corrupt or inaccurate records, removing inaccurate or irrelevant parts of data, and coming up with an overall standardization of the data so that algorithms can process and assist in determining results. 2.1 DATA ISSUES The data for our project was pulled from the Reddit website and dumped into an Excel CSV file. Pulling and dumping of data from the internet into Excel can lead to numerous issues especially when it involves text. In the case of our data, a preliminary glance showed that there was some ‘trash’ characters leading some of our headlines. The data set has 1989 instances of data with twenty-five attributes for each instance, it would be very daunting to look through the data individually to try and find these characters and remove them. Also, without knowing what other potential ‘trash’ was possible in the data set, it would be very hard to either script it out in R or get rid of it in Excel. Spellcheck was also not an option as any word that was classified as misspelled would require a judgment from our group about whether it was intended to be spelled this way on Reddit or if it was actually just an error. In the end, we decided that we had enough data that we could tolerate some of these ‘trash’ characters and that it wouldn’t affect our plan for how we intended to use the data anyway. The next problem we faced was determining how we could best use the headlines to test against our class attribute. Text mining is still a developing field and there are two main trains of thought about how to best utilize text data. Our initial plan was to use sentiment analysis which is the process of determining a sentiment of a sentence based off of a sum score for the sentiment of the words used in that sentence. As you could imagine, this can lead to some unique headaches. “The accuracy of a sentiment analysis system is, in principle, how well it agrees with human judgments. This is usually measured by precision and recall. However, according to research human raters typically agree 79% of the time. Thus, a 70% accurate program is doing nearly as well as humans, even though such accuracy may not sound impressive. If a program were "right" 100% of the time, humans would still disagree with it about 20% of the time, since they disagree that much about any answer.” (Wikipedia.org, n.d.) Other things such as cultural factors, linguistic nuances and differing contexts make it extremely difficult to turn a string of text into a simple pro or con statement. Due to the complexity and lack of overall accurate processes associated with sentiment analysis, we opted for a different approach. We chose to look for key words occurring with a certain (to be determined later) frequency throughout our data set. Our base assumption in pursuing this option was that certain words would have a negative/positive effect on people who invest in the stock market’s attitudes on how/when they traded thus affecting the DJIA. In conversations at the beginning of the project we hypothesized that words such as terror, bomb, war, explosion, etc... would have a negative effect on the DJIA. The thinking was that these headlines were normally associated with negative events in our minds, and would thus cause people to be less or potentially even more enthusiastic about investing. The last major issue that we were facing in using text mining and keywords was how to accurately determine the correct frequencies of words. As an example, England is referred to by numerous names.
  • 6. 5 There is Britain, UK, and England to name a few, how would we best handle a count if we wanted to determine if the overall ‘Great Britain’ was going to be a key attribute in determining class? Additionally, does ‘British’ and ‘English’ mean the same thing? This also was an issue for regular words with suffixes or prefixes that we wanted to account for. We began looking for a way to try and break words down to their individual root word so that we could eliminate as much of this as possible. Stemming or Lemmatization functions would have to be completed on our data set to try and eliminate this issue. Stemming is the process of breaking words down to their root word to avoid having multiple words for what is technically the same word. For example, ‘bomb’ would be the root word for ‘bomber’, ‘bombing’, ‘bombings’, ‘bombarding’, and ‘bombarded’. It would only make sense that all of these words should be counted as a single word according to our group. The stemming process looks at each word and then, based off of the assigned dictionary, breaks each word down to its root. This also caused some interesting problems as a word such as ‘issues’ was broken down to ‘issu’. As we found out, the stemming methods that we used were not as effective as we had hoped. However, stemming was a better option than leaving the words as they were, so we ended up pursuing this method. Once we had identified and developed a plan to mitigate our initial concerns with the data cleaning process the next step was to select a tool that would allow us to accomplish everything we intended to accomplish. We chose to utilize Excel for basic, easy calculations and counts and to use R for the more complex data mining tasks and frequency distributions. R is a language and environment for statistical computing and graphics. Upon doing some cursory research and speaking to data mining professionals, we selected R due to its large community of users online that could assist us as we undertook this project using a language that we were unfamiliar with, a vast library of answered questions available at StackedOverflow.com, and availability of individuals that had used R in our circle of contacts. 2.2 DATA CLEANING PROCESS The goal of our data cleaning was to come up with a list of roughly 50 words, based off of their frequency of occurring throughout the dataset, count the number of times those words appeared on each day, and place the values for those counts against the date they occurred as well as the class attribute. Our data when we started was separated in the following manner: • Date attribute • Label Attribute (CLASS attribute) • Headline 1 • Headline 2 • Etc… • Headline 25 In order to achieve our goal of getting keywords it was necessary to combine all of the headline attributes into a single attribute for the data cleaning process. In Excel, we concatenated the headlines into a single attribute. This took us down from twenty-seven attributes to three attributes. The next step was to begin the cleanup process on the headline attribute. As you can see in the sample of data below, there was some ‘trash’ that was easily identifiable and could be removed from our file using basic Excel functionality. We removed b", b', "b", and "b' from our data and then saved the file as .CSV.
  • 7. 6 The next step was to import the .CSV to our R workspace so we could continue on with the data cleaning. The view below is a snapshot of what the data looked like once it was imported. *Full R script is shown in the Appendix. In order to move forward with our imported data in R, we needed to create a corpus. The main structure for managing documents in a text mining package in R is a corpus. A corpus is a representation of text documents that is stored in memory so that it is easily accessible for data manipulation. Once we had created our corpus we could begin to break the headlines down to individual words that we would use as our attributes going forward. The first step was to remove all upper-case letters, remove punctuation, remove numbers, remove common stop words, and stem the words in the document. Below is a view of our corpus after upper-case letters, punctuation, and numbers had been removed. A Term Document Matrix (tdm) and Document Term Matrix (dtm) were created based off of our Corpus next in order to better organize our data for frequency searches. Next, we needed to know the overall frequency of range in our data so that we could search for ranges within that range that we thought could potentially lead us to results. To accomplish this we created a new variable called ‘freq’ as a matrix of all of our terms in the dtm. Using the ‘freq’ variable that we created in the previous step, we were then able to start doing some visualizations to determine what range of values would be useful in our tests. After some manipulation of the visualizations, we came across two different ranges with about 50 terms each that we felt would provide results. The charts we created in R are below:
  • 8. 7 Frequency of 500-850 Frequency of 250-300 Once we had our desired lists of frequent terms, we opted to return to Excel for the remaining steps in the data cleaning process. For the next step, we re-opened the CSV file that was imported to R and modified it to get rid of the Date and Label attributes. We were looking for counts of our frequent terms, the Date and Label were not important in this step. Our headline attribute is now the first attribute with the concatenated headlines as the instances. For the remaining attributes, we added the key words from each range that we selected. We were left with two separate files, one for each frequency range. Next, we needed to determine a formula that would take the attribute name and use it to look at each instance of the headline attribute and provide a count for how many times it appears. The Excel formula we created is below: • =(LEN(B2)-LEN(SUBSTITUTE(B2,$C$1,"")))/LEN($C$1) The above formula accomplished what we wanted, but there were some drawbacks that are discussed in the False Predictor section of this paper. After the above formula was run against our Headline attribute, the Headline attribute was not needed. We deleted it and added back the Date and Label attributes. Our data set now had a count for each attribute on each instance. Our file showed individual words (as attributes) that appeared between either 250-300 or 500-850 times throughout the data set, and how many times that word occurred for each instance. At this point, we considered our data ‘clean’ and were ready to move forward with testing. We understood that further manipulations of our data may be necessary based off of how testing went. However, the
  • 9. 8 bulk work of determining our attributes and providing counts that we could test against our class variable was complete. 3 Experiment Design 3.1 FALSE PREDICTORS After the data cleaning process, we ran various algorithms on the clean data set and encountered a few errors with respect to the words learned from our R processing. While the formula used in Excel gave us counts of our key word attributes against the Headline instances, this count had issues. Because this work was performed in Excel, we were bound to Excel constraints on rules. The issue we discovered was that instead of searching for the distinct key word attribute in the headline instance, our formula was looking for the pattern of the same letters. This means that there was a really strong chance that we would return a greater count for each of our key word attributes than the original frequency found in R. For example, the attribute name ‘man’ will return a count for every time ‘man’ appears in the instance. It will also return a count for the word ‘manage’, ‘mankind’, ‘woman’, or any other word that contains that combination of letters. This means that our counts for each attribute would likely be exaggerated. To negate this somewhat, we summed the count for each attribute and made judgment decisions on whether to keep the attribute or not based on its sum count. In some cases, the count of the attribute, which had originally been either 250-300 or 500-850, skyrocketed to 1,500 or 2,000. Since there were only 1,989 instances, if a single attribute had a count greater than the total number of instances then we believed it would serve more as a false predictor rather than valuable information that we should use to test. So several words were removed from our keyword attribute list following that process. Another false predictor that we identified very early in our testing process was the date attribute being included in our data. In our beginning tests we were getting very high accuracy from OneR which made us very excited. However, when we investigated the rule that was being created, it was based off of the Date attribute. Since our knowledge of the data set told us that the Date attribute should not, in any way, be a decision factor in predicting our Class attribute we opted to remove the attribute. Also, because our data was already laid out in Date order, we knew that we could add the attribute back later if we wanted to run different tests where it was needed. 3.2 ZeroR - BENCHMARK ZeroR is the simplest classification method available for data mining. ZeroR looks at the target attribute and ignores all other attributes. It will find the majority Class attribute and then use it as its prediction going forward. We ran our clean data sets against ZeroR to set a benchmark for our results and the tables below have those findings.
  • 11. 10 3.3 CLASSIFIER SELECTION Due to the amount of assumptions in our experiment, we set out to choose our classifiers by running the whole data set on a variety of different algorithms to see where we would get the best results. Our goal being mainly predictive analysis of stock prices on a particular day, we wanted to choose classifiers which are good at unsupervised learning in terms of high accuracy and precision. Looking at the tables below, the classifiers J48, PART, and Bagging produced considerably higher results (when run with the full data set) compared to ZeroR as well as any of other classifiers that we tested. For this reason, we chose to move our experiment forward with these classifiers. * 58 keywords with a frequency of 250 to 300. ‘Date’ Removed. * 20 keywords with a frequency of 250 to 300. ‘Date’ Removed. * 57 keywords with a frequency of 500 to 850. ‘Date’ Removed. * 20 keywords with a frequency of 500 to 850. ‘Date’ Removed.
  • 12. 11 • J48 (tree) - J48 belongs to the tree group of classifiers. It works on the Decision Tree Learning (DTL) process to find and optimize the most efficient attribute which increases the prediction accuracy. In the data mining field J48 is well known for its capability of building high accuracy models for predictive analysis. • Bagging (Meta) – Bagging, an ensemble method for classifying, works by combining the decisions of different models and amalgamating the various outputs into a single prediction. The prediction is done by taking a cumulative vote of classifications done by each of the classifiers. This model produces reliable results because predictions made by voting become more accurate as more votes are taken into account. Decisions rarely deteriorate if new training sets are discovered, trees are built for them, and their predictions participate in the vote as well. In particular, the combined classifier will seldom be less accurate than a decision tree constructed from just one of the datasets. • PART (rule) – PART adopts a supervised machine learning algorithm, namely partial decision trees, as a method for feature subset selection. Feature subset selection aims at finding the smallest feature set having the most beneficial impact on machine learning algorithms, i.e. its prime goal is to identify a subset of features upon which attention should be centered. More precisely, PART exploits the partial decision tree learning algorithm for feature space reduction. It uses separate- and-conquer method. It builds a partial C4.5 (J48) decision tree in each iteration and makes the "best" leaf into a rule. In each iteration, a rule is derived from a pre-pruned decision tree. 3.4 FOUR CELL EXPERIMENT DESIGN Two Factor Design: Our experiment design contains two factors: 1. Factor 1 (F1) Number of attributes a. Using the full set of keyword attributes or the top 20 keyword attributes 2. Factor 2 (F2) Percentage Split a. 80/20 or 20/80 Four Criteria of the Design: The two factors are to be divided up into 8 criteria based on the two sets of key word frequency. By keeping one factor constant, and varying the other factor between two values and vice versa. This is illustrated more clearly in the tables below.
  • 13. 12 Location Test Frequency Experiment Design Split F11 C1 250-300 Frequency for SAME DAY design All Attributes Percentage Split of 80% - 20% F12 C2 500-850 Frequency for SAME DAY design All Attributes Percentage Split of 80% - 20% F13 C3 250-300 Frequency for SAME DAY design Selected Attributes Percentage Split of 80% - 20% F14 C4 500-850 Frequency for SAME DAY design Selected Attributes Percentage Split of 80% - 20% F21 C5 250-300 Frequency for SAME DAY design All Attributes Percentage Split of 20% - 80% F22 C6 500-850 Frequency for SAME DAY design All Attributes Percentage Split of 20% - 80% F23 C7 250-300 Frequency for SAME DAY design Selected Attributes Percentage Split of 20% - 80% F24 C8 500-850 Frequency for SAME DAY design Selected Attributes Percentage Split of 20% - 80% F11 C9 250-300 Frequency for OPEN-TO-LOW design All Attributes Percentage Split of 80% - 20% F12 C10 500-850 Frequency for OPEN-TO-LOW design All Attributes Percentage Split of 80% - 20% F13 C11 250-300 Frequency for OPEN-TO-LOW design Selected Attributes Percentage Split of 80% - 20% F14 C12 500-850 Frequency for OPEN-TO-LOW design Selected Attributes Percentage Split of 80% - 20% F21 C13 250-300 Frequency for OPEN-TO-LOW design All Attributes Percentage Split of 20% - 80% F22 C14 500-850 Frequency for OPEN-TO-LOW design All Attributes Percentage Split of 20% - 80% F23 C15 250-300 Frequency for OPEN-TO-LOW design Selected Attributes Percentage Split of 20% - 80% F24 C16 500-850 Frequency for OPEN-TO-LOW design Selected Attributes Percentage Split of 20% - 80% F11 C17 250-300 Frequency for OPEN-TO-HIGH design All Attributes Percentage Split of 80% - 20% F12 C18 500-850 Frequency for OPEN-TO-HIGH design All Attributes Percentage Split of 80% - 20% F13 C19 250-300 Frequency for OPEN-TO-HIGH design Selected Attributes Percentage Split of 80% - 20% F14 C20 500-850 Frequency for OPEN-TO-HIGH design Selected Attributes Percentage Split of 80% - 20% F21 C21 250-300 Frequency for OPEN-TO-HIGH design All Attributes Percentage Split of 20% - 80% F22 C22 500-850 Frequency for OPEN-TO-HIGH design All Attributes Percentage Split of 20% - 80% F23 C23 250-300 Frequency for OPEN-TO-HIGH design Selected Attributes Percentage Split of 20% - 80% F24 C24 500-850 Frequency for OPEN-TO-HIGH design Selected Attributes Percentage Split of 20% - 80%
  • 14. 13 Once we clearly defined our four criteria for testing, we were ready to complete the experiment design for SAME DAY, OPEN-TO-LOW, and OPEN-TO-HIGH. In order to make our training and test data truly representative (there was a potential chance that the data might lose its properties due to sampling and running the classifiers) we planned on doing ten runs for each criteria with each classifier having a distinct seed value. • Total number of experiment runs = Number of experiment design * Number of criteria * Number of Classifiers * Number of runs • 3 * 8 * 3 * 10 = 720 runs 3.5 DATA REPRESENTATION Initially, the Headlines were laid out to represent the instances for our data. After the data cleaning process, we chose the key words with the frequency selection (500-850 / 250-300) as attributes and the counts for those keywords on the corresponding date that they occurred as the instances. We made the following assumptions to carry out the experiment. 3.6 ASSUMPTIONS 1. We assume that the top 25 headlines were published only during the stock market hours (8:30 AM to 3:00 PM). Hence, we came up with the experiment design for SAME DAY. a. We calculated the price change for this criteria as: b. Same Day = (Market Close – Market Open) / Market Open 2. We assume that the top 25 headlines influence the stock market index either in the positive or negative direction. Hence, we came up with two additional categories of experiment design. a. Open-To-High = Market High – Market Open / Market Open b. Open-To-Low = Market Low – Market Open / Market Low 3. We assume that we captured all of the possible keywords. 4. We assume that world news stories affect how people react when trading stocks. 5. We assume that there is either a positive or negative effect in the market influenced by positive or negative connotations of news stories. 3.7 TARGET VARIABLES Given our assumptions above, we came up with the sets of target variables for our class attributes that are described below. Same Day
  • 15. 14 For the experimental design for Same Day calculations, we decided to name the target variables: • POSITIVE EFFECT - for an increase in the stock price compared to the market open • LOW EFFECT - for a decrease in the stock price compared to the market open • NO EFFECT - for no change between the market open and market close The values for these target variables were derived by looking at the distribution of our calculations in the chart below. Open to Low For the experimental design for the Open-to-Low calculation, we analyzed the negative effect the headlines can have on the market price. Hence, we decided to name the target variables: • HIGH EFFECT - for higher negative effect on the market price with respect to the Market open • MODERATE EFFECT - for a medium decrease in the stock price compared to the Market Open • NO EFFECT - for a minor decrease in the market price with respect to Market Open
  • 16. 15 The values for these target variables were derived by looking at the distribution of our calculations in the chart below. Open to High For the experimental design for our Open-to-Low calculations, we analyzed the positive effect the headlines can have on the market price. Hence, we decided to name the target variables: • HIGH EFFECT - for higher positive effect on the market price with respect to the Market open • MODERATE EFFECT - for medium increase in the stock price compared to the Market Open • NO EFFECT - for a minor increase in the market price with respect to Market Open
  • 17. 16 The values for these target variables were derived by looking at the distribution of our calculations in the chart below. 4 EXPERIMENT RESULTS 4.1 RESULTS FOR EACH CLASSIFIER The tables below describe all of the possible combinations of our criteria with the three selected classifiers. Each of these combinations was run ten times using a different seed value for each run as explained in the Experiment Design section. The accuracy of each test is recorded below along with the average and standard deviation of their accuracy. J48 - Detailed chart explanation in Appendix
  • 18. 17 Meta-Bagging - Detailed chart explanation in Appendix PART – Detailed chart explanation in Appendix
  • 19. 18 As depicted in the tables above, we have used three classifiers in the analysis process. The classifiers are J48, Meta-Bagging, and PART. Each classifier is tested and trained with the attributes based on the following combination of test criteria thus leading to 24 sets of algorithms runs for each classifier: • Dividing the market based on the stock values in a day: o Opentolow o Opentohigh o Sameday • Percentage Split: o 80% training - 20% testing o 20% training – 80% testing • Attribute Selection o Using all attributes All words in the frequencies of keywords o Using top 20 attributes Based off of the total count of that the word occurred in the data set • Frequency of keywords o 250-300 frequency words o 500-850 frequency words 4.2 SUMMARY OF RESULTS The tables below show the average accuracy for each of our algorithms based off of the experiment design that we used to partition our Stock Market data (Sameday, OpentoLow, OpentoHigh). The legend numbers for each algorithm refer to the corresponding tests identified in charts above. There is also a key below each chart to help clarify.
  • 20. 19 1-4 = 250-300 Frequency Range 5-8 = 500-850 Frequency Range Odd numbers = Top 20 Words Even Numbers = Full List 1,2,5,6 = 80/20 Split 3,4,7,8 = 20/80 Split 9-12 = 250-300 Frequency Range 13-16 = 500-850 Frequency Range Odd numbers = Top 20 Words Even Numbers = Full List 9,10,13,14 = 80/20 Split 11,12,15,16 = 20/80 Split
  • 21. 20 17-20 = 250-300 Frequency Range 21-24 = 500-850 Frequency Range Odd numbers = Top 20 Words Even Numbers = Full List 17,18,21,22 = 80/20 Split 19,20,23,24 = 20/80 Split A glance at the above tables shows that based off of percentage accuracy, Meta-Bagging appears to be the most accurate classifier across our experiment design factors overall. J48 was marginally better for the OpentoLow group but worse for the other two. For the different stock market segments, the Meta- Bagging algorithm showed averages of 34.29 (Sameday), 33.72 (OpentoLow), and 40.16 (OpentoHigh). We notice that according to MB and J48, the tests on the ‘Top 20’ group is the least accurate while this same group does not seem to show matching results for PART. For the 250-300 group of words PART does not show the same decrease for the Top 20 classifier as the other algorithms. However, when run against the 500-850 group, we see that the accuracy drops when only exposed to the Top 20 values. Almost unanimously, the full list of key words outperformed the top 20 key words for the same factors. This tells us that the more words we have to predict against, the better chance we have of predicting the class attribute with higher accuracy. There were a few examples where this was easily noticeable (OpentoLow and OpentoHigh), the rest of the time it was only a marginal increase. On both the OpentoLow and OpentoHigh tests, the noticeable difference came in the lower frequency of words. This would lead us to believe that the more words you include at lower frequencies would increase your accuracy for predicting the class attribute. However, when looking at the 250-300 group of words we can see some strange occurrences when inspecting the averages across the test criteria. The average percentage accuracy is the same for both the 80/20 and 20/80 split when using the Top 20 key words. This goes against conventional wisdom. We would expect that the accuracy would increase as the percentage of training data increases. This is not the case for this split on the Top 20 key words across all three algorithms. For this reason, using the test criteria and the other experiment design for our project, we would not recommend using the 250-300 grouping of words for any predictive testing until it is fully understood why this is happening.
  • 22. 21 Meta-Bagging produced better results while considering all of the key word attributes as well as just using the top 20 attributes. Meta-Bagging also showed consistent values while taking into account the 250-300 and 500-850 frequency words. By far the highest accuracy that we achieved was using Meta-Bagging. On the 250-300 group of words, using the OpentoHigh with the full list of words. We achieved greater than 50% accuracy using both the 80/20 split and 20/80 split. The problem with this is that (as explained earlier) both the 80/20 split and 20/80 split have the same values. 5 ANALYSIS & CONCLUSION 5.1 POSSIBLE ASSUMPTIONS VIOLATED There seems to be inherit assumption violations with our project as the results we received were at times illogical. Our first assumption, “the top 25 headlines were published only during the stock market hours (8:30 AM to 3:00 PM)”, is violated because we do not have pre-trading and post-trading data to capture all trading hours in a given day. We do not know at what time the headlines were produced, just the knowledge of knowing it came from 12:00 a.m. to 11:59 p.m. It is illogical that we were returning either the same or better results in our 20/80 splits than our 80/20 splits on some of our tests. This seemed to occur mostly with the 250-300 grouping of key words. We did not investigate the reasoning behind this, but noted the results and would explore further in future experiments. 5.2 CONCLUSION We observed that the accuracy for each experiment design was close to our benchmark ZeroR. We can also conclude that our experimental designs were inefficient given that there were over 900 rules produced for 1,989 examples. We cannot say that our experiment designs based on our testing algorithms and key words chosen gave conclusive evidence to predict the stock market changes on the DJIA. From the experiment, we conclude that the third experiment design (Open-to-High) showed better accuracy for both ‘All Key Words’ and ‘Top 20 words’ factor and produced consistent results for both 250- 300 frequency words and 500-850 frequency words. Our goal with this project was to prove that a rise or fall in the DJIA could be predicted based off of the prior days top 25 headlines from Reddit. This hypothesis was refined as the project went on to: attempting to predict whether certain key words from the top 25 headlines on Reddit could accurately predict whether the DJIA would rise or fall at various points throughout the day. Obviously, our hypothesis got much narrower rather than broader as we went along. I think it is clear to see that we were chasing results the further we got into it. We were convinced that we were on to a bright idea with our hypothesis and we started chasing proof. Ultimately, we came away unsuccessful in proving our theory. This is not to say however, that our hypothesis is incorrect. It simply means that the manner in which we tested our hypothesis was unable to prove it correct. If we were to start this project over again knowing what we know now, there are a couple of things that we would/could do differently. • Use different ranges of frequencies of our key words as our attributes • Use a better tool for searching our key word attributes against the headlines of the day. We got faulty counts with Excel that massively exceeded our original counts pulled from R.
  • 23. 22 • Use combinations of words rather than individual key words • Use sentiment analysis • Search for trends of rises/falls in the DJIA and then use key words, or groups of key words, which occurred around those times to see if they were more effective at predicting our class • Use a different source to get our headlines from If we were to examine our best results against the original ZeroR baseline results, we see that ZeroR was a better classifier for each group. Looking at the table above, we can say that the model we created is not good in regards to determining whether the DJIA would rise or fall based off of our assumptions. Taking what we have learned, we should try again with perhaps some of the changes mentioned above.
  • 27. 26 R Script # install packages necessary for work below Needed <- c("tm", "SnowballCC", "RColorBrewer", "ggplot2", "wordcloud", "biclust", "cluster", "igraph", "fpc") install.packages(Needed, dependencies=TRUE) # Create corpus myCorpus <- Corpus(VectorSource(Combined_concatenated_headlines$Headlines)) # lowercase corpus myCorpus <- tm_map(myCorpus, tolower) # remove all punctuation from corpus myCorpus <- tm_map(myCorpus, removePunctuation) # remove all numbers from Corpus myCorpus <- tm_map(myCorpus, removeNumbers) # Visualization of the corpus if desired inspect(myCorpus) # removing common words that usually have no analytical value # ex. a, and, also, the, but, etc... myCorpus <- tm_map(myCorpus, removeWords, stopwords("en")) # Stemming corpus to try and get the roots of words myCorpus <- tm_map(myCorpus, stemDocument) # In order to complete the next two steps, the corpus will have to be changed into PlainTextDocument # Recommend saving the corpus as a different name when done myCorpusText <- tm_map(myCorpus, PlainTextDocument) # create a dtm using the corpus that will be used in future steps dtm <- DocumentTermMatrix(myCorpusText) # create a new variable off of the dtm that will be used to create visualizations if desired freq <- colSums(as.matrix(dtm)) length(freq) # necessary structure for plots wf <- data.frame(word=names(freq), freq=freq) # the frequencies below can be adjusted to get a sample of data desired p <- ggplot(subset(wf, freq>500 & freq<850), aes(word, freq)) p <- p + geom_bar(stat="identity")
  • 28. 27 p <- p + theme(axis.text.x=element_text(angle=45, hjust=1)) p # separate visualization that can be adjusted according to the frequencies desired wordcloud(names(freq), freq, min.freq = 500, max.words = 850, random.order = FALSE, colors = brewer.pal(6, "Set1")) # will display the available pallets of colors that can be chosen for the wordcloud above display.brewer.all() Word Cloud for the 500-850 Range: