1. 1 | P a g e
University of Dublin
TRINITY COLLEGE
Modelling Financial Market Forces Using
Regression and Sentiment Analysis
Mark John Lyons
B.A.I. Engineering
Final Year Project April 2016
Supervisor: Professor Khurshid Ahmad
School of Computer Science and Statistics
O’Reilly Institute, Trinity College, Dublin 2, Ireland
2. 2 | P a g e
Abstract
The aim of the project was to model the dynamics of
financial markets so as to try observe and test the
validity of financial theories. Specifically the theories
mean reversion, volatility clustering and attribute
framing. Regression and sentiment analysis were
combined to achieve this. The R programming language
was used to compute the statistics involved in the
project.
3. 3 | P a g e
Acknowledgements
I would like to thank my supervisor, Khurshid Ahmad,
for giving me this opportunity. He provided great advice
and helpful nudges throughout the project. I enjoyed the
discussions on a range of topics; finance, computer
science, maths and a small bit of history.
Thank you also to Stephen Kelly, PhD student at Trinity
College, for his help along the way. Particularly for
introducing me to the Rocksteady program and
discussing R with me.
Finally thank you to my family and friends for their
support throughout. In particular my Mother and Father
Margaret and Pat Lyons for all the help they gave me.
4. 4 | P a g e
Table of Contents:
Abstract...........................................................2
Acknowledgements .........................................3
1. Introduction..................................................6
2. Motivation and Lit Review
2.1. Economics…………………………………….7
2.2. Financial Markets & Monetary Policy…8
2.3. Behavioural Finance………………………11
2.4. Conclusion…………………………………..12
3. Method
3.1. Stochastics & Financial Series………..14
3.2. Stationarity & Returns……………………14
3.3. Stylised Facts & Summary Statistics..15
3.4. Linear Regression…………………………17
3.5. Ordinary Least Squares………………….19
3.6. OLS Assumptions………………………….20
3.7. Autoregression…………………………….23
3.8. Vector Autoregression…………………..23
3.9. LexisNexis Corpus………………………..24
3.10. Sentiment Analysis……………………….25
3.11. Tableau……………………………………….26
4. Case Studies and Results
4.1. Summary Statistics……………………….28
4.2. Autoregression…………………………….29
4.3. Vector Autoregression…………………..34
5. Conclusion and Future Work.
5.1. Work Completed……………………………40
5.2. Conclusions…………………………………40
5.3. Future Work…………………………………40
6. 6 | P a g e
1 Introduction:
This project began from a desire to pursue a project under
an area of personal interest, currency, or forex, markets. The
ever-changing value of money and the effect it could have on
whole countries has always fascinated me. The wish was to
learn Computer Engineering skills to better understand the
market. First statistical research on the forex market was done
and then evolved to financial markets as a whole. A
comparison of the behaviour of 3 of the major financial markets
(the bond market, the stock market and as stated the forex
market) will be made.
Traditional Finance theory states that markets should
usually be rational and exhibit mean reversion to reassert an
assets price. The newer emerging field of Behavioural Finance
states that the market isn’t as efficient as theorised and the
sentiment of traders will affect it. The project aimed to model
post Global Financial Crisis market behaviour to see whether
it, as theoretically proposed, performed mean reversion over
the period. Behavioural finance theorems were assessed to see
whether sentiment effects market movement when it acted
irrationally. Volatility clustering was also of interest and
observed in the market. Modelling the nature of the market is
essential for finding investment opportunities and risk
management in financial portfolios.
The project involved learning data analysis techniques such
as the ETL (extract, transform and load) process. Text analysis
was applied to determine the implicit sentiment on the markets
in major publications around the world. Knowledge of statistical
theory in the context of finance was essential for preceding
accurate modelling of the data. The R programming
environment and language was learned to warehouse the large
data sets and apply the statistical m ethods. Lastly emerging
data visualisation standard Tableau was used to display the
results as pleasingly as possible.
7. 7 | P a g e
2 Motivation and Lit Review:
2.1 Economics
Economics is a social science studying the movement of
goods, services and wealth. BusinessDictionary.com defines
economics as “the theories, principles, and models that deal
with how the market process work s”. Economies and financial
markets are then intrinsically linked in the capitalist system.
The theoretical relationship between the US economy and its
stock market for example is shown in figure 1 below. Also of
note from the figure is the idea of econom ic cycles, a
periodical cycle through recession and recovery.
F ig u re 1
The economic crisis of 2007 –2008, from which many
countries including the U.S. are still only slowly recovering,
was of historic proportions, involving a f inancial market
collapse, a rapid rise in unemployment, an unprecedented
decline in world trade and massive government intervention
aimed at reversing the downturn (Hielbroner and Milberg,
2011). This economic crisis, called the Global Financial Crisis,
was the worst seen since the great depression of the 1930s. Of
interest to this project was the behavior of the stock markets
as an economic indicator post -crisis.
8. 8 | P a g e
The U.S. National Bureau of Economic Research (NBER)
determined the economic trough of the Global Financial crisis
to be June 2009. This trough is marked in a graph, figure 2,
containing the stock market indices the S&P500 and the
NASDAQ composite index. The figure shows empirical evidence
of the relationship graphed above. The S&P 500 index wil l be
used as the indicator when m odelling stock market behavior.
F ig u re 2
2.2 Financial Markets & Monetary Policy
The U.S. Federal Reserve (Fed) sets monetary policy in
order to control this economic growth and contraction. Before
discussing monetary polic y we need to understand how
financial assets are priced in the market. Asset prices in the
market are determined through price discovery. The invisible
hand of the pricing mechanism coordinates supply and demand
in markets in a way that is automatically in the best interests of
society (Scott, 2006). Traditional finance theory would have us
believe that the free market will keep prices fair and balanced
and that arbitrageurs (people who utilise arbitrage*) will take
advantage of any deviations thus restorin g the equilibrium. The
return to equilibrium occurs through mean reversion which is
defined as “the theory that interest rates, security prices, and
*Investopedia: Arbitrage is the simultaneous purchase and sale
of an asset in order to profit from a difference in the price.
9. 9 | P a g e
various economic indicators will, over time, return to their long -
term averages after a significant short -term move”. This is
called the efficient market hypothesis.
F ig u re 3
Intuitively mean reversion can be seen as a positive
change in price will be followed by a negative change and vice
versa. Figure 3 is a picture depicting mean reversion.
The three primary methods of im plementing monetary
policy is setting interest rates, buying/selling U.S. treasuries
on the open market and changing dollar reserve requirements.
These actions affect the dollar value and the return rate on
treasury bonds. A deeper discussion of monetary policy and the
means by which it is carried out is not presented here. The rate
of return on U.S. treasuries is an important indicator in the
trust of the market in the U.S economy and as such we will
assess it also.
The Fed discusses monetary policy in Federal Open
Market Committee (FOMC) meetings 8 of which are scheduled
every year, after the 2 d ay meeting the Fed reveals its view on
economic activity, forecasts for future activity and changes to
monetary policy in an announcement afterwards. It is stated
that volatility of asset prices such as the S&P 500 and dollar
foreign exchange rates increas es on announcement days and in
particular around the release time of the announcement. To
test this statement the minute by minute rate of change, or
volatility, of the EUR/USD exchange rate on both an average
day and a FOMC announcement day was calculated . This is
shown in figure 4. Unfortunately intraday data of the S&P500
and Treasury Bonds could not be obtained freely to do similar
10. 10 | P a g e
calculations. The graphs y axes are normalised to the maxim um
return value of both series.
F ig u re 4
Note the overall in crease in volatility during the day and
the particular increases at the announcement time and the end
of the trading day. Per noble prize winning economist
Mandelbrot (1963) “large changes tend to be followed by large
changes, of either sign, and small cha nges tend to be followed
by small changes.” This is known as volatility clustering. The
DXY will be the final economic indicator modelled.
A paper by Romer and Romer (2000) discusses the Feds
FOMC forecasts and concludes that “the Fed has information
about future inflation that market participants do not have”.
Future inflation levels determine the future value of the dollar.
The FOMC announcement is then heavily dissected by market
participants and the future prospects of the dollar speculated.
The asymmetric holding of information by the Fed therefore can
cause movement in the dollar price as arbitrageurs buy/sell
dollars to capitalise on its long term change . As previously
described traditional finance theory assumes the supply and
demand equilibrium wil l be restored by the rational market and
11. 11 | P a g e
the true price rediscovered. However the speculation about the
future value also has an effect beyond supply and demand on
the current price. This affect falls under the branch of
behavioural finance which attempts to explain price anomalies
in terms of the biased behaviour of individuals.
2.3 Behavioural Finance
Behavioural finance is “a new approach to financial
markets that has emerged, at least in part, in response to the
difficulties faced by the traditional parad igm. In broad terms, it
argues that some financial phenomena can be better
understood using models in which some agents are not fully
rational” (Barbaris and Thaller, 2003). It states that the market
is exuberant rather than rational and arbitrage may not always
offset shocks to the market. Figure 5 shows empirical evidence
that assets demonstrate high volatility much more frequently
than expected. These instances of irrationality are of concern
to market participants. From figure 4 it can be seen that
volatility on FOMC days regularly exceeds expectation, we infer
the market is acting exuberantly and turn to behavioural
finance theory to see if it can explain the activity.
F ig u re 5 ( re pr o du ce d fro m K h u r s hi d A h ma d s B e hav io ura l Fi nan c e
l e ct ur e s)
12. 12 | P a g e
One important observation in behavioural finance is
framing. Entman (1993) summarises framing as “selecting some
aspects of perceived reality and make them more salient in the
communicating text, in such a way as to promote a particular
problem definition, causal interpretation, moral evaluation
and/or treatment recommendation for the item described". W e
hypothesise that the framing of information about the FOMC
announcement can influence the uncertainty in speculators
towards a certain bias. This type of frami ng is called attribute
framing. Panasiak and Terry (2013) say of attribute framing “an
event can receive different reviews when it is framed in a
positive vs negative light”. To test the hypothesis that positive
or negative framing can sway the bias, and cons equently
activity, of the market we evaluate the sentiment of major
worldwide publications when reviewing the FOMC meetings. If
there is correlation between market movement and the
sentiment over a long period of time our hypothesis is
confirmed.
2.4 Conclusion
So the goal is to model mean reversion of the market over
a long period and also analyse the sentiments effect on market
movement, how do we achieve this? Using regression analysis.
Regression analysis is a branch of statistical modelling which
aims to estimate the relationship between variables.
In order to model mean reversion the autoregressive area
of regression analysis was researched. Autoregressive models
of time series estimate the effect of previous values on future
values of a variable. Using the definition of mean reversion
that a positive change will be followed by a negative one (and
vice versa) we see it is necessary to model the relationship
between price changes and their prior price change. Therefore
autoregression is a suitable model for testing for mean
reversion.
Modelling the influence of sentiment on price movement
involves the simultaneous analysis of multiple variables,
termed multivariate analysis. The vector autoregressive model
explains an endogenous variable by a range of e xogenous
variables. In the words of Del Negro and Schorfheide (2011) “at
first glance, VARs appear to be straightforward multivariate
generalisations of univariate autoregressive models. At second
sight, they turn out to be one of the key empirical tools i n
modern macroeconomics”. The power of vector autoregressive
models comes from the ability to model seemingly unrelated
variables and determine their interdependencies. Vector
13. 13 | P a g e
autoregression is chosen to model the correlation between
sentiment and price change.
14. 14 | P a g e
3 Method:
3.1 Stochastic Processes & Financial Series
A stochastic process is a sequence of random variables,
{Xt }, indexed by t where t is usually a subset, T, of time [0, ∞).
Many natural processes are modelled as stochastic due to their
random behaviour. Since the closing price of an asset
tomorrow, Pt + 1 , cannot be predicted today we regard P t + 1 as a
random variable (Taylor, 2005). The set of prices, {Pt }, can
then be thought of as a set of random variables or a realisation
of a stochastic process.
Price changes can occur at any point on the time scale
during the trading day; therefore P t is a continuous function of
time. The financial time series analysed here will be sampled
at regular time intervals and so are discrete stochastic
processes. Discretising the data makes for easier computation
and analysis of behaviour over specific periods.
3.2 Stationarity & Returns
Stationarity describes a property of the process to achieve a
certain state of statistical equilibrium so that the distribution of
the process does not change much (Rachev et al, 2007). Put
simpler a stationary series can be defined as one with a
constant mean, constant variance and constant
autocovariances for each given lag.
The probability distribution of financial time series over a
period is heavily time period dependent as prices naturally rise
due to inflation. The mean and standard deviation of the series
over a long period of time will not give an accurate
representation of the series behaviour over the period . To
achieve stationarity of our series we find price returns. Price
returns are the change in price over a time period. The formula
for log returns, denoted by 𝑟𝑡, is defined by Ruppert and
Matteson (2011) as:
𝑟𝑡 = log(1 + 𝑅𝑡) = log (
𝑃𝑡
𝑃𝑡−1
)
where 𝑅𝑡is the net return 𝑅𝑡 = (𝑃𝑡/𝑃𝑡−1) − 1.
15. 15 | P a g e
Taking the return instead of raw price achiev es time
invariance of the series, a con stant mean and constant
variance for the series . Figure 6 shows the EUR/GBP exchange
rate, from 2013 to 2016, in orange and the return series
generated from it in blue. Note how the EUR/GBP rate has
been detrended to a constant nature over time in the return
series. The probability distribution for the return series gives a
more accurate indication of market behaviour than the raw
series distribution.
The expected mean of the return distribution is 0 with some
constant variance 𝜎2
.
Figure 6
Further mention of the economic indicator series will
reference their return series.
3.3 Stylised Facts & Summary Statistics
Stylised facts are “general properties that are expected to
be present in any set of returns” and “are pervasive a cross
time as well as across markets” (Taylor, 2005). One importan t
stylised fact Taylor states is “the distribution of returns is not
16. 16 | P a g e
normal”; the assumption of normality of returns is important for
many financial techniques so the returns distribution is
analysed.
The summary statistics mean ( 𝑟̅), standard deviation ( s),
skewness (b) and kurtosis (k) are used to describe the
characteristics of a distribution. They are defined, for a set of
n returns to be:
𝑟̅ =
1
𝑛
∑ 𝑟𝑡
𝑛
𝑡=1
, 𝑠2
=
1
𝑛 − 1
∑(𝑟𝑡 −
𝑛
𝑡=1
𝑟̅)2
,
𝑏 =
1
𝑛 − 1
∑
(𝑟𝑡 − 𝑟̅)3
𝑠3
,
𝑛
𝑡=1
𝑘 =
1
𝑛 − 1
∑
(𝑟𝑡 − 𝑟̅)4
𝑠4
𝑛
𝑡=1
.
The summary statistics , also called the moments of data, are
used to find the closeness of the distribution of returns to a
normal distribution.
Mean, the first moment, and standard deviation, the square
root of the second moment variance, are elementary probability
measures and it is assumed the reader underst ands them
already. Briefly to note, the mean indicates the central
tendency point of the distribution and the standard deviation
reveals the dispersion of data points. The standard deviation is
also important for standardising the distribution using z -scores
particularly in multivariate analysis.
Skewness is a measure of the asymmetry of the distribution
about the central tendency. Outliers produce skewed
distributions. A visual display of skewness measurement is
shown below in figure 7.
Figure 7
17. 17 | P a g e
Kurtosis is the relative concentration of scores in the center,
the upper and lower ends (tails), and the shoulders (between
the center and the tails) of a distribution (Norusis, 1994).
Kurtosis measures how peaked a distribution is. In a normal
distribution kurtosis is equal to three, to compare a
distributions kurtosis to the normal the “excess kurtosis” is
found by negating three from the measured kurtosis. A
distribution is called leptokurtic if the excess kurtosis is
positive, mesokurtic if there is no exce ss kurtosis and
platykurtic if excess kurtosis is negative. A visual
representation of kurtosis is given in figure 8.
Figure 8
A final summary statistic, the z -statistic, defined as:
𝑧 =
𝑟̅
𝑠/√ 𝑛
is used to “assess the null hypothesis that the expected return
is zero” (Taylor, 2005).
3.4 Linear Regression
Regression analysis is an area of statistics which aims to
model the effect of a given set of explanatory random variables
x, {x1 ,...,xk }, also called regressors, on a variable of primary
interest y. “A main characteristic of regression models is that
the relationship between the response variable y is not a
deterministic function f (x) of x (as often is the case in
18. 18 | P a g e
classical physics), but rather shows random errors ” (Fahrmeir
et al, 2013).
Linear regression methods estimate the relationship between
y and x by modelling the best fitting linear relationship between
the response and explanatory variables. The ordinary least
squares method is a popular technique to model the best linear
fit and will be discussed shortly.
Linear regression models are composed of a “systematic (or
deterministic) component, 𝛽0 + 𝛽1 𝑥1 + ⋯ + 𝛽 𝑘 𝑥 𝑘 , and an
idiosyncratic (or stochastic) co mponent, ε” (Miller, 2014). The
deterministic part consists of the vertical axis intercept 𝛽0 and a
summation of the regressors weighted by a set of m atching
regression coefficients {𝛽1, … , 𝛽 𝑘}. The regression coefficients
are unknown parameters which weight how much effect each
variable in the set 𝒙 has on 𝑦. “More precisely, 𝛽𝑖 is the partial
derivative of the expected response with respect to the ith
regressor” (Ruppert, Matteson 2011 ). The stochastic
component is a set of error (or re sidual) terms, ε, to account
for the error between the line and data points.
The linear regressive model equation:
𝑦 = 𝛽0 + ∑ 𝛽𝑖 𝑥𝑖
𝑘
𝑖=1
+ 𝜺
Figure 9 shows a plot of the response of y to a variable x
and a line fitted as a linear response o f y to x. The distance of
the thin lines connecting the points above and below the line
are the error terms for the points.
Figure 9
19. 19 | P a g e
The unknown regression coefficients are solved for by the
ordinary least squares (OLS) method. Once these are
determined they’re plugged back into the equation above to
find the OLS linear regressive model.
3.5 Ordinary Least Squares
The process undertaken by the OLS method to estimate the
optimal regression coefficients and the slope is a minimisation
of the difference be tween the observed response variable data
points, yi , and their linearly predicted values 𝑦i – εi .
Wooldridge (2000) gives a good description of OLS and so to
detail the process we follow his explanation. Paraphrasing his
discussion of a 2 regressor vari able system to a general
system he states “given n observations on y, x1 , x2 , … xk , {(xi 1 ,
xi 2 , … xi k , yi ): i = 1, 2, … , n}, the estimates of β,{ 𝛽̂0, 𝛽̂1, … , 𝛽̂ 𝑘},
are chosen simultaneously to make:
𝑆𝑆𝐸 = ∑(
𝑛
𝑖=1
𝑦𝑖 − 𝛽̂0 − 𝛽̂1 𝑥𝑖1 − ⋯ − 𝛽̂ 𝑘 𝑥𝑖𝑘)2
= ∑ 𝜀𝑖
2
𝑛
𝑖=1
as small as possible.” The residuals are squared to account for
positive and negative values negating. That is, for all
observation points i= 1,…,n of the explanatory variables the
squared error terms ( 𝜀𝒊 = 𝑦𝑖 − 𝛽0 − ∑ 𝛽𝑖 𝑥𝑖
𝑁
𝑖=1 , from the linear
regressive model equation) are summed up so that the
minimum solution can be found.
Multivariable calculus is used to solve thi s minimisation
problem to a system of k+1 linear equations in k+1 unknown’s
𝛽̂0, 𝛽̂1, … , 𝛽̂ 𝑘. W e want to find the critical points of the SSE
equation in order to minimise it. Taking the first partial
derivative of the equation with respect to each of the 𝛽̂𝑗,
evaluating them at the solutio ns, and setting them equal to
zero gives:
−2 ∑ 𝑥𝑖𝑗(
𝑛
𝑖=1
𝑦𝑖 − 𝛽̂0 − 𝛽̂1 𝑥𝑖1 − 𝛽̂2 𝑥𝑖2 − ⋯ − 𝛽̂ 𝑘 𝑥𝑖𝑘) = 0, 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑗 = 0, … , 𝑘.
Cancel the -2 and we have the desired system of linear
equations. This extremely large s ystem, called the OLS first
order conditions, can be solved through standard linear
equation methods by R, the statistics software used in this
project, very quickly.
20. 20 | P a g e
3.6 OLS Assumptions
OLS will resolve the regression coefficients given any
arbitrary response variable or set of explanatory variables.
However in order to uniquely determine the regression
parameters, and be confident of the inferences we make based
on the model, we need to make assumptions about our
variables. Per Miller (2014) these assumptions are:
A1: The relationship between the regressor and the
regressand is linear.
A2: 𝐸[𝜀 | 𝑥] = 0
A3: 𝑉𝑎𝑟[𝜀 | 𝑥] = 𝜎2
A4: 𝐶𝑜𝑣[𝜀𝑖 | 𝜀𝑗 ] = 0 ∀ 𝑖 ≠ 𝑗
A5: 𝜀𝑖 ~ 𝑁(0, 𝜎2) ∀ 𝜀𝑖
A6: The regressor is nonstochastic .
It is important to test the model for these assumptions to
satisfy that it is valid. The images in this section are sourced
from a tutorial on R-bloggers.com called “Graphic Analysis of
Regression Assumptions”.
Assumption 1 should be self-explanatory. To linearly model y
against x there must be a linear relationship between them. To
test this assumption the residuals are plotted against the
values predicted by the model, if the graph shows an even
spread of data points about the x -axis then the linearity
assumption is met. Figure 10 shows an even spread and thus
linearity is confirmed.
Figure 10
21. 21 | P a g e
Assumption 2 states that the expected value of the models
error terms should be zero; this refers to the fact that a
perfectly fitted line will have residuals distributed evenly above
and below the line leading to a mean value of zero. The mean
value of the models residuals is found to test this.
Assumptions 3 and 4 are sometimes grouped together. They
state the error terms must have constant variance (3) and be
uncorrelated (4). If this is true the error terms are called
spherical errors. Constant variance, also called
homoscedasticity, assumes that the variance of the error terms
does not change over time. To test this we look at the same
plot as assumption 1 to see if the vertical distance between
error terms grows consistently in either direction. If they don’t
then we have homoscedasticity. Figure 11 shows
homoscedasticity, note how the variance rises and falls but
doesn’t do so persistently.
Figure 11
Error terms should be completely random; having correlation
among the terms means the OLS process made a systematic
error judging the line. The autocorrelation function is run on
the residuals to determine if there is correlation. If
autocorrelation, or serial co rrelation, is found this can violate
trust in the model. The Durbin -W atson statistic, d, is used to
test the significance of this autocorrelation and consequently
the accuracy of the model.
22. 22 | P a g e
𝑑 =
∑ (𝜀𝑡 − 𝜀𝑡−1)2𝑁
𝑛=2
∑ 𝜀2𝑁
𝑛=1
The value of d always lies between 0 and 4. If d is 2 there is
no autocorrelation, values below this imply positive
autocorrelation (successive error terms are close in value to
one another) and valu es above indicate negative
autocorrelation (successive error terms are different).
Assumption 5 says that the residuals must be normally
distributed. However as Miller states “many of the results of
the OLS model are true, regardless of this assumption.” T he
assumption is therefore mostly useful for defining confidence
levels for the model parameters. A probability distribution of
the errors is graphed and analysed to determine the closeness
to a normal distribution . Figure 12 shows this comparison.
Figure 12
The summary statistics could also be useful here; they can
be helpful in comparing the distribution to a normal
distribution. W e omit the summary statistics for the error terms
and instead present the distribution visually. The less similar
the error terms distribution is to the normal distribution the less
accurate the OLS will be.
23. 23 | P a g e
From a practical standpoint the OLS model holds true
regardless of the 6t h
assumption so it will not be discussed
here nor will we test for it.
3.7 Autoregression
An autoregression model, notation AR(p), is a form of linear
regression model where the set of regressor variables is p lags
of the response variable. The equation for an AR(p) model:
𝑦𝑡 = 𝛽0 + ∑ 𝛽𝑖 𝑦𝑡−𝑖
𝑝
𝑖=1
+ 𝜀𝑡
We wish to use the autoregressive model to detect if mean
reversion occurred therefore a lag of 1 is chosen for our model.
Taking the first lag defines the effect on a time period, t, of the
period before. The AR(1) model is:
𝑦𝑡 = 𝛽0 + 𝛽1 𝑦𝑡−1 + 𝜀𝑡
The regression coefficient 𝛽1 determines whether there is mean
reversion. Rouzet (2010) states if |𝛽1| < 1 in an AR(1), the
process is mean reverting. This can be seen if you think of a
realisation of 𝑦𝑡−1 being a non-zero number then the 𝛽1
coefficient will “shrink” 𝑦𝑡 towards our mean of zero. There is
then an inverse relationship between 𝛽1 and mean reversion,
the smaller the absolute value of 𝛽1 the more reversion has
occurred. If 𝛽1 is negative then we can say a positive change is
usually followed by a negative one.
3.8 Vector Autoregression
The vector autoregressive model, notation VAR(p), is a
multivariate generalised version of th e autoregressive model. It
extends the set of regressor values from the lags of the
dependent variable to the lags of exogenous variables as well
as the dependent variable lags. The equation for a VAR(p)
model with k variables:
𝑦𝑡 = 𝛽0 + ∑ 𝛽𝑖 𝑦𝑡−𝑖
𝑝
𝑖=1
+ 𝜀𝑡
24. 24 | P a g e
the same form as the AR(p) model except that 𝛽0, 𝜀𝑡, and each
𝑦𝑖 is a vector of length k and each 𝛽𝑖 is a kxk matrix.
To illustrate, the general example of a VAR(1) in 2 variables is
given:
[
𝑦1,𝑡
𝑦2,𝑡
] = [
𝛽1,0
𝛽2,0
] + [
𝛽1,1 𝛽1,2
𝛽2,1 𝛽2,2
] [
𝑦1,𝑡−1
𝑦2,𝑡−1
] + [
𝜀1,𝑡
𝜀2,𝑡
]
Most important to us the regression coefficients are found for
the lags of the dependent and exogenous varia bles so that the
effect of the changes in the exogenous variables on the
dependent variable can be seen.
To test the effect of sentiment on our series we model the
indicators as the dependent variables and the sentiment as
regressor variables.
3.9 LexisNexis Corpus
Nesselhauf (2005) defines a corpus as “a systematic
collection of naturally occurring texts (of both written and
spoken language).” A corpus of news articles is created to
analyse the sentiment and try determine if the framing in these
articles follows market movement.
Figure 13
25. 25 | P a g e
LexisNexis is a provider of legal, government, business and
high-tech information sources. TCD attendants are allowed free
use of the LexisNexis database. To find a corpus specific to
the project articles were filtere d based on the search imaged in
figure 13.
3.10 Sentiment Analysis
Sentiment from the corpus was analysed using the
Rocksteady program. Rocksteady is a text analytics system
created in Trinity College Dublin by Khurshid Ahmad and his
postgraduate students.
Rocksteady uses a bag of words approach; it breaks the
corpus down into the words constituting it, regardless of order,
and compares them against a specialised dictionary. The
dictionary contains weighting for what sentiment is inherent in
each word. A z-score based on this weighting is computed for
each type of sentiment expressed in a daily aggregation of
articles. A z-score, or standard score, indicates how many
standard deviations a raw score is from the mean. In the
context of Rocksteady it indicates h ow much stronger a
sentiment express over a day is compared to normal.
Figure 14
26. 26 | P a g e
Rocksteady analyses positive, negative, active, passive,
strong, weak, economic, political and militant sentiment. Of
interest to the project are the first two types of sentiment.
Figure 14 shows sample section of the Rocksteady output for
the projects corpus. Red boxes indicate extremely high levels
of a sentiment expressed that day and yellow boxes show
moderately high sentiment.
3.11 Tableau
Tableau is an emerging standard for dat a visualisation. It is
offers a variety of options for graphing enabling users to
display information as intuitively as possible. It also integrates
with the R programming environment using RServe to open
communication between the programs. This combines t he power
and flexibility available through R to compute advanced
statistical processes with the ease of Tableaus visualisation
process. The original images Figure 3, Figure 4 and Figure 6
produced in this report were created in Tableau. Two examples
of the Tableau interface are provided. The data loading
procedure, showing a left join of two datasets, is in the figure
above while the graphing process is shown in the figure 16 on
the next page.
Figure 15
28. 28 | P a g e
4 Case Study & Results
4.1 Stylised facts & Summary Statistics
The stylised facts of the DXY, S&P500 and 10 year Treasury
bond (T10) returns for the period from June 2009 to present
day are presented in the table below.
Indicator
Mean Return
(*10^4)
Standard Deviation
(*10^2) Skewness
Excess
Kurtosis
Z
statistic
T10 -3.89 2.27 0.15 0.81 -0.71
DXY 0.59 0.45 -0.03 1.54 0.54
S&P500 4.49 1.02 -0.42 3.56 1.83
All three series were leptokurtic meaning the distributions
were quite peaked. The DXY and the S&P 500 has slight
negative skew about their means. Due to a positive mean with
negative skew we can infer that daily returns over the period
were more likely to be positive. The T10 conversely had a
positive skew and negative mean . This means there was a
greater portion of negative returns over t he period.
The standardised distributions of the T10 and the S&P 500
are given in figure s 13 and 14, with a normal distribution curve
overlaid, to give context to the summary statistics.
Figure 17
29. 29 | P a g e
Figure 18
Note in particular how peaked the S&P 500 d istribution is.
Standardised distributions are graphed in units of standard
deviation on the x-axis, we can see there is extreme outliers
outside of 4 standards deviations in the S&P500 compared to
the T10 which cause this.
4.2 Autoregression
Assumptions Tests:
Plots of the standardised residuals against fitted values
for each model are shown in figures 19, 20 and 21. These are
used for testing assumptions 1 and 3. A table containing the
mean residual value for each model is also provided for
assumption 2 tests. To validate assumption 4 plots of the
autocorrelation of residuals is given in figures 22, 23 and 24.
Assumption 1: Linearity. In each model there is an even
spread of standardised residuals about the fitted values,
therefore the assumptions that a linear relations exists
between the response and regressor variables are true for each
model.
30. 30 | P a g e
Figure 19 – DXY model
Assumption 2: As we can see from the table the mean, or
expected, value of the error terms are all extremely small
numbers. They are c lose enough to zero to be sufficient to
meet this assumption.
Figure 20 – S&P 500 Model
T10 DXY S&P 500
E[ ε | x ] -4.18E-19 2.22E-17 -9.13E-19
31. 31 | P a g e
Figure 21 – T10 Model
Assumption 3: Constant variance. There is no directional
growth of variance in any of the figures. The models are
homoscedastic, assumption 3 is met.
Figure 22 – ACF of DXY model residuals
32. 32 | P a g e
Figure 23 – ACF of S&P500 model residuals
Assumption 4: As we can see from the autocorrelation
plots there is persistent serial correlation in the models lags.
The correlation of errors and certain lagged errors, in each
model, is big enough to suggest that autocorrelation may be
problem. The Durbin -W atson test was run on each model to
determine the statistical significance of this. The results are
presented in a table on the next page.
Figure 24 – ACF of T10 Model Residuals
33. 33 | P a g e
Comparing the Durbin W atson test results to 2 we can see
that the autocorrelation is statistically insignificant. Assumption
is validated.
Assumption 5: Normality of errors. Thou gh this
assumption does not need to be fully met in order for the model
to be true it does assess the confidence we can have in the
model. The distribution of errors for the T10 model is shown
below compared to a normal distribution. There is a very close
fit to the normal distribution and so there can be confidence in
the results of the model. The distributions for the other models
are very similar and have been omitted.
Figure 25
DXY S&P 500 T10
DW-test 1.999 1.996 1.998
34. 34 | P a g e
Results:
Indicator Regression Coefficient Reversion
T10 -0.0222 Yes
DXY -0.0433 Yes
S&P 500 -0.0614 Yes
All three series displayed mean reversion from one day to
the next. The magnitude of the return s from day to day shrunk
to zero as seen by the fractional coefficient. Also of interest is
the negative sign of the coefficient, returns from day to day
tends to be in the opposite direction to each other.
4.3 Vector Autoregression
The vector autoregressive model was ran between
negative/positive sentiment and the three finan cial time series.
The estimation results for the coefficients determine the effect
of the sentiment on each series. The t value of the model
measures the size of the errors relative to the variation in the
sample data. More simply it tests how well the mod el fit by
taking a ratio of the distance between the estimated value and
observed value and the standard error. The p statistic, noted
by Pr(>|t|) in the images, is a hypothesis test that determines
the significance of the result. Significance in this case refers to
how much effect a change in the regressor variable, sentiment,
had on the response variable, the financial indicators.
Assumptions Tests:
For the sake of brevity we present the assumption test
results for the VAR model run on the DXY and posi tive
sentiment and omit the tests for the other models as the results
were very similar to each other and the results in the
autoregressive section.
Figure 26 shows the model as linear and with constant
variance. The autocorrelation plot in figure 27 shows small
serial correlation, however the Durbin -W atson test result of
1.999 renders this insignificant. The histogram of the residuals
contains some negative skew but reflects a normal distribution
fairly well. The mean value of errors was -1.03e-21.
36. 36 | P a g e
Figure 28
Results:
First the DXY:
Figure 29 – DXY and Negative Returns
37. 37 | P a g e
Figure 30 – DXY and Positive Returns
The DXY has no significant correlation between it and
positive or negative sentiment.
Secondly the T10:
Figure 31 – T10 and Negative Sentiment
38. 38 | P a g e
Figure 32 – T10 and Positive Sentiment
The T10 shows some effect from negative sentiment of
articles 5 days prior. This is small thou gh and may be an
artefact from modelling.
Finally the S&P500:
Figure 33 – S&P 500 and Negative Sentiment
39. 39 | P a g e
Figure 34 – S&P 500 and Positive Sentiment
There is no significant correlation between sentiment
expressed about the FOMC meetings and the S&P500 either.
The hypothesis that major publications can swa y bias by
attribute framing, and influence market movement, has been
debunked under the parameters of this experiment.
40. 40 | P a g e
5 Conclusion & Future Work
5.1 Work Completed
The project offered involved analysing sentiment in financial
markets using re gression analysis. E xtensive background
research into finance was necessary in order to create and
understand a context to analyse. Concurrently study was done
on statistical methods. After more basics statistical measures
were understood work turned to un derstanding regression
analysis. Particularly autoregression and vector autoregression
analysis. As these methods produce results regardless of
context a deep understanding of their properties was vital to
ensure the models created accurate results. Throug hout the
project the R programming language was learned to apply the
statistical methods to big data sets. Once the models were
created in R property tests were applied to validate them. The
result was the modelling of mean reversion and sentiment in
financial markets.
5.2 Future Work
Reassessing the filters for the corpus may reveal better
results for the VAR models. LexisNexis limited corpus
downloads to 500 articles, building a larger , more selective
corpus out of the limited corpus’ would also be benefic ial to the
project and attempted if more time was available. Additionally
we viewed volatility clustering in figure 4. Further modelling of
this through a GARCH model would be desirable.
5.3 Conclusion
The project gave a good grounding in regression and
sentiment analysis and the methods involved. Regression
analysis is a powerful flexible tool that can be applied to a
wide range of applications. As such it was very beneficial to
learn. The results did not display correlation of sentiment and
financial markets as expected however further work may prove
more revealing.
41. 41 | P a g e
6 References
Del Negro, M. and Schorfheide, F. (2011). Bayesian
Macroeconomics. The Oxford Handbook of Bayesian
Econometrics, vol. 1, p.293–389.
Entman, R. (1993). Framing: Toward Clari fication of a
Fractured Paradigm. Journal of Communication, vol. 43,
pp.51-58.
Fahrmeir, L. (2013). Regression: Models, Methods and
Applications. Berlin, Heidelberg: Springer Berlin Heidelberg.
Heilbroner, R. and Milberg, W . (2012). The making of
economic society. Upper Saddle River, N.J.: Pearson.
Mandelbrot, B. B. (1963) The variation of certain speculative
prices. Journal of Business, vol. 36, pp. 392–41.
Miller, M. (2012). Mathematics and statistics for financial
risk management. Hoboken, N.J.: W ile y.
Nesselhauf, N. (2005). Corpus Linguistics: A Practical
Introduction. Available at: http://www.as.uni-
heidelberg.de/personen/Ne sselhauf/files/Corpus%20Linguisti
cs%20Practical%20Introduction.pdf
[Accessed: 04/04/2016]
Nicholas & Thaler, Richard, 2003. A survey of behavioral
finance. Handbook of the Economics of Finance , vol. 1, pp.
1053-1128.
Norusis, M. J. (1994). SPSS 6.1 base system user’s guide,
part 2. Chicago, IL: SPSS.
Panasiak, M. and Terry, E. (2013). Framing Effects and
Financial Decision Making. Proceedings of 8t h
Annual
London Business Research Conference. Imperial College,
London.
Rachev, S., Mittnik, S. and Fabo zzi, F. (2007). Financial
econometrics. Hoboken, New Jersey: John W iley & Sons.
Romer, C. and Romer, D. (2000). Federal Reserve
Information and the Behavior of Interest Rates. American
42. 42 | P a g e
Economic Review, vol. 90, pp.429 -457. Available at:
http://www.cfapubs.org/doi/pdf/10.2469/dig.v31.n1.805
[Accessed: 06/03/2016]
Rouzet, D. (2010) Lectures slides on: Discounted Dividens
and Asset Prices. Available at:
http://isites.harvard.edu/fs/docs/icb.topic734133.files/Sectio
n6.pdf
[Accessed: 15/04/2016]
Ruppert, D. and Matteson, D. S. (2011) . Statistics and Data
Analysis for Financial Engineering . New York, NY: Springer
New York.
Scott, B. R. (2006). The Political Economy of Capitalism .
Available at:
http://www.hbs.edu/faculty/Publication%20Files/07 -037.pdf
[Accessed: 17/02/2016]
Taylor, S. (2007). Asset price dynamics, volatility, and
prediction. Princeton, N.J.: Princeton University Press.
Wooldridge, J. (2013). Introductory econometrics. Mason,
OH: South-W estern Cengage Learning.