Our team competed in a Kaggle competition to predict the bike share use as a part of their capital bike share program in Washington DC using a powerful function approximation technique called support vector regression.
Exploring Support Vector Regression - Signals and Systems Project
1. Exploring Support Vector Regression
for Predictive Data Analysis
Daniel Kuntz∗, Surya Chandra† and Jon Pritchard‡
Department of Electrical Engineering and Computer Science
Colorado School of Mines: Golden, CO
Email: ∗dkuntz@mines.edu, †schandra@mines.edu, ‡jpritcha@mines.edu
Abstract—The purpose of this paper is to demonstrate the
use of Support Vector Regression (SVR) in the context of
predicting the hourly use of bikes in Washington D.C.’s bike
share program. An abridged derivation of the SVR scheme is
given along with an explanation of kernel functions which are
vital to the performance of this method. Bike share data is
provided as part of a Kaggle
TM
competition, meaning we get
a firm qualitative benchmark for it’s predictive performance
against an array of other competitors, also, we show a direct
comparison between SVR and a naive linear regression to further
intuitive comprehension of the concepts. Our results indicate good
performance vs. linear regression and competitive performance
in the overall contest.
I. INTRODUCTION
Advances in predictive modelling are providing new in-
sights into critical data for businesses, governments and indi-
viduals. One of the most popular of these methods is SVR. It
is an efficient, highly configurable, and mathematically sound
solution for gaining this insight. In short, it is designed for the
task of fitting a non-linear function to approximate an outcome
(e.g. number of bikes rented) based on data that this outcome
is perceived to be a function of (e.g. time, season, weather, ...)
which are usually called ”explanatory variables”.
As a test case, our team will compete in the Washington
D.C. Bike Share competition hosted by Kaggle. In this contest,
it is of interest to the city to determine when and why people
are using their bike share program. This information will allow
them to properly plan for future growth a well as provide and
analysis of customer use patterns. Our team has decided to
use SVR modelling to compete in the competition and our
approach is documented herein.
A. How The Competition Works
Kaggle provides two sets of data. One set, generally
referred to as the ”training” set provides a set of explanatory
variables along with the outcome for each. For this particular
competition the given variables and outcomes are provided
in TABLES I and II respectively. This set of data is used to
train a prediction algorithm. The second set of data, referred
to as the ”test” set provides explanatory variables but not their
outcome. This outcome is hidden from contestants who’s job
it is to predict these outcomes. Once a prediction is made, it’s
accuracy is scored with equation (1).
TABLE I. EXPLANATORY VARIABLES [1]
Name Description
datetime Date and Time (YYYY-MM-DD HH:MM:SS)
season Season (1 = spring, 2 = summer, 3 = fall, 4 = winter)
holiday Whether the day is considered a holiday
weather 1: Clear, Few clouds, Partly cloudy, Partly cloudy; 2: Mist +
Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist; 3:
Light Snow, Light Rain + Thunderstorm + Scattered clouds,
Light Rain + Scattered clouds; 4: Heavy Rain + Ice Pallets
+ Thunderstorm + Mist, Snow + Fog
temp Temperature in Celsius
atemp ”Feels like” temperature in Celsius
humidity Relative humidity
windspeed Wind speed
TABLE II. OUTCOMES
Name Description
casual Number of non-registered user rentals initiated
registered Number of registered user rentals initiated
count Number of total rentals
=
1
n
n
i=1
(log (pi + 1) − log (ai + 1))
2
(1)
Where:
: Root Mean Squared Logrithmic Error (RMSLE)
n : Number of explanatory vectors in the test data set
pi : Prediction for vector i
ai : Actual value for vector i
B. Discussion of Parameters
Some of the parameters in TABLE I had to be modified and
some care had to be taken that redundant and non-important
variables were not used. The ”datetime” variable was divided
into 4 different variables: year, day, month and hour. This
allowed our model to take into account variations by the hour
and month as one would intuitively expect a strongly correlated
cyclical pattern associated with these variables. Also, variables
such as ”season”, which is entirely dependent on the month
and day were generally taken out of the model so as to not
”over-train” the model.
Some experimentation was needed to best determine which
variables affected the outcome the most strongly, one way to do
this that we will not discuss is by using Principal Component
Analysis (PCA). When results are discussed, we will provide
a full list of explanatory variables used in the model.
2. II. LINEAR REGRESSION
To demonstrate some of the underlying concepts of SVR
we take as inspiration a very simple linear regression for
creating a predictive model. In this we will assume that each
explanatory variable used has a weight and the sum of each
variable times it weight plus an offset is a good model of the
system.
A. Problem Formulation
We assume that the predictive function that we would like
to find takes the same form as (2)
f(xi) = w0 +
m
j=1
wjxi,j (2)
Where:
xi : The ith explanatory vector
w0 : An offset weight
wj : weights for each component of xi
m : The number of variables in xi
So, in this case if we determine the weights w =
[w0 · · · wm]
T
we have found a predictive model. Since we
have m + 1 weights, we could use a system of equations of
the form (2) to find them. So for the training set of data we
have system of equations (3).
1 x1,1 x1,2 · · · x1,m
1 x2,1 x2,2 · · · x2,m
...
...
...
1 xn,1 xn,2 · · · xn,m
w0
w1
...
wm
=
y1
y2
...
yn
(3)
Where:
n : The the number of training eplanatory vectors
yi : The outcome of for each explanatory vector i
Using matrix notation we rewrite (2) as (4). We recognize
this as a standard over-defined minimization problem, The
solution of which is given by (5). (X+
denotes the pseudo
inverse of the data matrix X)
Xw = y (4)
w = X+
y (5)
B. Results
Using this naive linear regression method, with the ex-
planatory vectors: ”year”, ”month”, ”day”, ”hour”, ”holiday”,
”workingday”, ”weather”, ”temp”, ”humidity” and ”wind-
speed” we managed to achieve the competition results in
TABLE III.
TABLE III. KAGGLE SCORE FOR LINEAR REGRESSION PREDICTION
Score (RMSLE) Rank (of approx. 1500)
1.30542 1275
C. Analysis of Linear Regression Results
As we would suspect, the linear regression did not perform
very well. The reason for this is that many variables do not
affect the outcome in a linear way. The variable ”weather”
may reduce the number of riders proportionally to how how
bad the weather is, and as such, is a good candidate for linear
regression but what about a variable like ”hour”? You would
intuitively think that this variable would create spikes in the
outcome for hours that represent rush hour. Fig. 1 shows that
this is true as well as showing the average bike rentals for each
hour over the whole data set compared to a best fit line. We
can easily see that the linear regression is not really a good
representation of this variable. Hence, we need a non-linear
representation of the data.
Fig. 1. Linear Regression Fit to Hourly Average
III. HIGHER DIMENSIONAL MAPPING
AND KERNEL FUNCTIONS
Since linear regression fails to accurately model the data,
it is obvious that we need to use a non-linear model to achieve
a better approximation. However, non-linear models are much
more complex than linear models. One strategy that could work
would be to map the low dimensional data into a higher dimen-
sional space where it is linear. In a simplistic manner, linear
regression preforms this kind of mapping by adding in an offset
term and performing the mapping: Φ (x) : Rm
→ Rm+1
. This
idea can be expanded to include higher order terms as well,
consider the mapping Φ such that:
Φ (x) : R2
→ R6
Φ ([x1 x2]) = 1 x1 x2 x2
1 x2
2 x1x2 (6)
The problem with these kinds of mappings is that the linear
regression model becomes extremely inefficient. This is due to
the fact that we could be mapping to a space with a HUGE
number of dimensions. As an example, for an m-dimensional
vector, a simple quadratic mapping the transformed vector
will be in an O m2
dimensional space. This can become
computationally expensive very quickly.
3. A. Definition
A solution to the problem of mapping to a higher dimen-
sional space is the use of kernel functions. Kernel functions al-
low us to find inner the products of high dimensional vectors in
a lower dimensional space for a very specific set of functions.
This means that if we can formulate our minimization problem
to depend only on these inner products. We can then use these
kernel functions to drastically improve the performance of our
algorithm.
The definition of a kernel function is simply any function
that satisfies the following:
K([x1 x2]) = Φ (x1) , Φ (x2) (7)
Where:
K : The Kernel Fucntion
Φ : A mapping to a higher dimensional space
B. Example Kernel Function
In order to illustrated the relationship between the mapping
functions and the kernels function, an example a simple kernel
function is derived below.
Given the column vectors:
x = [x1 x2]
T
z = [z1 z2]
T
Φ(x) = x2
1
√
2x1x2 x2
2
T
If follows that:
Φ(x)
T
Φ(z) = x2
1
√
2x1x2 x2
2 z2
1
√
2z1z2 z2
2
T
= x2
1 z 2
1 + x2
2 z 2
2 + 2x1x2z1z2
= (x1z1 + x2z2)
2
= xT
z
2
= K(x, z)
C. Other Types of Kernel Functions
Two of the most commonly used used kernel function are
the Gaussian Radial Basis Function (RBF) (8) and Polynomial
function (9).
K (x2, x2) = exp −
x1 − x2
2
2σ2
, σ ∈ R (8)
K (x2, x2) = ( x1, x2 + c)
p
, c ≥ 0, p ∈ N (9)
D. Discussion
We have defined kernel functions and showed how they
can be used to calculate high dimensional inner products using
lower dimensional vectors. With this knowledge we can move
forward to define the formulation of support vector regression,
using kernel functions to simplify calculations.
IV. DERIVATION OF THE SUPPORT VECTOR REGRESSION
METHOD
A. Primal Formulation
In order to use the efficient properties of kernel functions
we now need a regression formulation that can be expressed
in terms of the inner product of explanatory vectors xi. To this
end we consider the minimization problem (10)
Minimize:
1
2
w 2
2 + C
n
i=1
(ζi + ζ∗
i ) (10)
Subject to:
yi − w, xi − w0 ≤ + ζi (11)
w, xi + w0 − yi ≤ + ζ∗
i (12)
ζi ≥ 0
ζ∗
i ≥ 0
In this formulation ζi and ζ ∗
i are the slack variables, they
allow the data to vary outside of the band ± . However, if they
do go outside this band, they are penalize the minimization
term. C > 0 is the amount for which deviations larger than
can are penalized. As shown in Fig 2, only the points outside
the region contribute to cost as we linearly penalize deviations.
These penalized data vectors are the support vectors.
Fig. 2. Visuaization of Support Vectors [2]
B. Lagrangian Minimization
The minimization problem described by (10) has the La-
grangian representation (13)
4. L :=
1
2
w 2
2 + C
n
i=1
(ζi + ζ∗
i ) −
n
i=1
(ηiζi + η∗
i ζ∗
i )
−
n
i=1
αi ( + ζi + w0 + w, xi − yi)
−
n
i=1
α∗
i ( + ζ∗
i − w0 − w, xi + yi) (13)
Taking each derivative of L with respect to the variables
{w, w0, ζi, ζ∗
i } yields the following expressions:
∂L
∂w
= w −
n
i=1
(αi − α∗
i ) xi (14)
∂L
∂w0
= −w0
n
i=1
(αi − α∗
i ) (15)
∂L
∂ζi
= C − (ηi + αi) (16)
∂L
∂ζ∗
i
= C − (η∗
i + α∗
i ) (17)
Setting each derivative equal to 0 imparts the following
expressions:
(14) = 0 =⇒ w =
n
i=1
(αi − α∗
i ) xi (18)
(15) = 0 =⇒
n
i=1
(αi − α∗
i ) = 0 (19)
(16) = 0 =⇒ ηi = C − αi (20)
(17) = 0 =⇒ η∗
i = C − α∗
i (21)
C. Dual Formulation
Plugging expressions (18), (19), (20), (21) back into (13)
then yields the dual formulation of the minimization problem
(10). This formulation is given by (22)
Maximize:
n
i=1
(αi − α∗
i ) yi −
n
i=1
(αi + α∗
i )
−
1
2
n
i,j=1
(αi − α∗
i ) αj − α∗
j xi, xj (22)
Subject to:
n
i=1
(αi − α∗
i ) = 0
αi, α∗
i ∈ [0, C]
Notice that the dual formulation has is written in terms
of the inner product of xi. This mean that we can use the
kernel functions described in SECTION III to reduce the
dimensionality of a higher order mapping zi = Φ(xi). This
allows us to write (22) as (23)
n
i=1
(αi − α∗
i ) yi −
n
i=1
(αi + α∗
i )
−
1
2
n
i,j=1
(αi − α∗
i ) αj − α∗
j K (xi, xj) (23)
D. Solving for α(∗)
Now the only unknowns left are the variables α and α∗
.
Solving for these variables is a task that can be accomplished
numerically. One such numerical scheme is an interior point
algorithm referred to as primal-dual path-following [3]. This
technique is described in [4]. It should also be noted that a very
nice property of the SVR formulation is that it can be shown
to be convex [3] so any numerical technique will converge to
only one possible solution.
E. Final Solution
Once we have solved for α and α∗
all that is left is compute
the weights w. This is realized by plugging (18) into (2) to
obtain (24)
f (xi) =
n
i,j=1
(αi − α∗
i ) αj − α∗
j K (xi, x2) + w0 (24)
Similarly, the offset term can be solved for by plugging
(18) into (11) or (12) to obtain (25) and (26):
w0 = yi −
n
j=1
αjK (xj, xi) − for aj ∈ (0, C) (25)
w0 = yi +
n
j=1
α∗
j K (xj, xi) + for a∗
j ∈ (0, C) (26)
With:
i s.t. 0 < αi < C/n
5. F. Selection of Parameters
When selecting parameters C and it helps to have an
understanding of how they affect the regression. The primal
minimization problem (10) holds some clues as to how these
variables affect the outcome. C penalizes the function that
is being minimized any time a vector goes outside the error
insensitive tube (which is ± ).
We can see from Fig. 3 that a small C favors a smoother
function while a larger C puts more emphasis on getting as
close to every point as possible. Thus, the C parameter is a
good way to deal with ”over-fitting” the data. It can be thought
of as a gain that we apply to the slack variables.
Fig. 3. Effect of the C Parameter
As shown by Fig 4 the size of controls how much small
errors in the predictive function are ignored. A small will
penalize most errors while a larger value will not penalize
errors that are close enough. Thus determines the number of
support vectors used calculate f(xi)
Fig. 4. Effect of the Parameter
G. Results
Using the parameters in TABLE IV, we achieved the
highest Kaggle score for our team. The score is provided in
TABLE V. These results show a drastic improvement from the
naive Linear Regression method.
TABLE IV. SVR PARAMETERS
Parameter Value
explanatory
variables
month, hour, weather, workingday
kernel Gaussian Radial Basis Function (RBF)
0.1
C 30
TABLE V. KAGGLE SCORE FOR SVR PREDICTION
Score (RMSLE) Rank (of approx. 1500)
0.55815 847
H. Analysis of SVR Results
We have seen that SVR boosts the predictive power a lot
from the baseline linear regression. In order to answer why
it does we again show the plot for a model run with just the
hour as an explanatory variable (Fig. 5). Now we can see that
the non-linear function fit to the data shows a much closer
adherence to the average. As time of day is one of the most
principal variables, we can easily imagine our fit in higher
dimensions conforming much more closely to actual data.
Fig. 5. SVR Fit to Hourly Average
V. CONCLUSION
This project has demonstrated how Support Vector Re-
gression can be used to find a functional approximation to
a nonlinear dataset. It extends the idea of linear regression
to higher dimensional spaces, and artfully utilizes kernel
functions in order to reduce the complexity of computing the
result. As our results in the Kaggle competition have shown,
SVR is a far more robust method of prediction than the naive
linear regression.
6. ACKNOWLEDGMENT
Special thanks to Professor Gongguo Tang for a very well
taught and interesting class this semester.
REFERENCES
[1] ”Data - Bike Sharing Demand,” https://www.kaggle.com/c/bike-sharing-
demand/data, Accessed Dec. 10, 2014
[2] P. S. Yu, et al. ”Support vector regression for real-time flood stage
forecasting”. Journal of Hydrology, 328 (3-4), pp. 704-716 (Sep. 2006)
[3] A. Smola, B. Sch¨olkopf, ”A Tutorial On Support Vector Regression,”
Sep. 30, 2003
[4] R. J. Vanderbei, ”LOQO: An interior point code for quadratic program-
ming.” TR SOR-94-15, Statistics and Operations Re- search, Princeton
Univ., NJ, 1994.