Euro 2013 barrow crone - slideshare

Cross-validation aggregation for
forecasting
www.lancs.ac.uk
Devon K. Barrow
Sven F. Crone

1. Motivation
2. Cross-validation and model selection
3. Cross-validation aggregation
4. Empirical evaluation
5. Conclusions and future work
Outline
Cross validation aggregation for forecasting Motivation 1

• Scenario:
– The statistician constructs a model and wishes to estimate the error
rate of this model when used to predict future values
Motivation

Bootstrapping (Efron ,1979) Cross validation (Stone, 1974)
Goal Estimating generalisation error Estimating generalisation error
Motivation

Motivation
Procedure Random sampling with replacement from a
single learning set (bootstrap samples). The
validation set is the same as the original
learning set.
Splits the data into mutually exclusive
subsets, using one subset as a set to train
each model, and the remaining part as a
validation sample (Arlot & Celisse, 2010)

Motivation
learning set.
Properties Low variance but is downward biased (Efron
and Tibshirani, 1997)
Generalization error estimate is nearly
unbiased but can be highly variable (Efron

Motivation
learning set.
1996 - Breiman introduces bootstrapping and aggregation

Motivation
learning set.
Forecast
aggregation
Bagging (Breiman 1996) – aggregates the
outputs of models trained on bootstrap
samples

(a) Published items in each year (b) Citations in Each Year
Motivation
learning set.
Forecast
aggregation
samples
Bagging for time series
forecasting:
• Forecasting with many
predictors (Watson 2005)
• Macro-economic time series
e.g. consumer price inflation
(Inoue & Kilian 2008)
• Volatility prediction (Hillebrand &
M. C. Medeiros 2010)
• Small datasets – few
observations (Langella 2010)
• With other approaches e.g.
feature selection – PCA (Lin and
Zhu 2007)
Citation results for publications on bagging for time series

Motivation
learning set.
Forecast
aggregation
samples
Research gap:
In contrast to bootstrapping, cross-validation has not been used for forecasts
aggregation

Motivation
learning set.
Research contribution:
We propose to combine the benefits of cross-validation and forecast
aggregation – Crogging
Forecast
aggregation
samples
Research gap:
In contrast to bootstrapping, cross-validation has not been used for forecasts
aggregation

Motivation: The Bagging algorithm
• Inputs: learning set
• Selection the number of bootstraps =
NN
yyyS ,x,...,,x,,x 2211
K

• For i=1 to K {
– Generate a bootstrap sample using (your favorite bootstrap method)Sk
S
NN
yyyS ,x,...,,x,,x 2211
K

• For i=1 to K {
– Generate a bootstrap sample using (your favorite bootstrap method)
– Using training set estimate a model such that }xˆ k
m iik
ym xˆ
Sk
S
k
S
NN
yyyS ,x,...,,x,,x 2211
K

• For i=1 to K {
– Generate a bootstrap sample using (your favorite bootstrap method)
– Using training set estimate a model such that }
• Combine model to obtain:
xˆ k
m iik
ym xˆ
K
k
k
m
K
M
1
xˆ
1
xˆ
Sk
S
k
S
NN
yyyS ,x,...,,x,,x 2211
K

1.
2. Cross-validation and model selection
3.
4.
5.
Outline
Cross validation aggregation for forecasting Cross-validation 4

• Cross validation is a widely used strategy:
– Estimating the predictive accuracy of a model
– Performing model selection e.g.:
• Choosing among variables in a regression or the degrees of
freedom of a nonparametric model (selection for identification)
• Parameter estimation and tuning (selection for estimation)
Cross-validation: Background

• Main features:
– Main idea: test the model on data not used in estimation
– Split data once or several times
– Part of data is used for training each model (the training
sample), and the remaining part is used for estimating the
prediction error of the model (the validation sample)
Cross-validation: Background

• K-fold cross-validation:
Cross-validation: How it works?

Sample 1 Sample 2 Sample K-1 Sample K
K samples (one or more observations)

Estimation Validation

Estimation Validation
…
K
t
i
m
e
s

• k-fold cross-validation
– Divides the data into k none-overlapping and mutually
exclusive sub-samples of approximately equal size.
Cross-validation strategies
Cross validation aggregation for forecasting Cross-validation aggregation 7

• k-fold cross-validation
– Divides the data into k none-overlapping and mutually
exclusive sub-samples of approximately equal size.
– If k=2, 2-Fold cross validation
– If k=10, 10-Fold cross validation

• If k=N, Leave-one-out cross-validation (LOOCV)

• Monte-carlo cross-validation
– Randomly split the data into two sub-samples (training and
validation) multiple times, each time randomly drawing
without replacement

• Hold-out method
– A single split into two data sub-samples

• Goal: select a model having the smallest generalisation
error
Cross validation: model selection

• Goal: select a model having the smallest generalisation
error
• Compute an approximation of the generalisation error
defined as follows: N
i
ii
N
gen
N
my
mE
1
2
xˆ
lim

• Estimate model m on the training set, and calculate the
error on the validation set for sample k is:
N
i
ii
N
gen
N
my
mE
1
2
xˆ
lim
KN
my
mE
KN
i
val
i
val
i
k
1
2
xˆ

• Estimate the generalisation error after K repetitions as the
average error across all repetitions:
N
i
ii
N
gen
N
my
mE
1
2
xˆ
lim
KN
my
mE
KN
i
val
i
val
i
k
1
2
xˆ
K
mE
mE
K
k
k
gen
1ˆ

N
i
ii
N
gen
N
my
mE
1
2
xˆ
lim
KN
my
mE
KN
i
val
i
val
i
k
1
2
xˆ
K
mE
mE
K
k
k
gen
1ˆ
Select the model with the smallest generalisation error

N
i
ii
N
gen
N
my
mE
1
2
xˆ
lim
KN
my
mE
KN
i
val
i
val
i
k
1
2
xˆ
K
mE
mE
K
k
k
gen
1ˆ
What about the K models estimated on the different data sets?
Select the model with the smallest generalisation error

1.
2.
3. Cross-validation aggregation
4.
5.
Outline

• In model selection, the model obtained is the one built on all the
data (no data reserved for validation)
– However predictive accuracy is adjudged on models built on different
parts of the data
– These supplementary models are thrown away after they have served
their purpose
Cross-validation aggregation: Crogging

• The proposed approach:

– We save the predictions made by the K estimated models

– This gives us a prediction for every observation in the training sample
derived from a model that was built when that observation was in the
validation sample

– We then average across the predictions from the K models to produce
a final prediction.
K
k
tkt
m
K
M
1
xˆ
1
xˆ

– In the case of neural networks, we also use the validation samples for
early stop training
K
k
tkt
m
K
M
1
xˆ
1
xˆ

– In the case of neural networks, we also use the validation samples for
early stop training
– We average across multiple initialisations together with cross
validation aggregation (to reduce variance)
K
k
tkt
m
K
M
1
xˆ
1
xˆ

1.
2.
3.
4. Empirical evaluation
5.
Outline
Cross validation aggregation for forecasting Empirical evaluation 11

Complete Dataset
Reduced Dataset
Short Long Normal Difficult SUM
Non-Seasonal
25
(NS)
25
(NL)
4
(NN)
3
(ND)
57
Seasonal
25
(SS)
25
(SL)
4
(SN)
- 54
SUM 50 50 8 3 111
Summary description of NN3 competition time series dataset
Evaluation: Design and implementation
• Time series data
• NN3 dataset: 111 time series from the NN3 competition (Crone, Hibon,
and Nikolopoulos 2011)

20 40 60 80 100 120 140
4000
5000
6000
NN3_101
20 40 60 80 100 120 140
0
5000
10000
NN3_102
20 40 60 80 100 120 140
0
5
10
x 10
4
NN3_103
20 40 60 80 100 120
0
5000
10000
NN3_104
20 40 60 80 100 120 140
2000
4000
6000
NN3_105
20 40 60 80 100 120 140
0
5000
10000
NN3_106
4000
5000
NN3_107
5000
10000
NN3_108Plot of 10 time series from the NN3 dataset
• Time series data
• NN3 dataset: 111 time series from the NN3 competition (Crone, Hibon,
and Nikolopoulos 2011)

•
• The following experimental setup is used:
– Forecast horizon: 12 months
– Holdout period: 18 months
– Error Measures: SMAPE and MASE.
– Rolling origin evaluation (Tashman,2000).

•
• Neural network specification:
– A univariate Multiplayer Perceptron (MLP) with Yt up to Yt-13 lags.
– Each MLP network contains a single hidden layer; two hidden nodes; and a single
output node with a linear identity function. The hyperbolic tangent transfer
function is used.

• Across all time series
– On validation set Monte carlo cross-validation is always best
– All Crogging variants outperform the benchmark Bagging algorithm
and hold-out method (NN model averaging)
Method Train Validation Test
BESTMLP 1.25 0.96 1.49
HOLDOUT 0.64 0.75 1.20
BAG 0.76 0.70 1.21
MONTECV 0.76 0.41 1.16
10FOLDCV 0.69 0.45 1.07
2FOLDCV 0.73 0.60 1.15
Method Train Validation Test
BESTMLP 12.36 11.10 17.89
HOLDOUT 11.78 12.57 16.08
BAG 12.95 13.17 16.32
MONTECV 13.81 8.29 15.35
10FOLDCV 12.65 8.94 15.52
2FOLDCV 13.68 11.19 15.29
MASE and SMAPE averaged over all time series on training, validation and test dataset across all time series
Evaluation: Findings
MASE SMAPE

Boxplots of the MASE and SMAPE averaged over all ftme series for the different methods. The line of reference
represents the median value of the distributions.
• Across all time series

Length Method
Forecast Horizon
1-3 4-12 13-18 1-18
Long BESTMLP 10.79 16.59 20.02 16.77
HOLDOUT 9.34 14.96 16.20 14.43
BAG 9.74 15.46 16.38 14.81
MONTECV 10.86 15.16 15.43 14.54
10FOLDCV 10.39 14.04 14.82 13.69
2FOLDCV 9.03 14.64 15.69 14.06
SMAPE on test set averaged over long time series for short, medium and long forecast horizon
• Data conditions:
– Long time series: 10-fold cross-validation has the smallest error for
medium to long horizons, and over forecast lead times 1-18

Length Method
Forecast Horizon
1-3 4-12 13-18 1-18
Short BESTMLP 16.83 17.03 20.66 18.20
HOLDOUT 17.59 17.04 20.12 18.16
BAG 17.20 17.27 20.96 18.49
MONTECV 15.47 14.71 19.05 16.28
10FOLDCV 16.00 15.91 20.25 17.37
2FOLDCV 15.86 14.51 18.95 16.21
SMAPE on test set averaged over short time series for short, medium and long forecast horizon
– Short time series: 2-fold cross validation and Monte-carlo cross-
validation outperform 10-fold cross-validation for all forecast horizons

Boxplots of the SMAPE averaged across long (left) and short (right) time series

Average errors Ranking all methods Ranking NN/CI
SMAPE MASE SMAPE MASE SMAPE MASE
B09 Wildi 14.84 1.13 1 2 − −
B07 Theta 14.89 1.13 2 2 − −
C27 Illies 15.18 1.25 3 9 1 7
** 2FOLDCV 15.29 1.15 4 3 2 2
** MONTECV 15.35 1.16 5 4 3 3
B03 ForecastPro 15.44 1.17 6 5 − −
… … … … … … … …
** BAG 16.32 1.21 13 8 7 5
… … … … … … … …
B00 AutomatANN 16.81 1.21 14 8 8 5
** MLP 17.89 1.50 15 10 9 6
• NN3 Competition:

1.
2.
3.
4.
5. Conclusions and future work
Outline
Cross validation aggregation for forecasting Conclusions and future work 16

Conclusions and future work

Not a Forecasting Method!

A general method for
improving the accuracy of a
forecast model

• Conclusion
– Cross-validation aggregation outperforms model selection, Bagging
and the current approaches to model averaging which uses a single
hold-out (validation sample)

• Conclusion
– It is especially effective when the amount of data available for training
the model is limited as shown for short time series

• Conclusion
– Improvements in forecast accuracy increase with forecast horizons

• Conclusion
– It offers promising results on the NN3 competition

• Future work
– Perform bias-variance decomposition and analysis
– Consider other base model types other than neural networks
– Evaluate forecast accuracy for a larger set of time series - M3
Competition Data (3003 time series, established benchmark)

Devon K. Barrow
Lancaster University Management School
Centre for Forecasting
Lancaster, LA1 4YX, UK
Tel.: +44 (0) 7960271368
Email: d.barrow@lancaster.ac.uk

Euro 2013 barrow crone - slideshare

Recommended

Recommended

More Related Content

Similar to Euro 2013 barrow crone - slideshare

Similar to Euro 2013 barrow crone - slideshare (20)

Recently uploaded

Recently uploaded (20)

Euro 2013 barrow crone - slideshare