Combining Lazy Learning, Racing and Subsampling for Effective Feature Selection

Combining Lazy Learning, Racing
and Subsampling
for Effective Feature Selection
Gianluca Bontempi, Mauro Birattari, Patrick E. Meyer
{gbonte,mbiro,pmeyer}@ulb.ac.be
ULB, Université Libre de Bruxelles
Boulevard de Triomphe - CP 212
Bruxelles, Belgium
http://www.ulb.ac.be/di/mlg
Combining Lazy Learning, Racing and Subsamplingfor Effective Feature Selection – p. 1/2

Outline
• Local vs. global modeling
• Wrapper feature selection and local modeling
• F-Racing and subsampling
• Experimental results

The global modeling approach
x
y
q
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Input-output regression problem.

0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
x
y
q
01
01
01
01
01
01
01
01
01
0
0
1
1 0
0
1
1
01
01
01
01
0
0
1
1
01
01
01
01
01
01
01
01
01
01
01
01
0
0
1
1 0
0
1
1
01
01
0
0
1
1
01
01
01
01
01
01
01
01
01
01
01
01
0
0
1
1 0
0
1
1
01
01
01
0
0
1
1
01
01
01
01
01
01
Training data set.

0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
x
y
q
01
01
01
01
01
01
01
01
01
0
0
1
1 0
0
1
1
01
01
01
01
0
0
1
1
01
01
01
01
01
01
01
01
01
01
01
01
0
0
1
1 0
0
1
1
01
01
0
0
1
1
01
01
01
01
01
01
01
01
01
01
01
01
0
0
1
1 0
0
1
1
01
01
01
01
0
0
1
1
01
01
01
01
01
Global model ﬁtting.

0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
y
q x
Prediction by discarding the data and using the ﬁtted global model.

0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
x
y
q
Another prediction by using the ﬁtted global model.

The local modeling approach
x
y
q
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Input-output regression problem.

0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
x
y
q
01
01
01
01
01
01
01
01
01
0
0
1
1 0
0
1
1
01
01
01
01
0
0
1
1
01
01
01
01
01
01
01
01
01
01
01
01
0
0
1
1 0
0
1
1
01
01
0
0
1
1
01
01
01
01
01
01
01
01
01
01
01
01
0
0
1
1 0
0
1
1
01
01
01
0
0
1
1
01
01
01
01
01
01
Training data set.

0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
x
y
q
01
01
01
01
01
01
01
01
01
0
0
1
1 0
0
1
1
01
01
01
01
0
0
1
1
01
01
01
01
01
01
01
01
01
01
01
01
0
0
1
1 0
0
1
1
01
01
0
0
1
1
01
01
01
01
01
01
01
01
01
01
01
01
0
0
1
1 0
0
1
1
01
01
01
01
0
0
1
1
01
01
01
01
01
01
0
0
1
101
0
0
1
101
0
0
1
101
0
0
1
101
0
0
0
1
1
1
01
01
01
01
01
01
0
0
1
1
01
Ranking of data according to a metric, selection of neighbours, local
ﬁtting and prediction.

01
01
01
01
01
0
0
1
1 0
0
1
1
01
01
01
01
0
0
1
1
01
01
01
01
01
01
01
01
01
01
01
01
0
0
1
1 0
0
1
1
01
01
0
0
1
1
01
01
01
01
01
01
01
01
01
01
01
01
0
0
1
1 0
0
1
1
01
01
01
01
0
0
1
1
01
01
01
01
01
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
x
y
q
01
01
01
01
Another prediction: again ranking of data according to a metric,
selection of neighbours, local ﬁtting and prediction

Global models: pros and cons
• Examples of global models are linear regression models and
neural networks.
• PRO: even for huge datasets, a parametric model can be stored
in a small memory.
• CON:
• in the nonlinear case learning procedures are typically slow
and analytically intractable.
• validation methods, which address the problem of assessing a
global model on the basis of a ﬁnite amount of noisy samples,
are computationally prohibitive.

Local models: pros and cons
• Examples of local models are locally weighted regression and
nearest neighbours
• We will consider here a Lazy Learning algorithm [2, 5, 4]
published in previous works.
• PRO: fast and easy local linear learning procedures for
parametric identiﬁcation and validation.
• CON:
• the dataset of observed input/output data must always be kept
in memory.
• Each prediction requires a repetition of the learning procedure.

Complexity in global and local modeling
• Consider a nonlinear regression problem where we have N
training samples, n given features and Q query points (i.e. Q
predictions to be performed).
• Let us compare the computational cost of a nonlinear global
learner (e.g. a neural network) and a local learner (with k << N
neighbors).
• Suppose that the nonlinear global learning procedure relies on a
nonlinear parametric identification step (e.g. backpropagation to
compute the weights) and a structural identification step (e.g.
K-fold cross-validation to define the number of hidden nodes).
• Suppose that the local learning relies on a local leave-one-out
linear criterion (PRESS statistic).

Complexity in global and local modeling
GLOBAL LOCAL
Parametric identiﬁcation CNLS O(Nn)+CLS
Structural identiﬁcation by K-fold cross-validation KCNLS small
Cost of Q predictions (K + 1)CNLS Q(O(Nn) + CLS)
where CNLS and CLS represent the cost of a nonlinear and a linear
least squares, respectively.
The global modeling approach is computationally advantageous wrt to
the local modeling one when the same model is expected to be used
for many predictions. Otherwise, a local approach is to be preferred.

Feature selection
• In recent years many applications of data mining (text mining,
bioinformatics, sensor networks) deal with a very large number n
of features (e.g. tens or hundreds of thousands of variables) and
often comparably few samples.
• In these cases, it is common practice to adopt feature selection
algorithms [7] to improve the generalization accuracy.
• Several techniques exist for feature selection: we focus here on
wrapper search techniques.
• Wrapper methods assess subsets of variables according to their
usefulness to a given learning machine. These methods conducts
a search for a good subset using the learning algorithm itself as
part of the evaluation function. The problem boils down to a
problem of stochastic state space search.
• Well-known example of greedy wrapper search is forward
selection. Combining Lazy Learning, Racing and Subsamplingfor Effective Feature Selection – p. 9/2

Why being local in feature selection?
• Suppose that we have F feature set candidates, N training
samples and that the assessment is perfomed by leave-one-out.
• The conventional approach is to to test all the F leave-one-out
models on all the N samples ans choose the best.
• This requires the training of F ∗ N different models, each one
used for a single prediction.
• The use of a global model demands a huge cost of retraining.
• Local approaches appear to be an effective alternative.

Racing and subsampling: an analogy
• You are a national team football trainer who has to select the
goalkeeper among a set of four candidates for the next World
Cup, starting the next month.
• You have available only twenty days of training session and eight
days to let the players play matches.
• Two options:
1. (i) Train all the candidates during the ﬁrst twenty days, (ii) test
all of them with matches the last eight days, and (iii) make a
decision.
2. (i) Alternate each week of training with two matches, (ii) after
each week, assess the candidates and if there is someone
signiﬁcantly worse than the others discard him (iii) keep
selecting the others.
• In our analogy the players are the feature subsets, the training
days are the training data, the matches are the test data.Combining Lazy Learning, Racing and Subsamplingfor Effective Feature Selection – p. 11/2

The racing idea
• Suppose that we have F feature set candidates, N training
samples and that the assessment is perfomed by leave-one-out.
• The conventional approach is to to test all the F models on all the
N samples and eventually choose the best.
• The racing idea [8] is to test each feature set on one point at the
time.
• After only a small number of points, by using statistical tests, we
can detect that some feature sets are signiﬁcantly worse than
others.
• We can discard them and keep focusing on the others.

Non racing approach
Consider this simple example: we have F = 5 feature subsets and
N = 10 samples to select the best feature set by leave-one-out corss
validation.
Squared error
0.1
0.4
0.3
0.7
0.5
2
0.1
4
3.2
4
1.5ESTIMATED
0.3
0.6
1.7
2.5
2
3.1
4
5.2
4
4
0.2
0.5
0.4
1.2
1
2.7
3.5
5.3
3.9
4
2.7 2.2
0.0
0.1
0.1
0.9
0.4
1.9
0.0
3.5
3.4
0.2
1.0
0.05
0.2
0.4
0.8
0.5
2.4
3.0
8.4
4.2
3.9
2.4
WINNER
MSE
F1 F2 F3 F4 F5
i=1
i=2
i=3
i=4
i=5
i=6
i=7
i=8
i=9
i=10
After 50 training and test procedures, we have the best candidate.

Racing approach
Squared error
0.1
0.4
0.3
0.6
0.2
0.5
0.0
0.1
0.05
0.2
F1 F2 F3 F4 F5
i=1
i=2
OUT
After only 33 training and test procedures, we have the best candidate.

Racing approach
Squared error
0.1
0.4
0.3
0.7
0.5
0.3
0.6
0.2
0.5
0.4
1.2
1
0.0
0.1
0.1
0.9
0.4
0.05
0.2
0.4
0.8
0.5
F1 F2 F3 F4 F5
i=1
i=2
i=3
i=4
i=5
OUT
OUT

Racing approach
Squared error
0.1
0.4
0.3
0.7
0.5
2
0.3
0.6
0.2
0.5
0.4
1.2
1
0.0
0.1
0.1
0.9
0.4
1.9
0.05
0.2
0.4
0.8
0.5
2.4
F1 F2 F3 F4 F5
i=1
i=2
i=3
i=4
i=5
i=6
OUT
OUT
OUT

Racing approach
MSE
0.1
0.4
0.3
0.7
0.5
2
0.1
4
3.2
4
0.3
0.6
0.2
0.5
0.4
1.2
1
0.0
0.1
0.1
0.9
0.4
1.9
0.0
3.5
3.4
0.2
1.0
0.05
0.2
0.4
0.8
0.5
2.4
WINNER
F1 F2 F3 F4 F5
i=1
i=2
i=3
i=4
i=5
i=6
i=7
i=8
i=9
i=10
OUT
OUT
OUT
OUT
Squared error

F-racing for feature selection
• We propose a nonparametric multiple test, the Friedman test [6],
to compare different configurations of input variables and to select
the ones to be eliminated from the race.
• The use of the Friedman test for racing was proposed first by one
of the authors in the context of a technique for comparing
metaheuristics for combinatorial optimization problems [3]. This is
the first time that the technique is used in a feature selection
setting.
• The main merit of this nonparametric approach is that it does not
require to formulate hypotheses on the distribution of the
observations.
• The idea of F-racing techniques consists in using blocking and
paired multiple test to compare different models in similar conditions
and discard as soon as possible the worst ones.

Sub-sampling and LL
• The goal of feature selection is to ﬁnd the best subset in a set of
alternatives.
• Given a set of alternative subsets, what we expect is a correct
ranking of their generalization accuracy (eg F2 > F3 > F5 > F1>
F4).
• By subsampling we mean using a random subset of the training
set to perform the assessment of the different feature sets.
• The rationale of subsampling is that by reducing the training set
size N, we deteriorate the accuracy of each single feature subset
without affecting their ranking.
• In LL reducing the training set size N reduces the computational
cost.
• This makes more competitive the LL approach

RACSAM for feature selection
We proposed the RACSAM (RACing+SAMpling) algorithm
1. Deﬁne an initial group of promising feature subsets.
2. Start with small training and test sets.
3. Discard by racing all the feature subsets that appear as
signiﬁcantly worse than the others.
4. Increase the training and test size until at most W winners models
remain.
5. Update the group with new candidates proposed by the search
strategy and go back to step 3.

Experimental session
• We compare the performance accuracy of the LL algorithm
enhanced by the RACSAM procedure to the the accuracy of two
state-of-art algorithms, a SVM for regression and a regression
tree (RTREE).
• Two version of the RACSAM algorithm were tested: the ﬁrst
(LL-RAC1) takes as feature set the best one (in terms of estimate
Mean absolute Error (MAE)) among the W winning candidates :
the second (LL-RAC2) averages the predictions of the best W LL
predictors.
• W = 5, and p-value is 0.01.

Experimental results
Five-fold cross-validation on six real datasets of high dimensionality:
Ailerons (N = 14308, n = 40), Pole (N = 15000, n = 48),
Elevators (N = 16599, n = 18), Triazines (N = 186, n = 60),
Wisconsin (N = 194, n = 32) and Census (N = 22784, n = 137).
Dataset AIL POL ELE TRI WIS CEN
LL-RAC1 9.7e-5 3.12 1.6e-3 0.21 27.39 0.17
LL-RAC2 9.0e-5 3.13 1.5e-3 0.12 27.41 0.16
SVM 1.3e-4 26.5 1.9e-3 0.11 29.91 0.21
RTREE 1.8e-4 8.80 3.1e-3 0.11 33.02 0.17

Statistical significativity
• LL-RAC1 vs. LL-RAC2:
• LL-RAC2 is significantly better than LL-RAC1 3 times out of 6
• LL-RAC2 is never significantly worse than LL-RAC1.
• LL-RAC2 vs.state-of-the-art techniques:
• LL-RAC2 approach is never significantly worse than SVM
and/or RTREE
• LL-RAC2 5 times out of 6 significantly better than SVM and 6
times out of 6 significantly better than RTREE.

Software
• MATLAB toolbox on Lazy Learning [1].
• R contributed packages:
• lazy package.
• racing package.
• Web page: http://iridia.ulb.ac.be/~lazy.
• About 5000 accesses since October 2002.

Conclusions
• Wrapper strategies asks for a huge number of assessments. It is
important to make this process faster and less prone to instability.
• Local strategies reduce the computational cost of training models
that has to be used for few predictions.
• Ranking speeds up the evaluation by discarding bad candidates
as soon as they appear to be statistically signiﬁcantly worse than
others.
• Sub-sampling combined with local learning can speed up the
training phase in preliminary phases when it is important to
discard the highest number of bad candidates.

ULB Machine Learning Group (MLG)
• 7 researchers (1 prof, 6 PhD students), 4 graduate students).
• Research topics: Local learning, Classification, Computational statistics, Data
mining, Regression, Time series prediction, Sensor networks, Bioinformatics.
• Computing facilities: cluster of 16 processors, LEGO Robotics Lab.
• Website: www.ulb.ac.be/di/mlg.
• Scientific collaborations in ULB: IRIDIA (Sciences Appliquées), Physiologie
Moléculaire de la Cellule (IBMM), Conformation des Macromolécules Biologiques
et Bioinformatique (IBMM), CENOLI (Sciences), Microarray Unit (Hopital Jules
Bordet), Service d’Anesthesie (ERASME).
• Scientific collaborations outside ULB: UCL Machine Learning Group (B),
Politecnico di Milano (I), Universitá del Sannio (I), George Mason University (US).
• The MLG is part to the "Groupe de Contact FNRS" on Machine Learning.

ULB-MLG: running projects
1. "Integrating experimental and theoretical approaches to decipher the molecular
networks of nitrogen utilisation in yeast": ARC (Action de Recherche Concertée)
funded by the Communauté Fran ˛Açaise de Belgique (2004-2009). Partners:
IBMM (Gosselies and La Plaine), CENOLI.
2. "COMP2
SYS" (COMPutational intelligence methods for COMPlex SYStems)
MARIE CURIE Early Stage Research Training funded by the European Union
(2004-2008). Main contractor: IRIDIA (ULB).
3. "Predictive data mining techniques in anaesthesia": FIRST Europe Objectif 1
funded by the Région wallonne and the Fonds Social Européen (2004-2009).
Partners: Service d’anesthesie (ERASME).
4. "AIDAR - Adressage et Indexation de Documents Multimédias Assistés par des
techniques de Reconnaissance Vocale": funded by Région Bruxelles-Capitale
(2004-2006). Partners: Voice Insight, RTBF, Titan.

References
[1] M. Birattari and G. Bontempi. The lazy learning toolbox, for
use with matlab. Technical Report TR/IRIDIA/99-7, IRIDIA-
ULB, Brussels, Belgium, 1999.
[2] M. Birattari, G. Bontempi, and H. Bersini. Lazy learn-
ing meets the recursive least-squares algorithm. In M. S.
Kearns, S. A. Solla, and D. A. Cohn, editors, NIPS 11,
pages 375–381, Cambridge, 1999. MIT Press.
[3] M. Birattari, T. Stützle, L. Paquete, and K. Varrentrapp. A
racing algorithm for conﬁguring metaheuristics. In W. B.
Langdon, editor, GECCO 2002, pages 11–18. Morgan
Kaufmann, 2002.
[4] G. Bontempi, M. Birattari, and H. Bersini. Lazy learning
for modeling and control design. International Journal of
Control, 72(7/8):643–658, 1999.
[5] G. Bontempi, M. Birattari, and H. Bersini. A model selection
approach for local learning. Artiﬁcial Intelligence Commu-
nications, 121(1), 2000.
[6] W. J. Conover. Practical Nonparametric Statistics. John
Wiley & Sons, New York, NY, USA, third edition, 1999.
24-1

[7] I. Guyon and A. Elisseeff. An introduction to variable and
feature selection. Journal of Machine Learning Research,
3:1157–1182, 2003.
[8] O. Maron and A. Moore. The racing algorithm: Model selec-
tion for lazy learners. Artiﬁcial Intelligence Review, 11(1–
5):193–225, 1997.
24-2

Combining Lazy Learning, Racing and Subsampling for Effective Feature Selection

Recommended

Recommended

More Related Content

Similar to Combining Lazy Learning, Racing and Subsampling for Effective Feature Selection

Similar to Combining Lazy Learning, Racing and Subsampling for Effective Feature Selection (20)

More from Gianluca Bontempi

More from Gianluca Bontempi (10)

Recently uploaded

Recently uploaded (20)

Combining Lazy Learning, Racing and Subsampling for Effective Feature Selection