SlideShare ist ein Scribd-Unternehmen logo
1 von 10
Downloaden Sie, um offline zu lesen
Stopped Training and Other Remedies for Over tting
                To appear in Proceedings of the 27th Symposium on the Interface, 1995
                                             Warren S. Sarle
                      SAS Institute Inc., SAS Campus Drive, Cary, NC 27513, USA
                                           saswss@unx.sas.com

Abstract                                                         way to estimate the generalization error, but Schwarz's
                                                                 Bayesian criterion SBC is reasonably e ective and
Over tting is the most serious problem in neural net-            much cheaper than bootstrapping. Previous unpub-
work training; stopped training is the most common so-           lished simulations showed that criteria such as AIC,
lution to this problem. Yet the neural network literature        FPE, and GCV, which choose more complex models than
contains little systematic investigation of the properties       SBC, can over t badly when applied to NNs.
of stopped training and no comparisons with statistical             Regularization involves constraining or penalizing the
methods of dealing with over tting. This paper presents          solution of the estimation problem to improve generaliza-
the results of simulations designed to investigate these         tion by smoothing the predictions. Common NN meth-
issues.                                                          ods for regularization include weight decay and addition
                                                                 of noise to the inputs during training. Both of these
                                                                 methods reduce to ridge regression in the case of linear
Introduction                                                     models. Other forms of statistical regularization include
                                                                 Stein estimation and Bayesian estimation.
The most important unresolved problem in practical ap-
plications of feedforward neural networks NNs is how
to determine the appropriate complexity of the network.          Stopped training
This problem is similar to the problem of choosing the
degree of smoothing in nonparametric estimation. But             The most popular method of regularizing NNs, which
over tting in NNs can have much more serious conse-              is called stopped training or early stopping or optimal
quences than can undersmoothing in most nonparamet-              stopping, proceeds as follows:
ric estimation methods. NNs, like high-order polynomi-             1. Divide the available data into two separate training
als, can produce wild predictions far beyond the range                and validation sets.
of the data.
   There are two main approaches to controlling the com-           2. Use a large number of HUs.
plexity of a NN: model selection and regularization.
   Model selection for a NN requires choosing the number           3. Use small random initial values.
of hidden units HUs and connections thereof. Related             4. Use a slow learning rate.
statistical problems include choosing the order of an au-
toregressive model and choosing a subset of predictors             5. Compute the validation error periodically during
in a linear regression model. The usual NN approach                   training.
to model selection is pruning Reed 1993, in which you            6. Stop training when the validation error starts to
start with a large model and remove connections or units              go up.
during training by various ad hoc algorithms. Pruning
is somewhat similar to backward elimination in stepwise             Stopped training is closely related to ridge regres-
regression. Wald tests for pruning have been introduced          sion. If the learning rate is su ciently small, the se-
in the NN literature under the name Optimal Brain Sur-           quence of weight vectors on each iteration will approxi-
geon by Hassibi and Stork 1993.                                mate the path of continuous steepest descent down the
   The usual statistical approach to model selection is          error function. Stopped training chooses a point along
to estimate the generalization error for each model and          this path that optimizes an estimate of the generaliza-
to choose the model with the minimum such estimate.              tion error computed from the validation set. Ridge re-
For nonlinear models, bootstrapping is generally the best        gression also de nes a path of weight vectors by varying

                                                             1
the ridge value. The ridge value is often chosen by opti-            These conclusions are based on a single data set with
mizing an estimate of the generalization error computed           540 training cases, 540 validation cases, 160 inputs, and
by cross-validation, generalized cross-validation, or boot-       a 6-category output, trained using the cross-entropy loss
strapping. There always exists a positive ridge value that        function, initial values uniformly distributed between
will improve the expected generalization error in a linear        , 01 and + 01, and a learning rate of .01 with no momen-
                                                                   :         :

model. A similar result has been obtained for stopped             tum. A 4HU network has 674 weights, which is greater
training in linear models Wang, Venkatesh, and Judd              than the number of training cases. Thus a 4HU network
1994. In linear models, the ridge path lies close to, but        is hardly fairly small relative to the training set size,
does not coincide with, the path of continuous steepest           so 1 is not supported by Weigend's results.
descent; in nonlinear models, the two paths can diverge              Weigend's point 2 is clearly supported by the results,
widely.                                                           but generalizing from a single case is risky!
   Stopped training has several advantages:
    It is fast.                                                   Practical regularization
    It can be applied successfully to networks in which           Ridge regression is the most popular statistical regular-
    the number of weights far exceeds the sample size.            ization method. Ridge regression can be extended to
    For example, Nelson and Illingworth 1991, p. 165            nonlinear models in a variety of ways:
    discuss training a network with 16,219 weights on
    only 50 training cases!                                             Penalized ridging, in which the estimation function
                                                                        is augmented by a penalty term, usually equal to
    It requires only one major decision by the user:                    the sum of squared weights; this method is called
    what proportion of validation cases to use.                         weight decay in the NN literature.
  Statisticians tend to be skeptical of stopped training                Smoothed ridging, in which noise is added to the
because:                                                                inputs, which amounts to an approximate convolu-
                                                                        tion of the training data with the noise distribution;
    It appears to be statistically ine cient due to the                 this method is used in the NN literature but does
    use of the split-sample technique; i.e., neither train-             not have a widely recognized name.
    ing nor validation makes use of the entire sample.
                                                                        Constrained ridging, in which a norm of the weights
    The usual statistical theory does not apply.                        is constrained not to exceed a speci ed value; this
                                                                        method has not appeared in the NN literature to
However, there has been recent progress addressing both                 my knowledge.
of the above concerns Wang 1994.
   Despite the widespread use of stopped training, there             Penalized ridging has the clearest motivation, be-
has been very little research on the subject. Most pub-           ing a form of maximum-posterior-density MPD
lished studies are seriously awed. For example, Morgan            Bayesian estimation in which the weights are distributed
and Bourlard 1990 generated arti cial data sets in such         N0 ridge  . The Bayesian approach has the great ad-
                                                                       ;         I

a way that the training set and validation set were corre-        vantage of providing a natural way to estimate the ridge
lated, thus invalidating the results. Finnof, Hergert and         value: you make the ridge value a hyperparameter and
Zimmermann 1993 generated arti cial data with uni-              estimate it along with everything else. Typically, the
formly distributed noise and then trained the networks            ridge value is given a noninformative prior distribution.
by least absolute values, a bizarre combination from a               Full Bayesian estimation has been found very e ec-
statistical point of view also see Weigend 1994 on least-        tive for NN training Neal 1995 but is prohibitively
absolute-value training.                                         expensive except for small data sets. Empirical Bayes
   In one of the few useful studies on stopped training,          methods called the evidence framework in the NN lit-
Weigend 1994 claims that:                                       erature, MacKay 199?, in which you rst estimate the
                                                                  ridge value by integrating over the weights, are also ef-
 1. ... fairly small 4HU networks that never reach               fective but still expensive. Williams 1995 suggested
   good performance also, already, show over tting.               rst integrating over the ridge value and then maximiz-
                                                                  ing with respect to the weights. This method is very
 2. ... large 15HU networks generalize better than                e cient since the integration can be done analytically,
   small ones                                                    but the global optimum is in nite, with all the weights

                                                              2
equal to zero, a solution that is rarely of much practical                 SBC: Ordinary least squares with the number
use.                                                                       of HUs chosen by Schwarz's Bayesian criterion.
   Hierarchical Bayesian models with noninformative hy-                    DG: Ordinary least squares with the number of
perprior distributions generally have an in nite poste-                    HUs chosen by Divine Guidance, i.e. the true
rior density along a ridge corresponding to a variance of                  generalization error.
zero for the hyperparameters. Sometimes there is a local                   MPD: Maximum posterior density with 10
maximum of the posterior density that yields an approx-                    HUs and a slightly informative prior.
imation to a full Bayesian analysis, but there is often
only a shoulder rather than a local maximum O'Hagan                       Stopped training with 20 HUs and 25 valida-
1985.                                                                     tion cases.
   For Bayesian estimation to be competitive with                Lower left: The mean MSE of prediction above and
stopped training for practical applications, a computa-               the median MSE of prediction below as functions
tionally e cient approach is required. One simple and                 of the number of HUs on the left and the percent-
fast approach is to maximize the posterior density with               age of validation cases on the right.
respect to both the weights and the ridge value, using
a slightly informative prior for the ridge value. A suf-         Lower right: Box-and-whisker plots of the MSE of pre-
  ciently informative prior will convert a shoulder into              diction for each estimation method.
a local maximum, but how much prior information is
required depends on the particular application.                     In all simulations, the HUs had logistic activation
   The maximum posterior density approach yields a sin-          functions and the output unit had an identity activation
gle model equation, so predictions can be computed               function.
quickly. But MPD estimation can be statistically in-                Stopped training was done with 5, 10, 20, or 40 HUs.
e cient compared to full Bayesian estimation in small            Each sample was randomly split into training and val-
samples when the mode of the posterior density is not            idation sets. The network was trained to convergence
representative of the entire posterior distribution. Hence       by a conjugate gradient algorithm. The validation error
the usefulness of MPD estimation in small samples re-            was computed on each iteration, and the weights that
quires investigation.                                            minimized the validation error were used to compute
                                                                 the estimated functions. This approach circumvents the
                                                                 question of when the validation error starts to go up.
Simulations                                                         OLS and MPD estimation were done by a Levenberg-
                                                                 Marquardt algorithm. Ten HUs were used for MPD es-
NNs are typically used with many inputs. But it is di -          timation.
cult to comprehend the nature of high-dimensional func-             Generalization error was estimated by computing the
tions. The simulations presented in this paper were run          MSE for predicting the mean response for 201 data
using functions of a single input to make it possible to         points evenly spaced from -2 to 2, which gives a close
plot the estimated functions.                                    approximation to the mean integrated squared error.
   Six functions were used, named at, line, curve, wig-             Since the distribution of the MSE of prediction
gle, saw, and step. Each gure at the end of this paper is        is highly skewed and possibly multimodal, a rank-
devoted to one of these functions. Data were generated           transform analysis of variance was used to evaluate the
by selecting an input value from a uniform distribution          e ects of the number of HUs and the validation percent-
on ,2 2 and adding Gaussian noise with standard de-
       ;
                                                                 age and to compare estimation methods. Since this is a
viation sigma as shown in the gures. Fifty samples were          small, exploratory study, no adjustments were made for
generated from the saw function, while twenty samples            multiple tests. Results are summarized in terms of the
were generated from each of the other functions.                 conventional 5 level of signi cance.
   The gures are composed of four quadrants containing              There was no signi cant interaction between the num-
the following plots:                                             ber of HUs and the validation percentage for any of the
                                                                 functions.
Upper left:   The true regression function, pooled data             The number of HUs had no signi cant e ect on the
    from all samples, and separate data from each of the         generalization of stopped training for the at or step
      rst two samples.                                           functions. For the line function, 40 HUs generalized sig-
                                                                 ni cantly better than 5 or 10 HUs. For the other func-
Upper right:  The estimated regression functions for             tions, 20 HUs were better than 5 or 10 HUs. For the saw
    each sample with each of four estimation methods:            function, 40 HUs generalized signi cantly worse than 20

                                                             3
HUs. For the other functions, the di erence between 20             References
and 40 HUs was not signi cant.
                                                                      Bernardo, J.M., DeGroot, M.H., Lindley, D.V. and
   The e ect of validation percentage on stopped train-            Smith, A.F.M., eds., 1985, Bayesian Statistics 2,
ing was not as clear-cut as the e ect of the number of             Amsterdam: Elsevier Science Publishers B.V. North-
HUs. For the at function, 33 validation cases general-            Holland.
ized best but not signi cantly better than any other per-             Finnof, W., Hergert, F., and Zimmermann, H.G.
centage except 14. For the line function, the e ect of            1993, Improving model selection by nonconvergent
validation percentage was not signi cant. For the curve            methods, Neural Networks, 6, 771-783.
function, 25 was best but was not signi cantly better                Hassibi, B. and Stork, D.G. 1993, Second order
than any other percentage except 10. For the wiggle               derivatives for network pruning: Optimal Brain Sur-
function, 14 was nonsigni cantly better than 20 and              geon, Advances in Neural Information Processing Sys-
signi cantly better than any other validation percentage.          tems, 5, 164-171.
For the saw function, 14 was best but was not signi -                MacKay,         D.J.C.       199?,       Probable
cantly better than any other percentage except 50. For            networks and plausible predictions|a review of prac-
the step function, 50 was signi cantly worse than any             tical Bayesian methods for supervised neural networks,
other percentage. Thus the data suggest a trend in which           ftp: mraos.ra.phy.cam.ac.uk pub mackay network.ps.Z
the more complex functions require a smaller validation               Morgan, N. and Bourlard, H. 1990, Generalization
percentage, but the results are far from conclusive.               and parameter estimation in feedforward nets: Some ex-
                                                                   periments, Advances in Neural Information Processing
  To compare stopped training with the other methods,              Systems, 2, 630-637.
20 HUs and 25 validation cases were chosen to give                   Neal, R.M. 1995, Bayesian Learning for Neu-
good results for all ve functions.                                 ral Networks, Ph.D. thesis, University of Toronto,
                                                                   ftp: ftp.cs.toronto.edu pub radford thesis.ps.Z
   Of the three practical methods DG is not, of course,              Nelson, M.C. and Illingworth, W.T. 1991, A Practi-
practical, SBC was signi cantly best for the at and line          cal Guide to Neural Nets, Reading, MA: Addison-Wesley.
functions, but MPD was best for the other functions, sig-             O'Hagan, A. 1985, Shoulders in hierarchical mod-
ni cantly so for the wiggle and saw functions. For the             els, in Bernardo et al. 1985, 697-710.
curve and step functions, MPD was signi cantly better                 Reed, R. 1993, Pruning Algorithms|A Survey,
than stopped training but not signi cantly better than             IEEE Transactions on Neural Networks, 4, 740-747.
SBC. For the at function, MPD was signi cantly worse                  Wang, C. 1994, A Theory of Generalisation in
than SBC or stopped training, while SBC and stopped                Learning Machines with Neural Network Application,
training did not di er signi cantly. For the line function,        Ph.D. thesis, University of Pennsylvania.
both MPD and stopped training were signi cantly worse                 Wang, C., Venkatesh, S.S., and Judd, J.S. 1994,
than SBC, but MPD and stopped training did not di er                 Optimal Stopping and E ective Machine Complexity
signi cantly from each other. Thus, for applications in            in Learning, Advances in Neural Information Process-
which nonlinear relationships are anticipated, MPD is              ing Systems, 6, 303-310.
clearly the method of choice. Ironically, stopped train-              Weigend, A. 1994, On over tting and the e ective
ing, which is touted as a fool-proof method for tting              number of hidden units, Proceedings of the 1993 Con-
complicated nonlinear functions, works well compared               nectionist Models Summer School, 335-342.
to MPD only for the two linear functions.                             Williams, P.M. 1995, Bayesian regularization and
                                                                   pruning using a Laplace prior, Neural Computation, 7,
   DG was signi cantly better than MPD for the at and              117-143.
line functions and was extremely close to MPD for the
other functions, so improved methods of model selection
deserve further investigation. However, DG sometimes
yields estimates with near discontinuities for all six func-
tions, and for some applications this tendency could be
a disadvantage. It is interesting that for the step func-
tion, SBC appears to the human eye to produce better
estimates than DG. But even though SBC detects the
discontinuities reliably, it does not position them accu-
rately.

                                                               4
Figure 1




           5
Figure 2




           6
Figure 3




           7
Figure 4




           8
Figure 5




           9
Figure 6




           10

Weitere ähnliche Inhalte

Ähnlich wie Stopped Training and Other Remedies for OverFITtting

November, 2006 CCKM'06 1
November, 2006 CCKM'06 1 November, 2006 CCKM'06 1
November, 2006 CCKM'06 1 butest
 
Learning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification DataLearning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification Data萍華 楊
 
IMPROVING CONSTRUCTION OF CONDITIONAL PROBABILITY TABLES FOR RANKED NODES IN ...
IMPROVING CONSTRUCTION OF CONDITIONAL PROBABILITY TABLES FOR RANKED NODES IN ...IMPROVING CONSTRUCTION OF CONDITIONAL PROBABILITY TABLES FOR RANKED NODES IN ...
IMPROVING CONSTRUCTION OF CONDITIONAL PROBABILITY TABLES FOR RANKED NODES IN ...Nexgen Technology
 
Quasi newton artificial neural network training algorithms
Quasi newton artificial neural network training algorithmsQuasi newton artificial neural network training algorithms
Quasi newton artificial neural network training algorithmsMrinmoy Majumder
 
A parsimonious SVM model selection criterion for classification of real-world ...
A parsimonious SVM model selection criterion for classification of real-world ...A parsimonious SVM model selection criterion for classification of real-world ...
A parsimonious SVM model selection criterion for classification of real-world ...o_almasi
 
Neural networks for the prediction and forecasting of water resources variables
Neural networks for the prediction and forecasting of water resources variablesNeural networks for the prediction and forecasting of water resources variables
Neural networks for the prediction and forecasting of water resources variablesJonathan D'Cruz
 
Robustness and Regularization of Support Vector Machines.pdf
Robustness and Regularization of Support Vector Machines.pdfRobustness and Regularization of Support Vector Machines.pdf
Robustness and Regularization of Support Vector Machines.pdfDaniel983829
 
Soft Computing Techniques Based Image Classification using Support Vector Mac...
Soft Computing Techniques Based Image Classification using Support Vector Mac...Soft Computing Techniques Based Image Classification using Support Vector Mac...
Soft Computing Techniques Based Image Classification using Support Vector Mac...ijtsrd
 
Cheatsheet deep-learning-tips-tricks
Cheatsheet deep-learning-tips-tricksCheatsheet deep-learning-tips-tricks
Cheatsheet deep-learning-tips-tricksSteve Nouri
 
Unit-4 PART-4 Overfitting.pptx
Unit-4 PART-4 Overfitting.pptxUnit-4 PART-4 Overfitting.pptx
Unit-4 PART-4 Overfitting.pptxSandeep Kumar
 
Learning Sparse Networks using Targeted Dropout
Learning Sparse Networks using Targeted DropoutLearning Sparse Networks using Targeted Dropout
Learning Sparse Networks using Targeted DropoutSeunghyun Hwang
 
Adaptive Classification of Imbalanced Data using ANN with Particle of Swarm O...
Adaptive Classification of Imbalanced Data using ANN with Particle of Swarm O...Adaptive Classification of Imbalanced Data using ANN with Particle of Swarm O...
Adaptive Classification of Imbalanced Data using ANN with Particle of Swarm O...ijtsrd
 
Comparison of Machine Learning Algorithms
Comparison of Machine Learning Algorithms Comparison of Machine Learning Algorithms
Comparison of Machine Learning Algorithms butest
 
Peak detection cwt
Peak detection cwtPeak detection cwt
Peak detection cwtjessyHanna
 
Training algorithms for Neural Networks
Training algorithms for Neural NetworksTraining algorithms for Neural Networks
Training algorithms for Neural NetworksMrinmoy Majumder
 
local_learning.doc - Word Format
local_learning.doc - Word Formatlocal_learning.doc - Word Format
local_learning.doc - Word Formatbutest
 
UNCERTAINTY ESTIMATION IN NEURAL NETWORKS THROUGH MULTI-TASK LEARNING
UNCERTAINTY ESTIMATION IN NEURAL NETWORKS THROUGH MULTI-TASK LEARNINGUNCERTAINTY ESTIMATION IN NEURAL NETWORKS THROUGH MULTI-TASK LEARNING
UNCERTAINTY ESTIMATION IN NEURAL NETWORKS THROUGH MULTI-TASK LEARNINGgerogepatton
 
UNCERTAINTY ESTIMATION IN NEURAL NETWORKS THROUGH MULTI-TASK LEARNING
UNCERTAINTY ESTIMATION IN NEURAL NETWORKS THROUGH MULTI-TASK LEARNINGUNCERTAINTY ESTIMATION IN NEURAL NETWORKS THROUGH MULTI-TASK LEARNING
UNCERTAINTY ESTIMATION IN NEURAL NETWORKS THROUGH MULTI-TASK LEARNINGijaia
 

Ähnlich wie Stopped Training and Other Remedies for OverFITtting (20)

November, 2006 CCKM'06 1
November, 2006 CCKM'06 1 November, 2006 CCKM'06 1
November, 2006 CCKM'06 1
 
Learning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification DataLearning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification Data
 
IMPROVING CONSTRUCTION OF CONDITIONAL PROBABILITY TABLES FOR RANKED NODES IN ...
IMPROVING CONSTRUCTION OF CONDITIONAL PROBABILITY TABLES FOR RANKED NODES IN ...IMPROVING CONSTRUCTION OF CONDITIONAL PROBABILITY TABLES FOR RANKED NODES IN ...
IMPROVING CONSTRUCTION OF CONDITIONAL PROBABILITY TABLES FOR RANKED NODES IN ...
 
Quasi newton artificial neural network training algorithms
Quasi newton artificial neural network training algorithmsQuasi newton artificial neural network training algorithms
Quasi newton artificial neural network training algorithms
 
A parsimonious SVM model selection criterion for classification of real-world ...
A parsimonious SVM model selection criterion for classification of real-world ...A parsimonious SVM model selection criterion for classification of real-world ...
A parsimonious SVM model selection criterion for classification of real-world ...
 
Neural networks for the prediction and forecasting of water resources variables
Neural networks for the prediction and forecasting of water resources variablesNeural networks for the prediction and forecasting of water resources variables
Neural networks for the prediction and forecasting of water resources variables
 
Robustness and Regularization of Support Vector Machines.pdf
Robustness and Regularization of Support Vector Machines.pdfRobustness and Regularization of Support Vector Machines.pdf
Robustness and Regularization of Support Vector Machines.pdf
 
Soft Computing Techniques Based Image Classification using Support Vector Mac...
Soft Computing Techniques Based Image Classification using Support Vector Mac...Soft Computing Techniques Based Image Classification using Support Vector Mac...
Soft Computing Techniques Based Image Classification using Support Vector Mac...
 
Cheatsheet deep-learning-tips-tricks
Cheatsheet deep-learning-tips-tricksCheatsheet deep-learning-tips-tricks
Cheatsheet deep-learning-tips-tricks
 
Unit-4 PART-4 Overfitting.pptx
Unit-4 PART-4 Overfitting.pptxUnit-4 PART-4 Overfitting.pptx
Unit-4 PART-4 Overfitting.pptx
 
Learning Sparse Networks using Targeted Dropout
Learning Sparse Networks using Targeted DropoutLearning Sparse Networks using Targeted Dropout
Learning Sparse Networks using Targeted Dropout
 
Adaptive Classification of Imbalanced Data using ANN with Particle of Swarm O...
Adaptive Classification of Imbalanced Data using ANN with Particle of Swarm O...Adaptive Classification of Imbalanced Data using ANN with Particle of Swarm O...
Adaptive Classification of Imbalanced Data using ANN with Particle of Swarm O...
 
effect of learning rate
effect of learning rateeffect of learning rate
effect of learning rate
 
Comparison of Machine Learning Algorithms
Comparison of Machine Learning Algorithms Comparison of Machine Learning Algorithms
Comparison of Machine Learning Algorithms
 
Issues in DTL.pptx
Issues in DTL.pptxIssues in DTL.pptx
Issues in DTL.pptx
 
Peak detection cwt
Peak detection cwtPeak detection cwt
Peak detection cwt
 
Training algorithms for Neural Networks
Training algorithms for Neural NetworksTraining algorithms for Neural Networks
Training algorithms for Neural Networks
 
local_learning.doc - Word Format
local_learning.doc - Word Formatlocal_learning.doc - Word Format
local_learning.doc - Word Format
 
UNCERTAINTY ESTIMATION IN NEURAL NETWORKS THROUGH MULTI-TASK LEARNING
UNCERTAINTY ESTIMATION IN NEURAL NETWORKS THROUGH MULTI-TASK LEARNINGUNCERTAINTY ESTIMATION IN NEURAL NETWORKS THROUGH MULTI-TASK LEARNING
UNCERTAINTY ESTIMATION IN NEURAL NETWORKS THROUGH MULTI-TASK LEARNING
 
UNCERTAINTY ESTIMATION IN NEURAL NETWORKS THROUGH MULTI-TASK LEARNING
UNCERTAINTY ESTIMATION IN NEURAL NETWORKS THROUGH MULTI-TASK LEARNINGUNCERTAINTY ESTIMATION IN NEURAL NETWORKS THROUGH MULTI-TASK LEARNING
UNCERTAINTY ESTIMATION IN NEURAL NETWORKS THROUGH MULTI-TASK LEARNING
 

Mehr von ESCOM

redes neuronales tipo Som
redes neuronales tipo Somredes neuronales tipo Som
redes neuronales tipo SomESCOM
 
redes neuronales Som
redes neuronales Somredes neuronales Som
redes neuronales SomESCOM
 
redes neuronales Som Slides
redes neuronales Som Slidesredes neuronales Som Slides
redes neuronales Som SlidesESCOM
 
red neuronal Som Net
red neuronal Som Netred neuronal Som Net
red neuronal Som NetESCOM
 
Self Organinising neural networks
Self Organinising  neural networksSelf Organinising  neural networks
Self Organinising neural networksESCOM
 
redes neuronales Kohonen
redes neuronales Kohonenredes neuronales Kohonen
redes neuronales KohonenESCOM
 
Teoria Resonancia Adaptativa
Teoria Resonancia AdaptativaTeoria Resonancia Adaptativa
Teoria Resonancia AdaptativaESCOM
 
ejemplo red neuronal Art1
ejemplo red neuronal Art1ejemplo red neuronal Art1
ejemplo red neuronal Art1ESCOM
 
redes neuronales tipo Art3
redes neuronales tipo Art3redes neuronales tipo Art3
redes neuronales tipo Art3ESCOM
 
Art2
Art2Art2
Art2ESCOM
 
Redes neuronales tipo Art
Redes neuronales tipo ArtRedes neuronales tipo Art
Redes neuronales tipo ArtESCOM
 
Neocognitron
NeocognitronNeocognitron
NeocognitronESCOM
 
Neocognitron
NeocognitronNeocognitron
NeocognitronESCOM
 
Neocognitron
NeocognitronNeocognitron
NeocognitronESCOM
 
Fukushima Cognitron
Fukushima CognitronFukushima Cognitron
Fukushima CognitronESCOM
 
Counterpropagation NETWORK
Counterpropagation NETWORKCounterpropagation NETWORK
Counterpropagation NETWORKESCOM
 
Counterpropagation NETWORK
Counterpropagation NETWORKCounterpropagation NETWORK
Counterpropagation NETWORKESCOM
 
Counterpropagation
CounterpropagationCounterpropagation
CounterpropagationESCOM
 
Teoría de Resonancia Adaptativa Art2 ARTMAP
Teoría de Resonancia Adaptativa Art2 ARTMAPTeoría de Resonancia Adaptativa Art2 ARTMAP
Teoría de Resonancia Adaptativa Art2 ARTMAPESCOM
 
Teoría de Resonancia Adaptativa ART1
Teoría de Resonancia Adaptativa ART1Teoría de Resonancia Adaptativa ART1
Teoría de Resonancia Adaptativa ART1ESCOM
 

Mehr von ESCOM (20)

redes neuronales tipo Som
redes neuronales tipo Somredes neuronales tipo Som
redes neuronales tipo Som
 
redes neuronales Som
redes neuronales Somredes neuronales Som
redes neuronales Som
 
redes neuronales Som Slides
redes neuronales Som Slidesredes neuronales Som Slides
redes neuronales Som Slides
 
red neuronal Som Net
red neuronal Som Netred neuronal Som Net
red neuronal Som Net
 
Self Organinising neural networks
Self Organinising  neural networksSelf Organinising  neural networks
Self Organinising neural networks
 
redes neuronales Kohonen
redes neuronales Kohonenredes neuronales Kohonen
redes neuronales Kohonen
 
Teoria Resonancia Adaptativa
Teoria Resonancia AdaptativaTeoria Resonancia Adaptativa
Teoria Resonancia Adaptativa
 
ejemplo red neuronal Art1
ejemplo red neuronal Art1ejemplo red neuronal Art1
ejemplo red neuronal Art1
 
redes neuronales tipo Art3
redes neuronales tipo Art3redes neuronales tipo Art3
redes neuronales tipo Art3
 
Art2
Art2Art2
Art2
 
Redes neuronales tipo Art
Redes neuronales tipo ArtRedes neuronales tipo Art
Redes neuronales tipo Art
 
Neocognitron
NeocognitronNeocognitron
Neocognitron
 
Neocognitron
NeocognitronNeocognitron
Neocognitron
 
Neocognitron
NeocognitronNeocognitron
Neocognitron
 
Fukushima Cognitron
Fukushima CognitronFukushima Cognitron
Fukushima Cognitron
 
Counterpropagation NETWORK
Counterpropagation NETWORKCounterpropagation NETWORK
Counterpropagation NETWORK
 
Counterpropagation NETWORK
Counterpropagation NETWORKCounterpropagation NETWORK
Counterpropagation NETWORK
 
Counterpropagation
CounterpropagationCounterpropagation
Counterpropagation
 
Teoría de Resonancia Adaptativa Art2 ARTMAP
Teoría de Resonancia Adaptativa Art2 ARTMAPTeoría de Resonancia Adaptativa Art2 ARTMAP
Teoría de Resonancia Adaptativa Art2 ARTMAP
 
Teoría de Resonancia Adaptativa ART1
Teoría de Resonancia Adaptativa ART1Teoría de Resonancia Adaptativa ART1
Teoría de Resonancia Adaptativa ART1
 

Kürzlich hochgeladen

How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSCeline George
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.MaryamAhmad92
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxcallscotland1987
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxVishalSingh1417
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxAmanpreet Kaur
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Association for Project Management
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptxMaritesTamaniVerdade
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Magic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptxMagic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptxdhanalakshmis0310
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Jisc
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
Third Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptxThird Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptxAmita Gupta
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docxPoojaSen20
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin ClassesCeline George
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 

Kürzlich hochgeladen (20)

How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Magic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptxMagic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptx
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Third Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptxThird Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptx
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 

Stopped Training and Other Remedies for OverFITtting

  • 1. Stopped Training and Other Remedies for Over tting To appear in Proceedings of the 27th Symposium on the Interface, 1995 Warren S. Sarle SAS Institute Inc., SAS Campus Drive, Cary, NC 27513, USA saswss@unx.sas.com Abstract way to estimate the generalization error, but Schwarz's Bayesian criterion SBC is reasonably e ective and Over tting is the most serious problem in neural net- much cheaper than bootstrapping. Previous unpub- work training; stopped training is the most common so- lished simulations showed that criteria such as AIC, lution to this problem. Yet the neural network literature FPE, and GCV, which choose more complex models than contains little systematic investigation of the properties SBC, can over t badly when applied to NNs. of stopped training and no comparisons with statistical Regularization involves constraining or penalizing the methods of dealing with over tting. This paper presents solution of the estimation problem to improve generaliza- the results of simulations designed to investigate these tion by smoothing the predictions. Common NN meth- issues. ods for regularization include weight decay and addition of noise to the inputs during training. Both of these methods reduce to ridge regression in the case of linear Introduction models. Other forms of statistical regularization include Stein estimation and Bayesian estimation. The most important unresolved problem in practical ap- plications of feedforward neural networks NNs is how to determine the appropriate complexity of the network. Stopped training This problem is similar to the problem of choosing the degree of smoothing in nonparametric estimation. But The most popular method of regularizing NNs, which over tting in NNs can have much more serious conse- is called stopped training or early stopping or optimal quences than can undersmoothing in most nonparamet- stopping, proceeds as follows: ric estimation methods. NNs, like high-order polynomi- 1. Divide the available data into two separate training als, can produce wild predictions far beyond the range and validation sets. of the data. There are two main approaches to controlling the com- 2. Use a large number of HUs. plexity of a NN: model selection and regularization. Model selection for a NN requires choosing the number 3. Use small random initial values. of hidden units HUs and connections thereof. Related 4. Use a slow learning rate. statistical problems include choosing the order of an au- toregressive model and choosing a subset of predictors 5. Compute the validation error periodically during in a linear regression model. The usual NN approach training. to model selection is pruning Reed 1993, in which you 6. Stop training when the validation error starts to start with a large model and remove connections or units go up. during training by various ad hoc algorithms. Pruning is somewhat similar to backward elimination in stepwise Stopped training is closely related to ridge regres- regression. Wald tests for pruning have been introduced sion. If the learning rate is su ciently small, the se- in the NN literature under the name Optimal Brain Sur- quence of weight vectors on each iteration will approxi- geon by Hassibi and Stork 1993. mate the path of continuous steepest descent down the The usual statistical approach to model selection is error function. Stopped training chooses a point along to estimate the generalization error for each model and this path that optimizes an estimate of the generaliza- to choose the model with the minimum such estimate. tion error computed from the validation set. Ridge re- For nonlinear models, bootstrapping is generally the best gression also de nes a path of weight vectors by varying 1
  • 2. the ridge value. The ridge value is often chosen by opti- These conclusions are based on a single data set with mizing an estimate of the generalization error computed 540 training cases, 540 validation cases, 160 inputs, and by cross-validation, generalized cross-validation, or boot- a 6-category output, trained using the cross-entropy loss strapping. There always exists a positive ridge value that function, initial values uniformly distributed between will improve the expected generalization error in a linear , 01 and + 01, and a learning rate of .01 with no momen- : : model. A similar result has been obtained for stopped tum. A 4HU network has 674 weights, which is greater training in linear models Wang, Venkatesh, and Judd than the number of training cases. Thus a 4HU network 1994. In linear models, the ridge path lies close to, but is hardly fairly small relative to the training set size, does not coincide with, the path of continuous steepest so 1 is not supported by Weigend's results. descent; in nonlinear models, the two paths can diverge Weigend's point 2 is clearly supported by the results, widely. but generalizing from a single case is risky! Stopped training has several advantages: It is fast. Practical regularization It can be applied successfully to networks in which Ridge regression is the most popular statistical regular- the number of weights far exceeds the sample size. ization method. Ridge regression can be extended to For example, Nelson and Illingworth 1991, p. 165 nonlinear models in a variety of ways: discuss training a network with 16,219 weights on only 50 training cases! Penalized ridging, in which the estimation function is augmented by a penalty term, usually equal to It requires only one major decision by the user: the sum of squared weights; this method is called what proportion of validation cases to use. weight decay in the NN literature. Statisticians tend to be skeptical of stopped training Smoothed ridging, in which noise is added to the because: inputs, which amounts to an approximate convolu- tion of the training data with the noise distribution; It appears to be statistically ine cient due to the this method is used in the NN literature but does use of the split-sample technique; i.e., neither train- not have a widely recognized name. ing nor validation makes use of the entire sample. Constrained ridging, in which a norm of the weights The usual statistical theory does not apply. is constrained not to exceed a speci ed value; this method has not appeared in the NN literature to However, there has been recent progress addressing both my knowledge. of the above concerns Wang 1994. Despite the widespread use of stopped training, there Penalized ridging has the clearest motivation, be- has been very little research on the subject. Most pub- ing a form of maximum-posterior-density MPD lished studies are seriously awed. For example, Morgan Bayesian estimation in which the weights are distributed and Bourlard 1990 generated arti cial data sets in such N0 ridge . The Bayesian approach has the great ad- ; I a way that the training set and validation set were corre- vantage of providing a natural way to estimate the ridge lated, thus invalidating the results. Finnof, Hergert and value: you make the ridge value a hyperparameter and Zimmermann 1993 generated arti cial data with uni- estimate it along with everything else. Typically, the formly distributed noise and then trained the networks ridge value is given a noninformative prior distribution. by least absolute values, a bizarre combination from a Full Bayesian estimation has been found very e ec- statistical point of view also see Weigend 1994 on least- tive for NN training Neal 1995 but is prohibitively absolute-value training. expensive except for small data sets. Empirical Bayes In one of the few useful studies on stopped training, methods called the evidence framework in the NN lit- Weigend 1994 claims that: erature, MacKay 199?, in which you rst estimate the ridge value by integrating over the weights, are also ef- 1. ... fairly small 4HU networks that never reach fective but still expensive. Williams 1995 suggested good performance also, already, show over tting. rst integrating over the ridge value and then maximiz- ing with respect to the weights. This method is very 2. ... large 15HU networks generalize better than e cient since the integration can be done analytically, small ones but the global optimum is in nite, with all the weights 2
  • 3. equal to zero, a solution that is rarely of much practical SBC: Ordinary least squares with the number use. of HUs chosen by Schwarz's Bayesian criterion. Hierarchical Bayesian models with noninformative hy- DG: Ordinary least squares with the number of perprior distributions generally have an in nite poste- HUs chosen by Divine Guidance, i.e. the true rior density along a ridge corresponding to a variance of generalization error. zero for the hyperparameters. Sometimes there is a local MPD: Maximum posterior density with 10 maximum of the posterior density that yields an approx- HUs and a slightly informative prior. imation to a full Bayesian analysis, but there is often only a shoulder rather than a local maximum O'Hagan Stopped training with 20 HUs and 25 valida- 1985. tion cases. For Bayesian estimation to be competitive with Lower left: The mean MSE of prediction above and stopped training for practical applications, a computa- the median MSE of prediction below as functions tionally e cient approach is required. One simple and of the number of HUs on the left and the percent- fast approach is to maximize the posterior density with age of validation cases on the right. respect to both the weights and the ridge value, using a slightly informative prior for the ridge value. A suf- Lower right: Box-and-whisker plots of the MSE of pre- ciently informative prior will convert a shoulder into diction for each estimation method. a local maximum, but how much prior information is required depends on the particular application. In all simulations, the HUs had logistic activation The maximum posterior density approach yields a sin- functions and the output unit had an identity activation gle model equation, so predictions can be computed function. quickly. But MPD estimation can be statistically in- Stopped training was done with 5, 10, 20, or 40 HUs. e cient compared to full Bayesian estimation in small Each sample was randomly split into training and val- samples when the mode of the posterior density is not idation sets. The network was trained to convergence representative of the entire posterior distribution. Hence by a conjugate gradient algorithm. The validation error the usefulness of MPD estimation in small samples re- was computed on each iteration, and the weights that quires investigation. minimized the validation error were used to compute the estimated functions. This approach circumvents the question of when the validation error starts to go up. Simulations OLS and MPD estimation were done by a Levenberg- Marquardt algorithm. Ten HUs were used for MPD es- NNs are typically used with many inputs. But it is di - timation. cult to comprehend the nature of high-dimensional func- Generalization error was estimated by computing the tions. The simulations presented in this paper were run MSE for predicting the mean response for 201 data using functions of a single input to make it possible to points evenly spaced from -2 to 2, which gives a close plot the estimated functions. approximation to the mean integrated squared error. Six functions were used, named at, line, curve, wig- Since the distribution of the MSE of prediction gle, saw, and step. Each gure at the end of this paper is is highly skewed and possibly multimodal, a rank- devoted to one of these functions. Data were generated transform analysis of variance was used to evaluate the by selecting an input value from a uniform distribution e ects of the number of HUs and the validation percent- on ,2 2 and adding Gaussian noise with standard de- ; age and to compare estimation methods. Since this is a viation sigma as shown in the gures. Fifty samples were small, exploratory study, no adjustments were made for generated from the saw function, while twenty samples multiple tests. Results are summarized in terms of the were generated from each of the other functions. conventional 5 level of signi cance. The gures are composed of four quadrants containing There was no signi cant interaction between the num- the following plots: ber of HUs and the validation percentage for any of the functions. Upper left: The true regression function, pooled data The number of HUs had no signi cant e ect on the from all samples, and separate data from each of the generalization of stopped training for the at or step rst two samples. functions. For the line function, 40 HUs generalized sig- ni cantly better than 5 or 10 HUs. For the other func- Upper right: The estimated regression functions for tions, 20 HUs were better than 5 or 10 HUs. For the saw each sample with each of four estimation methods: function, 40 HUs generalized signi cantly worse than 20 3
  • 4. HUs. For the other functions, the di erence between 20 References and 40 HUs was not signi cant. Bernardo, J.M., DeGroot, M.H., Lindley, D.V. and The e ect of validation percentage on stopped train- Smith, A.F.M., eds., 1985, Bayesian Statistics 2, ing was not as clear-cut as the e ect of the number of Amsterdam: Elsevier Science Publishers B.V. North- HUs. For the at function, 33 validation cases general- Holland. ized best but not signi cantly better than any other per- Finnof, W., Hergert, F., and Zimmermann, H.G. centage except 14. For the line function, the e ect of 1993, Improving model selection by nonconvergent validation percentage was not signi cant. For the curve methods, Neural Networks, 6, 771-783. function, 25 was best but was not signi cantly better Hassibi, B. and Stork, D.G. 1993, Second order than any other percentage except 10. For the wiggle derivatives for network pruning: Optimal Brain Sur- function, 14 was nonsigni cantly better than 20 and geon, Advances in Neural Information Processing Sys- signi cantly better than any other validation percentage. tems, 5, 164-171. For the saw function, 14 was best but was not signi - MacKay, D.J.C. 199?, Probable cantly better than any other percentage except 50. For networks and plausible predictions|a review of prac- the step function, 50 was signi cantly worse than any tical Bayesian methods for supervised neural networks, other percentage. Thus the data suggest a trend in which ftp: mraos.ra.phy.cam.ac.uk pub mackay network.ps.Z the more complex functions require a smaller validation Morgan, N. and Bourlard, H. 1990, Generalization percentage, but the results are far from conclusive. and parameter estimation in feedforward nets: Some ex- periments, Advances in Neural Information Processing To compare stopped training with the other methods, Systems, 2, 630-637. 20 HUs and 25 validation cases were chosen to give Neal, R.M. 1995, Bayesian Learning for Neu- good results for all ve functions. ral Networks, Ph.D. thesis, University of Toronto, ftp: ftp.cs.toronto.edu pub radford thesis.ps.Z Of the three practical methods DG is not, of course, Nelson, M.C. and Illingworth, W.T. 1991, A Practi- practical, SBC was signi cantly best for the at and line cal Guide to Neural Nets, Reading, MA: Addison-Wesley. functions, but MPD was best for the other functions, sig- O'Hagan, A. 1985, Shoulders in hierarchical mod- ni cantly so for the wiggle and saw functions. For the els, in Bernardo et al. 1985, 697-710. curve and step functions, MPD was signi cantly better Reed, R. 1993, Pruning Algorithms|A Survey, than stopped training but not signi cantly better than IEEE Transactions on Neural Networks, 4, 740-747. SBC. For the at function, MPD was signi cantly worse Wang, C. 1994, A Theory of Generalisation in than SBC or stopped training, while SBC and stopped Learning Machines with Neural Network Application, training did not di er signi cantly. For the line function, Ph.D. thesis, University of Pennsylvania. both MPD and stopped training were signi cantly worse Wang, C., Venkatesh, S.S., and Judd, J.S. 1994, than SBC, but MPD and stopped training did not di er Optimal Stopping and E ective Machine Complexity signi cantly from each other. Thus, for applications in in Learning, Advances in Neural Information Process- which nonlinear relationships are anticipated, MPD is ing Systems, 6, 303-310. clearly the method of choice. Ironically, stopped train- Weigend, A. 1994, On over tting and the e ective ing, which is touted as a fool-proof method for tting number of hidden units, Proceedings of the 1993 Con- complicated nonlinear functions, works well compared nectionist Models Summer School, 335-342. to MPD only for the two linear functions. Williams, P.M. 1995, Bayesian regularization and pruning using a Laplace prior, Neural Computation, 7, DG was signi cantly better than MPD for the at and 117-143. line functions and was extremely close to MPD for the other functions, so improved methods of model selection deserve further investigation. However, DG sometimes yields estimates with near discontinuities for all six func- tions, and for some applications this tendency could be a disadvantage. It is interesting that for the step func- tion, SBC appears to the human eye to produce better estimates than DG. But even though SBC detects the discontinuities reliably, it does not position them accu- rately. 4
  • 10. Figure 6 10