A New Learning Method for Single Layer Neural Networks Based on a Regularized Cost Function
1. A New Learning Method for Single Layer Neural Networks Based on a Regularized Cost Function Juan A. Suárez-Romero Óscar Fontenla-Romero Bertha Guijarro-Berdiñas Amparo Alonso-Betanzos Laboratory for Research and Development in Artificial Intelligence Department of Computer Science, University of A Coruña, Spain
21. A New Learning Method for Single Layer Neural Networks Based on a Regularized Cost Function Juan A. Suárez-Romero Óscar Fontenla-Romero Bertha Guijarro-Berdiñas Amparo Alonso-Betanzos Laboratory for Research and Development in Artificial Intelligence Department of Computer Science, University of A Coruña, Spain T h a n k y o u f o r y o u r a t t e n t i o n !
Hinweis der Redaktion
Thank you very much I’m going to present here a new learning method for single layer neural networks based on a regularized cost function
Let me first make an outline of the main points of this presentation I’m going to start with a little introduction about single layer neural networks Next I’ll explain supervised learning with regularization in this kind of networks And show an alternative loss function that allows to obtain a n analytical solution Finally, I’ll show experimental results, the conclusions and future work
As we can notice, our supervised learning algorithm is applied to a single layer neural network, with I inputs and J outputs To train the network we have S examples Generally, the activation functions used are non-linear At last , as can be seen, in this kind of networks the outputs are independent one of each other Because the weights related with each output are independent one set from another
So, in order to simplify the explanation , we’ll work with one output PRESS NEXT KEY So the real outputs of the network are obtained through a non-linear function, where the input is the sum of the inputs by the weights, plus the bias If the error function used is the MSE, as is in our case, then the goal is to obtain the values of the weights and the bias which minimi z es the MSE between the real and the desired outputs
So adding a regularization term to the cost function, our goal is minimi z e this cost function, which has two terms PRESS NEXT KEY The first term is the loss function, here the MSE, which is the square of difference between the desire output and the real output PRESS NEXT KEY And the second term is the regularization term, weighted by the regularization parameter alpha In our case the regularization term used is weight decay, which tries to smooth the obtained curve To minimi z e this cost function, we can derive both terms with respect to weights and bias, and equating to zero PRESS NEXT KEY The problem is that, in the first term, the weights are inside the non-linear function, so it isn’t guaranteed to have a unique minimum And also these minima can’t be obtained using an analytical method , but an iterative method
In order to solve this problem, we present here an alternative loss function that is based on the following theorem READ THE THEOREM BRIEFLY
Roughly speaking, the idea is that minimi z e the error difference in the output is equivalent to minimi z e the error difference before the non-linear function, weighting it by a factor
So now, applying the theorem, we have the new cost function PRESS NEXT KEY In which the alternative loss function is the MSE, but obtained before the non-linear functions PRESS NEXT KEY And the regularization term, which is the same as in the previous cost function We can notice that now the weights and the bias are outside the non-linear function
So to minimi z e the new cost function we derive both terms with respect to the weights and the bias, and equating the partial derivatives to zero Obtaining the equations showed in the slide
We can rewrite the previous system in order to obtain a new system of I+1 by I+1 linear equations, where PRESS NEXT KEY We have the variables, which are the weights and the bias PRESS NEXT KEY The cofficients PRESS NEXT KEY And the independent terms PRESS NEXT KEY So we can use an analytical method to solve this system of equations, obtaining the optima weights and bias This implies that the training is doing very fast with a low computational cost Also, this system of equations ha s an unique minimum, except for degenerated systems At last, we can do an incremental learning And even a parallel learning, where the training process is divided in several distributed neural networks, and the results are merged to obtain the global training In both cases, only the coefficients matrix and the independent terms vector must be stored
In order to probe our algorithm, we have applied it to a classification problem and to a regression problem PRESS NEXT KEY In both cases, we have used the logistic function in the neural functions PRESS NEXT KEY And the parameter alpha has been constrained to the interval [0, 1]
The first problem, a classification problem, has been extracted from the KDD Cup 99 competition Each sample summarized a connection between two hosts, and i t’ s formed by 41 inputs The goal is to classify each sample in two classes: attack or normal connection We have 30.000 samples for training, and almost 5.000 for testing
In order to study the influence of the training set size and the regularization parameter, we have generated several training sets For doing this , we have generated an initial training set formed by 100 samples. Each new training set is formed adding to the previous set 100 new samples, up to 2500 samples By this way we have 25 training sets For each training set, several neural networks have been trained with differents alphas, from 0 to 1, in steps of 0.005 All this process has been repeated 12 times, to obtain a better estimation of the true error Finally, the regularization parameter that provides the minimum test classification error is chosen
As it can be seen in the figure, using regularization in all cases produces a better results that without it Mainly in small training set sizes In order to check that really this difference is statistically significant, we have applied a statistical test, confirming this fact PRESS NEXT KEY Also, we only need 400 samples to stabilize the error using regularization, while without regularization we need 700 samples
The other problem is a regression problem, specifically the Box-Jenkins problem The problem consists in estimate the concentration of CO2 in a gas furnace at a time instant from the 4 previous concentrations and the 6 previous methane flow rates
Like in t h e previous problem, we have generated several training set sizes Initially we have done a 10-fold cross validation, using 261 samples for training and 29 for testing In each validation round, several training sets have been generated, from 9 to 261 examples, in steps of 9 samples, using the same process as in the case of the intrusion detection problem Finally, for each training set, several neural networks have been trained and tested, varying alpha from 0 to 1 in steps of 0.001 In order to obtain a better estimation of the true error, mainly with small training sets, we have repeated the previous process 1 0 times And at last, the alpha that produces the minimum normalized MSE has been chosen
The results are showed in the figure Though it seems that using regularization i s worse than not use it, statistically there is no difference Except for small training sets
In this case the neural network performs very well, and using regularization do es n’t enhance the results In order to check the generalization capability of regularization in the presence of noisy data, we have added two normal random noise One with a standard desviation that is the half of the standard desviation of the original time series (gamma 0.5) And the other with the same standard desviation (gamma 1)
We can notice the results together with the previous results As we can see, using regularization with nois y data improves the results In fact, in both cases there is a statistically difference using regularization and without regularization In the case of gamma 0.5, this difference only exists until a training set size of 225 samples PRESS NEXT KEY If we search for the smallest training set from which the error stabilizes With gamma 0.5 this size is 198, either using regularization or not But with gamma 1, that is, with a more noisy dat a , this size is 189 with regularization, and 207 without regularization
As conclusions , we have proposed a new supervised learning method for single layer neural networks using regularization Among its features, we can remark that it allows to obtain the global optimum in an analytical way and, hence, faster than the current iterative methods It allows incremental learning and distributed learning And, due to the regularization term, a better generalization capability, mainly with small training sets or with noisy data We have applied it to two kind of problems, a classification problem and a regression problem, obtaining generally better results As future work, an analytical method to obtain the regularization parameter is being analyzed