SlideShare ist ein Scribd-Unternehmen logo
1 von 13
Downloaden Sie, um offline zu lesen
Two algorithms to accelerate training of
back-propagation neural networks
Vincent Vanhoucke
May 23, 1999
http://www.stanford.edu/ nouk/
Abstract
This project is aimed at developing techniques to initialize the weights of multi-
layer back-propagation neural networks, in order to speed up the rate of convergence.
Two algorithms are proposed for a general class of networks.
1 Conventions and notations
i,1
1
w1,1
1
wN,1
1
w2,1
1
wN,N
1
xi,1
2
yi,1
x
1
X Y
xi,N
1
xi,2
1
3
xi,N
L
yi,N
xxi,N
2
i,N
wN,1
2
2,1
2
wN,N
2
w1,1
2
w
wN,1
L
2,1
L
wN,N
L
i,1
3
xi,1
L
x 1,1
L
w
w
yi,2
L
N
Figure 1: General notations: L layers neural network
In this paper we restrict ourselves to L layers, N nodes per layer back-propagation net-
works [1]. This restriction will allow us to derive the behaviour of the network from matrix
equations.1
L Number of layers
N Number of nodes per layer
T Size of the training set
η Backpropagation coefficient
φ() Activation function
xk
i,j jth
input of the kth
layer for the ith
training vector
yi,j Output of the jth
node for the ith
training vector
di,j Desired output of the jth
node for the ith
training vector
wk
i,j ith
weight of the jth
node of the kth
layer
With these conventions, we can write:
Xi =



xi
1,1 . . . xi
1,N
...
...
...
xi
T,1 . . . xi
T,N


 (i ∈ [1, L]) Wi =



wi
1,1 . . . wi
1,N
...
...
...
wi
N,1 . . . wi
N,N


 (i ∈ [1, L]) (1)
1
Extension to a wider variety of neural nets has not been studied yet
2
Y =



y1,1 . . . y1,N
...
...
...
yT,1 . . . yT,N


 D =



d1,1 . . . d1,N
...
...
...
dT,1 . . . dT,N


 (2)
These notations allow us to simply express the forward propagation of all the training
set:



X2 = φ (X1W1)
...
Xi+1 = φ (XiWi)
...
Y = φ (XLWL)
(3)
2 Mathematical introduction
Consider the recursion in equation 3. Successive application of the sigmo¨id function φ()
drives the input asymptotically to the fixed points of φ(). This statement is rigorously true
when no weighting is done (Wi = I), but holds statistically for a wider class of matrices.
Figure 2: Two kind of sigmo¨ids: Left φ (0) > 1, Right φ (0) ≤ 1
Figure 2 shows two different behaviors, depending on the slope of the sigmo¨id:
1 − (x > 0)
φ (0) > 1 x → 0 (x = 0)
−1 + (x < 0)
φ (0) ≤ 1 x → 0
(4)
Conclusions:
1. If we fix the initial weights of the network randomly, we are statistically more likely to
get an output - before adaptation of the weights - close to these convergence points. As
3
a consequence, we might be able to transform the objectives D of the neural network
to make their coefficient close to these convergence points and achieve a lower initial
distortion.
2. In the case of φ (0) > 1, there are three convergence points, depending on the sign of
the input. Note the unstability of the origin: for an output to be zero, the input has
to be zero. We will focus on this constraint later.
3. In the case of φ (0) ≤ 1, the nodes makes every entry it is given tend to zero. There
is no natural clustering of the entries induced by the sigmo¨id. We won’t consider this
case in the following discussion. Section 3.3 will show that our algorithm performs
poorly in that case.
3 Algorithm I: Initialization by Forward
Propagated Orthogonalization
3.1 Case N = T
The case N = T, where the number of nodes equals the size of the training set, is not likely
to be encountered in practice. However, it allows us to derive an exact solution that holds
for that case, and extend it to a set of approximate solutions for the more general case.
3.1.1 Exact solution
If N = T, all the matrices are square, and thus we can write2
:
Case T = 1:
D = φ (X1W1) ⇒ W1 = X−1
1 φ−1
(D) (5)
Case T > 1: Let α = 1 − be the positive fixed point of φ()3
.
W1 = αX−1
1 ⇒ X2 = αI
. . .
Wi = αI ⇒ Xi+1 = αI
. . .
WL = α−1
φ−1
(D) ⇒ Y = D
(6)
This set of weights ensures that we aim at the exact solution without any adaptation.
3.1.2 Low complexity asymptotically optimal solution
We might not be able or be willing to invert the matrix X1. In that case, we can derive an
approximate solution by letting Wn = α−1
φ−1
(D), which implies Xn = αI.
2
Subject to X1 invertible
3
This excludes the case φ (0) ≤ 1
4
Remember that in 2 we arrived to the conclusion that we might be able to take advantage
of a change in the objective D, so that its coefficient are only fixed points of the sigmo¨id.
Here we achieve this by letting the new goal be Xn = αI.
Procedure:
• Step 1:
W1 → ˜y1 ˜x1
1 . . . ˜xi
1 . . . ˜xN−1
1
X1 →







x1
1
...
xi
1
...
xN
1













α 0 0 0 0 0
0 ∗ ∗ ∗ ∗ ∗
∗ ∗ ∗ ∗ ∗ ∗
. . . . . .
. . . . . .






(7)
With:
˜xi
1 given by : 1
∀i ∈ [1, N − 1] < x1
1 | ˜xi
1 >= 0
˜y1 given by : 2
< x2
1 | ˜y1 >= 0 and < x1
1 | ˜y1 >= α
(8)
• Step 2:
W2 →
1 0 . . . 0
˜x1
2 ˜y2 ˜x2
2 . . . ˜xN−1
2
X2 →







α 0
0 x1
2
∗ x2
2
∗
...
∗ xN
2













α 0 0 0 0 0
0 α 0 0 0 0
∗ 0 ∗ ∗ ∗ ∗
. . . . . .
. . . . . .






(9)
With:
˜xi
2 given by : ∀i ∈ [1, N − 1] < x1
2 | ˜xi
2 >= 0
˜y2 given by : 3
< x2
2 | ˜y2 >= 0 and < x1
2 | ˜y2 >= α
(10)
• Step 3:
1
Note that the ˜xi
1 do not need to be all distincts. rank ˜xi
1, i ∈ [1, N − 1] > 1 is a necessary condition,
as shown in step two, but there might be some more requirements to ensure that the procedure works in all
cases.
2
The second constraint can be weakened:< x1
1 | ˜y1 > > 0. The remarks of section 2 apply, ensuring that
the coefficient will naturally converge to α.
3
This can never be achieved if rank ˜xi
1, i ∈ [1, N − 1] = 1.
5
W3 →


1 0 . . . 0
0 1 0 . . . . . . 0
˜x1
3 ˜x2
3 ˜y3 ˜x3
3 . . . ˜xN−1
3


X3 →






α 0 0
0 α 0
β 0 x1
3
∗ ∗ x2
3
. . .












α 0 0 0 0 0
0 α 0 0 0 0
0 0 α 0 0 0
∗ ∗ 0 ∗ ∗ ∗
. . . . . .






(11)
With:
˜x1
3 given by : < [β x1
3] | [1 ˜x1
3] >= 0
˜xi
3 (i > 1) given by : ∀i ∈ [2, N − 1] < x1
3 | ˜xi
3 >= 0
˜y3 given by : < x2
3 | ˜y3 >= 0 and < x1
3 | ˜y3 >= α
(12)
• An so on...
This procedure does a step by step diagonalization of the matrix, which is fully com-
pleted in N steps. We will see later that N layers are not required for the algorithm
to perform well. Even a partial diagonalisation reduces the error by a significative
amount.
3.2 Experiments: Case N = 4
3.2.1 Algorithm I applied to a 4 nodes, L layers network, N = T
See figure 3.2.1. The detailed experimental protocol is:
Number of layers 1 to 6
Number of nodes per layer 4
Size of training set 4
Sigmo¨id φ(x) = tanh(4.x)
Adaptation rule Back propagation
Error measure MSE averaged on all training sequences
Input vectors Random coefficients between 0 and 1
Desired output Random coefficients between 0 and 1
Reference Random weights between 0 and 1
Number of training iterations 200
Number of experiments averaged 100
The adaptation time is greatly reduced by the algorithm. The first transitory phase
where the network is searching for a stable configuration is removed. However, in this case,
perfect fit without adaptation can be achieved using the inversion procedure described in
3.1.1.
6
0 20 40 60 80 100 120 140 160 180 200
150
200
250
300
350
400
450
500
One layer
Iterations
MSE
Random weights
Algorithm I
0 20 40 60 80 100 120 140 160 180 200
100
150
200
250
300
350
400
450
500
Two layers
Iterations
MSE
Random weights
Algorithm I
0 20 40 60 80 100 120 140 160 180 200
100
200
300
400
500
600
700
Three layers
Iterations
MSE
Random weights
Algorithm I
0 20 40 60 80 100 120 140 160 180 200
100
150
200
250
300
350
400
450
500
550
600
Four layers
Iterations
MSE
Random weights
Algorithm I
0 20 40 60 80 100 120 140 160 180 200
150
200
250
300
350
400
450
500
550
600
Five layers
Iterations
MSE
Random weights
Algorithm I
0 20 40 60 80 100 120 140 160 180 200
150
200
250
300
350
400
450
500
550
Six layers
Iterations
MSE
Random weights
Algorithm I
Figure 3: MSE for various number of layers in the network
7
3.2.2 Case N < T
When the training set is bigger than N, a solution consists of adapting the initial weights
to a subset of size N of the training set, using one of the two techniques described before.
The results shown in figure 3.2.2 demonstrate that in that case, Algorithm I is the one
that performs the best. The inversion procedure ’overfits’ the reduced portion of the training
set, and thus the adaptation time required to train the other elements of the training set is
much higher.
0 20 40 60 80 100 120 140 160 180 200
150
200
250
300
350
400
450
500
550
Iterations
MSE
Doubled training set
Random weights
Inversion procedure
Algorithm I
Figure 4: MSE for a doubled size of the training set
4 6 8 10 12 14 16 18 20
0
0.5
1
1.5
2
2.5
x 10
5
Size of the training set
AveragedMSE
Random weights
Algorithm I
Figure 5: MSE averaged over 200 iterations
8
3.3 Case φ (0) ≤ 1
Theoretically, the procedure described should work poorly if φ (0) ≤ 1 (See section 2). Figure
3.3 shows that it is indeed the case.
0 20 40 60 80 100 120 140 160 180 200
150
200
250
300
350
400
450
500
550
600
Use of different phi
Iterations
MSE
Random weights, phi’(0)>1
Algorithm I, phi’(0)>1
Random weights, phi’(0)<1
Algorithm I, phi’(0)<1
Figure 6: MSE with φ(x) = tanh(x
4
).
4 Algorithm II: Initialization by Joint Diagonalization
4.1 Description
In the description of the exact procedure in section 3.1.1, the key step was to get:
X2 = αI and XL = αI (13)
Now considering a larger size of the training set, we are forced to weaken these constraints
in order to be able to work with non-square matrices.
Let’s consider that the size of the training set is T = kN, a multiple of the number of
nodes in each layer. Xi and D can be decomposed in square blocks:
Xi =



X1
i
...
Xk
i


 and D =



D1
...
Dk


 (14)
We can achieve a fairly good separation of the elements of the training set by transforming
the constraints in 13 into:
9
X2 = XL =



αI
...
αI


 (15)
If we can meet these constraints, each separated bin will contain at most k elements.
Using equation 3, we can see that meeting these constraints is equivalent to be able to
invert X1
1 , . . . , Xk
1 with a common inverse W1, as well as inverting D1
, . . . , Dk
with WL. This
can not be done strictly, but by weakening one step further the equality 15, we might be
able to achieve joint diagonalization. If X2 and XL are diagonal by blocks, remarks made in
2 apply and the blocks are good approximates of the objectives αI.
The joint diagonalisation theorem states that:
Given A1, . . . , An a set of normal matrices which commute, there exist an orthog-
onal matrix P such as: P A1P, . . ., P AnP are all diagonal.
If the matrices are not normal, and/or do not commute, we can achieve a close result by
minimization of the off-diagonal terms of the set of matrices as shown in [2]:
With:
off(A)
1≤i=j≤N
|ai,j|2
(16)
The minimization criterion4
is:
P = inf
P :P P =I
k
i=1
off P AiP (17)
As in equation 3, the objective is to diagonalize:
φ−1
X2
i = X1
i W1 and XL
i = φ−1
Di
W−1
L , i ∈ [1, k] (18)
Which can be expressed formally as:
∆i
φ−1
(X2
i )
= P
W1
AiP
X1
i
and ∆i
XL
i
= P Ai
φ−1(Di)
P
W L
(19)
As a consequence, subject to be willing to change the inputs X1 and the desired outputs
D by a bijective transform: X1 → X1P, and D → φ P φ−1
(D) , then setting Wi = P
and WL = P provides a good approximation of the ideal case.
4
A Matlab function performing this operation is given at the URL [3]
10
4.2 Algorithm
Xi =



X1
i
...
Xk
i


 and D =



D1
...
Dk



P: best orthogonal matrix diagonalizing X1
i , . . . , Xk
i
P : best orthogonal matrix diagonalizing φ−1
(D1
) , . . . , φ−1
Dk
X1 → .P →
L layers, N nodes neural net
W1, . . . , WL
→ P . → D
Wi = P , ∀i ∈ [2, L − 1] Wi = αI, WL = P
4.3 Experiments
Experiments made using the same protocol (see figure 4.3) clearly show the gain in perfor-
mance of algorithm II. Note that even when the objectives are identical, the gain in speed
of convergence is significant.
0 20 40 60 80 100 120 140 160 180 200
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
Algorithm II
Size of the training set
AveragedMSE
Random weights
Random w. with modified objectives
Algorithm II
Figure 7: MSE for k = 5 (20 training elements)
5 Comparison between the two algorithms
Figure 5 shows the difference of performance between algorithm I and II. The averaged MSE
is not a very good quantitative indicator of the speed of convergence, but the orders of
magnitude are consistent with the data provided by the learning curves.
11
4 6 8 10 12 14 16 18 20
0
0.5
1
1.5
2
2.5
x 10
5 Comparison between the algorithms
Size of the training set
AveragedMSE
Random weights
Algorithm I
Algorithm II
Figure 8: MSE averaged over 200 iterations
Algorithm II performs obviously better, but with the restriction of having to apply a
transform on the data before feeding the neural network. On the other hand, Algorithm I is
computationnally very inexpensive, but is very likely by design not to perform very well on
very large training sets.
6 Conclusion
The two algorithms proposed speed up the learning rate of multi-layer back-propagation
neural networks. Constraints on these networks include:
• A fixed number of nodes per layer
• No constrained weight in the network
• An activation function with slope greater than one at the origin.
Adaptation of these techniques to different class of networks is still to be explored.
12
References
[1] Neural Networks, A Comprehensive Foundation, Second Edition - Simon Haykin, Pren-
tice Hall, 1999
[2] Jacobi angles for simultaneous diagonalization, Jean-Fran¸cois Cardoso and Antoine
Souloumiac - SIAM J. Mat. Anal. Appl. #1 vol. 17, p. 161-164, jan. 1996
ftp://sig.enst.fr/pub/jfc/Papers/siam_note.ps.gz
[3] Description of the joint diagonalization algoritm and Matlab code:
http://www-sig.enst.fr/~cardoso/jointdiag.html

Weitere ähnliche Inhalte

Was ist angesagt?

Newton's forward & backward interpolation
Newton's forward & backward interpolationNewton's forward & backward interpolation
Newton's forward & backward interpolationHarshad Koshti
 
Introduction to Neural Networks and Deep Learning
Introduction to Neural Networks and Deep LearningIntroduction to Neural Networks and Deep Learning
Introduction to Neural Networks and Deep LearningVahid Mirjalili
 
Newton's Forward/Backward Difference Interpolation
Newton's Forward/Backward  Difference InterpolationNewton's Forward/Backward  Difference Interpolation
Newton's Forward/Backward Difference InterpolationVARUN KUMAR
 
numericai matmatic matlab uygulamalar ali abdullah
numericai matmatic  matlab  uygulamalar ali abdullahnumericai matmatic  matlab  uygulamalar ali abdullah
numericai matmatic matlab uygulamalar ali abdullahAli Abdullah
 
Newton divided difference interpolation
Newton divided difference interpolationNewton divided difference interpolation
Newton divided difference interpolationVISHAL DONGA
 
Newton's Backward Interpolation Formula with Example
Newton's Backward Interpolation Formula with ExampleNewton's Backward Interpolation Formula with Example
Newton's Backward Interpolation Formula with ExampleMuhammadUsmanIkram2
 
Applied numerical methods lec14
Applied numerical methods lec14Applied numerical methods lec14
Applied numerical methods lec14Yasser Ahmed
 
Newton backward interpolation
Newton backward interpolationNewton backward interpolation
Newton backward interpolationMUHAMMADUMAIR647
 
Approximating Dominant Eivenvalue By The Power Method
Approximating Dominant Eivenvalue By The Power MethodApproximating Dominant Eivenvalue By The Power Method
Approximating Dominant Eivenvalue By The Power MethodIsaac Yowetu
 
Kakuro: Solving the Constraint Satisfaction Problem
Kakuro: Solving the Constraint Satisfaction ProblemKakuro: Solving the Constraint Satisfaction Problem
Kakuro: Solving the Constraint Satisfaction ProblemVarad Meru
 
Solution of non-linear equations
Solution of non-linear equationsSolution of non-linear equations
Solution of non-linear equationsZunAib Ali
 
Fourth order improved finite difference approach to pure bending analysis o...
Fourth   order improved finite difference approach to pure bending analysis o...Fourth   order improved finite difference approach to pure bending analysis o...
Fourth order improved finite difference approach to pure bending analysis o...eSAT Publishing House
 
On Application of the Fixed-Point Theorem to the Solution of Ordinary Differe...
On Application of the Fixed-Point Theorem to the Solution of Ordinary Differe...On Application of the Fixed-Point Theorem to the Solution of Ordinary Differe...
On Application of the Fixed-Point Theorem to the Solution of Ordinary Differe...BRNSS Publication Hub
 
Principle of Definite Integra - Integral Calculus - by Arun Umrao
Principle of Definite Integra - Integral Calculus - by Arun UmraoPrinciple of Definite Integra - Integral Calculus - by Arun Umrao
Principle of Definite Integra - Integral Calculus - by Arun Umraossuserd6b1fd
 
Applied numerical methods lec12
Applied numerical methods lec12Applied numerical methods lec12
Applied numerical methods lec12Yasser Ahmed
 
Engineering Mathematics-IV_B.Tech_Semester-IV_Unit-IV
Engineering Mathematics-IV_B.Tech_Semester-IV_Unit-IVEngineering Mathematics-IV_B.Tech_Semester-IV_Unit-IV
Engineering Mathematics-IV_B.Tech_Semester-IV_Unit-IVRai University
 
Principle of Integration - Basic Introduction - by Arun Umrao
Principle of Integration - Basic Introduction - by Arun UmraoPrinciple of Integration - Basic Introduction - by Arun Umrao
Principle of Integration - Basic Introduction - by Arun Umraossuserd6b1fd
 

Was ist angesagt? (20)

Newton's forward & backward interpolation
Newton's forward & backward interpolationNewton's forward & backward interpolation
Newton's forward & backward interpolation
 
Introduction to Neural Networks and Deep Learning
Introduction to Neural Networks and Deep LearningIntroduction to Neural Networks and Deep Learning
Introduction to Neural Networks and Deep Learning
 
Newton's Forward/Backward Difference Interpolation
Newton's Forward/Backward  Difference InterpolationNewton's Forward/Backward  Difference Interpolation
Newton's Forward/Backward Difference Interpolation
 
numericai matmatic matlab uygulamalar ali abdullah
numericai matmatic  matlab  uygulamalar ali abdullahnumericai matmatic  matlab  uygulamalar ali abdullah
numericai matmatic matlab uygulamalar ali abdullah
 
INTERPOLATION
INTERPOLATIONINTERPOLATION
INTERPOLATION
 
Newton divided difference interpolation
Newton divided difference interpolationNewton divided difference interpolation
Newton divided difference interpolation
 
Newton's Backward Interpolation Formula with Example
Newton's Backward Interpolation Formula with ExampleNewton's Backward Interpolation Formula with Example
Newton's Backward Interpolation Formula with Example
 
Applied numerical methods lec14
Applied numerical methods lec14Applied numerical methods lec14
Applied numerical methods lec14
 
Newton backward interpolation
Newton backward interpolationNewton backward interpolation
Newton backward interpolation
 
Approximating Dominant Eivenvalue By The Power Method
Approximating Dominant Eivenvalue By The Power MethodApproximating Dominant Eivenvalue By The Power Method
Approximating Dominant Eivenvalue By The Power Method
 
Kakuro: Solving the Constraint Satisfaction Problem
Kakuro: Solving the Constraint Satisfaction ProblemKakuro: Solving the Constraint Satisfaction Problem
Kakuro: Solving the Constraint Satisfaction Problem
 
Solution of non-linear equations
Solution of non-linear equationsSolution of non-linear equations
Solution of non-linear equations
 
Fourth order improved finite difference approach to pure bending analysis o...
Fourth   order improved finite difference approach to pure bending analysis o...Fourth   order improved finite difference approach to pure bending analysis o...
Fourth order improved finite difference approach to pure bending analysis o...
 
On Application of the Fixed-Point Theorem to the Solution of Ordinary Differe...
On Application of the Fixed-Point Theorem to the Solution of Ordinary Differe...On Application of the Fixed-Point Theorem to the Solution of Ordinary Differe...
On Application of the Fixed-Point Theorem to the Solution of Ordinary Differe...
 
Principle of Definite Integra - Integral Calculus - by Arun Umrao
Principle of Definite Integra - Integral Calculus - by Arun UmraoPrinciple of Definite Integra - Integral Calculus - by Arun Umrao
Principle of Definite Integra - Integral Calculus - by Arun Umrao
 
Applied numerical methods lec12
Applied numerical methods lec12Applied numerical methods lec12
Applied numerical methods lec12
 
Engineering Mathematics-IV_B.Tech_Semester-IV_Unit-IV
Engineering Mathematics-IV_B.Tech_Semester-IV_Unit-IVEngineering Mathematics-IV_B.Tech_Semester-IV_Unit-IV
Engineering Mathematics-IV_B.Tech_Semester-IV_Unit-IV
 
Ch05 6
Ch05 6Ch05 6
Ch05 6
 
Principle of Integration - Basic Introduction - by Arun Umrao
Principle of Integration - Basic Introduction - by Arun UmraoPrinciple of Integration - Basic Introduction - by Arun Umrao
Principle of Integration - Basic Introduction - by Arun Umrao
 
Aleksander gegov
Aleksander gegovAleksander gegov
Aleksander gegov
 

Andere mochten auch

GAlib: A C++ Library of Genetic Algorithm Components
GAlib: A C++ Library of Genetic Algorithm ComponentsGAlib: A C++ Library of Genetic Algorithm Components
GAlib: A C++ Library of Genetic Algorithm ComponentsESCOM
 
Combining LEARNING FOR Multilayer Nets CO
Combining LEARNING FOR Multilayer Nets COCombining LEARNING FOR Multilayer Nets CO
Combining LEARNING FOR Multilayer Nets COESCOM
 
Teoria Resonancia Adaptativa
Teoria Resonancia AdaptativaTeoria Resonancia Adaptativa
Teoria Resonancia AdaptativaESCOM
 
redes neuronales Som
redes neuronales Somredes neuronales Som
redes neuronales SomESCOM
 
Funciones De Matlab En Redes Hopfield
Funciones De Matlab En Redes HopfieldFunciones De Matlab En Redes Hopfield
Funciones De Matlab En Redes HopfieldESCOM
 
REDES NEURONALES De Hopfield
REDES NEURONALES De HopfieldREDES NEURONALES De Hopfield
REDES NEURONALES De HopfieldESCOM
 
Introduccion a las redes neuronales
Introduccion a las redes neuronalesIntroduccion a las redes neuronales
Introduccion a las redes neuronalesHALCONPEREGRINO2
 
redes neuronales tipo Som
redes neuronales tipo Somredes neuronales tipo Som
redes neuronales tipo SomESCOM
 

Andere mochten auch (8)

GAlib: A C++ Library of Genetic Algorithm Components
GAlib: A C++ Library of Genetic Algorithm ComponentsGAlib: A C++ Library of Genetic Algorithm Components
GAlib: A C++ Library of Genetic Algorithm Components
 
Combining LEARNING FOR Multilayer Nets CO
Combining LEARNING FOR Multilayer Nets COCombining LEARNING FOR Multilayer Nets CO
Combining LEARNING FOR Multilayer Nets CO
 
Teoria Resonancia Adaptativa
Teoria Resonancia AdaptativaTeoria Resonancia Adaptativa
Teoria Resonancia Adaptativa
 
redes neuronales Som
redes neuronales Somredes neuronales Som
redes neuronales Som
 
Funciones De Matlab En Redes Hopfield
Funciones De Matlab En Redes HopfieldFunciones De Matlab En Redes Hopfield
Funciones De Matlab En Redes Hopfield
 
REDES NEURONALES De Hopfield
REDES NEURONALES De HopfieldREDES NEURONALES De Hopfield
REDES NEURONALES De Hopfield
 
Introduccion a las redes neuronales
Introduccion a las redes neuronalesIntroduccion a las redes neuronales
Introduccion a las redes neuronales
 
redes neuronales tipo Som
redes neuronales tipo Somredes neuronales tipo Som
redes neuronales tipo Som
 

Ähnlich wie Two algorithms to accelerate training of back-propagation neural networks

Classification using perceptron.pptx
Classification using perceptron.pptxClassification using perceptron.pptx
Classification using perceptron.pptxsomeyamohsen3
 
Artificial Neural Networks
Artificial Neural NetworksArtificial Neural Networks
Artificial Neural NetworksArslan Zulfiqar
 
Solving the Poisson Equation
Solving the Poisson EquationSolving the Poisson Equation
Solving the Poisson EquationShahzaib Malik
 
Artificial Neural Networks Lect7: Neural networks based on competition
Artificial Neural Networks Lect7: Neural networks based on competitionArtificial Neural Networks Lect7: Neural networks based on competition
Artificial Neural Networks Lect7: Neural networks based on competitionMohammed Bennamoun
 
MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2arogozhnikov
 
2012 mdsp pr13 support vector machine
2012 mdsp pr13 support vector machine2012 mdsp pr13 support vector machine
2012 mdsp pr13 support vector machinenozomuhamada
 
Estimating Future Initial Margin with Machine Learning
Estimating Future Initial Margin with Machine LearningEstimating Future Initial Margin with Machine Learning
Estimating Future Initial Margin with Machine LearningAndres Hernandez
 
Computation of Electromagnetic Fields Scattered from Dielectric Objects of Un...
Computation of Electromagnetic Fields Scattered from Dielectric Objects of Un...Computation of Electromagnetic Fields Scattered from Dielectric Objects of Un...
Computation of Electromagnetic Fields Scattered from Dielectric Objects of Un...Alexander Litvinenko
 
APPLICATION OF PARTICLE SWARM OPTIMIZATION TO MICROWAVE TAPERED MICROSTRIP LINES
APPLICATION OF PARTICLE SWARM OPTIMIZATION TO MICROWAVE TAPERED MICROSTRIP LINESAPPLICATION OF PARTICLE SWARM OPTIMIZATION TO MICROWAVE TAPERED MICROSTRIP LINES
APPLICATION OF PARTICLE SWARM OPTIMIZATION TO MICROWAVE TAPERED MICROSTRIP LINEScseij
 
Application of particle swarm optimization to microwave tapered microstrip lines
Application of particle swarm optimization to microwave tapered microstrip linesApplication of particle swarm optimization to microwave tapered microstrip lines
Application of particle swarm optimization to microwave tapered microstrip linescseij
 

Ähnlich wie Two algorithms to accelerate training of back-propagation neural networks (20)

Max net
Max netMax net
Max net
 
Classification using perceptron.pptx
Classification using perceptron.pptxClassification using perceptron.pptx
Classification using perceptron.pptx
 
Artificial Neural Networks
Artificial Neural NetworksArtificial Neural Networks
Artificial Neural Networks
 
Input analysis
Input analysisInput analysis
Input analysis
 
03 Single layer Perception Classifier
03 Single layer Perception Classifier03 Single layer Perception Classifier
03 Single layer Perception Classifier
 
Solving the Poisson Equation
Solving the Poisson EquationSolving the Poisson Equation
Solving the Poisson Equation
 
neural networksNnf
neural networksNnfneural networksNnf
neural networksNnf
 
Artificial Neural Networks Lect7: Neural networks based on competition
Artificial Neural Networks Lect7: Neural networks based on competitionArtificial Neural Networks Lect7: Neural networks based on competition
Artificial Neural Networks Lect7: Neural networks based on competition
 
Group 6 NDE project
Group 6 NDE projectGroup 6 NDE project
Group 6 NDE project
 
2-Perceptrons.pdf
2-Perceptrons.pdf2-Perceptrons.pdf
2-Perceptrons.pdf
 
MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2
 
2012 mdsp pr13 support vector machine
2012 mdsp pr13 support vector machine2012 mdsp pr13 support vector machine
2012 mdsp pr13 support vector machine
 
Nn3
Nn3Nn3
Nn3
 
Estimating Future Initial Margin with Machine Learning
Estimating Future Initial Margin with Machine LearningEstimating Future Initial Margin with Machine Learning
Estimating Future Initial Margin with Machine Learning
 
Computation of Electromagnetic Fields Scattered from Dielectric Objects of Un...
Computation of Electromagnetic Fields Scattered from Dielectric Objects of Un...Computation of Electromagnetic Fields Scattered from Dielectric Objects of Un...
Computation of Electromagnetic Fields Scattered from Dielectric Objects of Un...
 
APPLICATION OF PARTICLE SWARM OPTIMIZATION TO MICROWAVE TAPERED MICROSTRIP LINES
APPLICATION OF PARTICLE SWARM OPTIMIZATION TO MICROWAVE TAPERED MICROSTRIP LINESAPPLICATION OF PARTICLE SWARM OPTIMIZATION TO MICROWAVE TAPERED MICROSTRIP LINES
APPLICATION OF PARTICLE SWARM OPTIMIZATION TO MICROWAVE TAPERED MICROSTRIP LINES
 
Application of particle swarm optimization to microwave tapered microstrip lines
Application of particle swarm optimization to microwave tapered microstrip linesApplication of particle swarm optimization to microwave tapered microstrip lines
Application of particle swarm optimization to microwave tapered microstrip lines
 
Neural networks
Neural networksNeural networks
Neural networks
 
6
66
6
 
Sparse autoencoder
Sparse autoencoderSparse autoencoder
Sparse autoencoder
 

Mehr von ESCOM

redes neuronales Som Slides
redes neuronales Som Slidesredes neuronales Som Slides
redes neuronales Som SlidesESCOM
 
red neuronal Som Net
red neuronal Som Netred neuronal Som Net
red neuronal Som NetESCOM
 
Self Organinising neural networks
Self Organinising  neural networksSelf Organinising  neural networks
Self Organinising neural networksESCOM
 
redes neuronales Kohonen
redes neuronales Kohonenredes neuronales Kohonen
redes neuronales KohonenESCOM
 
ejemplo red neuronal Art1
ejemplo red neuronal Art1ejemplo red neuronal Art1
ejemplo red neuronal Art1ESCOM
 
redes neuronales tipo Art3
redes neuronales tipo Art3redes neuronales tipo Art3
redes neuronales tipo Art3ESCOM
 
Art2
Art2Art2
Art2ESCOM
 
Redes neuronales tipo Art
Redes neuronales tipo ArtRedes neuronales tipo Art
Redes neuronales tipo ArtESCOM
 
Neocognitron
NeocognitronNeocognitron
NeocognitronESCOM
 
Neocognitron
NeocognitronNeocognitron
NeocognitronESCOM
 
Neocognitron
NeocognitronNeocognitron
NeocognitronESCOM
 
Fukushima Cognitron
Fukushima CognitronFukushima Cognitron
Fukushima CognitronESCOM
 
Counterpropagation NETWORK
Counterpropagation NETWORKCounterpropagation NETWORK
Counterpropagation NETWORKESCOM
 
Counterpropagation NETWORK
Counterpropagation NETWORKCounterpropagation NETWORK
Counterpropagation NETWORKESCOM
 
Counterpropagation
CounterpropagationCounterpropagation
CounterpropagationESCOM
 
Teoría de Resonancia Adaptativa Art2 ARTMAP
Teoría de Resonancia Adaptativa Art2 ARTMAPTeoría de Resonancia Adaptativa Art2 ARTMAP
Teoría de Resonancia Adaptativa Art2 ARTMAPESCOM
 
Teoría de Resonancia Adaptativa ART1
Teoría de Resonancia Adaptativa ART1Teoría de Resonancia Adaptativa ART1
Teoría de Resonancia Adaptativa ART1ESCOM
 
Teoría de Resonancia Adaptativa ART
Teoría de Resonancia Adaptativa ARTTeoría de Resonancia Adaptativa ART
Teoría de Resonancia Adaptativa ARTESCOM
 
learning Vector Quantization LVQ2 LVQ3
learning Vector Quantization LVQ2 LVQ3learning Vector Quantization LVQ2 LVQ3
learning Vector Quantization LVQ2 LVQ3ESCOM
 
Learning Vector Quantization LVQ
Learning Vector Quantization LVQLearning Vector Quantization LVQ
Learning Vector Quantization LVQESCOM
 

Mehr von ESCOM (20)

redes neuronales Som Slides
redes neuronales Som Slidesredes neuronales Som Slides
redes neuronales Som Slides
 
red neuronal Som Net
red neuronal Som Netred neuronal Som Net
red neuronal Som Net
 
Self Organinising neural networks
Self Organinising  neural networksSelf Organinising  neural networks
Self Organinising neural networks
 
redes neuronales Kohonen
redes neuronales Kohonenredes neuronales Kohonen
redes neuronales Kohonen
 
ejemplo red neuronal Art1
ejemplo red neuronal Art1ejemplo red neuronal Art1
ejemplo red neuronal Art1
 
redes neuronales tipo Art3
redes neuronales tipo Art3redes neuronales tipo Art3
redes neuronales tipo Art3
 
Art2
Art2Art2
Art2
 
Redes neuronales tipo Art
Redes neuronales tipo ArtRedes neuronales tipo Art
Redes neuronales tipo Art
 
Neocognitron
NeocognitronNeocognitron
Neocognitron
 
Neocognitron
NeocognitronNeocognitron
Neocognitron
 
Neocognitron
NeocognitronNeocognitron
Neocognitron
 
Fukushima Cognitron
Fukushima CognitronFukushima Cognitron
Fukushima Cognitron
 
Counterpropagation NETWORK
Counterpropagation NETWORKCounterpropagation NETWORK
Counterpropagation NETWORK
 
Counterpropagation NETWORK
Counterpropagation NETWORKCounterpropagation NETWORK
Counterpropagation NETWORK
 
Counterpropagation
CounterpropagationCounterpropagation
Counterpropagation
 
Teoría de Resonancia Adaptativa Art2 ARTMAP
Teoría de Resonancia Adaptativa Art2 ARTMAPTeoría de Resonancia Adaptativa Art2 ARTMAP
Teoría de Resonancia Adaptativa Art2 ARTMAP
 
Teoría de Resonancia Adaptativa ART1
Teoría de Resonancia Adaptativa ART1Teoría de Resonancia Adaptativa ART1
Teoría de Resonancia Adaptativa ART1
 
Teoría de Resonancia Adaptativa ART
Teoría de Resonancia Adaptativa ARTTeoría de Resonancia Adaptativa ART
Teoría de Resonancia Adaptativa ART
 
learning Vector Quantization LVQ2 LVQ3
learning Vector Quantization LVQ2 LVQ3learning Vector Quantization LVQ2 LVQ3
learning Vector Quantization LVQ2 LVQ3
 
Learning Vector Quantization LVQ
Learning Vector Quantization LVQLearning Vector Quantization LVQ
Learning Vector Quantization LVQ
 

Two algorithms to accelerate training of back-propagation neural networks

  • 1. Two algorithms to accelerate training of back-propagation neural networks Vincent Vanhoucke May 23, 1999 http://www.stanford.edu/ nouk/ Abstract This project is aimed at developing techniques to initialize the weights of multi- layer back-propagation neural networks, in order to speed up the rate of convergence. Two algorithms are proposed for a general class of networks.
  • 2. 1 Conventions and notations i,1 1 w1,1 1 wN,1 1 w2,1 1 wN,N 1 xi,1 2 yi,1 x 1 X Y xi,N 1 xi,2 1 3 xi,N L yi,N xxi,N 2 i,N wN,1 2 2,1 2 wN,N 2 w1,1 2 w wN,1 L 2,1 L wN,N L i,1 3 xi,1 L x 1,1 L w w yi,2 L N Figure 1: General notations: L layers neural network In this paper we restrict ourselves to L layers, N nodes per layer back-propagation net- works [1]. This restriction will allow us to derive the behaviour of the network from matrix equations.1 L Number of layers N Number of nodes per layer T Size of the training set η Backpropagation coefficient φ() Activation function xk i,j jth input of the kth layer for the ith training vector yi,j Output of the jth node for the ith training vector di,j Desired output of the jth node for the ith training vector wk i,j ith weight of the jth node of the kth layer With these conventions, we can write: Xi =    xi 1,1 . . . xi 1,N ... ... ... xi T,1 . . . xi T,N    (i ∈ [1, L]) Wi =    wi 1,1 . . . wi 1,N ... ... ... wi N,1 . . . wi N,N    (i ∈ [1, L]) (1) 1 Extension to a wider variety of neural nets has not been studied yet 2
  • 3. Y =    y1,1 . . . y1,N ... ... ... yT,1 . . . yT,N    D =    d1,1 . . . d1,N ... ... ... dT,1 . . . dT,N    (2) These notations allow us to simply express the forward propagation of all the training set:    X2 = φ (X1W1) ... Xi+1 = φ (XiWi) ... Y = φ (XLWL) (3) 2 Mathematical introduction Consider the recursion in equation 3. Successive application of the sigmo¨id function φ() drives the input asymptotically to the fixed points of φ(). This statement is rigorously true when no weighting is done (Wi = I), but holds statistically for a wider class of matrices. Figure 2: Two kind of sigmo¨ids: Left φ (0) > 1, Right φ (0) ≤ 1 Figure 2 shows two different behaviors, depending on the slope of the sigmo¨id: 1 − (x > 0) φ (0) > 1 x → 0 (x = 0) −1 + (x < 0) φ (0) ≤ 1 x → 0 (4) Conclusions: 1. If we fix the initial weights of the network randomly, we are statistically more likely to get an output - before adaptation of the weights - close to these convergence points. As 3
  • 4. a consequence, we might be able to transform the objectives D of the neural network to make their coefficient close to these convergence points and achieve a lower initial distortion. 2. In the case of φ (0) > 1, there are three convergence points, depending on the sign of the input. Note the unstability of the origin: for an output to be zero, the input has to be zero. We will focus on this constraint later. 3. In the case of φ (0) ≤ 1, the nodes makes every entry it is given tend to zero. There is no natural clustering of the entries induced by the sigmo¨id. We won’t consider this case in the following discussion. Section 3.3 will show that our algorithm performs poorly in that case. 3 Algorithm I: Initialization by Forward Propagated Orthogonalization 3.1 Case N = T The case N = T, where the number of nodes equals the size of the training set, is not likely to be encountered in practice. However, it allows us to derive an exact solution that holds for that case, and extend it to a set of approximate solutions for the more general case. 3.1.1 Exact solution If N = T, all the matrices are square, and thus we can write2 : Case T = 1: D = φ (X1W1) ⇒ W1 = X−1 1 φ−1 (D) (5) Case T > 1: Let α = 1 − be the positive fixed point of φ()3 . W1 = αX−1 1 ⇒ X2 = αI . . . Wi = αI ⇒ Xi+1 = αI . . . WL = α−1 φ−1 (D) ⇒ Y = D (6) This set of weights ensures that we aim at the exact solution without any adaptation. 3.1.2 Low complexity asymptotically optimal solution We might not be able or be willing to invert the matrix X1. In that case, we can derive an approximate solution by letting Wn = α−1 φ−1 (D), which implies Xn = αI. 2 Subject to X1 invertible 3 This excludes the case φ (0) ≤ 1 4
  • 5. Remember that in 2 we arrived to the conclusion that we might be able to take advantage of a change in the objective D, so that its coefficient are only fixed points of the sigmo¨id. Here we achieve this by letting the new goal be Xn = αI. Procedure: • Step 1: W1 → ˜y1 ˜x1 1 . . . ˜xi 1 . . . ˜xN−1 1 X1 →        x1 1 ... xi 1 ... xN 1              α 0 0 0 0 0 0 ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ . . . . . . . . . . . .       (7) With: ˜xi 1 given by : 1 ∀i ∈ [1, N − 1] < x1 1 | ˜xi 1 >= 0 ˜y1 given by : 2 < x2 1 | ˜y1 >= 0 and < x1 1 | ˜y1 >= α (8) • Step 2: W2 → 1 0 . . . 0 ˜x1 2 ˜y2 ˜x2 2 . . . ˜xN−1 2 X2 →        α 0 0 x1 2 ∗ x2 2 ∗ ... ∗ xN 2              α 0 0 0 0 0 0 α 0 0 0 0 ∗ 0 ∗ ∗ ∗ ∗ . . . . . . . . . . . .       (9) With: ˜xi 2 given by : ∀i ∈ [1, N − 1] < x1 2 | ˜xi 2 >= 0 ˜y2 given by : 3 < x2 2 | ˜y2 >= 0 and < x1 2 | ˜y2 >= α (10) • Step 3: 1 Note that the ˜xi 1 do not need to be all distincts. rank ˜xi 1, i ∈ [1, N − 1] > 1 is a necessary condition, as shown in step two, but there might be some more requirements to ensure that the procedure works in all cases. 2 The second constraint can be weakened:< x1 1 | ˜y1 > > 0. The remarks of section 2 apply, ensuring that the coefficient will naturally converge to α. 3 This can never be achieved if rank ˜xi 1, i ∈ [1, N − 1] = 1. 5
  • 6. W3 →   1 0 . . . 0 0 1 0 . . . . . . 0 ˜x1 3 ˜x2 3 ˜y3 ˜x3 3 . . . ˜xN−1 3   X3 →       α 0 0 0 α 0 β 0 x1 3 ∗ ∗ x2 3 . . .             α 0 0 0 0 0 0 α 0 0 0 0 0 0 α 0 0 0 ∗ ∗ 0 ∗ ∗ ∗ . . . . . .       (11) With: ˜x1 3 given by : < [β x1 3] | [1 ˜x1 3] >= 0 ˜xi 3 (i > 1) given by : ∀i ∈ [2, N − 1] < x1 3 | ˜xi 3 >= 0 ˜y3 given by : < x2 3 | ˜y3 >= 0 and < x1 3 | ˜y3 >= α (12) • An so on... This procedure does a step by step diagonalization of the matrix, which is fully com- pleted in N steps. We will see later that N layers are not required for the algorithm to perform well. Even a partial diagonalisation reduces the error by a significative amount. 3.2 Experiments: Case N = 4 3.2.1 Algorithm I applied to a 4 nodes, L layers network, N = T See figure 3.2.1. The detailed experimental protocol is: Number of layers 1 to 6 Number of nodes per layer 4 Size of training set 4 Sigmo¨id φ(x) = tanh(4.x) Adaptation rule Back propagation Error measure MSE averaged on all training sequences Input vectors Random coefficients between 0 and 1 Desired output Random coefficients between 0 and 1 Reference Random weights between 0 and 1 Number of training iterations 200 Number of experiments averaged 100 The adaptation time is greatly reduced by the algorithm. The first transitory phase where the network is searching for a stable configuration is removed. However, in this case, perfect fit without adaptation can be achieved using the inversion procedure described in 3.1.1. 6
  • 7. 0 20 40 60 80 100 120 140 160 180 200 150 200 250 300 350 400 450 500 One layer Iterations MSE Random weights Algorithm I 0 20 40 60 80 100 120 140 160 180 200 100 150 200 250 300 350 400 450 500 Two layers Iterations MSE Random weights Algorithm I 0 20 40 60 80 100 120 140 160 180 200 100 200 300 400 500 600 700 Three layers Iterations MSE Random weights Algorithm I 0 20 40 60 80 100 120 140 160 180 200 100 150 200 250 300 350 400 450 500 550 600 Four layers Iterations MSE Random weights Algorithm I 0 20 40 60 80 100 120 140 160 180 200 150 200 250 300 350 400 450 500 550 600 Five layers Iterations MSE Random weights Algorithm I 0 20 40 60 80 100 120 140 160 180 200 150 200 250 300 350 400 450 500 550 Six layers Iterations MSE Random weights Algorithm I Figure 3: MSE for various number of layers in the network 7
  • 8. 3.2.2 Case N < T When the training set is bigger than N, a solution consists of adapting the initial weights to a subset of size N of the training set, using one of the two techniques described before. The results shown in figure 3.2.2 demonstrate that in that case, Algorithm I is the one that performs the best. The inversion procedure ’overfits’ the reduced portion of the training set, and thus the adaptation time required to train the other elements of the training set is much higher. 0 20 40 60 80 100 120 140 160 180 200 150 200 250 300 350 400 450 500 550 Iterations MSE Doubled training set Random weights Inversion procedure Algorithm I Figure 4: MSE for a doubled size of the training set 4 6 8 10 12 14 16 18 20 0 0.5 1 1.5 2 2.5 x 10 5 Size of the training set AveragedMSE Random weights Algorithm I Figure 5: MSE averaged over 200 iterations 8
  • 9. 3.3 Case φ (0) ≤ 1 Theoretically, the procedure described should work poorly if φ (0) ≤ 1 (See section 2). Figure 3.3 shows that it is indeed the case. 0 20 40 60 80 100 120 140 160 180 200 150 200 250 300 350 400 450 500 550 600 Use of different phi Iterations MSE Random weights, phi’(0)>1 Algorithm I, phi’(0)>1 Random weights, phi’(0)<1 Algorithm I, phi’(0)<1 Figure 6: MSE with φ(x) = tanh(x 4 ). 4 Algorithm II: Initialization by Joint Diagonalization 4.1 Description In the description of the exact procedure in section 3.1.1, the key step was to get: X2 = αI and XL = αI (13) Now considering a larger size of the training set, we are forced to weaken these constraints in order to be able to work with non-square matrices. Let’s consider that the size of the training set is T = kN, a multiple of the number of nodes in each layer. Xi and D can be decomposed in square blocks: Xi =    X1 i ... Xk i    and D =    D1 ... Dk    (14) We can achieve a fairly good separation of the elements of the training set by transforming the constraints in 13 into: 9
  • 10. X2 = XL =    αI ... αI    (15) If we can meet these constraints, each separated bin will contain at most k elements. Using equation 3, we can see that meeting these constraints is equivalent to be able to invert X1 1 , . . . , Xk 1 with a common inverse W1, as well as inverting D1 , . . . , Dk with WL. This can not be done strictly, but by weakening one step further the equality 15, we might be able to achieve joint diagonalization. If X2 and XL are diagonal by blocks, remarks made in 2 apply and the blocks are good approximates of the objectives αI. The joint diagonalisation theorem states that: Given A1, . . . , An a set of normal matrices which commute, there exist an orthog- onal matrix P such as: P A1P, . . ., P AnP are all diagonal. If the matrices are not normal, and/or do not commute, we can achieve a close result by minimization of the off-diagonal terms of the set of matrices as shown in [2]: With: off(A) 1≤i=j≤N |ai,j|2 (16) The minimization criterion4 is: P = inf P :P P =I k i=1 off P AiP (17) As in equation 3, the objective is to diagonalize: φ−1 X2 i = X1 i W1 and XL i = φ−1 Di W−1 L , i ∈ [1, k] (18) Which can be expressed formally as: ∆i φ−1 (X2 i ) = P W1 AiP X1 i and ∆i XL i = P Ai φ−1(Di) P W L (19) As a consequence, subject to be willing to change the inputs X1 and the desired outputs D by a bijective transform: X1 → X1P, and D → φ P φ−1 (D) , then setting Wi = P and WL = P provides a good approximation of the ideal case. 4 A Matlab function performing this operation is given at the URL [3] 10
  • 11. 4.2 Algorithm Xi =    X1 i ... Xk i    and D =    D1 ... Dk    P: best orthogonal matrix diagonalizing X1 i , . . . , Xk i P : best orthogonal matrix diagonalizing φ−1 (D1 ) , . . . , φ−1 Dk X1 → .P → L layers, N nodes neural net W1, . . . , WL → P . → D Wi = P , ∀i ∈ [2, L − 1] Wi = αI, WL = P 4.3 Experiments Experiments made using the same protocol (see figure 4.3) clearly show the gain in perfor- mance of algorithm II. Note that even when the objectives are identical, the gain in speed of convergence is significant. 0 20 40 60 80 100 120 140 160 180 200 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Algorithm II Size of the training set AveragedMSE Random weights Random w. with modified objectives Algorithm II Figure 7: MSE for k = 5 (20 training elements) 5 Comparison between the two algorithms Figure 5 shows the difference of performance between algorithm I and II. The averaged MSE is not a very good quantitative indicator of the speed of convergence, but the orders of magnitude are consistent with the data provided by the learning curves. 11
  • 12. 4 6 8 10 12 14 16 18 20 0 0.5 1 1.5 2 2.5 x 10 5 Comparison between the algorithms Size of the training set AveragedMSE Random weights Algorithm I Algorithm II Figure 8: MSE averaged over 200 iterations Algorithm II performs obviously better, but with the restriction of having to apply a transform on the data before feeding the neural network. On the other hand, Algorithm I is computationnally very inexpensive, but is very likely by design not to perform very well on very large training sets. 6 Conclusion The two algorithms proposed speed up the learning rate of multi-layer back-propagation neural networks. Constraints on these networks include: • A fixed number of nodes per layer • No constrained weight in the network • An activation function with slope greater than one at the origin. Adaptation of these techniques to different class of networks is still to be explored. 12
  • 13. References [1] Neural Networks, A Comprehensive Foundation, Second Edition - Simon Haykin, Pren- tice Hall, 1999 [2] Jacobi angles for simultaneous diagonalization, Jean-Fran¸cois Cardoso and Antoine Souloumiac - SIAM J. Mat. Anal. Appl. #1 vol. 17, p. 161-164, jan. 1996 ftp://sig.enst.fr/pub/jfc/Papers/siam_note.ps.gz [3] Description of the joint diagonalization algoritm and Matlab code: http://www-sig.enst.fr/~cardoso/jointdiag.html