Design of airfoil using backpropagation training with mixed approach
1-s2.0-S092523121401087X-main
1. Multiple optimal learning factors for the multi-layer perceptron
Sanjeev S. Malalur, Michael T. Manry n
, Praveen Jesudhas
University of Texas at Arlington, Department of Electrical Engineering, Nedderman Hall, Room 517, Arlington, TX 76019, USA
a r t i c l e i n f o
Article history:
Received 10 April 2010
Received in revised form
14 June 2014
Accepted 21 August 2014
Communicated by: G. Thimm
Available online 11 September 2014
Keywords:
Multilayer perceptron
Newton's method
Hessian
Orthogonal least squares
Multiple optimal learning factor
Whitening transform
a b s t r a c t
A batch training algorithm is developed for a fully connected multi-layer perceptron, with a single
hidden layer, which uses two-stages per iteration. In the first stage, Newton's method is used to find a
vector of optimal learning factors (OLFs), one for each hidden unit, which is used to update the input
weights. Linear equations are solved for output weights in the second stage. Elements of the new
method's Hessian matrix are shown to be weighted sums of elements from the Hessian of the whole
network. The effects of linearly dependent inputs and hidden units on training are analyzed and an
improved version of the batch training algorithm is developed. In several examples, the improved
method performs better than first order training methods like backpropagation and scaled conjugate
gradient, with minimal computational overhead and performs almost as well as Levenberg–Marquardt, a
second order training method, with several orders of magnitude fewer multiplications.
& 2014 Elsevier B.V. All rights reserved.
1. Introduction
Multi-layer perceptron (MLP) neural networks are widely used
for regression and classification applications in the areas of
parameter estimation [1,2], document analysis and recognition
[3], finance and manufacturing [4] and data mining [5]. Due to its
layered, parallel architecture, the MLP has several favorable
properties such as universal approximation [6] and the ability to
mimic Bayes discriminant [7] and maximum a-posteriori (MAP)
estimates [8]. However, actual MLP performance is adversely
affected by the limitations of the available training algorithms.
Common batch training algorithms for the MLP include first
order methods such as backpropagation (BP) [9] and conjugate
gradient [10] and second order learning methods related to New-
ton's method. First order methods generally use fewer operations
per iteration but require more iterations than second order methods.
Newton's method for training the MLP often has non-positive
definite [11,12] or even singular Hessians. Hence Levenberg–Mar-
quardt (LM) [13,14] and other quasi-Newton methods are used
instead. Unfortunately, second order methods do not scale well.
Although first order methods scale better, they are sensitive to input
means and gains [15], since they lack affine invariance. Layer-by-
layer approaches [16] also exist for optimizing the MLP. Such
approaches (i) divide the network weights into disjoint subsets,
and (ii) train the subsets separately. Some network weight sub-
sets have nonsingular Hessians, allowing them to be trained via
Newton's method. For example, solving linear equations for output
weights [15,17–19] is actually Newton's algorithm for the output
weights. The optimal learning factor (OLF) [20] for BP training is
found using a one by one Hessian. If the learning factor and
momentum term gain in BP are found using a two by two Hessian
[21], the resulting algorithm is very similar to conjugate gradient.
The primary purpose of this paper is to present a family of
learning algorithms targeted towards training a fixed architecture,
fully connected multi-layer perceptron with a single hidden layer,
capable of learning from regression/approximation type application
and data. In [22] a two-stage training method was introduced, which
uses Newton's method to obtain a vector of optimal learning factors,
one for each MLP hidden unit. These learning factors are optimal
hence it was named multiple optimal learning factors (MOLF).
A variation to MOLF called the variable optimal learning factors
(VOLF) was presented in [23]. As a learning method, VOLF looks
promising. However the original MOLF was still better in terms of
training performance. In this paper, we explain in detail the motiva-
tion behind the original MOLF algorithm using the concept of
equivalent networks. We analyze the structure of the MOLF Hessian
matrix and demonstrate how linear dependency among inputs and
hidden units can impact the structure, ultimately affecting training
performance. This leads to the development of an improved version
of the MOLF whose performance is not impacted by the presence of
linear dependencies and has an overall performance better than the
Contents lists available at ScienceDirect
journal homepage: www.elsevier.com/locate/neucom
Neurocomputing
http://dx.doi.org/10.1016/j.neucom.2014.08.043
0925-2312/& 2014 Elsevier B.V. All rights reserved.
n
Corresponding author. Present address: Department of Electrical Engineering, The
University of Texas at Arlington, Arlington, TX 76019, USA. Tel.: þ1 817 272 3483.
E-mail addresses: Sanjeev.malalur@gmail.com (S.S. Malalur),
manry@uta.edu (M.T. Manry).
Neurocomputing 149 (2015) 1490–1501
2. original MOLF. The rest of the paper is organized as follows. Section 2
reviews MLP notation, a simple first order training method, and an
expression for the OLF. The MOLF method is presented in Section 3
motivated by a discussion of equivalent networks. Section 4
analyzes the effect of linearly dependent inputs and hidden units
on MOLF learning. An improvement to the MOLF algorithm is
presented in Section 5 that is immune to the presence of linearly
dependent inputs and hidden units. In Section 6, our methods are
compared to various existing first and second order algorithms.
Section 7 summarizes our conclusions.
2. Review of multi-layer perceptron
First, we introduce the notation used in the rest of the paper
and review a simple two-stage algorithm that makes limited use of
Newton's method. Observations from the two-stage method are
used to motivate the MOLF approach.
2.1. MLP notation
A fully connected feed-forward MLP network with a single
hidden layer is shown in Fig. 1. It consists of a layer for inputs, a
single layer of hidden units and a layer of outputs. The input
weights w(k, n) connect the nth input to the kth hidden unit.
Output weights woh(m, k) connect the kth hidden unit's activation
op(k) to the mth output yp(m), which has a linear activation.
The bypass weight woi(m, n) connects the nth input to the mth
output. The training data, described by the set {xp, tp} consists of
N-dimensional input vectors xp and M-dimensional desired output
vectors, tp. The pattern number p varies from 1 to Nv, where Nv
denotes the number of training vectors present in the data set.
In order to handle thresholds in the hidden and output layers, the
input vectors are augmented by an extra element xp(Nþ1) where,
xp(Nþ1)¼1, so xp¼[xp(1), xp(2),…, xp(Nþ1)]T
. Let Nh denote the
number of hidden units. The vector of hidden layer net functions, np
and the output vector of the network, yp can be written as
np ¼ W Uxp;
yp ¼ Wio U xp þ Woh Uop ð1Þ
where the kth element of the hidden unit activation vector op is
calculated as op(k)¼f(np(k)) and f(∙) denotes the hidden layer
activation function such as the sigmoid activation [9], which is the
one used throughout this paper. In this paper, the cost function used
in MLP training is the classic mean squared error between the
desired and the actual network outputs, defined as
E ¼
1
Nv
∑
Nv
p ¼ 1
∑
M
m ¼ 1
½tp mð ÞÀyp mð ÞŠ2
ð2Þ
2.2. Training using output weight optimization-backpropagation
Here, we introduce output weight optimization-backpropagation
[17] (OWO-BP), a first order algorithm that groups the weights in the
network into two subsets and trains them separately. In a given
iteration of OWO-BP, the training proceeds from (i) finding the
output weight matrices, Woh and Woi connected to the network
outputs to (ii) separately training the input weights W using BP.
Output weight optimization (OWO) is a technique for calculating
the network's output weights [18]. From (1) it is clear that our
MLP's output layer activations are linear. Therefore, finding the
weights connected to the outputs is equivalent to solving a system
of linear equations. The expression for the actual outputs given in
(1) can be re-written as
yp ¼ Wo Uxp ð3Þ
where xp ¼[xT
p, oT
p]T
is the augmented input column vector or basis
vector of size Nu where Nu ¼NþNhþ1. The M by Nu output weight
matrix Wo, formed as [Woi: Woh], contains all the output weights.
The output weights can be found by setting ∂E/∂Wo¼0 which
leads to a set of linear equations given by
Cx ¼ Rx UWT
o ð4Þ
where
Cx ¼
1
Nv
∑
Nv
p ¼ 1
xp UtT
p
Rx ¼
1
Nv
∑
Nv
p ¼ 1
xp UxT
p
Eq. (4) comprises of M sets of Nu linear equations in Nu unknowns and
is most easily solved using orthogonal least squares (OLS) [24]. In the
second half of an OWO-BP iteration, the input weight matrix W is
updated as
W’W þzUG ð5Þ
where G is the generic representation of a direction matrix that
contains information about the direction of learning and the learning
factor, z, contains information about the step length to be taken in the
direction G. For backpropagation, the direction matrix is nothing but
the Nh by (Nþ1) negative input weight Jacobian matrix computed as
G ¼
∂E
∂W
G ¼
1
Nv
∑
Nv
p ¼ 1
δp UxT
p ð6Þ
Here δp¼[δp(1), δp(2),…, δp(Nh)]T
is the Nh by 1 column vector of
hidden unit delta functions [9]. A brief outline of training using
OWO-BP is given below. For every training iteration,
1. In the first stage, solve the system of linear equations in (4)
using OLS and update the output weights, Wo;
2. In the second stage, begin by finding the negative Jacobian
matrix G described in Eq. (6);
3. Then, update the input weights, W, using Eq. (5).
This method is attractive for a couple of reasons. First, the
training is faster than BP, since training weights connected to theFig. 1. A fully connected multi-layer perceptron.
S.S. Malalur et al. / Neurocomputing 149 (2015) 1490–1501 1491
3. outputs is equivalent to solving linear equations [16] and second, it
helps us avoid some local minima.
2.3. Optimal learning factor
The choice of learning factor z in Eq. (5) has a direct effect on
the convergence rate of OWO-BP. Early steepest descent methods
used a fixed constant, with slow convergence, while later methods
used a heuristic scaling approach to modify the learning factor
between iterations and thus speed up the rate of convergence.
However, using Taylor's series for the error E, in Eq. (2), a non-
heuristic optimal learning factor (OLF) for OWO-BP can be derived
[20] as,
z ¼
À∂E=∂z
∂2
E=∂z2
z ¼ 0
ð7Þ
where the numerator and denominator derivatives are evaluated
at z¼0. The expression for the second derivative of the error with
respect to the OLF is found using (2) as,
∂2
E
∂z2
¼ ∑
Nh
k ¼ 1
∑
Nh
j ¼ 1
∑
N þ 1
n ¼ 1
gðk; nÞU ∑
N þ 1
i ¼ 1
∂2
E
∂w k; nð Þ∂w j; ið Þ
g j; ið Þ ¼ ∑
Nh
k ¼ 1
∑
Nh
j ¼ 1
gT
k Hk;j
R gj
ð8Þ
where column vector gk contains elements g(k, n) of G, for all
values of n. HR is the input weight Hessian with Niw rows and
columns, where Niw ¼(Nþ1) Nh is the number of input weights.
Clearly, HR is reduced in size compared to H, the Hessian for all
weights in the entire network. Hk;j
R contains elements of HR for all
input weights connected to the jth and kth hidden units and has
size (Nþ1) by (Nþ1). When Gauss–Newton [12] updates are used,
elements of HR are computed as
∂2
E
∂wðj; iÞ∂w k; nð Þ
¼ 2
Nv
u j; kð Þ ∑
Nv
p ¼ 1
xp ið Þxp nð Þop' jð Þop' kð Þ
u j; kð Þ ¼ ∑
M
m ¼ 1
woh m; jð Þwohðm; kÞ ð9Þ
o0
p(k) denotes the first partial derivative of op(k) with respect to its
net function. Because (9) represents the Gauss–Newton approx-
imation to the Hessian, it is positive semi-definite. Eq. (8) shows
that (i) the OLF can be obtained from elements of the Hessian HR,
(ii) HR contains useful information even when it is singular, and
(iii) a smaller non-singular Hessian ∂2
E/∂z2
can be constructed
using HR.
2.4. Motivation for further work
Even though the OWO stage of OWO-BP is equivalent to
Newton's algorithm for the output weights, the overall algorithm
is only marginally better than standard BP, and so it is rarely used.
This occurs because the BP stage is first order. The one by one
Hessian used in the OLF calculations is the only one associated
with the input weight matrix W. Therefore, improving the perfor-
mance of OWO-BP may require that we develop a larger Hessian
matrix associated with W.
3. Multiple optimal learning factor algorithm
In this section, we attempt to improve input weight training
by deriving a vector of optimal learning factors, z, which contains
one element for each hidden unit. The resulting improvement to
OWO-BP is called the multiple optimal learning factors (MOLF)
training method. We begin by first introducing the concept of
equivalent networks and identify the conditions under which two
networks can generate the same outputs.
3.1. Discussion of equivalent networks
Let MLP 1 have a net function vector np and output vector yp as
discussed previously in Section 2.1. In a second network called
MLP 2, which is trained similarly, the net function vector and the
output vector are respectively denoted by np' and yp'. The bypass
and output weight matrices along with the hidden unit activation
functions are considered to be equal for both the MLP 1 and MLP 2.
Based on this information the output vectors for MLP 1 and MLP
2 are respectively,
yp ið Þ ¼ ∑
N þ1
n ¼ 1
woi i; nð Þxp nð Þþ ∑
Nh
k ¼ 1
woh i; kð Þf ðnp kð ÞÞ ð10Þ
yp' ið Þ ¼ ∑
N þ 1
n ¼ 1
woi i; nð Þxp nð Þþ ∑
Nh
k ¼ 1
woh i; kð Þf np' kð Þ
À Á
ð11Þ
In order to make MLP 2 strongly equivalent to MLP 1 their
respective output vectors yp and yp' have to be equal. Based on
Eqs. (10) and (11) this can be achieved only when their respective
net function vectors are equal [25]. MLP 2 can be made strongly
equivalent to MLP 1 by linearly transforming its net function
vector np' before passing it as an argument to the activation
function f(). This process is equivalent to directly using the net
function np in MLP 2, denoted as,
np ¼ C Unp' ð12Þ
The linear transformation matrix C used to transform np' in
Eq. (12) is of size Nh-by-Nh. This process of transforming the net
function vector of MLP 2 is represented in Fig. 2.
On applying the linear transformation to the net function
vector n0
p in Eq. (12), MLP 2 is made strongly equivalent to
MLP1. We then get,
yp' ið Þ ¼ yp ið Þ ¼ ∑
N þ1
n ¼ 1
woi i; nð Þxp nð Þþ ∑
Nh
k ¼ 1
woh i; kð Þf ∑
Nh
m ¼ 1
c k; mð Þnp' mð Þ
!
ð13Þ
After making MLP 2 strongly equivalent to MLP 1, the net function
vector np can be related to its input weights W and the net
function vector np' as,
np kð Þ ¼ ∑
N þ1
n ¼ 1
w k; nð Þxp nð Þ ¼ ∑
Nh
m ¼ 1
c k; mð Þnp' ðmÞ ð14Þ
Similarly the MLP 2 net function vector np' can be related to its
weights W0
as
np' mð Þ ¼ ∑
N þ 1
n ¼ 1
w' m; nð Þxp nð Þ ð15Þ
Fig. 2. Graphical representation of net function vectors.
S.S. Malalur et al. / Neurocomputing 149 (2015) 1490–15011492
4. On substituting (15) into (14) we get,
npðkÞ ¼ ∑
N þ 1
n ¼ 1
wðk; nÞxpðnÞ ¼ ∑
N þ 1
n ¼ 1
∑
Nh
m ¼ 1
cðk; mÞUw'ðm; nÞxpðnÞ ð16Þ
In (16) we see that the input weight matrices of MLPs 1 and 2 are
linearly related as
W ¼ C UW' ð17Þ
If the elements of the C matrix are found through an optimality
criterion, then optimal input weights W0
can be computed as
W' ¼ C À 1
UW ð18Þ
3.2. Transformation of net function vectors
In Section 3.1 we showed a method for generating a network,
MLP 2, which is strongly equivalent to the original network, MLP 1.
In this section, we show that equivalent networks train differently.
The output of the transformed network is given in Eq. (14).
The elements of the negative Jacobian matrix G' of MLP 2 found
from Eqs. (2) and (14) are defined as,
g' u; vð Þ ¼ À
∂E
∂w' u; vð Þ
¼
2
Nv
∑
Nv
p ¼ 1
∑
M
i ¼ 1
tp ið ÞÀyp ið Þ
h i
U
∂yp ið Þ
∂w' u; vð Þ
∂yp ið Þ
∂w' u; vð Þ
¼ ∑
Nh
k ¼ 1
woh i; kð Þop' kð Þc k; uð ÞxpðvÞ ð19Þ
o0
p(k) denotes the first partial derivative of op(k). Rearranging
terms in (19) results in,
g'ðu; vÞ ¼ ∑
Nh
k ¼ 1
cðk; uÞ
2
Nv
∑
Nv
p ¼ 1
∑
M
i ¼ 1
½tpðiÞÀypðiÞŠU wohði; kÞo'pðkÞxpðvÞ
¼ ∑
Nh
k ¼ 1
cðk; uÞ
˶E
∂wðk; vÞ
ð20Þ
This is abbreviated as
G' ¼ CT
UG ð21Þ
The input weight update equation for MLP 2 based on its Jacobian
vector G0
is given by,
W' ¼ W' þzUG'
On pre-multiplying this equation by C and using Eqs. (18) and (21)
we obtain
W ¼ W þzUG'' ð22Þ
where,
G'' ¼ RUG
R ¼ C UCT
ð23Þ
Lemma 1. For a given R matrix in Eq. (23), the number of C matrices
is infinite.
Continuing, the transformed Jacobian G'' of MLP 1 is given in
terms of the original Jacobian G as,
g''ðu; vÞ ¼ ∑
Nh
k ¼ 1
rðu; kÞgðk; vÞ
gðk; vÞ ¼
˶E
∂wðk; vÞ
¼
2
Nv
∑
Nv
p ¼ 1
∑
M
i ¼ 1
½tpðiÞÀypðiÞŠUwohði; kÞo0
pðkÞxpðvÞ ð24Þ
where o'p(k) denotes the derivative of op(k) with respect to its net
function. Eqs. (22) and (23) show that valid input weight update
matrices G'' can be generated by simply transforming the weight
update matrix G from MLP 1, using an autocorrelation matrix R.
Since uncountably many matrices R exist, it is natural to wonder
which one is optimal.
3.3. Multiple optimal learning factors for input weights
In this subsection, we explore the idea of finding an optimal
transform matrix R for the negative Jacobian G. However, finding a
dense R matrix would result in an algorithm with almost the
computational complexity of Levenberg–Marquardt (LM). We
avoid this by letting the R matrix in Eq. (23) be diagonal. Under
this case Eq. (22) becomes,
w k; nð Þ ¼ w k; nð ÞþzUr kð Þg k; nð Þ ð25Þ
On comparing Eqs. (22) and (25), the expression for the different
optimal learning factors zk could be given as,
zk ¼ zUr kð Þ ð26Þ
Eq. (26) shows that using the MOLF approach for training a MLP is
equivalent to transforming the net function vector using a diag-
onal transform matrix. Next, we use Newton's method develop our
approach for finding z.
Assume that an MLP is being trained using OWO-BP. Also
assume that a separate OLF zk is being used to update each hidden
unit's input weights, w(k, n), where 1rnr(Nþ1). The error
function to be minimized is given by (2). The predicted output
yp(m) is given by,
ypðmÞ ¼ ∑
N þ 1
n ¼ 1
woiðm; nÞxpðnÞ
þ ∑
Nh
k ¼ 1
wohðm; kÞf ∑
N þ 1
n ¼ 1
ðwðk; nÞþzk Á gðk; nÞÞxpðnÞ
where, g(k, n) is an element of the negative Jacobian matrix G and
f() denotes the hidden layer activation function. The negative first
partial of E with respect to zj is
gmolf jð Þ ¼ À
∂E
∂zj
¼
2
Nv
∑
Nv
p ¼ 1
∑
M
m ¼ 1
tp mð ÞÀ ∑
Nh
k ¼ 1
woh m; kð Þop zkð Þ
#
Uwoh m; jð Þop' jð ÞΔnpðjÞ
ð27Þ
where,
tp mð Þ ¼ tp mð ÞÀ ∑
N þ 1
n ¼ 1
woi m; nð Þxp nð Þ;
Δnp jð Þ ¼ ∑
N þ 1
n ¼ 1
xp nð ÞUg j; nð Þ
op zkð Þ ¼ f ∑
N þ 1
n ¼ 1
ðw k; nð Þþzk Ug k; nð ÞÞxp nð Þ
The elements of the Gauss–Newton Hessian Hmolf are
hmolf l; jð Þ %
∂2
E
∂zl∂zj
¼
2
Nv
∑
M
m ¼ 1
woh m; lð Þwoh m; jð Þ ∑
Nv
p ¼ 1
op' lð Þop' jð ÞΔnp lð ÞΔnp jð Þ
¼ ∑
N þ 1
i ¼ 1
∑
N þ 1
n ¼ 1
2
Nv
u l; jð Þ ∑
Nv
p ¼ 1
xp ið Þxp nð Þop' lð Þop' jð Þ
#
g l; ið ÞUg j; nð Þ ð28Þ
where u(l, j) is defined in (9). The Gauss–Newton update guaran-
tees that Hmolf is non-negative definite. Given Hmolf and the
negative gradient vector gmolf, defined as
gmolf ¼ ½À∂E=∂z1; À∂E=∂z2; …; À∂E=∂zNh
ŠT
We minimize E with respect to the vector z using Newton's
method. If Hmolf and gmolf are the Hessian and negative gradient,
respectively, of the error with respect to z, then the multiple
optimal learning factors are computed by solving
Hmolf Uz ¼ gmolf ð29Þ
S.S. Malalur et al. / Neurocomputing 149 (2015) 1490–1501 1493
5. A brief outline of training using the MOLF algorithm is given
below. In each iteration of the training algorithm:
1. Solve linear equations for all output weights using (4)
2. Calculate the negative input weight Jacobian G using BP as
given by (6)
3. Calculate z using Newton's method as shown in (29) and
update the input weights as
w k; nð Þ’w k; nð Þþzk Ug k; nð Þ ð30Þ
Here, the MOLF procedure has been inserted into the OWO-BP
algorithm, resulting in an algorithm we denote as MOLF-BP.
The MOLF procedure can be inserted into other algorithms as well.
3.4. Analysis of the MOLF Hessian
This subsection presents an analysis of the structure of the
MOLF Hessian and its relation to the reduced Hessian HR of the
input weights. Comparing (28) to (9), we see that the term within
the square brackets is an element of HR. Hence,
hmolf l; jð Þ ¼
∂2
E
∂zl∂zj
¼ ∑
N þ 1
i ¼ 1
∑
N þ1
n ¼ 1
∂2
E
∂w l; ið Þ∂w j; nð Þ
g l; ið Þg j; nð Þ ð31Þ
For fixed (l, j), hmolf(l, j) can be expressed as,
∂2
E
∂zl∂zj
¼ ∑
N þ1
i ¼ 1
gl ið Þ ∑
N þ 1
n ¼ 1
h
l;j
R i; nð ÞUgj nð Þ ð32Þ
In matrix notation,
Hmolf ¼ gT
l Hl;j
R gj ð33Þ
where column vector gl
contains G elements g(l, n) for all values of
n, where the (Nþ1) by (Nþ1) matrix HR
l, j
contains elements of HR
for weights connected to the lth and jth hidden units. The reduced
size Hessian HR
l, j
in (33) uses four indices (l, i, j, n) and can be
viewed as a 4-dimensional array, represented by H4
R whose
elements satisfy
h
l;j
R ði; nÞARðN þ 1ÞxNhxNhxðN þ 1Þ
Now, (33) can be rewritten as
H4
molf ¼ GH4
RGT
ð34Þ
From (32), we see that hmolf(l, j)¼h4
molf(l, j, l, j), i.e., the
4-dimensional matrix H4
molf is transformed into the 2-dimensional
matrix Hmolf, by setting m¼l and u¼j. Each element of the MOLF
Hessian combines the information from (Nþ1) rows and columns of
the reduced Hessian, HR
l, j
. This makes MOLF less sensitive to input
conditions. Note the similarities between (8) and (33).
4. Effect of dependence on the MOLF algorithm
In this section, we analyze the performance of MOLF applied to
OWO-BP, in the presence of linear dependence.
4.1. Dependence in the input layer
A linearly dependent input can be modeled as a linear combi-
nation of any or all inputs as
xp Nþ2ð Þ ¼ ∑
N þ 1
n ¼ 1
b nð Þxp nð Þ ð35Þ
During OWO, the weights from the dependent input, feeding the
outputs will be set to zero and the output weight adaptation
will not be affected. During the input weight adaptation, the
expression for gradient given by (27) can be re-written as,
gmolf jð Þ ¼ À
∂E
∂zj
¼
2
Nv
∑
Nv
p ¼ 1
∑
M
m ¼ 1
tp mð ÞÀ ∑
Nh
k ¼ 1
woh m; kð Þop zkð Þ
#
Uwoh m; jð Þop' jð Þ Δnp jð Þþg k; Nþ2ð Þ ∑
N þ 1
n ¼ 1
b nð Þxp nð Þ
!
ð36Þ
and the expression for an element of the Hessian in (28) can be
re-written as
∂2
E
∂zl∂zj
¼
2
Nv
∑
M
m ¼ 1
woh m; lð Þwoh m; jð Þ ∑
Nv
p ¼ 1
op' lð Þop' jð Þ
 Δnp lð ÞΔnp jð Þþ Δnp lð Þg j; Nþ2ð Þ
 ∑
N þ 1
n ¼ 1
b nð Þxp nð ÞþΔnp jð Þg l; Nþ2ð Þ ∑
N þ 1
i ¼ 1
b ið Þxp ið Þ
þg j; Nþ2ð Þg l; Nþ2ð Þ ∑
N þ1
n ¼ 1
∑
N þ1
i ¼ 1
b ið Þxp ið Þb nð Þxp nð Þ
#
ð37Þ
Let H0
molf be the Hessian, when the extra dependent input xp(Nþ2)
is included. Then,
hmolf' l; jð Þ ¼ hmolf l; jð Þþ
2
Nv
u l; jð Þ ∑
Nv
p ¼ 1
xp Nþ2ð Þop' lð Þop' jð Þ
 Ã
U g l; Nþ2ð Þ ∑
N þ 1
n ¼ 1
xp nð Þg j; nð Þþg j; Nþ2ð Þ ∑
N þ 1
i ¼ 1
xp ið Þg l; ið Þ
þxp Nþ2ð Þg l; Nþ2ð Þg j; Nþ2ð Þ
#
ð38Þ
Comparing (27) with (36) and (28) with (37) and (38), we see
some additional terms that appear within the square brackets in
the expressions for gradient and Hessian in the presence of
linearly dependent input. Clearly, these parasitic cross-terms will
cause the training using MOLF to be different for the case of
linearly dependent inputs. This leads to the following lemma.
Lemma 2. Linearly dependent inputs, when added to the network,
do not force H'molf to be singular.
As seen in (37) and (38), each element h0
molf (m, j) simply gains
some first and second degree terms in the variables b(n).
4.2. Dependence in the hidden layer
Here we look at how a dependent hidden unit affects the
performance of MOLF. Assume that some hidden unit activations
are linearly dependent upon others, as
opðNh þ1Þ ¼ ∑
Nh
k ¼ 1
c kð Þop kð Þ ð39Þ
Further assume that OLS is used to solve for output weights.
The weights in the network are updated during every training
iteration and it is quite possible that this could cause some hidden
units to be linearly dependent upon each other. The dependence
could manifest in hidden units being identical or a linear combi-
nation of other hidden unit outputs.
Consider one of the hidden units to be dependent. The autocorre-
lation matrix, Rx, will be singular and since we are using OLS to solve
for output weights, all the weights connecting the dependent hidden
unit to the outputs will be forced to zero. This will ensure that the
dependent hidden unit does not contribute to the output. To see if this
has any impact on learning in MOLF we can look at the expression for
the gradient and Hessian, given by (27) and (28) respectively. Both
equations have a sum of product terms containing output weights.
Since OWO sets the output weights for the dependent hidden unit to
be zero, this will also set the corresponding gradient and Hessian
S.S. Malalur et al. / Neurocomputing 149 (2015) 1490–15011494
6. elements to be zero. In general, any dependence in the hidden layer
will cause the corresponding learning factor to be zero and will not
affect the performance of MOLF.
Lemma 3. For each dependent hidden unit, the corresponding row
and column of Hmolf is zero-valued.
This follows from (28). The zero-valued rows and columns in
Hmolf cause no difficulties if (16) is solved using OLS [26].
5. Improving the MOLF algorithm
Section 4.1 showed that the presence of linearly dependent inputs
affect MOLF training, primarily the input weight adaptation. Depen-
dent inputs manifest themselves as additional rows and columns in
the input weight Jacobian matrix G. If we can find a way to set these
additional rows and columns to zero, then we can ensure that the
weights connected to the dependent inputs will not get updated and
the training will be unaffected by the presence of dependent inputs.
In this section, we present a way to mitigate the effect of linearly
dependent inputs on training input weights.
Without loss of generality, let a linearly dependent input
xp(Nþ2) be modeled as in (35), where the first (Nþ1) inputs are
linearly independent. Using the singular value decomposition
(SVD), the input autocorrelation matrix R of dimension (Nþ2) by
(Nþ2) is written as UΣUT
. This can be rewritten as R¼AT
A, where
the transformation matrix A is defined as
A ¼ ΣÀ1=2
UT
Both the last row and the last column of A are zero. Now, when the
input vector with the dependent inputs is transformed as,
x0
p ¼ A Á xp ð40Þ
the last element, x0
p(Nþ2) will be zero. This type of transformation
is termed as whitening transformation [27] in linear algebra and
statistics literature. Now, let us investigate the effect of dependent
inputs on adjusting input weights using MOLF, when whitening
transform is incorporated into training. We need to verify that the
transformed input vector x0
p, with zero-values for the dependent
inputs, causes no problems. The Nh by (Nþ2) gradient matrix G0
for the transformed data is
G' ¼ G Á AT
ð41Þ
where G is the negative Jacobian matrix before the inputs are
transformed. Although the last column of G is dependent upon the
other columns, the whitening transform causes the last column of
G0
to be zero. The fact that x'p(Nþ2) and g(k, Nþ2) are zero causes
(36) to revert to (27) and H0
molf in (38) to equal Hmolf. When OLS is
used to find z and then the output weights, xp(Nþ2) has no effect,
and weights connected to it are not trained.
Eqs. (40) and (41) now provide us with a way to remove line-
arly dependent inputs during training, so their presence has no
effect. This leads us to develop an improved version of MOLF.
The transformation applied to the Jacobian matrix is a whitening
transformation (WT). Whitening transformation has two desirable
properties: (i) decorrelation, which causes the transformed matrix to
have a diagonal covariance matrix and (ii) whitening, which forces
the transformed matrix to be a unit covariance (identity) matrix. We
call the improved algorithm MOLF-WT. The steps to training an MLP
with MOLF-WT are listed below:
1. Compute the input autocorrelation matrix R as given by (4)
2. Use the SVD to compute the whitening transformation matrix A
as described above
3. In each iteration of training
a. Calculate the negative input weight Jacobian G using BP
as given by (6)
b. Calculate the whitening transformed Jacobian G0
as described
in (41) to eliminate any linear dependency in the inputs
c. Calculate z using Newton's method as before, using (29) and
update the input weights as described in (30), except that
G is replaced by G0
d. Solve linear equations for all output weights as before
using (4)
6. Performance comparison and results
This section compares the performance of MOLF-WT, with
several popular existing feed-forward batch learning algorithms.
We begin by listing the various algorithms being compared.
We then identify the metrics used for comparing the performances
of the various algorithms, provide details of the experimental
procedure along with a description of the data files used for
obtaining the numerical results.
6.1. List of competing algorithms
The performance of MOLF-WT is compared with four other
existing methods, namely:
1. Output weight optimization-backpropagation (OWO-BP)
2. Multiple Learning Factor applied to OWO-BP (MOLF-BP)
3. Levenberg–Marquardt (LM)
4. Scaled conjugate gradient (SCG) [28]
All algorithms employ the fully connected feed-forward net-
work architecture described in Fig. 1. The hidden units have
sigmoid activation functions, while the output units all have linear
activations. SCG and LM employ one-stage learning where all
weights in the network are updated simultaneously in each
iteration. OWO-BP, MOLF-BP, and MOLF-WT employ two-stage
learning where we first update the input weights and subse-
quently solve linear equations for the output weights. A single OLF
was used for training in the OWO-BP and LM algorithms, while
multiple OLFs were used for MOLF-BP and MOLF-WT.
6.2. Experimental procedure
Primarily, we make use of the k-fold methodology for our experi-
ments (k¼10 for our simulations). Each data set is split into k non-
overlapping parts of equal size. Of these, k-2 parts (roughly 80%) are
combined and used for training, while each of the remaining two parts
is used for validation and testing respectively (roughly 10% each).
Validation is performed per training iteration (to prevent over train-
ing) and the network with the minimum validation error is saved.
After the training is completed, the network with the minimum
validation error is used with the testing data to obtain the final testing
error, which measures the networks ability to generalize. The max-
imum number of iteration per training session is fixed at 2500 and is
the same for all algorithms. There is also an early stopping criterion.
Training is repeated till all k combinations have been exhausted (i.e.,10
times).
It is known that the initial network plays a role in the networks
ability to learn. In order to make our performance comparisons
independent of the initial network, we carry out the k-fold
methodology 5 times, each time with a different initial network.
This leads to a total of 50 different training sessions per algorithm,
with 50 different initial networks being used. We call this the
5k-fold method. For a given data set, even though the initial weight
S.S. Malalur et al. / Neurocomputing 149 (2015) 1490–1501 1495
7. is different in each training session, we use the same 50 initial
weight sets across all algorithms. This is done to allow for a fair
comparison between the various training methods.
We use net control [19] to adjust the weights before training
begins. In net control, the randomly initialized input weights and
hidden unit thresholds are scaled and biased so that each hidden
unit's net function has the same mean and standard deviation.
Specifically, the net function means are equal to 0.5 and the corre-
sponding standard deviations are equal to 1. In addition, OWO is used
to find the initial output weights. This means that all the algorithms
listed above have the same starting point (a different one for each
session) for all data files. This will be evident in the plots presented
later in this section, where we can see that the MSE for the first
iteration for all algorithms is the same.
At the end of the 5k-fold procedure, we compute (i) the average
training error, (ii) the average minimum validation error, (iii) the
average testing error, and (iv) the average number of cumulative
multiplies required for training, for each algorithm and for each
data file. The average is computed over 50 sessions. These
quantities form the metrics for comparison and are subsequently
used to generate the plots and tables to compare the performance
of different learning algorithms.
6.3. Computational cost
The proposed MOLF algorithm involves inversion a Hessian
matrix, which has a general complexity of O(N3
), for a square
matrix of size N. However, compared to Newton's method or LM,
the size of the Hessian is much smaller. Updating weights using
Newton's method or LM, requires a Hessian with Nw rows and
columns, where Nw ¼M Á Nu þ(Nþ1)Nh and Nu ¼NþNh þ1. In com-
parison, the Hessian used in the proposed MOLF has only Nh rows
and columns. The number of multiplies required to solve for
output weights using OLS is given by
Mols ¼ Nu Nu þ1ð Þ Mþ
1
6
Nu 2Nu þ1ð Þþ
3
2
!
ð42Þ
The numbers of multiplies per training iteration for OWO-BP, LM
and MOLF are given below
Mowo Àbp ¼ Nv½ 2Nh Nþ2ð ÞþMðNu þ1Þ þðNu Nu þ1ð ÞÞ=2þM Nþ6Nh þ4ð ÞŠ
þMols þNh Nþ1ð Þ ð43Þ
Mlm ¼ Nv½MNu þ2Nh Nþ1ð ÞþM Nþ6Nh þ4ð ÞþMNu Nu þ3Nh Nþ1ð Þð Þ
þ4N2
h Nþ1ð Þ2
ŠþN3
w þN2
w ð44Þ
Mscg ¼ 4Nv Nh Nþ1ð ÞþMNu½ Šþ10 Nh Nþ1ð ÞþMNu½ Š ð45Þ
Mmolf ¼ MowoÀbp þNv Nh Nþ4ð ÞÀM Nþ6Nh þ4ð Þ½ ŠþN3
h ð46Þ
Note that Mmolf consists of Mowo-bp plus the required multiplies for
calculating optimal learning factors. Similarly, Mlm consists of Mbp
plus the required multiplies for calculating and inverting the
Hessian matrix. We will see later in the plots that despite the
additional complexity of Hessian inversion, the number of cumu-
lative multiplies to train a network using MOLF is almost the same
as OWO-BP or SCG algorithms, which do not involve any matrix
inversion operation. We will also see that LM on the other hand
has a far greater overhead due the inversion of a much larger
Hessian matrix.
6.4. Training data sets
Table 1 lists seven data sets used for simulations and perfor-
mance comparisons.
All the data sets are publicly available. In all data sets, the
inputs have been normalized to be zero-mean and unit variance.
This way, it becomes clear that MOLF is not equivalent to a mere
normalization of the data.
The number of hidden units for each data set is selected by
first training a multi-layer perceptron with a large number of
hidden units followed by a step-wise pruning with validation [29].
The least useful hidden unit is removed at each step till there is only
one hidden unit left. The number of hidden units corresponding to the
smallest validation error is used for training on that data set.
6.5. Results
Numerical results obtained from our training, validation and
testing methodology for the proposed MOLF-WT algorithm are
compared with those of OWO-BP, MOLF-BP, LM and SCG, using the
data files listed in Table 1. Two plots are generated for each data
file: (i) the average mean square error (MSE) for training from
10-fold training is plotted versus the number of iterations (shown
on a log10 scale) and (ii) the average training MSE from 10-fold
training is plotted versus the cumulative number of multiplies
(also shown on a log10 scale) for each algorithm. Together, the two
plots will not only show the final average MSE achieved by
training but also the computations it consumed for getting there.
6.5.1. Prognostics data set
This data file is available at the Image Processing and Neural
Networks Lab repository [30]. It consists of parameters that are
available in the Bell Helicopter health and usage monitoring
system (HUMS), which performs flight load synthesis, which is a
form of prognostics [31]. The data set containing 17 input features
and 9 output parameters.
We trained an MLP having 13 hidden units. In Fig. 3-a, the
average mean square error (MSE) for 10-fold training is plotted
versus the number of iterations for each algorithm. In Fig. 3-b, the
average 10-fold training MSE is plotted versus the required number
of multiplies (shown on a log10 scale). From Fig. 3-a, LM and MOLF
based algorithms have almost identical performance (LM is margin-
ally better than MOLF-WT which in turn is marginally better than
MOLF-BP). However, the performance of LM comes with a signifi-
cantly higher computational demand, as shown in Fig. 3-b. Also,
MOLF based algorithms are better than OWO-BP and SCG.
6.5.2. Federal reserve economic data set
This file contains some economic data for the USA from 01/04/
1980 to 02/04/2000 on a weekly basis. From the given features, the
goal is to predict the 1-Month CD Rate [32]. For this data file, called
TR on its webpage [33], we trained an MLP having 34 hidden units.
From Fig. 4-a, LM has the best overall performance, followed
closely by MOLF-WT. The MOLF-WT algorithm is better than
MOLF-BP. Fig. 4-b shows the computational cost of achieving this
Table 1
Data set description.
Data set name No. of
inputs
No. of
outputs
No. of
patterns
No. of Hidden
Units
Prognostics 17 9 4745 13
Federal reserve 15 1 1049 34
Housing 16 1 22,784 30
Concrete 8 1 1030 13
Remote sensing 16 3 5992 25
White wine
quality
11 1 4898 24
Parkinson's 16 2 5875 17
S.S. Malalur et al. / Neurocomputing 149 (2015) 1490–15011496
8. performance. The proposed MOLF algorithms consume slightly
more computation than OWO-BP and SCG. However, all the first
order algorithms utilized about two orders of magnitude fewer
computations than LM.
6.5.3. Housing data set
This data file is available on the DELVE data set repository [34].
The data was collected as part of the 1990 US census. For the
purpose of this data set a level State-Place was used. Data from all
states were obtained. Most of the counts were changed into
appropriate proportions. These are all concerned with predicting
the median price of houses in a region based on demographic
composition and the state of the housing market in the region.
House-16H data was used in our simulation, with ‘H’ standing for
high difficulty. Tasks with high difficulty have had their attributes
chosen to make the modeling more difficult due to higher variance
or lower correlation of the inputs to the target. We trained an MLP
having 30 hidden units. From Figs. 5-a and -b, we see that the LM
had the best performance, followed closely by SCG. The proposed
MOLF-WT algorithm has a slight edge over MOLF-BP.
6.5.4. Concrete compressive strength data set
This data file is available on the UCI Machine Learning Repo-
sitory [35]. It contains the actual concrete compressive strength
(MPa) for a given mixture under a specific age (days) determined
from laboratory. The concrete compressive strength is a highly
nonlinear function of age and ingredients. We trained an MLP
having 13 hidden units. For this data set, Fig. 6 shows that LM has
the best overall average training error, followed very closely by the
improved MOLF-WT, which is marginally better than MOLF-BP.
SCG attains a performance almost identical to MOLF-WT.
6.5.5. Remote sensing data set
This data file is available on the Image Processing and Neural
Networks Lab repository [30]. It consists of 16 inputs and 3 outputs
and represents the training set for inversion of surface permittiv-
ity, the normalized surface rms roughness, and the surface
correlation length found in back scattering models from randomly
rough dielectric surfaces [36]. We trained an MLP having 25
hidden units. From Fig. 7-a, the MOLF-WT has the best overall
average training MSE.
Fig. 4. Federal reserve data: average error vs. (a) iterations and (b) multiplies per iteration.
Fig. 3. Prognostics data set: average error vs. (a) iterations and (b) multiplies per iteration.
S.S. Malalur et al. / Neurocomputing 149 (2015) 1490–1501 1497
9. 6.5.6. White wine quality data set
The Wine Quality-White data set [37] was created using white
wine samples. The inputs include objective tests (e.g. PH values)
and the output is based on sensory data (median of at least
3 evaluations made by wine experts). Each expert graded the wine
quality between 0 (very bad) and 10 (very excellent). This data set
is available on the UCI data repository [35]. For this data file, the
number of hidden units was chosen to be 24. For this data set,
Fig. 8 shows that LM and the improved MOLF-WT have almost
identical performance and are better than the rest.
6.5.7. Parkinson's disease data set
The Parkinson's data set [38] is available on the UCI data repository
[35]. It is composed of a range of biomedical voice measurements
from 42 people with early-stage Parkinson's disease recruited to a
six-month trial of a telemonitoring device for remote symptom
progression monitoring. The main aim of the data is to predict the
motor and total UPDRS scores (‘motor_UPDRS’ and ‘total_UPDRS’)
from the 16 voice measures. For this data set the number of hidden
units was chosen to be 17. From Fig. 9, LM attains the best overall
training error, followed by the improved MOLF-WT. We can see again
that the improved MOLF-WT performs better than MOLF-BP.
Table 2 is a compilation of the metrics used for comparing the
performance of all algorithms on all data sets. It comprises of the
average minimum testing and validation errors. Based on the results
obtained on the selected data sets, we can make a few observations.
From the training plots and the data from Table 2, we can say that
1. LM is the top performer in 4 out of the 7 data sets. However, LM
being a second order method, this performance comes at a
significant cost of computation – almost two orders of magni-
tude greater than the rest;
2. Both SCG and OWO-BP consistently appear in the last two
places in terms of average training error on 5 out of 7 data sets
(Housing and Concrete data sets being the exception). How-
ever, being first order methods, they set the bar for the lowest
cost of computation;
3. Both MOLF-BP and MOLF-WT always perform better than their
common predecessor, OWO-BP, on all data sets and they are
Fig. 5. Housing data: average error vs. (a) iterations and (b) multiplies per iteration.
Fig. 6. Concrete data: average error vs. (a) iterations and (b) multiplies per iteration.
S.S. Malalur et al. / Neurocomputing 149 (2015) 1490–15011498
10. better than SCG except for the housing data set. This perfor-
mance is achieved with minimal computational overhead
compared to both SCG and OWO-BP, as evident in the plots of
error vs. number of cumulative multiplications;
4. MOLF-BP is almost never better than LM, except for the remote
sensing data set where it is marginally better;
5. MOLF-WT always performs better than its immediate prede-
cessor MOLF-BP (although MOLF-BP comes very close on 3 of
the 7 data sets);
6. MOLF-WT achieves a better performance than LM on one data
set (remote sensing) and has almost identical performance on
two other data sets (wine and prognostics). It is also a close
second on 3 of the remaining 4 data sets. The same cannot be
said of MOLF-BP;
7. LM is again consistently the top performer with the best
average testing error on 6 out 7 data sets, while MOLF based
algorithms figure consistently in the top two performers (note
that the network with the smallest validation error is used to
obtain the testing error).
In general, we can see that
1. Both MOLF-BP and the improved MOLF-WT algorithms con-
sistently outperform OWO-BP in all three phases of learning
(i.e. MOLF inserted into OWO-BP does help improve its perfor-
mance in training, validation and testing);
2. Both MOLF-BP and the improved MOLF-WT algorithms con-
sistently outperform SCG in terms of the average minimum
testing error;
3. The MOLF-WT very often has a performance comparable to LM
and it attains that with minimal computational overhead;
4. While MOLF-BP is an improvement over OWO-BP, it is not as
good as MOLF-WT.
7. Conclusions
Starting with the concept of equivalent networks and its
implications, a class of hybrid learning algorithms is presented
Fig. 7. Remote sensing data: average error vs. (a) iterations and (b) multiplies per iteration.
Fig. 8. Wine quality data: average error vs. (a) iterations and (b) multiplies per iteration.
S.S. Malalur et al. / Neurocomputing 149 (2015) 1490–1501 1499
11. that uses first and second order information to simultaneously
calculate optimal learning factors for all hidden units and train the
input weights. Detailed analysis of the structure of the Hessian
matrix for the proposed algorithm in relation to the full Newton
Hessian matrix is presented, showing how the smaller MOLF
Hessian could potentially be less ill-conditioned. Elements of the
reduced size Hessian are shown to be weighted sums of the
elements of the entire network's Hessian. Limitations of the basic
MOLF procedure in the presence of linearly dependent inputs is
shown mathematically. The limitations are used as a launch pad to
develop a version of MOLF that incorporates a whitening trans-
formation type operation to mitigate the effect of linearly depen-
dent inputs. This version is called MOLF-WT.
A detailed experimental procedure is presented and the perfor-
mance of MOLF-BP and MOLF-WT are compared to 3 other existing
feed-forward batch learning algorithms on seven publicly available
data sets. For the data sets investigated, both MOLF-BP and MOLF-WT
are about as computationally efficient per iteration as OWO-BP and
SCG. MOLF-BP consistently performs better than OWO-BP. In effect
the MOLF procedure improves OWO-BP by using an Nh by Nh Hessian
in place of the normal 1 by 1 Hessian used to calculate a scalar OLF.
MOLF-WT has a performance comparable and in some cases better
than that of LM. From the results, MOLF-WT stays close to the
computational cost of OWO-BP and SCG to achieve a performance
comparable with LM and may be a viable alternative to LM in terms of
efficiency of training, minimum validation error and testing errors.
Since they use smaller Hessian matrices, the MOLF methods
presented in this paper require fewer multiplies per iteration than
traditional second order methods based on Newton's method.
Finally, it is clear from the figures that MOLF-BP and MOLF-WT
decrease the training error much faster than the other algorithms.
In future work, the MOLF approach will be applied to other
training algorithms and to networks having additional hidden layers.
An alternate approach, in which each network layer has its own
optimal learning factor, will also be tried. In addition, the MOLF
algorithm will be extended to handle classification type applications.
References
[1] R.C. Odom, P. Pavlakos, S.S. Diocee, S.M. Bailey, D.M. Zander, J.J. Gillespie, Shaly
sand analysis using density-neutron porosities from a cased-hole pulsed
neutron system, SPE Rocky Mountain Regional Meeting Proceedings: Society
of Petroleum Engineers, 1999, pp. 467–476.
[2] A. Khotanzad, M.H. Davis, A. Abaye, D.J. Maratukulam, An artificial neural
network hourly temperature forecaster with applications in load forecasting,
IEEE Trans. Power Syst. 11 (2) (1996) 870–876.
[3] S. Marinai, M. Gori, G. Soda, Artificial neural networks for document analysis
and recognition, IEEE Trans. Pattern Anal. Mach. Intell. 27 (1) (2005) 23–35.
[4] J. Kamruzzaman, R.A. Sarker, R. Begg, Artificial Neural Networks: Applications
in Finance and Manufacturing, Idea Group Inc (IGI), Hershey, PA, 2006.
Fig. 9. Parkinson's disease data: average error vs. (a) iterations and (b) multiplies per iteration.
Table 2
Statistics of testing and validation errors.
Comparison metric Data file name OWO-BP SCG MOLF-BP MOLF-WT LM
Average minimum
testing error
Prognostics 2.4187E7 2.6845E7 1.5985E7 1.4973E7 1.4509E7
Federal reserve 0.108636974 0.33502068 0.105951897 0.120668845 0.103284696
Housing 1.7329E9 1.7441E9 1.2758E9 1.3599E9 1.1627E9
Concrete 33.48613532 83.78832466 32.57090477 33.72180127 29.73279111
Remote sensing 1.568023633 1.51843944 2.189078998 0.136484007 0.637828985
White wine 0.513739373 1.25867336 0.507925435 0.514729807 0.499294581
Parkinson's 139.0323995 125.5411318 129.4456136 127.3415428 122.2485104
Average minimum
validation error
Prognostics 2.4764E7 2.6675E7 1.5999E7 1.5155E7 1.4329E7
Federal reserve 0.108442923 0.17653648 0.09675038 0.10759092 0.09371866
Housing 1.3512E9 1.1683E9 1.2414E9 1.2109E9 1.0662E9
Concrete 33.43842214 48.63268454 29.52707558 28.72108048 26.4212992
Remote sensing 1.534430917 0.71842796 0.57845174 0.15130356 0.37738708
White wine 0.503644249 0.50386288 0.49644024 0.49550544 0.49058676
Parkinson's 131.2226476 124.3035235 124.0881855 122.3187088 117.3367678
S.S. Malalur et al. / Neurocomputing 149 (2015) 1490–15011500
12. [5] L. Wang, X. Fu, Data Mining With Computational Intelligence, Springer-Verlag,
New York, Inc, Seacaucus, NJ, USA, 2005.
[6] G. Cybenko, Approximations by superposition of a sigmoidal function, Math.
Control. Signals Syst. (MCSS) 2 (1989) 303–314.
[7] D.W. Ruck, et al., The multi-layer perceptron as an approximation to a Bayes
optimal discriminant function, IEEE Trans. Neural Netw. 1 (4) (1990) 296–298.
[8] Q. Yu, S.J. Apollo, M.T. Manry, MAP estimation and the multi-layer perceptron,
in: Proceedings of the 1993 IEEE Workshop on Neural Networks for Signal
Processing. Linthicum Heights, Maryland, September 6–9, 1993, pp. 30–39.
[9] D.E. Rumelhart, G.E. Hinton, R.J. Williams, Learning internal representations by
error propagation, in: D.E. Rumelhart, J.L. McClelland (Eds.), Parallel Distrib-
uted Processing, vol. I, The MIT Press, Cambridge, Massachusetts, 1986.
[10] J.P. Fitch, S.K. Lehman, F.U. Dowla, S.Y. Lu, E.M. Johansson, D.M. Goodman, Ship
wake-detection procedure using conjugate gradient trained artificial neural
networks, IEEE Trans. Geosci. Remote Sens. 29 (5) (1991) 718–726.
[11] S. McLoone, G. Irwin, A variable memory quasi-Newton training algorithm,
Neural Process. Lett. 9 (1999) 77–89.
[12] A.J. Shepherd, Second-Order Methods for Neural Networks, Springer-Verlag,
New York, Inc., 1997.
[13] K. Levenberg, A method for the solution of certain problems in least squares,
Q. Appl. Math. 2 (1944) 164–168.
[14] D. Marquardt, An algorithm for least-squares estimation of nonlinear para-
meters, SIAM J. Appl. Math. 11 (1963) 431–441.
[15] Changhua Yu, Michael T. Manry, Jiang Li, Effects of nonsingular pre-processing
on feed-forward network training, Int. J. Pattern Recognit. Artif. Intell. 19 (2)
(2005) 217–247.
[16] S. Ergezinger, E. Thomsen, An accelerated learning algorithm for multilayer
perceptrons: optimization layer by layer, IEEE Trans. Neural Netw. 6 (1) (1995)
31–42.
[17] M.T. Manry, S.J. Apollo, L.S. Allen, W.D. Lyle, W. Gong, M.S. Dawson, A.K. Fung,
Fast training of neural networks for remote sensing, Remote Sens. Rev. 9
(1994) 77–96.
[18] S.A. Barton, A matrix method for optimizing a neural network, Neural Comput.
3 (3) (1991) 450–459.
[19] J. Olvera, X. Guan, M.T. Manry, Theory of monomial networks, in: Proceedings
of SINS'92, Automation and Robotics Research Institute, Fort Worth, Texas,
1992, pp. 96–101.
[20] G.D. Magoulas, M.N. Vrahatis, G.S. Androulakis, Improving the convergence of
the backpropagation algorithm using learning rate adaptation methods,
Neural Comput. 11 (7) (1999) 1769–1796 (October 1).
[21] Amit Bhaya, Eugenius Kaszkurewicz, Steepest descent with momentum for
quadratic functions is a version of the conjugate gradient method, Neural
Netw. 17 (1) (2004) 65–71.
[22] S.S. Malalur, M.T. Manry, Multiple optimal learning factors for feed-forward
networks, in: Proceedings of SPIE: Independent Component Analyses, Wave-
lets, Neural Networks, Biosystems, and Nanoengineering VIII, vol. 7703,
Orlando Florida, April 7–9, 2010, 77030F.
[23] P. Jesudhas, M.T. Manry, R. Rawat, S. Malalur, Analysis and improvement of
multiple optimal learning factors for feed-forward networks, in: Proceedings
of International Joint Conference on Neural Networks (IJCNN), San Jose,
California, USA, July 31–August 5, 2011, pp. 2593–2600.
[24] W. Kaminski, P. Strumillo, Kernel orthonormalization in radial basis function
neural networks, IEEE Trans. Neural Netw. 8 (5) (1997) 1177–1183.
[25] R.P. Lippman, An introduction to computing with neural nets, IEEE ASSP
Magazine (1987) (April).
[26] R. Battiti, First- and second-order methods for learning: between steepest
descent and Newton's method, Neural Comput. 4 (1992) 141–166.
[27] S. Raudys, Statistical and Neural Classifiers: An Integrated Approach to Design,
Springer-Verlag, London, 2001.
[28] M.F. Moller, A scaled conjugate gradient algorithm for fast supervised
learning, Neural Netw. 6 (1993) 525–533.
[29] P.L. Narasimha, W.H. Delashmit, M.T. Manry, Jiang Li, and F. Maldonado, An
integrated growing-pruning method for feed-forward network training,
NeuroComputing 71 (2008) 2831–2847.
[30] University of Texas at Arlington, Training Data Files – 〈http://www.uta.edu/
faculty/manry/new_mapping.html〉.
[31] M.T. Manry, H. Chandrasekaran, C.-H. Hsieh, Signal Processing Applications of
the Multilayer Perceptron, book chapter in Handbook of Neural Network
Signal Processing, in: Yu Hen Hu, Jenq-Nenq Hwang (Eds.), CRC Press, Boca
Raton, 2001.
[32] US Census Bureau [〈http://www.census.gov〉] (under Lookup Access [〈http://
www.census.gov/cdrom/lookup〉]: Summary Tape File 1).
[33] Bilkent University, Function Approximation Repository – 〈http://funapp.cs.
bilkent.edu.tr/DataSets/〉.
[34] University of Toronto, Delve Data Sets – 〈http://www.cs.toronto.edu/~delve/
data/datasets.html〉.
[35] A. Frank, A. Asuncion, UCI Machine Learning Repository, University of
California, School of Information and Computer Science, Irvine, CA, 2010
http://archive.ics.uci.edu/ml.
[36] A.K. Fung, Z. Li, K.S. Chen, Back scattering from a randomly rough dielectric
surface, IEEE Trans. Geosci. Remote. Sens. 30 (2) (1992).
[37] P. Cortez, A. Cerdeira, F. Almeida, T. Matos, J. Reis., Modeling wine preferences
by data mining from physicochemical properties, Decision Support Systems,
vol. 47, Elsevier (2009) 547–553 (ISSN: 0167-9236).
[38] A. Tsanas, M.A. Little, P.E. McSharry, L.O. Ramig, Accurate telemonitoring of
Parkinson's disease progression by non-invasive speech tests, IEEE Trans.
Biomed. Eng. 57 (4) (2009) 884–893.
Sanjeev S. Malalur received his M.S. and Ph.D. in
Electrical Engineering from The University of Texas at
Arlington, mentored by Prof. Michael T. Manry. For his
doctoral dissertation, he developed a family of fast and
robust training algorithms for the multi-layer percep-
tron feed-forward neural networks. He developed a
fingerprint verification system as a part of his Master's
thesis. He is currently employed by Hunter Well
Science where he continues to extend his research by
developing techniques for modeling, analyzing and
interpreting well log data. His research interests
include machine learning algorithms, with an emphasis
on neural networks, signal and image processing tech-
niques, statistical pattern recognition and biometrics. He has authored several
peer-reviewed publications in his field of research.
Michael T. Manry is a professor at the University of
Texas at Arlington (UTA) and his experience includes
work in neural networks for the past 25 years. He is
currently the director of the Image Processing and
Neural Networks Lab (IPNNL) at UTA. He has published
more than 150 papers and he is a senior member of the
IEEE. He is an associate editor of Neurocomputing, and
also a referee for several journals. His recent work,
sponsored by the Advanced Technology Program of the
state of Texas, Bell Helicopter Textron, E-Systems, Mobil
Research, and NASA has involved the development of
techniques for the analysis and fast design of neural
networks for image processing, parameter estimation,
and pattern classification. He has served as a consultant for the Office of Missile
Electronic Warfare at White Sands Missile Range, MICOM (Missile Command) at
Redstone Arsenal, NSF, Texas Instruments, Geophysics International, Halliburton
Logging Services, Mobil Research, Williams-Pyro and Verity Instruments. Dr. Manry
is President of Neural Decision Lab LLC, which is a member of the Arlington
Technology Incubator. The company commercializes software developed by
the IPNNL.
Praveen Jesudhas received the B.E in Electrical Engineer-
ing from Anna University, India in 2008 and the M.S.
degree in Electrical Engineering from the University of
Texas at Arlington, USA in 2010. In his university research,
Praveen worked on problems in iris detection, iris feature
extraction, and second order training algorithms for neural
nets. As a Pattern Recognition Intern at FastVDO LLC in
Maryland, he developed particle filter-based vehicle track-
ing algorithms and Adaboost-based vehicle detection soft-
ware. Currently he is employed at Egrabber, India as a
Research Engineer, where he uses machine learning to
solve problems related to automated online information
retrieval. His current interests include neural networks,
machine learning, natural language processing and big data analytics. He is a member of
Tau Beta Pi.
S.S. Malalur et al. / Neurocomputing 149 (2015) 1490–1501 1501