SlideShare ist ein Scribd-Unternehmen logo
1 von 214
Downloaden Sie, um offline zu lesen
Neural Networks
Radial Basis Functions Networks
Andres Mendez-Vazquez
December 10, 2015
1 / 96
Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
2 / 96
Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
3 / 96
Introduction
Observation
The back-propagation algorithm for the design of a multilayer perceptron
as described in the previous chapter may be viewed as the application of a
recursive technique known in statistics as stochastic approximation.
Now
We take a completely different approach by viewing the design of a neural
network as a curve fitting (approximation) problem in a high-dimensional
space.
Thus
Learning is equivalent to finding a surface in a multidimensional space that
provides a best fit to the training data.
Under a statistical metric
4 / 96
Introduction
Observation
The back-propagation algorithm for the design of a multilayer perceptron
as described in the previous chapter may be viewed as the application of a
recursive technique known in statistics as stochastic approximation.
Now
We take a completely different approach by viewing the design of a neural
network as a curve fitting (approximation) problem in a high-dimensional
space.
Thus
Learning is equivalent to finding a surface in a multidimensional space that
provides a best fit to the training data.
Under a statistical metric
4 / 96
Introduction
Observation
The back-propagation algorithm for the design of a multilayer perceptron
as described in the previous chapter may be viewed as the application of a
recursive technique known in statistics as stochastic approximation.
Now
We take a completely different approach by viewing the design of a neural
network as a curve fitting (approximation) problem in a high-dimensional
space.
Thus
Learning is equivalent to finding a surface in a multidimensional space that
provides a best fit to the training data.
Under a statistical metric
4 / 96
Thus
In the context of a neural network
The hidden units provide a set of "functions"
A "basis" for the input patterns when they are expanded into the
hidden space.
Name of these functions
Radial-Basis functions.
5 / 96
Thus
In the context of a neural network
The hidden units provide a set of "functions"
A "basis" for the input patterns when they are expanded into the
hidden space.
Name of these functions
Radial-Basis functions.
5 / 96
History
These functions were first introduced
As the solution of the real multivariate interpolation problem
Right now
It is now one of the main fields of research in numerical analysis.
6 / 96
History
These functions were first introduced
As the solution of the real multivariate interpolation problem
Right now
It is now one of the main fields of research in numerical analysis.
6 / 96
Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
7 / 96
A Basic Structure
We have the following structure
1 Input Layer to connect with the environment.
2 Hidden Layer applying a non-linear transformation.
3 Output Layer applying a linear transformation.
Example
8 / 96
A Basic Structure
We have the following structure
1 Input Layer to connect with the environment.
2 Hidden Layer applying a non-linear transformation.
3 Output Layer applying a linear transformation.
Example
8 / 96
A Basic Structure
We have the following structure
1 Input Layer to connect with the environment.
2 Hidden Layer applying a non-linear transformation.
3 Output Layer applying a linear transformation.
Example
8 / 96
A Basic Structure
We have the following structure
1 Input Layer to connect with the environment.
2 Hidden Layer applying a non-linear transformation.
3 Output Layer applying a linear transformation.
Example
Input Nodes
Nonlinear Nodes
Linear Node
8 / 96
Why the non-linear transformation?
The justification
In a paper by Cover (1965), a pattern-classification problem mapped to a
high dimensional space is more likely to be linearly separable than in a
low-dimensional space.
Thus
A good reason to make the dimension in the hidden space in a
Radial-Basis Function (RBF) network high
9 / 96
Why the non-linear transformation?
The justification
In a paper by Cover (1965), a pattern-classification problem mapped to a
high dimensional space is more likely to be linearly separable than in a
low-dimensional space.
Thus
A good reason to make the dimension in the hidden space in a
Radial-Basis Function (RBF) network high
9 / 96
Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
10 / 96
Cover’s Theorem
The Resumed Statement
A complex pattern-classification problem cast in a high-dimensional space
nonlinearly is more likely to be linearly separable than in a low-dimensional
space.
Actually
It is quite more complex...
11 / 96
Cover’s Theorem
The Resumed Statement
A complex pattern-classification problem cast in a high-dimensional space
nonlinearly is more likely to be linearly separable than in a low-dimensional
space.
Actually
It is quite more complex...
11 / 96
Some facts
A fact
Once we know a set of patterns are linearly separable, the problem is easy
to solve.
Consider
A family of surfaces that separate the space in two regions.
In addition
We have a set of patterns
H = {x1, x2, ..., xN } (1)
12 / 96
Some facts
A fact
Once we know a set of patterns are linearly separable, the problem is easy
to solve.
Consider
A family of surfaces that separate the space in two regions.
In addition
We have a set of patterns
H = {x1, x2, ..., xN } (1)
12 / 96
Some facts
A fact
Once we know a set of patterns are linearly separable, the problem is easy
to solve.
Consider
A family of surfaces that separate the space in two regions.
In addition
We have a set of patterns
H = {x1, x2, ..., xN } (1)
12 / 96
Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
13 / 96
Dichotomy (Binary Partition)
Now
The pattern set is split into two classes H1 and H2.
Definition
A dichotomy (binary partition) of the points is said to be separable with
respect to the family of surfaces if a surface exists in the family that
separates the points in the class H1 from those in the class H2.
Define
For each pattern x ∈ H, we define a set of real valued measurement
functions {φ1 (x) , φ2 (x) , ..., φd1 (x)}
14 / 96
Dichotomy (Binary Partition)
Now
The pattern set is split into two classes H1 and H2.
Definition
A dichotomy (binary partition) of the points is said to be separable with
respect to the family of surfaces if a surface exists in the family that
separates the points in the class H1 from those in the class H2.
Define
For each pattern x ∈ H, we define a set of real valued measurement
functions {φ1 (x) , φ2 (x) , ..., φd1 (x)}
14 / 96
Dichotomy (Binary Partition)
Now
The pattern set is split into two classes H1 and H2.
Definition
A dichotomy (binary partition) of the points is said to be separable with
respect to the family of surfaces if a surface exists in the family that
separates the points in the class H1 from those in the class H2.
Define
For each pattern x ∈ H, we define a set of real valued measurement
functions {φ1 (x) , φ2 (x) , ..., φd1 (x)}
14 / 96
Thus
We define the following function (Vector of measurements)
φ : H → Rd1
(2)
Defined as
φ (x) = (φ1 (x) , φ2 (x) , ..., φd1 (x))T
(3)
Now
Suppose that the pattern x is a vector in an d0-dimensional input space.
15 / 96
Thus
We define the following function (Vector of measurements)
φ : H → Rd1
(2)
Defined as
φ (x) = (φ1 (x) , φ2 (x) , ..., φd1 (x))T
(3)
Now
Suppose that the pattern x is a vector in an d0-dimensional input space.
15 / 96
Thus
We define the following function (Vector of measurements)
φ : H → Rd1
(2)
Defined as
φ (x) = (φ1 (x) , φ2 (x) , ..., φd1 (x))T
(3)
Now
Suppose that the pattern x is a vector in an d0-dimensional input space.
15 / 96
Then...
We have that the mapping φ (x)
It maps points in d0-dimensional space into corresponding points in a new
space of dimension d1.
Each of this functions φi (x)
It is known as a hidden function because it plays a role similar to the
hidden unit in a feed-forward neural network.
Thus
We have that the space spanned by the set of hidden functions
{φi (x)}d1
i=1 is called as the hidden space of feature space.
16 / 96
Then...
We have that the mapping φ (x)
It maps points in d0-dimensional space into corresponding points in a new
space of dimension d1.
Each of this functions φi (x)
It is known as a hidden function because it plays a role similar to the
hidden unit in a feed-forward neural network.
Thus
We have that the space spanned by the set of hidden functions
{φi (x)}d1
i=1 is called as the hidden space of feature space.
16 / 96
Then...
We have that the mapping φ (x)
It maps points in d0-dimensional space into corresponding points in a new
space of dimension d1.
Each of this functions φi (x)
It is known as a hidden function because it plays a role similar to the
hidden unit in a feed-forward neural network.
Thus
We have that the space spanned by the set of hidden functions
{φi (x)}d1
i=1 is called as the hidden space of feature space.
16 / 96
Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
17 / 96
φ-separable functions
Definition
A dichotomy {H1, H2} of H is said to be φ-separable if there exists a
d1-dimensional vector w such that
1 wT φ (x) > 0 if x ∈ H1.
2 wT φ (x) < 0 if x ∈ H2.
Clearly the hyperplane is defined by the equation
wT
φ (x) = 0 (4)
Now
The inverse image of this hyperplane
Hyp−1
= x|wT
φ (x) = 0 (5)
define the separating surface in the input space.
18 / 96
φ-separable functions
Definition
A dichotomy {H1, H2} of H is said to be φ-separable if there exists a
d1-dimensional vector w such that
1 wT φ (x) > 0 if x ∈ H1.
2 wT φ (x) < 0 if x ∈ H2.
Clearly the hyperplane is defined by the equation
wT
φ (x) = 0 (4)
Now
The inverse image of this hyperplane
Hyp−1
= x|wT
φ (x) = 0 (5)
define the separating surface in the input space.
18 / 96
φ-separable functions
Definition
A dichotomy {H1, H2} of H is said to be φ-separable if there exists a
d1-dimensional vector w such that
1 wT φ (x) > 0 if x ∈ H1.
2 wT φ (x) < 0 if x ∈ H2.
Clearly the hyperplane is defined by the equation
wT
φ (x) = 0 (4)
Now
The inverse image of this hyperplane
Hyp−1
= x|wT
φ (x) = 0 (5)
define the separating surface in the input space.
18 / 96
φ-separable functions
Definition
A dichotomy {H1, H2} of H is said to be φ-separable if there exists a
d1-dimensional vector w such that
1 wT φ (x) > 0 if x ∈ H1.
2 wT φ (x) < 0 if x ∈ H2.
Clearly the hyperplane is defined by the equation
wT
φ (x) = 0 (4)
Now
The inverse image of this hyperplane
Hyp−1
= x|wT
φ (x) = 0 (5)
define the separating surface in the input space.
18 / 96
Now
Taking in consideration
A natural class of mappings obtained by using a linear combination of
r-wise products of the pattern vector coordinates.
They are called
As the rth-order rational varieties.
A rational variety of order r in dimensional d0 is described by
0≤i1≤i2≤...≤ir ≤d0
ai1i2...ir xi1 xi2 ...xir = 0 (6)
where xi is the ith coordinate of the input vector x and x0 is set to unity
in order to express the previous equation in homogeneous form.
19 / 96
Now
Taking in consideration
A natural class of mappings obtained by using a linear combination of
r-wise products of the pattern vector coordinates.
They are called
As the rth-order rational varieties.
A rational variety of order r in dimensional d0 is described by
0≤i1≤i2≤...≤ir ≤d0
ai1i2...ir xi1 xi2 ...xir = 0 (6)
where xi is the ith coordinate of the input vector x and x0 is set to unity
in order to express the previous equation in homogeneous form.
19 / 96
Now
Taking in consideration
A natural class of mappings obtained by using a linear combination of
r-wise products of the pattern vector coordinates.
They are called
As the rth-order rational varieties.
A rational variety of order r in dimensional d0 is described by
0≤i1≤i2≤...≤ir ≤d0
ai1i2...ir xi1 xi2 ...xir = 0 (6)
where xi is the ith coordinate of the input vector x and x0 is set to unity
in order to express the previous equation in homogeneous form.
19 / 96
Now
Taking in consideration
A natural class of mappings obtained by using a linear combination of
r-wise products of the pattern vector coordinates.
They are called
As the rth-order rational varieties.
A rational variety of order r in dimensional d0 is described by
0≤i1≤i2≤...≤ir ≤d0
ai1i2...ir xi1 xi2 ...xir = 0 (6)
where xi is the ith coordinate of the input vector x and x0 is set to unity
in order to express the previous equation in homogeneous form.
19 / 96
Homogenous Functions
Definition
A function f (x) is said to be homogeneous of degree n if, by introducing a
constant parameter λ, replacing the variable x with λx we find:
f (λx) = λn
f (x) (7)
20 / 96
Homogeneous Equation
Equation (Eq. 6)
A rth order product of entries xi of x, xi1 xi2 ...xir , is called a monomial
Properties
For an input space of dimensionality d0, there are
d0
r
=
d0!
(d0 − r)!r!
(8)
monomials in (Eq. 6).
21 / 96
Homogeneous Equation
Equation (Eq. 6)
A rth order product of entries xi of x, xi1 xi2 ...xir , is called a monomial
Properties
For an input space of dimensionality d0, there are
d0
r
=
d0!
(d0 − r)!r!
(8)
monomials in (Eq. 6).
21 / 96
Example of these surfaces
Hyperplanes (first-order rational varieties)
22 / 96
Example of these surfaces
Hyperplanes (first-order rational varieties)
23 / 96
Example of these surfaces
Quadrices (second-order rational varieties)
24 / 96
Example of these surfaces
Hyperspheres (quadrics with certain linear constraints on the
coefficients)
25 / 96
Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
26 / 96
The Stochastic Experiment
Suppose
You have the following activation patterns x1, x2, ..., xN are chosen
independently.
Suppose
That all possible dichotomies of H = {x1, x2, ..., xN } are equiprobable.
Now given P (N, d1) the probability that a particular dichotomy
picked at random is φ-separable
P (N, d1) =
1
2
N−1 d1−1
m=0
N − 1
m
(9)
27 / 96
The Stochastic Experiment
Suppose
You have the following activation patterns x1, x2, ..., xN are chosen
independently.
Suppose
That all possible dichotomies of H = {x1, x2, ..., xN } are equiprobable.
Now given P (N, d1) the probability that a particular dichotomy
picked at random is φ-separable
P (N, d1) =
1
2
N−1 d1−1
m=0
N − 1
m
(9)
27 / 96
The Stochastic Experiment
Suppose
You have the following activation patterns x1, x2, ..., xN are chosen
independently.
Suppose
That all possible dichotomies of H = {x1, x2, ..., xN } are equiprobable.
Now given P (N, d1) the probability that a particular dichotomy
picked at random is φ-separable
P (N, d1) =
1
2
N−1 d1−1
m=0
N − 1
m
(9)
27 / 96
What?
Basically (Eq. 9) represents
The essence of Cover’s Separability Theorem.
Something Notable
It is a statement of the fact that the cumulative binomial distribution
corresponding to the probability that N − 1 (Flips of a coin) samples will
be separable in a mapping of d1 − 1 (heads) or fewer dimensions.
Specifically
The higher we make the hidden space in the radial basis function the
closer is the probability of P (N, d1) to one.
28 / 96
What?
Basically (Eq. 9) represents
The essence of Cover’s Separability Theorem.
Something Notable
It is a statement of the fact that the cumulative binomial distribution
corresponding to the probability that N − 1 (Flips of a coin) samples will
be separable in a mapping of d1 − 1 (heads) or fewer dimensions.
Specifically
The higher we make the hidden space in the radial basis function the
closer is the probability of P (N, d1) to one.
28 / 96
What?
Basically (Eq. 9) represents
The essence of Cover’s Separability Theorem.
Something Notable
It is a statement of the fact that the cumulative binomial distribution
corresponding to the probability that N − 1 (Flips of a coin) samples will
be separable in a mapping of d1 − 1 (heads) or fewer dimensions.
Specifically
The higher we make the hidden space in the radial basis function the
closer is the probability of P (N, d1) to one.
28 / 96
Final ingredients if the Cover’s Theorem
First
Nonlinear formulation of the hidden function defined by φ (x), where x is
the input vector and i = 1, 2, ..., d1.
Second
High dimensionality of the hidden space compared to the input space.
This dimensionality is determined by the value assigned to d_1 (i.e.,
the number of hidden units).
Then
In general, a complex pattern-classification problem cast in
highdimensional space nonlinearly is more likely to be linearly separable
than in a lowdimensional space.
29 / 96
Final ingredients if the Cover’s Theorem
First
Nonlinear formulation of the hidden function defined by φ (x), where x is
the input vector and i = 1, 2, ..., d1.
Second
High dimensionality of the hidden space compared to the input space.
This dimensionality is determined by the value assigned to d_1 (i.e.,
the number of hidden units).
Then
In general, a complex pattern-classification problem cast in
highdimensional space nonlinearly is more likely to be linearly separable
than in a lowdimensional space.
29 / 96
Final ingredients if the Cover’s Theorem
First
Nonlinear formulation of the hidden function defined by φ (x), where x is
the input vector and i = 1, 2, ..., d1.
Second
High dimensionality of the hidden space compared to the input space.
This dimensionality is determined by the value assigned to d_1 (i.e.,
the number of hidden units).
Then
In general, a complex pattern-classification problem cast in
highdimensional space nonlinearly is more likely to be linearly separable
than in a lowdimensional space.
29 / 96
Final ingredients if the Cover’s Theorem
First
Nonlinear formulation of the hidden function defined by φ (x), where x is
the input vector and i = 1, 2, ..., d1.
Second
High dimensionality of the hidden space compared to the input space.
This dimensionality is determined by the value assigned to d_1 (i.e.,
the number of hidden units).
Then
In general, a complex pattern-classification problem cast in
highdimensional space nonlinearly is more likely to be linearly separable
than in a lowdimensional space.
29 / 96
Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
30 / 96
There is always an exception to every rule!!!
The XOR Problem
0
1
1
Class 1
Class 2
31 / 96
Now
We define the following radial functions
φ1 (x) = exp x − t1
2
2 where t1 = (1, 1)T
φ2 (x) = exp x − t2
2
2 where t2 = (1, 1)T
Then
If we apply our classic mapping φ (x) = [φ1 (x) , φ2 (x)]:
Original Mapping
(0, 1) → (0.3678, 0.3678)
(1, 0) → (0.3678, 0.3678)
(0, 0) → (0.1353, 1)
(1, 1) → (1, 0.1353)
32 / 96
Now
We define the following radial functions
φ1 (x) = exp x − t1
2
2 where t1 = (1, 1)T
φ2 (x) = exp x − t2
2
2 where t2 = (1, 1)T
Then
If we apply our classic mapping φ (x) = [φ1 (x) , φ2 (x)]:
Original Mapping
(0, 1) → (0.3678, 0.3678)
(1, 0) → (0.3678, 0.3678)
(0, 0) → (0.1353, 1)
(1, 1) → (1, 0.1353)
32 / 96
New Space
We have the following new φ1 − φ2 space
0
1
1
Class 1
Class 2
33 / 96
Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
34 / 96
Separating Capacity of a Surface
Something Notable
(Eq. 9) has an important bearing on the expected maximum number of
randomly assigned patterns that are linearly separable in a
multidimensional space.
Now, given our patterns {xi}N
i=1
Given N be a random variable defined as the largest integer such that the
sequence is φ-separable.
We have that
Prob (N = n) = P (n, d1) − P (n + 1, d1) (10)
35 / 96
Separating Capacity of a Surface
Something Notable
(Eq. 9) has an important bearing on the expected maximum number of
randomly assigned patterns that are linearly separable in a
multidimensional space.
Now, given our patterns {xi}N
i=1
Given N be a random variable defined as the largest integer such that the
sequence is φ-separable.
We have that
Prob (N = n) = P (n, d1) − P (n + 1, d1) (10)
35 / 96
Separating Capacity of a Surface
Something Notable
(Eq. 9) has an important bearing on the expected maximum number of
randomly assigned patterns that are linearly separable in a
multidimensional space.
Now, given our patterns {xi}N
i=1
Given N be a random variable defined as the largest integer such that the
sequence is φ-separable.
We have that
Prob (N = n) = P (n, d1) − P (n + 1, d1) (10)
35 / 96
Separating Capacity of a Surface
Then
Prob (N = n) =
1
2
n
n − 1
d1 − 1
, n = 0, 1, 2... (11)
Remark:
n
d1
=
n − 1
d1 − 1
+
n − 1
d1
, 0 < d1 < n
To interpret this
Recall the negative binomial distribution.
It is a repeated sequence of Bernoulli Trials
With k failures preceding the rth success.
36 / 96
Separating Capacity of a Surface
Then
Prob (N = n) =
1
2
n
n − 1
d1 − 1
, n = 0, 1, 2... (11)
Remark:
n
d1
=
n − 1
d1 − 1
+
n − 1
d1
, 0 < d1 < n
To interpret this
Recall the negative binomial distribution.
It is a repeated sequence of Bernoulli Trials
With k failures preceding the rth success.
36 / 96
Separating Capacity of a Surface
Then
Prob (N = n) =
1
2
n
n − 1
d1 − 1
, n = 0, 1, 2... (11)
Remark:
n
d1
=
n − 1
d1 − 1
+
n − 1
d1
, 0 < d1 < n
To interpret this
Recall the negative binomial distribution.
It is a repeated sequence of Bernoulli Trials
With k failures preceding the rth success.
36 / 96
Separating Capacity of a Surface
Thus, we have that
Given p and q the probabilities of success and failure, respectively, with
p + q = 1.
Definition
p (K = k|p, q) =
r + k − 1
k
pr
qk
(12)
What happened with p = q = 1
2
and k + r = n
Any idea?
37 / 96
Separating Capacity of a Surface
Thus, we have that
Given p and q the probabilities of success and failure, respectively, with
p + q = 1.
Definition
p (K = k|p, q) =
r + k − 1
k
pr
qk
(12)
What happened with p = q = 1
2
and k + r = n
Any idea?
37 / 96
Separating Capacity of a Surface
Thus, we have that
Given p and q the probabilities of success and failure, respectively, with
p + q = 1.
Definition
p (K = k|p, q) =
r + k − 1
k
pr
qk
(12)
What happened with p = q = 1
2
and k + r = n
Any idea?
37 / 96
Separating Capacity of a Surface
Thus
(Eq. 11) is just the negative binomial distribution shifted d1 units to the
right with parameters d1 and 1
2
Finally
N corresponds to thew “waiting time” for d1 th failure in a sequence of
tosses of a fair coin.
We have then
E [N] = 2d1
Median [N] = 2d1
38 / 96
Separating Capacity of a Surface
Thus
(Eq. 11) is just the negative binomial distribution shifted d1 units to the
right with parameters d1 and 1
2
Finally
N corresponds to thew “waiting time” for d1 th failure in a sequence of
tosses of a fair coin.
We have then
E [N] = 2d1
Median [N] = 2d1
38 / 96
Separating Capacity of a Surface
Thus
(Eq. 11) is just the negative binomial distribution shifted d1 units to the
right with parameters d1 and 1
2
Finally
N corresponds to thew “waiting time” for d1 th failure in a sequence of
tosses of a fair coin.
We have then
E [N] = 2d1
Median [N] = 2d1
38 / 96
This allows to define the Corollary to Cover’s Theorem
A celebrated asymptotic result
The expected maximum number of randomly assigned patterns (vectors)
that are linearly separable in a space of dimensionality d1 is equal to 2d1 .
Something Notable
This result suggests that 2d1 is a natural definition of the separating
capacity of a family of decision surfaces having d1 degrees of freedom.
39 / 96
This allows to define the Corollary to Cover’s Theorem
A celebrated asymptotic result
The expected maximum number of randomly assigned patterns (vectors)
that are linearly separable in a space of dimensionality d1 is equal to 2d1 .
Something Notable
This result suggests that 2d1 is a natural definition of the separating
capacity of a family of decision surfaces having d1 degrees of freedom.
39 / 96
Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
40 / 96
Given a problem of non-linearly separable patterns
It is possible to see that
There is a benefit to be gained by mapping the input space into a new
space of high enough dimension
For this, we use a non-linear map
Quite similar to solve a difficult non-linear filtering problem by mapping it
to high dimension, then solving it as a linear filtering problem.
41 / 96
Given a problem of non-linearly separable patterns
It is possible to see that
There is a benefit to be gained by mapping the input space into a new
space of high enough dimension
For this, we use a non-linear map
Quite similar to solve a difficult non-linear filtering problem by mapping it
to high dimension, then solving it as a linear filtering problem.
41 / 96
Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
42 / 96
Take in consideration the following architecture
Mapping from input space to hidden space, followed by a linear
mapping to output space!!!
Input Nodes
Nonlinear Nodes
Linear Node
43 / 96
This can be seen as
We have the following map
s : Rd0
→ R (13)
Therefore
We may think of s as a hypersurface (graph) Γ ⊂ Rd0+1
44 / 96
This can be seen as
We have the following map
s : Rd0
→ R (13)
Therefore
We may think of s as a hypersurface (graph) Γ ⊂ Rd0+1
44 / 96
Example
We have that the Red planes represent the mappings and the Gray is
the Linear Separator
45 / 96
Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
46 / 96
General Idea
First
The training phase constitutes the optimization of a fitting procedure
for the surface Γ.
It is based in the know data points as input-output patterns.
Second
The generalization phase is synonymous with interpolation between
the data points.
The interpolation being performed along the constrained surface
generated by the fitting procedure.
47 / 96
General Idea
First
The training phase constitutes the optimization of a fitting procedure
for the surface Γ.
It is based in the know data points as input-output patterns.
Second
The generalization phase is synonymous with interpolation between
the data points.
The interpolation being performed along the constrained surface
generated by the fitting procedure.
47 / 96
General Idea
First
The training phase constitutes the optimization of a fitting procedure
for the surface Γ.
It is based in the know data points as input-output patterns.
Second
The generalization phase is synonymous with interpolation between
the data points.
The interpolation being performed along the constrained surface
generated by the fitting procedure.
47 / 96
General Idea
First
The training phase constitutes the optimization of a fitting procedure
for the surface Γ.
It is based in the know data points as input-output patterns.
Second
The generalization phase is synonymous with interpolation between
the data points.
The interpolation being performed along the constrained surface
generated by the fitting procedure.
47 / 96
This leads to the theory of multi-variable interpolation
Interpolation Problem
Given a set of N different points xi ∈ Rd0 |i = 1, 2, ..., N and a
corresponding set of N real numbers di ∈ R1|i = 1, 2, ..., N , find a
function F : RN → R that satisfies the interpolation condition:
F (xi) = di i = 1, 2, ..., N (14)
Remark
For strict interpolation as specified here, the interpolating surface is
constrained to pass through all the training data points.
48 / 96
This leads to the theory of multi-variable interpolation
Interpolation Problem
Given a set of N different points xi ∈ Rd0 |i = 1, 2, ..., N and a
corresponding set of N real numbers di ∈ R1|i = 1, 2, ..., N , find a
function F : RN → R that satisfies the interpolation condition:
F (xi) = di i = 1, 2, ..., N (14)
Remark
For strict interpolation as specified here, the interpolating surface is
constrained to pass through all the training data points.
48 / 96
Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
49 / 96
Radial-Basis Functions (RBF)
The function F has the following form (Powell, 1988)
F (x) =
N
i=1
wiφ ( x − xi ) (15)
Where
{φ ( x − xi ) |i = 1, ..., N}
is a set of N arbitrary, generally non-linear, functions, know as RBF with
· denotes a norm that is usually Euclidean.
In addition
The know data points xi ∈ Rd0 i = 1, 2, ..., N are taken to be the centers
of the radial basis functions.
50 / 96
Radial-Basis Functions (RBF)
The function F has the following form (Powell, 1988)
F (x) =
N
i=1
wiφ ( x − xi ) (15)
Where
{φ ( x − xi ) |i = 1, ..., N}
is a set of N arbitrary, generally non-linear, functions, know as RBF with
· denotes a norm that is usually Euclidean.
In addition
The know data points xi ∈ Rd0 i = 1, 2, ..., N are taken to be the centers
of the radial basis functions.
50 / 96
Radial-Basis Functions (RBF)
The function F has the following form (Powell, 1988)
F (x) =
N
i=1
wiφ ( x − xi ) (15)
Where
{φ ( x − xi ) |i = 1, ..., N}
is a set of N arbitrary, generally non-linear, functions, know as RBF with
· denotes a norm that is usually Euclidean.
In addition
The know data points xi ∈ Rd0 i = 1, 2, ..., N are taken to be the centers
of the radial basis functions.
50 / 96
A Set of Simultaneous Linear Equations
Given
φji = φ ( xj − xi ) , (j, i) = 1, 2, ..., N (16)
Using (Eq. 14) and (Eq. 15), we get






φ11 φ12 · · · φ1N
φ21 φ22 · · · φ2N
...
...
...
...
φN1 φN2 · · · φNN












w1
w2
...
wN






=






d1
d2
...
dN






(17)
51 / 96
A Set of Simultaneous Linear Equations
Given
φji = φ ( xj − xi ) , (j, i) = 1, 2, ..., N (16)
Using (Eq. 14) and (Eq. 15), we get






φ11 φ12 · · · φ1N
φ21 φ22 · · · φ2N
...
...
...
...
φN1 φN2 · · · φNN












w1
w2
...
wN






=






d1
d2
...
dN






(17)
51 / 96
Now
We can create the following vectors
d = [d1, d2, ..., dN ]T
(Response vector).
w = [w1, w2, ..., wN ]T
(Linear weight vector).
Now, we define a N × N matrix called interpolation matrix
Φ = {φji| (j, i) = 1, 2, ..., N} (18)
Thus, we have
Φw = x (19)
52 / 96
Now
We can create the following vectors
d = [d1, d2, ..., dN ]T
(Response vector).
w = [w1, w2, ..., wN ]T
(Linear weight vector).
Now, we define a N × N matrix called interpolation matrix
Φ = {φji| (j, i) = 1, 2, ..., N} (18)
Thus, we have
Φw = x (19)
52 / 96
Now
We can create the following vectors
d = [d1, d2, ..., dN ]T
(Response vector).
w = [w1, w2, ..., wN ]T
(Linear weight vector).
Now, we define a N × N matrix called interpolation matrix
Φ = {φji| (j, i) = 1, 2, ..., N} (18)
Thus, we have
Φw = x (19)
52 / 96
From here
Assuming that Φ is a non-singular matrix
w = Φ−1
x (20)
Question
How can we be sure that the interpolation matrix Φ is non-singular?
Answer
It turns out that for a large class of radial-basis functions and under
certain conditions the non-singularity happens!!!
53 / 96
From here
Assuming that Φ is a non-singular matrix
w = Φ−1
x (20)
Question
How can we be sure that the interpolation matrix Φ is non-singular?
Answer
It turns out that for a large class of radial-basis functions and under
certain conditions the non-singularity happens!!!
53 / 96
From here
Assuming that Φ is a non-singular matrix
w = Φ−1
x (20)
Question
How can we be sure that the interpolation matrix Φ is non-singular?
Answer
It turns out that for a large class of radial-basis functions and under
certain conditions the non-singularity happens!!!
53 / 96
Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
54 / 96
Introduction
Observation
The strict interpolation procedure described may not be a good strategy
for the training of RBF networks for certain classes of tasks.
Reason
If the number of data points is much larger than the number of degrees of
freedom of the underlying physical process.
Thus
The network may end up fitting misleading variations due to idiosyncrasies
or noise in the input data.
55 / 96
Introduction
Observation
The strict interpolation procedure described may not be a good strategy
for the training of RBF networks for certain classes of tasks.
Reason
If the number of data points is much larger than the number of degrees of
freedom of the underlying physical process.
Thus
The network may end up fitting misleading variations due to idiosyncrasies
or noise in the input data.
55 / 96
Introduction
Observation
The strict interpolation procedure described may not be a good strategy
for the training of RBF networks for certain classes of tasks.
Reason
If the number of data points is much larger than the number of degrees of
freedom of the underlying physical process.
Thus
The network may end up fitting misleading variations due to idiosyncrasies
or noise in the input data.
55 / 96
Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
56 / 96
Well-posed
The Problem
Assume that we have a domain X and a range Y , metric spaces.
They are related by a mapping
f : X → Y (21)
Definition
The problem of reconstructing the mapping f is said to be well-posed if
three conditions are satisfied: Existence, Uniqueness and Continuity.
57 / 96
Well-posed
The Problem
Assume that we have a domain X and a range Y , metric spaces.
They are related by a mapping
f : X → Y (21)
Definition
The problem of reconstructing the mapping f is said to be well-posed if
three conditions are satisfied: Existence, Uniqueness and Continuity.
57 / 96
Well-posed
The Problem
Assume that we have a domain X and a range Y , metric spaces.
They are related by a mapping
f : X → Y (21)
Definition
The problem of reconstructing the mapping f is said to be well-posed if
three conditions are satisfied: Existence, Uniqueness and Continuity.
57 / 96
Defining the meaning of this
Existence
For every input vector x ∈ X, there does exist an output y = f (x), where
y ∈ Y .
Uniqueness
For any pair of input vectors x, t ∈ X, we have f (x) = f (t) if and only if
x = t.
Continuity
The mapping is continuous, if for any > 0 exists δ such that the
condition dX (x, t) < δ implies dY (f (x) , f (t)) < .
58 / 96
Defining the meaning of this
Existence
For every input vector x ∈ X, there does exist an output y = f (x), where
y ∈ Y .
Uniqueness
For any pair of input vectors x, t ∈ X, we have f (x) = f (t) if and only if
x = t.
Continuity
The mapping is continuous, if for any > 0 exists δ such that the
condition dX (x, t) < δ implies dY (f (x) , f (t)) < .
58 / 96
Defining the meaning of this
Existence
For every input vector x ∈ X, there does exist an output y = f (x), where
y ∈ Y .
Uniqueness
For any pair of input vectors x, t ∈ X, we have f (x) = f (t) if and only if
x = t.
Continuity
The mapping is continuous, if for any > 0 exists δ such that the
condition dX (x, t) < δ implies dY (f (x) , f (t)) < .
58 / 96
Basically
Example
Mapping
59 / 96
Ill-Posed
Therefore
If any of these conditions is not satisfied, the problem is said to be
ill-posed.
Basically
An ill-posed problem means that large data sets may contain a
surprisingly small amount of information about the desired solution.
60 / 96
Ill-Posed
Therefore
If any of these conditions is not satisfied, the problem is said to be
ill-posed.
Basically
An ill-posed problem means that large data sets may contain a
surprisingly small amount of information about the desired solution.
60 / 96
Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
61 / 96
Learning from data
Rebuilding the physical phenomena using the samples
Physical Phenomenon
Data
62 / 96
We have the following
Physical Phenomena
Speech, pictures, radar signals, sonar signals, seismic data.
It is a well-posed data
But learning form such data i.e. rebuilding the hypersurface can be an
ill-posed inverse problem.
63 / 96
We have the following
Physical Phenomena
Speech, pictures, radar signals, sonar signals, seismic data.
It is a well-posed data
But learning form such data i.e. rebuilding the hypersurface can be an
ill-posed inverse problem.
63 / 96
Why
First
The existence criterion may be violated in that a distinct output may not
exist for every input
Second
There may not be as much information in the training sample as we really
need to reconstruct the input-output mapping uniquely.
Third
The unavoidable presence of noise or imprecision in real-life training data
adds uncertainty to the reconstructed input-output mapping.
64 / 96
Why
First
The existence criterion may be violated in that a distinct output may not
exist for every input
Second
There may not be as much information in the training sample as we really
need to reconstruct the input-output mapping uniquely.
Third
The unavoidable presence of noise or imprecision in real-life training data
adds uncertainty to the reconstructed input-output mapping.
64 / 96
Why
First
The existence criterion may be violated in that a distinct output may not
exist for every input
Second
There may not be as much information in the training sample as we really
need to reconstruct the input-output mapping uniquely.
Third
The unavoidable presence of noise or imprecision in real-life training data
adds uncertainty to the reconstructed input-output mapping.
64 / 96
The noise problem
Getting out of the range
Mapping+Noise
65 / 96
How?
This can happen when
There is a lack of information!!!
Lanczos, 1964
“A lack of information cannot be remedied by any mathematical trickery.”
66 / 96
How?
This can happen when
There is a lack of information!!!
Lanczos, 1964
“A lack of information cannot be remedied by any mathematical trickery.”
66 / 96
Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
67 / 96
How do we solve the problem?
Something Notable
In 1963, Tikhonov proposed a new method called regularization for solving
ill-posed ’
Tikhonov
He was a Soviet and Russian mathematician known for important
contributions to topology, functional analysis, mathematical physics, and
ill-posed problems.
68 / 96
How do we solve the problem?
Something Notable
In 1963, Tikhonov proposed a new method called regularization for solving
ill-posed ’
Tikhonov
He was a Soviet and Russian mathematician known for important
contributions to topology, functional analysis, mathematical physics, and
ill-posed problems.
68 / 96
Also Known as Ridge Regression
Setup
We have:
Input Signal xi ∈ Rd0
N
i=1
.
Output Signal {di ∈ R}N
i=1.
In addition
Note that the output is assumed to be one-dimensional.
69 / 96
Also Known as Ridge Regression
Setup
We have:
Input Signal xi ∈ Rd0
N
i=1
.
Output Signal {di ∈ R}N
i=1.
In addition
Note that the output is assumed to be one-dimensional.
69 / 96
Now, assuming that you have an approximation function
y = F (x)
Standard Error Term
Es (F) =
1
2
N
i=1
(di − yi) =
1
2
N
i=1
(di − F (xi)) (22)
Regularization Term
Ec (F) =
1
2
DF 2
(23)
Where
D is a linear differential operator.
70 / 96
Now, assuming that you have an approximation function
y = F (x)
Standard Error Term
Es (F) =
1
2
N
i=1
(di − yi) =
1
2
N
i=1
(di − F (xi)) (22)
Regularization Term
Ec (F) =
1
2
DF 2
(23)
Where
D is a linear differential operator.
70 / 96
Now, assuming that you have an approximation function
y = F (x)
Standard Error Term
Es (F) =
1
2
N
i=1
(di − yi) =
1
2
N
i=1
(di − F (xi)) (22)
Regularization Term
Ec (F) =
1
2
DF 2
(23)
Where
D is a linear differential operator.
70 / 96
Now
Ordinarily y = F (x)
Normally, the function space representing the functional F is the L2 space
that consist of all real-valued functions f (x) with x ∈ Rd0
The quantity to be minimized in regularization theory is
E (f ) =
1
2
N
i=1
(di − f (xi)) +
1
2
Df 2
(24)
Where
λ is a positive real number called the regularization parameter.
E (f ) is called the Tikhonov functional.
71 / 96
Now
Ordinarily y = F (x)
Normally, the function space representing the functional F is the L2 space
that consist of all real-valued functions f (x) with x ∈ Rd0
The quantity to be minimized in regularization theory is
E (f ) =
1
2
N
i=1
(di − f (xi)) +
1
2
Df 2
(24)
Where
λ is a positive real number called the regularization parameter.
E (f ) is called the Tikhonov functional.
71 / 96
Now
Ordinarily y = F (x)
Normally, the function space representing the functional F is the L2 space
that consist of all real-valued functions f (x) with x ∈ Rd0
The quantity to be minimized in regularization theory is
E (f ) =
1
2
N
i=1
(di − f (xi)) +
1
2
Df 2
(24)
Where
λ is a positive real number called the regularization parameter.
E (f ) is called the Tikhonov functional.
71 / 96
Now
Ordinarily y = F (x)
Normally, the function space representing the functional F is the L2 space
that consist of all real-valued functions f (x) with x ∈ Rd0
The quantity to be minimized in regularization theory is
E (f ) =
1
2
N
i=1
(di − f (xi)) +
1
2
Df 2
(24)
Where
λ is a positive real number called the regularization parameter.
E (f ) is called the Tikhonov functional.
71 / 96
Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
72 / 96
Introduction
What did we see until now?
The design of learning machines from two main points:
Statistical Point of View
Linear Algebra and Optimization Point of View
Going back to the probability models
We might think in the machine to be learned as a function g (x|D)....
Something as curve fitting...
Under a data set
D = {(xi, yi) |i = 1, 2, ..., N} (25)
Remark: Where the xi ∼ p (x|Θ)!!!
73 / 96
Introduction
What did we see until now?
The design of learning machines from two main points:
Statistical Point of View
Linear Algebra and Optimization Point of View
Going back to the probability models
We might think in the machine to be learned as a function g (x|D)....
Something as curve fitting...
Under a data set
D = {(xi, yi) |i = 1, 2, ..., N} (25)
Remark: Where the xi ∼ p (x|Θ)!!!
73 / 96
Introduction
What did we see until now?
The design of learning machines from two main points:
Statistical Point of View
Linear Algebra and Optimization Point of View
Going back to the probability models
We might think in the machine to be learned as a function g (x|D)....
Something as curve fitting...
Under a data set
D = {(xi, yi) |i = 1, 2, ..., N} (25)
Remark: Where the xi ∼ p (x|Θ)!!!
73 / 96
Introduction
What did we see until now?
The design of learning machines from two main points:
Statistical Point of View
Linear Algebra and Optimization Point of View
Going back to the probability models
We might think in the machine to be learned as a function g (x|D)....
Something as curve fitting...
Under a data set
D = {(xi, yi) |i = 1, 2, ..., N} (25)
Remark: Where the xi ∼ p (x|Θ)!!!
73 / 96
Introduction
What did we see until now?
The design of learning machines from two main points:
Statistical Point of View
Linear Algebra and Optimization Point of View
Going back to the probability models
We might think in the machine to be learned as a function g (x|D)....
Something as curve fitting...
Under a data set
D = {(xi, yi) |i = 1, 2, ..., N} (25)
Remark: Where the xi ∼ p (x|Θ)!!!
73 / 96
Introduction
What did we see until now?
The design of learning machines from two main points:
Statistical Point of View
Linear Algebra and Optimization Point of View
Going back to the probability models
We might think in the machine to be learned as a function g (x|D)....
Something as curve fitting...
Under a data set
D = {(xi, yi) |i = 1, 2, ..., N} (25)
Remark: Where the xi ∼ p (x|Θ)!!!
73 / 96
Introduction
What did we see until now?
The design of learning machines from two main points:
Statistical Point of View
Linear Algebra and Optimization Point of View
Going back to the probability models
We might think in the machine to be learned as a function g (x|D)....
Something as curve fitting...
Under a data set
D = {(xi, yi) |i = 1, 2, ..., N} (25)
Remark: Where the xi ∼ p (x|Θ)!!!
73 / 96
Thus, we have that
Two main functions
A function g (x|D) obtained using some algorithm!!!
E [y|x] the optimal regression...
Important
The key factor here is the dependence of the approximation on D.
Why?
The approximation may be very good for a specific training data set but
very bad for another.
This is the reason of studying fusion of information at decision level...
74 / 96
Thus, we have that
Two main functions
A function g (x|D) obtained using some algorithm!!!
E [y|x] the optimal regression...
Important
The key factor here is the dependence of the approximation on D.
Why?
The approximation may be very good for a specific training data set but
very bad for another.
This is the reason of studying fusion of information at decision level...
74 / 96
Thus, we have that
Two main functions
A function g (x|D) obtained using some algorithm!!!
E [y|x] the optimal regression...
Important
The key factor here is the dependence of the approximation on D.
Why?
The approximation may be very good for a specific training data set but
very bad for another.
This is the reason of studying fusion of information at decision level...
74 / 96
Thus, we have that
Two main functions
A function g (x|D) obtained using some algorithm!!!
E [y|x] the optimal regression...
Important
The key factor here is the dependence of the approximation on D.
Why?
The approximation may be very good for a specific training data set but
very bad for another.
This is the reason of studying fusion of information at decision level...
74 / 96
Thus, we have that
Two main functions
A function g (x|D) obtained using some algorithm!!!
E [y|x] the optimal regression...
Important
The key factor here is the dependence of the approximation on D.
Why?
The approximation may be very good for a specific training data set but
very bad for another.
This is the reason of studying fusion of information at decision level...
74 / 96
How do we measure the difference
We have that
Var(X) = E((X − µ)2
)
We can do that for our data
VarD (g (x|D)) = ED (g (x|D) − E [y|x])2
Now, if we add and subtract
ED [g (x|D)] (26)
Remark: The expected output of the machine g (x|D)
75 / 96
How do we measure the difference
We have that
Var(X) = E((X − µ)2
)
We can do that for our data
VarD (g (x|D)) = ED (g (x|D) − E [y|x])2
Now, if we add and subtract
ED [g (x|D)] (26)
Remark: The expected output of the machine g (x|D)
75 / 96
How do we measure the difference
We have that
Var(X) = E((X − µ)2
)
We can do that for our data
VarD (g (x|D)) = ED (g (x|D) − E [y|x])2
Now, if we add and subtract
ED [g (x|D)] (26)
Remark: The expected output of the machine g (x|D)
75 / 96
How do we measure the difference
We have that
Var(X) = E((X − µ)2
)
We can do that for our data
VarD (g (x|D)) = ED (g (x|D) − E [y|x])2
Now, if we add and subtract
ED [g (x|D)] (26)
Remark: The expected output of the machine g (x|D)
75 / 96
Thus, we have that
Or Original variance
VarD (g (x|D)) = ED (g (x|D) − E [y|x])2
= ED (g (x|D) − ED [g (x|D)] + ED [g (x|D)] − E [y|x])2
= ED (g (x|D) − ED [g (x|D)])2
+ ...
...2 ((g (x|D) − ED [g (x|D)])) (ED [g (x|D)] − E [y|x]) + ...
... (ED [g (x|D)] − E [y|x])2
Finally
ED (((g (x|D) − ED [g (x|D)])) (ED [g (x|D)] − E [y|x])) =? (27)
76 / 96
Thus, we have that
Or Original variance
VarD (g (x|D)) = ED (g (x|D) − E [y|x])2
= ED (g (x|D) − ED [g (x|D)] + ED [g (x|D)] − E [y|x])2
= ED (g (x|D) − ED [g (x|D)])2
+ ...
...2 ((g (x|D) − ED [g (x|D)])) (ED [g (x|D)] − E [y|x]) + ...
... (ED [g (x|D)] − E [y|x])2
Finally
ED (((g (x|D) − ED [g (x|D)])) (ED [g (x|D)] − E [y|x])) =? (27)
76 / 96
Thus, we have that
Or Original variance
VarD (g (x|D)) = ED (g (x|D) − E [y|x])2
= ED (g (x|D) − ED [g (x|D)] + ED [g (x|D)] − E [y|x])2
= ED (g (x|D) − ED [g (x|D)])2
+ ...
...2 ((g (x|D) − ED [g (x|D)])) (ED [g (x|D)] − E [y|x]) + ...
... (ED [g (x|D)] − E [y|x])2
Finally
ED (((g (x|D) − ED [g (x|D)])) (ED [g (x|D)] − E [y|x])) =? (27)
76 / 96
Thus, we have that
Or Original variance
VarD (g (x|D)) = ED (g (x|D) − E [y|x])2
= ED (g (x|D) − ED [g (x|D)] + ED [g (x|D)] − E [y|x])2
= ED (g (x|D) − ED [g (x|D)])2
+ ...
...2 ((g (x|D) − ED [g (x|D)])) (ED [g (x|D)] − E [y|x]) + ...
... (ED [g (x|D)] − E [y|x])2
Finally
ED (((g (x|D) − ED [g (x|D)])) (ED [g (x|D)] − E [y|x])) =? (27)
76 / 96
We have the Bias-Variance
Our Final Equation
ED (g (x|D) − E [y|x])2
= ED (g (x|D) − ED [g (x|D)])2
VARIANCE
+ (ED [g (x|D)] − E [y|x])2
BIAS
Where the variance
It represent the measure of the error between our machine g (x|D) and the
expected output of the machine under xi ∼ p (x|Θ).
Where the bias
It represent the quadratic error between the expected output of the
machine under xi ∼ p (x|Θ) and the expected output of the optimal
regression.
77 / 96
We have the Bias-Variance
Our Final Equation
ED (g (x|D) − E [y|x])2
= ED (g (x|D) − ED [g (x|D)])2
VARIANCE
+ (ED [g (x|D)] − E [y|x])2
BIAS
Where the variance
It represent the measure of the error between our machine g (x|D) and the
expected output of the machine under xi ∼ p (x|Θ).
Where the bias
It represent the quadratic error between the expected output of the
machine under xi ∼ p (x|Θ) and the expected output of the optimal
regression.
77 / 96
We have the Bias-Variance
Our Final Equation
ED (g (x|D) − E [y|x])2
= ED (g (x|D) − ED [g (x|D)])2
VARIANCE
+ (ED [g (x|D)] − E [y|x])2
BIAS
Where the variance
It represent the measure of the error between our machine g (x|D) and the
expected output of the machine under xi ∼ p (x|Θ).
Where the bias
It represent the quadratic error between the expected output of the
machine under xi ∼ p (x|Θ) and the expected output of the optimal
regression.
77 / 96
We have the Bias-Variance
Our Final Equation
ED (g (x|D) − E [y|x])2
= ED (g (x|D) − ED [g (x|D)])2
VARIANCE
+ (ED [g (x|D)] − E [y|x])2
BIAS
Where the variance
It represent the measure of the error between our machine g (x|D) and the
expected output of the machine under xi ∼ p (x|Θ).
Where the bias
It represent the quadratic error between the expected output of the
machine under xi ∼ p (x|Θ) and the expected output of the optimal
regression.
77 / 96
Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
78 / 96
Using this in our favor!!!
Something Notable
Introducing bias is equivalent to restricting the range of functions for
which a model can account.
Typically this is achieved by removing degrees of freedom.
Examples
They would be lowering the order of a polynomial or reducing the number
of weights in a neural network!!!
Ridge Regression
It does not explicitly remove degrees of freedom but instead reduces the
effective number of parameters.
79 / 96
Using this in our favor!!!
Something Notable
Introducing bias is equivalent to restricting the range of functions for
which a model can account.
Typically this is achieved by removing degrees of freedom.
Examples
They would be lowering the order of a polynomial or reducing the number
of weights in a neural network!!!
Ridge Regression
It does not explicitly remove degrees of freedom but instead reduces the
effective number of parameters.
79 / 96
Using this in our favor!!!
Something Notable
Introducing bias is equivalent to restricting the range of functions for
which a model can account.
Typically this is achieved by removing degrees of freedom.
Examples
They would be lowering the order of a polynomial or reducing the number
of weights in a neural network!!!
Ridge Regression
It does not explicitly remove degrees of freedom but instead reduces the
effective number of parameters.
79 / 96
Using this in our favor!!!
Something Notable
Introducing bias is equivalent to restricting the range of functions for
which a model can account.
Typically this is achieved by removing degrees of freedom.
Examples
They would be lowering the order of a polynomial or reducing the number
of weights in a neural network!!!
Ridge Regression
It does not explicitly remove degrees of freedom but instead reduces the
effective number of parameters.
79 / 96
Example
In the case of a linear regression model
C (w) =
N
i=1
di − wT
xi
2
+ λ
d0
j=1
w2
j (28)
Thus
This is ridge regression (weight decay) and the regularization
parameter λ > 0 controls the balance between fitting the data and
avoiding the penalty.
A small value for λ means the data can be fit tightly without causing
a large penalty.
A large value for λ means a tight fit has to be sacrificed if it requires
large weights.
80 / 96
Example
In the case of a linear regression model
C (w) =
N
i=1
di − wT
xi
2
+ λ
d0
j=1
w2
j (28)
Thus
This is ridge regression (weight decay) and the regularization
parameter λ > 0 controls the balance between fitting the data and
avoiding the penalty.
A small value for λ means the data can be fit tightly without causing
a large penalty.
A large value for λ means a tight fit has to be sacrificed if it requires
large weights.
80 / 96
Example
In the case of a linear regression model
C (w) =
N
i=1
di − wT
xi
2
+ λ
d0
j=1
w2
j (28)
Thus
This is ridge regression (weight decay) and the regularization
parameter λ > 0 controls the balance between fitting the data and
avoiding the penalty.
A small value for λ means the data can be fit tightly without causing
a large penalty.
A large value for λ means a tight fit has to be sacrificed if it requires
large weights.
80 / 96
Example
In the case of a linear regression model
C (w) =
N
i=1
di − wT
xi
2
+ λ
d0
j=1
w2
j (28)
Thus
This is ridge regression (weight decay) and the regularization
parameter λ > 0 controls the balance between fitting the data and
avoiding the penalty.
A small value for λ means the data can be fit tightly without causing
a large penalty.
A large value for λ means a tight fit has to be sacrificed if it requires
large weights.
80 / 96
Important
The Bias
It favors solutions involving small weights and the effect is to smooth the
output function.
81 / 96
Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
82 / 96
Now, we can carry out the optimization
First, we rewrite the cost function the following way
S (w) =
N
i=1
(di − f (xi))2
(29)
And we will use a generalized version for f
f (xi) =
d1
j=1
wjφj (xi) (30)
Where
The free variables are the weights {wj}d1
j=1.
83 / 96
Now, we can carry out the optimization
First, we rewrite the cost function the following way
S (w) =
N
i=1
(di − f (xi))2
(29)
And we will use a generalized version for f
f (xi) =
d1
j=1
wjφj (xi) (30)
Where
The free variables are the weights {wj}d1
j=1.
83 / 96
Now, we can carry out the optimization
First, we rewrite the cost function the following way
S (w) =
N
i=1
(di − f (xi))2
(29)
And we will use a generalized version for f
f (xi) =
d1
j=1
wjφj (xi) (30)
Where
The free variables are the weights {wj}d1
j=1.
83 / 96
Where
φj (xi) is in our case, we may have the Gaussian distribution
φj (xi) = φ (xi, xj) (31)
With
φ (x, xj) = exp −
1
2σ2
x − xi (32)
84 / 96
Where
φj (xi) is in our case, we may have the Gaussian distribution
φj (xi) = φ (xi, xj) (31)
With
φ (x, xj) = exp −
1
2σ2
x − xi (32)
84 / 96
Thus
Final cost function assuming there is a regularization term per weight
C (w, λ) =
N
i=1
(di − f (xi))2
+
d1
j=1
λjw2
j (33)
What do we do?
1 Differentiate the function with respect to the free variables.
2 Equate the results with zero.
3 Solve the resulting equations.
85 / 96
Thus
Final cost function assuming there is a regularization term per weight
C (w, λ) =
N
i=1
(di − f (xi))2
+
d1
j=1
λjw2
j (33)
What do we do?
1 Differentiate the function with respect to the free variables.
2 Equate the results with zero.
3 Solve the resulting equations.
85 / 96
Thus
Final cost function assuming there is a regularization term per weight
C (w, λ) =
N
i=1
(di − f (xi))2
+
d1
j=1
λjw2
j (33)
What do we do?
1 Differentiate the function with respect to the free variables.
2 Equate the results with zero.
3 Solve the resulting equations.
85 / 96
Thus
Final cost function assuming there is a regularization term per weight
C (w, λ) =
N
i=1
(di − f (xi))2
+
d1
j=1
λjw2
j (33)
What do we do?
1 Differentiate the function with respect to the free variables.
2 Equate the results with zero.
3 Solve the resulting equations.
85 / 96
Differentiate the function with respect to the free variables.
First
∂C (w, λ)
∂wj
= 2
N
i=1
(di − f (xi))
∂f (xi)
∂wj
+ 2λjwj (34)
We get differential of ∂f (xi)
∂wj
∂f (xi)
∂wj
= φj (xi) (35)
86 / 96
Differentiate the function with respect to the free variables.
First
∂C (w, λ)
∂wj
= 2
N
i=1
(di − f (xi))
∂f (xi)
∂wj
+ 2λjwj (34)
We get differential of ∂f (xi)
∂wj
∂f (xi)
∂wj
= φj (xi) (35)
86 / 96
Now
We have then
N
i=1
f (xi) φj (xi) + λjwj =
N
i=1
diφj (xi) (36)
Something Notable
There are m such equations, for 1 ≤ j ≤ m, each representing one
constraint on the solution.
Since there are exactly as many constraints as there are unknowns
equations has, except under certain pathological conditions, a unique
solution.
87 / 96
Now
We have then
N
i=1
f (xi) φj (xi) + λjwj =
N
i=1
diφj (xi) (36)
Something Notable
There are m such equations, for 1 ≤ j ≤ m, each representing one
constraint on the solution.
Since there are exactly as many constraints as there are unknowns
equations has, except under certain pathological conditions, a unique
solution.
87 / 96
Now
We have then
N
i=1
f (xi) φj (xi) + λjwj =
N
i=1
diφj (xi) (36)
Something Notable
There are m such equations, for 1 ≤ j ≤ m, each representing one
constraint on the solution.
Since there are exactly as many constraints as there are unknowns
equations has, except under certain pathological conditions, a unique
solution.
87 / 96
Using Our Linear Algebra
We have then
φT
j f + λjwj = φT
j d (37)
Where
φj =






φj (x1)
φj (x2)
...
φj (xN )






, f =






f (x1)
f (x2)
...
f (xN )






, d =






d1
d2
...
dN






(38)
88 / 96
Using Our Linear Algebra
We have then
φT
j f + λjwj = φT
j d (37)
Where
φj =






φj (x1)
φj (x2)
...
φj (xN )






, f =






f (x1)
f (x2)
...
f (xN )






, d =






d1
d2
...
dN






(38)
88 / 96
Now
Since there is one of these equations, each relating one scalar
quantity to another, we can stack them






φT
1 f
φT
2 f
...
φT
d1
f






+






λ1w1
λ2w2
...
λd1 wd1






=






φT
1 d
φT
2 d
...
φT
d1
d






(39)
Now, if we define
Φ = φ1 φ2 . . . φd1
(40)
Written in full form
Φ =






φ1 (x1) φ2 (x1) · · · φd1 (x1)
φ1 (x2) φ2 (x2) · · · φd1 (x2)
...
...
...
...
φ1 (xN ) φ2 (xN ) · · · φd1 (xN )






(41)
89 / 96
Now
Since there is one of these equations, each relating one scalar
quantity to another, we can stack them






φT
1 f
φT
2 f
...
φT
d1
f






+






λ1w1
λ2w2
...
λd1 wd1






=






φT
1 d
φT
2 d
...
φT
d1
d






(39)
Now, if we define
Φ = φ1 φ2 . . . φd1
(40)
Written in full form
Φ =






φ1 (x1) φ2 (x1) · · · φd1 (x1)
φ1 (x2) φ2 (x2) · · · φd1 (x2)
...
...
...
...
φ1 (xN ) φ2 (xN ) · · · φd1 (xN )






(41)
89 / 96
Now
Since there is one of these equations, each relating one scalar
quantity to another, we can stack them






φT
1 f
φT
2 f
...
φT
d1
f






+






λ1w1
λ2w2
...
λd1 wd1






=






φT
1 d
φT
2 d
...
φT
d1
d






(39)
Now, if we define
Φ = φ1 φ2 . . . φd1
(40)
Written in full form
Φ =






φ1 (x1) φ2 (x1) · · · φd1 (x1)
φ1 (x2) φ2 (x2) · · · φd1 (x2)
...
...
...
...
φ1 (xN ) φ2 (xN ) · · · φd1 (xN )






(41)
89 / 96
We can then
Define the following matrix equation
ΦT
f + Λw = ΦT
d (42)
Where
Λ =






λ1 0 · · · 0
0 λ2 · · · 0
...
...
...
...
0 0 · · · λd1






(43)
90 / 96
We can then
Define the following matrix equation
ΦT
f + Λw = ΦT
d (42)
Where
Λ =






λ1 0 · · · 0
0 λ2 · · · 0
...
...
...
...
0 0 · · · λd1






(43)
90 / 96
Now, we have that
The vector can be decomposed into the product of two terms
Design matrix and the weight vector
We have then
fi = f (xi) =
d1
j=1
wjhj (xi) = φ
T
i w (44)
Where
φi =






φ1 (xi)
φ2 (xi)
...
φd1 (xi)






(45)
91 / 96
Now, we have that
The vector can be decomposed into the product of two terms
Design matrix and the weight vector
We have then
fi = f (xi) =
d1
j=1
wjhj (xi) = φ
T
i w (44)
Where
φi =






φ1 (xi)
φ2 (xi)
...
φd1 (xi)






(45)
91 / 96
Now, we have that
The vector can be decomposed into the product of two terms
Design matrix and the weight vector
We have then
fi = f (xi) =
d1
j=1
wjhj (xi) = φ
T
i w (44)
Where
φi =






φ1 (xi)
φ2 (xi)
...
φd1 (xi)






(45)
91 / 96
Furthermore
We get that
f =






f1
f2
...
fN






=







φ
T
1 w
φ
T
2 w
...
φ
T
N w







= Φw (46)
Finally, we have that
ΦT
d =ΦT
f + Λw
=ΦT
Φw + Λw
= ΦT
Φ + Λ w
92 / 96
Furthermore
We get that
f =






f1
f2
...
fN






=







φ
T
1 w
φ
T
2 w
...
φ
T
N w







= Φw (46)
Finally, we have that
ΦT
d =ΦT
f + Λw
=ΦT
Φw + Λw
= ΦT
Φ + Λ w
92 / 96
Now...
We get finally
w = ΦT
Φ + Λ
−1
ΦT
d (47)
Remember
This equation is the most general form of the normal equation.
We have two cases
In standard ridge regression λj = λ, 1 ≤ j ≤ m.
Ordinary least squares where there is no weight penalty or all λj = 0,
1 ≤ j ≤ m..
93 / 96
Now...
We get finally
w = ΦT
Φ + Λ
−1
ΦT
d (47)
Remember
This equation is the most general form of the normal equation.
We have two cases
In standard ridge regression λj = λ, 1 ≤ j ≤ m.
Ordinary least squares where there is no weight penalty or all λj = 0,
1 ≤ j ≤ m..
93 / 96
Now...
We get finally
w = ΦT
Φ + Λ
−1
ΦT
d (47)
Remember
This equation is the most general form of the normal equation.
We have two cases
In standard ridge regression λj = λ, 1 ≤ j ≤ m.
Ordinary least squares where there is no weight penalty or all λj = 0,
1 ≤ j ≤ m..
93 / 96
Thus, we have
First Case
w = ΦT
Φ + λId1
−1
ΦT
d (48)
Second Case
w = ΦT
Φ
−1
ΦT
d (49)
94 / 96
Thus, we have
First Case
w = ΦT
Φ + λId1
−1
ΦT
d (48)
Second Case
w = ΦT
Φ
−1
ΦT
d (49)
94 / 96
Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
95 / 96
There are still several things that we need to look at...
First
What is the variance of the weight vector? The Variance Matrix.
Second
The prediction of the output at any of the training set inputs - The
Projection Matrix
Finally
The incremental algorithm for the problem!!!
96 / 96
There are still several things that we need to look at...
First
What is the variance of the weight vector? The Variance Matrix.
Second
The prediction of the output at any of the training set inputs - The
Projection Matrix
Finally
The incremental algorithm for the problem!!!
96 / 96
There are still several things that we need to look at...
First
What is the variance of the weight vector? The Variance Matrix.
Second
The prediction of the output at any of the training set inputs - The
Projection Matrix
Finally
The incremental algorithm for the problem!!!
96 / 96

Weitere ähnliche Inhalte

Was ist angesagt?

Machine learning in science and industry — day 1
Machine learning in science and industry — day 1Machine learning in science and industry — day 1
Machine learning in science and industry — day 1arogozhnikov
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Zihui Li
 
Machine learning in science and industry — day 2
Machine learning in science and industry — day 2Machine learning in science and industry — day 2
Machine learning in science and industry — day 2arogozhnikov
 
Machine Learning Algorithms Review(Part 2)
Machine Learning Algorithms Review(Part 2)Machine Learning Algorithms Review(Part 2)
Machine Learning Algorithms Review(Part 2)Zihui Li
 
27 Machine Learning Unsupervised Measure Properties
27 Machine Learning Unsupervised Measure Properties27 Machine Learning Unsupervised Measure Properties
27 Machine Learning Unsupervised Measure PropertiesAndres Mendez-Vazquez
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machinesnextlib
 
Probabilistic PCA, EM, and more
Probabilistic PCA, EM, and moreProbabilistic PCA, EM, and more
Probabilistic PCA, EM, and morehsharmasshare
 
. An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic .... An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic ...butest
 
Machine learning in science and industry — day 3
Machine learning in science and industry — day 3Machine learning in science and industry — day 3
Machine learning in science and industry — day 3arogozhnikov
 
Linear Algebra – A Powerful Tool for Data Science
Linear Algebra – A Powerful Tool for Data ScienceLinear Algebra – A Powerful Tool for Data Science
Linear Algebra – A Powerful Tool for Data SciencePremier Publishers
 
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...Masahiro Suzuki
 
Image Classification And Support Vector Machine
Image Classification And Support Vector MachineImage Classification And Support Vector Machine
Image Classification And Support Vector MachineShao-Chuan Wang
 
2.6 support vector machines and associative classifiers revised
2.6 support vector machines and associative classifiers revised2.6 support vector machines and associative classifiers revised
2.6 support vector machines and associative classifiers revisedKrish_ver2
 
Instance based learning
Instance based learningInstance based learning
Instance based learningswapnac12
 
From RNN to neural networks for cyclic undirected graphs
From RNN to neural networks for cyclic undirected graphsFrom RNN to neural networks for cyclic undirected graphs
From RNN to neural networks for cyclic undirected graphstuxette
 
A short and naive introduction to using network in prediction models
A short and naive introduction to using network in prediction modelsA short and naive introduction to using network in prediction models
A short and naive introduction to using network in prediction modelstuxette
 

Was ist angesagt? (20)

Machine learning in science and industry — day 1
Machine learning in science and industry — day 1Machine learning in science and industry — day 1
Machine learning in science and industry — day 1
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)
 
Machine learning in science and industry — day 2
Machine learning in science and industry — day 2Machine learning in science and industry — day 2
Machine learning in science and industry — day 2
 
Machine Learning Algorithms Review(Part 2)
Machine Learning Algorithms Review(Part 2)Machine Learning Algorithms Review(Part 2)
Machine Learning Algorithms Review(Part 2)
 
27 Machine Learning Unsupervised Measure Properties
27 Machine Learning Unsupervised Measure Properties27 Machine Learning Unsupervised Measure Properties
27 Machine Learning Unsupervised Measure Properties
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 
Probabilistic PCA, EM, and more
Probabilistic PCA, EM, and moreProbabilistic PCA, EM, and more
Probabilistic PCA, EM, and more
 
Iclr2016 vaeまとめ
Iclr2016 vaeまとめIclr2016 vaeまとめ
Iclr2016 vaeまとめ
 
. An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic .... An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic ...
 
Svm
SvmSvm
Svm
 
Machine learning in science and industry — day 3
Machine learning in science and industry — day 3Machine learning in science and industry — day 3
Machine learning in science and industry — day 3
 
Polynomial Matrix Decompositions
Polynomial Matrix DecompositionsPolynomial Matrix Decompositions
Polynomial Matrix Decompositions
 
Estimating Space-Time Covariance from Finite Sample Sets
Estimating Space-Time Covariance from Finite Sample SetsEstimating Space-Time Covariance from Finite Sample Sets
Estimating Space-Time Covariance from Finite Sample Sets
 
Linear Algebra – A Powerful Tool for Data Science
Linear Algebra – A Powerful Tool for Data ScienceLinear Algebra – A Powerful Tool for Data Science
Linear Algebra – A Powerful Tool for Data Science
 
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
 
Image Classification And Support Vector Machine
Image Classification And Support Vector MachineImage Classification And Support Vector Machine
Image Classification And Support Vector Machine
 
2.6 support vector machines and associative classifiers revised
2.6 support vector machines and associative classifiers revised2.6 support vector machines and associative classifiers revised
2.6 support vector machines and associative classifiers revised
 
Instance based learning
Instance based learningInstance based learning
Instance based learning
 
From RNN to neural networks for cyclic undirected graphs
From RNN to neural networks for cyclic undirected graphsFrom RNN to neural networks for cyclic undirected graphs
From RNN to neural networks for cyclic undirected graphs
 
A short and naive introduction to using network in prediction models
A short and naive introduction to using network in prediction modelsA short and naive introduction to using network in prediction models
A short and naive introduction to using network in prediction models
 

Andere mochten auch

Implementation of Back-Propagation Neural Network using Scilab and its Conver...
Implementation of Back-Propagation Neural Network using Scilab and its Conver...Implementation of Back-Propagation Neural Network using Scilab and its Conver...
Implementation of Back-Propagation Neural Network using Scilab and its Conver...IJEEE
 
Radial Basis Function Interpolation
Radial Basis Function InterpolationRadial Basis Function Interpolation
Radial Basis Function InterpolationJesse Bettencourt
 
Introduction to Radial Basis Function Networks
Introduction to Radial Basis Function NetworksIntroduction to Radial Basis Function Networks
Introduction to Radial Basis Function NetworksESCOM
 
Radial Basis Function Network (RBFN)
Radial Basis Function Network (RBFN)Radial Basis Function Network (RBFN)
Radial Basis Function Network (RBFN)ahmad haidaroh
 
Radial basis function network ppt bySheetal,Samreen and Dhanashri
Radial basis function network ppt bySheetal,Samreen and DhanashriRadial basis function network ppt bySheetal,Samreen and Dhanashri
Radial basis function network ppt bySheetal,Samreen and Dhanashrisheetal katkar
 
Back propagation network
Back propagation networkBack propagation network
Back propagation networkHIRA Zaidi
 
The Back Propagation Learning Algorithm
The Back Propagation Learning AlgorithmThe Back Propagation Learning Algorithm
The Back Propagation Learning AlgorithmESCOM
 
Back propagation
Back propagationBack propagation
Back propagationNagarajan
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Derek Kane
 

Andere mochten auch (10)

Implementation of Back-Propagation Neural Network using Scilab and its Conver...
Implementation of Back-Propagation Neural Network using Scilab and its Conver...Implementation of Back-Propagation Neural Network using Scilab and its Conver...
Implementation of Back-Propagation Neural Network using Scilab and its Conver...
 
Radial Basis Function Interpolation
Radial Basis Function InterpolationRadial Basis Function Interpolation
Radial Basis Function Interpolation
 
Introduction to Radial Basis Function Networks
Introduction to Radial Basis Function NetworksIntroduction to Radial Basis Function Networks
Introduction to Radial Basis Function Networks
 
Radial Basis Function Network (RBFN)
Radial Basis Function Network (RBFN)Radial Basis Function Network (RBFN)
Radial Basis Function Network (RBFN)
 
Radial basis function network ppt bySheetal,Samreen and Dhanashri
Radial basis function network ppt bySheetal,Samreen and DhanashriRadial basis function network ppt bySheetal,Samreen and Dhanashri
Radial basis function network ppt bySheetal,Samreen and Dhanashri
 
Backpropagation algo
Backpropagation  algoBackpropagation  algo
Backpropagation algo
 
Back propagation network
Back propagation networkBack propagation network
Back propagation network
 
The Back Propagation Learning Algorithm
The Back Propagation Learning AlgorithmThe Back Propagation Learning Algorithm
The Back Propagation Learning Algorithm
 
Back propagation
Back propagationBack propagation
Back propagation
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests
 

Ähnlich wie RBF Neural Networks: Cover's Theorem and Pattern Separability

Section5 Rbf
Section5 RbfSection5 Rbf
Section5 Rbfkylin
 
02.03 Artificial Intelligence: Search by Optimization
02.03 Artificial Intelligence: Search by Optimization02.03 Artificial Intelligence: Search by Optimization
02.03 Artificial Intelligence: Search by OptimizationAndres Mendez-Vazquez
 
17- Kernels and Clustering.pptx
17- Kernels and Clustering.pptx17- Kernels and Clustering.pptx
17- Kernels and Clustering.pptxssuser2023c6
 
MachineLearning.ppt
MachineLearning.pptMachineLearning.ppt
MachineLearning.pptbutest
 
MachineLearning.ppt
MachineLearning.pptMachineLearning.ppt
MachineLearning.pptbutest
 
MachineLearning.ppt
MachineLearning.pptMachineLearning.ppt
MachineLearning.pptbutest
 
Lecture 17: Supervised Learning Recap
Lecture 17: Supervised Learning RecapLecture 17: Supervised Learning Recap
Lecture 17: Supervised Learning Recapbutest
 
Composing graphical models with neural networks for structured representatio...
Composing graphical models with  neural networks for structured representatio...Composing graphical models with  neural networks for structured representatio...
Composing graphical models with neural networks for structured representatio...Jeongmin Cha
 
Learning to Search Henry Kautz
Learning to Search Henry KautzLearning to Search Henry Kautz
Learning to Search Henry Kautzbutest
 
Learning to Search Henry Kautz
Learning to Search Henry KautzLearning to Search Henry Kautz
Learning to Search Henry Kautzbutest
 
Histogram-Based Method for Effective Initialization of the K-Means Clustering...
Histogram-Based Method for Effective Initialization of the K-Means Clustering...Histogram-Based Method for Effective Initialization of the K-Means Clustering...
Histogram-Based Method for Effective Initialization of the K-Means Clustering...Gingles Caroline
 
Group 9 genetic-algorithms (1)
Group 9 genetic-algorithms (1)Group 9 genetic-algorithms (1)
Group 9 genetic-algorithms (1)lakshmi.ec
 
Certified global minima
Certified global minimaCertified global minima
Certified global minimassuserfa7e73
 
Deep learning Unit1 BasicsAllllllll.pptx
Deep learning Unit1 BasicsAllllllll.pptxDeep learning Unit1 BasicsAllllllll.pptx
Deep learning Unit1 BasicsAllllllll.pptxFreefireGarena30
 

Ähnlich wie RBF Neural Networks: Cover's Theorem and Pattern Separability (20)

Section5 Rbf
Section5 RbfSection5 Rbf
Section5 Rbf
 
02.03 Artificial Intelligence: Search by Optimization
02.03 Artificial Intelligence: Search by Optimization02.03 Artificial Intelligence: Search by Optimization
02.03 Artificial Intelligence: Search by Optimization
 
17- Kernels and Clustering.pptx
17- Kernels and Clustering.pptx17- Kernels and Clustering.pptx
17- Kernels and Clustering.pptx
 
MachineLearning.ppt
MachineLearning.pptMachineLearning.ppt
MachineLearning.ppt
 
MachineLearning.ppt
MachineLearning.pptMachineLearning.ppt
MachineLearning.ppt
 
MachineLearning.ppt
MachineLearning.pptMachineLearning.ppt
MachineLearning.ppt
 
Knapsack problem using fixed tuple
Knapsack problem using fixed tupleKnapsack problem using fixed tuple
Knapsack problem using fixed tuple
 
DNN_M3_Optimization.pdf
DNN_M3_Optimization.pdfDNN_M3_Optimization.pdf
DNN_M3_Optimization.pdf
 
modeling.ppt
modeling.pptmodeling.ppt
modeling.ppt
 
Lecture 17: Supervised Learning Recap
Lecture 17: Supervised Learning RecapLecture 17: Supervised Learning Recap
Lecture 17: Supervised Learning Recap
 
Composing graphical models with neural networks for structured representatio...
Composing graphical models with  neural networks for structured representatio...Composing graphical models with  neural networks for structured representatio...
Composing graphical models with neural networks for structured representatio...
 
Learning to Search Henry Kautz
Learning to Search Henry KautzLearning to Search Henry Kautz
Learning to Search Henry Kautz
 
Learning to Search Henry Kautz
Learning to Search Henry KautzLearning to Search Henry Kautz
Learning to Search Henry Kautz
 
Histogram-Based Method for Effective Initialization of the K-Means Clustering...
Histogram-Based Method for Effective Initialization of the K-Means Clustering...Histogram-Based Method for Effective Initialization of the K-Means Clustering...
Histogram-Based Method for Effective Initialization of the K-Means Clustering...
 
Group 9 genetic-algorithms (1)
Group 9 genetic-algorithms (1)Group 9 genetic-algorithms (1)
Group 9 genetic-algorithms (1)
 
Certified global minima
Certified global minimaCertified global minima
Certified global minima
 
Deep learning Unit1 BasicsAllllllll.pptx
Deep learning Unit1 BasicsAllllllll.pptxDeep learning Unit1 BasicsAllllllll.pptx
Deep learning Unit1 BasicsAllllllll.pptx
 
chap3.pdf
chap3.pdfchap3.pdf
chap3.pdf
 
M3R.FINAL
M3R.FINALM3R.FINAL
M3R.FINAL
 
ppt
pptppt
ppt
 

Mehr von Andres Mendez-Vazquez

01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectorsAndres Mendez-Vazquez
 
01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issuesAndres Mendez-Vazquez
 
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiationAndres Mendez-Vazquez
 
01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep LearningAndres Mendez-Vazquez
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learningAndres Mendez-Vazquez
 
Neural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusNeural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusAndres Mendez-Vazquez
 
Introduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusIntroduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusAndres Mendez-Vazquez
 
Ideas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesIdeas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesAndres Mendez-Vazquez
 
20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variationsAndres Mendez-Vazquez
 

Mehr von Andres Mendez-Vazquez (20)

2.03 bayesian estimation
2.03 bayesian estimation2.03 bayesian estimation
2.03 bayesian estimation
 
05 linear transformations
05 linear transformations05 linear transformations
05 linear transformations
 
01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors
 
01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues
 
01.02 linear equations
01.02 linear equations01.02 linear equations
01.02 linear equations
 
01.01 vector spaces
01.01 vector spaces01.01 vector spaces
01.01 vector spaces
 
06 recurrent neural_networks
06 recurrent neural_networks06 recurrent neural_networks
06 recurrent neural_networks
 
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation
 
Zetta global
Zetta globalZetta global
Zetta global
 
01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learning
 
Neural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusNeural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning Syllabus
 
Introduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusIntroduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabus
 
Ideas 09 22_2018
Ideas 09 22_2018Ideas 09 22_2018
Ideas 09 22_2018
 
Ideas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesIdeas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data Sciences
 
Analysis of Algorithms Syllabus
Analysis of Algorithms  SyllabusAnalysis of Algorithms  Syllabus
Analysis of Algorithms Syllabus
 
20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations
 
18.1 combining models
18.1 combining models18.1 combining models
18.1 combining models
 
17 vapnik chervonenkis dimension
17 vapnik chervonenkis dimension17 vapnik chervonenkis dimension
17 vapnik chervonenkis dimension
 
A basic introduction to learning
A basic introduction to learningA basic introduction to learning
A basic introduction to learning
 

Kürzlich hochgeladen

Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgsaravananr517913
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
National Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfNational Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfRajuKanojiya4
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionMebane Rash
 
Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - GuideGOPINATHS437943
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the weldingMuhammadUzairLiaqat
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substationstephanwindworld
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm Systemirfanmechengr
 
Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdfCaalaaAbdulkerim
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfROCENODodongVILLACER
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncssuser2ae721
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating SystemRashmi Bhat
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHC Sai Kiran
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvLewisJB
 
Solving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.pptSolving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.pptJasonTagapanGulla
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptMadan Karki
 

Kürzlich hochgeladen (20)

Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
National Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfNational Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdf
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of Action
 
Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - Guide
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the welding
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substation
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm System
 
Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdf
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdf
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating System
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECH
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvv
 
Solving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.pptSolving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.ppt
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.ppt
 

RBF Neural Networks: Cover's Theorem and Pattern Separability

  • 1. Neural Networks Radial Basis Functions Networks Andres Mendez-Vazquez December 10, 2015 1 / 96
  • 2. Outline 1 Introduction Main Idea Basic Radial-Basis Functions 2 Separability Cover’s Theorem on the separability of patterns Dichotomy φ-separable functions The Stochastic Experiment The XOR Problem Separating Capacity of a Surface 3 Interpolation Problem What is gained? Feedforward Network Learning Process Radial-Basis Functions (RBF) 4 Introduction Description of the Problem Well-posed or ill-posed The Main Problem 5 Regularization Theory Solving the issue Bias-Variance Dilemma Measuring the difference between optimal and learned The Bias-Variance How can we use this? Getting a solution We still need to talk about... 2 / 96
  • 3. Outline 1 Introduction Main Idea Basic Radial-Basis Functions 2 Separability Cover’s Theorem on the separability of patterns Dichotomy φ-separable functions The Stochastic Experiment The XOR Problem Separating Capacity of a Surface 3 Interpolation Problem What is gained? Feedforward Network Learning Process Radial-Basis Functions (RBF) 4 Introduction Description of the Problem Well-posed or ill-posed The Main Problem 5 Regularization Theory Solving the issue Bias-Variance Dilemma Measuring the difference between optimal and learned The Bias-Variance How can we use this? Getting a solution We still need to talk about... 3 / 96
  • 4. Introduction Observation The back-propagation algorithm for the design of a multilayer perceptron as described in the previous chapter may be viewed as the application of a recursive technique known in statistics as stochastic approximation. Now We take a completely different approach by viewing the design of a neural network as a curve fitting (approximation) problem in a high-dimensional space. Thus Learning is equivalent to finding a surface in a multidimensional space that provides a best fit to the training data. Under a statistical metric 4 / 96
  • 5. Introduction Observation The back-propagation algorithm for the design of a multilayer perceptron as described in the previous chapter may be viewed as the application of a recursive technique known in statistics as stochastic approximation. Now We take a completely different approach by viewing the design of a neural network as a curve fitting (approximation) problem in a high-dimensional space. Thus Learning is equivalent to finding a surface in a multidimensional space that provides a best fit to the training data. Under a statistical metric 4 / 96
  • 6. Introduction Observation The back-propagation algorithm for the design of a multilayer perceptron as described in the previous chapter may be viewed as the application of a recursive technique known in statistics as stochastic approximation. Now We take a completely different approach by viewing the design of a neural network as a curve fitting (approximation) problem in a high-dimensional space. Thus Learning is equivalent to finding a surface in a multidimensional space that provides a best fit to the training data. Under a statistical metric 4 / 96
  • 7. Thus In the context of a neural network The hidden units provide a set of "functions" A "basis" for the input patterns when they are expanded into the hidden space. Name of these functions Radial-Basis functions. 5 / 96
  • 8. Thus In the context of a neural network The hidden units provide a set of "functions" A "basis" for the input patterns when they are expanded into the hidden space. Name of these functions Radial-Basis functions. 5 / 96
  • 9. History These functions were first introduced As the solution of the real multivariate interpolation problem Right now It is now one of the main fields of research in numerical analysis. 6 / 96
  • 10. History These functions were first introduced As the solution of the real multivariate interpolation problem Right now It is now one of the main fields of research in numerical analysis. 6 / 96
  • 11. Outline 1 Introduction Main Idea Basic Radial-Basis Functions 2 Separability Cover’s Theorem on the separability of patterns Dichotomy φ-separable functions The Stochastic Experiment The XOR Problem Separating Capacity of a Surface 3 Interpolation Problem What is gained? Feedforward Network Learning Process Radial-Basis Functions (RBF) 4 Introduction Description of the Problem Well-posed or ill-posed The Main Problem 5 Regularization Theory Solving the issue Bias-Variance Dilemma Measuring the difference between optimal and learned The Bias-Variance How can we use this? Getting a solution We still need to talk about... 7 / 96
  • 12. A Basic Structure We have the following structure 1 Input Layer to connect with the environment. 2 Hidden Layer applying a non-linear transformation. 3 Output Layer applying a linear transformation. Example 8 / 96
  • 13. A Basic Structure We have the following structure 1 Input Layer to connect with the environment. 2 Hidden Layer applying a non-linear transformation. 3 Output Layer applying a linear transformation. Example 8 / 96
  • 14. A Basic Structure We have the following structure 1 Input Layer to connect with the environment. 2 Hidden Layer applying a non-linear transformation. 3 Output Layer applying a linear transformation. Example 8 / 96
  • 15. A Basic Structure We have the following structure 1 Input Layer to connect with the environment. 2 Hidden Layer applying a non-linear transformation. 3 Output Layer applying a linear transformation. Example Input Nodes Nonlinear Nodes Linear Node 8 / 96
  • 16. Why the non-linear transformation? The justification In a paper by Cover (1965), a pattern-classification problem mapped to a high dimensional space is more likely to be linearly separable than in a low-dimensional space. Thus A good reason to make the dimension in the hidden space in a Radial-Basis Function (RBF) network high 9 / 96
  • 17. Why the non-linear transformation? The justification In a paper by Cover (1965), a pattern-classification problem mapped to a high dimensional space is more likely to be linearly separable than in a low-dimensional space. Thus A good reason to make the dimension in the hidden space in a Radial-Basis Function (RBF) network high 9 / 96
  • 18. Outline 1 Introduction Main Idea Basic Radial-Basis Functions 2 Separability Cover’s Theorem on the separability of patterns Dichotomy φ-separable functions The Stochastic Experiment The XOR Problem Separating Capacity of a Surface 3 Interpolation Problem What is gained? Feedforward Network Learning Process Radial-Basis Functions (RBF) 4 Introduction Description of the Problem Well-posed or ill-posed The Main Problem 5 Regularization Theory Solving the issue Bias-Variance Dilemma Measuring the difference between optimal and learned The Bias-Variance How can we use this? Getting a solution We still need to talk about... 10 / 96
  • 19. Cover’s Theorem The Resumed Statement A complex pattern-classification problem cast in a high-dimensional space nonlinearly is more likely to be linearly separable than in a low-dimensional space. Actually It is quite more complex... 11 / 96
  • 20. Cover’s Theorem The Resumed Statement A complex pattern-classification problem cast in a high-dimensional space nonlinearly is more likely to be linearly separable than in a low-dimensional space. Actually It is quite more complex... 11 / 96
  • 21. Some facts A fact Once we know a set of patterns are linearly separable, the problem is easy to solve. Consider A family of surfaces that separate the space in two regions. In addition We have a set of patterns H = {x1, x2, ..., xN } (1) 12 / 96
  • 22. Some facts A fact Once we know a set of patterns are linearly separable, the problem is easy to solve. Consider A family of surfaces that separate the space in two regions. In addition We have a set of patterns H = {x1, x2, ..., xN } (1) 12 / 96
  • 23. Some facts A fact Once we know a set of patterns are linearly separable, the problem is easy to solve. Consider A family of surfaces that separate the space in two regions. In addition We have a set of patterns H = {x1, x2, ..., xN } (1) 12 / 96
  • 24. Outline 1 Introduction Main Idea Basic Radial-Basis Functions 2 Separability Cover’s Theorem on the separability of patterns Dichotomy φ-separable functions The Stochastic Experiment The XOR Problem Separating Capacity of a Surface 3 Interpolation Problem What is gained? Feedforward Network Learning Process Radial-Basis Functions (RBF) 4 Introduction Description of the Problem Well-posed or ill-posed The Main Problem 5 Regularization Theory Solving the issue Bias-Variance Dilemma Measuring the difference between optimal and learned The Bias-Variance How can we use this? Getting a solution We still need to talk about... 13 / 96
  • 25. Dichotomy (Binary Partition) Now The pattern set is split into two classes H1 and H2. Definition A dichotomy (binary partition) of the points is said to be separable with respect to the family of surfaces if a surface exists in the family that separates the points in the class H1 from those in the class H2. Define For each pattern x ∈ H, we define a set of real valued measurement functions {φ1 (x) , φ2 (x) , ..., φd1 (x)} 14 / 96
  • 26. Dichotomy (Binary Partition) Now The pattern set is split into two classes H1 and H2. Definition A dichotomy (binary partition) of the points is said to be separable with respect to the family of surfaces if a surface exists in the family that separates the points in the class H1 from those in the class H2. Define For each pattern x ∈ H, we define a set of real valued measurement functions {φ1 (x) , φ2 (x) , ..., φd1 (x)} 14 / 96
  • 27. Dichotomy (Binary Partition) Now The pattern set is split into two classes H1 and H2. Definition A dichotomy (binary partition) of the points is said to be separable with respect to the family of surfaces if a surface exists in the family that separates the points in the class H1 from those in the class H2. Define For each pattern x ∈ H, we define a set of real valued measurement functions {φ1 (x) , φ2 (x) , ..., φd1 (x)} 14 / 96
  • 28. Thus We define the following function (Vector of measurements) φ : H → Rd1 (2) Defined as φ (x) = (φ1 (x) , φ2 (x) , ..., φd1 (x))T (3) Now Suppose that the pattern x is a vector in an d0-dimensional input space. 15 / 96
  • 29. Thus We define the following function (Vector of measurements) φ : H → Rd1 (2) Defined as φ (x) = (φ1 (x) , φ2 (x) , ..., φd1 (x))T (3) Now Suppose that the pattern x is a vector in an d0-dimensional input space. 15 / 96
  • 30. Thus We define the following function (Vector of measurements) φ : H → Rd1 (2) Defined as φ (x) = (φ1 (x) , φ2 (x) , ..., φd1 (x))T (3) Now Suppose that the pattern x is a vector in an d0-dimensional input space. 15 / 96
  • 31. Then... We have that the mapping φ (x) It maps points in d0-dimensional space into corresponding points in a new space of dimension d1. Each of this functions φi (x) It is known as a hidden function because it plays a role similar to the hidden unit in a feed-forward neural network. Thus We have that the space spanned by the set of hidden functions {φi (x)}d1 i=1 is called as the hidden space of feature space. 16 / 96
  • 32. Then... We have that the mapping φ (x) It maps points in d0-dimensional space into corresponding points in a new space of dimension d1. Each of this functions φi (x) It is known as a hidden function because it plays a role similar to the hidden unit in a feed-forward neural network. Thus We have that the space spanned by the set of hidden functions {φi (x)}d1 i=1 is called as the hidden space of feature space. 16 / 96
  • 33. Then... We have that the mapping φ (x) It maps points in d0-dimensional space into corresponding points in a new space of dimension d1. Each of this functions φi (x) It is known as a hidden function because it plays a role similar to the hidden unit in a feed-forward neural network. Thus We have that the space spanned by the set of hidden functions {φi (x)}d1 i=1 is called as the hidden space of feature space. 16 / 96
  • 34. Outline 1 Introduction Main Idea Basic Radial-Basis Functions 2 Separability Cover’s Theorem on the separability of patterns Dichotomy φ-separable functions The Stochastic Experiment The XOR Problem Separating Capacity of a Surface 3 Interpolation Problem What is gained? Feedforward Network Learning Process Radial-Basis Functions (RBF) 4 Introduction Description of the Problem Well-posed or ill-posed The Main Problem 5 Regularization Theory Solving the issue Bias-Variance Dilemma Measuring the difference between optimal and learned The Bias-Variance How can we use this? Getting a solution We still need to talk about... 17 / 96
  • 35. φ-separable functions Definition A dichotomy {H1, H2} of H is said to be φ-separable if there exists a d1-dimensional vector w such that 1 wT φ (x) > 0 if x ∈ H1. 2 wT φ (x) < 0 if x ∈ H2. Clearly the hyperplane is defined by the equation wT φ (x) = 0 (4) Now The inverse image of this hyperplane Hyp−1 = x|wT φ (x) = 0 (5) define the separating surface in the input space. 18 / 96
  • 36. φ-separable functions Definition A dichotomy {H1, H2} of H is said to be φ-separable if there exists a d1-dimensional vector w such that 1 wT φ (x) > 0 if x ∈ H1. 2 wT φ (x) < 0 if x ∈ H2. Clearly the hyperplane is defined by the equation wT φ (x) = 0 (4) Now The inverse image of this hyperplane Hyp−1 = x|wT φ (x) = 0 (5) define the separating surface in the input space. 18 / 96
  • 37. φ-separable functions Definition A dichotomy {H1, H2} of H is said to be φ-separable if there exists a d1-dimensional vector w such that 1 wT φ (x) > 0 if x ∈ H1. 2 wT φ (x) < 0 if x ∈ H2. Clearly the hyperplane is defined by the equation wT φ (x) = 0 (4) Now The inverse image of this hyperplane Hyp−1 = x|wT φ (x) = 0 (5) define the separating surface in the input space. 18 / 96
  • 38. φ-separable functions Definition A dichotomy {H1, H2} of H is said to be φ-separable if there exists a d1-dimensional vector w such that 1 wT φ (x) > 0 if x ∈ H1. 2 wT φ (x) < 0 if x ∈ H2. Clearly the hyperplane is defined by the equation wT φ (x) = 0 (4) Now The inverse image of this hyperplane Hyp−1 = x|wT φ (x) = 0 (5) define the separating surface in the input space. 18 / 96
  • 39. Now Taking in consideration A natural class of mappings obtained by using a linear combination of r-wise products of the pattern vector coordinates. They are called As the rth-order rational varieties. A rational variety of order r in dimensional d0 is described by 0≤i1≤i2≤...≤ir ≤d0 ai1i2...ir xi1 xi2 ...xir = 0 (6) where xi is the ith coordinate of the input vector x and x0 is set to unity in order to express the previous equation in homogeneous form. 19 / 96
  • 40. Now Taking in consideration A natural class of mappings obtained by using a linear combination of r-wise products of the pattern vector coordinates. They are called As the rth-order rational varieties. A rational variety of order r in dimensional d0 is described by 0≤i1≤i2≤...≤ir ≤d0 ai1i2...ir xi1 xi2 ...xir = 0 (6) where xi is the ith coordinate of the input vector x and x0 is set to unity in order to express the previous equation in homogeneous form. 19 / 96
  • 41. Now Taking in consideration A natural class of mappings obtained by using a linear combination of r-wise products of the pattern vector coordinates. They are called As the rth-order rational varieties. A rational variety of order r in dimensional d0 is described by 0≤i1≤i2≤...≤ir ≤d0 ai1i2...ir xi1 xi2 ...xir = 0 (6) where xi is the ith coordinate of the input vector x and x0 is set to unity in order to express the previous equation in homogeneous form. 19 / 96
  • 42. Now Taking in consideration A natural class of mappings obtained by using a linear combination of r-wise products of the pattern vector coordinates. They are called As the rth-order rational varieties. A rational variety of order r in dimensional d0 is described by 0≤i1≤i2≤...≤ir ≤d0 ai1i2...ir xi1 xi2 ...xir = 0 (6) where xi is the ith coordinate of the input vector x and x0 is set to unity in order to express the previous equation in homogeneous form. 19 / 96
  • 43. Homogenous Functions Definition A function f (x) is said to be homogeneous of degree n if, by introducing a constant parameter λ, replacing the variable x with λx we find: f (λx) = λn f (x) (7) 20 / 96
  • 44. Homogeneous Equation Equation (Eq. 6) A rth order product of entries xi of x, xi1 xi2 ...xir , is called a monomial Properties For an input space of dimensionality d0, there are d0 r = d0! (d0 − r)!r! (8) monomials in (Eq. 6). 21 / 96
  • 45. Homogeneous Equation Equation (Eq. 6) A rth order product of entries xi of x, xi1 xi2 ...xir , is called a monomial Properties For an input space of dimensionality d0, there are d0 r = d0! (d0 − r)!r! (8) monomials in (Eq. 6). 21 / 96
  • 46. Example of these surfaces Hyperplanes (first-order rational varieties) 22 / 96
  • 47. Example of these surfaces Hyperplanes (first-order rational varieties) 23 / 96
  • 48. Example of these surfaces Quadrices (second-order rational varieties) 24 / 96
  • 49. Example of these surfaces Hyperspheres (quadrics with certain linear constraints on the coefficients) 25 / 96
  • 50. Outline 1 Introduction Main Idea Basic Radial-Basis Functions 2 Separability Cover’s Theorem on the separability of patterns Dichotomy φ-separable functions The Stochastic Experiment The XOR Problem Separating Capacity of a Surface 3 Interpolation Problem What is gained? Feedforward Network Learning Process Radial-Basis Functions (RBF) 4 Introduction Description of the Problem Well-posed or ill-posed The Main Problem 5 Regularization Theory Solving the issue Bias-Variance Dilemma Measuring the difference between optimal and learned The Bias-Variance How can we use this? Getting a solution We still need to talk about... 26 / 96
  • 51. The Stochastic Experiment Suppose You have the following activation patterns x1, x2, ..., xN are chosen independently. Suppose That all possible dichotomies of H = {x1, x2, ..., xN } are equiprobable. Now given P (N, d1) the probability that a particular dichotomy picked at random is φ-separable P (N, d1) = 1 2 N−1 d1−1 m=0 N − 1 m (9) 27 / 96
  • 52. The Stochastic Experiment Suppose You have the following activation patterns x1, x2, ..., xN are chosen independently. Suppose That all possible dichotomies of H = {x1, x2, ..., xN } are equiprobable. Now given P (N, d1) the probability that a particular dichotomy picked at random is φ-separable P (N, d1) = 1 2 N−1 d1−1 m=0 N − 1 m (9) 27 / 96
  • 53. The Stochastic Experiment Suppose You have the following activation patterns x1, x2, ..., xN are chosen independently. Suppose That all possible dichotomies of H = {x1, x2, ..., xN } are equiprobable. Now given P (N, d1) the probability that a particular dichotomy picked at random is φ-separable P (N, d1) = 1 2 N−1 d1−1 m=0 N − 1 m (9) 27 / 96
  • 54. What? Basically (Eq. 9) represents The essence of Cover’s Separability Theorem. Something Notable It is a statement of the fact that the cumulative binomial distribution corresponding to the probability that N − 1 (Flips of a coin) samples will be separable in a mapping of d1 − 1 (heads) or fewer dimensions. Specifically The higher we make the hidden space in the radial basis function the closer is the probability of P (N, d1) to one. 28 / 96
  • 55. What? Basically (Eq. 9) represents The essence of Cover’s Separability Theorem. Something Notable It is a statement of the fact that the cumulative binomial distribution corresponding to the probability that N − 1 (Flips of a coin) samples will be separable in a mapping of d1 − 1 (heads) or fewer dimensions. Specifically The higher we make the hidden space in the radial basis function the closer is the probability of P (N, d1) to one. 28 / 96
  • 56. What? Basically (Eq. 9) represents The essence of Cover’s Separability Theorem. Something Notable It is a statement of the fact that the cumulative binomial distribution corresponding to the probability that N − 1 (Flips of a coin) samples will be separable in a mapping of d1 − 1 (heads) or fewer dimensions. Specifically The higher we make the hidden space in the radial basis function the closer is the probability of P (N, d1) to one. 28 / 96
  • 57. Final ingredients if the Cover’s Theorem First Nonlinear formulation of the hidden function defined by φ (x), where x is the input vector and i = 1, 2, ..., d1. Second High dimensionality of the hidden space compared to the input space. This dimensionality is determined by the value assigned to d_1 (i.e., the number of hidden units). Then In general, a complex pattern-classification problem cast in highdimensional space nonlinearly is more likely to be linearly separable than in a lowdimensional space. 29 / 96
  • 58. Final ingredients if the Cover’s Theorem First Nonlinear formulation of the hidden function defined by φ (x), where x is the input vector and i = 1, 2, ..., d1. Second High dimensionality of the hidden space compared to the input space. This dimensionality is determined by the value assigned to d_1 (i.e., the number of hidden units). Then In general, a complex pattern-classification problem cast in highdimensional space nonlinearly is more likely to be linearly separable than in a lowdimensional space. 29 / 96
  • 59. Final ingredients if the Cover’s Theorem First Nonlinear formulation of the hidden function defined by φ (x), where x is the input vector and i = 1, 2, ..., d1. Second High dimensionality of the hidden space compared to the input space. This dimensionality is determined by the value assigned to d_1 (i.e., the number of hidden units). Then In general, a complex pattern-classification problem cast in highdimensional space nonlinearly is more likely to be linearly separable than in a lowdimensional space. 29 / 96
  • 60. Final ingredients if the Cover’s Theorem First Nonlinear formulation of the hidden function defined by φ (x), where x is the input vector and i = 1, 2, ..., d1. Second High dimensionality of the hidden space compared to the input space. This dimensionality is determined by the value assigned to d_1 (i.e., the number of hidden units). Then In general, a complex pattern-classification problem cast in highdimensional space nonlinearly is more likely to be linearly separable than in a lowdimensional space. 29 / 96
  • 61. Outline 1 Introduction Main Idea Basic Radial-Basis Functions 2 Separability Cover’s Theorem on the separability of patterns Dichotomy φ-separable functions The Stochastic Experiment The XOR Problem Separating Capacity of a Surface 3 Interpolation Problem What is gained? Feedforward Network Learning Process Radial-Basis Functions (RBF) 4 Introduction Description of the Problem Well-posed or ill-posed The Main Problem 5 Regularization Theory Solving the issue Bias-Variance Dilemma Measuring the difference between optimal and learned The Bias-Variance How can we use this? Getting a solution We still need to talk about... 30 / 96
  • 62. There is always an exception to every rule!!! The XOR Problem 0 1 1 Class 1 Class 2 31 / 96
  • 63. Now We define the following radial functions φ1 (x) = exp x − t1 2 2 where t1 = (1, 1)T φ2 (x) = exp x − t2 2 2 where t2 = (1, 1)T Then If we apply our classic mapping φ (x) = [φ1 (x) , φ2 (x)]: Original Mapping (0, 1) → (0.3678, 0.3678) (1, 0) → (0.3678, 0.3678) (0, 0) → (0.1353, 1) (1, 1) → (1, 0.1353) 32 / 96
  • 64. Now We define the following radial functions φ1 (x) = exp x − t1 2 2 where t1 = (1, 1)T φ2 (x) = exp x − t2 2 2 where t2 = (1, 1)T Then If we apply our classic mapping φ (x) = [φ1 (x) , φ2 (x)]: Original Mapping (0, 1) → (0.3678, 0.3678) (1, 0) → (0.3678, 0.3678) (0, 0) → (0.1353, 1) (1, 1) → (1, 0.1353) 32 / 96
  • 65. New Space We have the following new φ1 − φ2 space 0 1 1 Class 1 Class 2 33 / 96
  • 66. Outline 1 Introduction Main Idea Basic Radial-Basis Functions 2 Separability Cover’s Theorem on the separability of patterns Dichotomy φ-separable functions The Stochastic Experiment The XOR Problem Separating Capacity of a Surface 3 Interpolation Problem What is gained? Feedforward Network Learning Process Radial-Basis Functions (RBF) 4 Introduction Description of the Problem Well-posed or ill-posed The Main Problem 5 Regularization Theory Solving the issue Bias-Variance Dilemma Measuring the difference between optimal and learned The Bias-Variance How can we use this? Getting a solution We still need to talk about... 34 / 96
  • 67. Separating Capacity of a Surface Something Notable (Eq. 9) has an important bearing on the expected maximum number of randomly assigned patterns that are linearly separable in a multidimensional space. Now, given our patterns {xi}N i=1 Given N be a random variable defined as the largest integer such that the sequence is φ-separable. We have that Prob (N = n) = P (n, d1) − P (n + 1, d1) (10) 35 / 96
  • 68. Separating Capacity of a Surface Something Notable (Eq. 9) has an important bearing on the expected maximum number of randomly assigned patterns that are linearly separable in a multidimensional space. Now, given our patterns {xi}N i=1 Given N be a random variable defined as the largest integer such that the sequence is φ-separable. We have that Prob (N = n) = P (n, d1) − P (n + 1, d1) (10) 35 / 96
  • 69. Separating Capacity of a Surface Something Notable (Eq. 9) has an important bearing on the expected maximum number of randomly assigned patterns that are linearly separable in a multidimensional space. Now, given our patterns {xi}N i=1 Given N be a random variable defined as the largest integer such that the sequence is φ-separable. We have that Prob (N = n) = P (n, d1) − P (n + 1, d1) (10) 35 / 96
  • 70. Separating Capacity of a Surface Then Prob (N = n) = 1 2 n n − 1 d1 − 1 , n = 0, 1, 2... (11) Remark: n d1 = n − 1 d1 − 1 + n − 1 d1 , 0 < d1 < n To interpret this Recall the negative binomial distribution. It is a repeated sequence of Bernoulli Trials With k failures preceding the rth success. 36 / 96
  • 71. Separating Capacity of a Surface Then Prob (N = n) = 1 2 n n − 1 d1 − 1 , n = 0, 1, 2... (11) Remark: n d1 = n − 1 d1 − 1 + n − 1 d1 , 0 < d1 < n To interpret this Recall the negative binomial distribution. It is a repeated sequence of Bernoulli Trials With k failures preceding the rth success. 36 / 96
  • 72. Separating Capacity of a Surface Then Prob (N = n) = 1 2 n n − 1 d1 − 1 , n = 0, 1, 2... (11) Remark: n d1 = n − 1 d1 − 1 + n − 1 d1 , 0 < d1 < n To interpret this Recall the negative binomial distribution. It is a repeated sequence of Bernoulli Trials With k failures preceding the rth success. 36 / 96
  • 73. Separating Capacity of a Surface Thus, we have that Given p and q the probabilities of success and failure, respectively, with p + q = 1. Definition p (K = k|p, q) = r + k − 1 k pr qk (12) What happened with p = q = 1 2 and k + r = n Any idea? 37 / 96
  • 74. Separating Capacity of a Surface Thus, we have that Given p and q the probabilities of success and failure, respectively, with p + q = 1. Definition p (K = k|p, q) = r + k − 1 k pr qk (12) What happened with p = q = 1 2 and k + r = n Any idea? 37 / 96
  • 75. Separating Capacity of a Surface Thus, we have that Given p and q the probabilities of success and failure, respectively, with p + q = 1. Definition p (K = k|p, q) = r + k − 1 k pr qk (12) What happened with p = q = 1 2 and k + r = n Any idea? 37 / 96
  • 76. Separating Capacity of a Surface Thus (Eq. 11) is just the negative binomial distribution shifted d1 units to the right with parameters d1 and 1 2 Finally N corresponds to thew “waiting time” for d1 th failure in a sequence of tosses of a fair coin. We have then E [N] = 2d1 Median [N] = 2d1 38 / 96
  • 77. Separating Capacity of a Surface Thus (Eq. 11) is just the negative binomial distribution shifted d1 units to the right with parameters d1 and 1 2 Finally N corresponds to thew “waiting time” for d1 th failure in a sequence of tosses of a fair coin. We have then E [N] = 2d1 Median [N] = 2d1 38 / 96
  • 78. Separating Capacity of a Surface Thus (Eq. 11) is just the negative binomial distribution shifted d1 units to the right with parameters d1 and 1 2 Finally N corresponds to thew “waiting time” for d1 th failure in a sequence of tosses of a fair coin. We have then E [N] = 2d1 Median [N] = 2d1 38 / 96
  • 79. This allows to define the Corollary to Cover’s Theorem A celebrated asymptotic result The expected maximum number of randomly assigned patterns (vectors) that are linearly separable in a space of dimensionality d1 is equal to 2d1 . Something Notable This result suggests that 2d1 is a natural definition of the separating capacity of a family of decision surfaces having d1 degrees of freedom. 39 / 96
  • 80. This allows to define the Corollary to Cover’s Theorem A celebrated asymptotic result The expected maximum number of randomly assigned patterns (vectors) that are linearly separable in a space of dimensionality d1 is equal to 2d1 . Something Notable This result suggests that 2d1 is a natural definition of the separating capacity of a family of decision surfaces having d1 degrees of freedom. 39 / 96
  • 81. Outline 1 Introduction Main Idea Basic Radial-Basis Functions 2 Separability Cover’s Theorem on the separability of patterns Dichotomy φ-separable functions The Stochastic Experiment The XOR Problem Separating Capacity of a Surface 3 Interpolation Problem What is gained? Feedforward Network Learning Process Radial-Basis Functions (RBF) 4 Introduction Description of the Problem Well-posed or ill-posed The Main Problem 5 Regularization Theory Solving the issue Bias-Variance Dilemma Measuring the difference between optimal and learned The Bias-Variance How can we use this? Getting a solution We still need to talk about... 40 / 96
  • 82. Given a problem of non-linearly separable patterns It is possible to see that There is a benefit to be gained by mapping the input space into a new space of high enough dimension For this, we use a non-linear map Quite similar to solve a difficult non-linear filtering problem by mapping it to high dimension, then solving it as a linear filtering problem. 41 / 96
  • 83. Given a problem of non-linearly separable patterns It is possible to see that There is a benefit to be gained by mapping the input space into a new space of high enough dimension For this, we use a non-linear map Quite similar to solve a difficult non-linear filtering problem by mapping it to high dimension, then solving it as a linear filtering problem. 41 / 96
  • 84. Outline 1 Introduction Main Idea Basic Radial-Basis Functions 2 Separability Cover’s Theorem on the separability of patterns Dichotomy φ-separable functions The Stochastic Experiment The XOR Problem Separating Capacity of a Surface 3 Interpolation Problem What is gained? Feedforward Network Learning Process Radial-Basis Functions (RBF) 4 Introduction Description of the Problem Well-posed or ill-posed The Main Problem 5 Regularization Theory Solving the issue Bias-Variance Dilemma Measuring the difference between optimal and learned The Bias-Variance How can we use this? Getting a solution We still need to talk about... 42 / 96
  • 85. Take in consideration the following architecture Mapping from input space to hidden space, followed by a linear mapping to output space!!! Input Nodes Nonlinear Nodes Linear Node 43 / 96
  • 86. This can be seen as We have the following map s : Rd0 → R (13) Therefore We may think of s as a hypersurface (graph) Γ ⊂ Rd0+1 44 / 96
  • 87. This can be seen as We have the following map s : Rd0 → R (13) Therefore We may think of s as a hypersurface (graph) Γ ⊂ Rd0+1 44 / 96
  • 88. Example We have that the Red planes represent the mappings and the Gray is the Linear Separator 45 / 96
  • 89. Outline 1 Introduction Main Idea Basic Radial-Basis Functions 2 Separability Cover’s Theorem on the separability of patterns Dichotomy φ-separable functions The Stochastic Experiment The XOR Problem Separating Capacity of a Surface 3 Interpolation Problem What is gained? Feedforward Network Learning Process Radial-Basis Functions (RBF) 4 Introduction Description of the Problem Well-posed or ill-posed The Main Problem 5 Regularization Theory Solving the issue Bias-Variance Dilemma Measuring the difference between optimal and learned The Bias-Variance How can we use this? Getting a solution We still need to talk about... 46 / 96
  • 90. General Idea First The training phase constitutes the optimization of a fitting procedure for the surface Γ. It is based in the know data points as input-output patterns. Second The generalization phase is synonymous with interpolation between the data points. The interpolation being performed along the constrained surface generated by the fitting procedure. 47 / 96
  • 91. General Idea First The training phase constitutes the optimization of a fitting procedure for the surface Γ. It is based in the know data points as input-output patterns. Second The generalization phase is synonymous with interpolation between the data points. The interpolation being performed along the constrained surface generated by the fitting procedure. 47 / 96
  • 92. General Idea First The training phase constitutes the optimization of a fitting procedure for the surface Γ. It is based in the know data points as input-output patterns. Second The generalization phase is synonymous with interpolation between the data points. The interpolation being performed along the constrained surface generated by the fitting procedure. 47 / 96
  • 93. General Idea First The training phase constitutes the optimization of a fitting procedure for the surface Γ. It is based in the know data points as input-output patterns. Second The generalization phase is synonymous with interpolation between the data points. The interpolation being performed along the constrained surface generated by the fitting procedure. 47 / 96
  • 94. This leads to the theory of multi-variable interpolation Interpolation Problem Given a set of N different points xi ∈ Rd0 |i = 1, 2, ..., N and a corresponding set of N real numbers di ∈ R1|i = 1, 2, ..., N , find a function F : RN → R that satisfies the interpolation condition: F (xi) = di i = 1, 2, ..., N (14) Remark For strict interpolation as specified here, the interpolating surface is constrained to pass through all the training data points. 48 / 96
  • 95. This leads to the theory of multi-variable interpolation Interpolation Problem Given a set of N different points xi ∈ Rd0 |i = 1, 2, ..., N and a corresponding set of N real numbers di ∈ R1|i = 1, 2, ..., N , find a function F : RN → R that satisfies the interpolation condition: F (xi) = di i = 1, 2, ..., N (14) Remark For strict interpolation as specified here, the interpolating surface is constrained to pass through all the training data points. 48 / 96
  • 96. Outline 1 Introduction Main Idea Basic Radial-Basis Functions 2 Separability Cover’s Theorem on the separability of patterns Dichotomy φ-separable functions The Stochastic Experiment The XOR Problem Separating Capacity of a Surface 3 Interpolation Problem What is gained? Feedforward Network Learning Process Radial-Basis Functions (RBF) 4 Introduction Description of the Problem Well-posed or ill-posed The Main Problem 5 Regularization Theory Solving the issue Bias-Variance Dilemma Measuring the difference between optimal and learned The Bias-Variance How can we use this? Getting a solution We still need to talk about... 49 / 96
  • 97. Radial-Basis Functions (RBF) The function F has the following form (Powell, 1988) F (x) = N i=1 wiφ ( x − xi ) (15) Where {φ ( x − xi ) |i = 1, ..., N} is a set of N arbitrary, generally non-linear, functions, know as RBF with · denotes a norm that is usually Euclidean. In addition The know data points xi ∈ Rd0 i = 1, 2, ..., N are taken to be the centers of the radial basis functions. 50 / 96
  • 98. Radial-Basis Functions (RBF) The function F has the following form (Powell, 1988) F (x) = N i=1 wiφ ( x − xi ) (15) Where {φ ( x − xi ) |i = 1, ..., N} is a set of N arbitrary, generally non-linear, functions, know as RBF with · denotes a norm that is usually Euclidean. In addition The know data points xi ∈ Rd0 i = 1, 2, ..., N are taken to be the centers of the radial basis functions. 50 / 96
  • 99. Radial-Basis Functions (RBF) The function F has the following form (Powell, 1988) F (x) = N i=1 wiφ ( x − xi ) (15) Where {φ ( x − xi ) |i = 1, ..., N} is a set of N arbitrary, generally non-linear, functions, know as RBF with · denotes a norm that is usually Euclidean. In addition The know data points xi ∈ Rd0 i = 1, 2, ..., N are taken to be the centers of the radial basis functions. 50 / 96
  • 100. A Set of Simultaneous Linear Equations Given φji = φ ( xj − xi ) , (j, i) = 1, 2, ..., N (16) Using (Eq. 14) and (Eq. 15), we get       φ11 φ12 · · · φ1N φ21 φ22 · · · φ2N ... ... ... ... φN1 φN2 · · · φNN             w1 w2 ... wN       =       d1 d2 ... dN       (17) 51 / 96
  • 101. A Set of Simultaneous Linear Equations Given φji = φ ( xj − xi ) , (j, i) = 1, 2, ..., N (16) Using (Eq. 14) and (Eq. 15), we get       φ11 φ12 · · · φ1N φ21 φ22 · · · φ2N ... ... ... ... φN1 φN2 · · · φNN             w1 w2 ... wN       =       d1 d2 ... dN       (17) 51 / 96
  • 102. Now We can create the following vectors d = [d1, d2, ..., dN ]T (Response vector). w = [w1, w2, ..., wN ]T (Linear weight vector). Now, we define a N × N matrix called interpolation matrix Φ = {φji| (j, i) = 1, 2, ..., N} (18) Thus, we have Φw = x (19) 52 / 96
  • 103. Now We can create the following vectors d = [d1, d2, ..., dN ]T (Response vector). w = [w1, w2, ..., wN ]T (Linear weight vector). Now, we define a N × N matrix called interpolation matrix Φ = {φji| (j, i) = 1, 2, ..., N} (18) Thus, we have Φw = x (19) 52 / 96
  • 104. Now We can create the following vectors d = [d1, d2, ..., dN ]T (Response vector). w = [w1, w2, ..., wN ]T (Linear weight vector). Now, we define a N × N matrix called interpolation matrix Φ = {φji| (j, i) = 1, 2, ..., N} (18) Thus, we have Φw = x (19) 52 / 96
  • 105. From here Assuming that Φ is a non-singular matrix w = Φ−1 x (20) Question How can we be sure that the interpolation matrix Φ is non-singular? Answer It turns out that for a large class of radial-basis functions and under certain conditions the non-singularity happens!!! 53 / 96
  • 106. From here Assuming that Φ is a non-singular matrix w = Φ−1 x (20) Question How can we be sure that the interpolation matrix Φ is non-singular? Answer It turns out that for a large class of radial-basis functions and under certain conditions the non-singularity happens!!! 53 / 96
  • 107. From here Assuming that Φ is a non-singular matrix w = Φ−1 x (20) Question How can we be sure that the interpolation matrix Φ is non-singular? Answer It turns out that for a large class of radial-basis functions and under certain conditions the non-singularity happens!!! 53 / 96
  • 108. Outline 1 Introduction Main Idea Basic Radial-Basis Functions 2 Separability Cover’s Theorem on the separability of patterns Dichotomy φ-separable functions The Stochastic Experiment The XOR Problem Separating Capacity of a Surface 3 Interpolation Problem What is gained? Feedforward Network Learning Process Radial-Basis Functions (RBF) 4 Introduction Description of the Problem Well-posed or ill-posed The Main Problem 5 Regularization Theory Solving the issue Bias-Variance Dilemma Measuring the difference between optimal and learned The Bias-Variance How can we use this? Getting a solution We still need to talk about... 54 / 96
  • 109. Introduction Observation The strict interpolation procedure described may not be a good strategy for the training of RBF networks for certain classes of tasks. Reason If the number of data points is much larger than the number of degrees of freedom of the underlying physical process. Thus The network may end up fitting misleading variations due to idiosyncrasies or noise in the input data. 55 / 96
  • 110. Introduction Observation The strict interpolation procedure described may not be a good strategy for the training of RBF networks for certain classes of tasks. Reason If the number of data points is much larger than the number of degrees of freedom of the underlying physical process. Thus The network may end up fitting misleading variations due to idiosyncrasies or noise in the input data. 55 / 96
  • 111. Introduction Observation The strict interpolation procedure described may not be a good strategy for the training of RBF networks for certain classes of tasks. Reason If the number of data points is much larger than the number of degrees of freedom of the underlying physical process. Thus The network may end up fitting misleading variations due to idiosyncrasies or noise in the input data. 55 / 96
  • 112. Outline 1 Introduction Main Idea Basic Radial-Basis Functions 2 Separability Cover’s Theorem on the separability of patterns Dichotomy φ-separable functions The Stochastic Experiment The XOR Problem Separating Capacity of a Surface 3 Interpolation Problem What is gained? Feedforward Network Learning Process Radial-Basis Functions (RBF) 4 Introduction Description of the Problem Well-posed or ill-posed The Main Problem 5 Regularization Theory Solving the issue Bias-Variance Dilemma Measuring the difference between optimal and learned The Bias-Variance How can we use this? Getting a solution We still need to talk about... 56 / 96
  • 113. Well-posed The Problem Assume that we have a domain X and a range Y , metric spaces. They are related by a mapping f : X → Y (21) Definition The problem of reconstructing the mapping f is said to be well-posed if three conditions are satisfied: Existence, Uniqueness and Continuity. 57 / 96
  • 114. Well-posed The Problem Assume that we have a domain X and a range Y , metric spaces. They are related by a mapping f : X → Y (21) Definition The problem of reconstructing the mapping f is said to be well-posed if three conditions are satisfied: Existence, Uniqueness and Continuity. 57 / 96
  • 115. Well-posed The Problem Assume that we have a domain X and a range Y , metric spaces. They are related by a mapping f : X → Y (21) Definition The problem of reconstructing the mapping f is said to be well-posed if three conditions are satisfied: Existence, Uniqueness and Continuity. 57 / 96
  • 116. Defining the meaning of this Existence For every input vector x ∈ X, there does exist an output y = f (x), where y ∈ Y . Uniqueness For any pair of input vectors x, t ∈ X, we have f (x) = f (t) if and only if x = t. Continuity The mapping is continuous, if for any > 0 exists δ such that the condition dX (x, t) < δ implies dY (f (x) , f (t)) < . 58 / 96
  • 117. Defining the meaning of this Existence For every input vector x ∈ X, there does exist an output y = f (x), where y ∈ Y . Uniqueness For any pair of input vectors x, t ∈ X, we have f (x) = f (t) if and only if x = t. Continuity The mapping is continuous, if for any > 0 exists δ such that the condition dX (x, t) < δ implies dY (f (x) , f (t)) < . 58 / 96
  • 118. Defining the meaning of this Existence For every input vector x ∈ X, there does exist an output y = f (x), where y ∈ Y . Uniqueness For any pair of input vectors x, t ∈ X, we have f (x) = f (t) if and only if x = t. Continuity The mapping is continuous, if for any > 0 exists δ such that the condition dX (x, t) < δ implies dY (f (x) , f (t)) < . 58 / 96
  • 120. Ill-Posed Therefore If any of these conditions is not satisfied, the problem is said to be ill-posed. Basically An ill-posed problem means that large data sets may contain a surprisingly small amount of information about the desired solution. 60 / 96
  • 121. Ill-Posed Therefore If any of these conditions is not satisfied, the problem is said to be ill-posed. Basically An ill-posed problem means that large data sets may contain a surprisingly small amount of information about the desired solution. 60 / 96
  • 122. Outline 1 Introduction Main Idea Basic Radial-Basis Functions 2 Separability Cover’s Theorem on the separability of patterns Dichotomy φ-separable functions The Stochastic Experiment The XOR Problem Separating Capacity of a Surface 3 Interpolation Problem What is gained? Feedforward Network Learning Process Radial-Basis Functions (RBF) 4 Introduction Description of the Problem Well-posed or ill-posed The Main Problem 5 Regularization Theory Solving the issue Bias-Variance Dilemma Measuring the difference between optimal and learned The Bias-Variance How can we use this? Getting a solution We still need to talk about... 61 / 96
  • 123. Learning from data Rebuilding the physical phenomena using the samples Physical Phenomenon Data 62 / 96
  • 124. We have the following Physical Phenomena Speech, pictures, radar signals, sonar signals, seismic data. It is a well-posed data But learning form such data i.e. rebuilding the hypersurface can be an ill-posed inverse problem. 63 / 96
  • 125. We have the following Physical Phenomena Speech, pictures, radar signals, sonar signals, seismic data. It is a well-posed data But learning form such data i.e. rebuilding the hypersurface can be an ill-posed inverse problem. 63 / 96
  • 126. Why First The existence criterion may be violated in that a distinct output may not exist for every input Second There may not be as much information in the training sample as we really need to reconstruct the input-output mapping uniquely. Third The unavoidable presence of noise or imprecision in real-life training data adds uncertainty to the reconstructed input-output mapping. 64 / 96
  • 127. Why First The existence criterion may be violated in that a distinct output may not exist for every input Second There may not be as much information in the training sample as we really need to reconstruct the input-output mapping uniquely. Third The unavoidable presence of noise or imprecision in real-life training data adds uncertainty to the reconstructed input-output mapping. 64 / 96
  • 128. Why First The existence criterion may be violated in that a distinct output may not exist for every input Second There may not be as much information in the training sample as we really need to reconstruct the input-output mapping uniquely. Third The unavoidable presence of noise or imprecision in real-life training data adds uncertainty to the reconstructed input-output mapping. 64 / 96
  • 129. The noise problem Getting out of the range Mapping+Noise 65 / 96
  • 130. How? This can happen when There is a lack of information!!! Lanczos, 1964 “A lack of information cannot be remedied by any mathematical trickery.” 66 / 96
  • 131. How? This can happen when There is a lack of information!!! Lanczos, 1964 “A lack of information cannot be remedied by any mathematical trickery.” 66 / 96
  • 132. Outline 1 Introduction Main Idea Basic Radial-Basis Functions 2 Separability Cover’s Theorem on the separability of patterns Dichotomy φ-separable functions The Stochastic Experiment The XOR Problem Separating Capacity of a Surface 3 Interpolation Problem What is gained? Feedforward Network Learning Process Radial-Basis Functions (RBF) 4 Introduction Description of the Problem Well-posed or ill-posed The Main Problem 5 Regularization Theory Solving the issue Bias-Variance Dilemma Measuring the difference between optimal and learned The Bias-Variance How can we use this? Getting a solution We still need to talk about... 67 / 96
  • 133. How do we solve the problem? Something Notable In 1963, Tikhonov proposed a new method called regularization for solving ill-posed ’ Tikhonov He was a Soviet and Russian mathematician known for important contributions to topology, functional analysis, mathematical physics, and ill-posed problems. 68 / 96
  • 134. How do we solve the problem? Something Notable In 1963, Tikhonov proposed a new method called regularization for solving ill-posed ’ Tikhonov He was a Soviet and Russian mathematician known for important contributions to topology, functional analysis, mathematical physics, and ill-posed problems. 68 / 96
  • 135. Also Known as Ridge Regression Setup We have: Input Signal xi ∈ Rd0 N i=1 . Output Signal {di ∈ R}N i=1. In addition Note that the output is assumed to be one-dimensional. 69 / 96
  • 136. Also Known as Ridge Regression Setup We have: Input Signal xi ∈ Rd0 N i=1 . Output Signal {di ∈ R}N i=1. In addition Note that the output is assumed to be one-dimensional. 69 / 96
  • 137. Now, assuming that you have an approximation function y = F (x) Standard Error Term Es (F) = 1 2 N i=1 (di − yi) = 1 2 N i=1 (di − F (xi)) (22) Regularization Term Ec (F) = 1 2 DF 2 (23) Where D is a linear differential operator. 70 / 96
  • 138. Now, assuming that you have an approximation function y = F (x) Standard Error Term Es (F) = 1 2 N i=1 (di − yi) = 1 2 N i=1 (di − F (xi)) (22) Regularization Term Ec (F) = 1 2 DF 2 (23) Where D is a linear differential operator. 70 / 96
  • 139. Now, assuming that you have an approximation function y = F (x) Standard Error Term Es (F) = 1 2 N i=1 (di − yi) = 1 2 N i=1 (di − F (xi)) (22) Regularization Term Ec (F) = 1 2 DF 2 (23) Where D is a linear differential operator. 70 / 96
  • 140. Now Ordinarily y = F (x) Normally, the function space representing the functional F is the L2 space that consist of all real-valued functions f (x) with x ∈ Rd0 The quantity to be minimized in regularization theory is E (f ) = 1 2 N i=1 (di − f (xi)) + 1 2 Df 2 (24) Where λ is a positive real number called the regularization parameter. E (f ) is called the Tikhonov functional. 71 / 96
  • 141. Now Ordinarily y = F (x) Normally, the function space representing the functional F is the L2 space that consist of all real-valued functions f (x) with x ∈ Rd0 The quantity to be minimized in regularization theory is E (f ) = 1 2 N i=1 (di − f (xi)) + 1 2 Df 2 (24) Where λ is a positive real number called the regularization parameter. E (f ) is called the Tikhonov functional. 71 / 96
  • 142. Now Ordinarily y = F (x) Normally, the function space representing the functional F is the L2 space that consist of all real-valued functions f (x) with x ∈ Rd0 The quantity to be minimized in regularization theory is E (f ) = 1 2 N i=1 (di − f (xi)) + 1 2 Df 2 (24) Where λ is a positive real number called the regularization parameter. E (f ) is called the Tikhonov functional. 71 / 96
  • 143. Now Ordinarily y = F (x) Normally, the function space representing the functional F is the L2 space that consist of all real-valued functions f (x) with x ∈ Rd0 The quantity to be minimized in regularization theory is E (f ) = 1 2 N i=1 (di − f (xi)) + 1 2 Df 2 (24) Where λ is a positive real number called the regularization parameter. E (f ) is called the Tikhonov functional. 71 / 96
  • 144. Outline 1 Introduction Main Idea Basic Radial-Basis Functions 2 Separability Cover’s Theorem on the separability of patterns Dichotomy φ-separable functions The Stochastic Experiment The XOR Problem Separating Capacity of a Surface 3 Interpolation Problem What is gained? Feedforward Network Learning Process Radial-Basis Functions (RBF) 4 Introduction Description of the Problem Well-posed or ill-posed The Main Problem 5 Regularization Theory Solving the issue Bias-Variance Dilemma Measuring the difference between optimal and learned The Bias-Variance How can we use this? Getting a solution We still need to talk about... 72 / 96
  • 145. Introduction What did we see until now? The design of learning machines from two main points: Statistical Point of View Linear Algebra and Optimization Point of View Going back to the probability models We might think in the machine to be learned as a function g (x|D).... Something as curve fitting... Under a data set D = {(xi, yi) |i = 1, 2, ..., N} (25) Remark: Where the xi ∼ p (x|Θ)!!! 73 / 96
  • 146. Introduction What did we see until now? The design of learning machines from two main points: Statistical Point of View Linear Algebra and Optimization Point of View Going back to the probability models We might think in the machine to be learned as a function g (x|D).... Something as curve fitting... Under a data set D = {(xi, yi) |i = 1, 2, ..., N} (25) Remark: Where the xi ∼ p (x|Θ)!!! 73 / 96
  • 147. Introduction What did we see until now? The design of learning machines from two main points: Statistical Point of View Linear Algebra and Optimization Point of View Going back to the probability models We might think in the machine to be learned as a function g (x|D).... Something as curve fitting... Under a data set D = {(xi, yi) |i = 1, 2, ..., N} (25) Remark: Where the xi ∼ p (x|Θ)!!! 73 / 96
  • 148. Introduction What did we see until now? The design of learning machines from two main points: Statistical Point of View Linear Algebra and Optimization Point of View Going back to the probability models We might think in the machine to be learned as a function g (x|D).... Something as curve fitting... Under a data set D = {(xi, yi) |i = 1, 2, ..., N} (25) Remark: Where the xi ∼ p (x|Θ)!!! 73 / 96
  • 149. Introduction What did we see until now? The design of learning machines from two main points: Statistical Point of View Linear Algebra and Optimization Point of View Going back to the probability models We might think in the machine to be learned as a function g (x|D).... Something as curve fitting... Under a data set D = {(xi, yi) |i = 1, 2, ..., N} (25) Remark: Where the xi ∼ p (x|Θ)!!! 73 / 96
  • 150. Introduction What did we see until now? The design of learning machines from two main points: Statistical Point of View Linear Algebra and Optimization Point of View Going back to the probability models We might think in the machine to be learned as a function g (x|D).... Something as curve fitting... Under a data set D = {(xi, yi) |i = 1, 2, ..., N} (25) Remark: Where the xi ∼ p (x|Θ)!!! 73 / 96
  • 151. Introduction What did we see until now? The design of learning machines from two main points: Statistical Point of View Linear Algebra and Optimization Point of View Going back to the probability models We might think in the machine to be learned as a function g (x|D).... Something as curve fitting... Under a data set D = {(xi, yi) |i = 1, 2, ..., N} (25) Remark: Where the xi ∼ p (x|Θ)!!! 73 / 96
  • 152. Thus, we have that Two main functions A function g (x|D) obtained using some algorithm!!! E [y|x] the optimal regression... Important The key factor here is the dependence of the approximation on D. Why? The approximation may be very good for a specific training data set but very bad for another. This is the reason of studying fusion of information at decision level... 74 / 96
  • 153. Thus, we have that Two main functions A function g (x|D) obtained using some algorithm!!! E [y|x] the optimal regression... Important The key factor here is the dependence of the approximation on D. Why? The approximation may be very good for a specific training data set but very bad for another. This is the reason of studying fusion of information at decision level... 74 / 96
  • 154. Thus, we have that Two main functions A function g (x|D) obtained using some algorithm!!! E [y|x] the optimal regression... Important The key factor here is the dependence of the approximation on D. Why? The approximation may be very good for a specific training data set but very bad for another. This is the reason of studying fusion of information at decision level... 74 / 96
  • 155. Thus, we have that Two main functions A function g (x|D) obtained using some algorithm!!! E [y|x] the optimal regression... Important The key factor here is the dependence of the approximation on D. Why? The approximation may be very good for a specific training data set but very bad for another. This is the reason of studying fusion of information at decision level... 74 / 96
  • 156. Thus, we have that Two main functions A function g (x|D) obtained using some algorithm!!! E [y|x] the optimal regression... Important The key factor here is the dependence of the approximation on D. Why? The approximation may be very good for a specific training data set but very bad for another. This is the reason of studying fusion of information at decision level... 74 / 96
  • 157. How do we measure the difference We have that Var(X) = E((X − µ)2 ) We can do that for our data VarD (g (x|D)) = ED (g (x|D) − E [y|x])2 Now, if we add and subtract ED [g (x|D)] (26) Remark: The expected output of the machine g (x|D) 75 / 96
  • 158. How do we measure the difference We have that Var(X) = E((X − µ)2 ) We can do that for our data VarD (g (x|D)) = ED (g (x|D) − E [y|x])2 Now, if we add and subtract ED [g (x|D)] (26) Remark: The expected output of the machine g (x|D) 75 / 96
  • 159. How do we measure the difference We have that Var(X) = E((X − µ)2 ) We can do that for our data VarD (g (x|D)) = ED (g (x|D) − E [y|x])2 Now, if we add and subtract ED [g (x|D)] (26) Remark: The expected output of the machine g (x|D) 75 / 96
  • 160. How do we measure the difference We have that Var(X) = E((X − µ)2 ) We can do that for our data VarD (g (x|D)) = ED (g (x|D) − E [y|x])2 Now, if we add and subtract ED [g (x|D)] (26) Remark: The expected output of the machine g (x|D) 75 / 96
  • 161. Thus, we have that Or Original variance VarD (g (x|D)) = ED (g (x|D) − E [y|x])2 = ED (g (x|D) − ED [g (x|D)] + ED [g (x|D)] − E [y|x])2 = ED (g (x|D) − ED [g (x|D)])2 + ... ...2 ((g (x|D) − ED [g (x|D)])) (ED [g (x|D)] − E [y|x]) + ... ... (ED [g (x|D)] − E [y|x])2 Finally ED (((g (x|D) − ED [g (x|D)])) (ED [g (x|D)] − E [y|x])) =? (27) 76 / 96
  • 162. Thus, we have that Or Original variance VarD (g (x|D)) = ED (g (x|D) − E [y|x])2 = ED (g (x|D) − ED [g (x|D)] + ED [g (x|D)] − E [y|x])2 = ED (g (x|D) − ED [g (x|D)])2 + ... ...2 ((g (x|D) − ED [g (x|D)])) (ED [g (x|D)] − E [y|x]) + ... ... (ED [g (x|D)] − E [y|x])2 Finally ED (((g (x|D) − ED [g (x|D)])) (ED [g (x|D)] − E [y|x])) =? (27) 76 / 96
  • 163. Thus, we have that Or Original variance VarD (g (x|D)) = ED (g (x|D) − E [y|x])2 = ED (g (x|D) − ED [g (x|D)] + ED [g (x|D)] − E [y|x])2 = ED (g (x|D) − ED [g (x|D)])2 + ... ...2 ((g (x|D) − ED [g (x|D)])) (ED [g (x|D)] − E [y|x]) + ... ... (ED [g (x|D)] − E [y|x])2 Finally ED (((g (x|D) − ED [g (x|D)])) (ED [g (x|D)] − E [y|x])) =? (27) 76 / 96
  • 164. Thus, we have that Or Original variance VarD (g (x|D)) = ED (g (x|D) − E [y|x])2 = ED (g (x|D) − ED [g (x|D)] + ED [g (x|D)] − E [y|x])2 = ED (g (x|D) − ED [g (x|D)])2 + ... ...2 ((g (x|D) − ED [g (x|D)])) (ED [g (x|D)] − E [y|x]) + ... ... (ED [g (x|D)] − E [y|x])2 Finally ED (((g (x|D) − ED [g (x|D)])) (ED [g (x|D)] − E [y|x])) =? (27) 76 / 96
  • 165. We have the Bias-Variance Our Final Equation ED (g (x|D) − E [y|x])2 = ED (g (x|D) − ED [g (x|D)])2 VARIANCE + (ED [g (x|D)] − E [y|x])2 BIAS Where the variance It represent the measure of the error between our machine g (x|D) and the expected output of the machine under xi ∼ p (x|Θ). Where the bias It represent the quadratic error between the expected output of the machine under xi ∼ p (x|Θ) and the expected output of the optimal regression. 77 / 96
  • 166. We have the Bias-Variance Our Final Equation ED (g (x|D) − E [y|x])2 = ED (g (x|D) − ED [g (x|D)])2 VARIANCE + (ED [g (x|D)] − E [y|x])2 BIAS Where the variance It represent the measure of the error between our machine g (x|D) and the expected output of the machine under xi ∼ p (x|Θ). Where the bias It represent the quadratic error between the expected output of the machine under xi ∼ p (x|Θ) and the expected output of the optimal regression. 77 / 96
  • 167. We have the Bias-Variance Our Final Equation ED (g (x|D) − E [y|x])2 = ED (g (x|D) − ED [g (x|D)])2 VARIANCE + (ED [g (x|D)] − E [y|x])2 BIAS Where the variance It represent the measure of the error between our machine g (x|D) and the expected output of the machine under xi ∼ p (x|Θ). Where the bias It represent the quadratic error between the expected output of the machine under xi ∼ p (x|Θ) and the expected output of the optimal regression. 77 / 96
  • 168. We have the Bias-Variance Our Final Equation ED (g (x|D) − E [y|x])2 = ED (g (x|D) − ED [g (x|D)])2 VARIANCE + (ED [g (x|D)] − E [y|x])2 BIAS Where the variance It represent the measure of the error between our machine g (x|D) and the expected output of the machine under xi ∼ p (x|Θ). Where the bias It represent the quadratic error between the expected output of the machine under xi ∼ p (x|Θ) and the expected output of the optimal regression. 77 / 96
  • 169. Outline 1 Introduction Main Idea Basic Radial-Basis Functions 2 Separability Cover’s Theorem on the separability of patterns Dichotomy φ-separable functions The Stochastic Experiment The XOR Problem Separating Capacity of a Surface 3 Interpolation Problem What is gained? Feedforward Network Learning Process Radial-Basis Functions (RBF) 4 Introduction Description of the Problem Well-posed or ill-posed The Main Problem 5 Regularization Theory Solving the issue Bias-Variance Dilemma Measuring the difference between optimal and learned The Bias-Variance How can we use this? Getting a solution We still need to talk about... 78 / 96
  • 170. Using this in our favor!!! Something Notable Introducing bias is equivalent to restricting the range of functions for which a model can account. Typically this is achieved by removing degrees of freedom. Examples They would be lowering the order of a polynomial or reducing the number of weights in a neural network!!! Ridge Regression It does not explicitly remove degrees of freedom but instead reduces the effective number of parameters. 79 / 96
  • 171. Using this in our favor!!! Something Notable Introducing bias is equivalent to restricting the range of functions for which a model can account. Typically this is achieved by removing degrees of freedom. Examples They would be lowering the order of a polynomial or reducing the number of weights in a neural network!!! Ridge Regression It does not explicitly remove degrees of freedom but instead reduces the effective number of parameters. 79 / 96
  • 172. Using this in our favor!!! Something Notable Introducing bias is equivalent to restricting the range of functions for which a model can account. Typically this is achieved by removing degrees of freedom. Examples They would be lowering the order of a polynomial or reducing the number of weights in a neural network!!! Ridge Regression It does not explicitly remove degrees of freedom but instead reduces the effective number of parameters. 79 / 96
  • 173. Using this in our favor!!! Something Notable Introducing bias is equivalent to restricting the range of functions for which a model can account. Typically this is achieved by removing degrees of freedom. Examples They would be lowering the order of a polynomial or reducing the number of weights in a neural network!!! Ridge Regression It does not explicitly remove degrees of freedom but instead reduces the effective number of parameters. 79 / 96
  • 174. Example In the case of a linear regression model C (w) = N i=1 di − wT xi 2 + λ d0 j=1 w2 j (28) Thus This is ridge regression (weight decay) and the regularization parameter λ > 0 controls the balance between fitting the data and avoiding the penalty. A small value for λ means the data can be fit tightly without causing a large penalty. A large value for λ means a tight fit has to be sacrificed if it requires large weights. 80 / 96
  • 175. Example In the case of a linear regression model C (w) = N i=1 di − wT xi 2 + λ d0 j=1 w2 j (28) Thus This is ridge regression (weight decay) and the regularization parameter λ > 0 controls the balance between fitting the data and avoiding the penalty. A small value for λ means the data can be fit tightly without causing a large penalty. A large value for λ means a tight fit has to be sacrificed if it requires large weights. 80 / 96
  • 176. Example In the case of a linear regression model C (w) = N i=1 di − wT xi 2 + λ d0 j=1 w2 j (28) Thus This is ridge regression (weight decay) and the regularization parameter λ > 0 controls the balance between fitting the data and avoiding the penalty. A small value for λ means the data can be fit tightly without causing a large penalty. A large value for λ means a tight fit has to be sacrificed if it requires large weights. 80 / 96
  • 177. Example In the case of a linear regression model C (w) = N i=1 di − wT xi 2 + λ d0 j=1 w2 j (28) Thus This is ridge regression (weight decay) and the regularization parameter λ > 0 controls the balance between fitting the data and avoiding the penalty. A small value for λ means the data can be fit tightly without causing a large penalty. A large value for λ means a tight fit has to be sacrificed if it requires large weights. 80 / 96
  • 178. Important The Bias It favors solutions involving small weights and the effect is to smooth the output function. 81 / 96
  • 179. Outline 1 Introduction Main Idea Basic Radial-Basis Functions 2 Separability Cover’s Theorem on the separability of patterns Dichotomy φ-separable functions The Stochastic Experiment The XOR Problem Separating Capacity of a Surface 3 Interpolation Problem What is gained? Feedforward Network Learning Process Radial-Basis Functions (RBF) 4 Introduction Description of the Problem Well-posed or ill-posed The Main Problem 5 Regularization Theory Solving the issue Bias-Variance Dilemma Measuring the difference between optimal and learned The Bias-Variance How can we use this? Getting a solution We still need to talk about... 82 / 96
  • 180. Now, we can carry out the optimization First, we rewrite the cost function the following way S (w) = N i=1 (di − f (xi))2 (29) And we will use a generalized version for f f (xi) = d1 j=1 wjφj (xi) (30) Where The free variables are the weights {wj}d1 j=1. 83 / 96
  • 181. Now, we can carry out the optimization First, we rewrite the cost function the following way S (w) = N i=1 (di − f (xi))2 (29) And we will use a generalized version for f f (xi) = d1 j=1 wjφj (xi) (30) Where The free variables are the weights {wj}d1 j=1. 83 / 96
  • 182. Now, we can carry out the optimization First, we rewrite the cost function the following way S (w) = N i=1 (di − f (xi))2 (29) And we will use a generalized version for f f (xi) = d1 j=1 wjφj (xi) (30) Where The free variables are the weights {wj}d1 j=1. 83 / 96
  • 183. Where φj (xi) is in our case, we may have the Gaussian distribution φj (xi) = φ (xi, xj) (31) With φ (x, xj) = exp − 1 2σ2 x − xi (32) 84 / 96
  • 184. Where φj (xi) is in our case, we may have the Gaussian distribution φj (xi) = φ (xi, xj) (31) With φ (x, xj) = exp − 1 2σ2 x − xi (32) 84 / 96
  • 185. Thus Final cost function assuming there is a regularization term per weight C (w, λ) = N i=1 (di − f (xi))2 + d1 j=1 λjw2 j (33) What do we do? 1 Differentiate the function with respect to the free variables. 2 Equate the results with zero. 3 Solve the resulting equations. 85 / 96
  • 186. Thus Final cost function assuming there is a regularization term per weight C (w, λ) = N i=1 (di − f (xi))2 + d1 j=1 λjw2 j (33) What do we do? 1 Differentiate the function with respect to the free variables. 2 Equate the results with zero. 3 Solve the resulting equations. 85 / 96
  • 187. Thus Final cost function assuming there is a regularization term per weight C (w, λ) = N i=1 (di − f (xi))2 + d1 j=1 λjw2 j (33) What do we do? 1 Differentiate the function with respect to the free variables. 2 Equate the results with zero. 3 Solve the resulting equations. 85 / 96
  • 188. Thus Final cost function assuming there is a regularization term per weight C (w, λ) = N i=1 (di − f (xi))2 + d1 j=1 λjw2 j (33) What do we do? 1 Differentiate the function with respect to the free variables. 2 Equate the results with zero. 3 Solve the resulting equations. 85 / 96
  • 189. Differentiate the function with respect to the free variables. First ∂C (w, λ) ∂wj = 2 N i=1 (di − f (xi)) ∂f (xi) ∂wj + 2λjwj (34) We get differential of ∂f (xi) ∂wj ∂f (xi) ∂wj = φj (xi) (35) 86 / 96
  • 190. Differentiate the function with respect to the free variables. First ∂C (w, λ) ∂wj = 2 N i=1 (di − f (xi)) ∂f (xi) ∂wj + 2λjwj (34) We get differential of ∂f (xi) ∂wj ∂f (xi) ∂wj = φj (xi) (35) 86 / 96
  • 191. Now We have then N i=1 f (xi) φj (xi) + λjwj = N i=1 diφj (xi) (36) Something Notable There are m such equations, for 1 ≤ j ≤ m, each representing one constraint on the solution. Since there are exactly as many constraints as there are unknowns equations has, except under certain pathological conditions, a unique solution. 87 / 96
  • 192. Now We have then N i=1 f (xi) φj (xi) + λjwj = N i=1 diφj (xi) (36) Something Notable There are m such equations, for 1 ≤ j ≤ m, each representing one constraint on the solution. Since there are exactly as many constraints as there are unknowns equations has, except under certain pathological conditions, a unique solution. 87 / 96
  • 193. Now We have then N i=1 f (xi) φj (xi) + λjwj = N i=1 diφj (xi) (36) Something Notable There are m such equations, for 1 ≤ j ≤ m, each representing one constraint on the solution. Since there are exactly as many constraints as there are unknowns equations has, except under certain pathological conditions, a unique solution. 87 / 96
  • 194. Using Our Linear Algebra We have then φT j f + λjwj = φT j d (37) Where φj =       φj (x1) φj (x2) ... φj (xN )       , f =       f (x1) f (x2) ... f (xN )       , d =       d1 d2 ... dN       (38) 88 / 96
  • 195. Using Our Linear Algebra We have then φT j f + λjwj = φT j d (37) Where φj =       φj (x1) φj (x2) ... φj (xN )       , f =       f (x1) f (x2) ... f (xN )       , d =       d1 d2 ... dN       (38) 88 / 96
  • 196. Now Since there is one of these equations, each relating one scalar quantity to another, we can stack them       φT 1 f φT 2 f ... φT d1 f       +       λ1w1 λ2w2 ... λd1 wd1       =       φT 1 d φT 2 d ... φT d1 d       (39) Now, if we define Φ = φ1 φ2 . . . φd1 (40) Written in full form Φ =       φ1 (x1) φ2 (x1) · · · φd1 (x1) φ1 (x2) φ2 (x2) · · · φd1 (x2) ... ... ... ... φ1 (xN ) φ2 (xN ) · · · φd1 (xN )       (41) 89 / 96
  • 197. Now Since there is one of these equations, each relating one scalar quantity to another, we can stack them       φT 1 f φT 2 f ... φT d1 f       +       λ1w1 λ2w2 ... λd1 wd1       =       φT 1 d φT 2 d ... φT d1 d       (39) Now, if we define Φ = φ1 φ2 . . . φd1 (40) Written in full form Φ =       φ1 (x1) φ2 (x1) · · · φd1 (x1) φ1 (x2) φ2 (x2) · · · φd1 (x2) ... ... ... ... φ1 (xN ) φ2 (xN ) · · · φd1 (xN )       (41) 89 / 96
  • 198. Now Since there is one of these equations, each relating one scalar quantity to another, we can stack them       φT 1 f φT 2 f ... φT d1 f       +       λ1w1 λ2w2 ... λd1 wd1       =       φT 1 d φT 2 d ... φT d1 d       (39) Now, if we define Φ = φ1 φ2 . . . φd1 (40) Written in full form Φ =       φ1 (x1) φ2 (x1) · · · φd1 (x1) φ1 (x2) φ2 (x2) · · · φd1 (x2) ... ... ... ... φ1 (xN ) φ2 (xN ) · · · φd1 (xN )       (41) 89 / 96
  • 199. We can then Define the following matrix equation ΦT f + Λw = ΦT d (42) Where Λ =       λ1 0 · · · 0 0 λ2 · · · 0 ... ... ... ... 0 0 · · · λd1       (43) 90 / 96
  • 200. We can then Define the following matrix equation ΦT f + Λw = ΦT d (42) Where Λ =       λ1 0 · · · 0 0 λ2 · · · 0 ... ... ... ... 0 0 · · · λd1       (43) 90 / 96
  • 201. Now, we have that The vector can be decomposed into the product of two terms Design matrix and the weight vector We have then fi = f (xi) = d1 j=1 wjhj (xi) = φ T i w (44) Where φi =       φ1 (xi) φ2 (xi) ... φd1 (xi)       (45) 91 / 96
  • 202. Now, we have that The vector can be decomposed into the product of two terms Design matrix and the weight vector We have then fi = f (xi) = d1 j=1 wjhj (xi) = φ T i w (44) Where φi =       φ1 (xi) φ2 (xi) ... φd1 (xi)       (45) 91 / 96
  • 203. Now, we have that The vector can be decomposed into the product of two terms Design matrix and the weight vector We have then fi = f (xi) = d1 j=1 wjhj (xi) = φ T i w (44) Where φi =       φ1 (xi) φ2 (xi) ... φd1 (xi)       (45) 91 / 96
  • 204. Furthermore We get that f =       f1 f2 ... fN       =        φ T 1 w φ T 2 w ... φ T N w        = Φw (46) Finally, we have that ΦT d =ΦT f + Λw =ΦT Φw + Λw = ΦT Φ + Λ w 92 / 96
  • 205. Furthermore We get that f =       f1 f2 ... fN       =        φ T 1 w φ T 2 w ... φ T N w        = Φw (46) Finally, we have that ΦT d =ΦT f + Λw =ΦT Φw + Λw = ΦT Φ + Λ w 92 / 96
  • 206. Now... We get finally w = ΦT Φ + Λ −1 ΦT d (47) Remember This equation is the most general form of the normal equation. We have two cases In standard ridge regression λj = λ, 1 ≤ j ≤ m. Ordinary least squares where there is no weight penalty or all λj = 0, 1 ≤ j ≤ m.. 93 / 96
  • 207. Now... We get finally w = ΦT Φ + Λ −1 ΦT d (47) Remember This equation is the most general form of the normal equation. We have two cases In standard ridge regression λj = λ, 1 ≤ j ≤ m. Ordinary least squares where there is no weight penalty or all λj = 0, 1 ≤ j ≤ m.. 93 / 96
  • 208. Now... We get finally w = ΦT Φ + Λ −1 ΦT d (47) Remember This equation is the most general form of the normal equation. We have two cases In standard ridge regression λj = λ, 1 ≤ j ≤ m. Ordinary least squares where there is no weight penalty or all λj = 0, 1 ≤ j ≤ m.. 93 / 96
  • 209. Thus, we have First Case w = ΦT Φ + λId1 −1 ΦT d (48) Second Case w = ΦT Φ −1 ΦT d (49) 94 / 96
  • 210. Thus, we have First Case w = ΦT Φ + λId1 −1 ΦT d (48) Second Case w = ΦT Φ −1 ΦT d (49) 94 / 96
  • 211. Outline 1 Introduction Main Idea Basic Radial-Basis Functions 2 Separability Cover’s Theorem on the separability of patterns Dichotomy φ-separable functions The Stochastic Experiment The XOR Problem Separating Capacity of a Surface 3 Interpolation Problem What is gained? Feedforward Network Learning Process Radial-Basis Functions (RBF) 4 Introduction Description of the Problem Well-posed or ill-posed The Main Problem 5 Regularization Theory Solving the issue Bias-Variance Dilemma Measuring the difference between optimal and learned The Bias-Variance How can we use this? Getting a solution We still need to talk about... 95 / 96
  • 212. There are still several things that we need to look at... First What is the variance of the weight vector? The Variance Matrix. Second The prediction of the output at any of the training set inputs - The Projection Matrix Finally The incremental algorithm for the problem!!! 96 / 96
  • 213. There are still several things that we need to look at... First What is the variance of the weight vector? The Variance Matrix. Second The prediction of the output at any of the training set inputs - The Projection Matrix Finally The incremental algorithm for the problem!!! 96 / 96
  • 214. There are still several things that we need to look at... First What is the variance of the weight vector? The Variance Matrix. Second The prediction of the output at any of the training set inputs - The Projection Matrix Finally The incremental algorithm for the problem!!! 96 / 96