This document discusses radial basis function networks. It begins by introducing the basic structure of RBF networks, which typically involve an input layer, a hidden layer that applies a nonlinear transformation using radial basis functions, and an output layer with a linear transformation. The document then discusses Cover's theorem, which states that pattern classification problems are more likely to be linearly separable when mapped to a higher-dimensional space through a nonlinear transformation. Several key concepts are introduced, including dichotomies, phi-separable functions, and using hidden functions to map patterns to a hidden feature space.
2. Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
2 / 96
3. Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
3 / 96
4. Introduction
Observation
The back-propagation algorithm for the design of a multilayer perceptron
as described in the previous chapter may be viewed as the application of a
recursive technique known in statistics as stochastic approximation.
Now
We take a completely different approach by viewing the design of a neural
network as a curve fitting (approximation) problem in a high-dimensional
space.
Thus
Learning is equivalent to finding a surface in a multidimensional space that
provides a best fit to the training data.
Under a statistical metric
4 / 96
5. Introduction
Observation
The back-propagation algorithm for the design of a multilayer perceptron
as described in the previous chapter may be viewed as the application of a
recursive technique known in statistics as stochastic approximation.
Now
We take a completely different approach by viewing the design of a neural
network as a curve fitting (approximation) problem in a high-dimensional
space.
Thus
Learning is equivalent to finding a surface in a multidimensional space that
provides a best fit to the training data.
Under a statistical metric
4 / 96
6. Introduction
Observation
The back-propagation algorithm for the design of a multilayer perceptron
as described in the previous chapter may be viewed as the application of a
recursive technique known in statistics as stochastic approximation.
Now
We take a completely different approach by viewing the design of a neural
network as a curve fitting (approximation) problem in a high-dimensional
space.
Thus
Learning is equivalent to finding a surface in a multidimensional space that
provides a best fit to the training data.
Under a statistical metric
4 / 96
7. Thus
In the context of a neural network
The hidden units provide a set of "functions"
A "basis" for the input patterns when they are expanded into the
hidden space.
Name of these functions
Radial-Basis functions.
5 / 96
8. Thus
In the context of a neural network
The hidden units provide a set of "functions"
A "basis" for the input patterns when they are expanded into the
hidden space.
Name of these functions
Radial-Basis functions.
5 / 96
9. History
These functions were first introduced
As the solution of the real multivariate interpolation problem
Right now
It is now one of the main fields of research in numerical analysis.
6 / 96
10. History
These functions were first introduced
As the solution of the real multivariate interpolation problem
Right now
It is now one of the main fields of research in numerical analysis.
6 / 96
11. Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
7 / 96
12. A Basic Structure
We have the following structure
1 Input Layer to connect with the environment.
2 Hidden Layer applying a non-linear transformation.
3 Output Layer applying a linear transformation.
Example
8 / 96
13. A Basic Structure
We have the following structure
1 Input Layer to connect with the environment.
2 Hidden Layer applying a non-linear transformation.
3 Output Layer applying a linear transformation.
Example
8 / 96
14. A Basic Structure
We have the following structure
1 Input Layer to connect with the environment.
2 Hidden Layer applying a non-linear transformation.
3 Output Layer applying a linear transformation.
Example
8 / 96
15. A Basic Structure
We have the following structure
1 Input Layer to connect with the environment.
2 Hidden Layer applying a non-linear transformation.
3 Output Layer applying a linear transformation.
Example
Input Nodes
Nonlinear Nodes
Linear Node
8 / 96
16. Why the non-linear transformation?
The justification
In a paper by Cover (1965), a pattern-classification problem mapped to a
high dimensional space is more likely to be linearly separable than in a
low-dimensional space.
Thus
A good reason to make the dimension in the hidden space in a
Radial-Basis Function (RBF) network high
9 / 96
17. Why the non-linear transformation?
The justification
In a paper by Cover (1965), a pattern-classification problem mapped to a
high dimensional space is more likely to be linearly separable than in a
low-dimensional space.
Thus
A good reason to make the dimension in the hidden space in a
Radial-Basis Function (RBF) network high
9 / 96
18. Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
10 / 96
19. Cover’s Theorem
The Resumed Statement
A complex pattern-classification problem cast in a high-dimensional space
nonlinearly is more likely to be linearly separable than in a low-dimensional
space.
Actually
It is quite more complex...
11 / 96
20. Cover’s Theorem
The Resumed Statement
A complex pattern-classification problem cast in a high-dimensional space
nonlinearly is more likely to be linearly separable than in a low-dimensional
space.
Actually
It is quite more complex...
11 / 96
21. Some facts
A fact
Once we know a set of patterns are linearly separable, the problem is easy
to solve.
Consider
A family of surfaces that separate the space in two regions.
In addition
We have a set of patterns
H = {x1, x2, ..., xN } (1)
12 / 96
22. Some facts
A fact
Once we know a set of patterns are linearly separable, the problem is easy
to solve.
Consider
A family of surfaces that separate the space in two regions.
In addition
We have a set of patterns
H = {x1, x2, ..., xN } (1)
12 / 96
23. Some facts
A fact
Once we know a set of patterns are linearly separable, the problem is easy
to solve.
Consider
A family of surfaces that separate the space in two regions.
In addition
We have a set of patterns
H = {x1, x2, ..., xN } (1)
12 / 96
24. Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
13 / 96
25. Dichotomy (Binary Partition)
Now
The pattern set is split into two classes H1 and H2.
Definition
A dichotomy (binary partition) of the points is said to be separable with
respect to the family of surfaces if a surface exists in the family that
separates the points in the class H1 from those in the class H2.
Define
For each pattern x ∈ H, we define a set of real valued measurement
functions {φ1 (x) , φ2 (x) , ..., φd1 (x)}
14 / 96
26. Dichotomy (Binary Partition)
Now
The pattern set is split into two classes H1 and H2.
Definition
A dichotomy (binary partition) of the points is said to be separable with
respect to the family of surfaces if a surface exists in the family that
separates the points in the class H1 from those in the class H2.
Define
For each pattern x ∈ H, we define a set of real valued measurement
functions {φ1 (x) , φ2 (x) , ..., φd1 (x)}
14 / 96
27. Dichotomy (Binary Partition)
Now
The pattern set is split into two classes H1 and H2.
Definition
A dichotomy (binary partition) of the points is said to be separable with
respect to the family of surfaces if a surface exists in the family that
separates the points in the class H1 from those in the class H2.
Define
For each pattern x ∈ H, we define a set of real valued measurement
functions {φ1 (x) , φ2 (x) , ..., φd1 (x)}
14 / 96
28. Thus
We define the following function (Vector of measurements)
φ : H → Rd1
(2)
Defined as
φ (x) = (φ1 (x) , φ2 (x) , ..., φd1 (x))T
(3)
Now
Suppose that the pattern x is a vector in an d0-dimensional input space.
15 / 96
29. Thus
We define the following function (Vector of measurements)
φ : H → Rd1
(2)
Defined as
φ (x) = (φ1 (x) , φ2 (x) , ..., φd1 (x))T
(3)
Now
Suppose that the pattern x is a vector in an d0-dimensional input space.
15 / 96
30. Thus
We define the following function (Vector of measurements)
φ : H → Rd1
(2)
Defined as
φ (x) = (φ1 (x) , φ2 (x) , ..., φd1 (x))T
(3)
Now
Suppose that the pattern x is a vector in an d0-dimensional input space.
15 / 96
31. Then...
We have that the mapping φ (x)
It maps points in d0-dimensional space into corresponding points in a new
space of dimension d1.
Each of this functions φi (x)
It is known as a hidden function because it plays a role similar to the
hidden unit in a feed-forward neural network.
Thus
We have that the space spanned by the set of hidden functions
{φi (x)}d1
i=1 is called as the hidden space of feature space.
16 / 96
32. Then...
We have that the mapping φ (x)
It maps points in d0-dimensional space into corresponding points in a new
space of dimension d1.
Each of this functions φi (x)
It is known as a hidden function because it plays a role similar to the
hidden unit in a feed-forward neural network.
Thus
We have that the space spanned by the set of hidden functions
{φi (x)}d1
i=1 is called as the hidden space of feature space.
16 / 96
33. Then...
We have that the mapping φ (x)
It maps points in d0-dimensional space into corresponding points in a new
space of dimension d1.
Each of this functions φi (x)
It is known as a hidden function because it plays a role similar to the
hidden unit in a feed-forward neural network.
Thus
We have that the space spanned by the set of hidden functions
{φi (x)}d1
i=1 is called as the hidden space of feature space.
16 / 96
34. Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
17 / 96
35. φ-separable functions
Definition
A dichotomy {H1, H2} of H is said to be φ-separable if there exists a
d1-dimensional vector w such that
1 wT φ (x) > 0 if x ∈ H1.
2 wT φ (x) < 0 if x ∈ H2.
Clearly the hyperplane is defined by the equation
wT
φ (x) = 0 (4)
Now
The inverse image of this hyperplane
Hyp−1
= x|wT
φ (x) = 0 (5)
define the separating surface in the input space.
18 / 96
36. φ-separable functions
Definition
A dichotomy {H1, H2} of H is said to be φ-separable if there exists a
d1-dimensional vector w such that
1 wT φ (x) > 0 if x ∈ H1.
2 wT φ (x) < 0 if x ∈ H2.
Clearly the hyperplane is defined by the equation
wT
φ (x) = 0 (4)
Now
The inverse image of this hyperplane
Hyp−1
= x|wT
φ (x) = 0 (5)
define the separating surface in the input space.
18 / 96
37. φ-separable functions
Definition
A dichotomy {H1, H2} of H is said to be φ-separable if there exists a
d1-dimensional vector w such that
1 wT φ (x) > 0 if x ∈ H1.
2 wT φ (x) < 0 if x ∈ H2.
Clearly the hyperplane is defined by the equation
wT
φ (x) = 0 (4)
Now
The inverse image of this hyperplane
Hyp−1
= x|wT
φ (x) = 0 (5)
define the separating surface in the input space.
18 / 96
38. φ-separable functions
Definition
A dichotomy {H1, H2} of H is said to be φ-separable if there exists a
d1-dimensional vector w such that
1 wT φ (x) > 0 if x ∈ H1.
2 wT φ (x) < 0 if x ∈ H2.
Clearly the hyperplane is defined by the equation
wT
φ (x) = 0 (4)
Now
The inverse image of this hyperplane
Hyp−1
= x|wT
φ (x) = 0 (5)
define the separating surface in the input space.
18 / 96
39. Now
Taking in consideration
A natural class of mappings obtained by using a linear combination of
r-wise products of the pattern vector coordinates.
They are called
As the rth-order rational varieties.
A rational variety of order r in dimensional d0 is described by
0≤i1≤i2≤...≤ir ≤d0
ai1i2...ir xi1 xi2 ...xir = 0 (6)
where xi is the ith coordinate of the input vector x and x0 is set to unity
in order to express the previous equation in homogeneous form.
19 / 96
40. Now
Taking in consideration
A natural class of mappings obtained by using a linear combination of
r-wise products of the pattern vector coordinates.
They are called
As the rth-order rational varieties.
A rational variety of order r in dimensional d0 is described by
0≤i1≤i2≤...≤ir ≤d0
ai1i2...ir xi1 xi2 ...xir = 0 (6)
where xi is the ith coordinate of the input vector x and x0 is set to unity
in order to express the previous equation in homogeneous form.
19 / 96
41. Now
Taking in consideration
A natural class of mappings obtained by using a linear combination of
r-wise products of the pattern vector coordinates.
They are called
As the rth-order rational varieties.
A rational variety of order r in dimensional d0 is described by
0≤i1≤i2≤...≤ir ≤d0
ai1i2...ir xi1 xi2 ...xir = 0 (6)
where xi is the ith coordinate of the input vector x and x0 is set to unity
in order to express the previous equation in homogeneous form.
19 / 96
42. Now
Taking in consideration
A natural class of mappings obtained by using a linear combination of
r-wise products of the pattern vector coordinates.
They are called
As the rth-order rational varieties.
A rational variety of order r in dimensional d0 is described by
0≤i1≤i2≤...≤ir ≤d0
ai1i2...ir xi1 xi2 ...xir = 0 (6)
where xi is the ith coordinate of the input vector x and x0 is set to unity
in order to express the previous equation in homogeneous form.
19 / 96
43. Homogenous Functions
Definition
A function f (x) is said to be homogeneous of degree n if, by introducing a
constant parameter λ, replacing the variable x with λx we find:
f (λx) = λn
f (x) (7)
20 / 96
44. Homogeneous Equation
Equation (Eq. 6)
A rth order product of entries xi of x, xi1 xi2 ...xir , is called a monomial
Properties
For an input space of dimensionality d0, there are
d0
r
=
d0!
(d0 − r)!r!
(8)
monomials in (Eq. 6).
21 / 96
45. Homogeneous Equation
Equation (Eq. 6)
A rth order product of entries xi of x, xi1 xi2 ...xir , is called a monomial
Properties
For an input space of dimensionality d0, there are
d0
r
=
d0!
(d0 − r)!r!
(8)
monomials in (Eq. 6).
21 / 96
46. Example of these surfaces
Hyperplanes (first-order rational varieties)
22 / 96
47. Example of these surfaces
Hyperplanes (first-order rational varieties)
23 / 96
48. Example of these surfaces
Quadrices (second-order rational varieties)
24 / 96
49. Example of these surfaces
Hyperspheres (quadrics with certain linear constraints on the
coefficients)
25 / 96
50. Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
26 / 96
51. The Stochastic Experiment
Suppose
You have the following activation patterns x1, x2, ..., xN are chosen
independently.
Suppose
That all possible dichotomies of H = {x1, x2, ..., xN } are equiprobable.
Now given P (N, d1) the probability that a particular dichotomy
picked at random is φ-separable
P (N, d1) =
1
2
N−1 d1−1
m=0
N − 1
m
(9)
27 / 96
52. The Stochastic Experiment
Suppose
You have the following activation patterns x1, x2, ..., xN are chosen
independently.
Suppose
That all possible dichotomies of H = {x1, x2, ..., xN } are equiprobable.
Now given P (N, d1) the probability that a particular dichotomy
picked at random is φ-separable
P (N, d1) =
1
2
N−1 d1−1
m=0
N − 1
m
(9)
27 / 96
53. The Stochastic Experiment
Suppose
You have the following activation patterns x1, x2, ..., xN are chosen
independently.
Suppose
That all possible dichotomies of H = {x1, x2, ..., xN } are equiprobable.
Now given P (N, d1) the probability that a particular dichotomy
picked at random is φ-separable
P (N, d1) =
1
2
N−1 d1−1
m=0
N − 1
m
(9)
27 / 96
54. What?
Basically (Eq. 9) represents
The essence of Cover’s Separability Theorem.
Something Notable
It is a statement of the fact that the cumulative binomial distribution
corresponding to the probability that N − 1 (Flips of a coin) samples will
be separable in a mapping of d1 − 1 (heads) or fewer dimensions.
Specifically
The higher we make the hidden space in the radial basis function the
closer is the probability of P (N, d1) to one.
28 / 96
55. What?
Basically (Eq. 9) represents
The essence of Cover’s Separability Theorem.
Something Notable
It is a statement of the fact that the cumulative binomial distribution
corresponding to the probability that N − 1 (Flips of a coin) samples will
be separable in a mapping of d1 − 1 (heads) or fewer dimensions.
Specifically
The higher we make the hidden space in the radial basis function the
closer is the probability of P (N, d1) to one.
28 / 96
56. What?
Basically (Eq. 9) represents
The essence of Cover’s Separability Theorem.
Something Notable
It is a statement of the fact that the cumulative binomial distribution
corresponding to the probability that N − 1 (Flips of a coin) samples will
be separable in a mapping of d1 − 1 (heads) or fewer dimensions.
Specifically
The higher we make the hidden space in the radial basis function the
closer is the probability of P (N, d1) to one.
28 / 96
57. Final ingredients if the Cover’s Theorem
First
Nonlinear formulation of the hidden function defined by φ (x), where x is
the input vector and i = 1, 2, ..., d1.
Second
High dimensionality of the hidden space compared to the input space.
This dimensionality is determined by the value assigned to d_1 (i.e.,
the number of hidden units).
Then
In general, a complex pattern-classification problem cast in
highdimensional space nonlinearly is more likely to be linearly separable
than in a lowdimensional space.
29 / 96
58. Final ingredients if the Cover’s Theorem
First
Nonlinear formulation of the hidden function defined by φ (x), where x is
the input vector and i = 1, 2, ..., d1.
Second
High dimensionality of the hidden space compared to the input space.
This dimensionality is determined by the value assigned to d_1 (i.e.,
the number of hidden units).
Then
In general, a complex pattern-classification problem cast in
highdimensional space nonlinearly is more likely to be linearly separable
than in a lowdimensional space.
29 / 96
59. Final ingredients if the Cover’s Theorem
First
Nonlinear formulation of the hidden function defined by φ (x), where x is
the input vector and i = 1, 2, ..., d1.
Second
High dimensionality of the hidden space compared to the input space.
This dimensionality is determined by the value assigned to d_1 (i.e.,
the number of hidden units).
Then
In general, a complex pattern-classification problem cast in
highdimensional space nonlinearly is more likely to be linearly separable
than in a lowdimensional space.
29 / 96
60. Final ingredients if the Cover’s Theorem
First
Nonlinear formulation of the hidden function defined by φ (x), where x is
the input vector and i = 1, 2, ..., d1.
Second
High dimensionality of the hidden space compared to the input space.
This dimensionality is determined by the value assigned to d_1 (i.e.,
the number of hidden units).
Then
In general, a complex pattern-classification problem cast in
highdimensional space nonlinearly is more likely to be linearly separable
than in a lowdimensional space.
29 / 96
61. Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
30 / 96
62. There is always an exception to every rule!!!
The XOR Problem
0
1
1
Class 1
Class 2
31 / 96
63. Now
We define the following radial functions
φ1 (x) = exp x − t1
2
2 where t1 = (1, 1)T
φ2 (x) = exp x − t2
2
2 where t2 = (1, 1)T
Then
If we apply our classic mapping φ (x) = [φ1 (x) , φ2 (x)]:
Original Mapping
(0, 1) → (0.3678, 0.3678)
(1, 0) → (0.3678, 0.3678)
(0, 0) → (0.1353, 1)
(1, 1) → (1, 0.1353)
32 / 96
64. Now
We define the following radial functions
φ1 (x) = exp x − t1
2
2 where t1 = (1, 1)T
φ2 (x) = exp x − t2
2
2 where t2 = (1, 1)T
Then
If we apply our classic mapping φ (x) = [φ1 (x) , φ2 (x)]:
Original Mapping
(0, 1) → (0.3678, 0.3678)
(1, 0) → (0.3678, 0.3678)
(0, 0) → (0.1353, 1)
(1, 1) → (1, 0.1353)
32 / 96
65. New Space
We have the following new φ1 − φ2 space
0
1
1
Class 1
Class 2
33 / 96
66. Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
34 / 96
67. Separating Capacity of a Surface
Something Notable
(Eq. 9) has an important bearing on the expected maximum number of
randomly assigned patterns that are linearly separable in a
multidimensional space.
Now, given our patterns {xi}N
i=1
Given N be a random variable defined as the largest integer such that the
sequence is φ-separable.
We have that
Prob (N = n) = P (n, d1) − P (n + 1, d1) (10)
35 / 96
68. Separating Capacity of a Surface
Something Notable
(Eq. 9) has an important bearing on the expected maximum number of
randomly assigned patterns that are linearly separable in a
multidimensional space.
Now, given our patterns {xi}N
i=1
Given N be a random variable defined as the largest integer such that the
sequence is φ-separable.
We have that
Prob (N = n) = P (n, d1) − P (n + 1, d1) (10)
35 / 96
69. Separating Capacity of a Surface
Something Notable
(Eq. 9) has an important bearing on the expected maximum number of
randomly assigned patterns that are linearly separable in a
multidimensional space.
Now, given our patterns {xi}N
i=1
Given N be a random variable defined as the largest integer such that the
sequence is φ-separable.
We have that
Prob (N = n) = P (n, d1) − P (n + 1, d1) (10)
35 / 96
70. Separating Capacity of a Surface
Then
Prob (N = n) =
1
2
n
n − 1
d1 − 1
, n = 0, 1, 2... (11)
Remark:
n
d1
=
n − 1
d1 − 1
+
n − 1
d1
, 0 < d1 < n
To interpret this
Recall the negative binomial distribution.
It is a repeated sequence of Bernoulli Trials
With k failures preceding the rth success.
36 / 96
71. Separating Capacity of a Surface
Then
Prob (N = n) =
1
2
n
n − 1
d1 − 1
, n = 0, 1, 2... (11)
Remark:
n
d1
=
n − 1
d1 − 1
+
n − 1
d1
, 0 < d1 < n
To interpret this
Recall the negative binomial distribution.
It is a repeated sequence of Bernoulli Trials
With k failures preceding the rth success.
36 / 96
72. Separating Capacity of a Surface
Then
Prob (N = n) =
1
2
n
n − 1
d1 − 1
, n = 0, 1, 2... (11)
Remark:
n
d1
=
n − 1
d1 − 1
+
n − 1
d1
, 0 < d1 < n
To interpret this
Recall the negative binomial distribution.
It is a repeated sequence of Bernoulli Trials
With k failures preceding the rth success.
36 / 96
73. Separating Capacity of a Surface
Thus, we have that
Given p and q the probabilities of success and failure, respectively, with
p + q = 1.
Definition
p (K = k|p, q) =
r + k − 1
k
pr
qk
(12)
What happened with p = q = 1
2
and k + r = n
Any idea?
37 / 96
74. Separating Capacity of a Surface
Thus, we have that
Given p and q the probabilities of success and failure, respectively, with
p + q = 1.
Definition
p (K = k|p, q) =
r + k − 1
k
pr
qk
(12)
What happened with p = q = 1
2
and k + r = n
Any idea?
37 / 96
75. Separating Capacity of a Surface
Thus, we have that
Given p and q the probabilities of success and failure, respectively, with
p + q = 1.
Definition
p (K = k|p, q) =
r + k − 1
k
pr
qk
(12)
What happened with p = q = 1
2
and k + r = n
Any idea?
37 / 96
76. Separating Capacity of a Surface
Thus
(Eq. 11) is just the negative binomial distribution shifted d1 units to the
right with parameters d1 and 1
2
Finally
N corresponds to thew “waiting time” for d1 th failure in a sequence of
tosses of a fair coin.
We have then
E [N] = 2d1
Median [N] = 2d1
38 / 96
77. Separating Capacity of a Surface
Thus
(Eq. 11) is just the negative binomial distribution shifted d1 units to the
right with parameters d1 and 1
2
Finally
N corresponds to thew “waiting time” for d1 th failure in a sequence of
tosses of a fair coin.
We have then
E [N] = 2d1
Median [N] = 2d1
38 / 96
78. Separating Capacity of a Surface
Thus
(Eq. 11) is just the negative binomial distribution shifted d1 units to the
right with parameters d1 and 1
2
Finally
N corresponds to thew “waiting time” for d1 th failure in a sequence of
tosses of a fair coin.
We have then
E [N] = 2d1
Median [N] = 2d1
38 / 96
79. This allows to define the Corollary to Cover’s Theorem
A celebrated asymptotic result
The expected maximum number of randomly assigned patterns (vectors)
that are linearly separable in a space of dimensionality d1 is equal to 2d1 .
Something Notable
This result suggests that 2d1 is a natural definition of the separating
capacity of a family of decision surfaces having d1 degrees of freedom.
39 / 96
80. This allows to define the Corollary to Cover’s Theorem
A celebrated asymptotic result
The expected maximum number of randomly assigned patterns (vectors)
that are linearly separable in a space of dimensionality d1 is equal to 2d1 .
Something Notable
This result suggests that 2d1 is a natural definition of the separating
capacity of a family of decision surfaces having d1 degrees of freedom.
39 / 96
81. Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
40 / 96
82. Given a problem of non-linearly separable patterns
It is possible to see that
There is a benefit to be gained by mapping the input space into a new
space of high enough dimension
For this, we use a non-linear map
Quite similar to solve a difficult non-linear filtering problem by mapping it
to high dimension, then solving it as a linear filtering problem.
41 / 96
83. Given a problem of non-linearly separable patterns
It is possible to see that
There is a benefit to be gained by mapping the input space into a new
space of high enough dimension
For this, we use a non-linear map
Quite similar to solve a difficult non-linear filtering problem by mapping it
to high dimension, then solving it as a linear filtering problem.
41 / 96
84. Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
42 / 96
85. Take in consideration the following architecture
Mapping from input space to hidden space, followed by a linear
mapping to output space!!!
Input Nodes
Nonlinear Nodes
Linear Node
43 / 96
86. This can be seen as
We have the following map
s : Rd0
→ R (13)
Therefore
We may think of s as a hypersurface (graph) Γ ⊂ Rd0+1
44 / 96
87. This can be seen as
We have the following map
s : Rd0
→ R (13)
Therefore
We may think of s as a hypersurface (graph) Γ ⊂ Rd0+1
44 / 96
88. Example
We have that the Red planes represent the mappings and the Gray is
the Linear Separator
45 / 96
89. Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
46 / 96
90. General Idea
First
The training phase constitutes the optimization of a fitting procedure
for the surface Γ.
It is based in the know data points as input-output patterns.
Second
The generalization phase is synonymous with interpolation between
the data points.
The interpolation being performed along the constrained surface
generated by the fitting procedure.
47 / 96
91. General Idea
First
The training phase constitutes the optimization of a fitting procedure
for the surface Γ.
It is based in the know data points as input-output patterns.
Second
The generalization phase is synonymous with interpolation between
the data points.
The interpolation being performed along the constrained surface
generated by the fitting procedure.
47 / 96
92. General Idea
First
The training phase constitutes the optimization of a fitting procedure
for the surface Γ.
It is based in the know data points as input-output patterns.
Second
The generalization phase is synonymous with interpolation between
the data points.
The interpolation being performed along the constrained surface
generated by the fitting procedure.
47 / 96
93. General Idea
First
The training phase constitutes the optimization of a fitting procedure
for the surface Γ.
It is based in the know data points as input-output patterns.
Second
The generalization phase is synonymous with interpolation between
the data points.
The interpolation being performed along the constrained surface
generated by the fitting procedure.
47 / 96
94. This leads to the theory of multi-variable interpolation
Interpolation Problem
Given a set of N different points xi ∈ Rd0 |i = 1, 2, ..., N and a
corresponding set of N real numbers di ∈ R1|i = 1, 2, ..., N , find a
function F : RN → R that satisfies the interpolation condition:
F (xi) = di i = 1, 2, ..., N (14)
Remark
For strict interpolation as specified here, the interpolating surface is
constrained to pass through all the training data points.
48 / 96
95. This leads to the theory of multi-variable interpolation
Interpolation Problem
Given a set of N different points xi ∈ Rd0 |i = 1, 2, ..., N and a
corresponding set of N real numbers di ∈ R1|i = 1, 2, ..., N , find a
function F : RN → R that satisfies the interpolation condition:
F (xi) = di i = 1, 2, ..., N (14)
Remark
For strict interpolation as specified here, the interpolating surface is
constrained to pass through all the training data points.
48 / 96
96. Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
49 / 96
97. Radial-Basis Functions (RBF)
The function F has the following form (Powell, 1988)
F (x) =
N
i=1
wiφ ( x − xi ) (15)
Where
{φ ( x − xi ) |i = 1, ..., N}
is a set of N arbitrary, generally non-linear, functions, know as RBF with
· denotes a norm that is usually Euclidean.
In addition
The know data points xi ∈ Rd0 i = 1, 2, ..., N are taken to be the centers
of the radial basis functions.
50 / 96
98. Radial-Basis Functions (RBF)
The function F has the following form (Powell, 1988)
F (x) =
N
i=1
wiφ ( x − xi ) (15)
Where
{φ ( x − xi ) |i = 1, ..., N}
is a set of N arbitrary, generally non-linear, functions, know as RBF with
· denotes a norm that is usually Euclidean.
In addition
The know data points xi ∈ Rd0 i = 1, 2, ..., N are taken to be the centers
of the radial basis functions.
50 / 96
99. Radial-Basis Functions (RBF)
The function F has the following form (Powell, 1988)
F (x) =
N
i=1
wiφ ( x − xi ) (15)
Where
{φ ( x − xi ) |i = 1, ..., N}
is a set of N arbitrary, generally non-linear, functions, know as RBF with
· denotes a norm that is usually Euclidean.
In addition
The know data points xi ∈ Rd0 i = 1, 2, ..., N are taken to be the centers
of the radial basis functions.
50 / 96
100. A Set of Simultaneous Linear Equations
Given
φji = φ ( xj − xi ) , (j, i) = 1, 2, ..., N (16)
Using (Eq. 14) and (Eq. 15), we get
φ11 φ12 · · · φ1N
φ21 φ22 · · · φ2N
...
...
...
...
φN1 φN2 · · · φNN
w1
w2
...
wN
=
d1
d2
...
dN
(17)
51 / 96
101. A Set of Simultaneous Linear Equations
Given
φji = φ ( xj − xi ) , (j, i) = 1, 2, ..., N (16)
Using (Eq. 14) and (Eq. 15), we get
φ11 φ12 · · · φ1N
φ21 φ22 · · · φ2N
...
...
...
...
φN1 φN2 · · · φNN
w1
w2
...
wN
=
d1
d2
...
dN
(17)
51 / 96
102. Now
We can create the following vectors
d = [d1, d2, ..., dN ]T
(Response vector).
w = [w1, w2, ..., wN ]T
(Linear weight vector).
Now, we define a N × N matrix called interpolation matrix
Φ = {φji| (j, i) = 1, 2, ..., N} (18)
Thus, we have
Φw = x (19)
52 / 96
103. Now
We can create the following vectors
d = [d1, d2, ..., dN ]T
(Response vector).
w = [w1, w2, ..., wN ]T
(Linear weight vector).
Now, we define a N × N matrix called interpolation matrix
Φ = {φji| (j, i) = 1, 2, ..., N} (18)
Thus, we have
Φw = x (19)
52 / 96
104. Now
We can create the following vectors
d = [d1, d2, ..., dN ]T
(Response vector).
w = [w1, w2, ..., wN ]T
(Linear weight vector).
Now, we define a N × N matrix called interpolation matrix
Φ = {φji| (j, i) = 1, 2, ..., N} (18)
Thus, we have
Φw = x (19)
52 / 96
105. From here
Assuming that Φ is a non-singular matrix
w = Φ−1
x (20)
Question
How can we be sure that the interpolation matrix Φ is non-singular?
Answer
It turns out that for a large class of radial-basis functions and under
certain conditions the non-singularity happens!!!
53 / 96
106. From here
Assuming that Φ is a non-singular matrix
w = Φ−1
x (20)
Question
How can we be sure that the interpolation matrix Φ is non-singular?
Answer
It turns out that for a large class of radial-basis functions and under
certain conditions the non-singularity happens!!!
53 / 96
107. From here
Assuming that Φ is a non-singular matrix
w = Φ−1
x (20)
Question
How can we be sure that the interpolation matrix Φ is non-singular?
Answer
It turns out that for a large class of radial-basis functions and under
certain conditions the non-singularity happens!!!
53 / 96
108. Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
54 / 96
109. Introduction
Observation
The strict interpolation procedure described may not be a good strategy
for the training of RBF networks for certain classes of tasks.
Reason
If the number of data points is much larger than the number of degrees of
freedom of the underlying physical process.
Thus
The network may end up fitting misleading variations due to idiosyncrasies
or noise in the input data.
55 / 96
110. Introduction
Observation
The strict interpolation procedure described may not be a good strategy
for the training of RBF networks for certain classes of tasks.
Reason
If the number of data points is much larger than the number of degrees of
freedom of the underlying physical process.
Thus
The network may end up fitting misleading variations due to idiosyncrasies
or noise in the input data.
55 / 96
111. Introduction
Observation
The strict interpolation procedure described may not be a good strategy
for the training of RBF networks for certain classes of tasks.
Reason
If the number of data points is much larger than the number of degrees of
freedom of the underlying physical process.
Thus
The network may end up fitting misleading variations due to idiosyncrasies
or noise in the input data.
55 / 96
112. Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
56 / 96
113. Well-posed
The Problem
Assume that we have a domain X and a range Y , metric spaces.
They are related by a mapping
f : X → Y (21)
Definition
The problem of reconstructing the mapping f is said to be well-posed if
three conditions are satisfied: Existence, Uniqueness and Continuity.
57 / 96
114. Well-posed
The Problem
Assume that we have a domain X and a range Y , metric spaces.
They are related by a mapping
f : X → Y (21)
Definition
The problem of reconstructing the mapping f is said to be well-posed if
three conditions are satisfied: Existence, Uniqueness and Continuity.
57 / 96
115. Well-posed
The Problem
Assume that we have a domain X and a range Y , metric spaces.
They are related by a mapping
f : X → Y (21)
Definition
The problem of reconstructing the mapping f is said to be well-posed if
three conditions are satisfied: Existence, Uniqueness and Continuity.
57 / 96
116. Defining the meaning of this
Existence
For every input vector x ∈ X, there does exist an output y = f (x), where
y ∈ Y .
Uniqueness
For any pair of input vectors x, t ∈ X, we have f (x) = f (t) if and only if
x = t.
Continuity
The mapping is continuous, if for any > 0 exists δ such that the
condition dX (x, t) < δ implies dY (f (x) , f (t)) < .
58 / 96
117. Defining the meaning of this
Existence
For every input vector x ∈ X, there does exist an output y = f (x), where
y ∈ Y .
Uniqueness
For any pair of input vectors x, t ∈ X, we have f (x) = f (t) if and only if
x = t.
Continuity
The mapping is continuous, if for any > 0 exists δ such that the
condition dX (x, t) < δ implies dY (f (x) , f (t)) < .
58 / 96
118. Defining the meaning of this
Existence
For every input vector x ∈ X, there does exist an output y = f (x), where
y ∈ Y .
Uniqueness
For any pair of input vectors x, t ∈ X, we have f (x) = f (t) if and only if
x = t.
Continuity
The mapping is continuous, if for any > 0 exists δ such that the
condition dX (x, t) < δ implies dY (f (x) , f (t)) < .
58 / 96
120. Ill-Posed
Therefore
If any of these conditions is not satisfied, the problem is said to be
ill-posed.
Basically
An ill-posed problem means that large data sets may contain a
surprisingly small amount of information about the desired solution.
60 / 96
121. Ill-Posed
Therefore
If any of these conditions is not satisfied, the problem is said to be
ill-posed.
Basically
An ill-posed problem means that large data sets may contain a
surprisingly small amount of information about the desired solution.
60 / 96
122. Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
61 / 96
124. We have the following
Physical Phenomena
Speech, pictures, radar signals, sonar signals, seismic data.
It is a well-posed data
But learning form such data i.e. rebuilding the hypersurface can be an
ill-posed inverse problem.
63 / 96
125. We have the following
Physical Phenomena
Speech, pictures, radar signals, sonar signals, seismic data.
It is a well-posed data
But learning form such data i.e. rebuilding the hypersurface can be an
ill-posed inverse problem.
63 / 96
126. Why
First
The existence criterion may be violated in that a distinct output may not
exist for every input
Second
There may not be as much information in the training sample as we really
need to reconstruct the input-output mapping uniquely.
Third
The unavoidable presence of noise or imprecision in real-life training data
adds uncertainty to the reconstructed input-output mapping.
64 / 96
127. Why
First
The existence criterion may be violated in that a distinct output may not
exist for every input
Second
There may not be as much information in the training sample as we really
need to reconstruct the input-output mapping uniquely.
Third
The unavoidable presence of noise or imprecision in real-life training data
adds uncertainty to the reconstructed input-output mapping.
64 / 96
128. Why
First
The existence criterion may be violated in that a distinct output may not
exist for every input
Second
There may not be as much information in the training sample as we really
need to reconstruct the input-output mapping uniquely.
Third
The unavoidable presence of noise or imprecision in real-life training data
adds uncertainty to the reconstructed input-output mapping.
64 / 96
130. How?
This can happen when
There is a lack of information!!!
Lanczos, 1964
“A lack of information cannot be remedied by any mathematical trickery.”
66 / 96
131. How?
This can happen when
There is a lack of information!!!
Lanczos, 1964
“A lack of information cannot be remedied by any mathematical trickery.”
66 / 96
132. Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
67 / 96
133. How do we solve the problem?
Something Notable
In 1963, Tikhonov proposed a new method called regularization for solving
ill-posed ’
Tikhonov
He was a Soviet and Russian mathematician known for important
contributions to topology, functional analysis, mathematical physics, and
ill-posed problems.
68 / 96
134. How do we solve the problem?
Something Notable
In 1963, Tikhonov proposed a new method called regularization for solving
ill-posed ’
Tikhonov
He was a Soviet and Russian mathematician known for important
contributions to topology, functional analysis, mathematical physics, and
ill-posed problems.
68 / 96
135. Also Known as Ridge Regression
Setup
We have:
Input Signal xi ∈ Rd0
N
i=1
.
Output Signal {di ∈ R}N
i=1.
In addition
Note that the output is assumed to be one-dimensional.
69 / 96
136. Also Known as Ridge Regression
Setup
We have:
Input Signal xi ∈ Rd0
N
i=1
.
Output Signal {di ∈ R}N
i=1.
In addition
Note that the output is assumed to be one-dimensional.
69 / 96
137. Now, assuming that you have an approximation function
y = F (x)
Standard Error Term
Es (F) =
1
2
N
i=1
(di − yi) =
1
2
N
i=1
(di − F (xi)) (22)
Regularization Term
Ec (F) =
1
2
DF 2
(23)
Where
D is a linear differential operator.
70 / 96
138. Now, assuming that you have an approximation function
y = F (x)
Standard Error Term
Es (F) =
1
2
N
i=1
(di − yi) =
1
2
N
i=1
(di − F (xi)) (22)
Regularization Term
Ec (F) =
1
2
DF 2
(23)
Where
D is a linear differential operator.
70 / 96
139. Now, assuming that you have an approximation function
y = F (x)
Standard Error Term
Es (F) =
1
2
N
i=1
(di − yi) =
1
2
N
i=1
(di − F (xi)) (22)
Regularization Term
Ec (F) =
1
2
DF 2
(23)
Where
D is a linear differential operator.
70 / 96
140. Now
Ordinarily y = F (x)
Normally, the function space representing the functional F is the L2 space
that consist of all real-valued functions f (x) with x ∈ Rd0
The quantity to be minimized in regularization theory is
E (f ) =
1
2
N
i=1
(di − f (xi)) +
1
2
Df 2
(24)
Where
λ is a positive real number called the regularization parameter.
E (f ) is called the Tikhonov functional.
71 / 96
141. Now
Ordinarily y = F (x)
Normally, the function space representing the functional F is the L2 space
that consist of all real-valued functions f (x) with x ∈ Rd0
The quantity to be minimized in regularization theory is
E (f ) =
1
2
N
i=1
(di − f (xi)) +
1
2
Df 2
(24)
Where
λ is a positive real number called the regularization parameter.
E (f ) is called the Tikhonov functional.
71 / 96
142. Now
Ordinarily y = F (x)
Normally, the function space representing the functional F is the L2 space
that consist of all real-valued functions f (x) with x ∈ Rd0
The quantity to be minimized in regularization theory is
E (f ) =
1
2
N
i=1
(di − f (xi)) +
1
2
Df 2
(24)
Where
λ is a positive real number called the regularization parameter.
E (f ) is called the Tikhonov functional.
71 / 96
143. Now
Ordinarily y = F (x)
Normally, the function space representing the functional F is the L2 space
that consist of all real-valued functions f (x) with x ∈ Rd0
The quantity to be minimized in regularization theory is
E (f ) =
1
2
N
i=1
(di − f (xi)) +
1
2
Df 2
(24)
Where
λ is a positive real number called the regularization parameter.
E (f ) is called the Tikhonov functional.
71 / 96
144. Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
72 / 96
145. Introduction
What did we see until now?
The design of learning machines from two main points:
Statistical Point of View
Linear Algebra and Optimization Point of View
Going back to the probability models
We might think in the machine to be learned as a function g (x|D)....
Something as curve fitting...
Under a data set
D = {(xi, yi) |i = 1, 2, ..., N} (25)
Remark: Where the xi ∼ p (x|Θ)!!!
73 / 96
146. Introduction
What did we see until now?
The design of learning machines from two main points:
Statistical Point of View
Linear Algebra and Optimization Point of View
Going back to the probability models
We might think in the machine to be learned as a function g (x|D)....
Something as curve fitting...
Under a data set
D = {(xi, yi) |i = 1, 2, ..., N} (25)
Remark: Where the xi ∼ p (x|Θ)!!!
73 / 96
147. Introduction
What did we see until now?
The design of learning machines from two main points:
Statistical Point of View
Linear Algebra and Optimization Point of View
Going back to the probability models
We might think in the machine to be learned as a function g (x|D)....
Something as curve fitting...
Under a data set
D = {(xi, yi) |i = 1, 2, ..., N} (25)
Remark: Where the xi ∼ p (x|Θ)!!!
73 / 96
148. Introduction
What did we see until now?
The design of learning machines from two main points:
Statistical Point of View
Linear Algebra and Optimization Point of View
Going back to the probability models
We might think in the machine to be learned as a function g (x|D)....
Something as curve fitting...
Under a data set
D = {(xi, yi) |i = 1, 2, ..., N} (25)
Remark: Where the xi ∼ p (x|Θ)!!!
73 / 96
149. Introduction
What did we see until now?
The design of learning machines from two main points:
Statistical Point of View
Linear Algebra and Optimization Point of View
Going back to the probability models
We might think in the machine to be learned as a function g (x|D)....
Something as curve fitting...
Under a data set
D = {(xi, yi) |i = 1, 2, ..., N} (25)
Remark: Where the xi ∼ p (x|Θ)!!!
73 / 96
150. Introduction
What did we see until now?
The design of learning machines from two main points:
Statistical Point of View
Linear Algebra and Optimization Point of View
Going back to the probability models
We might think in the machine to be learned as a function g (x|D)....
Something as curve fitting...
Under a data set
D = {(xi, yi) |i = 1, 2, ..., N} (25)
Remark: Where the xi ∼ p (x|Θ)!!!
73 / 96
151. Introduction
What did we see until now?
The design of learning machines from two main points:
Statistical Point of View
Linear Algebra and Optimization Point of View
Going back to the probability models
We might think in the machine to be learned as a function g (x|D)....
Something as curve fitting...
Under a data set
D = {(xi, yi) |i = 1, 2, ..., N} (25)
Remark: Where the xi ∼ p (x|Θ)!!!
73 / 96
152. Thus, we have that
Two main functions
A function g (x|D) obtained using some algorithm!!!
E [y|x] the optimal regression...
Important
The key factor here is the dependence of the approximation on D.
Why?
The approximation may be very good for a specific training data set but
very bad for another.
This is the reason of studying fusion of information at decision level...
74 / 96
153. Thus, we have that
Two main functions
A function g (x|D) obtained using some algorithm!!!
E [y|x] the optimal regression...
Important
The key factor here is the dependence of the approximation on D.
Why?
The approximation may be very good for a specific training data set but
very bad for another.
This is the reason of studying fusion of information at decision level...
74 / 96
154. Thus, we have that
Two main functions
A function g (x|D) obtained using some algorithm!!!
E [y|x] the optimal regression...
Important
The key factor here is the dependence of the approximation on D.
Why?
The approximation may be very good for a specific training data set but
very bad for another.
This is the reason of studying fusion of information at decision level...
74 / 96
155. Thus, we have that
Two main functions
A function g (x|D) obtained using some algorithm!!!
E [y|x] the optimal regression...
Important
The key factor here is the dependence of the approximation on D.
Why?
The approximation may be very good for a specific training data set but
very bad for another.
This is the reason of studying fusion of information at decision level...
74 / 96
156. Thus, we have that
Two main functions
A function g (x|D) obtained using some algorithm!!!
E [y|x] the optimal regression...
Important
The key factor here is the dependence of the approximation on D.
Why?
The approximation may be very good for a specific training data set but
very bad for another.
This is the reason of studying fusion of information at decision level...
74 / 96
157. How do we measure the difference
We have that
Var(X) = E((X − µ)2
)
We can do that for our data
VarD (g (x|D)) = ED (g (x|D) − E [y|x])2
Now, if we add and subtract
ED [g (x|D)] (26)
Remark: The expected output of the machine g (x|D)
75 / 96
158. How do we measure the difference
We have that
Var(X) = E((X − µ)2
)
We can do that for our data
VarD (g (x|D)) = ED (g (x|D) − E [y|x])2
Now, if we add and subtract
ED [g (x|D)] (26)
Remark: The expected output of the machine g (x|D)
75 / 96
159. How do we measure the difference
We have that
Var(X) = E((X − µ)2
)
We can do that for our data
VarD (g (x|D)) = ED (g (x|D) − E [y|x])2
Now, if we add and subtract
ED [g (x|D)] (26)
Remark: The expected output of the machine g (x|D)
75 / 96
160. How do we measure the difference
We have that
Var(X) = E((X − µ)2
)
We can do that for our data
VarD (g (x|D)) = ED (g (x|D) − E [y|x])2
Now, if we add and subtract
ED [g (x|D)] (26)
Remark: The expected output of the machine g (x|D)
75 / 96
161. Thus, we have that
Or Original variance
VarD (g (x|D)) = ED (g (x|D) − E [y|x])2
= ED (g (x|D) − ED [g (x|D)] + ED [g (x|D)] − E [y|x])2
= ED (g (x|D) − ED [g (x|D)])2
+ ...
...2 ((g (x|D) − ED [g (x|D)])) (ED [g (x|D)] − E [y|x]) + ...
... (ED [g (x|D)] − E [y|x])2
Finally
ED (((g (x|D) − ED [g (x|D)])) (ED [g (x|D)] − E [y|x])) =? (27)
76 / 96
162. Thus, we have that
Or Original variance
VarD (g (x|D)) = ED (g (x|D) − E [y|x])2
= ED (g (x|D) − ED [g (x|D)] + ED [g (x|D)] − E [y|x])2
= ED (g (x|D) − ED [g (x|D)])2
+ ...
...2 ((g (x|D) − ED [g (x|D)])) (ED [g (x|D)] − E [y|x]) + ...
... (ED [g (x|D)] − E [y|x])2
Finally
ED (((g (x|D) − ED [g (x|D)])) (ED [g (x|D)] − E [y|x])) =? (27)
76 / 96
163. Thus, we have that
Or Original variance
VarD (g (x|D)) = ED (g (x|D) − E [y|x])2
= ED (g (x|D) − ED [g (x|D)] + ED [g (x|D)] − E [y|x])2
= ED (g (x|D) − ED [g (x|D)])2
+ ...
...2 ((g (x|D) − ED [g (x|D)])) (ED [g (x|D)] − E [y|x]) + ...
... (ED [g (x|D)] − E [y|x])2
Finally
ED (((g (x|D) − ED [g (x|D)])) (ED [g (x|D)] − E [y|x])) =? (27)
76 / 96
164. Thus, we have that
Or Original variance
VarD (g (x|D)) = ED (g (x|D) − E [y|x])2
= ED (g (x|D) − ED [g (x|D)] + ED [g (x|D)] − E [y|x])2
= ED (g (x|D) − ED [g (x|D)])2
+ ...
...2 ((g (x|D) − ED [g (x|D)])) (ED [g (x|D)] − E [y|x]) + ...
... (ED [g (x|D)] − E [y|x])2
Finally
ED (((g (x|D) − ED [g (x|D)])) (ED [g (x|D)] − E [y|x])) =? (27)
76 / 96
165. We have the Bias-Variance
Our Final Equation
ED (g (x|D) − E [y|x])2
= ED (g (x|D) − ED [g (x|D)])2
VARIANCE
+ (ED [g (x|D)] − E [y|x])2
BIAS
Where the variance
It represent the measure of the error between our machine g (x|D) and the
expected output of the machine under xi ∼ p (x|Θ).
Where the bias
It represent the quadratic error between the expected output of the
machine under xi ∼ p (x|Θ) and the expected output of the optimal
regression.
77 / 96
166. We have the Bias-Variance
Our Final Equation
ED (g (x|D) − E [y|x])2
= ED (g (x|D) − ED [g (x|D)])2
VARIANCE
+ (ED [g (x|D)] − E [y|x])2
BIAS
Where the variance
It represent the measure of the error between our machine g (x|D) and the
expected output of the machine under xi ∼ p (x|Θ).
Where the bias
It represent the quadratic error between the expected output of the
machine under xi ∼ p (x|Θ) and the expected output of the optimal
regression.
77 / 96
167. We have the Bias-Variance
Our Final Equation
ED (g (x|D) − E [y|x])2
= ED (g (x|D) − ED [g (x|D)])2
VARIANCE
+ (ED [g (x|D)] − E [y|x])2
BIAS
Where the variance
It represent the measure of the error between our machine g (x|D) and the
expected output of the machine under xi ∼ p (x|Θ).
Where the bias
It represent the quadratic error between the expected output of the
machine under xi ∼ p (x|Θ) and the expected output of the optimal
regression.
77 / 96
168. We have the Bias-Variance
Our Final Equation
ED (g (x|D) − E [y|x])2
= ED (g (x|D) − ED [g (x|D)])2
VARIANCE
+ (ED [g (x|D)] − E [y|x])2
BIAS
Where the variance
It represent the measure of the error between our machine g (x|D) and the
expected output of the machine under xi ∼ p (x|Θ).
Where the bias
It represent the quadratic error between the expected output of the
machine under xi ∼ p (x|Θ) and the expected output of the optimal
regression.
77 / 96
169. Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
78 / 96
170. Using this in our favor!!!
Something Notable
Introducing bias is equivalent to restricting the range of functions for
which a model can account.
Typically this is achieved by removing degrees of freedom.
Examples
They would be lowering the order of a polynomial or reducing the number
of weights in a neural network!!!
Ridge Regression
It does not explicitly remove degrees of freedom but instead reduces the
effective number of parameters.
79 / 96
171. Using this in our favor!!!
Something Notable
Introducing bias is equivalent to restricting the range of functions for
which a model can account.
Typically this is achieved by removing degrees of freedom.
Examples
They would be lowering the order of a polynomial or reducing the number
of weights in a neural network!!!
Ridge Regression
It does not explicitly remove degrees of freedom but instead reduces the
effective number of parameters.
79 / 96
172. Using this in our favor!!!
Something Notable
Introducing bias is equivalent to restricting the range of functions for
which a model can account.
Typically this is achieved by removing degrees of freedom.
Examples
They would be lowering the order of a polynomial or reducing the number
of weights in a neural network!!!
Ridge Regression
It does not explicitly remove degrees of freedom but instead reduces the
effective number of parameters.
79 / 96
173. Using this in our favor!!!
Something Notable
Introducing bias is equivalent to restricting the range of functions for
which a model can account.
Typically this is achieved by removing degrees of freedom.
Examples
They would be lowering the order of a polynomial or reducing the number
of weights in a neural network!!!
Ridge Regression
It does not explicitly remove degrees of freedom but instead reduces the
effective number of parameters.
79 / 96
174. Example
In the case of a linear regression model
C (w) =
N
i=1
di − wT
xi
2
+ λ
d0
j=1
w2
j (28)
Thus
This is ridge regression (weight decay) and the regularization
parameter λ > 0 controls the balance between fitting the data and
avoiding the penalty.
A small value for λ means the data can be fit tightly without causing
a large penalty.
A large value for λ means a tight fit has to be sacrificed if it requires
large weights.
80 / 96
175. Example
In the case of a linear regression model
C (w) =
N
i=1
di − wT
xi
2
+ λ
d0
j=1
w2
j (28)
Thus
This is ridge regression (weight decay) and the regularization
parameter λ > 0 controls the balance between fitting the data and
avoiding the penalty.
A small value for λ means the data can be fit tightly without causing
a large penalty.
A large value for λ means a tight fit has to be sacrificed if it requires
large weights.
80 / 96
176. Example
In the case of a linear regression model
C (w) =
N
i=1
di − wT
xi
2
+ λ
d0
j=1
w2
j (28)
Thus
This is ridge regression (weight decay) and the regularization
parameter λ > 0 controls the balance between fitting the data and
avoiding the penalty.
A small value for λ means the data can be fit tightly without causing
a large penalty.
A large value for λ means a tight fit has to be sacrificed if it requires
large weights.
80 / 96
177. Example
In the case of a linear regression model
C (w) =
N
i=1
di − wT
xi
2
+ λ
d0
j=1
w2
j (28)
Thus
This is ridge regression (weight decay) and the regularization
parameter λ > 0 controls the balance between fitting the data and
avoiding the penalty.
A small value for λ means the data can be fit tightly without causing
a large penalty.
A large value for λ means a tight fit has to be sacrificed if it requires
large weights.
80 / 96
178. Important
The Bias
It favors solutions involving small weights and the effect is to smooth the
output function.
81 / 96
179. Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
82 / 96
180. Now, we can carry out the optimization
First, we rewrite the cost function the following way
S (w) =
N
i=1
(di − f (xi))2
(29)
And we will use a generalized version for f
f (xi) =
d1
j=1
wjφj (xi) (30)
Where
The free variables are the weights {wj}d1
j=1.
83 / 96
181. Now, we can carry out the optimization
First, we rewrite the cost function the following way
S (w) =
N
i=1
(di − f (xi))2
(29)
And we will use a generalized version for f
f (xi) =
d1
j=1
wjφj (xi) (30)
Where
The free variables are the weights {wj}d1
j=1.
83 / 96
182. Now, we can carry out the optimization
First, we rewrite the cost function the following way
S (w) =
N
i=1
(di − f (xi))2
(29)
And we will use a generalized version for f
f (xi) =
d1
j=1
wjφj (xi) (30)
Where
The free variables are the weights {wj}d1
j=1.
83 / 96
183. Where
φj (xi) is in our case, we may have the Gaussian distribution
φj (xi) = φ (xi, xj) (31)
With
φ (x, xj) = exp −
1
2σ2
x − xi (32)
84 / 96
184. Where
φj (xi) is in our case, we may have the Gaussian distribution
φj (xi) = φ (xi, xj) (31)
With
φ (x, xj) = exp −
1
2σ2
x − xi (32)
84 / 96
185. Thus
Final cost function assuming there is a regularization term per weight
C (w, λ) =
N
i=1
(di − f (xi))2
+
d1
j=1
λjw2
j (33)
What do we do?
1 Differentiate the function with respect to the free variables.
2 Equate the results with zero.
3 Solve the resulting equations.
85 / 96
186. Thus
Final cost function assuming there is a regularization term per weight
C (w, λ) =
N
i=1
(di − f (xi))2
+
d1
j=1
λjw2
j (33)
What do we do?
1 Differentiate the function with respect to the free variables.
2 Equate the results with zero.
3 Solve the resulting equations.
85 / 96
187. Thus
Final cost function assuming there is a regularization term per weight
C (w, λ) =
N
i=1
(di − f (xi))2
+
d1
j=1
λjw2
j (33)
What do we do?
1 Differentiate the function with respect to the free variables.
2 Equate the results with zero.
3 Solve the resulting equations.
85 / 96
188. Thus
Final cost function assuming there is a regularization term per weight
C (w, λ) =
N
i=1
(di − f (xi))2
+
d1
j=1
λjw2
j (33)
What do we do?
1 Differentiate the function with respect to the free variables.
2 Equate the results with zero.
3 Solve the resulting equations.
85 / 96
189. Differentiate the function with respect to the free variables.
First
∂C (w, λ)
∂wj
= 2
N
i=1
(di − f (xi))
∂f (xi)
∂wj
+ 2λjwj (34)
We get differential of ∂f (xi)
∂wj
∂f (xi)
∂wj
= φj (xi) (35)
86 / 96
190. Differentiate the function with respect to the free variables.
First
∂C (w, λ)
∂wj
= 2
N
i=1
(di − f (xi))
∂f (xi)
∂wj
+ 2λjwj (34)
We get differential of ∂f (xi)
∂wj
∂f (xi)
∂wj
= φj (xi) (35)
86 / 96
191. Now
We have then
N
i=1
f (xi) φj (xi) + λjwj =
N
i=1
diφj (xi) (36)
Something Notable
There are m such equations, for 1 ≤ j ≤ m, each representing one
constraint on the solution.
Since there are exactly as many constraints as there are unknowns
equations has, except under certain pathological conditions, a unique
solution.
87 / 96
192. Now
We have then
N
i=1
f (xi) φj (xi) + λjwj =
N
i=1
diφj (xi) (36)
Something Notable
There are m such equations, for 1 ≤ j ≤ m, each representing one
constraint on the solution.
Since there are exactly as many constraints as there are unknowns
equations has, except under certain pathological conditions, a unique
solution.
87 / 96
193. Now
We have then
N
i=1
f (xi) φj (xi) + λjwj =
N
i=1
diφj (xi) (36)
Something Notable
There are m such equations, for 1 ≤ j ≤ m, each representing one
constraint on the solution.
Since there are exactly as many constraints as there are unknowns
equations has, except under certain pathological conditions, a unique
solution.
87 / 96
194. Using Our Linear Algebra
We have then
φT
j f + λjwj = φT
j d (37)
Where
φj =
φj (x1)
φj (x2)
...
φj (xN )
, f =
f (x1)
f (x2)
...
f (xN )
, d =
d1
d2
...
dN
(38)
88 / 96
195. Using Our Linear Algebra
We have then
φT
j f + λjwj = φT
j d (37)
Where
φj =
φj (x1)
φj (x2)
...
φj (xN )
, f =
f (x1)
f (x2)
...
f (xN )
, d =
d1
d2
...
dN
(38)
88 / 96
196. Now
Since there is one of these equations, each relating one scalar
quantity to another, we can stack them
φT
1 f
φT
2 f
...
φT
d1
f
+
λ1w1
λ2w2
...
λd1 wd1
=
φT
1 d
φT
2 d
...
φT
d1
d
(39)
Now, if we define
Φ = φ1 φ2 . . . φd1
(40)
Written in full form
Φ =
φ1 (x1) φ2 (x1) · · · φd1 (x1)
φ1 (x2) φ2 (x2) · · · φd1 (x2)
...
...
...
...
φ1 (xN ) φ2 (xN ) · · · φd1 (xN )
(41)
89 / 96
197. Now
Since there is one of these equations, each relating one scalar
quantity to another, we can stack them
φT
1 f
φT
2 f
...
φT
d1
f
+
λ1w1
λ2w2
...
λd1 wd1
=
φT
1 d
φT
2 d
...
φT
d1
d
(39)
Now, if we define
Φ = φ1 φ2 . . . φd1
(40)
Written in full form
Φ =
φ1 (x1) φ2 (x1) · · · φd1 (x1)
φ1 (x2) φ2 (x2) · · · φd1 (x2)
...
...
...
...
φ1 (xN ) φ2 (xN ) · · · φd1 (xN )
(41)
89 / 96
198. Now
Since there is one of these equations, each relating one scalar
quantity to another, we can stack them
φT
1 f
φT
2 f
...
φT
d1
f
+
λ1w1
λ2w2
...
λd1 wd1
=
φT
1 d
φT
2 d
...
φT
d1
d
(39)
Now, if we define
Φ = φ1 φ2 . . . φd1
(40)
Written in full form
Φ =
φ1 (x1) φ2 (x1) · · · φd1 (x1)
φ1 (x2) φ2 (x2) · · · φd1 (x2)
...
...
...
...
φ1 (xN ) φ2 (xN ) · · · φd1 (xN )
(41)
89 / 96
199. We can then
Define the following matrix equation
ΦT
f + Λw = ΦT
d (42)
Where
Λ =
λ1 0 · · · 0
0 λ2 · · · 0
...
...
...
...
0 0 · · · λd1
(43)
90 / 96
200. We can then
Define the following matrix equation
ΦT
f + Λw = ΦT
d (42)
Where
Λ =
λ1 0 · · · 0
0 λ2 · · · 0
...
...
...
...
0 0 · · · λd1
(43)
90 / 96
201. Now, we have that
The vector can be decomposed into the product of two terms
Design matrix and the weight vector
We have then
fi = f (xi) =
d1
j=1
wjhj (xi) = φ
T
i w (44)
Where
φi =
φ1 (xi)
φ2 (xi)
...
φd1 (xi)
(45)
91 / 96
202. Now, we have that
The vector can be decomposed into the product of two terms
Design matrix and the weight vector
We have then
fi = f (xi) =
d1
j=1
wjhj (xi) = φ
T
i w (44)
Where
φi =
φ1 (xi)
φ2 (xi)
...
φd1 (xi)
(45)
91 / 96
203. Now, we have that
The vector can be decomposed into the product of two terms
Design matrix and the weight vector
We have then
fi = f (xi) =
d1
j=1
wjhj (xi) = φ
T
i w (44)
Where
φi =
φ1 (xi)
φ2 (xi)
...
φd1 (xi)
(45)
91 / 96
204. Furthermore
We get that
f =
f1
f2
...
fN
=
φ
T
1 w
φ
T
2 w
...
φ
T
N w
= Φw (46)
Finally, we have that
ΦT
d =ΦT
f + Λw
=ΦT
Φw + Λw
= ΦT
Φ + Λ w
92 / 96
205. Furthermore
We get that
f =
f1
f2
...
fN
=
φ
T
1 w
φ
T
2 w
...
φ
T
N w
= Φw (46)
Finally, we have that
ΦT
d =ΦT
f + Λw
=ΦT
Φw + Λw
= ΦT
Φ + Λ w
92 / 96
206. Now...
We get finally
w = ΦT
Φ + Λ
−1
ΦT
d (47)
Remember
This equation is the most general form of the normal equation.
We have two cases
In standard ridge regression λj = λ, 1 ≤ j ≤ m.
Ordinary least squares where there is no weight penalty or all λj = 0,
1 ≤ j ≤ m..
93 / 96
207. Now...
We get finally
w = ΦT
Φ + Λ
−1
ΦT
d (47)
Remember
This equation is the most general form of the normal equation.
We have two cases
In standard ridge regression λj = λ, 1 ≤ j ≤ m.
Ordinary least squares where there is no weight penalty or all λj = 0,
1 ≤ j ≤ m..
93 / 96
208. Now...
We get finally
w = ΦT
Φ + Λ
−1
ΦT
d (47)
Remember
This equation is the most general form of the normal equation.
We have two cases
In standard ridge regression λj = λ, 1 ≤ j ≤ m.
Ordinary least squares where there is no weight penalty or all λj = 0,
1 ≤ j ≤ m..
93 / 96
209. Thus, we have
First Case
w = ΦT
Φ + λId1
−1
ΦT
d (48)
Second Case
w = ΦT
Φ
−1
ΦT
d (49)
94 / 96
210. Thus, we have
First Case
w = ΦT
Φ + λId1
−1
ΦT
d (48)
Second Case
w = ΦT
Φ
−1
ΦT
d (49)
94 / 96
211. Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
95 / 96
212. There are still several things that we need to look at...
First
What is the variance of the weight vector? The Variance Matrix.
Second
The prediction of the output at any of the training set inputs - The
Projection Matrix
Finally
The incremental algorithm for the problem!!!
96 / 96
213. There are still several things that we need to look at...
First
What is the variance of the weight vector? The Variance Matrix.
Second
The prediction of the output at any of the training set inputs - The
Projection Matrix
Finally
The incremental algorithm for the problem!!!
96 / 96
214. There are still several things that we need to look at...
First
What is the variance of the weight vector? The Variance Matrix.
Second
The prediction of the output at any of the training set inputs - The
Projection Matrix
Finally
The incremental algorithm for the problem!!!
96 / 96