1. 1
R. Rao, Week 6: Neural Networks
CSE 599 Lecture 6: Neural Networks and Models
Last Lecture:
Neurons, membranes, channels
Structure of Neurons: dendrites, cell body and axon
Input: Synapses on dendrites and cell body (soma)
Output: Axon, myelin for fast signal propagation
Function of Neurons:
Internal voltage controlled by Na/K/Cl… channels
Channels gated by chemicals (neurotransmitters) and membrane
voltage
Inputs at synapses release neurotransmitter, causing increase or
decrease in membrane voltages (via currents in channels)
Large positive change in voltage membrane voltage exceeds
threshold action potential or spike generated.
3. 3
R. Rao, Week 6: Neural Networks
McCulloch–Pitts “neuron” (1943)
Attributes of neuron
m binary inputs and 1 output (0 or 1)
Synaptic weights wij
Threshold i
4. 4
R. Rao, Week 6: Neural Networks
McCulloch–Pitts Neural Networks
Synchronous discrete time operation
Time quantized in units of synaptic delay
Output is 1 if and only if weighted
sum of inputs is greater than threshold
(x) = 1 if x 0 and 0 if x < 0
Behavior of network can be simulated by a finite automaton
Any FA can be simulated by a McCulloch-Pitts Network
n t w n t
i ij j
j
i
L
N
M O
Q
P
1 n i
w i j
i
ij
i
output of unit
step function
weight from unit to
threshold
j to i
5. 5
R. Rao, Week 6: Neural Networks
Properties of Artificial Neural Networks
High level abstraction of neural input-output transformation
Inputs weighted sum of inputs nonlinear function output
Typically no spikes
Typically use implausible constraints or learning rules
Often used where data or functions are uncertain
Goal is to learn from a set of training data
And to generalize from learned instances to new unseen data
Key attributes
Parallel computation
Distributed representation and storage of data
Learning (networks adapt themselves to solve a problem)
Fault tolerance (insensitive to component failures)
6. 6
R. Rao, Week 6: Neural Networks
Topologies of Neural Networks
completely
connected feedforward
(directed, a-cyclic)
recurrent
(feedback connections)
7. 7
R. Rao, Week 6: Neural Networks
Networks Types
Feedforward versus recurrent networks
Feedforward: No loops, input hidden layers output
Recurrent: Use feedback (positive or negative)
Continuous versus spiking
Continuous networks model mean spike rate (firing rate)
Assume spikes are integrated over time
Consistent with rate-code model of neural coding
Supervised versus unsupervised learning
Supervised networks use a “teacher”
The desired output for each input is provided by user
Unsupervised networks find hidden statistical patterns in input data
Clustering, principal component analysis
8. 8
R. Rao, Week 6: Neural Networks
History
1943: McCulloch–Pitts “neuron”
Started the field
1962: Rosenblatt’s perceptron
Learned its own weight values; convergence proof
1969: Minsky & Papert book on perceptrons
Proved limitations of single-layer perceptron networks
1982: Hopfield and convergence in symmetric networks
Introduced energy-function concept
1986: Backpropagation of errors
Method for training multilayer networks
Present: Probabilistic interpretations, Bayesian and spiking networks
9. 9
R. Rao, Week 6: Neural Networks
Perceptrons
Attributes
Layered feedforward networks
Supervised learning
Hebbian: Adjust weights to enforce correlations
Parameters: weights wij
Binary output = (weighted sum of inputs)
Take wo to be the threshold with fixed input –1.
Output w
i ij j
j
L
N
M
M
O
Q
P
P
Multilayer
Single-layer
10. 10
R. Rao, Week 6: Neural Networks
Training Perceptrons to Compute a Function
Given inputs j to neuron i and desired output Yi, find its weight values
by iterative improvement:
1. Feed an input pattern
2. Is the binary output correct?
Yes: Go to the next pattern
No: Modify the connection weights using error signal (Yi – Oi)
Increase weight if neuron didn’t fire when it should have and vice versa
Learning rule is Hebbian (based on input/output correlation)
converges in a finite number of steps if a solution exists
Used in ADALINE (adaptive linear neuron) networks
w Y O
Y w
ij i i j
i ij j j
a f
d i
learning rate
input
desired output
actual output
j
i
i
Y
O
11. 11
R. Rao, Week 6: Neural Networks
Computational Power of Perceptrons
Consider a single-layer perceptron
Assume threshold units
Assume binary inputs and outputs
Weighted sum forms a linear hyperplane
Consider a single output network with two inputs
Only functions that are linearly separable can be computed
Example: AND is linearly separable
wij j
j
0
o 1
12. 12
R. Rao, Week 6: Neural Networks
Linear inseparability
Single-layer perceptron with threshold units fails if problem
is not linearly separable
Example: XOR
Can use other tricks (e.g.
complicated threshold
functions) but complexity
blows up
Minsky and Papert’s book
showing these negative results
was very influential
13. 13
R. Rao, Week 6: Neural Networks
Solution in 1980s: Multilayer perceptrons
Removes many limitations of single-layer networks
Can solve XOR
Exercise: Draw a two-layer perceptron that computes the
XOR function
2 binary inputs 1 and 2
1 binary output
One “hidden” layer
Find the appropriate
weights and threshold
14. 14
R. Rao, Week 6: Neural Networks
Solution in 1980s: Multilayer perceptrons
Examples of two-layer perceptrons that compute XOR
E.g. Right side network
Output is 1 if and only if x + y – 2(x + y – 1.5 > 0) – 0.5 > 0
x y
15. 15
R. Rao, Week 6: Neural Networks
Multilayer Perceptron
Input nodes
Output neurons
One or more
layers of
hidden units
(hidden layers)
a
e
a
g
1
1
)
(
a
(a)
1
The most common
output function
(Sigmoid):
(non-linear
squashing function)
g(a)
16. 16
R. Rao, Week 6: Neural Networks
x y
out
x
y
1
1
2
1 2
2
1
1
1
1
2
1
1
1
2
1
?
Example: Perceptrons as Constraint Satisfaction Networks
17. 17
R. Rao, Week 6: Neural Networks
x y
out
x
y
1
1
2
1 2
0
2
1
1
y
x
0
2
1
1
y
x
=0
=1
2
1
1
1
Example: Perceptrons as Constraint Satisfaction Networks
18. 18
R. Rao, Week 6: Neural Networks
x y
out
x
y
1
1
2
1 2
0
2
y
x 0
2
y
x
=0
=0
=1
=1
1
2
1
Example: Perceptrons as Constraint Satisfaction Networks
19. 19
R. Rao, Week 6: Neural Networks
x y
out
x
y
1
1
2
1 2
=0
=0
=1
=1
1
1
2
1
-
2
1
>0
Example: Perceptrons as Constraint Satisfaction Networks
20. 20
R. Rao, Week 6: Neural Networks
x y
out
x
y
1
1
2
1 2
0
2
y
x 0
2
y
x
0
2
1
1
y
x
0
2
1
1
y
x
=0
=0
=1
=1
2
1
1
1
1
2
1
1
1
2
1
Perceptrons as Constraint Satisfaction Networks
21. 21
R. Rao, Week 6: Neural Networks
Learning networks
We want networks that configure themselves
Learn from the input data or from training examples
Generalize from learned data
Can this network configure itself
to solve a problem?
How do we train it?
22. 22
R. Rao, Week 6: Neural Networks
Gradient-descent learning
Use a differentiable activation function
Try a continuous function f ( ) instead of ( )
First guess: Use a linear unit
Define an error function (cost function or “energy” function)
Changes weights in the direction of smaller errors
Minimizes the mean-squared error over input patterns
Called Delta rule = adaline rule = Widrow-Hoff rule = LMS rule
E Y w
i
u
ij j
j
u
i
L
N
M
M
O
Q
P
P
1
2
2
Then w
E
w
Y w
ij
ij
i
u
ij j
j
u
j
L
N
M
M
O
Q
P
P
Cost function measures the
network’s performance as a
differentiable function of the
weights
23. 23
R. Rao, Week 6: Neural Networks
Backpropagation of errors
Use a nonlinear, differentiable activation function
Such as a sigmoid
Use a multilayer feedforward network
Outputs are differentiable functions of the inputs
Result: Can propagate credit/blame back to internal nodes
Chain rule (calculus) gives wij for internal “hidden” nodes
Based on gradient-descent learning
f
h
h wij j
j
1
1 2
exp
a f where
24. 24
R. Rao, Week 6: Neural Networks
Backpropagation of errors (con’t)
Vj
25. 25
R. Rao, Week 6: Neural Networks
Backpropagation of errors (con’t)
Let Ai be the activation (weighted sum of inputs) of neuron i
Let Vj = g(Aj) be output of hidden unit j
Learning rule for hidden-output connection weights:
Wij = -E/Wij = S [di – ai] g’(Ai) Vj
= S di Vj
Learning rule for input-hidden connection weights:
wjk = - E/wjk = - (E/Vj ) (Vj/wjk ) {chain rule}
S,i ([di – ai] g’(Ai) Wij) (g’ (Aj) k)
= S dj k
26. 26
R. Rao, Week 6: Neural Networks
Backpropagation
Can be extended to arbitrary number of layers but three is most
commonly used
Can approximate arbitrary functions: crucial issues are
generalization to examples not in test data set
number of hidden units
number of samples
speed of convergence to a stable set of weights (sometimes a momentum
term a wpq is added to the learning rule to speed up learning)
In your homework, you will use backpropagation and the delta rule in a
simple pattern recognition task: classifying noisy images of the digits 0
through 9
C Code for the networks is already given – you will only need to modify
the input and output
27. 27
R. Rao, Week 6: Neural Networks
Hopfield networks
Act as “autoassociative” memories to store patterns
McCulloch-Pitts neurons with outputs -1 or 1, and threshold
All neurons connected to each other
Symmetric weights (wij = wji) and wii = 0
Asynchronous updating of outputs
Let si be the state of unit i
At each time step, pick a random unit
Set si to 1 if Sj wij sj ; otherwise, set si to -1
completely
connected
28. 28
R. Rao, Week 6: Neural Networks
Hopfield networks
Hopfield showed that asynchronous updating in symmetric
networks minimizes an “energy” function and leads to a
stable final state for a given initial state
Define an energy function (analogous to the gradient descent
error function)
E = -1/2 Si,j wij si sj + Si si i
Suppose a random unit i was updated: E always decreases!
If si is initially –1 and Sj wij sj > i, then si becomes +1
Change in E = -1/2 Sj (wij sj + wji sj ) + i = - Sj wij sj + i < 0 !!
If si is initially +1 and Sj wij sj < i, then si becomes -1
Change in E = 1/2 Sj (wij sj + wji sj ) - i = Sj wij sj - i < 0 !!
29. 29
R. Rao, Week 6: Neural Networks
Hopfield networks
Note: Network converges to local minima which store
different patterns.
Store p N-dimensional pattern vectors x1, …, xp using
Hebbian learning rule:
wji = 1/N Sm=1,..,p x m,j x m,i for all j i; 0 for j = i
W = 1/N Sm=1,..,p x m x m
T (outer product of vectors; diagonal zero)
T denotes vector transpose
x4
x1
30. 30
R. Rao, Week 6: Neural Networks
Pattern Completion in a Hopfield Network
Local minimum
(“attractor”)
of energy function
stores pattern
31. 31
R. Rao, Week 6: Neural Networks
Radial Basis Function Networks
input nodes
output neurons
one layer of
hidden neurons
32. 32
R. Rao, Week 6: Neural Networks
Radial Basis Function Networks
propagation function:
n
i
j
i
i
j x
a
1
2
, )
(
input nodes
output neurons
33. 33
R. Rao, Week 6: Neural Networks
Radial Basis Function Networks
2
2
2
)
(
a
e
a
h
output function:
(Gauss’bell-shaped function)
a
(a)
input nodes
output neurons
h(a)
34. 34
R. Rao, Week 6: Neural Networks
Radial Basis Function Networks
output of network:
i
i
j
i
j h
w,
out
input nodes
output neurons
35. 35
R. Rao, Week 6: Neural Networks
RBF networks
Radial basis functions
Hidden units store means and
variances
Hidden units compute a
Gaussian function of inputs
x1,…xn that constitute the
input vector x
Learn weights wi, means i,
and variances i by
minimizing squared error
function (gradient descent
learning)
36. 36
R. Rao, Week 6: Neural Networks
RBF Networks and Multilayer Perceptrons
RBF: MLP:
input nodes
output neurons
37. 37
R. Rao, Week 6: Neural Networks
Recurrent networks
Employ feedback (positive, negative, or both)
Not necessarily stable
Symmetric connections can ensure stability
Why use recurrent networks?
Can learn temporal patterns (time series or oscillations)
Biologically realistic
Majority of connections to neurons in cerebral cortex are
feedback connections from local or distant neurons
Examples
Hopfield network
Boltzmann machine (Hopfield-like net with input & output units)
Recurrent backpropagation networks: for small sequences, unfold
network in time dimension and use backpropagation learning
38. 38
R. Rao, Week 6: Neural Networks
Recurrent networks (con’t)
Example
Elman networks
Partially recurrent
Context units keep
internal memory of part
inputs
Fixed context weights
Backpropagation for
learning
E.g. Can disambiguate
ABC and CBA
Elman network
39. 39
R. Rao, Week 6: Neural Networks
Unsupervised Networks
No feedback to say how output differs from desired output
(no error signal) or even whether output was right or wrong
Network must discover patterns in the input data by itself
Only works if there are redundancies in the input data
Network self-organizes to find these redundancies
Clustering: Decide which group an input belongs to
Synaptic weights of one neuron represents one group
Principal Component Analysis: Finds the principal eigenvector of
data covariance matrix
Hebb rule performs PCA! (Oja, 1982)
wi = iy
Output y Si wi i
40. 40
R. Rao, Week 6: Neural Networks
Self-Organizing Maps (Kohonen Maps)
Feature maps
Competitive networks
Neurons have locations
For each input, winner is
the unit with largest output
Weights of winner and
nearby units modified to
resemble input pattern
Nearby inputs are thus
mapped topographically
Biological relevance
Retinotopic map
Somatosensory map
Tonotopic map
41. 41
R. Rao, Week 6: Neural Networks
Example of a 2D Self-Organizing Map
• 10 x 10 array of
neurons
• 2D inputs (x,y)
• Initial weights w1
and w2 random
as shown on right;
lines connect
neighbors
42. 42
R. Rao, Week 6: Neural Networks
Example of a 2D Self-Organizing Map
• 10 x 10 array of
neurons
• 2D inputs (x,y)
• Weights after 10
iterations
43. 43
R. Rao, Week 6: Neural Networks
Example of a 2D Self-Organizing Map
• 10 x 10 array of
neurons
• 2D inputs (x,y)
• Weights after 20
iterations
44. 44
R. Rao, Week 6: Neural Networks
Example of a 2D Self-Organizing Map
• 10 x 10 array of
neurons
• 2D inputs (x,y)
• Weights after 40
iterations
45. 45
R. Rao, Week 6: Neural Networks
Example of a 2D Self-Organizing Map
• 10 x 10 array of
neurons
• 2D inputs (x,y)
• Weights after 80
iterations
46. 46
R. Rao, Week 6: Neural Networks
Example of a 2D Self-Organizing Map
• 10 x 10 array of
neurons
• 2D inputs (x,y)
• Final Weights (after
160 iterations)
• Topography of inputs
has been captured
47. 47
R. Rao, Week 6: Neural Networks
Summary: Biology and Neural Networks
So many similarities
Information is contained in synaptic connections
Network learns to perform specific functions
Network generalizes to new inputs
But NNs are woefully inadequate compared with biology
Simplistic model of neuron and synapse, implausible learning rules
Hard to train large networks
Network construction (structure, learning rate etc.) is a heuristic art
One obvious difference: Spike representation
Recent models explore spikes and spike-timing dependent plasticity
Other Recent Trends: Probabilistic approach
NNs as Bayesian networks (allows principled derivation of dynamics,
learning rules, and even structure of network)
Not clear how neurons encode probabilities in spikes
Hinweis der Redaktion
Mögliche Topologien für Neuronale Netze:
- vollständig vernetzt: Selbstassoziative Netze (Hopfield)
- feedforward: vielparametrisierter Approximator
- recurrent: zur Modellierung zeitveränderlicher Signale
Für die Approximation zeitkonstanter Signale werden meist vorwärtsgerichtete Neuronale Netze eingesetzt. Darauf beschränken wir uns auch im weiteren...
Klare Struktur, Informationsfluss vorwaerts.
Mit nur einer verdeckten Schicht, lässt sich jede beliebige Funktion approximieren (man braucht evtl halt viele Neuronen…)
Andere Ausgabefunktionen fuehren zu anderen Netztypen, dazu spaeter mehr.
Uebergang zur Stufenfunktion fuer
Beispiel:
2 Eingaben, eine Ausgabe. Stufenfunktion in den Neuronen. 1er-Neuron fuer Schwellwert der Stufenfunktion:
Wie sieht die Ausgabe des MLP abhängig von den zwei Eingaben x und y aus?
Ok, oben also Ausgabe des roten Neurons 1, unten 0.
-> Entscheidungsgerade (-ebene im 3D)
Ebenso das zweite (blaue) Neuron
Abgrenzung eines Bereichs, abhängig von der Lage der Hyperebenen, die durch die verdeckten Neuronen aufgespannt werden.
Message:
Man kann beliebige Region abgrenzen, wenn man nur genuegend Neuronen in der verdeckten Schicht spendiert.
Was kann denn nun ein Trainingsalgorithmus alles automatisch, d.h. beispieldatengetrieben einstellen?
Andere Netzwerktypen können z.B. durch andere Aktivierungsfunktionen erzielt werden.
MLP: globale Aufteilung des Merkmalraums
RBF: lokale Aufteilung
Propagierungsfunktion zum verdeckten Layer: Euklidische Distanz zum Gewichtsvektor.
Ausgabe der verdeckten Schicht in [0,1], je dichter Eingabe am Gewichtsvektor liegt, desto grösser die Ausgabe.
-> lokale Aktiverung(Gewichtsvektior = Prototyp)
Ausgabe: Summe der verdeckten Neuronen.
Ausgabe ist also eine gewichtete Summe von Basisfunktionen, parametrisiert durch den jeweiligen Gewichtsvektor.
D.h. auch RBFs koennen bei genügender Anzahl Neuronen in der verdeckten Schicht beliebige Funktionen approximieren.
Vergleich zum MLP: Hyperebenen teilen Raum global auf, beim RBF sind es Hyperkugeln und eine lokale Aufteilung.
Problem wieder: Interpretation: Wie stellt man sich den Bereich “um” einen Prototyp vor?
Retinatopic map: from retina to visual cortex
Somatosensory map: from skin to somatosensory cortex
Tonotopic map: from ear to auditory cortex
Hard problem: try to map a circle to a line