"The statistical physics of learning revisted: Phase transitions in layered neural networks"
Physics Colloquium at the University of Leipzig/Germany, June 29, 2021
24 slides, ca 45 minutes
The statistical physics of learning revisted: Phase transitions in layered neural networks
1. Leipzig, June 2021 1 / 24
The statistical physics of learning revisited:
Phase transitions in layered neural networks
Elisa Oostwal
Michiel Straat
Michael Biehl
Bernoulli Institute for Mathematics,
Computer Science and Artificial Intelligence
University of Groningen / NL
Physica A Vol. 564, 2021, 125517 (open access)
Hidden unit specialization in layered neural networks: ReLU vs. sigmoidal activation
2. Leipzig, June 2021 2 / 24
the revival of neural networks
success of multi-layered neural networks (Deep Learning)
• availability of large amounts of training data
• increased computational power
• improved training procedures and set-ups
• task specific network designs, e.g. activation functions
many open questions / lack of theoretical understanding
3. Leipzig, June 2021 3 / 24
statistical physics of learning
statistical physics of neural networks
training of feed-forward neural networks:
Elizabeth Gardner (1957-1988).
The space of interactions in neural networks.
J. Phys. A 21:257-270, 1988
dynamics of attractor neural networks:
John Hopfield. Neural Networks and
physical systems with emergent
collective computational abilities.
PNAS 79(8):2554-2558, 1982
1991
2001
2011
a successful branch of learning theory:
5. Leipzig, June 2021 5 / 24
N units: high-dim. input
example: a shallow neural network
K hidden units with activation
linear output
soft committee machine
input/output function defined by
• architecture, connectivity, activation functions
• adaptive weights
↑ target
function
• regression: learning from example data
e.g.
6. Leipzig, June 2021 6 / 24
statistical physics of learning in a nutshell
objective/cost/energy function with
• equilibrium state: compromise/competition between
minimal energy (ground state) vs. number (volume) of available
states with higher energy
• e.g. Metropolis algorithm, noisy gradient descent (Langevin)
with equilibrium (Gibbs-Boltzmann)
control parameter: „inverse temperature“ β =1 / T
• training by stochastic optimization of all adaptive weights
„thermal averages “ over Peq e.g.
minima of free energy microcanonical entropy:
7. Leipzig, June 2021 7 / 24
machine learning specifics
• energy function is given for a specific set of example data:
defined w.r.t.
• typical properties: additional average of the free energy over
difficult, even for the simplest model density:
with independent identically distributed (i.i.d.)
unstructured input density
• disorder-average of the free energy requires (e.g.) replica trick
frozen disorder
8. Leipzig, June 2021 8 / 24
machine learning at high temperatures
• a simplifying limit: high (formal) temperature
with finite
“learn almost nothing... (high T )
...from very many examples”
• independent i.i.d. examples:
generalization error
limitations:
- training error and generalization error cannot be distinguished
- number of examples and training temperature are coupled
- (at best) qualitative agreement with low temperature results
• large number of examples: , in the limit
9. Leipzig, June 2021 9 / 24
adaptive student N inputs
(K) hidden units (M)
teacher
? ? ? ? ? ? ?
modelling: student teacher scenario
training: minimization of
here: learnable rules, reliable data (outputs provided by teacher)
perfectly matching complexity K=M
two prototypical activation functions:
sigmoidal / ReLU in student and teacher
10. Leipzig, June 2021 10 / 24
thermodynamic limit, CLT for
normally distributed with zero mean and covariance matrix
large N: Central Limit Theorem
order parameters: model parameters:
macroscopic
properties of
the system
(+ constant) independent of details (e.g. activation)
11. Leipzig, June 2021 11 / 24
generalization error
on average over P({xi,xj
*})
[D. Saad, S. Solla, 1995]
[M. Straat, 2019]
sigmoidal activation
rectified linear units
12. Leipzig, June 2021 12 / 24
site symmetry
simplification: orthonormal teacher vectors, isotropic input density
reflects permutation symmetry, allows for hidden unit specialization
sigmoidal
hidden units
ReLU
activations
entropy
(+ constant)
13. Leipzig, June 2021 13 / 24
given 𝛼, determine (global and local) minima of
given: size of the training data set
K, g(z),
obtain learning curves
typical learning curves
order parameters and generalization error
as a function of the (scaled) training set size
solve:
14. Leipzig, June 2021 14 / 24
sigmoidal ( K = 2 )
invariance under exchange of
the two hidden units
R=S: both units ~ (w1
* + w2
*) + noise
symmetry breaking phase transition
(second order, continuous) ...
... results in a kink in
the typical learning curve
15. Leipzig, June 2021 15 / 24
ReLU ( K = 2 )
qualitatively identical behavior
Note: num. values of and/or
are irrelevant, scale depends a.o.
on pre-factor of g(z)
Physica A Vol. 564, 2021, 125517
16. Leipzig, June 2021 16 / 24
sigmoidal ( K > 2 )
K=5
permutation symmetry of h.u.
initial R=S phase
discontinuous jump in ε g
coexistence of poor and good
generalization
first order transition, local min.
R>S competes with R=S
R>S becomes global minimum
facilitates perfect learning
“anti-specialization” S>R
(overlooked in 1998!)
weak/no effect of additional
anti-specialization on
generalization error
17. Leipzig, June 2021 17 / 24
ReLU ( K > 2)
K=10
permutation symmetry of h.u.
initial R=S phase
continuous kink in ε g
competing minima of
poor* vs. good generalization
continuous phase transition
global minimum: R>S
local minimum: R<S
* pretty good
Physica A Vol. 564, 2021, 125517
18. Leipzig, June 2021 18 / 24
ReLU ( large K )
permutation symmetry of h.u.
initial R=S phase
specialized and anti-specialized
branch achieve perfect
generalization, asymptotically !
(due to partial linearity of ReLU)
continuous phase transition at
degenerate minima: R>S, R<S
19. Leipzig, June 2021 19 / 24
Monte Carlo simulations
histogram of
observed Rij
continous Metropolis, ReLU activation, K=4, N=50, β=1 (=T)
gen. error vs. time, specialized and unspecialized initialization
anti-specialized specialized
unspecialized
R=S
R S S R
20. Leipzig, June 2021 20 / 24
Monte Carlo simulations
sigmoidal activation ReLU
K= 4
large gap / high barrier between
specialized and unspecialized
states delays success of learning
anti-specialized states
display near optimal
performance for large K
stationary generalization error:
21. Leipzig, June 2021 21 / 24
• formal equilibrium of training at high temperature in
student/teacher model situations of supervised learning
• unspecialized and partially or anti-specialized configurations
compete as local/global minima of the free energy
• phase transitions with scaled number of examples:
K=2: continuous symmetry-breaking transitions
with equivalent competing states
K>2, sigmoidal activations: first order transition with
competing states of distinct generalization ability
K>2, ReLU networks: continuous transition with
competing states of similar performance
Summary
22. Leipzig, June 2021 22 / 24
piece-wise linear
„sigmoidal“ activation
ReLU
increasing slope
discontinuous to
continuous
Outlook
which is the decisive
property of the activation?
• consider various activation functions (leaky ReLU, swish ... )
most important question:
• study more complex solutions beyond site-symmetry
piece-wise linear activtations
23. Leipzig, June 2021 23 / 24
• replica trick / annealed approximation
- low temperatures, vary # of examples and T independently
- mismatched student/teacher networks 𝐾 ≠ 𝑀
- overfitting / underfitting effects
• complementary approach:
- dynamics of stochastic gradient descent
- description in terms of ODE for order parameters
• deep networks
- many hidden layers
- tree-like architectures with uncorrelated branches
• realistic input data
- clustered / correlated data
- recent developments: Zdeborova, Mezard, Goldt et al.
outlook (selected topics)
24. Leipzig, June 2021 24 / 24
www.cs.rug.nl/~biehl m.biehl@rug.nl
Questions ?
see also for: algorithm development in machine learning
applications in medicine, life sciences, astronomy …