Statistical Mechanics Methods for Discovering Knowledge from Production-Scale Neural Networks

Statistical Mechanics Methods for Discovering
Knowledge from Production-Scale Neural Networks
Charles H. Martin∗ and Michael W. Mahoney†
Tutorial at ACM-KDD, August 2019
∗
Calculation Consulting, charles@calculationconsulting.com
†
ICSI and Dept of Statistics, UC Berkeley, https://www.stat.berkeley.edu/~mmahoney/
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 1 / 98

Outline
1 Prehistory and History
Older Background
A Very Simple Deep Learning Model
More Immediate Background
2 Preliminary Results
Regularization and the Energy Landscape
Preliminary Empirical Results
Gaussian and Heavy-tailed Random Matrix Theory
3 Developing a Theory for Deep Learning
More Detailed Empirical Results
An RMT-based Theory for Deep Learning
Tikhonov Regularization versus Heavy-tailed Regularization
4 Validating and Using the Theory
Varying the Batch Size: Explaining the Generalization Gap
Using the Theory: pip install weightwatcher
Diagnostics at Scale: Predicting Test Accuracies
5 More General Implications and Conclusions

Statistical Physics & Neural Networks: A Long History
60s:
J. D. Cowan, Statistical Mechanics of Neural Networks, 1967.
70s:
W. A. Little, “The existence of persistent states in the brain,” Math.
Biosci., 1974.
80s:
H. Sompolinsky, “Statistical mechanics of neural networks,” Physics
Today, 1988.
90s:
D. Haussler, M. Kearns, H. S. Seung, and N. Tishby, “Rigorous learning
curve bounds from statistical mechanics,” Machine Learning, 1996.
00s:
A. Engel and C. P. L. Van den Broeck, Statistical mechanics of
learning, 2001.

Hopfield model
Hopfield, “Neural networks and physical systems with emergent collective computational abilities,” PNAS 1982.
Hopfield model:
Recurrent artificial neural network model
Equivalence between behavior of NNs with symmetric connections and the
equilibrium statistical mechanics behavior of certain magnetic systems.
Can design NNs for associative memory and other computational tasks
Phase diagram with three kinds of phases (α is load parameter):
Very low α regime: model has so so much capacity, it is a prototype method
Intermediate α: spin glass phase, which is “pathologically non-convex”
Higher α: generalization phase
But:
Lots of subsequent work focusing on spin glasses, replica theory, etc.
Let’s go back to the basics!

Restricted Boltzmann Machines and Variational Methods
RBMs = Hopﬁeld + temperature + backprop:
RBMs and other more sophisticated variational free energy methods
They have an intractable partition function.
Goal: try to approximate partition function / free energy.
Also, recent work on their phase diagram.
We do NOT do this.
Memorization, then and now.
Three (then) versus two (now) phases.
Modern “memorization” is probably more like spin glass phase.
Let’s go back to the basics!

Some other signposts
Cowen’s introduction of sigmoid into neuroscience.
Parisi’s replica theory computations.
Solla’s statistical analysis.
Gardner’s analysis of annealed versus quenched entropy.
Saad’s analysis of dynamics of SGD.
More recent work on dynamics, energy langscapes, etc.
Lots more . . .

Important: Our Methodological Approach
Most people like training and validating ideas by training.
We will use pre-trained models.
Many state-of-the-art models are publicly available.
They are “machine learning models that work” . . . so analyze them.
Selection bias: you can’t talk with deceased patients.
Of course, one could use these methods to improve training . . . we won’t.
Beneﬁts of this methodological approach.
Can develop a practical theory.
(Current theory isn’t . . . loose bounds and convergence rates.)
Can evaluate theory on state-of-the-art models.
(Big models are diﬀerent than small . . . easily-trainable models.)
Can be more reproducible.
(Training isn’t reproducible . . . too many knobs.)
You can “pip install weightwatcher”

PAC/VC versus Statistical Mechanics Approaches (1 of 2)
Basic Student-Teacher Learning Setup:
Classify elements of input space X into {0, 1}
Target rule / teacher T; and hypothesis space F of possible mappings
Given T for X ⊂ X, the training set, select a student f ∗
∈ F, and evaluate how
well f ∗
approximates T on X
Generalization error ( ): probability of disagreement bw student and teacher on X
Training error ( t ): fraction of disagreement bw student and teacher on X
Learning curve: behavior of | t − | as a function of control parameters
PAC/VC Approach:
Related to statistical problem of convergence of frequencies to probabilities
Statistical Mechanics Approach:
Exploit the thermodynamic limit from statistical mechanics

PAC/VC versus Statistical Mechanics Approaches (2 of 2)
PAC/VC: get bounds on worst-case results
View m = |X| as the main control parameter; fix the function class F; and ask
how | t − | varies
Natural to consider γ = P [| t − | > δ]
Related to problem of convergence of frequencies to probabilities
Hoeffding-type approach not appropriate (f ∗
depends on training data)
Fix F and construct uniform bound P [maxh∈F | t (h) − (h)| > δ] ≤ 2 |F| e−2mδ2
Straightforward if |F| < ∞; use VC dimension (etc.) otherwise
Statistical Mechanics: get precise results for typical configurations
Function class F = FN varies with m; and let m and (size of F) vary in
well-defined manner
Thermodynamic limit: m, N → ∞ s.t. α = m
N
(like load in associative memory
models).
Limit s.t. (when it exists) certain quantities get sharply peaked around
their most probable value.
Describe learning curve as competition between error (energy) and log
of number of functions with that energy (entropy)
Get precise results for typical (most probably in that limit) quantities

Rethinking generalization requires revisiting old ideas
Martin and Mahoney https://arxiv.org/abs/1710.09553
Very Simple Deep Learning (VSDL) model:
DNN is a black box, load-like parameters α, & temperature-like parameters τ
Adding noise to training data decreases α
Early stopping increases τ
Nearly any non-trivial model‡
exhibits “phase diagrams,” with qualitatively
diﬀerent generalization properties, for diﬀerent parameter values.
(a) Training/general-
ization error.
(b) Learning phases in
τ-α plane.
(c) Noisifying data and
adjusting knobs.
‡
when analyzed via the Statistical Mechanics Theory of Generalization (SMToG)

Remembering Regularization
Statistical Mechanics (1990s): (this) Overtraining → Spin Glass Phase
Binary Classiﬁer with N Random Labelings:
2N over-trained solutions: locally (ruggedly) convex, very high barriers, all unable to generalize
implication: solutions inside basins should be more similar than solutions in diﬀerent basins

Stat Mech Setup: Student Teacher Model
Given N labeled data points
Imagine a Teacher Network T that maps data to labels
Learning ﬁnds a Student J similar to the Teacher T
Consider all possible Student Networks J for all possible teachers T
The Generalization error is related to the phase space volume Ω of all possible
Student-Teacher overlaps for all possible J, T
= arccos R, R =
1
N
J†
T

Statistical Mechanics Foundations:
Spherical (Energy) Constraints: δ(Tr[J2
] − N)
Teacher Overlap (Potential): δ( 1
N Tr[J†
T] − cos(π ))
Teacher Phase Space Volume (Density of States):
ΩT ( ) = dJδ(Tr[J2
] − N)δ( 1
N Tr[J†
T] − cos(π ))
Comparison to traditional Statistical Mechanics:
Phase Space Volume, free particles:
ΩE = dN
r dN
pδ
N
i
p2
i
2mi
− E ∼ V N
Canonical Ensemble: Legendre Transform in R = cos(π ):
actually more technical, and must choose sign convention on Tr[J†
T], H
Ωβ(R) ∼ dµ(J)e−λTr[J†
T]
∼ dqN
dpN
e−βH(p,q)

Early Models: Perception: J, T N-dim vectors
Continuous Perception Ji ∈ R (not so intersting)
Ising Perception Ji = ±1 (sharp transitions, requires Replica theory)
Our Proposal: J, T (N × M) Real (possibly Heavy Tailed) matrices
Practical Applications: Hinton, Bengio, etc.
Related to complexity of (Levy) spin glasses (Bouchaud)
Our Expectation:
Heavy-tailed structure means there is less capacity/entropy available for
integrals, which will affect generalization properties non-trivially
Multi-class classification is very different than binary classification

Student Teacher: Recent Practical Application
“Similarity of Neural Network Representations Revisited”
Kornblith, Norouzi, Lee, Hinton; https://arxiv.org/abs/1905.00414
Examined diﬀerent Weight matrix similarity metrics
Best method: Canonical Correlation Analysis (CCA): Y†
X 2
F
Figure: Diagnostic Tool for both individual and comparative DNNs

Student Teacher: Recent Generalization vs. Memorization
“Insights on representational similarity in neural networks with canonical correlation”
Morcos, Raghu, Bengio; https://arxiv.org/pdf/1806.05759.pdf
Compare NN representations and how they evolve during training
Projection weighted Canonical Correlation Analysis (PWCCA)
Figure: Generalizing networks converge to more similar solutions than memorizing
networks.

Motivations: Theoretical AND Practical
Theoretical: deeper insight into Why Deep Learning Works?
convex versus non-convex optimization?
explicit/implicit regularization?
is / why is / when is deep better?
VC theory versus Statistical Mechanics theory?
. . .
Practical: use insights to improve engineering of DNNs?
when is a network fully optimized?
can we use labels and/or domain knowledge more eﬃciently?
large batch versus small batch in optimization?
designing better ensembles?
. . .

Motivations: towards a Theory of Deep Learning
DNNs as
spin glasses,
Choromanska
et al. 2015
Looks exactly
like old protein
folding results
(late 90s)
Energy Landscape Theory
Completely
diﬀerent
picture
of DNNs
Raises broad questions about Why Deep Learning Works

Motivations: regularization in DNNs?
ICLR 2017 Best paper
Large neural network models can easily overtrain/overﬁt on randomly
labeled data
Popular ways to regularize (basically minx f (x) + λg(x), with “control
parameter” λ) may or may not help.
Understanding deep learning requires rethinking generalization??
https://arxiv.org/abs/1611.03530
Rethinking generalization requires revisiting old ideas: statistical
mechanics approaches and complex learning behavior!!
https://arxiv.org/abs/1710.09553 (Martin & Mahoney)

Motivations: stochastic optimization DNNs?
Theory (from convex problems):
First order (SGD, e.g., Bottou 2010)
larger “batches” are better (at least up to statistical noise)
Second order (SSN, e.g., Roosta and Mahoney 2016)
larger “batches” are better (at least up to statistical noise)
Large batch sizes have better computational properties!
So, people just increase batch size (and compensate with other parameters)
Practice (from non-convex problems):
SGD-like methods “saturate”
(https://arxiv.org/abs/1811.12941)
SSN-like methods “saturate”
Small batch sizes have better statistical properties!
Is batch size a computational parameter, or a statistical parameter, or what?
How should batch size be chosen?

Set up: the Energy Landscape
Energy/Optimization function:
EDNN = hL(WL × hL−1(WL−1 × hL−2(· · · ) + bL−1) + bL)
Train this on labeled data {di , yi } ∈ D, using Backprop, by minimizing loss L:
min
Wl ,bl
L
i
EDNN(di ) − yi
EDNN is “the” Energy Landscape:
The part of the optimization problem parameterized by the heretofore
unknown elements of the weight matrices and bias vectors, and as deﬁned
by the data {di , yi } ∈ D
Pass the data through the Energy function EDNN multiple times, as we run
Backprop training
The Energy Landscape§
is changing at each epoch
§
i.e., the optimization function that is nominally being optimized

Problem: How can this possibly work?
Expected
Highly non-convex?
Observed
Apparently not!
It has been known for a long time that local minima are not the issue.

Problem: Local Minima?
Duda, Hart and Stork, 2000
Solution: add more capacity and regularize, i.e., over-parameterization

Motivations: what is regularization?
(a) Dropout. (b) Early Stopping.
(c) Batch Size. (d) Noisify Data.
Every adjustable knob and switch—and there are many¶—is regularization.
¶
https://arxiv.org/pdf/1710.10686.pdf

Basics of Regularization
Ridge Regression / Tikhonov-Phillips Regularization
ˆWx = y
x = ˆWT ˆW + αI
−1
ˆWT
y
Moore-Penrose pseudoinverse (1955)
Ridge regularization (Phillips, 1962)
min
x
ˆWx − y 2
2 + α ˆx 2
2 familiar optimization problem
Softens the rank of ˆW to focus on large eigenvalues.
Related to Truncated SVD, which does hard truncation on rank of ˆW
Early stopping, truncated random walks, etc. often implicitly solve
regularized optimiation problems.

How we will study regularization
The Energy Landscape is determined by layer weight matrices WL:
EDNN = hL(WL × hL−1(WL−1 × hL−2(· · · ) + bL−1) + bL)
Traditional regularization is applied to WL:
min
Wl ,bl
L
i
EDNN(di ) − yi + α
l
Wl
Different types of regularization, e.g., different norms · , leave different
empirical signatures on WL.
What we do:
Turn off “all” regularization.
Systematically turn it back on, explicitly with α or implicitly with
knobs/switches.
Study empirical properties of WL.

Lots of DNNs Analyzed
Question: What happens to the layer weight matrices WL?
(Don’t evaluate your method on one/two/three NN, evaluate it on a dozen/hundred.)
Retrained LeNet5 on MINST using Keras.
Two other small models:
3-Layer MLP
Mini AlexNet
Conv2D MaxPool Conv2D MaxPool FC1 FC2 FC
Wide range of state-of-the-art pre-trained models:
AlexNet, Inception, etc.

Matrix complexity: Matrix Entropy and Stable Rank
W = UΣVT νi = Σii pi = ν2
i / i ν2
i
S(W) =
−1
log(R(W)) i pi log pi Rs(W) =
W 2
F
W 2
2
= i ν2
i
ν2
max
A warm-up: train a 3-Layer MLP:
(e) MLP3 Entropies. (f) MLP3 Stable Ranks.
Figure: Matrix Entropy & Stable Rank show transition during Backprop training.

Matrix complexity: Scree Plots
i / i ν2
i
S(W) =
−1
W 2
F
W 2
2
= i ν2
i
ν2
max
(a) Initial Scree Plot. (b) Final Scree Plot.
Figure: Scree plots for initial and ﬁnal conﬁgurations for MLP3.

Matrix complexity: Singular/Eigen Value Densities
i / i ν2
i
S(W) =
−1
W 2
F
W 2
2
= i ν2
i
ν2
max
(a) Singular val. density (b) Eigenvalue density
Figure: Histograms of the Singular Values νi and associated Eigenvalues λi = ν2
i .

ESD: detailed insight into WL
Empirical Spectral Density (ESD: eigenvalues of X = WT
L WL)
import keras
import numpy as np
import matplotlib.pyplot as plt
…
W = model.layers[i].get_weights()[0]
…
X = np.dot(W, W.T)
evals, evecs = np.linalg.eig(W, W.T)
plt.hist(X, bin=100, density=True)

ESD: detailed insight into WL
Empirical Spectral Density (ESD: eigenvalues of X = WT
L WL)
Eopch 0:
Random
Matrix
Eopch 36:
Random
+ Spiles
Entropy decrease corresponds to:
modiﬁcation (later, breakdown) of random structure and
onset of a new kind of self-regularization.

Random Matrix Theory 101: Wigner and Tracy-Widom
Wigner: global bulk statistics approach universal semi-circular form
Tracy-Widom: local edge statistics ﬂuctuate in universal way
Problems with Wigner and Tracy-Widom:
Weight matrices usually not square
Typically do only a single training run

Random Matrix Theory 102: Marchenko-Pastur
Let W be an N × M random matrix, with elements Wij ∼ N(0, σ2
mp).
Then, the ESD of X = WT W, converges to a deterministic function:
ρN(λ) :=
1
N
M
i=1
δ (λ − λi )
N→∞
−−−−→
Q ﬁxed



Q
2πσ2
mp
(λ+ − λ)(λ − λ−)
λ
if λ ∈ [λ−, λ+]
0 otherwise.
with well-deﬁned edges (which depend on Q, the aspect ratio):
λ±
= σ2
mp 1 ±
1
√
Q
2
Q = N/M ≥ 1.

Random Matrix Theory 102’: Marchenko-Pastur
(a) Vary aspect ratios (b) Vary variance parameters
Figure: Marchenko-Pastur (MP) distributions.
Important points:
Global bulk stats: The overall shape is deterministic, ﬁxed by Q and σ.
Local edge stats: The edge λ+
is very crisp, i.e.,
∆λM = |λmax − λ+
| ∼ O(M−2/3
), plus Tracy-Widom ﬂuctuations.
We use both global bulk statistics as well as local edge statistics in our theory.

Random Matrix Theory 103: Heavy-tailed RMT
Go beyond the (relatively easy) Gaussian Universality class:
model strongly-correlated systems (“signal”) with heavy-tailed random matrices.
Generative Model
w/ elements from
Universality class
Finite-N
Global shape
ρN (λ)
Limiting
Global shape
ρ(λ), N → ∞
Bulk edge
Local stats
λ ≈ λ+
(far) Tail
Local stats
λ ≈ λmax
Basic MP Gaussian
MP
distribution
MP TW No tail.
Spiked-
Covariance
Gaussian,
+ low-rank
perturbations
MP +
Gaussian
spikes
MP TW Gaussian
Heavy tail,
4 < µ
(Weakly)
Heavy-Tailed
MP +
PL tail
MP Heavy-Tailed∗
Heavy-Tailed∗
Heavy tail,
2 < µ < 4
(Moderately)
Heavy-Tailed
(or “fat tailed”)
PL∗∗
∼ λ−(aµ+b)
PL
∼ λ
−( 1
2
µ+1) No edge. Frechet
Heavy tail,
0 < µ < 2
(Very)
Heavy-Tailed
PL∗∗
∼ λ
−( 1
2
µ+1)
PL
∼ λ
−( 1
2
µ+1) No edge. Frechet
Basic MP theory, and the spiked and Heavy-Tailed extensions we use, including known, empirically-observed, and conjectured
relations between them. Boxes marked “∗
” are best described as following “TW with large finite size corrections” that are likely
Heavy-Tailed, leading to bulk edge statistics and far tail statistics that are indistinguishable. Boxes marked “∗∗
” are
phenomenological fits, describing large (2 < µ < 4) or small (0 < µ < 2) finite-size corrections on N → ∞ behavior.

Fitting Heavy-tailed Distributions
Figure: The log-log histogram plots of the ESD for three Heavy-Tailed random
matrices M with same aspect ratio Q = 3, with µ = 1.0, 3.0, 5.0, corresponding to
the three Heavy-Tailed Universality classes (0 < µ < 2 vs 2 < µ < 4 and 4 < µ).

Non-negligible finite size effects
(a) M = 1000, N = 2000. (b) Fixed M. (c) Fixed N.
Figure: Dependence of α (the fitted PL parameter) on µ (the hypothesized
limiting PL parameter).

Heavy Tails (!) and Heavy-Tailed Universality (?)
Universality: large-scale properties are independent of small-scale details
Mathematicians: justify proving theorems in random matrix theory
Physicists: derive new phenomenological relations and predict things
Gaussian Universality is most common, but there are many other types.
Heavy-Tailed Phenomenon
Rare events are not extraordinarily rare, i.e., are heavier than Gaussian tails
Modeled with power law and related functions
Seen in finance, structural glass theory, etc.
Heavy-Tailed Random Matrix Theory
Phenomenological work by physicists (Bouchard, Potters, Sornette, 90s)
Theorem proving by mathematicians (Auffinger, Ben Arous, Burda, Peche, 00s)
Universality of Power Laws, Levy-based dynamics, finite-size attractors, etc.

Heavy-Tailed Universality: Earthquake prediction
“Complex Critical Exponents from Renormalization Group Theory of Earthquakes . . . ” Sornette et al. (1985)
Power law ﬁt of the regional strain (a measure of seismic release) before the
critical time tc (of the earthquake)
d
dt
= A + B(t − tc )m
(a) (b)
Figure: (a) Cumulative Beniolf strain released by magnitude 5 and greater
earthquakes in the San Francisco Bay area prior to the 1989 Loma Prieta
eaerthquake. (b) Fit of Power Law exponent (m).
More sophisticated Renormalization Group (RG) analysys uses complex critical exponents, giving log-peripdic corrections.

Heavy-Tailed Universality: Market Crashes
“Why Stock Markets Crash: Critical Events in Complex Financial Systems” by D. Sornette (book, 2003)
Simple Power Law
log p(t) = A + B(t − tc )β
Complex Power Law (RG Log Periodic corrections)
log p(t) = A + B(t − tc )β
+ C(t − tc )β
(cos(ωlog(t − tc ) − φ)
(a) Dow Jones 1929 crash (b) Universal parameters, ﬁt to RG model

Heavy-Tailed Universality: Neuronal Avalanches
Neuronal avalanche dynamics indicates different universality classes in neuronal cultures; Scienfic Reports 3417 (2018)
(c) Spiking activity of cultured neurons
(d) Critical exponents, fit to scalaing model

Experiments: just apply this to pre-trained models
https://medium.com/@siddharthdas_32104/cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-...

Experiments: just apply this to pre-trained models

RMT: LeNet5 (an old/small NN example)
Figure: Full and zoomed-in ESD for LeNet5, Layer FC1.
Marchenko-Pastur Bulk + Spikes

RMT: AlexNet (a typical modern/large DNN example)
Figure: Zoomed-in ESD for Layer FC1 and FC3 of AlexNet.
Marchenko-Pastur Bulk-decay + Heavy-tailed

RMT: InceptionV3 (a particularly unusual example)
Figure: ESD for Layers L226 and L302 in InceptionV3, as distributed w/ pyTorch.
Marchenko-Pastur bulk decay, onset of Heavy Tails

Convolutional 2D Feature Maps
We analyze Conv2D layers by extracting the feature maps individually,
i.e., A (3 × 3 × 64 × 64) Conv2D layer yields 9 (64 × 64) Feature Maps
(a) α = 1.38 (b) α = 2.74 (c) α = 3.02
Figure: Select Feature Maps from diﬀerent Conv2D layers of VGG16.
Fits shown with PDF (blue) and CDF (red)

Open AI GPT2 Attention Matrices
NLP Embedding and Attention Matrices are Dense/Linear, but generally have
large aspect ratios
(a) α = (b) α = (c) α =
Figure: Selected ESDs from Open AI GPT2 (Huggingface implementation)

RMT-based 5+1 Phases of Training
(a) Random-like. (b) Bleeding-out. (c) Bulk+Spikes.
(d) Bulk-decay. (e) Heavy-Tailed. (f) Rank-collapse.
Figure: The 5+1 phases of learning we identiﬁed in DNN training.

We model “noise” and also “signal” with random matrices:
W Wrand
+ ∆sig
. (1)
Operational
Definition
Informal
Description
via Eqn. (1)
Edge/tail
Fluctuation
Comments
Illustration
and
Description
Random-like
ESD well-fit by MP
with appropriate λ+
Wrand
random;
∆sig
zero or small
λmax ≈ λ+
is
sharp, with
TW statistics
Fig. 15(a)
Bleeding-out
ESD Random-like,
excluding eigenmass
just above λ+
W has eigenmass at
bulk edge as
spikes “pull out”;
∆sig
medium
BPP transition,
λmax and
λ+
separate
Fig. 15(b)
Bulk+Spikes
ESD Random-like
plus ≥ 1 spikes
well above λ+
Wrand
well-separated
from low-rank ∆sig
;
∆sig
larger
λ+
is TW,
λmax is
Gaussian
Fig. 15(c)
Bulk-decay
ESD less Random-like;
Heavy-Tailed eigenmass
above λ+
; some spikes
Complex ∆sig
with
correlations that
don’t fully enter spike
Edge above λ+
is not concave
Fig. 15(d)
Heavy-Tailed
ESD better-described
by Heavy-Tailed RMT
than Gaussian RMT
Wrand
is small;
∆sig
is large and
strongly-correlated
No good λ+
;
λmax λ+ Fig. 15(e)
Rank-collapse
ESD has large-mass
spike at λ = 0
W very rank-deficient;
over-regularization
— Fig. 15(f)
The 5+1 phases of learning we identified in DNN training.

Lots of technical issues ...

Bulk+Spikes: Small Models ∼ Tikhonov regularization
Low-rank perturbation
Wl Wrand
l + ∆large
Perturbative correction
λmax = σ2 1
Q
+
|∆|2
N
1 +
N
|∆|2
|∆| > (Q)− 1
4
λ+
simple scale threshold
x = ˆX + αI
−1
ˆWT
y
eigenvalues > α (Spikes)
carry most of the
signal/information
Bulk → Spikes
Smaller, older models like LeNet5 exhibit traditional regularization and can
be described perturbatively with Gaussian RMT

Heavy-tailed Self-regularization
W is strongly-correlated and highly non-random
We model strongly-correlated systems by heavy-tailed random matrices
I.e., we model signal (not noise) by heavy-tailed random matrices
Then RMT/MP ESD will also have heavy tails
Known results from RMT / polymer theory (Bouchaud, Potters, etc)
AlexNet
ReseNet50
Inception V3
DenseNet201
...
“All” larger, modern DNNs exhibit novel Heavy-tailed self-regularization

Heavy-tailed Self-regularization
Summary of what we “suspect” today
No single scale threshold.
No simple low rank approximation for WL.
Contributions from correlations at all scales.
Can not be treated perturbatively.
“All” larger, modern DNNs exhibit novel Heavy-tailed self-regularization

Spikes: carry more “information” than the Bulk
Spikes have less entropy, are more localized than bulk.
(a) Vector Entropies. (b) Localization Ratios. (c) Participation Ratios.
Figure: Eigenvector localization metrics for the FC1 layer of MiniAlexNet.
Information begins to concentrate in the spikes.

Power Law Universality: ImageNet and AllenNLP
All these models display remarkable Heavy Tailed Universality

Power Law Universality: ImageNet
7500 matrices (and Conv2D feature maps)
over 50 architectures
Linear layers and Conv2D feature maps
80 − 90% < 5
All these models display remarkable Heavy Tailed Universality

Power Law Universality: Open AI GPT versus GPT2
GPT versus GPT2: (Huggingface implementation)
example of a class of models that “improves” over time.

Mechanisms?
Spiked-Covariance Model
Statistical model: Johnstone, “On the distribution . . . ” 2001.
Simple self-organization model: Malevergne and Sornette, “Collective
Origin of the Coexistence of Apparent RMT Noise and Factors in
Large Sample Correlation Matrices,” 2002.
Low-rank perturbative variant of Gaussian randomness modeling noise
Heavy-tailed Models: Self-organized criticality (and others ...)
Johansen, Sornette, and Ledoit, “Predicting ﬁnancial crashes using
discrete scale invariance,” 1998.
Markovic and Gros, “Power laws and self-organized criticality in
theory and nature,” 2013.
Non-perturbative model where heavy-tails and power laws are used to
model strongly correlated systems

Self-regularization: Batch size experiments
A theory should make predictions:
We predict the existence of 5+1 phases of increasing implicit
self-regularization
We characterize their properties in terms of HT-RMT
Do these phases exist? Can we ﬁnd them?
There are many knobs. Let’s vary one—batch size.
Tune the batch size from very large to very small
A small (i.e., retrainable) model exhibits all 5+1 phases
Large batch sizes => decrease generalization accuracy
Large batch sizes => decrease implicit self-regularization
Generalization Gap Phenomena: all else being equal, small batch sizes lead to
more implicitly self-regularized models.

Batch size and the Generalization Gap
Large versus small batches?
Larger is better:
Convex theory: SGD is closer to gradient descent
Implementations: Better parallelism, etc.
(But see Golmant et al. (arxiv:1811.12941) and Ma et al.
(arxiv:1903.06237) for “ineﬃciency” of SGD and KFAC.)
Smaller is better:
Empirical: Hoﬀer et al. (arXiv:1705.08741) and Keskar et al.
(arXiv:1609.04836)
Information: Schwartz-Ziv and Tishby (arxiv:1703.00810)
(This is like a “supervised” version of our approach.)
Connection with weight norms?
Older: Bartlett, 1997; Mahoney and Narayanan, 2009.
Newer: Liao et al., 2018; Soudry et al., 2017; Poggio et al., 2018;
Neyshabur et al., 2014; 2015; 2017a; Bartlett et al., 2017; Yoshida and
Miyato, 2017; Kawaguchi et al., 2017; Neyshabur et al., 2017b; Arora et al.,
2018b;a; Zhou and Feng, 2018.

Batch Size Tuning: Generalization Gap
Figure: Varying Batch Size: Stable Rank and MP Softrank for FC1 and FC2
Training and Test Accuracies versus Batch Size for MiniAlexNet.
Decreasing batch size leads to better results—it induces strong
correlations in W.
Increasing batch size leads to worse results—it washes out strong
correlations in W.

Batch Size Tuning: Generalization Gap
(a) Batch Size 500. (b) Batch Size 250. (c) Batch Size 100. (d) Batch Size 32.
(e) Batch Size 16. (f) Batch Size 8. (g) Batch Size 4. (h) Batch Size 2.
Figure: Varying Batch Size. ESD for Layer FC1 of MiniAlexNet. We exhibit all 5
of the main phases of training by varying only the batch size.
Decreasing batch size induces strong correlations in W, leading to a more
implicitly-regularized model.
Increasing batch size washes out strong correlations in W, leading to a less
implicitly-regularized model.

Summary so far
applied Random Matrix Theory (RMT)
self-regularization ∼ entropy / information decrease
5+1 phases of learning
small models ∼ Tinkhonov-like regularization
modern DNNs ∼ heavy-tailed self-regularization
Remarkably ubiquitous
How can this be used?
Why does deep learning work?

Open source tool: weightwatcher
https://github.com/CalculatedContent/WeightWatcher
A python tool to analyze weight matrices in Deep Neural Networks.
All our results can be reproduced by anyone on a basic laptop
using widely available, open source, pretained models:
(keras, pytorch, osmr/imgclsmob, huggingface, allennlp, distiller, modelzoo, etc.)
and without even needing the test or training data!

WeightWatcher
WeightWatcher is an open source to analyze DNNs with our heavy tailed α and
weighted ˆα metric (and other userful theories)
goal: to develop a useful, open source tool
supports: Keras, PyTorch, some custom libs (i.e. huggingface)
implements: various norm and rank metrics
pip install weightwatcher
current version: 0.1.2
latest from source: 0.1.3
looking for: users and contributors

WeightWatcher: Usage
Usage
import weightwatcher as ww
watcher = ww.WeightWatcher(model=model)
results = watcher.analyze()
watcher.get_summary()
watcher.print_results()
Advanced Usage
def analyze(self, model=None, layers= [],
min_size= 50, max_size= 0,
compute_alphas=True,
compute_lognorms=True,
compute_spectralnorms=True,
...
plot=True):

WeightWatcher: example VGG19_BN
import weightwatcher as ww
import torchvision.models as models
model = models.vgg19_bn(pretrained=True)
watcher = ww.WeightWatcher(model=model)
results = watcher.analyze(compute_alphas=True)
data.append(“name”: “vgg19bntorch”, “summary”: watcher.get_summary())
’name’: ’vgg19bntorch’,
’summary’: ’lognorm’: 0.8185,
’lognorm_compound’: 0.9365,
’alpha’: 2.9646,
’alpha_compound’: 2.8479
’alpha_weighted’: 1.1588
’alpha_weighted_compound’: 1.5002
We also provide a pandas dataframe with detailed results for each layer

Bounding Generalization Error
BTW, Bounding Generalization Error = Predicting Test Accuracies
A lot of recent interest, e.g.:
Bartlett et al: (arxiv:1706.08498): bounds based on ratio of output margin distribution
and spectral complexity measure
Neyshabur et al. (arxiv:1707.09564,arxiv:1706.08947): bounds based on the product
norms of the weights across layers
Arora et al. (arxiv:1802.05296): bounds based on compression and noise stability
properties
Liao et al. (arxiv:1807.09659): normalized cross-entropy measure that correlates well
with test loss
Jiang et al. (arxiv:1810.00113): measure based on distribution of margins at multiple
layers that correlates well with test loss∗∗
These use/develop learning theory bounds and then apply to training of
MNIST/CIFAR10/etc.
Question: How do these norm-based metrics perform on state-of-the-art pre-trained
models?
∗∗
and released DEMOGEN pretrained models (after our 1901.08278 paper).

Predicting test accuracies (at scale): Product norms
M&M: “Heavy-Tailed Universality Predicts Trends in Test Accuracies ... Pre-Trained ...” https://arxiv.org/abs/1901.08278
The product norm is a VC-like data-dependent capacity metric for DNNs.
People prove theorems and then may use it to guide training.
But how does it perform on state-of-the-art production-quality models?
We can predict trends in the test accuracy in state-of-the-art production-quality
models—without peeking at the test data!
“pip install weightwatcher”

Universality, capacity control, and norm-powerlaw relations
“Universality” suggests the power law exponent α would make a good, Universal,
DNN capacity control metric.
For multi-layer NN, consider a weighted average
ˆα =
1
N
l,i
bl,i αl,i
To get weights bl,i , relate Frobenius norm and Power Law exponent.
Create a random Heavy-Tailed (Pareto) matrix:
Pr W rand
i,j ∼
1
x1+µ
Examine norm-powerlaw relations:
log W 2
F
log λmax
versus α
Argue††
that:
PL–Norm Relation: α log λmax
≈ log W 2
F .
††
Open problem: make the “heuristic” argument more “rigorous.”

Predicting test accuracies better: Weighted Power Laws
Use the weighted PL metric: ˆα = 1
N l,i
log(λmax
l,i )αl,i , instead of the product norm.
We can predict trends in the test accuracy—without peeking at the test data!

Predicting test accuracies better: Distilled Models
Question: Is the weighted PL metric simply a repackaging of the product norm?
Answer: No!
For some Intel Distiller models, the Spectral Norm behaves atypically, and α does not
change noticibly
(a) PL Exponents (α) (b) Spectral Norm (λmax )
Figure: (a) PL exponents α and (b) Spectral Norm (λmax ) for each layer of ResNet20,
with Intel Distiller Group Regularization method applied (before and after).

WeighWatcher: ESDs of GANs
GANs also display Heavy Tailed ESDs, however:
There are more exceptions
Many ESDs appear to display signiﬁcant rank collapse and only weak
correlations
(a) Histogram of αs (b) Anamolous ESD.
Figure: (a) Distribution of all power law exponents α for DeepMind’s BigGAN
(Huggingface implementation). (b) Example of anamolous ESD.

Summary of “Validating and Using”
Some things so far:
We can: explain the generalization gap.
We can: “pip install weightwatcher” and use this source tool.
We can: predict trends in test accuracies on production-quality models.
Some things we are starting to look at:
Better metrics for monitoring and/or improving training.
Better metrics for robustness to, e.g., adversarial perturbation, model
compression, etc., that don’t involve looking at data.
Better phenomenological load-like and temperature-like metrics to guide
data collection, algorithm/parameter/hyperparameter selection, etc.
What else?
Join us:
“pip install weightwatcher”—contribute to the repo.‡‡
‡‡
Don’t do everything from scratch in a non-reproducible way. Make it reproducible!

Implications: RMT and Deep Learning
Where are the local minima?
How is the Hessian behaved?
Are simpler models misleading?
Can we design better learning
strategies?
(tradeoﬀ between Energy and Entropy minimization)
How can RMT be used to understand the Energy Landscape?

Implications: Minimizing Frustration and Energy Funnels
As simple as can be?, Wolynes, 1997
Energy Landscape Theory: “random heteropolymer” versus “natural protein” folding

Implications: The Spin Glass of Minimal Frustration
https://calculatedcontent.com/2015/03/25/why-does-deep-learning-work/
low lying Energy state in Spin Glass ∼ spikes in RMT

Implications: Rugged Energy Landscapes of Heavy-tailed
Models
Spin Glasses with Heavy Tails?
Local minima do not concentrate
near the ground state
(Cizeau and Bouchaud 1993)
Conﬁguration space with a “rugged
convexity”
Contrast with (Gaussian) Spin Glass
model of Choromanska et al. 2015
If Energy Landscape is ruggedly funneled, then no “problems” with local minima!

Conclusions: “pip install weightwatcher”
Statistical mechanics and neural networks
Long history
Revisit due to recent “crisis” in why deep learning works
Use ideas from statistical mechanics of strongly-correlated systems
Develop a theory that is designed to be used
Main Empirical/Theoretical Results
Use Heavy-tailed RMT to construct a operational theory of DNN learning
Evaluate eﬀect of implicit versus explicit regularization
Exhibit all 5+1 phases by adjusting batch size: explain the generalization gap
Methodology: Observations → Hypotheses → Build a Theory → Test the Theory.
Many Implications:
Explain the generalization gap
Rationalize claims about rugged convexity of Energy Landscape
Predict test accuracies in state-of-the-art models

If you want more ... “pip install weightwatcher” ...
Background paper:
Rethinking generalization requires revisiting old ideas: statistical mechanics approaches
and complex learning behavior
Main paper (full):
Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix
Theory and Implications for Learning
Code: https://github.com/CalculatedContent/ImplicitSelfRegularization
Main paper (abridged):
Traditional and Heavy-Tailed Self Regularization in Neural Network Models
Code: https://github.com/CalculatedContent/ImplicitSelfRegularization
Applying the theory paper:
Heavy-Tailed Universality Predicts Trends in Test Accuracies for Very Large Pre-Trained
Deep Neural Networks
Code: https://github.com/CalculatedContent/PredictingTestAccuracies

Statistical Mechanics Methods for Discovering Knowledge from Production-Scale Neural Networks

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Statistical Mechanics Methods for Discovering Knowledge from Production-Scale Neural Networks

Ähnlich wie Statistical Mechanics Methods for Discovering Knowledge from Production-Scale Neural Networks (20)

Mehr von Charles Martin

Mehr von Charles Martin (10)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Statistical Mechanics Methods for Discovering Knowledge from Production-Scale Neural Networks