SlideShare ist ein Scribd-Unternehmen logo
1 von 98
Downloaden Sie, um offline zu lesen
Statistical Mechanics Methods for Discovering
Knowledge from Production-Scale Neural Networks
Charles H. Martin∗ and Michael W. Mahoney†
Tutorial at ACM-KDD, August 2019
∗
Calculation Consulting, charles@calculationconsulting.com
†
ICSI and Dept of Statistics, UC Berkeley, https://www.stat.berkeley.edu/~mmahoney/
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 1 / 98
Outline
1 Prehistory and History
Older Background
A Very Simple Deep Learning Model
More Immediate Background
2 Preliminary Results
Regularization and the Energy Landscape
Preliminary Empirical Results
Gaussian and Heavy-tailed Random Matrix Theory
3 Developing a Theory for Deep Learning
More Detailed Empirical Results
An RMT-based Theory for Deep Learning
Tikhonov Regularization versus Heavy-tailed Regularization
4 Validating and Using the Theory
Varying the Batch Size: Explaining the Generalization Gap
Using the Theory: pip install weightwatcher
Diagnostics at Scale: Predicting Test Accuracies
5 More General Implications and Conclusions
Outline
1 Prehistory and History
Older Background
A Very Simple Deep Learning Model
More Immediate Background
2 Preliminary Results
Regularization and the Energy Landscape
Preliminary Empirical Results
Gaussian and Heavy-tailed Random Matrix Theory
3 Developing a Theory for Deep Learning
More Detailed Empirical Results
An RMT-based Theory for Deep Learning
Tikhonov Regularization versus Heavy-tailed Regularization
4 Validating and Using the Theory
Varying the Batch Size: Explaining the Generalization Gap
Using the Theory: pip install weightwatcher
Diagnostics at Scale: Predicting Test Accuracies
5 More General Implications and Conclusions
Statistical Physics & Neural Networks: A Long History
60s:
J. D. Cowan, Statistical Mechanics of Neural Networks, 1967.
70s:
W. A. Little, “The existence of persistent states in the brain,” Math.
Biosci., 1974.
80s:
H. Sompolinsky, “Statistical mechanics of neural networks,” Physics
Today, 1988.
90s:
D. Haussler, M. Kearns, H. S. Seung, and N. Tishby, “Rigorous learning
curve bounds from statistical mechanics,” Machine Learning, 1996.
00s:
A. Engel and C. P. L. Van den Broeck, Statistical mechanics of
learning, 2001.
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 4 / 98
Hopfield model
Hopfield, “Neural networks and physical systems with emergent collective computational abilities,” PNAS 1982.
Hopfield model:
Recurrent artificial neural network model
Equivalence between behavior of NNs with symmetric connections and the
equilibrium statistical mechanics behavior of certain magnetic systems.
Can design NNs for associative memory and other computational tasks
Phase diagram with three kinds of phases (α is load parameter):
Very low α regime: model has so so much capacity, it is a prototype method
Intermediate α: spin glass phase, which is “pathologically non-convex”
Higher α: generalization phase
But:
Lots of subsequent work focusing on spin glasses, replica theory, etc.
Let’s go back to the basics!
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 5 / 98
Restricted Boltzmann Machines and Variational Methods
RBMs = Hopfield + temperature + backprop:
RBMs and other more sophisticated variational free energy methods
They have an intractable partition function.
Goal: try to approximate partition function / free energy.
Also, recent work on their phase diagram.
We do NOT do this.
Memorization, then and now.
Three (then) versus two (now) phases.
Modern “memorization” is probably more like spin glass phase.
Let’s go back to the basics!
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 6 / 98
Some other signposts
Cowen’s introduction of sigmoid into neuroscience.
Parisi’s replica theory computations.
Solla’s statistical analysis.
Gardner’s analysis of annealed versus quenched entropy.
Saad’s analysis of dynamics of SGD.
More recent work on dynamics, energy langscapes, etc.
Lots more . . .
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 7 / 98
Important: Our Methodological Approach
Most people like training and validating ideas by training.
We will use pre-trained models.
Many state-of-the-art models are publicly available.
They are “machine learning models that work” . . . so analyze them.
Selection bias: you can’t talk with deceased patients.
Of course, one could use these methods to improve training . . . we won’t.
Benefits of this methodological approach.
Can develop a practical theory.
(Current theory isn’t . . . loose bounds and convergence rates.)
Can evaluate theory on state-of-the-art models.
(Big models are different than small . . . easily-trainable models.)
Can be more reproducible.
(Training isn’t reproducible . . . too many knobs.)
You can “pip install weightwatcher”
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 8 / 98
Outline
1 Prehistory and History
Older Background
A Very Simple Deep Learning Model
More Immediate Background
2 Preliminary Results
Regularization and the Energy Landscape
Preliminary Empirical Results
Gaussian and Heavy-tailed Random Matrix Theory
3 Developing a Theory for Deep Learning
More Detailed Empirical Results
An RMT-based Theory for Deep Learning
Tikhonov Regularization versus Heavy-tailed Regularization
4 Validating and Using the Theory
Varying the Batch Size: Explaining the Generalization Gap
Using the Theory: pip install weightwatcher
Diagnostics at Scale: Predicting Test Accuracies
5 More General Implications and Conclusions
PAC/VC versus Statistical Mechanics Approaches (1 of 2)
Basic Student-Teacher Learning Setup:
Classify elements of input space X into {0, 1}
Target rule / teacher T; and hypothesis space F of possible mappings
Given T for X ⊂ X, the training set, select a student f ∗
∈ F, and evaluate how
well f ∗
approximates T on X
Generalization error ( ): probability of disagreement bw student and teacher on X
Training error ( t ): fraction of disagreement bw student and teacher on X
Learning curve: behavior of | t − | as a function of control parameters
PAC/VC Approach:
Related to statistical problem of convergence of frequencies to probabilities
Statistical Mechanics Approach:
Exploit the thermodynamic limit from statistical mechanics
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 10 / 98
PAC/VC versus Statistical Mechanics Approaches (2 of 2)
PAC/VC: get bounds on worst-case results
View m = |X| as the main control parameter; fix the function class F; and ask
how | t − | varies
Natural to consider γ = P [| t − | > δ]
Related to problem of convergence of frequencies to probabilities
Hoeffding-type approach not appropriate (f ∗
depends on training data)
Fix F and construct uniform bound P [maxh∈F | t (h) − (h)| > δ] ≤ 2 |F| e−2mδ2
Straightforward if |F| < ∞; use VC dimension (etc.) otherwise
Statistical Mechanics: get precise results for typical configurations
Function class F = FN varies with m; and let m and (size of F) vary in
well-defined manner
Thermodynamic limit: m, N → ∞ s.t. α = m
N
(like load in associative memory
models).
Limit s.t. (when it exists) certain quantities get sharply peaked around
their most probable value.
Describe learning curve as competition between error (energy) and log
of number of functions with that energy (entropy)
Get precise results for typical (most probably in that limit) quantities
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 11 / 98
Rethinking generalization requires revisiting old ideas
Martin and Mahoney https://arxiv.org/abs/1710.09553
Very Simple Deep Learning (VSDL) model:
DNN is a black box, load-like parameters α, & temperature-like parameters τ
Adding noise to training data decreases α
Early stopping increases τ
Nearly any non-trivial model‡
exhibits “phase diagrams,” with qualitatively
different generalization properties, for different parameter values.
(a) Training/general-
ization error.
(b) Learning phases in
τ-α plane.
(c) Noisifying data and
adjusting knobs.
‡
when analyzed via the Statistical Mechanics Theory of Generalization (SMToG)
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 12 / 98
Remembering Regularization
Martin and Mahoney https://arxiv.org/abs/1710.09553
Statistical Mechanics (1990s): (this) Overtraining → Spin Glass Phase
Binary Classifier with N Random Labelings:
2N over-trained solutions: locally (ruggedly) convex, very high barriers, all unable to generalize
implication: solutions inside basins should be more similar than solutions in different basins
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 13 / 98
Stat Mech Setup: Student Teacher Model
Martin and Mahoney https://arxiv.org/abs/1710.09553
Given N labeled data points
Imagine a Teacher Network T that maps data to labels
Learning finds a Student J similar to the Teacher T
Consider all possible Student Networks J for all possible teachers T
The Generalization error is related to the phase space volume Ω of all possible
Student-Teacher overlaps for all possible J, T
= arccos R, R =
1
N
J†
T
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 14 / 98
Stat Mech Setup: Student Teacher Model
Martin and Mahoney https://arxiv.org/abs/1710.09553
Statistical Mechanics Foundations:
Spherical (Energy) Constraints: δ(Tr[J2
] − N)
Teacher Overlap (Potential): δ( 1
N Tr[J†
T] − cos(π ))
Teacher Phase Space Volume (Density of States):
ΩT ( ) = dJδ(Tr[J2
] − N)δ( 1
N Tr[J†
T] − cos(π ))
Comparison to traditional Statistical Mechanics:
Phase Space Volume, free particles:
ΩE = dN
r dN
pδ
N
i
p2
i
2mi
− E ∼ V N
Canonical Ensemble: Legendre Transform in R = cos(π ):
actually more technical, and must choose sign convention on Tr[J†
T], H
Ωβ(R) ∼ dµ(J)e−λTr[J†
T]
∼ dqN
dpN
e−βH(p,q)
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 15 / 98
Stat Mech Setup: Student Teacher Model
Martin and Mahoney https://arxiv.org/abs/1710.09553
Early Models: Perception: J, T N-dim vectors
Continuous Perception Ji ∈ R (not so intersting)
Ising Perception Ji = ±1 (sharp transitions, requires Replica theory)
Our Proposal: J, T (N × M) Real (possibly Heavy Tailed) matrices
Practical Applications: Hinton, Bengio, etc.
Related to complexity of (Levy) spin glasses (Bouchaud)
Our Expectation:
Heavy-tailed structure means there is less capacity/entropy available for
integrals, which will affect generalization properties non-trivially
Multi-class classification is very different than binary classification
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 16 / 98
Student Teacher: Recent Practical Application
“Similarity of Neural Network Representations Revisited”
Kornblith, Norouzi, Lee, Hinton; https://arxiv.org/abs/1905.00414
Examined different Weight matrix similarity metrics
Best method: Canonical Correlation Analysis (CCA): Y†
X 2
F
Figure: Diagnostic Tool for both individual and comparative DNNs
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 17 / 98
Student Teacher: Recent Generalization vs. Memorization
“Insights on representational similarity in neural networks with canonical correlation”
Morcos, Raghu, Bengio; https://arxiv.org/pdf/1806.05759.pdf
Compare NN representations and how they evolve during training
Projection weighted Canonical Correlation Analysis (PWCCA)
Figure: Generalizing networks converge to more similar solutions than memorizing
networks.
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 18 / 98
Outline
1 Prehistory and History
Older Background
A Very Simple Deep Learning Model
More Immediate Background
2 Preliminary Results
Regularization and the Energy Landscape
Preliminary Empirical Results
Gaussian and Heavy-tailed Random Matrix Theory
3 Developing a Theory for Deep Learning
More Detailed Empirical Results
An RMT-based Theory for Deep Learning
Tikhonov Regularization versus Heavy-tailed Regularization
4 Validating and Using the Theory
Varying the Batch Size: Explaining the Generalization Gap
Using the Theory: pip install weightwatcher
Diagnostics at Scale: Predicting Test Accuracies
5 More General Implications and Conclusions
Motivations: Theoretical AND Practical
Theoretical: deeper insight into Why Deep Learning Works?
convex versus non-convex optimization?
explicit/implicit regularization?
is / why is / when is deep better?
VC theory versus Statistical Mechanics theory?
. . .
Practical: use insights to improve engineering of DNNs?
when is a network fully optimized?
can we use labels and/or domain knowledge more efficiently?
large batch versus small batch in optimization?
designing better ensembles?
. . .
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 20 / 98
Motivations: towards a Theory of Deep Learning
DNNs as
spin glasses,
Choromanska
et al. 2015
Looks exactly
like old protein
folding results
(late 90s)
Energy Landscape Theory
Completely
different
picture
of DNNs
Raises broad questions about Why Deep Learning Works
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 21 / 98
Motivations: regularization in DNNs?
ICLR 2017 Best paper
Large neural network models can easily overtrain/overfit on randomly
labeled data
Popular ways to regularize (basically minx f (x) + λg(x), with “control
parameter” λ) may or may not help.
Understanding deep learning requires rethinking generalization??
https://arxiv.org/abs/1611.03530
Rethinking generalization requires revisiting old ideas: statistical
mechanics approaches and complex learning behavior!!
https://arxiv.org/abs/1710.09553 (Martin & Mahoney)
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 22 / 98
Motivations: stochastic optimization DNNs?
Theory (from convex problems):
First order (SGD, e.g., Bottou 2010)
larger “batches” are better (at least up to statistical noise)
Second order (SSN, e.g., Roosta and Mahoney 2016)
larger “batches” are better (at least up to statistical noise)
Large batch sizes have better computational properties!
So, people just increase batch size (and compensate with other parameters)
Practice (from non-convex problems):
SGD-like methods “saturate”
(https://arxiv.org/abs/1811.12941)
SSN-like methods “saturate”
(https://arxiv.org/abs/1903.06237)
Small batch sizes have better statistical properties!
Is batch size a computational parameter, or a statistical parameter, or what?
How should batch size be chosen?
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 23 / 98
Set up: the Energy Landscape
Energy/Optimization function:
EDNN = hL(WL × hL−1(WL−1 × hL−2(· · · ) + bL−1) + bL)
Train this on labeled data {di , yi } ∈ D, using Backprop, by minimizing loss L:
min
Wl ,bl
L
i
EDNN(di ) − yi
EDNN is “the” Energy Landscape:
The part of the optimization problem parameterized by the heretofore
unknown elements of the weight matrices and bias vectors, and as defined
by the data {di , yi } ∈ D
Pass the data through the Energy function EDNN multiple times, as we run
Backprop training
The Energy Landscape§
is changing at each epoch
§
i.e., the optimization function that is nominally being optimized
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 24 / 98
Problem: How can this possibly work?
Expected
Highly non-convex?
Observed
Apparently not!
It has been known for a long time that local minima are not the issue.
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 25 / 98
Problem: Local Minima?
Duda, Hart and Stork, 2000
Solution: add more capacity and regularize, i.e., over-parameterization
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 26 / 98
Motivations: what is regularization?
(a) Dropout. (b) Early Stopping.
(c) Batch Size. (d) Noisify Data.
Every adjustable knob and switch—and there are many¶—is regularization.
¶
https://arxiv.org/pdf/1710.10686.pdf
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 27 / 98
Outline
1 Prehistory and History
Older Background
A Very Simple Deep Learning Model
More Immediate Background
2 Preliminary Results
Regularization and the Energy Landscape
Preliminary Empirical Results
Gaussian and Heavy-tailed Random Matrix Theory
3 Developing a Theory for Deep Learning
More Detailed Empirical Results
An RMT-based Theory for Deep Learning
Tikhonov Regularization versus Heavy-tailed Regularization
4 Validating and Using the Theory
Varying the Batch Size: Explaining the Generalization Gap
Using the Theory: pip install weightwatcher
Diagnostics at Scale: Predicting Test Accuracies
5 More General Implications and Conclusions
Outline
1 Prehistory and History
Older Background
A Very Simple Deep Learning Model
More Immediate Background
2 Preliminary Results
Regularization and the Energy Landscape
Preliminary Empirical Results
Gaussian and Heavy-tailed Random Matrix Theory
3 Developing a Theory for Deep Learning
More Detailed Empirical Results
An RMT-based Theory for Deep Learning
Tikhonov Regularization versus Heavy-tailed Regularization
4 Validating and Using the Theory
Varying the Batch Size: Explaining the Generalization Gap
Using the Theory: pip install weightwatcher
Diagnostics at Scale: Predicting Test Accuracies
5 More General Implications and Conclusions
Basics of Regularization
Ridge Regression / Tikhonov-Phillips Regularization
ˆWx = y
x = ˆWT ˆW + αI
−1
ˆWT
y
Moore-Penrose pseudoinverse (1955)
Ridge regularization (Phillips, 1962)
min
x
ˆWx − y 2
2 + α ˆx 2
2 familiar optimization problem
Softens the rank of ˆW to focus on large eigenvalues.
Related to Truncated SVD, which does hard truncation on rank of ˆW
Early stopping, truncated random walks, etc. often implicitly solve
regularized optimiation problems.
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 30 / 98
How we will study regularization
The Energy Landscape is determined by layer weight matrices WL:
EDNN = hL(WL × hL−1(WL−1 × hL−2(· · · ) + bL−1) + bL)
Traditional regularization is applied to WL:
min
Wl ,bl
L
i
EDNN(di ) − yi + α
l
Wl
Different types of regularization, e.g., different norms · , leave different
empirical signatures on WL.
What we do:
Turn off “all” regularization.
Systematically turn it back on, explicitly with α or implicitly with
knobs/switches.
Study empirical properties of WL.
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 31 / 98
Lots of DNNs Analyzed
Question: What happens to the layer weight matrices WL?
(Don’t evaluate your method on one/two/three NN, evaluate it on a dozen/hundred.)
Retrained LeNet5 on MINST using Keras.
Two other small models:
3-Layer MLP
Mini AlexNet
Conv2D MaxPool Conv2D MaxPool FC1 FC2 FC
Wide range of state-of-the-art pre-trained models:
AlexNet, Inception, etc.
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 32 / 98
Outline
1 Prehistory and History
Older Background
A Very Simple Deep Learning Model
More Immediate Background
2 Preliminary Results
Regularization and the Energy Landscape
Preliminary Empirical Results
Gaussian and Heavy-tailed Random Matrix Theory
3 Developing a Theory for Deep Learning
More Detailed Empirical Results
An RMT-based Theory for Deep Learning
Tikhonov Regularization versus Heavy-tailed Regularization
4 Validating and Using the Theory
Varying the Batch Size: Explaining the Generalization Gap
Using the Theory: pip install weightwatcher
Diagnostics at Scale: Predicting Test Accuracies
5 More General Implications and Conclusions
Matrix complexity: Matrix Entropy and Stable Rank
W = UΣVT νi = Σii pi = ν2
i / i ν2
i
S(W) =
−1
log(R(W)) i pi log pi Rs(W) =
W 2
F
W 2
2
= i ν2
i
ν2
max
A warm-up: train a 3-Layer MLP:
(e) MLP3 Entropies. (f) MLP3 Stable Ranks.
Figure: Matrix Entropy & Stable Rank show transition during Backprop training.
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 34 / 98
Matrix complexity: Scree Plots
W = UΣVT νi = Σii pi = ν2
i / i ν2
i
S(W) =
−1
log(R(W)) i pi log pi Rs(W) =
W 2
F
W 2
2
= i ν2
i
ν2
max
A warm-up: train a 3-Layer MLP:
(a) Initial Scree Plot. (b) Final Scree Plot.
Figure: Scree plots for initial and final configurations for MLP3.
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 35 / 98
Matrix complexity: Singular/Eigen Value Densities
W = UΣVT νi = Σii pi = ν2
i / i ν2
i
S(W) =
−1
log(R(W)) i pi log pi Rs(W) =
W 2
F
W 2
2
= i ν2
i
ν2
max
A warm-up: train a 3-Layer MLP:
(a) Singular val. density (b) Eigenvalue density
Figure: Histograms of the Singular Values νi and associated Eigenvalues λi = ν2
i .
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 36 / 98
ESD: detailed insight into WL
Empirical Spectral Density (ESD: eigenvalues of X = WT
L WL)
import keras
import numpy as np
import matplotlib.pyplot as plt
…
W = model.layers[i].get_weights()[0]
…
X = np.dot(W, W.T)
evals, evecs = np.linalg.eig(W, W.T)
plt.hist(X, bin=100, density=True)
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 37 / 98
ESD: detailed insight into WL
Empirical Spectral Density (ESD: eigenvalues of X = WT
L WL)
Eopch 0:
Random
Matrix
Eopch 36:
Random
+ Spiles
Entropy decrease corresponds to:
modification (later, breakdown) of random structure and
onset of a new kind of self-regularization.
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 38 / 98
Outline
1 Prehistory and History
Older Background
A Very Simple Deep Learning Model
More Immediate Background
2 Preliminary Results
Regularization and the Energy Landscape
Preliminary Empirical Results
Gaussian and Heavy-tailed Random Matrix Theory
3 Developing a Theory for Deep Learning
More Detailed Empirical Results
An RMT-based Theory for Deep Learning
Tikhonov Regularization versus Heavy-tailed Regularization
4 Validating and Using the Theory
Varying the Batch Size: Explaining the Generalization Gap
Using the Theory: pip install weightwatcher
Diagnostics at Scale: Predicting Test Accuracies
5 More General Implications and Conclusions
Random Matrix Theory 101: Wigner and Tracy-Widom
Wigner: global bulk statistics approach universal semi-circular form
Tracy-Widom: local edge statistics fluctuate in universal way
Problems with Wigner and Tracy-Widom:
Weight matrices usually not square
Typically do only a single training run
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 40 / 98
Random Matrix Theory 102: Marchenko-Pastur
Let W be an N × M random matrix, with elements Wij ∼ N(0, σ2
mp).
Then, the ESD of X = WT W, converges to a deterministic function:
ρN(λ) :=
1
N
M
i=1
δ (λ − λi )
N→∞
−−−−→
Q fixed



Q
2πσ2
mp
(λ+ − λ)(λ − λ−)
λ
if λ ∈ [λ−, λ+]
0 otherwise.
with well-defined edges (which depend on Q, the aspect ratio):
λ±
= σ2
mp 1 ±
1
√
Q
2
Q = N/M ≥ 1.
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 41 / 98
Random Matrix Theory 102’: Marchenko-Pastur
(a) Vary aspect ratios (b) Vary variance parameters
Figure: Marchenko-Pastur (MP) distributions.
Important points:
Global bulk stats: The overall shape is deterministic, fixed by Q and σ.
Local edge stats: The edge λ+
is very crisp, i.e.,
∆λM = |λmax − λ+
| ∼ O(M−2/3
), plus Tracy-Widom fluctuations.
We use both global bulk statistics as well as local edge statistics in our theory.
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 42 / 98
Random Matrix Theory 103: Heavy-tailed RMT
Go beyond the (relatively easy) Gaussian Universality class:
model strongly-correlated systems (“signal”) with heavy-tailed random matrices.
Generative Model
w/ elements from
Universality class
Finite-N
Global shape
ρN (λ)
Limiting
Global shape
ρ(λ), N → ∞
Bulk edge
Local stats
λ ≈ λ+
(far) Tail
Local stats
λ ≈ λmax
Basic MP Gaussian
MP
distribution
MP TW No tail.
Spiked-
Covariance
Gaussian,
+ low-rank
perturbations
MP +
Gaussian
spikes
MP TW Gaussian
Heavy tail,
4 < µ
(Weakly)
Heavy-Tailed
MP +
PL tail
MP Heavy-Tailed∗
Heavy-Tailed∗
Heavy tail,
2 < µ < 4
(Moderately)
Heavy-Tailed
(or “fat tailed”)
PL∗∗
∼ λ−(aµ+b)
PL
∼ λ
−( 1
2
µ+1) No edge. Frechet
Heavy tail,
0 < µ < 2
(Very)
Heavy-Tailed
PL∗∗
∼ λ
−( 1
2
µ+1)
PL
∼ λ
−( 1
2
µ+1) No edge. Frechet
Basic MP theory, and the spiked and Heavy-Tailed extensions we use, including known, empirically-observed, and conjectured
relations between them. Boxes marked “∗
” are best described as following “TW with large finite size corrections” that are likely
Heavy-Tailed, leading to bulk edge statistics and far tail statistics that are indistinguishable. Boxes marked “∗∗
” are
phenomenological fits, describing large (2 < µ < 4) or small (0 < µ < 2) finite-size corrections on N → ∞ behavior.
Fitting Heavy-tailed Distributions
Figure: The log-log histogram plots of the ESD for three Heavy-Tailed random
matrices M with same aspect ratio Q = 3, with µ = 1.0, 3.0, 5.0, corresponding to
the three Heavy-Tailed Universality classes (0 < µ < 2 vs 2 < µ < 4 and 4 < µ).
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 44 / 98
Non-negligible finite size effects
(a) M = 1000, N = 2000. (b) Fixed M. (c) Fixed N.
Figure: Dependence of α (the fitted PL parameter) on µ (the hypothesized
limiting PL parameter).
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 45 / 98
Heavy Tails (!) and Heavy-Tailed Universality (?)
Universality: large-scale properties are independent of small-scale details
Mathematicians: justify proving theorems in random matrix theory
Physicists: derive new phenomenological relations and predict things
Gaussian Universality is most common, but there are many other types.
Heavy-Tailed Phenomenon
Rare events are not extraordinarily rare, i.e., are heavier than Gaussian tails
Modeled with power law and related functions
Seen in finance, structural glass theory, etc.
Heavy-Tailed Random Matrix Theory
Phenomenological work by physicists (Bouchard, Potters, Sornette, 90s)
Theorem proving by mathematicians (Auffinger, Ben Arous, Burda, Peche, 00s)
Universality of Power Laws, Levy-based dynamics, finite-size attractors, etc.
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 46 / 98
Heavy-Tailed Universality: Earthquake prediction
“Complex Critical Exponents from Renormalization Group Theory of Earthquakes . . . ” Sornette et al. (1985)
Power law fit of the regional strain (a measure of seismic release) before the
critical time tc (of the earthquake)
d
dt
= A + B(t − tc )m
(a) (b)
Figure: (a) Cumulative Beniolf strain released by magnitude 5 and greater
earthquakes in the San Francisco Bay area prior to the 1989 Loma Prieta
eaerthquake. (b) Fit of Power Law exponent (m).
More sophisticated Renormalization Group (RG) analysys uses complex critical exponents, giving log-peripdic corrections.
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 47 / 98
Heavy-Tailed Universality: Market Crashes
“Why Stock Markets Crash: Critical Events in Complex Financial Systems” by D. Sornette (book, 2003)
Simple Power Law
log p(t) = A + B(t − tc )β
Complex Power Law (RG Log Periodic corrections)
log p(t) = A + B(t − tc )β
+ C(t − tc )β
(cos(ωlog(t − tc ) − φ)
(a) Dow Jones 1929 crash (b) Universal parameters, fit to RG model
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 48 / 98
Heavy-Tailed Universality: Neuronal Avalanches
Neuronal avalanche dynamics indicates different universality classes in neuronal cultures; Scienfic Reports 3417 (2018)
(c) Spiking activity of cultured neurons
(d) Critical exponents, fit to scalaing model
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 49 / 98
Outline
1 Prehistory and History
Older Background
A Very Simple Deep Learning Model
More Immediate Background
2 Preliminary Results
Regularization and the Energy Landscape
Preliminary Empirical Results
Gaussian and Heavy-tailed Random Matrix Theory
3 Developing a Theory for Deep Learning
More Detailed Empirical Results
An RMT-based Theory for Deep Learning
Tikhonov Regularization versus Heavy-tailed Regularization
4 Validating and Using the Theory
Varying the Batch Size: Explaining the Generalization Gap
Using the Theory: pip install weightwatcher
Diagnostics at Scale: Predicting Test Accuracies
5 More General Implications and Conclusions
Outline
1 Prehistory and History
Older Background
A Very Simple Deep Learning Model
More Immediate Background
2 Preliminary Results
Regularization and the Energy Landscape
Preliminary Empirical Results
Gaussian and Heavy-tailed Random Matrix Theory
3 Developing a Theory for Deep Learning
More Detailed Empirical Results
An RMT-based Theory for Deep Learning
Tikhonov Regularization versus Heavy-tailed Regularization
4 Validating and Using the Theory
Varying the Batch Size: Explaining the Generalization Gap
Using the Theory: pip install weightwatcher
Diagnostics at Scale: Predicting Test Accuracies
5 More General Implications and Conclusions
Experiments: just apply this to pre-trained models
https://medium.com/@siddharthdas_32104/cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-...
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 52 / 98
Experiments: just apply this to pre-trained models
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 53 / 98
RMT: LeNet5 (an old/small NN example)
Figure: Full and zoomed-in ESD for LeNet5, Layer FC1.
Marchenko-Pastur Bulk + Spikes
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 54 / 98
RMT: AlexNet (a typical modern/large DNN example)
Figure: Zoomed-in ESD for Layer FC1 and FC3 of AlexNet.
Marchenko-Pastur Bulk-decay + Heavy-tailed
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 55 / 98
RMT: InceptionV3 (a particularly unusual example)
Figure: ESD for Layers L226 and L302 in InceptionV3, as distributed w/ pyTorch.
Marchenko-Pastur bulk decay, onset of Heavy Tails
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 56 / 98
Convolutional 2D Feature Maps
We analyze Conv2D layers by extracting the feature maps individually,
i.e., A (3 × 3 × 64 × 64) Conv2D layer yields 9 (64 × 64) Feature Maps
(a) α = 1.38 (b) α = 2.74 (c) α = 3.02
Figure: Select Feature Maps from different Conv2D layers of VGG16.
Fits shown with PDF (blue) and CDF (red)
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 57 / 98
Open AI GPT2 Attention Matrices
NLP Embedding and Attention Matrices are Dense/Linear, but generally have
large aspect ratios
(a) α = (b) α = (c) α =
Figure: Selected ESDs from Open AI GPT2 (Huggingface implementation)
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 58 / 98
Outline
1 Prehistory and History
Older Background
A Very Simple Deep Learning Model
More Immediate Background
2 Preliminary Results
Regularization and the Energy Landscape
Preliminary Empirical Results
Gaussian and Heavy-tailed Random Matrix Theory
3 Developing a Theory for Deep Learning
More Detailed Empirical Results
An RMT-based Theory for Deep Learning
Tikhonov Regularization versus Heavy-tailed Regularization
4 Validating and Using the Theory
Varying the Batch Size: Explaining the Generalization Gap
Using the Theory: pip install weightwatcher
Diagnostics at Scale: Predicting Test Accuracies
5 More General Implications and Conclusions
RMT-based 5+1 Phases of Training
(a) Random-like. (b) Bleeding-out. (c) Bulk+Spikes.
(d) Bulk-decay. (e) Heavy-Tailed. (f) Rank-collapse.
Figure: The 5+1 phases of learning we identified in DNN training.
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 60 / 98
RMT-based 5+1 Phases of Training
We model “noise” and also “signal” with random matrices:
W Wrand
+ ∆sig
. (1)
Operational
Definition
Informal
Description
via Eqn. (1)
Edge/tail
Fluctuation
Comments
Illustration
and
Description
Random-like
ESD well-fit by MP
with appropriate λ+
Wrand
random;
∆sig
zero or small
λmax ≈ λ+
is
sharp, with
TW statistics
Fig. 15(a)
Bleeding-out
ESD Random-like,
excluding eigenmass
just above λ+
W has eigenmass at
bulk edge as
spikes “pull out”;
∆sig
medium
BPP transition,
λmax and
λ+
separate
Fig. 15(b)
Bulk+Spikes
ESD Random-like
plus ≥ 1 spikes
well above λ+
Wrand
well-separated
from low-rank ∆sig
;
∆sig
larger
λ+
is TW,
λmax is
Gaussian
Fig. 15(c)
Bulk-decay
ESD less Random-like;
Heavy-Tailed eigenmass
above λ+
; some spikes
Complex ∆sig
with
correlations that
don’t fully enter spike
Edge above λ+
is not concave
Fig. 15(d)
Heavy-Tailed
ESD better-described
by Heavy-Tailed RMT
than Gaussian RMT
Wrand
is small;
∆sig
is large and
strongly-correlated
No good λ+
;
λmax λ+ Fig. 15(e)
Rank-collapse
ESD has large-mass
spike at λ = 0
W very rank-deficient;
over-regularization
— Fig. 15(f)
The 5+1 phases of learning we identified in DNN training.
RMT-based 5+1 Phases of Training
Lots of technical issues ...
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 62 / 98
Outline
1 Prehistory and History
Older Background
A Very Simple Deep Learning Model
More Immediate Background
2 Preliminary Results
Regularization and the Energy Landscape
Preliminary Empirical Results
Gaussian and Heavy-tailed Random Matrix Theory
3 Developing a Theory for Deep Learning
More Detailed Empirical Results
An RMT-based Theory for Deep Learning
Tikhonov Regularization versus Heavy-tailed Regularization
4 Validating and Using the Theory
Varying the Batch Size: Explaining the Generalization Gap
Using the Theory: pip install weightwatcher
Diagnostics at Scale: Predicting Test Accuracies
5 More General Implications and Conclusions
Bulk+Spikes: Small Models ∼ Tikhonov regularization
Low-rank perturbation
Wl Wrand
l + ∆large
Perturbative correction
λmax = σ2 1
Q
+
|∆|2
N
1 +
N
|∆|2
|∆| > (Q)− 1
4
λ+
simple scale threshold
x = ˆX + αI
−1
ˆWT
y
eigenvalues > α (Spikes)
carry most of the
signal/information
Bulk → Spikes
Smaller, older models like LeNet5 exhibit traditional regularization and can
be described perturbatively with Gaussian RMT
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 64 / 98
Heavy-tailed Self-regularization
W is strongly-correlated and highly non-random
We model strongly-correlated systems by heavy-tailed random matrices
I.e., we model signal (not noise) by heavy-tailed random matrices
Then RMT/MP ESD will also have heavy tails
Known results from RMT / polymer theory (Bouchaud, Potters, etc)
AlexNet
ReseNet50
Inception V3
DenseNet201
...
“All” larger, modern DNNs exhibit novel Heavy-tailed self-regularization
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 65 / 98
Heavy-tailed Self-regularization
Summary of what we “suspect” today
No single scale threshold.
No simple low rank approximation for WL.
Contributions from correlations at all scales.
Can not be treated perturbatively.
“All” larger, modern DNNs exhibit novel Heavy-tailed self-regularization
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 66 / 98
Spikes: carry more “information” than the Bulk
Spikes have less entropy, are more localized than bulk.
(a) Vector Entropies. (b) Localization Ratios. (c) Participation Ratios.
Figure: Eigenvector localization metrics for the FC1 layer of MiniAlexNet.
Information begins to concentrate in the spikes.
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 67 / 98
Power Law Universality: ImageNet and AllenNLP
All these models display remarkable Heavy Tailed Universality
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 68 / 98
Power Law Universality: ImageNet
7500 matrices (and Conv2D feature maps)
over 50 architectures
Linear layers and Conv2D feature maps
80 − 90% < 5
All these models display remarkable Heavy Tailed Universality
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 69 / 98
Power Law Universality: Open AI GPT versus GPT2
GPT versus GPT2: (Huggingface implementation)
example of a class of models that “improves” over time.
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 70 / 98
Mechanisms?
Spiked-Covariance Model
Statistical model: Johnstone, “On the distribution . . . ” 2001.
Simple self-organization model: Malevergne and Sornette, “Collective
Origin of the Coexistence of Apparent RMT Noise and Factors in
Large Sample Correlation Matrices,” 2002.
Low-rank perturbative variant of Gaussian randomness modeling noise
Heavy-tailed Models: Self-organized criticality (and others ...)
Johansen, Sornette, and Ledoit, “Predicting financial crashes using
discrete scale invariance,” 1998.
Markovic and Gros, “Power laws and self-organized criticality in
theory and nature,” 2013.
Non-perturbative model where heavy-tails and power laws are used to
model strongly correlated systems
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 71 / 98
Outline
1 Prehistory and History
Older Background
A Very Simple Deep Learning Model
More Immediate Background
2 Preliminary Results
Regularization and the Energy Landscape
Preliminary Empirical Results
Gaussian and Heavy-tailed Random Matrix Theory
3 Developing a Theory for Deep Learning
More Detailed Empirical Results
An RMT-based Theory for Deep Learning
Tikhonov Regularization versus Heavy-tailed Regularization
4 Validating and Using the Theory
Varying the Batch Size: Explaining the Generalization Gap
Using the Theory: pip install weightwatcher
Diagnostics at Scale: Predicting Test Accuracies
5 More General Implications and Conclusions
Outline
1 Prehistory and History
Older Background
A Very Simple Deep Learning Model
More Immediate Background
2 Preliminary Results
Regularization and the Energy Landscape
Preliminary Empirical Results
Gaussian and Heavy-tailed Random Matrix Theory
3 Developing a Theory for Deep Learning
More Detailed Empirical Results
An RMT-based Theory for Deep Learning
Tikhonov Regularization versus Heavy-tailed Regularization
4 Validating and Using the Theory
Varying the Batch Size: Explaining the Generalization Gap
Using the Theory: pip install weightwatcher
Diagnostics at Scale: Predicting Test Accuracies
5 More General Implications and Conclusions
Self-regularization: Batch size experiments
A theory should make predictions:
We predict the existence of 5+1 phases of increasing implicit
self-regularization
We characterize their properties in terms of HT-RMT
Do these phases exist? Can we find them?
There are many knobs. Let’s vary one—batch size.
Tune the batch size from very large to very small
A small (i.e., retrainable) model exhibits all 5+1 phases
Large batch sizes => decrease generalization accuracy
Large batch sizes => decrease implicit self-regularization
Generalization Gap Phenomena: all else being equal, small batch sizes lead to
more implicitly self-regularized models.
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 74 / 98
Batch size and the Generalization Gap
Large versus small batches?
Larger is better:
Convex theory: SGD is closer to gradient descent
Implementations: Better parallelism, etc.
(But see Golmant et al. (arxiv:1811.12941) and Ma et al.
(arxiv:1903.06237) for “inefficiency” of SGD and KFAC.)
Smaller is better:
Empirical: Hoffer et al. (arXiv:1705.08741) and Keskar et al.
(arXiv:1609.04836)
Information: Schwartz-Ziv and Tishby (arxiv:1703.00810)
(This is like a “supervised” version of our approach.)
Connection with weight norms?
Older: Bartlett, 1997; Mahoney and Narayanan, 2009.
Newer: Liao et al., 2018; Soudry et al., 2017; Poggio et al., 2018;
Neyshabur et al., 2014; 2015; 2017a; Bartlett et al., 2017; Yoshida and
Miyato, 2017; Kawaguchi et al., 2017; Neyshabur et al., 2017b; Arora et al.,
2018b;a; Zhou and Feng, 2018.
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 75 / 98
Batch Size Tuning: Generalization Gap
Figure: Varying Batch Size: Stable Rank and MP Softrank for FC1 and FC2
Training and Test Accuracies versus Batch Size for MiniAlexNet.
Decreasing batch size leads to better results—it induces strong
correlations in W.
Increasing batch size leads to worse results—it washes out strong
correlations in W.
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 76 / 98
Batch Size Tuning: Generalization Gap
(a) Batch Size 500. (b) Batch Size 250. (c) Batch Size 100. (d) Batch Size 32.
(e) Batch Size 16. (f) Batch Size 8. (g) Batch Size 4. (h) Batch Size 2.
Figure: Varying Batch Size. ESD for Layer FC1 of MiniAlexNet. We exhibit all 5
of the main phases of training by varying only the batch size.
Decreasing batch size induces strong correlations in W, leading to a more
implicitly-regularized model.
Increasing batch size washes out strong correlations in W, leading to a less
implicitly-regularized model.
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 77 / 98
Summary so far
applied Random Matrix Theory (RMT)
self-regularization ∼ entropy / information decrease
5+1 phases of learning
small models ∼ Tinkhonov-like regularization
modern DNNs ∼ heavy-tailed self-regularization
Remarkably ubiquitous
How can this be used?
Why does deep learning work?
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 78 / 98
Outline
1 Prehistory and History
Older Background
A Very Simple Deep Learning Model
More Immediate Background
2 Preliminary Results
Regularization and the Energy Landscape
Preliminary Empirical Results
Gaussian and Heavy-tailed Random Matrix Theory
3 Developing a Theory for Deep Learning
More Detailed Empirical Results
An RMT-based Theory for Deep Learning
Tikhonov Regularization versus Heavy-tailed Regularization
4 Validating and Using the Theory
Varying the Batch Size: Explaining the Generalization Gap
Using the Theory: pip install weightwatcher
Diagnostics at Scale: Predicting Test Accuracies
5 More General Implications and Conclusions
Open source tool: weightwatcher
https://github.com/CalculatedContent/WeightWatcher
A python tool to analyze weight matrices in Deep Neural Networks.
All our results can be reproduced by anyone on a basic laptop
using widely available, open source, pretained models:
(keras, pytorch, osmr/imgclsmob, huggingface, allennlp, distiller, modelzoo, etc.)
and without even needing the test or training data!
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 80 / 98
WeightWatcher
WeightWatcher is an open source to analyze DNNs with our heavy tailed α and
weighted ˆα metric (and other userful theories)
goal: to develop a useful, open source tool
supports: Keras, PyTorch, some custom libs (i.e. huggingface)
implements: various norm and rank metrics
pip install weightwatcher
current version: 0.1.2
latest from source: 0.1.3
looking for: users and contributors
https://github.com/CalculatedContent/WeightWatcher
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 81 / 98
WeightWatcher: Usage
Usage
import weightwatcher as ww
watcher = ww.WeightWatcher(model=model)
results = watcher.analyze()
watcher.get_summary()
watcher.print_results()
Advanced Usage
def analyze(self, model=None, layers= [],
min_size= 50, max_size= 0,
compute_alphas=True,
compute_lognorms=True,
compute_spectralnorms=True,
...
plot=True):
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 82 / 98
WeightWatcher: example VGG19_BN
import weightwatcher as ww
import torchvision.models as models
model = models.vgg19_bn(pretrained=True)
watcher = ww.WeightWatcher(model=model)
results = watcher.analyze(compute_alphas=True)
data.append(“name”: “vgg19bntorch”, “summary”: watcher.get_summary())
’name’: ’vgg19bntorch’,
’summary’: ’lognorm’: 0.8185,
’lognorm_compound’: 0.9365,
’alpha’: 2.9646,
’alpha_compound’: 2.8479
’alpha_weighted’: 1.1588
’alpha_weighted_compound’: 1.5002
We also provide a pandas dataframe with detailed results for each layer
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 83 / 98
Outline
1 Prehistory and History
Older Background
A Very Simple Deep Learning Model
More Immediate Background
2 Preliminary Results
Regularization and the Energy Landscape
Preliminary Empirical Results
Gaussian and Heavy-tailed Random Matrix Theory
3 Developing a Theory for Deep Learning
More Detailed Empirical Results
An RMT-based Theory for Deep Learning
Tikhonov Regularization versus Heavy-tailed Regularization
4 Validating and Using the Theory
Varying the Batch Size: Explaining the Generalization Gap
Using the Theory: pip install weightwatcher
Diagnostics at Scale: Predicting Test Accuracies
5 More General Implications and Conclusions
Bounding Generalization Error
BTW, Bounding Generalization Error = Predicting Test Accuracies
A lot of recent interest, e.g.:
Bartlett et al: (arxiv:1706.08498): bounds based on ratio of output margin distribution
and spectral complexity measure
Neyshabur et al. (arxiv:1707.09564,arxiv:1706.08947): bounds based on the product
norms of the weights across layers
Arora et al. (arxiv:1802.05296): bounds based on compression and noise stability
properties
Liao et al. (arxiv:1807.09659): normalized cross-entropy measure that correlates well
with test loss
Jiang et al. (arxiv:1810.00113): measure based on distribution of margins at multiple
layers that correlates well with test loss∗∗
These use/develop learning theory bounds and then apply to training of
MNIST/CIFAR10/etc.
Question: How do these norm-based metrics perform on state-of-the-art pre-trained
models?
∗∗
and released DEMOGEN pretrained models (after our 1901.08278 paper).
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 85 / 98
Predicting test accuracies (at scale): Product norms
M&M: “Heavy-Tailed Universality Predicts Trends in Test Accuracies ... Pre-Trained ...” https://arxiv.org/abs/1901.08278
The product norm is a VC-like data-dependent capacity metric for DNNs.
People prove theorems and then may use it to guide training.
But how does it perform on state-of-the-art production-quality models?
We can predict trends in the test accuracy in state-of-the-art production-quality
models—without peeking at the test data!
“pip install weightwatcher”
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 86 / 98
Universality, capacity control, and norm-powerlaw relations
M&M: “Heavy-Tailed Universality Predicts Trends in Test Accuracies ... Pre-Trained ...” https://arxiv.org/abs/1901.08278
“Universality” suggests the power law exponent α would make a good, Universal,
DNN capacity control metric.
For multi-layer NN, consider a weighted average
ˆα =
1
N
l,i
bl,i αl,i
To get weights bl,i , relate Frobenius norm and Power Law exponent.
Create a random Heavy-Tailed (Pareto) matrix:
Pr W rand
i,j ∼
1
x1+µ
Examine norm-powerlaw relations:
log W 2
F
log λmax
versus α
Argue††
that:
PL–Norm Relation: α log λmax
≈ log W 2
F .
††
Open problem: make the “heuristic” argument more “rigorous.”
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 87 / 98
Predicting test accuracies better: Weighted Power Laws
M&M: “Heavy-Tailed Universality Predicts Trends in Test Accuracies ... Pre-Trained ...” https://arxiv.org/abs/1901.08278
Use the weighted PL metric: ˆα = 1
N l,i
log(λmax
l,i )αl,i , instead of the product norm.
We can predict trends in the test accuracy—without peeking at the test data!
“pip install weightwatcher”
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 88 / 98
Predicting test accuracies better: Distilled Models
M&M: “Heavy-Tailed Universality Predicts Trends in Test Accuracies ... Pre-Trained ...” https://arxiv.org/abs/1901.08278
Question: Is the weighted PL metric simply a repackaging of the product norm?
Answer: No!
For some Intel Distiller models, the Spectral Norm behaves atypically, and α does not
change noticibly
(a) PL Exponents (α) (b) Spectral Norm (λmax )
Figure: (a) PL exponents α and (b) Spectral Norm (λmax ) for each layer of ResNet20,
with Intel Distiller Group Regularization method applied (before and after).
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 89 / 98
WeighWatcher: ESDs of GANs
GANs also display Heavy Tailed ESDs, however:
There are more exceptions
Many ESDs appear to display significant rank collapse and only weak
correlations
(a) Histogram of αs (b) Anamolous ESD.
Figure: (a) Distribution of all power law exponents α for DeepMind’s BigGAN
(Huggingface implementation). (b) Example of anamolous ESD.
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 90 / 98
Summary of “Validating and Using”
Some things so far:
We can: explain the generalization gap.
We can: “pip install weightwatcher” and use this source tool.
We can: predict trends in test accuracies on production-quality models.
Some things we are starting to look at:
Better metrics for monitoring and/or improving training.
Better metrics for robustness to, e.g., adversarial perturbation, model
compression, etc., that don’t involve looking at data.
Better phenomenological load-like and temperature-like metrics to guide
data collection, algorithm/parameter/hyperparameter selection, etc.
What else?
Join us:
“pip install weightwatcher”—contribute to the repo.‡‡
‡‡
Don’t do everything from scratch in a non-reproducible way. Make it reproducible!
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 91 / 98
Outline
1 Prehistory and History
Older Background
A Very Simple Deep Learning Model
More Immediate Background
2 Preliminary Results
Regularization and the Energy Landscape
Preliminary Empirical Results
Gaussian and Heavy-tailed Random Matrix Theory
3 Developing a Theory for Deep Learning
More Detailed Empirical Results
An RMT-based Theory for Deep Learning
Tikhonov Regularization versus Heavy-tailed Regularization
4 Validating and Using the Theory
Varying the Batch Size: Explaining the Generalization Gap
Using the Theory: pip install weightwatcher
Diagnostics at Scale: Predicting Test Accuracies
5 More General Implications and Conclusions
Implications: RMT and Deep Learning
Where are the local minima?
How is the Hessian behaved?
Are simpler models misleading?
Can we design better learning
strategies?
(tradeoff between Energy and Entropy minimization)
How can RMT be used to understand the Energy Landscape?
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 93 / 98
Implications: Minimizing Frustration and Energy Funnels
As simple as can be?, Wolynes, 1997
Energy Landscape Theory: “random heteropolymer” versus “natural protein” folding
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 94 / 98
Implications: The Spin Glass of Minimal Frustration
https://calculatedcontent.com/2015/03/25/why-does-deep-learning-work/
low lying Energy state in Spin Glass ∼ spikes in RMT
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 95 / 98
Implications: Rugged Energy Landscapes of Heavy-tailed
Models
Martin and Mahoney https://arxiv.org/abs/1710.09553
Spin Glasses with Heavy Tails?
Local minima do not concentrate
near the ground state
(Cizeau and Bouchaud 1993)
Configuration space with a “rugged
convexity”
Contrast with (Gaussian) Spin Glass
model of Choromanska et al. 2015
If Energy Landscape is ruggedly funneled, then no “problems” with local minima!
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 96 / 98
Conclusions: “pip install weightwatcher”
Statistical mechanics and neural networks
Long history
Revisit due to recent “crisis” in why deep learning works
Use ideas from statistical mechanics of strongly-correlated systems
Develop a theory that is designed to be used
Main Empirical/Theoretical Results
Use Heavy-tailed RMT to construct a operational theory of DNN learning
Evaluate effect of implicit versus explicit regularization
Exhibit all 5+1 phases by adjusting batch size: explain the generalization gap
Methodology: Observations → Hypotheses → Build a Theory → Test the Theory.
Many Implications:
Explain the generalization gap
Rationalize claims about rugged convexity of Energy Landscape
Predict test accuracies in state-of-the-art models
“pip install weightwatcher”
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 97 / 98
If you want more ... “pip install weightwatcher” ...
Background paper:
Rethinking generalization requires revisiting old ideas: statistical mechanics approaches
and complex learning behavior
(https://arxiv.org/abs/1710.09553)
Main paper (full):
Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix
Theory and Implications for Learning
(https://arxiv.org/abs/1810.01075)
Code: https://github.com/CalculatedContent/ImplicitSelfRegularization
Main paper (abridged):
Traditional and Heavy-Tailed Self Regularization in Neural Network Models
(https://arxiv.org/abs/1901.08276)
Code: https://github.com/CalculatedContent/ImplicitSelfRegularization
Applying the theory paper:
Heavy-Tailed Universality Predicts Trends in Test Accuracies for Very Large Pre-Trained
Deep Neural Networks
(https://arxiv.org/abs/1901.08278)
Code: https://github.com/CalculatedContent/PredictingTestAccuracies
https://github.com/CalculatedContent/WeightWatcher
“pip install weightwatcher”
Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 98 / 98

Weitere ähnliche Inhalte

Was ist angesagt?

Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley
Why Deep Learning Works: Dec 13, 2018 at ICSI, UC BerkeleyWhy Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley
Why Deep Learning Works: Dec 13, 2018 at ICSI, UC BerkeleyCharles Martin
 
AI and Machine Learning for the Lean Start Up
AI and Machine Learning for the Lean Start UpAI and Machine Learning for the Lean Start Up
AI and Machine Learning for the Lean Start UpCharles Martin
 
Why Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural NetworksWhy Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural NetworksCharles Martin
 
Why Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural NetworksWhy Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural NetworksCharles Martin
 
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...Mokhtar SELLAMI
 
Cari presentation maurice-tchoupe-joskelngoufo
Cari presentation maurice-tchoupe-joskelngoufoCari presentation maurice-tchoupe-joskelngoufo
Cari presentation maurice-tchoupe-joskelngoufoMokhtar SELLAMI
 
Dimensionality reduction with UMAP
Dimensionality reduction with UMAPDimensionality reduction with UMAP
Dimensionality reduction with UMAPJakub Bartczuk
 
11 Machine Learning Important Issues in Machine Learning
11 Machine Learning Important Issues in Machine Learning11 Machine Learning Important Issues in Machine Learning
11 Machine Learning Important Issues in Machine LearningAndres Mendez-Vazquez
 
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...Xin-She Yang
 
Teaching Graph Algorithms in the Field - Bipartite Matching in optical datace...
Teaching Graph Algorithms in the Field - Bipartite Matching in optical datace...Teaching Graph Algorithms in the Field - Bipartite Matching in optical datace...
Teaching Graph Algorithms in the Field - Bipartite Matching in optical datace...Kostas Katrinis
 
The Advancement and Challenges in Computational Physics - Phdassistance
The Advancement and Challenges in Computational Physics - PhdassistanceThe Advancement and Challenges in Computational Physics - Phdassistance
The Advancement and Challenges in Computational Physics - PhdassistancePhD Assistance
 
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...Xin-She Yang
 
Dual-time Modeling and Forecasting in Consumer Banking (2016)
Dual-time Modeling and Forecasting in Consumer Banking (2016)Dual-time Modeling and Forecasting in Consumer Banking (2016)
Dual-time Modeling and Forecasting in Consumer Banking (2016)Aijun Zhang
 
Large-Scale Nonparametric Estimation of Vehicle Travel Time Distributions
Large-Scale Nonparametric Estimation of Vehicle Travel Time DistributionsLarge-Scale Nonparametric Estimation of Vehicle Travel Time Distributions
Large-Scale Nonparametric Estimation of Vehicle Travel Time DistributionsRikiya Takahashi
 
[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...
[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...
[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...Yuko Kuroki (黒木祐子)
 
Clustbigfim frequent itemset mining of
Clustbigfim frequent itemset mining ofClustbigfim frequent itemset mining of
Clustbigfim frequent itemset mining ofijfcstjournal
 

Was ist angesagt? (20)

Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley
Why Deep Learning Works: Dec 13, 2018 at ICSI, UC BerkeleyWhy Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley
Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley
 
AI and Machine Learning for the Lean Start Up
AI and Machine Learning for the Lean Start UpAI and Machine Learning for the Lean Start Up
AI and Machine Learning for the Lean Start Up
 
Why Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural NetworksWhy Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural Networks
 
Why Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural NetworksWhy Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural Networks
 
Capsule Networks
Capsule NetworksCapsule Networks
Capsule Networks
 
Cc stat phys draft
Cc stat phys draftCc stat phys draft
Cc stat phys draft
 
CC mmds talk 2106
CC mmds talk 2106CC mmds talk 2106
CC mmds talk 2106
 
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
 
Cari presentation maurice-tchoupe-joskelngoufo
Cari presentation maurice-tchoupe-joskelngoufoCari presentation maurice-tchoupe-joskelngoufo
Cari presentation maurice-tchoupe-joskelngoufo
 
F010333844
F010333844F010333844
F010333844
 
Dimensionality reduction with UMAP
Dimensionality reduction with UMAPDimensionality reduction with UMAP
Dimensionality reduction with UMAP
 
11 Machine Learning Important Issues in Machine Learning
11 Machine Learning Important Issues in Machine Learning11 Machine Learning Important Issues in Machine Learning
11 Machine Learning Important Issues in Machine Learning
 
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
 
Teaching Graph Algorithms in the Field - Bipartite Matching in optical datace...
Teaching Graph Algorithms in the Field - Bipartite Matching in optical datace...Teaching Graph Algorithms in the Field - Bipartite Matching in optical datace...
Teaching Graph Algorithms in the Field - Bipartite Matching in optical datace...
 
The Advancement and Challenges in Computational Physics - Phdassistance
The Advancement and Challenges in Computational Physics - PhdassistanceThe Advancement and Challenges in Computational Physics - Phdassistance
The Advancement and Challenges in Computational Physics - Phdassistance
 
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
 
Dual-time Modeling and Forecasting in Consumer Banking (2016)
Dual-time Modeling and Forecasting in Consumer Banking (2016)Dual-time Modeling and Forecasting in Consumer Banking (2016)
Dual-time Modeling and Forecasting in Consumer Banking (2016)
 
Large-Scale Nonparametric Estimation of Vehicle Travel Time Distributions
Large-Scale Nonparametric Estimation of Vehicle Travel Time DistributionsLarge-Scale Nonparametric Estimation of Vehicle Travel Time Distributions
Large-Scale Nonparametric Estimation of Vehicle Travel Time Distributions
 
[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...
[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...
[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...
 
Clustbigfim frequent itemset mining of
Clustbigfim frequent itemset mining ofClustbigfim frequent itemset mining of
Clustbigfim frequent itemset mining of
 

Ähnlich wie Statistical Mechanics Methods for Discovering Knowledge from Production-Scale Neural Networks

Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401butest
 
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401butest
 
The Advancement and Challenges in Computational Physics - Phdassistance
The Advancement and Challenges in Computational Physics - PhdassistanceThe Advancement and Challenges in Computational Physics - Phdassistance
The Advancement and Challenges in Computational Physics - PhdassistancePhD Assistance
 
A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...
A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...
A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...IJDKP
 
2020: Phase transitions in layered neural networks: ReLU vs. sigmoidal activa...
2020: Phase transitions in layered neural networks: ReLU vs. sigmoidal activa...2020: Phase transitions in layered neural networks: ReLU vs. sigmoidal activa...
2020: Phase transitions in layered neural networks: ReLU vs. sigmoidal activa...University of Groningen
 
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401butest
 
Homework 21. Complete Chapter 3, Problem #1 under Project.docx
Homework 21. Complete Chapter 3, Problem #1 under Project.docxHomework 21. Complete Chapter 3, Problem #1 under Project.docx
Homework 21. Complete Chapter 3, Problem #1 under Project.docxadampcarr67227
 
Classes without Dependencies - UseR 2018
Classes without Dependencies - UseR 2018Classes without Dependencies - UseR 2018
Classes without Dependencies - UseR 2018Sam Clifford
 
Final Abstract Book - ICMMSND 2024_Pg. No_39.pdf
Final Abstract Book - ICMMSND 2024_Pg. No_39.pdfFinal Abstract Book - ICMMSND 2024_Pg. No_39.pdf
Final Abstract Book - ICMMSND 2024_Pg. No_39.pdfjagatheeshwari Jaga
 
Advanced machine learning for metabolite identification
Advanced machine learning for metabolite identificationAdvanced machine learning for metabolite identification
Advanced machine learning for metabolite identificationDai-Hai Nguyen
 
Intro to Model Selection
Intro to Model SelectionIntro to Model Selection
Intro to Model Selectionchenhm
 
Selecting the best stochastic systems for large scale engineering problems
Selecting the best stochastic systems for large scale engineering problemsSelecting the best stochastic systems for large scale engineering problems
Selecting the best stochastic systems for large scale engineering problemsIJECEIAES
 
THESLING-PETER-6019098-EFR-THESIS
THESLING-PETER-6019098-EFR-THESISTHESLING-PETER-6019098-EFR-THESIS
THESLING-PETER-6019098-EFR-THESISPeter Thesling
 

Ähnlich wie Statistical Mechanics Methods for Discovering Knowledge from Production-Scale Neural Networks (20)

Statsci
StatsciStatsci
Statsci
 
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401
 
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401
 
The Advancement and Challenges in Computational Physics - Phdassistance
The Advancement and Challenges in Computational Physics - PhdassistanceThe Advancement and Challenges in Computational Physics - Phdassistance
The Advancement and Challenges in Computational Physics - Phdassistance
 
A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...
A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...
A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...
 
Petrini - MSc Thesis
Petrini - MSc ThesisPetrini - MSc Thesis
Petrini - MSc Thesis
 
cb
cbcb
cb
 
TruongNguyen_CV
TruongNguyen_CVTruongNguyen_CV
TruongNguyen_CV
 
2020: Phase transitions in layered neural networks: ReLU vs. sigmoidal activa...
2020: Phase transitions in layered neural networks: ReLU vs. sigmoidal activa...2020: Phase transitions in layered neural networks: ReLU vs. sigmoidal activa...
2020: Phase transitions in layered neural networks: ReLU vs. sigmoidal activa...
 
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401
 
cr500606e
cr500606ecr500606e
cr500606e
 
Homework 21. Complete Chapter 3, Problem #1 under Project.docx
Homework 21. Complete Chapter 3, Problem #1 under Project.docxHomework 21. Complete Chapter 3, Problem #1 under Project.docx
Homework 21. Complete Chapter 3, Problem #1 under Project.docx
 
Classes without Dependencies - UseR 2018
Classes without Dependencies - UseR 2018Classes without Dependencies - UseR 2018
Classes without Dependencies - UseR 2018
 
CLIM: Transition Workshop - Activities and Progress in the “Parameter Optimiz...
CLIM: Transition Workshop - Activities and Progress in the “Parameter Optimiz...CLIM: Transition Workshop - Activities and Progress in the “Parameter Optimiz...
CLIM: Transition Workshop - Activities and Progress in the “Parameter Optimiz...
 
Final Abstract Book - ICMMSND 2024_Pg. No_39.pdf
Final Abstract Book - ICMMSND 2024_Pg. No_39.pdfFinal Abstract Book - ICMMSND 2024_Pg. No_39.pdf
Final Abstract Book - ICMMSND 2024_Pg. No_39.pdf
 
Advanced machine learning for metabolite identification
Advanced machine learning for metabolite identificationAdvanced machine learning for metabolite identification
Advanced machine learning for metabolite identification
 
Intro to Model Selection
Intro to Model SelectionIntro to Model Selection
Intro to Model Selection
 
Data Mgmt-Liu.pdf
Data Mgmt-Liu.pdfData Mgmt-Liu.pdf
Data Mgmt-Liu.pdf
 
Selecting the best stochastic systems for large scale engineering problems
Selecting the best stochastic systems for large scale engineering problemsSelecting the best stochastic systems for large scale engineering problems
Selecting the best stochastic systems for large scale engineering problems
 
THESLING-PETER-6019098-EFR-THESIS
THESLING-PETER-6019098-EFR-THESISTHESLING-PETER-6019098-EFR-THESIS
THESLING-PETER-6019098-EFR-THESIS
 

Mehr von Charles Martin

Heavy Tails Workshop NeurIPS2023.pdf
Heavy Tails Workshop NeurIPS2023.pdfHeavy Tails Workshop NeurIPS2023.pdf
Heavy Tails Workshop NeurIPS2023.pdfCharles Martin
 
LLM avalanche June 2023.pdf
LLM avalanche June 2023.pdfLLM avalanche June 2023.pdf
LLM avalanche June 2023.pdfCharles Martin
 
WeightWatcher LLM Update
WeightWatcher LLM UpdateWeightWatcher LLM Update
WeightWatcher LLM UpdateCharles Martin
 
WeightWatcher Update: January 2021
WeightWatcher Update:  January 2021WeightWatcher Update:  January 2021
WeightWatcher Update: January 2021Charles Martin
 
Building AI Products: Delivery Vs Discovery
Building AI Products: Delivery Vs Discovery Building AI Products: Delivery Vs Discovery
Building AI Products: Delivery Vs Discovery Charles Martin
 
Palo alto university rotary club talk Sep 29, 2107
Palo alto university rotary club talk Sep 29, 2107Palo alto university rotary club talk Sep 29, 2107
Palo alto university rotary club talk Sep 29, 2107Charles Martin
 
Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Charles Martin
 
Cc hass b school talk 2105
Cc hass b school talk  2105Cc hass b school talk  2105
Cc hass b school talk 2105Charles Martin
 

Mehr von Charles Martin (10)

Heavy Tails Workshop NeurIPS2023.pdf
Heavy Tails Workshop NeurIPS2023.pdfHeavy Tails Workshop NeurIPS2023.pdf
Heavy Tails Workshop NeurIPS2023.pdf
 
LLM avalanche June 2023.pdf
LLM avalanche June 2023.pdfLLM avalanche June 2023.pdf
LLM avalanche June 2023.pdf
 
WeightWatcher LLM Update
WeightWatcher LLM UpdateWeightWatcher LLM Update
WeightWatcher LLM Update
 
ICCF24.pdf
ICCF24.pdfICCF24.pdf
ICCF24.pdf
 
WeightWatcher Update: January 2021
WeightWatcher Update:  January 2021WeightWatcher Update:  January 2021
WeightWatcher Update: January 2021
 
Building AI Products: Delivery Vs Discovery
Building AI Products: Delivery Vs Discovery Building AI Products: Delivery Vs Discovery
Building AI Products: Delivery Vs Discovery
 
Palo alto university rotary club talk Sep 29, 2107
Palo alto university rotary club talk Sep 29, 2107Palo alto university rotary club talk Sep 29, 2107
Palo alto university rotary club talk Sep 29, 2107
 
Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3
 
Cc hass b school talk 2105
Cc hass b school talk  2105Cc hass b school talk  2105
Cc hass b school talk 2105
 
CC Talk at Berekely
CC Talk at BerekelyCC Talk at Berekely
CC Talk at Berekely
 

Kürzlich hochgeladen

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 

Kürzlich hochgeladen (20)

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Statistical Mechanics Methods for Discovering Knowledge from Production-Scale Neural Networks

  • 1. Statistical Mechanics Methods for Discovering Knowledge from Production-Scale Neural Networks Charles H. Martin∗ and Michael W. Mahoney† Tutorial at ACM-KDD, August 2019 ∗ Calculation Consulting, charles@calculationconsulting.com † ICSI and Dept of Statistics, UC Berkeley, https://www.stat.berkeley.edu/~mmahoney/ Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 1 / 98
  • 2. Outline 1 Prehistory and History Older Background A Very Simple Deep Learning Model More Immediate Background 2 Preliminary Results Regularization and the Energy Landscape Preliminary Empirical Results Gaussian and Heavy-tailed Random Matrix Theory 3 Developing a Theory for Deep Learning More Detailed Empirical Results An RMT-based Theory for Deep Learning Tikhonov Regularization versus Heavy-tailed Regularization 4 Validating and Using the Theory Varying the Batch Size: Explaining the Generalization Gap Using the Theory: pip install weightwatcher Diagnostics at Scale: Predicting Test Accuracies 5 More General Implications and Conclusions
  • 3. Outline 1 Prehistory and History Older Background A Very Simple Deep Learning Model More Immediate Background 2 Preliminary Results Regularization and the Energy Landscape Preliminary Empirical Results Gaussian and Heavy-tailed Random Matrix Theory 3 Developing a Theory for Deep Learning More Detailed Empirical Results An RMT-based Theory for Deep Learning Tikhonov Regularization versus Heavy-tailed Regularization 4 Validating and Using the Theory Varying the Batch Size: Explaining the Generalization Gap Using the Theory: pip install weightwatcher Diagnostics at Scale: Predicting Test Accuracies 5 More General Implications and Conclusions
  • 4. Statistical Physics & Neural Networks: A Long History 60s: J. D. Cowan, Statistical Mechanics of Neural Networks, 1967. 70s: W. A. Little, “The existence of persistent states in the brain,” Math. Biosci., 1974. 80s: H. Sompolinsky, “Statistical mechanics of neural networks,” Physics Today, 1988. 90s: D. Haussler, M. Kearns, H. S. Seung, and N. Tishby, “Rigorous learning curve bounds from statistical mechanics,” Machine Learning, 1996. 00s: A. Engel and C. P. L. Van den Broeck, Statistical mechanics of learning, 2001. Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 4 / 98
  • 5. Hopfield model Hopfield, “Neural networks and physical systems with emergent collective computational abilities,” PNAS 1982. Hopfield model: Recurrent artificial neural network model Equivalence between behavior of NNs with symmetric connections and the equilibrium statistical mechanics behavior of certain magnetic systems. Can design NNs for associative memory and other computational tasks Phase diagram with three kinds of phases (α is load parameter): Very low α regime: model has so so much capacity, it is a prototype method Intermediate α: spin glass phase, which is “pathologically non-convex” Higher α: generalization phase But: Lots of subsequent work focusing on spin glasses, replica theory, etc. Let’s go back to the basics! Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 5 / 98
  • 6. Restricted Boltzmann Machines and Variational Methods RBMs = Hopfield + temperature + backprop: RBMs and other more sophisticated variational free energy methods They have an intractable partition function. Goal: try to approximate partition function / free energy. Also, recent work on their phase diagram. We do NOT do this. Memorization, then and now. Three (then) versus two (now) phases. Modern “memorization” is probably more like spin glass phase. Let’s go back to the basics! Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 6 / 98
  • 7. Some other signposts Cowen’s introduction of sigmoid into neuroscience. Parisi’s replica theory computations. Solla’s statistical analysis. Gardner’s analysis of annealed versus quenched entropy. Saad’s analysis of dynamics of SGD. More recent work on dynamics, energy langscapes, etc. Lots more . . . Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 7 / 98
  • 8. Important: Our Methodological Approach Most people like training and validating ideas by training. We will use pre-trained models. Many state-of-the-art models are publicly available. They are “machine learning models that work” . . . so analyze them. Selection bias: you can’t talk with deceased patients. Of course, one could use these methods to improve training . . . we won’t. Benefits of this methodological approach. Can develop a practical theory. (Current theory isn’t . . . loose bounds and convergence rates.) Can evaluate theory on state-of-the-art models. (Big models are different than small . . . easily-trainable models.) Can be more reproducible. (Training isn’t reproducible . . . too many knobs.) You can “pip install weightwatcher” Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 8 / 98
  • 9. Outline 1 Prehistory and History Older Background A Very Simple Deep Learning Model More Immediate Background 2 Preliminary Results Regularization and the Energy Landscape Preliminary Empirical Results Gaussian and Heavy-tailed Random Matrix Theory 3 Developing a Theory for Deep Learning More Detailed Empirical Results An RMT-based Theory for Deep Learning Tikhonov Regularization versus Heavy-tailed Regularization 4 Validating and Using the Theory Varying the Batch Size: Explaining the Generalization Gap Using the Theory: pip install weightwatcher Diagnostics at Scale: Predicting Test Accuracies 5 More General Implications and Conclusions
  • 10. PAC/VC versus Statistical Mechanics Approaches (1 of 2) Basic Student-Teacher Learning Setup: Classify elements of input space X into {0, 1} Target rule / teacher T; and hypothesis space F of possible mappings Given T for X ⊂ X, the training set, select a student f ∗ ∈ F, and evaluate how well f ∗ approximates T on X Generalization error ( ): probability of disagreement bw student and teacher on X Training error ( t ): fraction of disagreement bw student and teacher on X Learning curve: behavior of | t − | as a function of control parameters PAC/VC Approach: Related to statistical problem of convergence of frequencies to probabilities Statistical Mechanics Approach: Exploit the thermodynamic limit from statistical mechanics Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 10 / 98
  • 11. PAC/VC versus Statistical Mechanics Approaches (2 of 2) PAC/VC: get bounds on worst-case results View m = |X| as the main control parameter; fix the function class F; and ask how | t − | varies Natural to consider γ = P [| t − | > δ] Related to problem of convergence of frequencies to probabilities Hoeffding-type approach not appropriate (f ∗ depends on training data) Fix F and construct uniform bound P [maxh∈F | t (h) − (h)| > δ] ≤ 2 |F| e−2mδ2 Straightforward if |F| < ∞; use VC dimension (etc.) otherwise Statistical Mechanics: get precise results for typical configurations Function class F = FN varies with m; and let m and (size of F) vary in well-defined manner Thermodynamic limit: m, N → ∞ s.t. α = m N (like load in associative memory models). Limit s.t. (when it exists) certain quantities get sharply peaked around their most probable value. Describe learning curve as competition between error (energy) and log of number of functions with that energy (entropy) Get precise results for typical (most probably in that limit) quantities Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 11 / 98
  • 12. Rethinking generalization requires revisiting old ideas Martin and Mahoney https://arxiv.org/abs/1710.09553 Very Simple Deep Learning (VSDL) model: DNN is a black box, load-like parameters α, & temperature-like parameters τ Adding noise to training data decreases α Early stopping increases τ Nearly any non-trivial model‡ exhibits “phase diagrams,” with qualitatively different generalization properties, for different parameter values. (a) Training/general- ization error. (b) Learning phases in τ-α plane. (c) Noisifying data and adjusting knobs. ‡ when analyzed via the Statistical Mechanics Theory of Generalization (SMToG) Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 12 / 98
  • 13. Remembering Regularization Martin and Mahoney https://arxiv.org/abs/1710.09553 Statistical Mechanics (1990s): (this) Overtraining → Spin Glass Phase Binary Classifier with N Random Labelings: 2N over-trained solutions: locally (ruggedly) convex, very high barriers, all unable to generalize implication: solutions inside basins should be more similar than solutions in different basins Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 13 / 98
  • 14. Stat Mech Setup: Student Teacher Model Martin and Mahoney https://arxiv.org/abs/1710.09553 Given N labeled data points Imagine a Teacher Network T that maps data to labels Learning finds a Student J similar to the Teacher T Consider all possible Student Networks J for all possible teachers T The Generalization error is related to the phase space volume Ω of all possible Student-Teacher overlaps for all possible J, T = arccos R, R = 1 N J† T Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 14 / 98
  • 15. Stat Mech Setup: Student Teacher Model Martin and Mahoney https://arxiv.org/abs/1710.09553 Statistical Mechanics Foundations: Spherical (Energy) Constraints: δ(Tr[J2 ] − N) Teacher Overlap (Potential): δ( 1 N Tr[J† T] − cos(π )) Teacher Phase Space Volume (Density of States): ΩT ( ) = dJδ(Tr[J2 ] − N)δ( 1 N Tr[J† T] − cos(π )) Comparison to traditional Statistical Mechanics: Phase Space Volume, free particles: ΩE = dN r dN pδ N i p2 i 2mi − E ∼ V N Canonical Ensemble: Legendre Transform in R = cos(π ): actually more technical, and must choose sign convention on Tr[J† T], H Ωβ(R) ∼ dµ(J)e−λTr[J† T] ∼ dqN dpN e−βH(p,q) Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 15 / 98
  • 16. Stat Mech Setup: Student Teacher Model Martin and Mahoney https://arxiv.org/abs/1710.09553 Early Models: Perception: J, T N-dim vectors Continuous Perception Ji ∈ R (not so intersting) Ising Perception Ji = ±1 (sharp transitions, requires Replica theory) Our Proposal: J, T (N × M) Real (possibly Heavy Tailed) matrices Practical Applications: Hinton, Bengio, etc. Related to complexity of (Levy) spin glasses (Bouchaud) Our Expectation: Heavy-tailed structure means there is less capacity/entropy available for integrals, which will affect generalization properties non-trivially Multi-class classification is very different than binary classification Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 16 / 98
  • 17. Student Teacher: Recent Practical Application “Similarity of Neural Network Representations Revisited” Kornblith, Norouzi, Lee, Hinton; https://arxiv.org/abs/1905.00414 Examined different Weight matrix similarity metrics Best method: Canonical Correlation Analysis (CCA): Y† X 2 F Figure: Diagnostic Tool for both individual and comparative DNNs Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 17 / 98
  • 18. Student Teacher: Recent Generalization vs. Memorization “Insights on representational similarity in neural networks with canonical correlation” Morcos, Raghu, Bengio; https://arxiv.org/pdf/1806.05759.pdf Compare NN representations and how they evolve during training Projection weighted Canonical Correlation Analysis (PWCCA) Figure: Generalizing networks converge to more similar solutions than memorizing networks. Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 18 / 98
  • 19. Outline 1 Prehistory and History Older Background A Very Simple Deep Learning Model More Immediate Background 2 Preliminary Results Regularization and the Energy Landscape Preliminary Empirical Results Gaussian and Heavy-tailed Random Matrix Theory 3 Developing a Theory for Deep Learning More Detailed Empirical Results An RMT-based Theory for Deep Learning Tikhonov Regularization versus Heavy-tailed Regularization 4 Validating and Using the Theory Varying the Batch Size: Explaining the Generalization Gap Using the Theory: pip install weightwatcher Diagnostics at Scale: Predicting Test Accuracies 5 More General Implications and Conclusions
  • 20. Motivations: Theoretical AND Practical Theoretical: deeper insight into Why Deep Learning Works? convex versus non-convex optimization? explicit/implicit regularization? is / why is / when is deep better? VC theory versus Statistical Mechanics theory? . . . Practical: use insights to improve engineering of DNNs? when is a network fully optimized? can we use labels and/or domain knowledge more efficiently? large batch versus small batch in optimization? designing better ensembles? . . . Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 20 / 98
  • 21. Motivations: towards a Theory of Deep Learning DNNs as spin glasses, Choromanska et al. 2015 Looks exactly like old protein folding results (late 90s) Energy Landscape Theory Completely different picture of DNNs Raises broad questions about Why Deep Learning Works Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 21 / 98
  • 22. Motivations: regularization in DNNs? ICLR 2017 Best paper Large neural network models can easily overtrain/overfit on randomly labeled data Popular ways to regularize (basically minx f (x) + λg(x), with “control parameter” λ) may or may not help. Understanding deep learning requires rethinking generalization?? https://arxiv.org/abs/1611.03530 Rethinking generalization requires revisiting old ideas: statistical mechanics approaches and complex learning behavior!! https://arxiv.org/abs/1710.09553 (Martin & Mahoney) Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 22 / 98
  • 23. Motivations: stochastic optimization DNNs? Theory (from convex problems): First order (SGD, e.g., Bottou 2010) larger “batches” are better (at least up to statistical noise) Second order (SSN, e.g., Roosta and Mahoney 2016) larger “batches” are better (at least up to statistical noise) Large batch sizes have better computational properties! So, people just increase batch size (and compensate with other parameters) Practice (from non-convex problems): SGD-like methods “saturate” (https://arxiv.org/abs/1811.12941) SSN-like methods “saturate” (https://arxiv.org/abs/1903.06237) Small batch sizes have better statistical properties! Is batch size a computational parameter, or a statistical parameter, or what? How should batch size be chosen? Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 23 / 98
  • 24. Set up: the Energy Landscape Energy/Optimization function: EDNN = hL(WL × hL−1(WL−1 × hL−2(· · · ) + bL−1) + bL) Train this on labeled data {di , yi } ∈ D, using Backprop, by minimizing loss L: min Wl ,bl L i EDNN(di ) − yi EDNN is “the” Energy Landscape: The part of the optimization problem parameterized by the heretofore unknown elements of the weight matrices and bias vectors, and as defined by the data {di , yi } ∈ D Pass the data through the Energy function EDNN multiple times, as we run Backprop training The Energy Landscape§ is changing at each epoch § i.e., the optimization function that is nominally being optimized Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 24 / 98
  • 25. Problem: How can this possibly work? Expected Highly non-convex? Observed Apparently not! It has been known for a long time that local minima are not the issue. Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 25 / 98
  • 26. Problem: Local Minima? Duda, Hart and Stork, 2000 Solution: add more capacity and regularize, i.e., over-parameterization Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 26 / 98
  • 27. Motivations: what is regularization? (a) Dropout. (b) Early Stopping. (c) Batch Size. (d) Noisify Data. Every adjustable knob and switch—and there are many¶—is regularization. ¶ https://arxiv.org/pdf/1710.10686.pdf Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 27 / 98
  • 28. Outline 1 Prehistory and History Older Background A Very Simple Deep Learning Model More Immediate Background 2 Preliminary Results Regularization and the Energy Landscape Preliminary Empirical Results Gaussian and Heavy-tailed Random Matrix Theory 3 Developing a Theory for Deep Learning More Detailed Empirical Results An RMT-based Theory for Deep Learning Tikhonov Regularization versus Heavy-tailed Regularization 4 Validating and Using the Theory Varying the Batch Size: Explaining the Generalization Gap Using the Theory: pip install weightwatcher Diagnostics at Scale: Predicting Test Accuracies 5 More General Implications and Conclusions
  • 29. Outline 1 Prehistory and History Older Background A Very Simple Deep Learning Model More Immediate Background 2 Preliminary Results Regularization and the Energy Landscape Preliminary Empirical Results Gaussian and Heavy-tailed Random Matrix Theory 3 Developing a Theory for Deep Learning More Detailed Empirical Results An RMT-based Theory for Deep Learning Tikhonov Regularization versus Heavy-tailed Regularization 4 Validating and Using the Theory Varying the Batch Size: Explaining the Generalization Gap Using the Theory: pip install weightwatcher Diagnostics at Scale: Predicting Test Accuracies 5 More General Implications and Conclusions
  • 30. Basics of Regularization Ridge Regression / Tikhonov-Phillips Regularization ˆWx = y x = ˆWT ˆW + αI −1 ˆWT y Moore-Penrose pseudoinverse (1955) Ridge regularization (Phillips, 1962) min x ˆWx − y 2 2 + α ˆx 2 2 familiar optimization problem Softens the rank of ˆW to focus on large eigenvalues. Related to Truncated SVD, which does hard truncation on rank of ˆW Early stopping, truncated random walks, etc. often implicitly solve regularized optimiation problems. Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 30 / 98
  • 31. How we will study regularization The Energy Landscape is determined by layer weight matrices WL: EDNN = hL(WL × hL−1(WL−1 × hL−2(· · · ) + bL−1) + bL) Traditional regularization is applied to WL: min Wl ,bl L i EDNN(di ) − yi + α l Wl Different types of regularization, e.g., different norms · , leave different empirical signatures on WL. What we do: Turn off “all” regularization. Systematically turn it back on, explicitly with α or implicitly with knobs/switches. Study empirical properties of WL. Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 31 / 98
  • 32. Lots of DNNs Analyzed Question: What happens to the layer weight matrices WL? (Don’t evaluate your method on one/two/three NN, evaluate it on a dozen/hundred.) Retrained LeNet5 on MINST using Keras. Two other small models: 3-Layer MLP Mini AlexNet Conv2D MaxPool Conv2D MaxPool FC1 FC2 FC Wide range of state-of-the-art pre-trained models: AlexNet, Inception, etc. Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 32 / 98
  • 33. Outline 1 Prehistory and History Older Background A Very Simple Deep Learning Model More Immediate Background 2 Preliminary Results Regularization and the Energy Landscape Preliminary Empirical Results Gaussian and Heavy-tailed Random Matrix Theory 3 Developing a Theory for Deep Learning More Detailed Empirical Results An RMT-based Theory for Deep Learning Tikhonov Regularization versus Heavy-tailed Regularization 4 Validating and Using the Theory Varying the Batch Size: Explaining the Generalization Gap Using the Theory: pip install weightwatcher Diagnostics at Scale: Predicting Test Accuracies 5 More General Implications and Conclusions
  • 34. Matrix complexity: Matrix Entropy and Stable Rank W = UΣVT νi = Σii pi = ν2 i / i ν2 i S(W) = −1 log(R(W)) i pi log pi Rs(W) = W 2 F W 2 2 = i ν2 i ν2 max A warm-up: train a 3-Layer MLP: (e) MLP3 Entropies. (f) MLP3 Stable Ranks. Figure: Matrix Entropy & Stable Rank show transition during Backprop training. Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 34 / 98
  • 35. Matrix complexity: Scree Plots W = UΣVT νi = Σii pi = ν2 i / i ν2 i S(W) = −1 log(R(W)) i pi log pi Rs(W) = W 2 F W 2 2 = i ν2 i ν2 max A warm-up: train a 3-Layer MLP: (a) Initial Scree Plot. (b) Final Scree Plot. Figure: Scree plots for initial and final configurations for MLP3. Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 35 / 98
  • 36. Matrix complexity: Singular/Eigen Value Densities W = UΣVT νi = Σii pi = ν2 i / i ν2 i S(W) = −1 log(R(W)) i pi log pi Rs(W) = W 2 F W 2 2 = i ν2 i ν2 max A warm-up: train a 3-Layer MLP: (a) Singular val. density (b) Eigenvalue density Figure: Histograms of the Singular Values νi and associated Eigenvalues λi = ν2 i . Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 36 / 98
  • 37. ESD: detailed insight into WL Empirical Spectral Density (ESD: eigenvalues of X = WT L WL) import keras import numpy as np import matplotlib.pyplot as plt … W = model.layers[i].get_weights()[0] … X = np.dot(W, W.T) evals, evecs = np.linalg.eig(W, W.T) plt.hist(X, bin=100, density=True) Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 37 / 98
  • 38. ESD: detailed insight into WL Empirical Spectral Density (ESD: eigenvalues of X = WT L WL) Eopch 0: Random Matrix Eopch 36: Random + Spiles Entropy decrease corresponds to: modification (later, breakdown) of random structure and onset of a new kind of self-regularization. Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 38 / 98
  • 39. Outline 1 Prehistory and History Older Background A Very Simple Deep Learning Model More Immediate Background 2 Preliminary Results Regularization and the Energy Landscape Preliminary Empirical Results Gaussian and Heavy-tailed Random Matrix Theory 3 Developing a Theory for Deep Learning More Detailed Empirical Results An RMT-based Theory for Deep Learning Tikhonov Regularization versus Heavy-tailed Regularization 4 Validating and Using the Theory Varying the Batch Size: Explaining the Generalization Gap Using the Theory: pip install weightwatcher Diagnostics at Scale: Predicting Test Accuracies 5 More General Implications and Conclusions
  • 40. Random Matrix Theory 101: Wigner and Tracy-Widom Wigner: global bulk statistics approach universal semi-circular form Tracy-Widom: local edge statistics fluctuate in universal way Problems with Wigner and Tracy-Widom: Weight matrices usually not square Typically do only a single training run Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 40 / 98
  • 41. Random Matrix Theory 102: Marchenko-Pastur Let W be an N × M random matrix, with elements Wij ∼ N(0, σ2 mp). Then, the ESD of X = WT W, converges to a deterministic function: ρN(λ) := 1 N M i=1 δ (λ − λi ) N→∞ −−−−→ Q fixed    Q 2πσ2 mp (λ+ − λ)(λ − λ−) λ if λ ∈ [λ−, λ+] 0 otherwise. with well-defined edges (which depend on Q, the aspect ratio): λ± = σ2 mp 1 ± 1 √ Q 2 Q = N/M ≥ 1. Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 41 / 98
  • 42. Random Matrix Theory 102’: Marchenko-Pastur (a) Vary aspect ratios (b) Vary variance parameters Figure: Marchenko-Pastur (MP) distributions. Important points: Global bulk stats: The overall shape is deterministic, fixed by Q and σ. Local edge stats: The edge λ+ is very crisp, i.e., ∆λM = |λmax − λ+ | ∼ O(M−2/3 ), plus Tracy-Widom fluctuations. We use both global bulk statistics as well as local edge statistics in our theory. Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 42 / 98
  • 43. Random Matrix Theory 103: Heavy-tailed RMT Go beyond the (relatively easy) Gaussian Universality class: model strongly-correlated systems (“signal”) with heavy-tailed random matrices. Generative Model w/ elements from Universality class Finite-N Global shape ρN (λ) Limiting Global shape ρ(λ), N → ∞ Bulk edge Local stats λ ≈ λ+ (far) Tail Local stats λ ≈ λmax Basic MP Gaussian MP distribution MP TW No tail. Spiked- Covariance Gaussian, + low-rank perturbations MP + Gaussian spikes MP TW Gaussian Heavy tail, 4 < µ (Weakly) Heavy-Tailed MP + PL tail MP Heavy-Tailed∗ Heavy-Tailed∗ Heavy tail, 2 < µ < 4 (Moderately) Heavy-Tailed (or “fat tailed”) PL∗∗ ∼ λ−(aµ+b) PL ∼ λ −( 1 2 µ+1) No edge. Frechet Heavy tail, 0 < µ < 2 (Very) Heavy-Tailed PL∗∗ ∼ λ −( 1 2 µ+1) PL ∼ λ −( 1 2 µ+1) No edge. Frechet Basic MP theory, and the spiked and Heavy-Tailed extensions we use, including known, empirically-observed, and conjectured relations between them. Boxes marked “∗ ” are best described as following “TW with large finite size corrections” that are likely Heavy-Tailed, leading to bulk edge statistics and far tail statistics that are indistinguishable. Boxes marked “∗∗ ” are phenomenological fits, describing large (2 < µ < 4) or small (0 < µ < 2) finite-size corrections on N → ∞ behavior.
  • 44. Fitting Heavy-tailed Distributions Figure: The log-log histogram plots of the ESD for three Heavy-Tailed random matrices M with same aspect ratio Q = 3, with µ = 1.0, 3.0, 5.0, corresponding to the three Heavy-Tailed Universality classes (0 < µ < 2 vs 2 < µ < 4 and 4 < µ). Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 44 / 98
  • 45. Non-negligible finite size effects (a) M = 1000, N = 2000. (b) Fixed M. (c) Fixed N. Figure: Dependence of α (the fitted PL parameter) on µ (the hypothesized limiting PL parameter). Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 45 / 98
  • 46. Heavy Tails (!) and Heavy-Tailed Universality (?) Universality: large-scale properties are independent of small-scale details Mathematicians: justify proving theorems in random matrix theory Physicists: derive new phenomenological relations and predict things Gaussian Universality is most common, but there are many other types. Heavy-Tailed Phenomenon Rare events are not extraordinarily rare, i.e., are heavier than Gaussian tails Modeled with power law and related functions Seen in finance, structural glass theory, etc. Heavy-Tailed Random Matrix Theory Phenomenological work by physicists (Bouchard, Potters, Sornette, 90s) Theorem proving by mathematicians (Auffinger, Ben Arous, Burda, Peche, 00s) Universality of Power Laws, Levy-based dynamics, finite-size attractors, etc. Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 46 / 98
  • 47. Heavy-Tailed Universality: Earthquake prediction “Complex Critical Exponents from Renormalization Group Theory of Earthquakes . . . ” Sornette et al. (1985) Power law fit of the regional strain (a measure of seismic release) before the critical time tc (of the earthquake) d dt = A + B(t − tc )m (a) (b) Figure: (a) Cumulative Beniolf strain released by magnitude 5 and greater earthquakes in the San Francisco Bay area prior to the 1989 Loma Prieta eaerthquake. (b) Fit of Power Law exponent (m). More sophisticated Renormalization Group (RG) analysys uses complex critical exponents, giving log-peripdic corrections. Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 47 / 98
  • 48. Heavy-Tailed Universality: Market Crashes “Why Stock Markets Crash: Critical Events in Complex Financial Systems” by D. Sornette (book, 2003) Simple Power Law log p(t) = A + B(t − tc )β Complex Power Law (RG Log Periodic corrections) log p(t) = A + B(t − tc )β + C(t − tc )β (cos(ωlog(t − tc ) − φ) (a) Dow Jones 1929 crash (b) Universal parameters, fit to RG model Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 48 / 98
  • 49. Heavy-Tailed Universality: Neuronal Avalanches Neuronal avalanche dynamics indicates different universality classes in neuronal cultures; Scienfic Reports 3417 (2018) (c) Spiking activity of cultured neurons (d) Critical exponents, fit to scalaing model Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 49 / 98
  • 50. Outline 1 Prehistory and History Older Background A Very Simple Deep Learning Model More Immediate Background 2 Preliminary Results Regularization and the Energy Landscape Preliminary Empirical Results Gaussian and Heavy-tailed Random Matrix Theory 3 Developing a Theory for Deep Learning More Detailed Empirical Results An RMT-based Theory for Deep Learning Tikhonov Regularization versus Heavy-tailed Regularization 4 Validating and Using the Theory Varying the Batch Size: Explaining the Generalization Gap Using the Theory: pip install weightwatcher Diagnostics at Scale: Predicting Test Accuracies 5 More General Implications and Conclusions
  • 51. Outline 1 Prehistory and History Older Background A Very Simple Deep Learning Model More Immediate Background 2 Preliminary Results Regularization and the Energy Landscape Preliminary Empirical Results Gaussian and Heavy-tailed Random Matrix Theory 3 Developing a Theory for Deep Learning More Detailed Empirical Results An RMT-based Theory for Deep Learning Tikhonov Regularization versus Heavy-tailed Regularization 4 Validating and Using the Theory Varying the Batch Size: Explaining the Generalization Gap Using the Theory: pip install weightwatcher Diagnostics at Scale: Predicting Test Accuracies 5 More General Implications and Conclusions
  • 52. Experiments: just apply this to pre-trained models https://medium.com/@siddharthdas_32104/cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-... Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 52 / 98
  • 53. Experiments: just apply this to pre-trained models Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 53 / 98
  • 54. RMT: LeNet5 (an old/small NN example) Figure: Full and zoomed-in ESD for LeNet5, Layer FC1. Marchenko-Pastur Bulk + Spikes Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 54 / 98
  • 55. RMT: AlexNet (a typical modern/large DNN example) Figure: Zoomed-in ESD for Layer FC1 and FC3 of AlexNet. Marchenko-Pastur Bulk-decay + Heavy-tailed Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 55 / 98
  • 56. RMT: InceptionV3 (a particularly unusual example) Figure: ESD for Layers L226 and L302 in InceptionV3, as distributed w/ pyTorch. Marchenko-Pastur bulk decay, onset of Heavy Tails Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 56 / 98
  • 57. Convolutional 2D Feature Maps We analyze Conv2D layers by extracting the feature maps individually, i.e., A (3 × 3 × 64 × 64) Conv2D layer yields 9 (64 × 64) Feature Maps (a) α = 1.38 (b) α = 2.74 (c) α = 3.02 Figure: Select Feature Maps from different Conv2D layers of VGG16. Fits shown with PDF (blue) and CDF (red) Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 57 / 98
  • 58. Open AI GPT2 Attention Matrices NLP Embedding and Attention Matrices are Dense/Linear, but generally have large aspect ratios (a) α = (b) α = (c) α = Figure: Selected ESDs from Open AI GPT2 (Huggingface implementation) Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 58 / 98
  • 59. Outline 1 Prehistory and History Older Background A Very Simple Deep Learning Model More Immediate Background 2 Preliminary Results Regularization and the Energy Landscape Preliminary Empirical Results Gaussian and Heavy-tailed Random Matrix Theory 3 Developing a Theory for Deep Learning More Detailed Empirical Results An RMT-based Theory for Deep Learning Tikhonov Regularization versus Heavy-tailed Regularization 4 Validating and Using the Theory Varying the Batch Size: Explaining the Generalization Gap Using the Theory: pip install weightwatcher Diagnostics at Scale: Predicting Test Accuracies 5 More General Implications and Conclusions
  • 60. RMT-based 5+1 Phases of Training (a) Random-like. (b) Bleeding-out. (c) Bulk+Spikes. (d) Bulk-decay. (e) Heavy-Tailed. (f) Rank-collapse. Figure: The 5+1 phases of learning we identified in DNN training. Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 60 / 98
  • 61. RMT-based 5+1 Phases of Training We model “noise” and also “signal” with random matrices: W Wrand + ∆sig . (1) Operational Definition Informal Description via Eqn. (1) Edge/tail Fluctuation Comments Illustration and Description Random-like ESD well-fit by MP with appropriate λ+ Wrand random; ∆sig zero or small λmax ≈ λ+ is sharp, with TW statistics Fig. 15(a) Bleeding-out ESD Random-like, excluding eigenmass just above λ+ W has eigenmass at bulk edge as spikes “pull out”; ∆sig medium BPP transition, λmax and λ+ separate Fig. 15(b) Bulk+Spikes ESD Random-like plus ≥ 1 spikes well above λ+ Wrand well-separated from low-rank ∆sig ; ∆sig larger λ+ is TW, λmax is Gaussian Fig. 15(c) Bulk-decay ESD less Random-like; Heavy-Tailed eigenmass above λ+ ; some spikes Complex ∆sig with correlations that don’t fully enter spike Edge above λ+ is not concave Fig. 15(d) Heavy-Tailed ESD better-described by Heavy-Tailed RMT than Gaussian RMT Wrand is small; ∆sig is large and strongly-correlated No good λ+ ; λmax λ+ Fig. 15(e) Rank-collapse ESD has large-mass spike at λ = 0 W very rank-deficient; over-regularization — Fig. 15(f) The 5+1 phases of learning we identified in DNN training.
  • 62. RMT-based 5+1 Phases of Training Lots of technical issues ... Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 62 / 98
  • 63. Outline 1 Prehistory and History Older Background A Very Simple Deep Learning Model More Immediate Background 2 Preliminary Results Regularization and the Energy Landscape Preliminary Empirical Results Gaussian and Heavy-tailed Random Matrix Theory 3 Developing a Theory for Deep Learning More Detailed Empirical Results An RMT-based Theory for Deep Learning Tikhonov Regularization versus Heavy-tailed Regularization 4 Validating and Using the Theory Varying the Batch Size: Explaining the Generalization Gap Using the Theory: pip install weightwatcher Diagnostics at Scale: Predicting Test Accuracies 5 More General Implications and Conclusions
  • 64. Bulk+Spikes: Small Models ∼ Tikhonov regularization Low-rank perturbation Wl Wrand l + ∆large Perturbative correction λmax = σ2 1 Q + |∆|2 N 1 + N |∆|2 |∆| > (Q)− 1 4 λ+ simple scale threshold x = ˆX + αI −1 ˆWT y eigenvalues > α (Spikes) carry most of the signal/information Bulk → Spikes Smaller, older models like LeNet5 exhibit traditional regularization and can be described perturbatively with Gaussian RMT Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 64 / 98
  • 65. Heavy-tailed Self-regularization W is strongly-correlated and highly non-random We model strongly-correlated systems by heavy-tailed random matrices I.e., we model signal (not noise) by heavy-tailed random matrices Then RMT/MP ESD will also have heavy tails Known results from RMT / polymer theory (Bouchaud, Potters, etc) AlexNet ReseNet50 Inception V3 DenseNet201 ... “All” larger, modern DNNs exhibit novel Heavy-tailed self-regularization Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 65 / 98
  • 66. Heavy-tailed Self-regularization Summary of what we “suspect” today No single scale threshold. No simple low rank approximation for WL. Contributions from correlations at all scales. Can not be treated perturbatively. “All” larger, modern DNNs exhibit novel Heavy-tailed self-regularization Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 66 / 98
  • 67. Spikes: carry more “information” than the Bulk Spikes have less entropy, are more localized than bulk. (a) Vector Entropies. (b) Localization Ratios. (c) Participation Ratios. Figure: Eigenvector localization metrics for the FC1 layer of MiniAlexNet. Information begins to concentrate in the spikes. Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 67 / 98
  • 68. Power Law Universality: ImageNet and AllenNLP All these models display remarkable Heavy Tailed Universality Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 68 / 98
  • 69. Power Law Universality: ImageNet 7500 matrices (and Conv2D feature maps) over 50 architectures Linear layers and Conv2D feature maps 80 − 90% < 5 All these models display remarkable Heavy Tailed Universality Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 69 / 98
  • 70. Power Law Universality: Open AI GPT versus GPT2 GPT versus GPT2: (Huggingface implementation) example of a class of models that “improves” over time. Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 70 / 98
  • 71. Mechanisms? Spiked-Covariance Model Statistical model: Johnstone, “On the distribution . . . ” 2001. Simple self-organization model: Malevergne and Sornette, “Collective Origin of the Coexistence of Apparent RMT Noise and Factors in Large Sample Correlation Matrices,” 2002. Low-rank perturbative variant of Gaussian randomness modeling noise Heavy-tailed Models: Self-organized criticality (and others ...) Johansen, Sornette, and Ledoit, “Predicting financial crashes using discrete scale invariance,” 1998. Markovic and Gros, “Power laws and self-organized criticality in theory and nature,” 2013. Non-perturbative model where heavy-tails and power laws are used to model strongly correlated systems Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 71 / 98
  • 72. Outline 1 Prehistory and History Older Background A Very Simple Deep Learning Model More Immediate Background 2 Preliminary Results Regularization and the Energy Landscape Preliminary Empirical Results Gaussian and Heavy-tailed Random Matrix Theory 3 Developing a Theory for Deep Learning More Detailed Empirical Results An RMT-based Theory for Deep Learning Tikhonov Regularization versus Heavy-tailed Regularization 4 Validating and Using the Theory Varying the Batch Size: Explaining the Generalization Gap Using the Theory: pip install weightwatcher Diagnostics at Scale: Predicting Test Accuracies 5 More General Implications and Conclusions
  • 73. Outline 1 Prehistory and History Older Background A Very Simple Deep Learning Model More Immediate Background 2 Preliminary Results Regularization and the Energy Landscape Preliminary Empirical Results Gaussian and Heavy-tailed Random Matrix Theory 3 Developing a Theory for Deep Learning More Detailed Empirical Results An RMT-based Theory for Deep Learning Tikhonov Regularization versus Heavy-tailed Regularization 4 Validating and Using the Theory Varying the Batch Size: Explaining the Generalization Gap Using the Theory: pip install weightwatcher Diagnostics at Scale: Predicting Test Accuracies 5 More General Implications and Conclusions
  • 74. Self-regularization: Batch size experiments A theory should make predictions: We predict the existence of 5+1 phases of increasing implicit self-regularization We characterize their properties in terms of HT-RMT Do these phases exist? Can we find them? There are many knobs. Let’s vary one—batch size. Tune the batch size from very large to very small A small (i.e., retrainable) model exhibits all 5+1 phases Large batch sizes => decrease generalization accuracy Large batch sizes => decrease implicit self-regularization Generalization Gap Phenomena: all else being equal, small batch sizes lead to more implicitly self-regularized models. Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 74 / 98
  • 75. Batch size and the Generalization Gap Large versus small batches? Larger is better: Convex theory: SGD is closer to gradient descent Implementations: Better parallelism, etc. (But see Golmant et al. (arxiv:1811.12941) and Ma et al. (arxiv:1903.06237) for “inefficiency” of SGD and KFAC.) Smaller is better: Empirical: Hoffer et al. (arXiv:1705.08741) and Keskar et al. (arXiv:1609.04836) Information: Schwartz-Ziv and Tishby (arxiv:1703.00810) (This is like a “supervised” version of our approach.) Connection with weight norms? Older: Bartlett, 1997; Mahoney and Narayanan, 2009. Newer: Liao et al., 2018; Soudry et al., 2017; Poggio et al., 2018; Neyshabur et al., 2014; 2015; 2017a; Bartlett et al., 2017; Yoshida and Miyato, 2017; Kawaguchi et al., 2017; Neyshabur et al., 2017b; Arora et al., 2018b;a; Zhou and Feng, 2018. Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 75 / 98
  • 76. Batch Size Tuning: Generalization Gap Figure: Varying Batch Size: Stable Rank and MP Softrank for FC1 and FC2 Training and Test Accuracies versus Batch Size for MiniAlexNet. Decreasing batch size leads to better results—it induces strong correlations in W. Increasing batch size leads to worse results—it washes out strong correlations in W. Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 76 / 98
  • 77. Batch Size Tuning: Generalization Gap (a) Batch Size 500. (b) Batch Size 250. (c) Batch Size 100. (d) Batch Size 32. (e) Batch Size 16. (f) Batch Size 8. (g) Batch Size 4. (h) Batch Size 2. Figure: Varying Batch Size. ESD for Layer FC1 of MiniAlexNet. We exhibit all 5 of the main phases of training by varying only the batch size. Decreasing batch size induces strong correlations in W, leading to a more implicitly-regularized model. Increasing batch size washes out strong correlations in W, leading to a less implicitly-regularized model. Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 77 / 98
  • 78. Summary so far applied Random Matrix Theory (RMT) self-regularization ∼ entropy / information decrease 5+1 phases of learning small models ∼ Tinkhonov-like regularization modern DNNs ∼ heavy-tailed self-regularization Remarkably ubiquitous How can this be used? Why does deep learning work? Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 78 / 98
  • 79. Outline 1 Prehistory and History Older Background A Very Simple Deep Learning Model More Immediate Background 2 Preliminary Results Regularization and the Energy Landscape Preliminary Empirical Results Gaussian and Heavy-tailed Random Matrix Theory 3 Developing a Theory for Deep Learning More Detailed Empirical Results An RMT-based Theory for Deep Learning Tikhonov Regularization versus Heavy-tailed Regularization 4 Validating and Using the Theory Varying the Batch Size: Explaining the Generalization Gap Using the Theory: pip install weightwatcher Diagnostics at Scale: Predicting Test Accuracies 5 More General Implications and Conclusions
  • 80. Open source tool: weightwatcher https://github.com/CalculatedContent/WeightWatcher A python tool to analyze weight matrices in Deep Neural Networks. All our results can be reproduced by anyone on a basic laptop using widely available, open source, pretained models: (keras, pytorch, osmr/imgclsmob, huggingface, allennlp, distiller, modelzoo, etc.) and without even needing the test or training data! Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 80 / 98
  • 81. WeightWatcher WeightWatcher is an open source to analyze DNNs with our heavy tailed α and weighted ˆα metric (and other userful theories) goal: to develop a useful, open source tool supports: Keras, PyTorch, some custom libs (i.e. huggingface) implements: various norm and rank metrics pip install weightwatcher current version: 0.1.2 latest from source: 0.1.3 looking for: users and contributors https://github.com/CalculatedContent/WeightWatcher Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 81 / 98
  • 82. WeightWatcher: Usage Usage import weightwatcher as ww watcher = ww.WeightWatcher(model=model) results = watcher.analyze() watcher.get_summary() watcher.print_results() Advanced Usage def analyze(self, model=None, layers= [], min_size= 50, max_size= 0, compute_alphas=True, compute_lognorms=True, compute_spectralnorms=True, ... plot=True): Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 82 / 98
  • 83. WeightWatcher: example VGG19_BN import weightwatcher as ww import torchvision.models as models model = models.vgg19_bn(pretrained=True) watcher = ww.WeightWatcher(model=model) results = watcher.analyze(compute_alphas=True) data.append(“name”: “vgg19bntorch”, “summary”: watcher.get_summary()) ’name’: ’vgg19bntorch’, ’summary’: ’lognorm’: 0.8185, ’lognorm_compound’: 0.9365, ’alpha’: 2.9646, ’alpha_compound’: 2.8479 ’alpha_weighted’: 1.1588 ’alpha_weighted_compound’: 1.5002 We also provide a pandas dataframe with detailed results for each layer Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 83 / 98
  • 84. Outline 1 Prehistory and History Older Background A Very Simple Deep Learning Model More Immediate Background 2 Preliminary Results Regularization and the Energy Landscape Preliminary Empirical Results Gaussian and Heavy-tailed Random Matrix Theory 3 Developing a Theory for Deep Learning More Detailed Empirical Results An RMT-based Theory for Deep Learning Tikhonov Regularization versus Heavy-tailed Regularization 4 Validating and Using the Theory Varying the Batch Size: Explaining the Generalization Gap Using the Theory: pip install weightwatcher Diagnostics at Scale: Predicting Test Accuracies 5 More General Implications and Conclusions
  • 85. Bounding Generalization Error BTW, Bounding Generalization Error = Predicting Test Accuracies A lot of recent interest, e.g.: Bartlett et al: (arxiv:1706.08498): bounds based on ratio of output margin distribution and spectral complexity measure Neyshabur et al. (arxiv:1707.09564,arxiv:1706.08947): bounds based on the product norms of the weights across layers Arora et al. (arxiv:1802.05296): bounds based on compression and noise stability properties Liao et al. (arxiv:1807.09659): normalized cross-entropy measure that correlates well with test loss Jiang et al. (arxiv:1810.00113): measure based on distribution of margins at multiple layers that correlates well with test loss∗∗ These use/develop learning theory bounds and then apply to training of MNIST/CIFAR10/etc. Question: How do these norm-based metrics perform on state-of-the-art pre-trained models? ∗∗ and released DEMOGEN pretrained models (after our 1901.08278 paper). Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 85 / 98
  • 86. Predicting test accuracies (at scale): Product norms M&M: “Heavy-Tailed Universality Predicts Trends in Test Accuracies ... Pre-Trained ...” https://arxiv.org/abs/1901.08278 The product norm is a VC-like data-dependent capacity metric for DNNs. People prove theorems and then may use it to guide training. But how does it perform on state-of-the-art production-quality models? We can predict trends in the test accuracy in state-of-the-art production-quality models—without peeking at the test data! “pip install weightwatcher” Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 86 / 98
  • 87. Universality, capacity control, and norm-powerlaw relations M&M: “Heavy-Tailed Universality Predicts Trends in Test Accuracies ... Pre-Trained ...” https://arxiv.org/abs/1901.08278 “Universality” suggests the power law exponent α would make a good, Universal, DNN capacity control metric. For multi-layer NN, consider a weighted average ˆα = 1 N l,i bl,i αl,i To get weights bl,i , relate Frobenius norm and Power Law exponent. Create a random Heavy-Tailed (Pareto) matrix: Pr W rand i,j ∼ 1 x1+µ Examine norm-powerlaw relations: log W 2 F log λmax versus α Argue†† that: PL–Norm Relation: α log λmax ≈ log W 2 F . †† Open problem: make the “heuristic” argument more “rigorous.” Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 87 / 98
  • 88. Predicting test accuracies better: Weighted Power Laws M&M: “Heavy-Tailed Universality Predicts Trends in Test Accuracies ... Pre-Trained ...” https://arxiv.org/abs/1901.08278 Use the weighted PL metric: ˆα = 1 N l,i log(λmax l,i )αl,i , instead of the product norm. We can predict trends in the test accuracy—without peeking at the test data! “pip install weightwatcher” Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 88 / 98
  • 89. Predicting test accuracies better: Distilled Models M&M: “Heavy-Tailed Universality Predicts Trends in Test Accuracies ... Pre-Trained ...” https://arxiv.org/abs/1901.08278 Question: Is the weighted PL metric simply a repackaging of the product norm? Answer: No! For some Intel Distiller models, the Spectral Norm behaves atypically, and α does not change noticibly (a) PL Exponents (α) (b) Spectral Norm (λmax ) Figure: (a) PL exponents α and (b) Spectral Norm (λmax ) for each layer of ResNet20, with Intel Distiller Group Regularization method applied (before and after). Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 89 / 98
  • 90. WeighWatcher: ESDs of GANs GANs also display Heavy Tailed ESDs, however: There are more exceptions Many ESDs appear to display significant rank collapse and only weak correlations (a) Histogram of αs (b) Anamolous ESD. Figure: (a) Distribution of all power law exponents α for DeepMind’s BigGAN (Huggingface implementation). (b) Example of anamolous ESD. Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 90 / 98
  • 91. Summary of “Validating and Using” Some things so far: We can: explain the generalization gap. We can: “pip install weightwatcher” and use this source tool. We can: predict trends in test accuracies on production-quality models. Some things we are starting to look at: Better metrics for monitoring and/or improving training. Better metrics for robustness to, e.g., adversarial perturbation, model compression, etc., that don’t involve looking at data. Better phenomenological load-like and temperature-like metrics to guide data collection, algorithm/parameter/hyperparameter selection, etc. What else? Join us: “pip install weightwatcher”—contribute to the repo.‡‡ ‡‡ Don’t do everything from scratch in a non-reproducible way. Make it reproducible! Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 91 / 98
  • 92. Outline 1 Prehistory and History Older Background A Very Simple Deep Learning Model More Immediate Background 2 Preliminary Results Regularization and the Energy Landscape Preliminary Empirical Results Gaussian and Heavy-tailed Random Matrix Theory 3 Developing a Theory for Deep Learning More Detailed Empirical Results An RMT-based Theory for Deep Learning Tikhonov Regularization versus Heavy-tailed Regularization 4 Validating and Using the Theory Varying the Batch Size: Explaining the Generalization Gap Using the Theory: pip install weightwatcher Diagnostics at Scale: Predicting Test Accuracies 5 More General Implications and Conclusions
  • 93. Implications: RMT and Deep Learning Where are the local minima? How is the Hessian behaved? Are simpler models misleading? Can we design better learning strategies? (tradeoff between Energy and Entropy minimization) How can RMT be used to understand the Energy Landscape? Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 93 / 98
  • 94. Implications: Minimizing Frustration and Energy Funnels As simple as can be?, Wolynes, 1997 Energy Landscape Theory: “random heteropolymer” versus “natural protein” folding Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 94 / 98
  • 95. Implications: The Spin Glass of Minimal Frustration https://calculatedcontent.com/2015/03/25/why-does-deep-learning-work/ low lying Energy state in Spin Glass ∼ spikes in RMT Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 95 / 98
  • 96. Implications: Rugged Energy Landscapes of Heavy-tailed Models Martin and Mahoney https://arxiv.org/abs/1710.09553 Spin Glasses with Heavy Tails? Local minima do not concentrate near the ground state (Cizeau and Bouchaud 1993) Configuration space with a “rugged convexity” Contrast with (Gaussian) Spin Glass model of Choromanska et al. 2015 If Energy Landscape is ruggedly funneled, then no “problems” with local minima! Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 96 / 98
  • 97. Conclusions: “pip install weightwatcher” Statistical mechanics and neural networks Long history Revisit due to recent “crisis” in why deep learning works Use ideas from statistical mechanics of strongly-correlated systems Develop a theory that is designed to be used Main Empirical/Theoretical Results Use Heavy-tailed RMT to construct a operational theory of DNN learning Evaluate effect of implicit versus explicit regularization Exhibit all 5+1 phases by adjusting batch size: explain the generalization gap Methodology: Observations → Hypotheses → Build a Theory → Test the Theory. Many Implications: Explain the generalization gap Rationalize claims about rugged convexity of Energy Landscape Predict test accuracies in state-of-the-art models “pip install weightwatcher” Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 97 / 98
  • 98. If you want more ... “pip install weightwatcher” ... Background paper: Rethinking generalization requires revisiting old ideas: statistical mechanics approaches and complex learning behavior (https://arxiv.org/abs/1710.09553) Main paper (full): Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning (https://arxiv.org/abs/1810.01075) Code: https://github.com/CalculatedContent/ImplicitSelfRegularization Main paper (abridged): Traditional and Heavy-Tailed Self Regularization in Neural Network Models (https://arxiv.org/abs/1901.08276) Code: https://github.com/CalculatedContent/ImplicitSelfRegularization Applying the theory paper: Heavy-Tailed Universality Predicts Trends in Test Accuracies for Very Large Pre-Trained Deep Neural Networks (https://arxiv.org/abs/1901.08278) Code: https://github.com/CalculatedContent/PredictingTestAccuracies https://github.com/CalculatedContent/WeightWatcher “pip install weightwatcher” Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 98 / 98