Melden

Teilen

Folgen

•0 gefällt mir•71 views

•0 gefällt mir•71 views

Melden

Teilen

"The statistical physics of learning revisted: Phase transitions in layered neural networks" Physics Colloquium at the University of Leipzig/Germany, June 29, 2021 24 slides, ca 45 minutes

Folgen

- 1. Leipzig, June 2021 1 / 24 The statistical physics of learning revisited: Phase transitions in layered neural networks Elisa Oostwal Michiel Straat Michael Biehl Bernoulli Institute for Mathematics, Computer Science and Artificial Intelligence University of Groningen / NL Physica A Vol. 564, 2021, 125517 (open access) Hidden unit specialization in layered neural networks: ReLU vs. sigmoidal activation
- 2. Leipzig, June 2021 2 / 24 the revival of neural networks success of multi-layered neural networks (Deep Learning) • availability of large amounts of training data • increased computational power • improved training procedures and set-ups • task specific network designs, e.g. activation functions many open questions / lack of theoretical understanding
- 3. Leipzig, June 2021 3 / 24 statistical physics of learning statistical physics of neural networks training of feed-forward neural networks: Elizabeth Gardner (1957-1988). The space of interactions in neural networks. J. Phys. A 21:257-270, 1988 dynamics of attractor neural networks: John Hopfield. Neural Networks and physical systems with emergent collective computational abilities. PNAS 79(8):2554-2558, 1982 1991 2001 2011 a successful branch of learning theory:
- 4. Leipzig, June 2021 4 / 24 statistical physics of learning
- 5. Leipzig, June 2021 5 / 24 N units: high-dim. input example: a shallow neural network K hidden units with activation linear output soft committee machine input/output function defined by • architecture, connectivity, activation functions • adaptive weights ↑ target function • regression: learning from example data e.g.
- 6. Leipzig, June 2021 6 / 24 statistical physics of learning in a nutshell objective/cost/energy function with • equilibrium state: compromise/competition between minimal energy (ground state) vs. number (volume) of available states with higher energy • e.g. Metropolis algorithm, noisy gradient descent (Langevin) with equilibrium (Gibbs-Boltzmann) control parameter: „inverse temperature“ β =1 / T • training by stochastic optimization of all adaptive weights „thermal averages “ over Peq e.g. minima of free energy microcanonical entropy:
- 7. Leipzig, June 2021 7 / 24 machine learning specifics • energy function is given for a specific set of example data: defined w.r.t. • typical properties: additional average of the free energy over difficult, even for the simplest model density: with independent identically distributed (i.i.d.) unstructured input density • disorder-average of the free energy requires (e.g.) replica trick frozen disorder
- 8. Leipzig, June 2021 8 / 24 machine learning at high temperatures • a simplifying limit: high (formal) temperature with finite “learn almost nothing... (high T ) ...from very many examples” • independent i.i.d. examples: generalization error limitations: - training error and generalization error cannot be distinguished - number of examples and training temperature are coupled - (at best) qualitative agreement with low temperature results • large number of examples: , in the limit
- 9. Leipzig, June 2021 9 / 24 adaptive student N inputs (K) hidden units (M) teacher ? ? ? ? ? ? ? modelling: student teacher scenario training: minimization of here: learnable rules, reliable data (outputs provided by teacher) perfectly matching complexity K=M two prototypical activation functions: sigmoidal / ReLU in student and teacher
- 10. Leipzig, June 2021 10 / 24 thermodynamic limit, CLT for normally distributed with zero mean and covariance matrix large N: Central Limit Theorem order parameters: model parameters: macroscopic properties of the system (+ constant) independent of details (e.g. activation)
- 11. Leipzig, June 2021 11 / 24 generalization error on average over P({xi,xj *}) [D. Saad, S. Solla, 1995] [M. Straat, 2019] sigmoidal activation rectified linear units
- 12. Leipzig, June 2021 12 / 24 site symmetry simplification: orthonormal teacher vectors, isotropic input density reflects permutation symmetry, allows for hidden unit specialization sigmoidal hidden units ReLU activations entropy (+ constant)
- 13. Leipzig, June 2021 13 / 24 given 𝛼, determine (global and local) minima of given: size of the training data set K, g(z), obtain learning curves typical learning curves order parameters and generalization error as a function of the (scaled) training set size solve:
- 14. Leipzig, June 2021 14 / 24 sigmoidal ( K = 2 ) invariance under exchange of the two hidden units R=S: both units ~ (w1 * + w2 *) + noise symmetry breaking phase transition (second order, continuous) ... ... results in a kink in the typical learning curve
- 15. Leipzig, June 2021 15 / 24 ReLU ( K = 2 ) qualitatively identical behavior Note: num. values of and/or are irrelevant, scale depends a.o. on pre-factor of g(z) Physica A Vol. 564, 2021, 125517
- 16. Leipzig, June 2021 16 / 24 sigmoidal ( K > 2 ) K=5 permutation symmetry of h.u. initial R=S phase discontinuous jump in ε g coexistence of poor and good generalization first order transition, local min. R>S competes with R=S R>S becomes global minimum facilitates perfect learning “anti-specialization” S>R (overlooked in 1998!) weak/no effect of additional anti-specialization on generalization error
- 17. Leipzig, June 2021 17 / 24 ReLU ( K > 2) K=10 permutation symmetry of h.u. initial R=S phase continuous kink in ε g competing minima of poor* vs. good generalization continuous phase transition global minimum: R>S local minimum: R<S * pretty good Physica A Vol. 564, 2021, 125517
- 18. Leipzig, June 2021 18 / 24 ReLU ( large K ) permutation symmetry of h.u. initial R=S phase specialized and anti-specialized branch achieve perfect generalization, asymptotically ! (due to partial linearity of ReLU) continuous phase transition at degenerate minima: R>S, R<S
- 19. Leipzig, June 2021 19 / 24 Monte Carlo simulations histogram of observed Rij continous Metropolis, ReLU activation, K=4, N=50, β=1 (=T) gen. error vs. time, specialized and unspecialized initialization anti-specialized specialized unspecialized R=S R S S R
- 20. Leipzig, June 2021 20 / 24 Monte Carlo simulations sigmoidal activation ReLU K= 4 large gap / high barrier between specialized and unspecialized states delays success of learning anti-specialized states display near optimal performance for large K stationary generalization error:
- 21. Leipzig, June 2021 21 / 24 • formal equilibrium of training at high temperature in student/teacher model situations of supervised learning • unspecialized and partially or anti-specialized configurations compete as local/global minima of the free energy • phase transitions with scaled number of examples: K=2: continuous symmetry-breaking transitions with equivalent competing states K>2, sigmoidal activations: first order transition with competing states of distinct generalization ability K>2, ReLU networks: continuous transition with competing states of similar performance Summary
- 22. Leipzig, June 2021 22 / 24 piece-wise linear „sigmoidal“ activation ReLU increasing slope discontinuous to continuous Outlook which is the decisive property of the activation? • consider various activation functions (leaky ReLU, swish ... ) most important question: • study more complex solutions beyond site-symmetry piece-wise linear activtations
- 23. Leipzig, June 2021 23 / 24 • replica trick / annealed approximation - low temperatures, vary # of examples and T independently - mismatched student/teacher networks 𝐾 ≠ 𝑀 - overfitting / underfitting effects • complementary approach: - dynamics of stochastic gradient descent - description in terms of ODE for order parameters • deep networks - many hidden layers - tree-like architectures with uncorrelated branches • realistic input data - clustered / correlated data - recent developments: Zdeborova, Mezard, Goldt et al. outlook (selected topics)
- 24. Leipzig, June 2021 24 / 24 www.cs.rug.nl/~biehl m.biehl@rug.nl Questions ? see also for: algorithm development in machine learning applications in medicine, life sciences, astronomy …