Understanding deep learning requires rethinking generalization

IDS Lab
Understanding deep learning requires
rethinking generalization
Does deep learning really doing some generalization?

presentedby Jamie Seol

IDS Lab
Jamie Seol
Motivation
• Normally, we measure a generalization by:

• generalization error = |training error - test error|

• if we overfit, the training error should be low, while test error
becomes large = high generalization error!

• However, a complex neural network is fragile to be overfitted!

• for example, let’s train some human baby by randomly labeled
CIFAR-10 dataset

• then, give’em some sample in the training set (2nd+ epoch)

• they will say "what the…" to any question

• because it’s impossible to generalize some kind of
abtracted concept!

• what about in neural network?

IDS Lab
Jamie Seol
CIFAR-10
• This is the CIFAR-10 dataset

• The goal of this task is to classify given image into one of 10
classes

• CNNs that we know well will solve this rather easily

IDS Lab
Jamie Seol
Randomized CIFAR-10
• When we randomize information of CIFAR-10’s training set, the
result of accuracy becomes:

IDS Lab
Jamie Seol
Randomized CIFAR-10
• This is just nothing more than over-overfit!

• What’s the problem than?

• neural networks memorized datasets

• even if it should have no meaning!

• it’s random! raaaaandddddddommm!!!

• aaaaarrrrrrrr!!!

• it did not generalize some concepts

• it just memorized!!!!

IDS Lab
Jamie Seol
Randomized CIFAR-10
• Even if you didn’t intend to, neural nets can just memorize thing
rather than generalizing!

• According to the experiment,

• the effective capacity of neural network is sufficient for
memorizing the entire data set

• randomizing (corrupting) data set makes task harder just by
small constant factor compared to the origial task!

• Again, even if you didn’t want to!! neural network is fragile to
overfit in natural sense!!

• "You don’t have to explain the meanings. I’ll just memorize it" - Chatur,
from the movie "3 Idiots"

IDS Lab
Jamie Seol
Regularization
• However, we do know that there are a lot of techniques for
regularization, which supports generalizations!

• dropout, batch norm, early stopping, weigh decay…

• It does seem help, but wait….

• can someone prove that regularizations fundamentally
improves generalization?

• does this works really really well? really???

IDS Lab
Jamie Seol
Regularization
• Isn’t data augmentation significantly more important than weight
decay?

• Even with regulizations, neural networks are good memorizers

• Just changing the model increased test accuracy

IDS Lab
Jamie Seol
Regularization
• Early stopping helps

• but not necessarily…

IDS Lab
Jamie Seol
Regularization
• Well… these techniques seem does helpful, but suspicion
remains…

IDS Lab
Jamie Seol
Rademacher complexity
• By the way, what’s so big deal about memorizing everything?

• The following measurement is called Rademacher complexity

• Detailed math is omitted here

• The thing is, if some model can memorize everything (actually, if
the hypothesis have power to fit randomized dataset), then
theoritical upper bound of generalization error is just 1

• which is useless!!!!

• actually, using regularization scheme lowers the bound, but this
is not true in ReLU, and we’ll show that there is some situation
that regularization helps nothing

IDS Lab
Jamie Seol
Finite-sample expressivity
• Remember Universal Approximation Theorem?

• finite-sample expressivity theorem is more practical version of it

• note that this statement shows that UAT does not guarantees
generalization!

• Theorem 1: there exists a 2-layer NN by ReLU with 2n+d weights that
can represent any function given by n samples in d dimensions

• This is not a hard theorem to prove, so let’s do it

IDS Lab
Jamie Seol
Lemma 1
• Lemma 1: for b1 < x1 < b2 < … < bn < xn, matrix A = [ReLU(xi - bj)]ij has
full rank
• Proof: obvious

IDS Lab
Jamie Seol
Theorem 1
• Theorem 1: there exists a 2-layer NN by ReLU with 2n+d weights that can
represent any function given by n samples in d dimensions
• Proof: Note that 2-layered neural network with ReLU can be expressed as
• where w, b ∈ ℝn and a ∈ ℝd
• for data S = {z1, …, zn} and label y ∈ ℝn where zi ∈ ℝd, WTS yi = NN2(zi)
for all i from 1 to n
• choose a, b so that xi = ⟨a, zi⟩ meets the condition for Lemma 1
• Then, this becomes y = Aw, while Lemma 1 says that A is invertable
• done

IDS Lab
Jamie Seol
Finite-sample expressivity
• What does it mean?

• It means that once you have more than about 2n + d parameters,
your model already possesses a willingful power to super-overfit
and just to remember everything instead of generalizing some
concept, therefore it gains trivial bound for generalization error and
is exposed to sudden-death-danger of doing nothing more than a
memorizer

• long story short: we can’t speak formally about generalization in
deep learning yet

• a snake’s leg: for deeper network, use intermediate layers to
choose splitted interval rather than target, resulting similar O(n + k)
parameters required

IDS Lab
Jamie Seol
Stochastic Gradient Descent
• Let’s think about linear optimization

• If we have large d, which is a underdetermined problem, then we
can have multiple globla minima

• But hey, can we determine which optima gives best
generalization?

• in non-linear systems, peeking curvature helped

• but there’s no such thing as a curvature in linear system!

IDS Lab
Jamie Seol
• Funny thing about SGD is, it gives optima for l2 loss for
underdetermined system, and known to be a regularizer itself

IDS Lab
Jamie Seol
• However… the result shows minimum l2 norm wasn’t always the
global optima in sense of generalization

• furthermore, it is possible to generate some dataset that
minimum l2 norm is not optima! a constructive counter
example!

• adding l2 regularization to parameters didn’t help a bit (not
shown in the table)
norm = 220
norm = 390

IDS Lab
Jamie Seol
Conclusion
• "Be careful whenever you speak 'generalization' in deep learning"

• Contributions of this paper:

• experimental framework for suspecting suspicious activities of
generalization techniques

• proof for lack of theoritical boundary of generalization error in
deep learning (since it can just memorize it all with small
effective capacity)

• optimization does not necessarily means generalization

• "beware of the light" - Caliban, from the movie "Logan"

IDS Lab
Jamie Seol
References
• Zhang, Chiyuan, et al. "Understanding deep learning requires rethinking
generalization." arXiv preprint arXiv:1611.03530 (2016).
• https://www.slideshare.net/JungHoonSeo2/understanding-deep-learning-
requires-rethinking-generalization-2017-12
• https://www.slideshare.net/JungHoonSeo2/understanding-deep-learning-
requires-rethinking-generalization-2017-2-22

Understanding deep learning requires rethinking generalization

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Understanding deep learning requires rethinking generalization

Ähnlich wie Understanding deep learning requires rethinking generalization (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Understanding deep learning requires rethinking generalization