SlideShare ist ein Scribd-Unternehmen logo
1 von 20
Downloaden Sie, um offline zu lesen
IDS Lab
Understanding deep learning requires
rethinking generalization
Does deep learning really doing some generalization?

presentedby Jamie Seol
IDS Lab
Jamie Seol
Motivation
• Normally, we measure a generalization by:

• generalization error = |training error - test error|

• if we overfit, the training error should be low, while test error
becomes large = high generalization error!

• However, a complex neural network is fragile to be overfitted!

• for example, let’s train some human baby by randomly labeled
CIFAR-10 dataset

• then, give’em some sample in the training set (2nd+ epoch)

• they will say "what the…" to any question

• because it’s impossible to generalize some kind of
abtracted concept!

• what about in neural network?
IDS Lab
Jamie Seol
CIFAR-10
• This is the CIFAR-10 dataset

• The goal of this task is to classify given image into one of 10
classes

• CNNs that we know well will solve this rather easily
IDS Lab
Jamie Seol
Randomized CIFAR-10
• When we randomize information of CIFAR-10’s training set, the
result of accuracy becomes:
IDS Lab
Jamie Seol
Randomized CIFAR-10
• This is just nothing more than over-overfit!

• What’s the problem than?

• neural networks memorized datasets

• even if it should have no meaning!

• it’s random! raaaaandddddddommm!!!

• aaaaarrrrrrrr!!!

• it did not generalize some concepts

• it just memorized!!!!
IDS Lab
Jamie Seol
Randomized CIFAR-10
• Even if you didn’t intend to, neural nets can just memorize thing
rather than generalizing!

• According to the experiment,

• the effective capacity of neural network is sufficient for
memorizing the entire data set

• randomizing (corrupting) data set makes task harder just by
small constant factor compared to the origial task!

• Again, even if you didn’t want to!! neural network is fragile to
overfit in natural sense!!

• "You don’t have to explain the meanings. I’ll just memorize it" - Chatur,
from the movie "3 Idiots"
IDS Lab
Jamie Seol
Regularization
• However, we do know that there are a lot of techniques for
regularization, which supports generalizations!

• dropout, batch norm, early stopping, weigh decay…

• It does seem help, but wait….

• can someone prove that regularizations fundamentally
improves generalization?

• does this works really really well? really???
IDS Lab
Jamie Seol
Regularization
• Isn’t data augmentation significantly more important than weight
decay?

• Even with regulizations, neural networks are good memorizers

• Just changing the model increased test accuracy
IDS Lab
Jamie Seol
Regularization
• Early stopping helps

• but not necessarily…
IDS Lab
Jamie Seol
Regularization
• Well… these techniques seem does helpful, but suspicion
remains…
IDS Lab
Jamie Seol
Rademacher complexity
• By the way, what’s so big deal about memorizing everything?

• The following measurement is called Rademacher complexity

• Detailed math is omitted here

• The thing is, if some model can memorize everything (actually, if
the hypothesis have power to fit randomized dataset), then
theoritical upper bound of generalization error is just 1

• which is useless!!!!

• actually, using regularization scheme lowers the bound, but this
is not true in ReLU, and we’ll show that there is some situation
that regularization helps nothing
IDS Lab
Jamie Seol
Finite-sample expressivity
• Remember Universal Approximation Theorem?

• finite-sample expressivity theorem is more practical version of it

• note that this statement shows that UAT does not guarantees
generalization!

• Theorem 1: there exists a 2-layer NN by ReLU with 2n+d weights that
can represent any function given by n samples in d dimensions

• This is not a hard theorem to prove, so let’s do it
IDS Lab
Jamie Seol
Lemma 1
• Lemma 1: for b1 < x1 < b2 < … < bn < xn, matrix A = [ReLU(xi - bj)]ij has
full rank
• Proof: obvious
IDS Lab
Jamie Seol
Theorem 1
• Theorem 1: there exists a 2-layer NN by ReLU with 2n+d weights that can
represent any function given by n samples in d dimensions
• Proof: Note that 2-layered neural network with ReLU can be expressed as
• where w, b ∈ ℝn and a ∈ ℝd
• for data S = {z1, …, zn} and label y ∈ ℝn where zi ∈ ℝd, WTS yi = NN2(zi)
for all i from 1 to n
• choose a, b so that xi = ⟨a, zi⟩ meets the condition for Lemma 1
• Then, this becomes y = Aw, while Lemma 1 says that A is invertable
• done
IDS Lab
Jamie Seol
Finite-sample expressivity
• What does it mean?

• It means that once you have more than about 2n + d parameters,
your model already possesses a willingful power to super-overfit
and just to remember everything instead of generalizing some
concept, therefore it gains trivial bound for generalization error and
is exposed to sudden-death-danger of doing nothing more than a
memorizer

• long story short: we can’t speak formally about generalization in
deep learning yet

• a snake’s leg: for deeper network, use intermediate layers to
choose splitted interval rather than target, resulting similar O(n + k)
parameters required
IDS Lab
Jamie Seol
Stochastic Gradient Descent
• Let’s think about linear optimization

• If we have large d, which is a underdetermined problem, then we
can have multiple globla minima

• But hey, can we determine which optima gives best
generalization?

• in non-linear systems, peeking curvature helped

• but there’s no such thing as a curvature in linear system!
IDS Lab
Jamie Seol
Stochastic Gradient Descent
• Funny thing about SGD is, it gives optima for l2 loss for
underdetermined system, and known to be a regularizer itself
IDS Lab
Jamie Seol
Stochastic Gradient Descent
• However… the result shows minimum l2 norm wasn’t always the
global optima in sense of generalization

• furthermore, it is possible to generate some dataset that
minimum l2 norm is not optima! a constructive counter
example!

• adding l2 regularization to parameters didn’t help a bit (not
shown in the table)
norm = 220
norm = 390
IDS Lab
Jamie Seol
Conclusion
• "Be careful whenever you speak 'generalization' in deep learning"

• Contributions of this paper:

• experimental framework for suspecting suspicious activities of
generalization techniques

• proof for lack of theoritical boundary of generalization error in
deep learning (since it can just memorize it all with small
effective capacity)

• optimization does not necessarily means generalization

• "beware of the light" - Caliban, from the movie "Logan"
IDS Lab
Jamie Seol
References
• Zhang, Chiyuan, et al. "Understanding deep learning requires rethinking
generalization." arXiv preprint arXiv:1611.03530 (2016).
• https://www.slideshare.net/JungHoonSeo2/understanding-deep-learning-
requires-rethinking-generalization-2017-12
• https://www.slideshare.net/JungHoonSeo2/understanding-deep-learning-
requires-rethinking-generalization-2017-2-22

Weitere ähnliche Inhalte

Ähnlich wie Understanding deep learning requires rethinking generalization

The marginal value of adaptive gradient methods in machine learning
The marginal value of adaptive gradient methods in machine learningThe marginal value of adaptive gradient methods in machine learning
The marginal value of adaptive gradient methods in machine learningJamie Seol
 
08 neural networks
08 neural networks08 neural networks
08 neural networksankit_ppt
 
Emergence of Invariance and Disentangling in Deep Representations
Emergence of Invariance and Disentangling in Deep RepresentationsEmergence of Invariance and Disentangling in Deep Representations
Emergence of Invariance and Disentangling in Deep RepresentationsSangwoo Mo
 
Introduction to Deep learning and H2O for beginner's
Introduction to Deep learning and H2O for beginner'sIntroduction to Deep learning and H2O for beginner's
Introduction to Deep learning and H2O for beginner'sVidyasagar Bhargava
 
Activation functions and Training Algorithms for Deep Neural network
Activation functions and Training Algorithms for Deep Neural networkActivation functions and Training Algorithms for Deep Neural network
Activation functions and Training Algorithms for Deep Neural networkGayatri Khanvilkar
 
H2O World - Top 10 Deep Learning Tips & Tricks - Arno Candel
H2O World - Top 10 Deep Learning Tips & Tricks - Arno CandelH2O World - Top 10 Deep Learning Tips & Tricks - Arno Candel
H2O World - Top 10 Deep Learning Tips & Tricks - Arno CandelSri Ambati
 
Rubyconf Neural Networks
Rubyconf Neural NetworksRubyconf Neural Networks
Rubyconf Neural Networkshexgnu
 
Machine Learning Introduction.pptx
Machine Learning Introduction.pptxMachine Learning Introduction.pptx
Machine Learning Introduction.pptxJeeva Nantham
 
Finding needles in haystacks with deep neural networks
Finding needles in haystacks with deep neural networksFinding needles in haystacks with deep neural networks
Finding needles in haystacks with deep neural networksCalvin Giles
 
Machine Learning, Deep Learning and Data Analysis Introduction
Machine Learning, Deep Learning and Data Analysis IntroductionMachine Learning, Deep Learning and Data Analysis Introduction
Machine Learning, Deep Learning and Data Analysis IntroductionTe-Yen Liu
 
Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017Balázs Hidasi
 
The deep bootstrap 논문 리뷰
The deep bootstrap 논문 리뷰The deep bootstrap 논문 리뷰
The deep bootstrap 논문 리뷰Seonghoon Jung
 
Sippin: A Mobile Application Case Study presented at Techfest Louisville
Sippin: A Mobile Application Case Study presented at Techfest LouisvilleSippin: A Mobile Application Case Study presented at Techfest Louisville
Sippin: A Mobile Application Case Study presented at Techfest LouisvilleDawn Yankeelov
 
Python for Image Understanding: Deep Learning with Convolutional Neural Nets
Python for Image Understanding: Deep Learning with Convolutional Neural NetsPython for Image Understanding: Deep Learning with Convolutional Neural Nets
Python for Image Understanding: Deep Learning with Convolutional Neural NetsRoelof Pieters
 
CS532L4_Backpropagation.pptx
CS532L4_Backpropagation.pptxCS532L4_Backpropagation.pptx
CS532L4_Backpropagation.pptxMFaisalRiaz5
 
Introduction to Hamiltonian Neural Networks
Introduction to Hamiltonian Neural NetworksIntroduction to Hamiltonian Neural Networks
Introduction to Hamiltonian Neural NetworksMiles Cranmer
 
BSSML17 - Ensembles
BSSML17 - EnsemblesBSSML17 - Ensembles
BSSML17 - EnsemblesBigML, Inc
 
Talk-Foutse-SrangeLoop.pdf
Talk-Foutse-SrangeLoop.pdfTalk-Foutse-SrangeLoop.pdf
Talk-Foutse-SrangeLoop.pdfFoutse Khomh
 

Ähnlich wie Understanding deep learning requires rethinking generalization (20)

The marginal value of adaptive gradient methods in machine learning
The marginal value of adaptive gradient methods in machine learningThe marginal value of adaptive gradient methods in machine learning
The marginal value of adaptive gradient methods in machine learning
 
08 neural networks
08 neural networks08 neural networks
08 neural networks
 
Mlcc #4
Mlcc #4Mlcc #4
Mlcc #4
 
Emergence of Invariance and Disentangling in Deep Representations
Emergence of Invariance and Disentangling in Deep RepresentationsEmergence of Invariance and Disentangling in Deep Representations
Emergence of Invariance and Disentangling in Deep Representations
 
Introduction to Deep learning and H2O for beginner's
Introduction to Deep learning and H2O for beginner'sIntroduction to Deep learning and H2O for beginner's
Introduction to Deep learning and H2O for beginner's
 
Activation functions and Training Algorithms for Deep Neural network
Activation functions and Training Algorithms for Deep Neural networkActivation functions and Training Algorithms for Deep Neural network
Activation functions and Training Algorithms for Deep Neural network
 
H2O World - Top 10 Deep Learning Tips & Tricks - Arno Candel
H2O World - Top 10 Deep Learning Tips & Tricks - Arno CandelH2O World - Top 10 Deep Learning Tips & Tricks - Arno Candel
H2O World - Top 10 Deep Learning Tips & Tricks - Arno Candel
 
Rubyconf Neural Networks
Rubyconf Neural NetworksRubyconf Neural Networks
Rubyconf Neural Networks
 
Machine Learning Introduction.pptx
Machine Learning Introduction.pptxMachine Learning Introduction.pptx
Machine Learning Introduction.pptx
 
Finding needles in haystacks with deep neural networks
Finding needles in haystacks with deep neural networksFinding needles in haystacks with deep neural networks
Finding needles in haystacks with deep neural networks
 
Regularizing DNN.pptx
Regularizing DNN.pptxRegularizing DNN.pptx
Regularizing DNN.pptx
 
Machine Learning, Deep Learning and Data Analysis Introduction
Machine Learning, Deep Learning and Data Analysis IntroductionMachine Learning, Deep Learning and Data Analysis Introduction
Machine Learning, Deep Learning and Data Analysis Introduction
 
Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017
 
The deep bootstrap 논문 리뷰
The deep bootstrap 논문 리뷰The deep bootstrap 논문 리뷰
The deep bootstrap 논문 리뷰
 
Sippin: A Mobile Application Case Study presented at Techfest Louisville
Sippin: A Mobile Application Case Study presented at Techfest LouisvilleSippin: A Mobile Application Case Study presented at Techfest Louisville
Sippin: A Mobile Application Case Study presented at Techfest Louisville
 
Python for Image Understanding: Deep Learning with Convolutional Neural Nets
Python for Image Understanding: Deep Learning with Convolutional Neural NetsPython for Image Understanding: Deep Learning with Convolutional Neural Nets
Python for Image Understanding: Deep Learning with Convolutional Neural Nets
 
CS532L4_Backpropagation.pptx
CS532L4_Backpropagation.pptxCS532L4_Backpropagation.pptx
CS532L4_Backpropagation.pptx
 
Introduction to Hamiltonian Neural Networks
Introduction to Hamiltonian Neural NetworksIntroduction to Hamiltonian Neural Networks
Introduction to Hamiltonian Neural Networks
 
BSSML17 - Ensembles
BSSML17 - EnsemblesBSSML17 - Ensembles
BSSML17 - Ensembles
 
Talk-Foutse-SrangeLoop.pdf
Talk-Foutse-SrangeLoop.pdfTalk-Foutse-SrangeLoop.pdf
Talk-Foutse-SrangeLoop.pdf
 

Kürzlich hochgeladen

High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMSHigh Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMSsandhya757531
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONjhunlian
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating SystemRashmi Bhat
 
CS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdfCS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdfBalamuruganV28
 
Comprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdfComprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdfalene1
 
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENTFUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENTSneha Padhiar
 
signals in triangulation .. ...Surveying
signals in triangulation .. ...Surveyingsignals in triangulation .. ...Surveying
signals in triangulation .. ...Surveyingsapna80328
 
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...Stork
 
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithm
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithmComputer Graphics Introduction, Open GL, Line and Circle drawing algorithm
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithmDeepika Walanjkar
 
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdf
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdfPaper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdf
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdfNainaShrivastava14
 
Cost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based questionCost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based questionSneha Padhiar
 
DEVICE DRIVERS AND INTERRUPTS SERVICE MECHANISM.pdf
DEVICE DRIVERS AND INTERRUPTS  SERVICE MECHANISM.pdfDEVICE DRIVERS AND INTERRUPTS  SERVICE MECHANISM.pdf
DEVICE DRIVERS AND INTERRUPTS SERVICE MECHANISM.pdfAkritiPradhan2
 
Prach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism CommunityPrach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism Communityprachaibot
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 
OOP concepts -in-Python programming language
OOP concepts -in-Python programming languageOOP concepts -in-Python programming language
OOP concepts -in-Python programming languageSmritiSharma901052
 
multiple access in wireless communication
multiple access in wireless communicationmultiple access in wireless communication
multiple access in wireless communicationpanditadesh123
 
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.elesangwon
 
Ch10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdfCh10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdfChristianCDAM
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionMebane Rash
 

Kürzlich hochgeladen (20)

High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMSHigh Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating System
 
CS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdfCS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdf
 
Comprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdfComprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdf
 
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENTFUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
 
signals in triangulation .. ...Surveying
signals in triangulation .. ...Surveyingsignals in triangulation .. ...Surveying
signals in triangulation .. ...Surveying
 
Designing pile caps according to ACI 318-19.pptx
Designing pile caps according to ACI 318-19.pptxDesigning pile caps according to ACI 318-19.pptx
Designing pile caps according to ACI 318-19.pptx
 
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
 
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithm
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithmComputer Graphics Introduction, Open GL, Line and Circle drawing algorithm
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithm
 
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdf
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdfPaper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdf
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdf
 
Cost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based questionCost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based question
 
DEVICE DRIVERS AND INTERRUPTS SERVICE MECHANISM.pdf
DEVICE DRIVERS AND INTERRUPTS  SERVICE MECHANISM.pdfDEVICE DRIVERS AND INTERRUPTS  SERVICE MECHANISM.pdf
DEVICE DRIVERS AND INTERRUPTS SERVICE MECHANISM.pdf
 
Prach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism CommunityPrach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism Community
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 
OOP concepts -in-Python programming language
OOP concepts -in-Python programming languageOOP concepts -in-Python programming language
OOP concepts -in-Python programming language
 
multiple access in wireless communication
multiple access in wireless communicationmultiple access in wireless communication
multiple access in wireless communication
 
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
 
Ch10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdfCh10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdf
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of Action
 

Understanding deep learning requires rethinking generalization

  • 1. IDS Lab Understanding deep learning requires rethinking generalization Does deep learning really doing some generalization? presentedby Jamie Seol
  • 2. IDS Lab Jamie Seol Motivation • Normally, we measure a generalization by: • generalization error = |training error - test error| • if we overfit, the training error should be low, while test error becomes large = high generalization error! • However, a complex neural network is fragile to be overfitted! • for example, let’s train some human baby by randomly labeled CIFAR-10 dataset • then, give’em some sample in the training set (2nd+ epoch) • they will say "what the…" to any question • because it’s impossible to generalize some kind of abtracted concept! • what about in neural network?
  • 3. IDS Lab Jamie Seol CIFAR-10 • This is the CIFAR-10 dataset • The goal of this task is to classify given image into one of 10 classes • CNNs that we know well will solve this rather easily
  • 4. IDS Lab Jamie Seol Randomized CIFAR-10 • When we randomize information of CIFAR-10’s training set, the result of accuracy becomes:
  • 5. IDS Lab Jamie Seol Randomized CIFAR-10 • This is just nothing more than over-overfit! • What’s the problem than? • neural networks memorized datasets • even if it should have no meaning! • it’s random! raaaaandddddddommm!!! • aaaaarrrrrrrr!!! • it did not generalize some concepts • it just memorized!!!!
  • 6. IDS Lab Jamie Seol Randomized CIFAR-10 • Even if you didn’t intend to, neural nets can just memorize thing rather than generalizing! • According to the experiment, • the effective capacity of neural network is sufficient for memorizing the entire data set • randomizing (corrupting) data set makes task harder just by small constant factor compared to the origial task! • Again, even if you didn’t want to!! neural network is fragile to overfit in natural sense!! • "You don’t have to explain the meanings. I’ll just memorize it" - Chatur, from the movie "3 Idiots"
  • 7. IDS Lab Jamie Seol Regularization • However, we do know that there are a lot of techniques for regularization, which supports generalizations! • dropout, batch norm, early stopping, weigh decay… • It does seem help, but wait…. • can someone prove that regularizations fundamentally improves generalization? • does this works really really well? really???
  • 8. IDS Lab Jamie Seol Regularization • Isn’t data augmentation significantly more important than weight decay? • Even with regulizations, neural networks are good memorizers • Just changing the model increased test accuracy
  • 9. IDS Lab Jamie Seol Regularization • Early stopping helps • but not necessarily…
  • 10. IDS Lab Jamie Seol Regularization • Well… these techniques seem does helpful, but suspicion remains…
  • 11. IDS Lab Jamie Seol Rademacher complexity • By the way, what’s so big deal about memorizing everything? • The following measurement is called Rademacher complexity • Detailed math is omitted here • The thing is, if some model can memorize everything (actually, if the hypothesis have power to fit randomized dataset), then theoritical upper bound of generalization error is just 1 • which is useless!!!! • actually, using regularization scheme lowers the bound, but this is not true in ReLU, and we’ll show that there is some situation that regularization helps nothing
  • 12. IDS Lab Jamie Seol Finite-sample expressivity • Remember Universal Approximation Theorem? • finite-sample expressivity theorem is more practical version of it • note that this statement shows that UAT does not guarantees generalization! • Theorem 1: there exists a 2-layer NN by ReLU with 2n+d weights that can represent any function given by n samples in d dimensions • This is not a hard theorem to prove, so let’s do it
  • 13. IDS Lab Jamie Seol Lemma 1 • Lemma 1: for b1 < x1 < b2 < … < bn < xn, matrix A = [ReLU(xi - bj)]ij has full rank • Proof: obvious
  • 14. IDS Lab Jamie Seol Theorem 1 • Theorem 1: there exists a 2-layer NN by ReLU with 2n+d weights that can represent any function given by n samples in d dimensions • Proof: Note that 2-layered neural network with ReLU can be expressed as • where w, b ∈ ℝn and a ∈ ℝd • for data S = {z1, …, zn} and label y ∈ ℝn where zi ∈ ℝd, WTS yi = NN2(zi) for all i from 1 to n • choose a, b so that xi = ⟨a, zi⟩ meets the condition for Lemma 1 • Then, this becomes y = Aw, while Lemma 1 says that A is invertable • done
  • 15. IDS Lab Jamie Seol Finite-sample expressivity • What does it mean? • It means that once you have more than about 2n + d parameters, your model already possesses a willingful power to super-overfit and just to remember everything instead of generalizing some concept, therefore it gains trivial bound for generalization error and is exposed to sudden-death-danger of doing nothing more than a memorizer • long story short: we can’t speak formally about generalization in deep learning yet • a snake’s leg: for deeper network, use intermediate layers to choose splitted interval rather than target, resulting similar O(n + k) parameters required
  • 16. IDS Lab Jamie Seol Stochastic Gradient Descent • Let’s think about linear optimization • If we have large d, which is a underdetermined problem, then we can have multiple globla minima • But hey, can we determine which optima gives best generalization? • in non-linear systems, peeking curvature helped • but there’s no such thing as a curvature in linear system!
  • 17. IDS Lab Jamie Seol Stochastic Gradient Descent • Funny thing about SGD is, it gives optima for l2 loss for underdetermined system, and known to be a regularizer itself
  • 18. IDS Lab Jamie Seol Stochastic Gradient Descent • However… the result shows minimum l2 norm wasn’t always the global optima in sense of generalization • furthermore, it is possible to generate some dataset that minimum l2 norm is not optima! a constructive counter example! • adding l2 regularization to parameters didn’t help a bit (not shown in the table) norm = 220 norm = 390
  • 19. IDS Lab Jamie Seol Conclusion • "Be careful whenever you speak 'generalization' in deep learning" • Contributions of this paper: • experimental framework for suspecting suspicious activities of generalization techniques • proof for lack of theoritical boundary of generalization error in deep learning (since it can just memorize it all with small effective capacity) • optimization does not necessarily means generalization • "beware of the light" - Caliban, from the movie "Logan"
  • 20. IDS Lab Jamie Seol References • Zhang, Chiyuan, et al. "Understanding deep learning requires rethinking generalization." arXiv preprint arXiv:1611.03530 (2016). • https://www.slideshare.net/JungHoonSeo2/understanding-deep-learning- requires-rethinking-generalization-2017-12 • https://www.slideshare.net/JungHoonSeo2/understanding-deep-learning- requires-rethinking-generalization-2017-2-22