SlideShare ist ein Scribd-Unternehmen logo
1 von 25
Downloaden Sie, um offline zu lesen
Stochastic Gradient Descent with
Exponential Convergence Rates of
Expected Classification Errors
Atsushi Nitanda and Taiji Suzuki
AISTATS
April 18th, 2019
Naha, Okinawa
RIKEN AIP
1, 2 1, 2
1 2
Overview
• Topic
Convergence analysis of (averaged) SGD for binary classification
problems.
• Key assumption
Strongest version of low noise condition (margin condition) on the
conditional label probability.
• Result
Exponential convergence rates of expected classification errors
2
Background
• Stochastic Gradient Descent (SGD)
Simple and effective method for training machine learning models.
Significantly faster than vanilla gradient descent.
• Convergence Rates
Expected risk: sublinear convergence 𝑂(1/𝑛&
), (𝛼 ∈ [1/2,1]).
Expected classification error: How fast dose it converge?
GD SGD
SGD: 𝑔/01 ← 𝑔/ − 𝜂𝐺6(𝑔/, 𝑍/) (𝑍/ ∼ 𝜌),
GD : 𝑔/01 ← 𝑔/ − 𝜂𝔼;<∼= 𝐺6 𝑔/, 𝑍/
Cost per iteration:
1 (SGD) vs #data examples (GD)
3
Background
Common way to bound classification error.
• Classification error bound via consistency of loss functions:
[T. Zhang(2004), P. Bartlett+(2006)]
ℙ sgn 𝑔 𝑋 ≠ 𝑌 − ℙ sgn 2𝜌 1 𝑋 − 1 ≠ 𝑌 ≲ ℒ 𝑔 − ℒ∗
H
,
𝑔: predictor, ℒ∗: Bayes optimal for ℒ,
𝜌 1 𝑋 : conditional probability of label 𝑌 = 1.
𝑝 = 1/2 for logistic, exponential, and squared losses.
• Sublinear convergence 𝑂
1
KLM of excess classification error.
4
Excess classification error Excess risk
Background
Faster convergence rates of excess classification error.
• Low noise condition on 𝜌 𝑌 = 1 𝑋)
[A.B. Tsybakov(2004), P. Bartlett+(2006)]
improves the consistency property,
resulting in faster rates: 𝑂
1
K
. (still sublinear convergence)
• Low noise condition (strongest version)
[V. Koltchinskii & O. Benzosova(2005), J-Y. Audibert & A.B. Tsybakov(2007)]
accelerates the rates for ERM to linear rates 𝑂 exp(−𝑛) .
5
Background
Faster convergence rates of excess classification error for SGD.
• Linear convergence rate
[L. Pillaud-Vivien, A. Rudi, & Francis Bach(2018)]
has been shown for the squared loss function under the strong low
noise condition.
• This work
shows the linear convergence for more suitable loss functions (e.g.,
logistic loss) under the strong low noise condition.
6
Outline
• Problem Settings and Assumptions
• (Averaged) Stochastic Gradient Descent
• Main Results: Linear Convergence Rates of SGD and ASGD
• Proof Idea
• Toy Experiment
7
Problem Setting
• Regularized expected risk minimization problems
min
S∈ℋU
ℒ6 𝑔 = 𝔼(V,W) 𝑙(𝑔 𝑋 , 𝑌) +
𝜆
2
𝑔 [

,
(ℋ[, , [): Reproducing kernel Hilbert space,
𝑙: Differentiable loss,
(𝑋, 𝑌): random variables on feature space and label set −1,1 ,
𝜆: Regularization parameter.
8
Loss Function
Example ∃𝜙: ℝ → ℝbc:convex s.t. 𝑙 𝜁, 𝑦 = 𝜙 𝑦𝜁 ,
𝜙 𝑣 = g
log(1 + exp −𝑣 ) 𝑙𝑜𝑔𝑖𝑠𝑡𝑖𝑐 𝑙𝑜𝑠𝑠 ,
exp −𝑣 𝑒𝑥𝑝𝑜𝑛𝑒𝑛𝑡𝑖𝑎𝑙 𝑙𝑜𝑠𝑠 ,
𝑣
𝑠𝑞𝑢𝑎𝑟𝑒𝑑 𝑙𝑜𝑠𝑠 .
9
Assumption
- sup
𝒳
𝑘(𝑥, 𝑥) ≤ 𝑅
,
- ∃𝑀 > 0, 𝜕• 𝑙 𝜁, 𝑦 ≤ 𝑀,
- ∃𝐿 > 0 ∀𝑔, ℎ ∈ ℋ[, ℒ 𝑔 + ℎ − ℒ 𝑔 − ∇ℒ(𝑔), ℎ [ ≤
†

ℎ [

,
- 𝜌 𝑌 = 1 𝑋 ∈ 0,1 , 𝑎. 𝑒.,
- ℎ∗ : increasing function on 0,1 ,
- sgn 𝜇 − 0.5 = sgn ℎ∗ 𝜇 ,
- 𝑔∗ ≔ arg min
S:Œ•Ž••‘Ž’“•
ℒ 𝑔 ∈ ℋ[.
Remark Logistic loss satisfies these assumptions.
The other loss functions also satisfy them by restricting Hypothesis space.
10
Link function:
ℎ∗ 𝜇 = arg min
”∈ℝ
𝜇𝜙 ℎ + 1 − 𝜇 𝜙(−ℎ) .
Strongest Low Noise Condition
Assumption ∃𝛿 ∈ 0,1/2 , for 𝑋 a.e. w.r.t. 𝜌 𝒳,
𝜌 𝑌 = 1 𝑋) − 0.5 > 𝛿.
11
𝜌 𝑌 = 1 𝑥)
𝒳
0.5
1.0
𝑌 = −1 𝑌 = +1
𝛿
𝛿
Strongest Low Noise Condition
Assumption ∃𝛿 ∈ 0,1/2 , for 𝑋 a.e. w.r.t. 𝜌 𝒳,
𝜌 𝑌 = 1 𝑋) − 0.5 > 𝛿.
MNIST
12
Toy data used in experiment
Example
(Averaged) Stochastic Gradient Descent
13
• Stochastic Gradient in RKHS
𝐺6 𝑔, 𝑋, 𝑌 = 𝜕• 𝑙 𝑔 𝑋 , 𝑌 𝑘 𝑋,⋅ + 𝜆𝑔.
𝜂/ =
2
𝜆(𝛾 + 𝑡)
, 𝛼/ =
2 𝛾 + 𝑡 − 1
(2𝛾 + 𝑇)(𝑇 + 1)
Note: averaging can be updated iteratively.
Convergence Analyses
• For simplicity, we focus on the following case:
𝑔1 = 0,
𝑘: Gaussian kernel,
𝜙 𝑣 = log(1 + exp(−𝑣)): Logistic loss.
• We analyze convergence rates of excess classification error:
ℛ 𝑔 − 𝑅∗: = ℙ sgn 𝑔 𝑋 ≠ 𝑌 − ℙ sgn 𝑔∗ ≠ 𝑌 .
14
Main Result 1: Linear Convergence of SGD
Theorem There exists ∃𝜆 > 0 s.t. the following holds.
Assume 𝜂1 ≤ min
1
†06
,
1
6
, V 𝜕• 𝑙 𝑔 𝑋 , 𝑌 𝑘 𝑋,⋅ ≤ 𝜎
.
Set 𝜈 ≔

6• 𝐿 + 𝜆 𝜎
, 1 + 𝛾 ℒ6 𝑔1 − ℒ6 𝑔6 ) .
Then, for 𝑇 ≥
Ÿ
6
log¡1 10¢
1¡¢
− 𝛾, we have
𝔼 ℛ 𝑔£01 − ℛ∗ ≤ 2 exp −
𝜆
𝛾 + 𝑇
2¤ ⋅ 9
log
1 + 2𝛿
1 − 2𝛿
.
#iterations for 𝜖-solution: 𝑂
1
6• log
1
§
log¡ 10¢
1¡¢
.
15
Main Result 2: Linear Convergence of ASGD
Theorem There exists ∃𝜆 > 0 s.t. the following holds.
Assume 𝜂1 ≤ min
1
†
,
1
6
. Then, if
max
٬
6•(©0£)
,
© ©¡1 Sª U
(©0£)(£01)
≤ 32 log 10¢
1¡¢
, we have
𝔼 ℛ 𝑔£01
− ℛ∗ ≤ 2 exp −
𝜆
2𝛾 + 𝑇
21c ⋅ 9
log
1 + 2𝛿
1 − 2𝛿
.
#iterations for 𝜖-solution: 𝑂
1
6• log
1
§
log¡ 10¢
1¡¢
.
Remark Condition on 𝑇 is much improved.
A dominant term can be satisfied for somewhat small 𝜖.
16
Toy Experiment
• 2-dim toy dataset.
• 𝛿 ∈ 0.1, 0.25, 0.4 .
• Linear separable.
• Logistic loss.
• 𝜆 was determined by validation.
Right Figure
Generated samples for 𝛿 = 0.4.
𝑥1 = 1 is the Bayes optimal.
17
18
From top to bottom:
1. Risk value
2. Class. error
3. Excess class. error
/Excess risk value
Purple line: SGD
Blue line : ASGD
ASGD is much faster
especially when 𝛿 = 0.4.
Summary
• We explained convergence rates of expected classification
errors for (A)SGD are sublinear 𝑂(1/𝑛&
) in general.
• We showed that these rates can be accelerated to linear rates
𝑂(exp(−𝑛)) under strong low noise condition.
Future Work
• Faster convergence by making more additional assumptions.
• Variants of SGD(Acceleration, Variance reduction).
• Non-convex models such as deep neural networks.
• Random Fourier features (ongoing work with collaborators).
19
References
- T. Zhang. Statistical behavior and consistency of classification methods based on convex risk
minimization. The Annals of Statistics, 2004.
- P. Bartlett, M. Jordan, & J. McAuliffe. Convexity, classification, and risk bounds. Journal of the
American Statistical Association, 2006.
- A.B. Tsybakov. Optimal aggregation of classifiers in statistical learning. The Annals of Statistics, 2004.
- V. Koltchinskii & O. Benzosova. Exponential convergence rates in classification. In International
Conference on Computational Learning Theory, 2005.
- J-Y. Audibert & A.B. Tsybakov. Fast learning rates for plug-in classifiers. The Annals of statistics, 2007.
- L. Bottou & O. Bousquet. The Tradeoffs of Large Scale Learning, Advances in Neural Information
Processing Systems, 2008.
- L. Pillaud-Vivien, A. Rudi, & Francis Bach. Exponential convergence of testing error for stochastic
gradient methods. In International Conference on Computational Learning Theory, 2018.
20
Appendix
21
Link Function
Definition (Link function) ℎ∗: 0,1 → ℝ,
ℎ∗ 𝜇 = arg min
”∈ℝ
𝜇𝜙 ℎ + 1 − 𝜇 𝜙(−ℎ) .
ℎ∗ connects conditional probability of label to model outputs.
Example (Logistic loss)
ℎ∗ 𝜇 = log
𝜇
1 − 𝜇
, ℎ∗
¡1
𝑎 =
1
1 + exp −𝑎
.
22
0
ℎ∗
Expected risk defined by
conditional probability 𝜇.
ℎ∗(𝜇)
Proof Idea
Set 𝑚 𝛿 ≔ max ℎ∗ 0.5 + 𝛿 , ℎ∗ 0.5 − 𝛿 .
Example (logistic loss) 𝑚 𝛿 = log
10¢
1¡¢
.
Through ℎ∗, noise condition is converted to: 𝑔∗ 𝑋 ≥ 𝑚 𝛿 .
Set 𝑔6 ≔ arg min
S∈ℋU
ℒ6(𝑔).
When 𝜆 is sufficiently small, 𝑔6 is close to 𝑔∗. Moreover,
Proposition
There exists 𝜆 s.t. 𝑔 − 𝑔6 [ ≤
Œ ¢
®
→ ℛ 𝑔 = ℛ∗.
23
24
Analyze the convergence speed and probability to get in in RKHS.
𝜌(1|𝑋)
Space of conditional probabilities
Small ball which provides Bayes rule.
𝑔∗
𝑔6
Small ball mapped into .
RKHS (predictor)
SGD
ℎ∗
Recall ℎ∗ 𝜇 = arg min
”∈ℝ
𝜇𝜙 ℎ + 1 − 𝜇 𝜙(−ℎ) .
Proof Idea
Proof Sketch
1. Let	𝑍1, … , 𝑍£~𝜌 be i.i.d., random variables,
𝐷/: = 𝔼 ¶𝑔£01|𝑍1, … , 𝑍/ − 𝔼 ¶𝑔£01|𝑍1, … , 𝑍/¡1 ,
¶𝑔£01 = 𝔼 ¶𝑔£01 + ·
/¸1
£
𝐷/ .
2. Convergence of 𝔼 ¶𝑔£01 can be analyzed by
𝔼 ¶𝑔£01 − 𝑔6 ≤ 𝔼 ℒ6 𝔼 ¶𝑔£01 − ℒ6 𝑔6 .
3. Bound ∑/¸1
£
𝐷/ by Martingale inequality: for 𝑐£ s.t. ∑/¸1
£
𝐷/ º

≤ 𝑐£

,
ℙ ·
/¸1
£
𝐷/
[
≥ 𝜖 ≤ 2 exp −
𝜖
𝑐£
 .
4. Bound 𝑐£ by stability of A(SGD).
5. Combining 1 and 2, probability to get Bayes rule can be obtained.
6. Finally, 𝔼 ¶𝑔£01 = ℙ ¶𝑔£01 𝑖𝑠 𝑛𝑜𝑡 𝐵𝑎𝑦𝑒𝑠 .
25

Weitere ähnliche Inhalte

Was ist angesagt?

Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential David Gleich
 
Spectral clustering Tutorial
Spectral clustering TutorialSpectral clustering Tutorial
Spectral clustering TutorialZitao Liu
 
Tensor Train decomposition in machine learning
Tensor Train decomposition in machine learningTensor Train decomposition in machine learning
Tensor Train decomposition in machine learningAlexander Novikov
 
Learning a nonlinear embedding by preserving class neibourhood structure 최종
Learning a nonlinear embedding by preserving class neibourhood structure   최종Learning a nonlinear embedding by preserving class neibourhood structure   최종
Learning a nonlinear embedding by preserving class neibourhood structure 최종WooSung Choi
 
PageRank Centrality of dynamic graph structures
PageRank Centrality of dynamic graph structuresPageRank Centrality of dynamic graph structures
PageRank Centrality of dynamic graph structuresDavid Gleich
 
MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2arogozhnikov
 
Tensorizing Neural Network
Tensorizing Neural NetworkTensorizing Neural Network
Tensorizing Neural NetworkRuochun Tzeng
 
Syde770a presentation
Syde770a presentationSyde770a presentation
Syde770a presentationSai Kumar
 
Techniques in Deep Learning
Techniques in Deep LearningTechniques in Deep Learning
Techniques in Deep LearningSourya Dey
 
Sharp Characterization of Optimal Minibatch Size for Stochastic Finite Sum Co...
Sharp Characterization of Optimal Minibatch Size for Stochastic Finite Sum Co...Sharp Characterization of Optimal Minibatch Size for Stochastic Finite Sum Co...
Sharp Characterization of Optimal Minibatch Size for Stochastic Finite Sum Co...Atsushi Nitanda
 
PR 103: t-SNE
PR 103: t-SNEPR 103: t-SNE
PR 103: t-SNETaeoh Kim
 
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...MLconf
 
icml2004 tutorial on spectral clustering part II
icml2004 tutorial on spectral clustering part IIicml2004 tutorial on spectral clustering part II
icml2004 tutorial on spectral clustering part IIzukun
 
icml2004 tutorial on spectral clustering part I
icml2004 tutorial on spectral clustering part Iicml2004 tutorial on spectral clustering part I
icml2004 tutorial on spectral clustering part Izukun
 
Foundation of KL Divergence
Foundation of KL DivergenceFoundation of KL Divergence
Foundation of KL DivergenceNatan Katz
 
Spacey random walks and higher order Markov chains
Spacey random walks and higher order Markov chainsSpacey random walks and higher order Markov chains
Spacey random walks and higher order Markov chainsDavid Gleich
 
Spectral Clustering Report
Spectral Clustering ReportSpectral Clustering Report
Spectral Clustering ReportMiaolan Xie
 

Was ist angesagt? (20)

Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential
 
Spectral clustering Tutorial
Spectral clustering TutorialSpectral clustering Tutorial
Spectral clustering Tutorial
 
Tensor Train decomposition in machine learning
Tensor Train decomposition in machine learningTensor Train decomposition in machine learning
Tensor Train decomposition in machine learning
 
Learning a nonlinear embedding by preserving class neibourhood structure 최종
Learning a nonlinear embedding by preserving class neibourhood structure   최종Learning a nonlinear embedding by preserving class neibourhood structure   최종
Learning a nonlinear embedding by preserving class neibourhood structure 최종
 
PageRank Centrality of dynamic graph structures
PageRank Centrality of dynamic graph structuresPageRank Centrality of dynamic graph structures
PageRank Centrality of dynamic graph structures
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2
 
Tensorizing Neural Network
Tensorizing Neural NetworkTensorizing Neural Network
Tensorizing Neural Network
 
Syde770a presentation
Syde770a presentationSyde770a presentation
Syde770a presentation
 
Techniques in Deep Learning
Techniques in Deep LearningTechniques in Deep Learning
Techniques in Deep Learning
 
Sharp Characterization of Optimal Minibatch Size for Stochastic Finite Sum Co...
Sharp Characterization of Optimal Minibatch Size for Stochastic Finite Sum Co...Sharp Characterization of Optimal Minibatch Size for Stochastic Finite Sum Co...
Sharp Characterization of Optimal Minibatch Size for Stochastic Finite Sum Co...
 
PR 103: t-SNE
PR 103: t-SNEPR 103: t-SNE
PR 103: t-SNE
 
K-means and GMM
K-means and GMMK-means and GMM
K-means and GMM
 
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
 
Survadapt-Webinar_2014_SLIDES
Survadapt-Webinar_2014_SLIDESSurvadapt-Webinar_2014_SLIDES
Survadapt-Webinar_2014_SLIDES
 
icml2004 tutorial on spectral clustering part II
icml2004 tutorial on spectral clustering part IIicml2004 tutorial on spectral clustering part II
icml2004 tutorial on spectral clustering part II
 
icml2004 tutorial on spectral clustering part I
icml2004 tutorial on spectral clustering part Iicml2004 tutorial on spectral clustering part I
icml2004 tutorial on spectral clustering part I
 
Foundation of KL Divergence
Foundation of KL DivergenceFoundation of KL Divergence
Foundation of KL Divergence
 
Spacey random walks and higher order Markov chains
Spacey random walks and higher order Markov chainsSpacey random walks and higher order Markov chains
Spacey random walks and higher order Markov chains
 
Spectral Clustering Report
Spectral Clustering ReportSpectral Clustering Report
Spectral Clustering Report
 

Ähnlich wie Stochastic Gradient Descent with Exponential Convergence Rates of Expected Classification Errors

MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4arogozhnikov
 
Paper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelinePaper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelineChenYiHuang5
 
MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3arogozhnikov
 
Average Sensitivity of Graph Algorithms
Average Sensitivity of Graph AlgorithmsAverage Sensitivity of Graph Algorithms
Average Sensitivity of Graph AlgorithmsYuichi Yoshida
 
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)Universitat Politècnica de Catalunya
 
The world of loss function
The world of loss functionThe world of loss function
The world of loss function홍배 김
 
STLtalk about statistical analysis and its application
STLtalk about statistical analysis and its applicationSTLtalk about statistical analysis and its application
STLtalk about statistical analysis and its applicationJulieDash5
 
Hierarchical Deterministic Quadrature Methods for Option Pricing under the Ro...
Hierarchical Deterministic Quadrature Methods for Option Pricing under the Ro...Hierarchical Deterministic Quadrature Methods for Option Pricing under the Ro...
Hierarchical Deterministic Quadrature Methods for Option Pricing under the Ro...Chiheb Ben Hammouda
 
Efficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketchingEfficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketchingHsing-chuan Hsieh
 
Linear regression, costs & gradient descent
Linear regression, costs & gradient descentLinear regression, costs & gradient descent
Linear regression, costs & gradient descentRevanth Kumar
 
20180831 riemannian representation learning
20180831 riemannian representation learning20180831 riemannian representation learning
20180831 riemannian representation learningsegwangkim
 
Paper study: Learning to solve circuit sat
Paper study: Learning to solve circuit satPaper study: Learning to solve circuit sat
Paper study: Learning to solve circuit satChenYiHuang5
 
Optimization of positive linear systems via geometric programming
Optimization of positive linear systems via geometric programmingOptimization of positive linear systems via geometric programming
Optimization of positive linear systems via geometric programmingMasaki Ogura
 
2 random variables notes 2p3
2 random variables notes 2p32 random variables notes 2p3
2 random variables notes 2p3MuhannadSaleh
 
Distributed solution of stochastic optimal control problem on GPUs
Distributed solution of stochastic optimal control problem on GPUsDistributed solution of stochastic optimal control problem on GPUs
Distributed solution of stochastic optimal control problem on GPUsPantelis Sopasakis
 
Lecture 5 backpropagation
Lecture 5 backpropagationLecture 5 backpropagation
Lecture 5 backpropagationParveenMalik18
 

Ähnlich wie Stochastic Gradient Descent with Exponential Convergence Rates of Expected Classification Errors (20)

MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4
 
Paper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelinePaper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipeline
 
MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3
 
Average Sensitivity of Graph Algorithms
Average Sensitivity of Graph AlgorithmsAverage Sensitivity of Graph Algorithms
Average Sensitivity of Graph Algorithms
 
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
 
The world of loss function
The world of loss functionThe world of loss function
The world of loss function
 
STLtalk about statistical analysis and its application
STLtalk about statistical analysis and its applicationSTLtalk about statistical analysis and its application
STLtalk about statistical analysis and its application
 
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
 
Hierarchical Deterministic Quadrature Methods for Option Pricing under the Ro...
Hierarchical Deterministic Quadrature Methods for Option Pricing under the Ro...Hierarchical Deterministic Quadrature Methods for Option Pricing under the Ro...
Hierarchical Deterministic Quadrature Methods for Option Pricing under the Ro...
 
Efficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketchingEfficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketching
 
Linear regression, costs & gradient descent
Linear regression, costs & gradient descentLinear regression, costs & gradient descent
Linear regression, costs & gradient descent
 
Presentation
PresentationPresentation
Presentation
 
Stochastic Optimization
Stochastic OptimizationStochastic Optimization
Stochastic Optimization
 
20180831 riemannian representation learning
20180831 riemannian representation learning20180831 riemannian representation learning
20180831 riemannian representation learning
 
Paper study: Learning to solve circuit sat
Paper study: Learning to solve circuit satPaper study: Learning to solve circuit sat
Paper study: Learning to solve circuit sat
 
Optimization of positive linear systems via geometric programming
Optimization of positive linear systems via geometric programmingOptimization of positive linear systems via geometric programming
Optimization of positive linear systems via geometric programming
 
2 random variables notes 2p3
2 random variables notes 2p32 random variables notes 2p3
2 random variables notes 2p3
 
Distributed solution of stochastic optimal control problem on GPUs
Distributed solution of stochastic optimal control problem on GPUsDistributed solution of stochastic optimal control problem on GPUs
Distributed solution of stochastic optimal control problem on GPUs
 
04 Multi-layer Feedforward Networks
04 Multi-layer Feedforward Networks04 Multi-layer Feedforward Networks
04 Multi-layer Feedforward Networks
 
Lecture 5 backpropagation
Lecture 5 backpropagationLecture 5 backpropagation
Lecture 5 backpropagation
 

Kürzlich hochgeladen

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 

Kürzlich hochgeladen (20)

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

Stochastic Gradient Descent with Exponential Convergence Rates of Expected Classification Errors

  • 1. Stochastic Gradient Descent with Exponential Convergence Rates of Expected Classification Errors Atsushi Nitanda and Taiji Suzuki AISTATS April 18th, 2019 Naha, Okinawa RIKEN AIP 1, 2 1, 2 1 2
  • 2. Overview • Topic Convergence analysis of (averaged) SGD for binary classification problems. • Key assumption Strongest version of low noise condition (margin condition) on the conditional label probability. • Result Exponential convergence rates of expected classification errors 2
  • 3. Background • Stochastic Gradient Descent (SGD) Simple and effective method for training machine learning models. Significantly faster than vanilla gradient descent. • Convergence Rates Expected risk: sublinear convergence 𝑂(1/𝑛& ), (𝛼 ∈ [1/2,1]). Expected classification error: How fast dose it converge? GD SGD SGD: 𝑔/01 ← 𝑔/ − 𝜂𝐺6(𝑔/, 𝑍/) (𝑍/ ∼ 𝜌), GD : 𝑔/01 ← 𝑔/ − 𝜂𝔼;<∼= 𝐺6 𝑔/, 𝑍/ Cost per iteration: 1 (SGD) vs #data examples (GD) 3
  • 4. Background Common way to bound classification error. • Classification error bound via consistency of loss functions: [T. Zhang(2004), P. Bartlett+(2006)] ℙ sgn 𝑔 𝑋 ≠ 𝑌 − ℙ sgn 2𝜌 1 𝑋 − 1 ≠ 𝑌 ≲ ℒ 𝑔 − ℒ∗ H , 𝑔: predictor, ℒ∗: Bayes optimal for ℒ, 𝜌 1 𝑋 : conditional probability of label 𝑌 = 1. 𝑝 = 1/2 for logistic, exponential, and squared losses. • Sublinear convergence 𝑂 1 KLM of excess classification error. 4 Excess classification error Excess risk
  • 5. Background Faster convergence rates of excess classification error. • Low noise condition on 𝜌 𝑌 = 1 𝑋) [A.B. Tsybakov(2004), P. Bartlett+(2006)] improves the consistency property, resulting in faster rates: 𝑂 1 K . (still sublinear convergence) • Low noise condition (strongest version) [V. Koltchinskii & O. Benzosova(2005), J-Y. Audibert & A.B. Tsybakov(2007)] accelerates the rates for ERM to linear rates 𝑂 exp(−𝑛) . 5
  • 6. Background Faster convergence rates of excess classification error for SGD. • Linear convergence rate [L. Pillaud-Vivien, A. Rudi, & Francis Bach(2018)] has been shown for the squared loss function under the strong low noise condition. • This work shows the linear convergence for more suitable loss functions (e.g., logistic loss) under the strong low noise condition. 6
  • 7. Outline • Problem Settings and Assumptions • (Averaged) Stochastic Gradient Descent • Main Results: Linear Convergence Rates of SGD and ASGD • Proof Idea • Toy Experiment 7
  • 8. Problem Setting • Regularized expected risk minimization problems min S∈ℋU ℒ6 𝑔 = 𝔼(V,W) 𝑙(𝑔 𝑋 , 𝑌) + 𝜆 2 𝑔 [ , (ℋ[, , [): Reproducing kernel Hilbert space, 𝑙: Differentiable loss, (𝑋, 𝑌): random variables on feature space and label set −1,1 , 𝜆: Regularization parameter. 8
  • 9. Loss Function Example ∃𝜙: ℝ → ℝbc:convex s.t. 𝑙 𝜁, 𝑦 = 𝜙 𝑦𝜁 , 𝜙 𝑣 = g log(1 + exp −𝑣 ) 𝑙𝑜𝑔𝑖𝑠𝑡𝑖𝑐 𝑙𝑜𝑠𝑠 , exp −𝑣 𝑒𝑥𝑝𝑜𝑛𝑒𝑛𝑡𝑖𝑎𝑙 𝑙𝑜𝑠𝑠 , 𝑣 𝑠𝑞𝑢𝑎𝑟𝑒𝑑 𝑙𝑜𝑠𝑠 . 9
  • 10. Assumption - sup 𝒳 𝑘(𝑥, 𝑥) ≤ 𝑅 , - ∃𝑀 > 0, 𝜕• 𝑙 𝜁, 𝑦 ≤ 𝑀, - ∃𝐿 > 0 ∀𝑔, ℎ ∈ ℋ[, ℒ 𝑔 + ℎ − ℒ 𝑔 − ∇ℒ(𝑔), ℎ [ ≤ † ℎ [ , - 𝜌 𝑌 = 1 𝑋 ∈ 0,1 , 𝑎. 𝑒., - ℎ∗ : increasing function on 0,1 , - sgn 𝜇 − 0.5 = sgn ℎ∗ 𝜇 , - 𝑔∗ ≔ arg min S:Œ•Ž••‘Ž’“• ℒ 𝑔 ∈ ℋ[. Remark Logistic loss satisfies these assumptions. The other loss functions also satisfy them by restricting Hypothesis space. 10 Link function: ℎ∗ 𝜇 = arg min ”∈ℝ 𝜇𝜙 ℎ + 1 − 𝜇 𝜙(−ℎ) .
  • 11. Strongest Low Noise Condition Assumption ∃𝛿 ∈ 0,1/2 , for 𝑋 a.e. w.r.t. 𝜌 𝒳, 𝜌 𝑌 = 1 𝑋) − 0.5 > 𝛿. 11 𝜌 𝑌 = 1 𝑥) 𝒳 0.5 1.0 𝑌 = −1 𝑌 = +1 𝛿 𝛿
  • 12. Strongest Low Noise Condition Assumption ∃𝛿 ∈ 0,1/2 , for 𝑋 a.e. w.r.t. 𝜌 𝒳, 𝜌 𝑌 = 1 𝑋) − 0.5 > 𝛿. MNIST 12 Toy data used in experiment Example
  • 13. (Averaged) Stochastic Gradient Descent 13 • Stochastic Gradient in RKHS 𝐺6 𝑔, 𝑋, 𝑌 = 𝜕• 𝑙 𝑔 𝑋 , 𝑌 𝑘 𝑋,⋅ + 𝜆𝑔. 𝜂/ = 2 𝜆(𝛾 + 𝑡) , 𝛼/ = 2 𝛾 + 𝑡 − 1 (2𝛾 + 𝑇)(𝑇 + 1) Note: averaging can be updated iteratively.
  • 14. Convergence Analyses • For simplicity, we focus on the following case: 𝑔1 = 0, 𝑘: Gaussian kernel, 𝜙 𝑣 = log(1 + exp(−𝑣)): Logistic loss. • We analyze convergence rates of excess classification error: ℛ 𝑔 − 𝑅∗: = ℙ sgn 𝑔 𝑋 ≠ 𝑌 − ℙ sgn 𝑔∗ ≠ 𝑌 . 14
  • 15. Main Result 1: Linear Convergence of SGD Theorem There exists ∃𝜆 > 0 s.t. the following holds. Assume 𝜂1 ≤ min 1 †06 , 1 6 , V 𝜕• 𝑙 𝑔 𝑋 , 𝑌 𝑘 𝑋,⋅ ≤ 𝜎 . Set 𝜈 ≔ 6• 𝐿 + 𝜆 𝜎 , 1 + 𝛾 ℒ6 𝑔1 − ℒ6 𝑔6 ) . Then, for 𝑇 ≥ Ÿ 6 log¡1 10¢ 1¡¢ − 𝛾, we have 𝔼 ℛ 𝑔£01 − ℛ∗ ≤ 2 exp − 𝜆 𝛾 + 𝑇 2¤ ⋅ 9 log 1 + 2𝛿 1 − 2𝛿 . #iterations for 𝜖-solution: 𝑂 1 6• log 1 § log¡ 10¢ 1¡¢ . 15
  • 16. Main Result 2: Linear Convergence of ASGD Theorem There exists ∃𝜆 > 0 s.t. the following holds. Assume 𝜂1 ≤ min 1 † , 1 6 . Then, if max Ÿ¨ 6•(©0£) , © ©¡1 Sª U (©0£)(£01) ≤ 32 log 10¢ 1¡¢ , we have 𝔼 ℛ 𝑔£01 − ℛ∗ ≤ 2 exp − 𝜆 2𝛾 + 𝑇 21c ⋅ 9 log 1 + 2𝛿 1 − 2𝛿 . #iterations for 𝜖-solution: 𝑂 1 6• log 1 § log¡ 10¢ 1¡¢ . Remark Condition on 𝑇 is much improved. A dominant term can be satisfied for somewhat small 𝜖. 16
  • 17. Toy Experiment • 2-dim toy dataset. • 𝛿 ∈ 0.1, 0.25, 0.4 . • Linear separable. • Logistic loss. • 𝜆 was determined by validation. Right Figure Generated samples for 𝛿 = 0.4. 𝑥1 = 1 is the Bayes optimal. 17
  • 18. 18 From top to bottom: 1. Risk value 2. Class. error 3. Excess class. error /Excess risk value Purple line: SGD Blue line : ASGD ASGD is much faster especially when 𝛿 = 0.4.
  • 19. Summary • We explained convergence rates of expected classification errors for (A)SGD are sublinear 𝑂(1/𝑛& ) in general. • We showed that these rates can be accelerated to linear rates 𝑂(exp(−𝑛)) under strong low noise condition. Future Work • Faster convergence by making more additional assumptions. • Variants of SGD(Acceleration, Variance reduction). • Non-convex models such as deep neural networks. • Random Fourier features (ongoing work with collaborators). 19
  • 20. References - T. Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization. The Annals of Statistics, 2004. - P. Bartlett, M. Jordan, & J. McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 2006. - A.B. Tsybakov. Optimal aggregation of classifiers in statistical learning. The Annals of Statistics, 2004. - V. Koltchinskii & O. Benzosova. Exponential convergence rates in classification. In International Conference on Computational Learning Theory, 2005. - J-Y. Audibert & A.B. Tsybakov. Fast learning rates for plug-in classifiers. The Annals of statistics, 2007. - L. Bottou & O. Bousquet. The Tradeoffs of Large Scale Learning, Advances in Neural Information Processing Systems, 2008. - L. Pillaud-Vivien, A. Rudi, & Francis Bach. Exponential convergence of testing error for stochastic gradient methods. In International Conference on Computational Learning Theory, 2018. 20
  • 22. Link Function Definition (Link function) ℎ∗: 0,1 → ℝ, ℎ∗ 𝜇 = arg min ”∈ℝ 𝜇𝜙 ℎ + 1 − 𝜇 𝜙(−ℎ) . ℎ∗ connects conditional probability of label to model outputs. Example (Logistic loss) ℎ∗ 𝜇 = log 𝜇 1 − 𝜇 , ℎ∗ ¡1 𝑎 = 1 1 + exp −𝑎 . 22 0 ℎ∗ Expected risk defined by conditional probability 𝜇. ℎ∗(𝜇)
  • 23. Proof Idea Set 𝑚 𝛿 ≔ max ℎ∗ 0.5 + 𝛿 , ℎ∗ 0.5 − 𝛿 . Example (logistic loss) 𝑚 𝛿 = log 10¢ 1¡¢ . Through ℎ∗, noise condition is converted to: 𝑔∗ 𝑋 ≥ 𝑚 𝛿 . Set 𝑔6 ≔ arg min S∈ℋU ℒ6(𝑔). When 𝜆 is sufficiently small, 𝑔6 is close to 𝑔∗. Moreover, Proposition There exists 𝜆 s.t. 𝑔 − 𝑔6 [ ≤ Œ ¢ ® → ℛ 𝑔 = ℛ∗. 23
  • 24. 24 Analyze the convergence speed and probability to get in in RKHS. 𝜌(1|𝑋) Space of conditional probabilities Small ball which provides Bayes rule. 𝑔∗ 𝑔6 Small ball mapped into . RKHS (predictor) SGD ℎ∗ Recall ℎ∗ 𝜇 = arg min ”∈ℝ 𝜇𝜙 ℎ + 1 − 𝜇 𝜙(−ℎ) . Proof Idea
  • 25. Proof Sketch 1. Let 𝑍1, … , 𝑍£~𝜌 be i.i.d., random variables, 𝐷/: = 𝔼 ¶𝑔£01|𝑍1, … , 𝑍/ − 𝔼 ¶𝑔£01|𝑍1, … , 𝑍/¡1 , ¶𝑔£01 = 𝔼 ¶𝑔£01 + · /¸1 £ 𝐷/ . 2. Convergence of 𝔼 ¶𝑔£01 can be analyzed by 𝔼 ¶𝑔£01 − 𝑔6 ≤ 𝔼 ℒ6 𝔼 ¶𝑔£01 − ℒ6 𝑔6 . 3. Bound ∑/¸1 £ 𝐷/ by Martingale inequality: for 𝑐£ s.t. ∑/¸1 £ 𝐷/ º ≤ 𝑐£ , ℙ · /¸1 £ 𝐷/ [ ≥ 𝜖 ≤ 2 exp − 𝜖 𝑐£ . 4. Bound 𝑐£ by stability of A(SGD). 5. Combining 1 and 2, probability to get Bayes rule can be obtained. 6. Finally, 𝔼 ¶𝑔£01 = ℙ ¶𝑔£01 𝑖𝑠 𝑛𝑜𝑡 𝐵𝑎𝑦𝑒𝑠 . 25