SlideShare ist ein Scribd-Unternehmen logo
1 von 69
Downloaden Sie, um offline zu lesen
Sufficient decrease is all you need
A simple condition to forget about the step-size, with
applications to the Frank-Wolfe algorithm.
Fabian Pedregosa
June 4th, 2018. Google Brain Montreal
Where I Come From
ML/Optimization/Software Guy
Engineer (2010–2012)
First contact with ML: develop ML
library (scikit-learn).
ML and NeuroScience (2012–2015)
PhD applying ML to neuroscience.
ML and Optimization (2015–)
Stochastic, Parallel, Constrained,
Hyperparameter optimization.
1/30
Outline
Motivation: eliminate step-size parameter.
2/30
Outline
Motivation: eliminate step-size parameter.
1. Frank-Wolfe, A method for constrained optimization.
2. Adaptive Frank-Wolfe. Frank-Wolfe without the step-size.
3. Perspectives. Other applications: proximal splitting,
stochastic optimization.
2/30
Outline
Motivation: eliminate step-size parameter.
1. Frank-Wolfe, A method for constrained optimization.
2. Adaptive Frank-Wolfe. Frank-Wolfe without the step-size.
3. Perspectives. Other applications: proximal splitting,
stochastic optimization.
With a little help from my collaborators
Armin Askari
(UC Berkeley)
Geoffrey N´egiar
(UC Berkeley)
Martin Jaggi
(EPFL)
Gauthier Gidel
(UdeM)
2/30
The Frank-Wolfe algorithm
The Frank-Wolfe (FW) algorithm, aka conditional gradient
Problem: smooth f , compact D
arg min
x∈D
f (x)
Algorithm 1: Frank-Wolfe (FW)
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 Find γt by line-search: γt ∈
arg minγ∈[0,1] f ((1−γ)xt +γst)
4 xt+1 = (1 − γt)xt + γtst
3/30
The Frank-Wolfe (FW) algorithm, aka conditional gradient
Problem: smooth f , compact D
arg min
x∈D
f (x)
Algorithm 1: Frank-Wolfe (FW)
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 Find γt by line-search: γt ∈
arg minγ∈[0,1] f ((1−γ)xt +γst)
4 xt+1 = (1 − γt)xt + γtst
3/30
The Frank-Wolfe (FW) algorithm, aka conditional gradient
Problem: smooth f , compact D
arg min
x∈D
f (x)
Algorithm 1: Frank-Wolfe (FW)
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 Find γt by line-search: γt ∈
arg minγ∈[0,1] f ((1−γ)xt +γst)
4 xt+1 = (1 − γt)xt + γtst
3/30
The Frank-Wolfe (FW) algorithm, aka conditional gradient
Problem: smooth f , compact D
arg min
x∈D
f (x)
Algorithm 1: Frank-Wolfe (FW)
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 Find γt by line-search: γt ∈
arg minγ∈[0,1] f ((1−γ)xt +γst)
4 xt+1 = (1 − γt)xt + γtst
3/30
The Frank-Wolfe (FW) algorithm, aka conditional gradient
Problem: smooth f , compact D
arg min
x∈D
f (x)
Algorithm 1: Frank-Wolfe (FW)
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 Find γt by line-search: γt ∈
arg minγ∈[0,1] f ((1−γ)xt +γst)
4 xt+1 = (1 − γt)xt + γtst
3/30
Why people ♥ Frank-Wolfe
• Projection-free. Only linear subproblems arise vs quadratic
for projection.
4/30
Why people ♥ Frank-Wolfe
• Projection-free. Only linear subproblems arise vs quadratic
for projection.
• Solution of linear subproblem is always extremal element of
D.
4/30
Why people ♥ Frank-Wolfe
• Projection-free. Only linear subproblems arise vs quadratic
for projection.
• Solution of linear subproblem is always extremal element of
D.
• Iterates admit sparse representation = xt convex combination
of at most t elements.
4/30
Recent applications of Frank-Wolfe
• Learning the structure of a neural network.1
• Attention mechanisms that enforce sparsity.2
• 1-constrained problems with extreme number of features.3
1
Wei Ping, Qiang Liu, and Alexander T Ihler (2016). “Learning Infinite RBMs
with Frank-Wolfe”. In: Advances in Neural Information Processing Systems.
2
Vlad Niculae et al. (2018). “SparseMAP: Differentiable Sparse Structured
Inference”. In: International Conference on Machine Learning.
3
Thomas Kerdreux, Fabian Pedregosa, and Alexandre d’Aspremont (2018).
“Frank-Wolfe with Subsampling Oracle”. In: Proceedings of the 35th
International Conference on Machine Learning.
5/30
A practical issue
• Line-search only
efficient when closed
form exists (quadratic
objective).
• Step-size
γt = 2/(t + 2) is
convergent, but
extremely slow.
Algorithm 2: Frank-Wolfe (FW)
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 Find γt by line-search: γt ∈
arg minγ∈[0,1] f ((1−γ)xt +γst)
4 xt+1 = (1 − γt)xt + γtst
6/30
A practical issue
• Line-search only
efficient when closed
form exists (quadratic
objective).
• Step-size
γt = 2/(t + 2) is
convergent, but
extremely slow.
Algorithm 2: Frank-Wolfe (FW)
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 Find γt by line-search: γt ∈
arg minγ∈[0,1] f ((1−γ)xt +γst)
4 xt+1 = (1 − γt)xt + γtst
Can we do better?
6/30
A sufficient decrease condition
Down the citation rabbit hole
4
Vladimir Demyanov
Alexsandr Rubinov
4
Vladimir Demyanov and Aleksandr Rubinov (1970). Approximate methods in
optimization problems (translated from Russian).
7/30
Down the citation rabbit hole
4
Vladimir Demyanov
Alexsandr Rubinov
4
Vladimir Demyanov and Aleksandr Rubinov (1970). Approximate methods in
optimization problems (translated from Russian).
7/30
The Demyanov-Rubinov (DR) Frank-Wolfe variant
Problem: smooth objective, compact domain
arg min
x∈D
f (x), where f is L-smooth .
(L-smooth ≡ differentiable with L-Lipschitz gradient).
• Step-size depends
on the correlation
between − f (xt)
and the descent
direction st − xt.
Algorithm 3: FW, DR variant
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 γt =min
− f (xt), st − xt
L st − xt
2
, 1
4 xt+1 = (1 − γt)xt + γtst
8/30
The Demyanov-Rubinov (DR) Frank-Wolfe variant
Where does γt =min
− f (xt), st − xt
L st − xt
2
, 1 come from?
9/30
The Demyanov-Rubinov (DR) Frank-Wolfe variant
Where does γt =min
− f (xt), st − xt
L st − xt
2
, 1 come from?
L-smooth inequality
Any L-smooth function f verifies
f (y) ≤ f (x) + f (x), y − x +
L
2
x − y 2
,
for all x, y in the domain.
9/30
The Demyanov-Rubinov (DR) Frank-Wolfe variant
Where does γt =min
− f (xt), st − xt
L st − xt
2
, 1 come from?
L-smooth inequality
Any L-smooth function f verifies
f (y) ≤ f (x) + f (x), y − x +
L
2
x − y 2
:=Qx (y)
,
for all x, y in the domain.
• The right hand side is a
quadratic upper bound
Qx (y)
f (y)
9/30
Justification of the step-size
• L-smooth inequality at xt+1(γ) = (1 − γ)xt + γst, x = xt
gives
f (xt+1(γ)) ≤ f (xt) − γ f (xt), st − xt +
γ2L
2
st − xt
2
10/30
Justification of the step-size
• L-smooth inequality at xt+1(γ) = (1 − γ)xt + γst, x = xt
gives
f (xt+1(γ)) ≤ f (xt) − γ f (xt), st − xt +
γ2L
2
st − xt
2
• Minimizing right hand side on γ ∈ [0, 1] gives
γ =min
− f (xt), st − xt
L st − xt
2
, 1 ,
= Demyanov-Rubinov step-size!
10/30
Justification of the step-size
• L-smooth inequality at xt+1(γ) = (1 − γ)xt + γst, x = xt
gives
f (xt+1(γ)) ≤ f (xt) − γ f (xt), st − xt +
γ2L
2
st − xt
2
• Minimizing right hand side on γ ∈ [0, 1] gives
γ =min
− f (xt), st − xt
L st − xt
2
, 1 ,
= Demyanov-Rubinov step-size!
• ≡ exact line search on the quadratic upper bound.
10/30
Towards an Adaptive FW
Quadratic upper bound
The Demyanov-Rubinov makes use of a quadratic upper bound,
but it is only evaluated at xt, xt+1.
11/30
Towards an Adaptive FW
Quadratic upper bound
The Demyanov-Rubinov makes use of a quadratic upper bound,
but it is only evaluated at xt, xt+1.
Sufficient decrease is all you need
L-smooth inequality can be replaced by
f (xt+1) ≤ f (xt) − γt f (xt), st − xt +
γ2
t Lt
2
st − xt
2
with γt =min
− f (xt), st − xt
Lt st − xt
2
, 1
11/30
Towards an Adaptive FW
Quadratic upper bound
The Demyanov-Rubinov makes use of a quadratic upper bound,
but it is only evaluated at xt, xt+1.
Sufficient decrease is all you need
L-smooth inequality can be replaced by
f (xt+1) ≤ f (xt) − γt f (xt), st − xt +
γ2
t Lt
2
st − xt
2
with γt =min
− f (xt), st − xt
Lt st − xt
2
, 1
Key difference with DR: L is replaced by Lt. Potentially Lt L.
11/30
The Adaptive FW algorithm
New FW variant with adaptive step-size.5
Algorithm 4: The Adaptive Frank-Wolfe algorithm (AdaFW)
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 Find Lt that verifies sufficient decrease (1), with
4 γt =min
− f (xt), st − xt
Lt st − xt
2
, 1
5 xt+1 = (1 − γt)xt + γtst
f (xt+1) ≤ f (xt) + γt f (xt), st − xt +
γ2
t Lt
2
st − xt
2
(1)
5
Fabian Pedregosa, Armin Askari, Geoffrey Negiar, and Martin Jaggi (2018).
“Step-Size Adaptivity in Projection-Free Optimization”. In: Submitted.
12/30
The Adaptive FW algorithm
γ =0 γ = 1γt
f (xt) + γ f (xt), st − xt + γ2Lt
2 st − xt
2
f ((1 − γ)xt + γst)
• Worst-case, Lt = L. Often Lt L =⇒ larger step-size.
13/30
The Adaptive FW algorithm
γ =0 γ = 1γt
f (xt) + γ f (xt), st − xt + γ2Lt
2 st − xt
2
f ((1 − γ)xt + γst)
• Worst-case, Lt = L. Often Lt L =⇒ larger step-size.
• Adaptivity to local geometry.
13/30
The Adaptive FW algorithm
γ =0 γ = 1γt
f (xt) + γ f (xt), st − xt + γ2Lt
2 st − xt
2
f ((1 − γ)xt + γst)
• Worst-case, Lt = L. Often Lt L =⇒ larger step-size.
• Adaptivity to local geometry.
• Two extra function evaluations per iteration. Often given as
byproduct of gradient.
13/30
Extension to other FW variants
Zig-Zagging phenomena in FW
The Frank-Wolfe algorithm zig-zags when the solution lies in a
face of the boundary.
Some FW variants have been developed to address this issue.
14/30
Away-steps FW, informal
Away-steps FW algorithm (Wolfe, 1970) (Gu´elat and Marcotte,
1986) adds the possibility to move away from an active atom.
15/30
Away-steps FW algorithm
Keep active set St = active vertices that have been previously
selected and have non-zero weight.
16/30
Away-steps FW algorithm
Keep active set St = active vertices that have been previously
selected and have non-zero weight.
Algorithm 5: Away-Steps FW
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 vt ∈ arg maxv∈St
f (xt), v
4 if − f (xt), st − xt ≥ − f (xt), xt − vt then
5 dt = st − xt, FW step
6 else
7 dt = xt − vt, Away step
8 Find γt by line-search: γt ∈ arg minγ∈[0,γmax
t ] f (xt +γdt)
9 xt+1 = xt + γtdt
16/30
Away-steps FW algorithm
Keep active set St = active vertices that have been previously
selected and have non-zero weight.
Algorithm 5: Away-Steps FW
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 vt ∈ arg maxv∈St
f (xt), v
4 if − f (xt), st − xt ≥ − f (xt), xt − vt then
5 dt = st − xt, FW step
6 else
7 dt = xt − vt, Away step
8 Find γt by line-search: γt ∈ arg minγ∈[0,γmax
t ] f (xt +γdt)
9 xt+1 = xt + γtdt
16/30
Away-steps FW algorithm
Keep active set St = active vertices that have been previously
selected and have non-zero weight.
Algorithm 5: Away-Steps FW
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 vt ∈ arg maxv∈St
f (xt), v
4 if − f (xt), st − xt ≥ − f (xt), xt − vt then
5 dt = st − xt, FW step
6 else
7 dt = xt − vt, Away step
8 Find γt by line-search: γt ∈ arg minγ∈[0,γmax
t ] f (xt +γdt)
9 xt+1 = xt + γtdt
16/30
Away-steps FW algorithm
Keep active set St = active vertices that have been previously
selected and have non-zero weight.
Algorithm 5: Away-Steps FW
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 vt ∈ arg maxv∈St
f (xt), v
4 if − f (xt), st − xt ≥ − f (xt), xt − vt then
5 dt = st − xt, FW step
6 else
7 dt = xt − vt, Away step
8 Find γt by line-search: γt ∈ arg minγ∈[0,γmax
t ] f (xt +γdt)
9 xt+1 = xt + γtdt
16/30
Away-steps FW algorithm
Keep active set St = active vertices that have been previously
selected and have non-zero weight.
Algorithm 5: Away-Steps FW
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 vt ∈ arg maxv∈St
f (xt), v
4 if − f (xt), st − xt ≥ − f (xt), xt − vt then
5 dt = st − xt, FW step
6 else
7 dt = xt − vt, Away step
8 Find γt by line-search: γt ∈ arg minγ∈[0,γmax
t ] f (xt +γdt)
9 xt+1 = xt + γtdt
16/30
Pairwise FW
Key idea
Move weight mass between two atoms in each step.
Proposed by (Lacoste-Julien and Jaggi, 2015), inspired the MDM
alg. (Mitchell, Demyanov, and Malozemov, 1974).
Algorithm 6: Pairwise FW
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 vt ∈ arg maxs∈St
f (xt), s
4 dt = st − vt
5 Find γt by line-search: γt ∈
arg minγ∈[0,γmax
t ] f (xt +γdt)
6 xt+1 = xt + γtdt
17/30
Away-steps FW and Pairwise FW
Convergence of Away-steps and Pairwise FW
• Linear convergence for strongly convex functions on polytopes
(Lacoste-Julien and Jaggi, 2015).
18/30
Away-steps FW and Pairwise FW
Convergence of Away-steps and Pairwise FW
• Linear convergence for strongly convex functions on polytopes
(Lacoste-Julien and Jaggi, 2015).
• Can we design variants with sufficient decrease?
18/30
Away-steps FW and Pairwise FW
Convergence of Away-steps and Pairwise FW
• Linear convergence for strongly convex functions on polytopes
(Lacoste-Julien and Jaggi, 2015).
• Can we design variants with sufficient decrease?
Introducing Adaptive Away-steps and Adaptive Pairwise
Choose Lt such that it verifies
f (xt + γtdt) ≤ f (xt) + γt f (xt), dt +
γ2
t Lt
2
dt
2
with γt =min
− f (xt), dt
Lt dt
2
, 1
18/30
Adaptive Pairwise FW
Algorithm 7: Pairwise FW
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 vt ∈ arg maxs∈St
f (xt), s
4 dt = st − vt
5 Find Lt that verifies sufficient decrease (2), with
6 γt =min
− f (xt), dt
Lt dt
2
, 1
7 xt+1 = xt + γtdt
f (xt + γtdt) ≤ f (xt) + γt f (xt), dt +
γ2
t Lt
2
dt
2
(2)
19/30
Theory for Adaptive Step-size variants
Strongly convex f
Pairwise and Away-steps converge linearly on a polytope. For
each “good step” we have:
f (xt+1) − f (x ) ≤ (1 − ρ)(f (xt) − f (x ))
20/30
Theory for Adaptive Step-size variants
Strongly convex f
Pairwise and Away-steps converge linearly on a polytope. For
each “good step” we have:
f (xt+1) − f (x ) ≤ (1 − ρ)(f (xt) − f (x ))
Convex f
For all FW variants, f (xt) − f (x ) ≤ O(1/t)
20/30
Theory for Adaptive Step-size variants
Strongly convex f
Pairwise and Away-steps converge linearly on a polytope. For
each “good step” we have:
f (xt+1) − f (x ) ≤ (1 − ρ)(f (xt) − f (x ))
Convex f
For all FW variants, f (xt) − f (x ) ≤ O(1/t)
Non-Convex f
For all FW variants, maxs∈D f (xt), xt − s ≤ O(1/
√
t)
20/30
Experiments
Experiments RCV1
Problem: 1-constrained logistic regression
arg min
x 1≤α
1
n
n
i=1
ϕ(aT
i x, bi ) with ϕ = logistic loss.
Dataset dimension density Lt /L
RCV1 47236 10−3 1.3 × 10−2
0 100 200 300 400
Time (in seconds)
10 8
10 6
10 4
10 2
100
Objectiveminusoptimum
1 ball radius = 100
0 200 400 600 800
Time (in seconds)
10 8
10 6
10 4
10 2
100
1 ball radius = 200
0 250 500 750 1000
Time (in seconds)
10 8
10 6
10 4
10 2
100
1 ball radius = 300
AdaFW AdaPFW AdaAFW FW PFW AFW D-FW
21/30
Experiments Madelon
Problem: 1-constrained logistic regression
arg min
x 1≤α
1
n
n
i=1
ϕ(aT
i x, bi ) with ϕ = logistic loss.
Dataset dimension density Lt /L
Madelon 500 1. 3.3 × 10−3
0 2 4
Time (in seconds)
10 8
10 6
10 4
10 2
Objectiveminusoptimum
1 ball radius = 13
0.0 2.5 5.0 7.5 10.0
Time (in seconds)
10 8
10 6
10 4
10 2
1 ball radius = 20
0 5 10 15 20
Time (in seconds)
10 8
10 6
10 4
10 2
1 ball radius = 30
AdaFW AdaPFW AdaAFW FW PFW AFW D-FW
22/30
Experiments MovieLens 1M
Problem: trace-norm constrained robust matrix completion
arg min
x ∗≤α
1
|B|
n
(i,j)∈B
h(Xi,j , Ai,j ) with h = Huber loss.
Dataset dimension density Lt /L
MovieLens 1M 22,393,987 0.04 1.1 × 10−2
0 200 400 600 800
Time (in seconds)
10 6
10 4
10 2
100
Objectiveminusoptimum
trace ball radius = 300
0 1000 2000
Time (in seconds)
10 6
10 4
10 2
100
trace ball radius = 350
0 2000 4000
Time (in seconds)
10 6
10 4
10 2
100
trace ball radius = 400
Adaptive FW FW D-FW
23/30
Other applications
Proximal Splitting
Building quadratic upper bound is common in proximal gradient
descent (Beck and Teboulle, 2009) (Nesterov, 2013).
Recently extended to the Davis-Yin three operator splitting6
f (x) + g(x) + h(x)
with access to f , proxγg , proxγh
Key insight: verify a sufficient decrease condition of the form
f (xt+1) ≤ f (zt) + f (zt), xt+1 − zt +
1
2γt
xt+1 − zt
2
6
Fabian Pedregosa and Gauthier Gidel (2018). “Adaptive Three Operator
Splitting”. In: Proceedings of the 35th International Conference on Machine
Learning.
24/30
Nearly-isotonic penalty
Problem
arg minx loss(x) + λ p−1
i=1 max{xi − xi+1, 0}
Coefficients
Magnitude
=10 6
Coefficients
Magnitude
=10 3
Coefficients
Magnitude
=0.01
Coefficients
Magnitude
=0.1
estimated coefficients
ground truth
0 100 200 300 400
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
Objectiveminusoptimum
0 200 400 600
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
0 100 200 300 400
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
0 100 200 300 400
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
Adaptive TOS (variant 1) Adaptive TOS (variant 2) TOS (1/L) TOS (1.99/L)TOS-AOLS PDHG Adaptive PDHG
25/30
Overlapping group lasso penalty
Problem
arg min
x
loss(x) + λ g∈G [x]g 2
Coefficients
Magnitude
=10 6
Coefficients
Magnitude
=10 3
Coefficients
Magnitude
=0.01
Coefficients
Magnitude
=0.1
estimated coefficients
ground truth
0 10 20 30 40 50
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
Objectiveminusoptimum
0 10 20 30 40 50
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
0 5 10 15 20
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
0.0 0.2 0.4 0.6 0.8 1.0
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
Adaptive TOS (variant 1) Adaptive TOS (variant 2) TOS (1/L) TOS (1.99/L) TOS-AOLS PDHG Adaptive PDHG
26/30
Perspectives
Stochastic optimization
Problem
arg min
x∈Rd
1
n
n
i=1
fi (x)
Heuristic from7 to estimate L by verifying at each iteration t
fi (xt −
1
L
fi (xt)) ≤ fi (xt) −
1
2L
fi (xt) 2
with i random index sampled at iter t.
7
Mark Schmidt, Nicolas Le Roux, and Francis Bach (2017). “Minimizing finite
sums with the stochastic average gradient”. In: Mathematical Programming.
27/30
Stochastic optimization
Problem
arg min
x∈Rd
1
n
n
i=1
fi (x)
Heuristic from7 to estimate L by verifying at each iteration t
fi (xt −
1
L
fi (xt)) ≤ fi (xt) −
1
2L
fi (xt) 2
with i random index sampled at iter t.
L-smooth inequality with y = xt − 1
L fi (xt), x = xt
7
Mark Schmidt, Nicolas Le Roux, and Francis Bach (2017). “Minimizing finite
sums with the stochastic average gradient”. In: Mathematical Programming.
27/30
Experiments stochastic line search
28/30
Experiments stochastic line search
Can we prove convergence for such (or similar) stochastic adaptive
step-size?
28/30
Conclusion
• Sufficient decrease condition to set step-size in FW and
variants.
29/30
Conclusion
• Sufficient decrease condition to set step-size in FW and
variants.
• (Mostly) Hyperparameter-free, adaptivity to local geometry.
29/30
Conclusion
• Sufficient decrease condition to set step-size in FW and
variants.
• (Mostly) Hyperparameter-free, adaptivity to local geometry.
• Applications in proximal splitting and stochastic optimization.
Thanks for your attention
29/30
References
Beck, Amir and Marc Teboulle (2009). “Gradient-based algorithms with applications to
signal recovery”. In: Convex optimization in signal processing and communications.
Demyanov, Vladimir and Aleksandr Rubinov (1970). Approximate methods in
optimization problems (translated from Russian).
Gu´elat, Jacques and Patrice Marcotte (1986). “Some comments on Wolfe’s away
step”. In: Mathematical Programming.
Kerdreux, Thomas, Fabian Pedregosa, and Alexandre d’Aspremont (2018).
“Frank-Wolfe with Subsampling Oracle”. In: Proceedings of the 35th International
Conference on Machine Learning.
Lacoste-Julien, Simon and Martin Jaggi (2015). “On the global linear convergence of
Frank-Wolfe optimization variants”. In: Advances in Neural Information Processing
Systems.
Mitchell, BF, Vladimir Fedorovich Demyanov, and VN Malozemov (1974). “Finding
the point of a polyhedron closest to the origin”. In: SIAM Journal on Control.
29/30
Nesterov, Yu (2013). “Gradient methods for minimizing composite functions”. In:
Mathematical Programming.
Niculae, Vlad et al. (2018). “SparseMAP: Differentiable Sparse Structured Inference”.
In: International Conference on Machine Learning.
Pedregosa, Fabian et al. (2018). “Step-Size Adaptivity in Projection-Free
Optimization”. In: Submitted.
Pedregosa, Fabian and Gauthier Gidel (2018). “Adaptive Three Operator Splitting”.
In: Proceedings of the 35th International Conference on Machine Learning.
Ping, Wei, Qiang Liu, and Alexander T Ihler (2016). “Learning Infinite RBMs with
Frank-Wolfe”. In: Advances in Neural Information Processing Systems.
Schmidt, Mark, Nicolas Le Roux, and Francis Bach (2017). “Minimizing finite sums
with the stochastic average gradient”. In: Mathematical Programming.
Wolfe, Philip (1970). “Convergence theory in nonlinear programming”. In: Integer and
nonlinear programming.
30/30

Weitere ähnliche Inhalte

Was ist angesagt?

fourier series and fourier transform
fourier series and fourier transformfourier series and fourier transform
fourier series and fourier transformVikas Rathod
 
Properties of fourier transform
Properties of fourier transformProperties of fourier transform
Properties of fourier transformNisarg Amin
 
fourier representation of signal and systems
fourier representation of signal and systemsfourier representation of signal and systems
fourier representation of signal and systemsSugeng Widodo
 
Fourier transformation
Fourier transformationFourier transformation
Fourier transformationzertux
 
Optics Fourier Transform Ii
Optics Fourier Transform IiOptics Fourier Transform Ii
Optics Fourier Transform Iidiarmseven
 
Fourier transforms
Fourier transformsFourier transforms
Fourier transformskalung0313
 
Presentation on fourier transformation
Presentation on fourier transformationPresentation on fourier transformation
Presentation on fourier transformationWasim Shah
 
Toward an Improved Computational Strategy for Vibration-Proof Structures Equi...
Toward an Improved Computational Strategy for Vibration-Proof Structures Equi...Toward an Improved Computational Strategy for Vibration-Proof Structures Equi...
Toward an Improved Computational Strategy for Vibration-Proof Structures Equi...Alessandro Palmeri
 
Fourier series and applications of fourier transform
Fourier series and applications of fourier transformFourier series and applications of fourier transform
Fourier series and applications of fourier transformKrishna Jangid
 
Lecture8 Signal and Systems
Lecture8 Signal and SystemsLecture8 Signal and Systems
Lecture8 Signal and Systemsbabak danyal
 
Signal Processing Introduction using Fourier Transforms
Signal Processing Introduction using Fourier TransformsSignal Processing Introduction using Fourier Transforms
Signal Processing Introduction using Fourier TransformsArvind Devaraj
 
fourier transforms
fourier transformsfourier transforms
fourier transformsUmang Gupta
 
Fourier Series for Continuous Time & Discrete Time Signals
Fourier Series for Continuous Time & Discrete Time SignalsFourier Series for Continuous Time & Discrete Time Signals
Fourier Series for Continuous Time & Discrete Time SignalsJayanshu Gundaniya
 

Was ist angesagt? (18)

fourier series and fourier transform
fourier series and fourier transformfourier series and fourier transform
fourier series and fourier transform
 
Properties of fourier transform
Properties of fourier transformProperties of fourier transform
Properties of fourier transform
 
fourier representation of signal and systems
fourier representation of signal and systemsfourier representation of signal and systems
fourier representation of signal and systems
 
002 ray modeling dynamic systems
002 ray modeling dynamic systems002 ray modeling dynamic systems
002 ray modeling dynamic systems
 
Fourier transformation
Fourier transformationFourier transformation
Fourier transformation
 
Optics Fourier Transform Ii
Optics Fourier Transform IiOptics Fourier Transform Ii
Optics Fourier Transform Ii
 
Fourier transforms
Fourier transformsFourier transforms
Fourier transforms
 
Presentation on fourier transformation
Presentation on fourier transformationPresentation on fourier transformation
Presentation on fourier transformation
 
Toward an Improved Computational Strategy for Vibration-Proof Structures Equi...
Toward an Improved Computational Strategy for Vibration-Proof Structures Equi...Toward an Improved Computational Strategy for Vibration-Proof Structures Equi...
Toward an Improved Computational Strategy for Vibration-Proof Structures Equi...
 
Fourier transform
Fourier transformFourier transform
Fourier transform
 
Fourier series and applications of fourier transform
Fourier series and applications of fourier transformFourier series and applications of fourier transform
Fourier series and applications of fourier transform
 
The FFT And Spectral Analysis
The FFT And Spectral AnalysisThe FFT And Spectral Analysis
The FFT And Spectral Analysis
 
Lecture8 Signal and Systems
Lecture8 Signal and SystemsLecture8 Signal and Systems
Lecture8 Signal and Systems
 
Signal Processing Introduction using Fourier Transforms
Signal Processing Introduction using Fourier TransformsSignal Processing Introduction using Fourier Transforms
Signal Processing Introduction using Fourier Transforms
 
Fourier Transform
Fourier TransformFourier Transform
Fourier Transform
 
fourier transforms
fourier transformsfourier transforms
fourier transforms
 
Dsp final
Dsp finalDsp final
Dsp final
 
Fourier Series for Continuous Time & Discrete Time Signals
Fourier Series for Continuous Time & Discrete Time SignalsFourier Series for Continuous Time & Discrete Time Signals
Fourier Series for Continuous Time & Discrete Time Signals
 

Ähnlich wie Sufficient decrease is all you need

Adaptive Three Operator Splitting
Adaptive Three Operator SplittingAdaptive Three Operator Splitting
Adaptive Three Operator SplittingFabian Pedregosa
 
LÍMITES Y DERIVADAS aplicados a ingenieria
LÍMITES Y DERIVADAS aplicados a ingenieriaLÍMITES Y DERIVADAS aplicados a ingenieria
LÍMITES Y DERIVADAS aplicados a ingenieriaGonzalo Fernandez
 
1531 fourier series- integrals and trans
1531 fourier series- integrals and trans1531 fourier series- integrals and trans
1531 fourier series- integrals and transDr Fereidoun Dejahang
 
Modal Analysis Basic Theory
Modal Analysis Basic TheoryModal Analysis Basic Theory
Modal Analysis Basic TheoryYuanCheng38
 
Quantitative norm convergence of some ergodic averages
Quantitative norm convergence of some ergodic averagesQuantitative norm convergence of some ergodic averages
Quantitative norm convergence of some ergodic averagesVjekoslavKovac1
 
Gradient_Descent_Unconstrained.pdf
Gradient_Descent_Unconstrained.pdfGradient_Descent_Unconstrained.pdf
Gradient_Descent_Unconstrained.pdfMTrang34
 
Computational Tools and Techniques for Numerical Macro-Financial Modeling
Computational Tools and Techniques for Numerical Macro-Financial ModelingComputational Tools and Techniques for Numerical Macro-Financial Modeling
Computational Tools and Techniques for Numerical Macro-Financial ModelingVictor Zhorin
 
Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Pierre Jacob
 
functions limits and continuity
functions limits and continuityfunctions limits and continuity
functions limits and continuityPume Ananda
 
Stochastic Alternating Direction Method of Multipliers
Stochastic Alternating Direction Method of MultipliersStochastic Alternating Direction Method of Multipliers
Stochastic Alternating Direction Method of MultipliersTaiji Suzuki
 
The lattice Boltzmann equation: background, boundary conditions, and Burnett-...
The lattice Boltzmann equation: background, boundary conditions, and Burnett-...The lattice Boltzmann equation: background, boundary conditions, and Burnett-...
The lattice Boltzmann equation: background, boundary conditions, and Burnett-...Tim Reis
 
DIGITAL IMAGE PROCESSING - Day 4 Image Transform
DIGITAL IMAGE PROCESSING - Day 4 Image TransformDIGITAL IMAGE PROCESSING - Day 4 Image Transform
DIGITAL IMAGE PROCESSING - Day 4 Image Transformvijayanand Kandaswamy
 
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...Sean Meyn
 

Ähnlich wie Sufficient decrease is all you need (20)

Adaptive Three Operator Splitting
Adaptive Three Operator SplittingAdaptive Three Operator Splitting
Adaptive Three Operator Splitting
 
LÍMITES Y DERIVADAS aplicados a ingenieria
LÍMITES Y DERIVADAS aplicados a ingenieriaLÍMITES Y DERIVADAS aplicados a ingenieria
LÍMITES Y DERIVADAS aplicados a ingenieria
 
1531 fourier series- integrals and trans
1531 fourier series- integrals and trans1531 fourier series- integrals and trans
1531 fourier series- integrals and trans
 
Modal Analysis Basic Theory
Modal Analysis Basic TheoryModal Analysis Basic Theory
Modal Analysis Basic Theory
 
Quantitative norm convergence of some ergodic averages
Quantitative norm convergence of some ergodic averagesQuantitative norm convergence of some ergodic averages
Quantitative norm convergence of some ergodic averages
 
Limit and continuity
Limit and continuityLimit and continuity
Limit and continuity
 
Recursive Compressed Sensing
Recursive Compressed SensingRecursive Compressed Sensing
Recursive Compressed Sensing
 
Gradient_Descent_Unconstrained.pdf
Gradient_Descent_Unconstrained.pdfGradient_Descent_Unconstrained.pdf
Gradient_Descent_Unconstrained.pdf
 
Ece3075 a 8
Ece3075 a 8Ece3075 a 8
Ece3075 a 8
 
Computational Tools and Techniques for Numerical Macro-Financial Modeling
Computational Tools and Techniques for Numerical Macro-Financial ModelingComputational Tools and Techniques for Numerical Macro-Financial Modeling
Computational Tools and Techniques for Numerical Macro-Financial Modeling
 
QMC: Operator Splitting Workshop, Perturbed (accelerated) Proximal-Gradient A...
QMC: Operator Splitting Workshop, Perturbed (accelerated) Proximal-Gradient A...QMC: Operator Splitting Workshop, Perturbed (accelerated) Proximal-Gradient A...
QMC: Operator Splitting Workshop, Perturbed (accelerated) Proximal-Gradient A...
 
Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...
 
functions limits and continuity
functions limits and continuityfunctions limits and continuity
functions limits and continuity
 
Stochastic Alternating Direction Method of Multipliers
Stochastic Alternating Direction Method of MultipliersStochastic Alternating Direction Method of Multipliers
Stochastic Alternating Direction Method of Multipliers
 
The lattice Boltzmann equation: background, boundary conditions, and Burnett-...
The lattice Boltzmann equation: background, boundary conditions, and Burnett-...The lattice Boltzmann equation: background, boundary conditions, and Burnett-...
The lattice Boltzmann equation: background, boundary conditions, and Burnett-...
 
Functions limits and continuity
Functions limits and continuityFunctions limits and continuity
Functions limits and continuity
 
Matrix calculus
Matrix calculusMatrix calculus
Matrix calculus
 
Singlevaropt
SinglevaroptSinglevaropt
Singlevaropt
 
DIGITAL IMAGE PROCESSING - Day 4 Image Transform
DIGITAL IMAGE PROCESSING - Day 4 Image TransformDIGITAL IMAGE PROCESSING - Day 4 Image Transform
DIGITAL IMAGE PROCESSING - Day 4 Image Transform
 
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
 

Mehr von Fabian Pedregosa

Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4Fabian Pedregosa
 
Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3Fabian Pedregosa
 
Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2Fabian Pedregosa
 
Random Matrix Theory and Machine Learning - Part 1
Random Matrix Theory and Machine Learning - Part 1Random Matrix Theory and Machine Learning - Part 1
Random Matrix Theory and Machine Learning - Part 1Fabian Pedregosa
 
Average case acceleration through spectral density estimation
Average case acceleration through spectral density estimationAverage case acceleration through spectral density estimation
Average case acceleration through spectral density estimationFabian Pedregosa
 
Asynchronous Stochastic Optimization, New Analysis and Algorithms
Asynchronous Stochastic Optimization, New Analysis and AlgorithmsAsynchronous Stochastic Optimization, New Analysis and Algorithms
Asynchronous Stochastic Optimization, New Analysis and AlgorithmsFabian Pedregosa
 
Parallel Optimization in Machine Learning
Parallel Optimization in Machine LearningParallel Optimization in Machine Learning
Parallel Optimization in Machine LearningFabian Pedregosa
 
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...Fabian Pedregosa
 
Hyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradientHyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradientFabian Pedregosa
 
Lightning: large scale machine learning in python
Lightning: large scale machine learning in pythonLightning: large scale machine learning in python
Lightning: large scale machine learning in pythonFabian Pedregosa
 

Mehr von Fabian Pedregosa (11)

Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4
 
Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3
 
Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2
 
Random Matrix Theory and Machine Learning - Part 1
Random Matrix Theory and Machine Learning - Part 1Random Matrix Theory and Machine Learning - Part 1
Random Matrix Theory and Machine Learning - Part 1
 
Average case acceleration through spectral density estimation
Average case acceleration through spectral density estimationAverage case acceleration through spectral density estimation
Average case acceleration through spectral density estimation
 
Asynchronous Stochastic Optimization, New Analysis and Algorithms
Asynchronous Stochastic Optimization, New Analysis and AlgorithmsAsynchronous Stochastic Optimization, New Analysis and Algorithms
Asynchronous Stochastic Optimization, New Analysis and Algorithms
 
Parallel Optimization in Machine Learning
Parallel Optimization in Machine LearningParallel Optimization in Machine Learning
Parallel Optimization in Machine Learning
 
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...
 
Hyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradientHyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradient
 
Lightning: large scale machine learning in python
Lightning: large scale machine learning in pythonLightning: large scale machine learning in python
Lightning: large scale machine learning in python
 
Profiling in Python
Profiling in PythonProfiling in Python
Profiling in Python
 

Kürzlich hochgeladen

9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformationAreesha Ahmad
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptxryanrooker
 
chemical bonding Essentials of Physical Chemistry2.pdf
chemical bonding Essentials of Physical Chemistry2.pdfchemical bonding Essentials of Physical Chemistry2.pdf
chemical bonding Essentials of Physical Chemistry2.pdfTukamushabaBismark
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxMohamedFarag457087
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professormuralinath2
 
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verifiedDelhi Call girls
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Monika Rani
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learninglevieagacer
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)Areesha Ahmad
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxFarihaAbdulRasheed
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusNazaninKarimi6
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxseri bangash
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 

Kürzlich hochgeladen (20)

9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 
chemical bonding Essentials of Physical Chemistry2.pdf
chemical bonding Essentials of Physical Chemistry2.pdfchemical bonding Essentials of Physical Chemistry2.pdf
chemical bonding Essentials of Physical Chemistry2.pdf
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 

Sufficient decrease is all you need

  • 1. Sufficient decrease is all you need A simple condition to forget about the step-size, with applications to the Frank-Wolfe algorithm. Fabian Pedregosa June 4th, 2018. Google Brain Montreal
  • 2. Where I Come From ML/Optimization/Software Guy Engineer (2010–2012) First contact with ML: develop ML library (scikit-learn). ML and NeuroScience (2012–2015) PhD applying ML to neuroscience. ML and Optimization (2015–) Stochastic, Parallel, Constrained, Hyperparameter optimization. 1/30
  • 4. Outline Motivation: eliminate step-size parameter. 1. Frank-Wolfe, A method for constrained optimization. 2. Adaptive Frank-Wolfe. Frank-Wolfe without the step-size. 3. Perspectives. Other applications: proximal splitting, stochastic optimization. 2/30
  • 5. Outline Motivation: eliminate step-size parameter. 1. Frank-Wolfe, A method for constrained optimization. 2. Adaptive Frank-Wolfe. Frank-Wolfe without the step-size. 3. Perspectives. Other applications: proximal splitting, stochastic optimization. With a little help from my collaborators Armin Askari (UC Berkeley) Geoffrey N´egiar (UC Berkeley) Martin Jaggi (EPFL) Gauthier Gidel (UdeM) 2/30
  • 7. The Frank-Wolfe (FW) algorithm, aka conditional gradient Problem: smooth f , compact D arg min x∈D f (x) Algorithm 1: Frank-Wolfe (FW) 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 Find γt by line-search: γt ∈ arg minγ∈[0,1] f ((1−γ)xt +γst) 4 xt+1 = (1 − γt)xt + γtst 3/30
  • 8. The Frank-Wolfe (FW) algorithm, aka conditional gradient Problem: smooth f , compact D arg min x∈D f (x) Algorithm 1: Frank-Wolfe (FW) 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 Find γt by line-search: γt ∈ arg minγ∈[0,1] f ((1−γ)xt +γst) 4 xt+1 = (1 − γt)xt + γtst 3/30
  • 9. The Frank-Wolfe (FW) algorithm, aka conditional gradient Problem: smooth f , compact D arg min x∈D f (x) Algorithm 1: Frank-Wolfe (FW) 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 Find γt by line-search: γt ∈ arg minγ∈[0,1] f ((1−γ)xt +γst) 4 xt+1 = (1 − γt)xt + γtst 3/30
  • 10. The Frank-Wolfe (FW) algorithm, aka conditional gradient Problem: smooth f , compact D arg min x∈D f (x) Algorithm 1: Frank-Wolfe (FW) 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 Find γt by line-search: γt ∈ arg minγ∈[0,1] f ((1−γ)xt +γst) 4 xt+1 = (1 − γt)xt + γtst 3/30
  • 11. The Frank-Wolfe (FW) algorithm, aka conditional gradient Problem: smooth f , compact D arg min x∈D f (x) Algorithm 1: Frank-Wolfe (FW) 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 Find γt by line-search: γt ∈ arg minγ∈[0,1] f ((1−γ)xt +γst) 4 xt+1 = (1 − γt)xt + γtst 3/30
  • 12. Why people ♥ Frank-Wolfe • Projection-free. Only linear subproblems arise vs quadratic for projection. 4/30
  • 13. Why people ♥ Frank-Wolfe • Projection-free. Only linear subproblems arise vs quadratic for projection. • Solution of linear subproblem is always extremal element of D. 4/30
  • 14. Why people ♥ Frank-Wolfe • Projection-free. Only linear subproblems arise vs quadratic for projection. • Solution of linear subproblem is always extremal element of D. • Iterates admit sparse representation = xt convex combination of at most t elements. 4/30
  • 15. Recent applications of Frank-Wolfe • Learning the structure of a neural network.1 • Attention mechanisms that enforce sparsity.2 • 1-constrained problems with extreme number of features.3 1 Wei Ping, Qiang Liu, and Alexander T Ihler (2016). “Learning Infinite RBMs with Frank-Wolfe”. In: Advances in Neural Information Processing Systems. 2 Vlad Niculae et al. (2018). “SparseMAP: Differentiable Sparse Structured Inference”. In: International Conference on Machine Learning. 3 Thomas Kerdreux, Fabian Pedregosa, and Alexandre d’Aspremont (2018). “Frank-Wolfe with Subsampling Oracle”. In: Proceedings of the 35th International Conference on Machine Learning. 5/30
  • 16. A practical issue • Line-search only efficient when closed form exists (quadratic objective). • Step-size γt = 2/(t + 2) is convergent, but extremely slow. Algorithm 2: Frank-Wolfe (FW) 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 Find γt by line-search: γt ∈ arg minγ∈[0,1] f ((1−γ)xt +γst) 4 xt+1 = (1 − γt)xt + γtst 6/30
  • 17. A practical issue • Line-search only efficient when closed form exists (quadratic objective). • Step-size γt = 2/(t + 2) is convergent, but extremely slow. Algorithm 2: Frank-Wolfe (FW) 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 Find γt by line-search: γt ∈ arg minγ∈[0,1] f ((1−γ)xt +γst) 4 xt+1 = (1 − γt)xt + γtst Can we do better? 6/30
  • 19. Down the citation rabbit hole 4 Vladimir Demyanov Alexsandr Rubinov 4 Vladimir Demyanov and Aleksandr Rubinov (1970). Approximate methods in optimization problems (translated from Russian). 7/30
  • 20. Down the citation rabbit hole 4 Vladimir Demyanov Alexsandr Rubinov 4 Vladimir Demyanov and Aleksandr Rubinov (1970). Approximate methods in optimization problems (translated from Russian). 7/30
  • 21. The Demyanov-Rubinov (DR) Frank-Wolfe variant Problem: smooth objective, compact domain arg min x∈D f (x), where f is L-smooth . (L-smooth ≡ differentiable with L-Lipschitz gradient). • Step-size depends on the correlation between − f (xt) and the descent direction st − xt. Algorithm 3: FW, DR variant 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 γt =min − f (xt), st − xt L st − xt 2 , 1 4 xt+1 = (1 − γt)xt + γtst 8/30
  • 22. The Demyanov-Rubinov (DR) Frank-Wolfe variant Where does γt =min − f (xt), st − xt L st − xt 2 , 1 come from? 9/30
  • 23. The Demyanov-Rubinov (DR) Frank-Wolfe variant Where does γt =min − f (xt), st − xt L st − xt 2 , 1 come from? L-smooth inequality Any L-smooth function f verifies f (y) ≤ f (x) + f (x), y − x + L 2 x − y 2 , for all x, y in the domain. 9/30
  • 24. The Demyanov-Rubinov (DR) Frank-Wolfe variant Where does γt =min − f (xt), st − xt L st − xt 2 , 1 come from? L-smooth inequality Any L-smooth function f verifies f (y) ≤ f (x) + f (x), y − x + L 2 x − y 2 :=Qx (y) , for all x, y in the domain. • The right hand side is a quadratic upper bound Qx (y) f (y) 9/30
  • 25. Justification of the step-size • L-smooth inequality at xt+1(γ) = (1 − γ)xt + γst, x = xt gives f (xt+1(γ)) ≤ f (xt) − γ f (xt), st − xt + γ2L 2 st − xt 2 10/30
  • 26. Justification of the step-size • L-smooth inequality at xt+1(γ) = (1 − γ)xt + γst, x = xt gives f (xt+1(γ)) ≤ f (xt) − γ f (xt), st − xt + γ2L 2 st − xt 2 • Minimizing right hand side on γ ∈ [0, 1] gives γ =min − f (xt), st − xt L st − xt 2 , 1 , = Demyanov-Rubinov step-size! 10/30
  • 27. Justification of the step-size • L-smooth inequality at xt+1(γ) = (1 − γ)xt + γst, x = xt gives f (xt+1(γ)) ≤ f (xt) − γ f (xt), st − xt + γ2L 2 st − xt 2 • Minimizing right hand side on γ ∈ [0, 1] gives γ =min − f (xt), st − xt L st − xt 2 , 1 , = Demyanov-Rubinov step-size! • ≡ exact line search on the quadratic upper bound. 10/30
  • 28. Towards an Adaptive FW Quadratic upper bound The Demyanov-Rubinov makes use of a quadratic upper bound, but it is only evaluated at xt, xt+1. 11/30
  • 29. Towards an Adaptive FW Quadratic upper bound The Demyanov-Rubinov makes use of a quadratic upper bound, but it is only evaluated at xt, xt+1. Sufficient decrease is all you need L-smooth inequality can be replaced by f (xt+1) ≤ f (xt) − γt f (xt), st − xt + γ2 t Lt 2 st − xt 2 with γt =min − f (xt), st − xt Lt st − xt 2 , 1 11/30
  • 30. Towards an Adaptive FW Quadratic upper bound The Demyanov-Rubinov makes use of a quadratic upper bound, but it is only evaluated at xt, xt+1. Sufficient decrease is all you need L-smooth inequality can be replaced by f (xt+1) ≤ f (xt) − γt f (xt), st − xt + γ2 t Lt 2 st − xt 2 with γt =min − f (xt), st − xt Lt st − xt 2 , 1 Key difference with DR: L is replaced by Lt. Potentially Lt L. 11/30
  • 31. The Adaptive FW algorithm New FW variant with adaptive step-size.5 Algorithm 4: The Adaptive Frank-Wolfe algorithm (AdaFW) 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 Find Lt that verifies sufficient decrease (1), with 4 γt =min − f (xt), st − xt Lt st − xt 2 , 1 5 xt+1 = (1 − γt)xt + γtst f (xt+1) ≤ f (xt) + γt f (xt), st − xt + γ2 t Lt 2 st − xt 2 (1) 5 Fabian Pedregosa, Armin Askari, Geoffrey Negiar, and Martin Jaggi (2018). “Step-Size Adaptivity in Projection-Free Optimization”. In: Submitted. 12/30
  • 32. The Adaptive FW algorithm γ =0 γ = 1γt f (xt) + γ f (xt), st − xt + γ2Lt 2 st − xt 2 f ((1 − γ)xt + γst) • Worst-case, Lt = L. Often Lt L =⇒ larger step-size. 13/30
  • 33. The Adaptive FW algorithm γ =0 γ = 1γt f (xt) + γ f (xt), st − xt + γ2Lt 2 st − xt 2 f ((1 − γ)xt + γst) • Worst-case, Lt = L. Often Lt L =⇒ larger step-size. • Adaptivity to local geometry. 13/30
  • 34. The Adaptive FW algorithm γ =0 γ = 1γt f (xt) + γ f (xt), st − xt + γ2Lt 2 st − xt 2 f ((1 − γ)xt + γst) • Worst-case, Lt = L. Often Lt L =⇒ larger step-size. • Adaptivity to local geometry. • Two extra function evaluations per iteration. Often given as byproduct of gradient. 13/30
  • 35. Extension to other FW variants
  • 36. Zig-Zagging phenomena in FW The Frank-Wolfe algorithm zig-zags when the solution lies in a face of the boundary. Some FW variants have been developed to address this issue. 14/30
  • 37. Away-steps FW, informal Away-steps FW algorithm (Wolfe, 1970) (Gu´elat and Marcotte, 1986) adds the possibility to move away from an active atom. 15/30
  • 38. Away-steps FW algorithm Keep active set St = active vertices that have been previously selected and have non-zero weight. 16/30
  • 39. Away-steps FW algorithm Keep active set St = active vertices that have been previously selected and have non-zero weight. Algorithm 5: Away-Steps FW 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 vt ∈ arg maxv∈St f (xt), v 4 if − f (xt), st − xt ≥ − f (xt), xt − vt then 5 dt = st − xt, FW step 6 else 7 dt = xt − vt, Away step 8 Find γt by line-search: γt ∈ arg minγ∈[0,γmax t ] f (xt +γdt) 9 xt+1 = xt + γtdt 16/30
  • 40. Away-steps FW algorithm Keep active set St = active vertices that have been previously selected and have non-zero weight. Algorithm 5: Away-Steps FW 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 vt ∈ arg maxv∈St f (xt), v 4 if − f (xt), st − xt ≥ − f (xt), xt − vt then 5 dt = st − xt, FW step 6 else 7 dt = xt − vt, Away step 8 Find γt by line-search: γt ∈ arg minγ∈[0,γmax t ] f (xt +γdt) 9 xt+1 = xt + γtdt 16/30
  • 41. Away-steps FW algorithm Keep active set St = active vertices that have been previously selected and have non-zero weight. Algorithm 5: Away-Steps FW 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 vt ∈ arg maxv∈St f (xt), v 4 if − f (xt), st − xt ≥ − f (xt), xt − vt then 5 dt = st − xt, FW step 6 else 7 dt = xt − vt, Away step 8 Find γt by line-search: γt ∈ arg minγ∈[0,γmax t ] f (xt +γdt) 9 xt+1 = xt + γtdt 16/30
  • 42. Away-steps FW algorithm Keep active set St = active vertices that have been previously selected and have non-zero weight. Algorithm 5: Away-Steps FW 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 vt ∈ arg maxv∈St f (xt), v 4 if − f (xt), st − xt ≥ − f (xt), xt − vt then 5 dt = st − xt, FW step 6 else 7 dt = xt − vt, Away step 8 Find γt by line-search: γt ∈ arg minγ∈[0,γmax t ] f (xt +γdt) 9 xt+1 = xt + γtdt 16/30
  • 43. Away-steps FW algorithm Keep active set St = active vertices that have been previously selected and have non-zero weight. Algorithm 5: Away-Steps FW 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 vt ∈ arg maxv∈St f (xt), v 4 if − f (xt), st − xt ≥ − f (xt), xt − vt then 5 dt = st − xt, FW step 6 else 7 dt = xt − vt, Away step 8 Find γt by line-search: γt ∈ arg minγ∈[0,γmax t ] f (xt +γdt) 9 xt+1 = xt + γtdt 16/30
  • 44. Pairwise FW Key idea Move weight mass between two atoms in each step. Proposed by (Lacoste-Julien and Jaggi, 2015), inspired the MDM alg. (Mitchell, Demyanov, and Malozemov, 1974). Algorithm 6: Pairwise FW 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 vt ∈ arg maxs∈St f (xt), s 4 dt = st − vt 5 Find γt by line-search: γt ∈ arg minγ∈[0,γmax t ] f (xt +γdt) 6 xt+1 = xt + γtdt 17/30
  • 45. Away-steps FW and Pairwise FW Convergence of Away-steps and Pairwise FW • Linear convergence for strongly convex functions on polytopes (Lacoste-Julien and Jaggi, 2015). 18/30
  • 46. Away-steps FW and Pairwise FW Convergence of Away-steps and Pairwise FW • Linear convergence for strongly convex functions on polytopes (Lacoste-Julien and Jaggi, 2015). • Can we design variants with sufficient decrease? 18/30
  • 47. Away-steps FW and Pairwise FW Convergence of Away-steps and Pairwise FW • Linear convergence for strongly convex functions on polytopes (Lacoste-Julien and Jaggi, 2015). • Can we design variants with sufficient decrease? Introducing Adaptive Away-steps and Adaptive Pairwise Choose Lt such that it verifies f (xt + γtdt) ≤ f (xt) + γt f (xt), dt + γ2 t Lt 2 dt 2 with γt =min − f (xt), dt Lt dt 2 , 1 18/30
  • 48. Adaptive Pairwise FW Algorithm 7: Pairwise FW 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 vt ∈ arg maxs∈St f (xt), s 4 dt = st − vt 5 Find Lt that verifies sufficient decrease (2), with 6 γt =min − f (xt), dt Lt dt 2 , 1 7 xt+1 = xt + γtdt f (xt + γtdt) ≤ f (xt) + γt f (xt), dt + γ2 t Lt 2 dt 2 (2) 19/30
  • 49. Theory for Adaptive Step-size variants Strongly convex f Pairwise and Away-steps converge linearly on a polytope. For each “good step” we have: f (xt+1) − f (x ) ≤ (1 − ρ)(f (xt) − f (x )) 20/30
  • 50. Theory for Adaptive Step-size variants Strongly convex f Pairwise and Away-steps converge linearly on a polytope. For each “good step” we have: f (xt+1) − f (x ) ≤ (1 − ρ)(f (xt) − f (x )) Convex f For all FW variants, f (xt) − f (x ) ≤ O(1/t) 20/30
  • 51. Theory for Adaptive Step-size variants Strongly convex f Pairwise and Away-steps converge linearly on a polytope. For each “good step” we have: f (xt+1) − f (x ) ≤ (1 − ρ)(f (xt) − f (x )) Convex f For all FW variants, f (xt) − f (x ) ≤ O(1/t) Non-Convex f For all FW variants, maxs∈D f (xt), xt − s ≤ O(1/ √ t) 20/30
  • 53. Experiments RCV1 Problem: 1-constrained logistic regression arg min x 1≤α 1 n n i=1 ϕ(aT i x, bi ) with ϕ = logistic loss. Dataset dimension density Lt /L RCV1 47236 10−3 1.3 × 10−2 0 100 200 300 400 Time (in seconds) 10 8 10 6 10 4 10 2 100 Objectiveminusoptimum 1 ball radius = 100 0 200 400 600 800 Time (in seconds) 10 8 10 6 10 4 10 2 100 1 ball radius = 200 0 250 500 750 1000 Time (in seconds) 10 8 10 6 10 4 10 2 100 1 ball radius = 300 AdaFW AdaPFW AdaAFW FW PFW AFW D-FW 21/30
  • 54. Experiments Madelon Problem: 1-constrained logistic regression arg min x 1≤α 1 n n i=1 ϕ(aT i x, bi ) with ϕ = logistic loss. Dataset dimension density Lt /L Madelon 500 1. 3.3 × 10−3 0 2 4 Time (in seconds) 10 8 10 6 10 4 10 2 Objectiveminusoptimum 1 ball radius = 13 0.0 2.5 5.0 7.5 10.0 Time (in seconds) 10 8 10 6 10 4 10 2 1 ball radius = 20 0 5 10 15 20 Time (in seconds) 10 8 10 6 10 4 10 2 1 ball radius = 30 AdaFW AdaPFW AdaAFW FW PFW AFW D-FW 22/30
  • 55. Experiments MovieLens 1M Problem: trace-norm constrained robust matrix completion arg min x ∗≤α 1 |B| n (i,j)∈B h(Xi,j , Ai,j ) with h = Huber loss. Dataset dimension density Lt /L MovieLens 1M 22,393,987 0.04 1.1 × 10−2 0 200 400 600 800 Time (in seconds) 10 6 10 4 10 2 100 Objectiveminusoptimum trace ball radius = 300 0 1000 2000 Time (in seconds) 10 6 10 4 10 2 100 trace ball radius = 350 0 2000 4000 Time (in seconds) 10 6 10 4 10 2 100 trace ball radius = 400 Adaptive FW FW D-FW 23/30
  • 57. Proximal Splitting Building quadratic upper bound is common in proximal gradient descent (Beck and Teboulle, 2009) (Nesterov, 2013). Recently extended to the Davis-Yin three operator splitting6 f (x) + g(x) + h(x) with access to f , proxγg , proxγh Key insight: verify a sufficient decrease condition of the form f (xt+1) ≤ f (zt) + f (zt), xt+1 − zt + 1 2γt xt+1 − zt 2 6 Fabian Pedregosa and Gauthier Gidel (2018). “Adaptive Three Operator Splitting”. In: Proceedings of the 35th International Conference on Machine Learning. 24/30
  • 58. Nearly-isotonic penalty Problem arg minx loss(x) + λ p−1 i=1 max{xi − xi+1, 0} Coefficients Magnitude =10 6 Coefficients Magnitude =10 3 Coefficients Magnitude =0.01 Coefficients Magnitude =0.1 estimated coefficients ground truth 0 100 200 300 400 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 Objectiveminusoptimum 0 200 400 600 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 0 100 200 300 400 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 0 100 200 300 400 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 Adaptive TOS (variant 1) Adaptive TOS (variant 2) TOS (1/L) TOS (1.99/L)TOS-AOLS PDHG Adaptive PDHG 25/30
  • 59. Overlapping group lasso penalty Problem arg min x loss(x) + λ g∈G [x]g 2 Coefficients Magnitude =10 6 Coefficients Magnitude =10 3 Coefficients Magnitude =0.01 Coefficients Magnitude =0.1 estimated coefficients ground truth 0 10 20 30 40 50 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 Objectiveminusoptimum 0 10 20 30 40 50 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 0 5 10 15 20 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 0.0 0.2 0.4 0.6 0.8 1.0 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 Adaptive TOS (variant 1) Adaptive TOS (variant 2) TOS (1/L) TOS (1.99/L) TOS-AOLS PDHG Adaptive PDHG 26/30
  • 61. Stochastic optimization Problem arg min x∈Rd 1 n n i=1 fi (x) Heuristic from7 to estimate L by verifying at each iteration t fi (xt − 1 L fi (xt)) ≤ fi (xt) − 1 2L fi (xt) 2 with i random index sampled at iter t. 7 Mark Schmidt, Nicolas Le Roux, and Francis Bach (2017). “Minimizing finite sums with the stochastic average gradient”. In: Mathematical Programming. 27/30
  • 62. Stochastic optimization Problem arg min x∈Rd 1 n n i=1 fi (x) Heuristic from7 to estimate L by verifying at each iteration t fi (xt − 1 L fi (xt)) ≤ fi (xt) − 1 2L fi (xt) 2 with i random index sampled at iter t. L-smooth inequality with y = xt − 1 L fi (xt), x = xt 7 Mark Schmidt, Nicolas Le Roux, and Francis Bach (2017). “Minimizing finite sums with the stochastic average gradient”. In: Mathematical Programming. 27/30
  • 64. Experiments stochastic line search Can we prove convergence for such (or similar) stochastic adaptive step-size? 28/30
  • 65. Conclusion • Sufficient decrease condition to set step-size in FW and variants. 29/30
  • 66. Conclusion • Sufficient decrease condition to set step-size in FW and variants. • (Mostly) Hyperparameter-free, adaptivity to local geometry. 29/30
  • 67. Conclusion • Sufficient decrease condition to set step-size in FW and variants. • (Mostly) Hyperparameter-free, adaptivity to local geometry. • Applications in proximal splitting and stochastic optimization. Thanks for your attention 29/30
  • 68. References Beck, Amir and Marc Teboulle (2009). “Gradient-based algorithms with applications to signal recovery”. In: Convex optimization in signal processing and communications. Demyanov, Vladimir and Aleksandr Rubinov (1970). Approximate methods in optimization problems (translated from Russian). Gu´elat, Jacques and Patrice Marcotte (1986). “Some comments on Wolfe’s away step”. In: Mathematical Programming. Kerdreux, Thomas, Fabian Pedregosa, and Alexandre d’Aspremont (2018). “Frank-Wolfe with Subsampling Oracle”. In: Proceedings of the 35th International Conference on Machine Learning. Lacoste-Julien, Simon and Martin Jaggi (2015). “On the global linear convergence of Frank-Wolfe optimization variants”. In: Advances in Neural Information Processing Systems. Mitchell, BF, Vladimir Fedorovich Demyanov, and VN Malozemov (1974). “Finding the point of a polyhedron closest to the origin”. In: SIAM Journal on Control. 29/30
  • 69. Nesterov, Yu (2013). “Gradient methods for minimizing composite functions”. In: Mathematical Programming. Niculae, Vlad et al. (2018). “SparseMAP: Differentiable Sparse Structured Inference”. In: International Conference on Machine Learning. Pedregosa, Fabian et al. (2018). “Step-Size Adaptivity in Projection-Free Optimization”. In: Submitted. Pedregosa, Fabian and Gauthier Gidel (2018). “Adaptive Three Operator Splitting”. In: Proceedings of the 35th International Conference on Machine Learning. Ping, Wei, Qiang Liu, and Alexander T Ihler (2016). “Learning Infinite RBMs with Frank-Wolfe”. In: Advances in Neural Information Processing Systems. Schmidt, Mark, Nicolas Le Roux, and Francis Bach (2017). “Minimizing finite sums with the stochastic average gradient”. In: Mathematical Programming. Wolfe, Philip (1970). “Convergence theory in nonlinear programming”. In: Integer and nonlinear programming. 30/30