1. Sufficient decrease is all you need
A simple condition to forget about the step-size, with
applications to the Frank-Wolfe algorithm.
Fabian Pedregosa
June 4th, 2018. Google Brain Montreal
2. Where I Come From
ML/Optimization/Software Guy
Engineer (2010–2012)
First contact with ML: develop ML
library (scikit-learn).
ML and NeuroScience (2012–2015)
PhD applying ML to neuroscience.
ML and Optimization (2015–)
Stochastic, Parallel, Constrained,
Hyperparameter optimization.
1/30
4. Outline
Motivation: eliminate step-size parameter.
1. Frank-Wolfe, A method for constrained optimization.
2. Adaptive Frank-Wolfe. Frank-Wolfe without the step-size.
3. Perspectives. Other applications: proximal splitting,
stochastic optimization.
2/30
5. Outline
Motivation: eliminate step-size parameter.
1. Frank-Wolfe, A method for constrained optimization.
2. Adaptive Frank-Wolfe. Frank-Wolfe without the step-size.
3. Perspectives. Other applications: proximal splitting,
stochastic optimization.
With a little help from my collaborators
Armin Askari
(UC Berkeley)
Geoffrey N´egiar
(UC Berkeley)
Martin Jaggi
(EPFL)
Gauthier Gidel
(UdeM)
2/30
7. The Frank-Wolfe (FW) algorithm, aka conditional gradient
Problem: smooth f , compact D
arg min
x∈D
f (x)
Algorithm 1: Frank-Wolfe (FW)
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 Find γt by line-search: γt ∈
arg minγ∈[0,1] f ((1−γ)xt +γst)
4 xt+1 = (1 − γt)xt + γtst
3/30
8. The Frank-Wolfe (FW) algorithm, aka conditional gradient
Problem: smooth f , compact D
arg min
x∈D
f (x)
Algorithm 1: Frank-Wolfe (FW)
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 Find γt by line-search: γt ∈
arg minγ∈[0,1] f ((1−γ)xt +γst)
4 xt+1 = (1 − γt)xt + γtst
3/30
9. The Frank-Wolfe (FW) algorithm, aka conditional gradient
Problem: smooth f , compact D
arg min
x∈D
f (x)
Algorithm 1: Frank-Wolfe (FW)
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 Find γt by line-search: γt ∈
arg minγ∈[0,1] f ((1−γ)xt +γst)
4 xt+1 = (1 − γt)xt + γtst
3/30
10. The Frank-Wolfe (FW) algorithm, aka conditional gradient
Problem: smooth f , compact D
arg min
x∈D
f (x)
Algorithm 1: Frank-Wolfe (FW)
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 Find γt by line-search: γt ∈
arg minγ∈[0,1] f ((1−γ)xt +γst)
4 xt+1 = (1 − γt)xt + γtst
3/30
11. The Frank-Wolfe (FW) algorithm, aka conditional gradient
Problem: smooth f , compact D
arg min
x∈D
f (x)
Algorithm 1: Frank-Wolfe (FW)
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 Find γt by line-search: γt ∈
arg minγ∈[0,1] f ((1−γ)xt +γst)
4 xt+1 = (1 − γt)xt + γtst
3/30
12. Why people ♥ Frank-Wolfe
• Projection-free. Only linear subproblems arise vs quadratic
for projection.
4/30
13. Why people ♥ Frank-Wolfe
• Projection-free. Only linear subproblems arise vs quadratic
for projection.
• Solution of linear subproblem is always extremal element of
D.
4/30
14. Why people ♥ Frank-Wolfe
• Projection-free. Only linear subproblems arise vs quadratic
for projection.
• Solution of linear subproblem is always extremal element of
D.
• Iterates admit sparse representation = xt convex combination
of at most t elements.
4/30
15. Recent applications of Frank-Wolfe
• Learning the structure of a neural network.1
• Attention mechanisms that enforce sparsity.2
• 1-constrained problems with extreme number of features.3
1
Wei Ping, Qiang Liu, and Alexander T Ihler (2016). “Learning Infinite RBMs
with Frank-Wolfe”. In: Advances in Neural Information Processing Systems.
2
Vlad Niculae et al. (2018). “SparseMAP: Differentiable Sparse Structured
Inference”. In: International Conference on Machine Learning.
3
Thomas Kerdreux, Fabian Pedregosa, and Alexandre d’Aspremont (2018).
“Frank-Wolfe with Subsampling Oracle”. In: Proceedings of the 35th
International Conference on Machine Learning.
5/30
16. A practical issue
• Line-search only
efficient when closed
form exists (quadratic
objective).
• Step-size
γt = 2/(t + 2) is
convergent, but
extremely slow.
Algorithm 2: Frank-Wolfe (FW)
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 Find γt by line-search: γt ∈
arg minγ∈[0,1] f ((1−γ)xt +γst)
4 xt+1 = (1 − γt)xt + γtst
6/30
17. A practical issue
• Line-search only
efficient when closed
form exists (quadratic
objective).
• Step-size
γt = 2/(t + 2) is
convergent, but
extremely slow.
Algorithm 2: Frank-Wolfe (FW)
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 Find γt by line-search: γt ∈
arg minγ∈[0,1] f ((1−γ)xt +γst)
4 xt+1 = (1 − γt)xt + γtst
Can we do better?
6/30
19. Down the citation rabbit hole
4
Vladimir Demyanov
Alexsandr Rubinov
4
Vladimir Demyanov and Aleksandr Rubinov (1970). Approximate methods in
optimization problems (translated from Russian).
7/30
20. Down the citation rabbit hole
4
Vladimir Demyanov
Alexsandr Rubinov
4
Vladimir Demyanov and Aleksandr Rubinov (1970). Approximate methods in
optimization problems (translated from Russian).
7/30
21. The Demyanov-Rubinov (DR) Frank-Wolfe variant
Problem: smooth objective, compact domain
arg min
x∈D
f (x), where f is L-smooth .
(L-smooth ≡ differentiable with L-Lipschitz gradient).
• Step-size depends
on the correlation
between − f (xt)
and the descent
direction st − xt.
Algorithm 3: FW, DR variant
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 γt =min
− f (xt), st − xt
L st − xt
2
, 1
4 xt+1 = (1 − γt)xt + γtst
8/30
22. The Demyanov-Rubinov (DR) Frank-Wolfe variant
Where does γt =min
− f (xt), st − xt
L st − xt
2
, 1 come from?
9/30
23. The Demyanov-Rubinov (DR) Frank-Wolfe variant
Where does γt =min
− f (xt), st − xt
L st − xt
2
, 1 come from?
L-smooth inequality
Any L-smooth function f verifies
f (y) ≤ f (x) + f (x), y − x +
L
2
x − y 2
,
for all x, y in the domain.
9/30
24. The Demyanov-Rubinov (DR) Frank-Wolfe variant
Where does γt =min
− f (xt), st − xt
L st − xt
2
, 1 come from?
L-smooth inequality
Any L-smooth function f verifies
f (y) ≤ f (x) + f (x), y − x +
L
2
x − y 2
:=Qx (y)
,
for all x, y in the domain.
• The right hand side is a
quadratic upper bound
Qx (y)
f (y)
9/30
25. Justification of the step-size
• L-smooth inequality at xt+1(γ) = (1 − γ)xt + γst, x = xt
gives
f (xt+1(γ)) ≤ f (xt) − γ f (xt), st − xt +
γ2L
2
st − xt
2
10/30
26. Justification of the step-size
• L-smooth inequality at xt+1(γ) = (1 − γ)xt + γst, x = xt
gives
f (xt+1(γ)) ≤ f (xt) − γ f (xt), st − xt +
γ2L
2
st − xt
2
• Minimizing right hand side on γ ∈ [0, 1] gives
γ =min
− f (xt), st − xt
L st − xt
2
, 1 ,
= Demyanov-Rubinov step-size!
10/30
27. Justification of the step-size
• L-smooth inequality at xt+1(γ) = (1 − γ)xt + γst, x = xt
gives
f (xt+1(γ)) ≤ f (xt) − γ f (xt), st − xt +
γ2L
2
st − xt
2
• Minimizing right hand side on γ ∈ [0, 1] gives
γ =min
− f (xt), st − xt
L st − xt
2
, 1 ,
= Demyanov-Rubinov step-size!
• ≡ exact line search on the quadratic upper bound.
10/30
28. Towards an Adaptive FW
Quadratic upper bound
The Demyanov-Rubinov makes use of a quadratic upper bound,
but it is only evaluated at xt, xt+1.
11/30
29. Towards an Adaptive FW
Quadratic upper bound
The Demyanov-Rubinov makes use of a quadratic upper bound,
but it is only evaluated at xt, xt+1.
Sufficient decrease is all you need
L-smooth inequality can be replaced by
f (xt+1) ≤ f (xt) − γt f (xt), st − xt +
γ2
t Lt
2
st − xt
2
with γt =min
− f (xt), st − xt
Lt st − xt
2
, 1
11/30
30. Towards an Adaptive FW
Quadratic upper bound
The Demyanov-Rubinov makes use of a quadratic upper bound,
but it is only evaluated at xt, xt+1.
Sufficient decrease is all you need
L-smooth inequality can be replaced by
f (xt+1) ≤ f (xt) − γt f (xt), st − xt +
γ2
t Lt
2
st − xt
2
with γt =min
− f (xt), st − xt
Lt st − xt
2
, 1
Key difference with DR: L is replaced by Lt. Potentially Lt L.
11/30
31. The Adaptive FW algorithm
New FW variant with adaptive step-size.5
Algorithm 4: The Adaptive Frank-Wolfe algorithm (AdaFW)
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 Find Lt that verifies sufficient decrease (1), with
4 γt =min
− f (xt), st − xt
Lt st − xt
2
, 1
5 xt+1 = (1 − γt)xt + γtst
f (xt+1) ≤ f (xt) + γt f (xt), st − xt +
γ2
t Lt
2
st − xt
2
(1)
5
Fabian Pedregosa, Armin Askari, Geoffrey Negiar, and Martin Jaggi (2018).
“Step-Size Adaptivity in Projection-Free Optimization”. In: Submitted.
12/30
32. The Adaptive FW algorithm
γ =0 γ = 1γt
f (xt) + γ f (xt), st − xt + γ2Lt
2 st − xt
2
f ((1 − γ)xt + γst)
• Worst-case, Lt = L. Often Lt L =⇒ larger step-size.
13/30
33. The Adaptive FW algorithm
γ =0 γ = 1γt
f (xt) + γ f (xt), st − xt + γ2Lt
2 st − xt
2
f ((1 − γ)xt + γst)
• Worst-case, Lt = L. Often Lt L =⇒ larger step-size.
• Adaptivity to local geometry.
13/30
34. The Adaptive FW algorithm
γ =0 γ = 1γt
f (xt) + γ f (xt), st − xt + γ2Lt
2 st − xt
2
f ((1 − γ)xt + γst)
• Worst-case, Lt = L. Often Lt L =⇒ larger step-size.
• Adaptivity to local geometry.
• Two extra function evaluations per iteration. Often given as
byproduct of gradient.
13/30
36. Zig-Zagging phenomena in FW
The Frank-Wolfe algorithm zig-zags when the solution lies in a
face of the boundary.
Some FW variants have been developed to address this issue.
14/30
37. Away-steps FW, informal
Away-steps FW algorithm (Wolfe, 1970) (Gu´elat and Marcotte,
1986) adds the possibility to move away from an active atom.
15/30
38. Away-steps FW algorithm
Keep active set St = active vertices that have been previously
selected and have non-zero weight.
16/30
39. Away-steps FW algorithm
Keep active set St = active vertices that have been previously
selected and have non-zero weight.
Algorithm 5: Away-Steps FW
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 vt ∈ arg maxv∈St
f (xt), v
4 if − f (xt), st − xt ≥ − f (xt), xt − vt then
5 dt = st − xt, FW step
6 else
7 dt = xt − vt, Away step
8 Find γt by line-search: γt ∈ arg minγ∈[0,γmax
t ] f (xt +γdt)
9 xt+1 = xt + γtdt
16/30
40. Away-steps FW algorithm
Keep active set St = active vertices that have been previously
selected and have non-zero weight.
Algorithm 5: Away-Steps FW
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 vt ∈ arg maxv∈St
f (xt), v
4 if − f (xt), st − xt ≥ − f (xt), xt − vt then
5 dt = st − xt, FW step
6 else
7 dt = xt − vt, Away step
8 Find γt by line-search: γt ∈ arg minγ∈[0,γmax
t ] f (xt +γdt)
9 xt+1 = xt + γtdt
16/30
41. Away-steps FW algorithm
Keep active set St = active vertices that have been previously
selected and have non-zero weight.
Algorithm 5: Away-Steps FW
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 vt ∈ arg maxv∈St
f (xt), v
4 if − f (xt), st − xt ≥ − f (xt), xt − vt then
5 dt = st − xt, FW step
6 else
7 dt = xt − vt, Away step
8 Find γt by line-search: γt ∈ arg minγ∈[0,γmax
t ] f (xt +γdt)
9 xt+1 = xt + γtdt
16/30
42. Away-steps FW algorithm
Keep active set St = active vertices that have been previously
selected and have non-zero weight.
Algorithm 5: Away-Steps FW
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 vt ∈ arg maxv∈St
f (xt), v
4 if − f (xt), st − xt ≥ − f (xt), xt − vt then
5 dt = st − xt, FW step
6 else
7 dt = xt − vt, Away step
8 Find γt by line-search: γt ∈ arg minγ∈[0,γmax
t ] f (xt +γdt)
9 xt+1 = xt + γtdt
16/30
43. Away-steps FW algorithm
Keep active set St = active vertices that have been previously
selected and have non-zero weight.
Algorithm 5: Away-Steps FW
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 vt ∈ arg maxv∈St
f (xt), v
4 if − f (xt), st − xt ≥ − f (xt), xt − vt then
5 dt = st − xt, FW step
6 else
7 dt = xt − vt, Away step
8 Find γt by line-search: γt ∈ arg minγ∈[0,γmax
t ] f (xt +γdt)
9 xt+1 = xt + γtdt
16/30
44. Pairwise FW
Key idea
Move weight mass between two atoms in each step.
Proposed by (Lacoste-Julien and Jaggi, 2015), inspired the MDM
alg. (Mitchell, Demyanov, and Malozemov, 1974).
Algorithm 6: Pairwise FW
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 vt ∈ arg maxs∈St
f (xt), s
4 dt = st − vt
5 Find γt by line-search: γt ∈
arg minγ∈[0,γmax
t ] f (xt +γdt)
6 xt+1 = xt + γtdt
17/30
45. Away-steps FW and Pairwise FW
Convergence of Away-steps and Pairwise FW
• Linear convergence for strongly convex functions on polytopes
(Lacoste-Julien and Jaggi, 2015).
18/30
46. Away-steps FW and Pairwise FW
Convergence of Away-steps and Pairwise FW
• Linear convergence for strongly convex functions on polytopes
(Lacoste-Julien and Jaggi, 2015).
• Can we design variants with sufficient decrease?
18/30
47. Away-steps FW and Pairwise FW
Convergence of Away-steps and Pairwise FW
• Linear convergence for strongly convex functions on polytopes
(Lacoste-Julien and Jaggi, 2015).
• Can we design variants with sufficient decrease?
Introducing Adaptive Away-steps and Adaptive Pairwise
Choose Lt such that it verifies
f (xt + γtdt) ≤ f (xt) + γt f (xt), dt +
γ2
t Lt
2
dt
2
with γt =min
− f (xt), dt
Lt dt
2
, 1
18/30
48. Adaptive Pairwise FW
Algorithm 7: Pairwise FW
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 vt ∈ arg maxs∈St
f (xt), s
4 dt = st − vt
5 Find Lt that verifies sufficient decrease (2), with
6 γt =min
− f (xt), dt
Lt dt
2
, 1
7 xt+1 = xt + γtdt
f (xt + γtdt) ≤ f (xt) + γt f (xt), dt +
γ2
t Lt
2
dt
2
(2)
19/30
49. Theory for Adaptive Step-size variants
Strongly convex f
Pairwise and Away-steps converge linearly on a polytope. For
each “good step” we have:
f (xt+1) − f (x ) ≤ (1 − ρ)(f (xt) − f (x ))
20/30
50. Theory for Adaptive Step-size variants
Strongly convex f
Pairwise and Away-steps converge linearly on a polytope. For
each “good step” we have:
f (xt+1) − f (x ) ≤ (1 − ρ)(f (xt) − f (x ))
Convex f
For all FW variants, f (xt) − f (x ) ≤ O(1/t)
20/30
51. Theory for Adaptive Step-size variants
Strongly convex f
Pairwise and Away-steps converge linearly on a polytope. For
each “good step” we have:
f (xt+1) − f (x ) ≤ (1 − ρ)(f (xt) − f (x ))
Convex f
For all FW variants, f (xt) − f (x ) ≤ O(1/t)
Non-Convex f
For all FW variants, maxs∈D f (xt), xt − s ≤ O(1/
√
t)
20/30
57. Proximal Splitting
Building quadratic upper bound is common in proximal gradient
descent (Beck and Teboulle, 2009) (Nesterov, 2013).
Recently extended to the Davis-Yin three operator splitting6
f (x) + g(x) + h(x)
with access to f , proxγg , proxγh
Key insight: verify a sufficient decrease condition of the form
f (xt+1) ≤ f (zt) + f (zt), xt+1 − zt +
1
2γt
xt+1 − zt
2
6
Fabian Pedregosa and Gauthier Gidel (2018). “Adaptive Three Operator
Splitting”. In: Proceedings of the 35th International Conference on Machine
Learning.
24/30
58. Nearly-isotonic penalty
Problem
arg minx loss(x) + λ p−1
i=1 max{xi − xi+1, 0}
Coefficients
Magnitude
=10 6
Coefficients
Magnitude
=10 3
Coefficients
Magnitude
=0.01
Coefficients
Magnitude
=0.1
estimated coefficients
ground truth
0 100 200 300 400
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
Objectiveminusoptimum
0 200 400 600
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
0 100 200 300 400
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
0 100 200 300 400
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
Adaptive TOS (variant 1) Adaptive TOS (variant 2) TOS (1/L) TOS (1.99/L)TOS-AOLS PDHG Adaptive PDHG
25/30
59. Overlapping group lasso penalty
Problem
arg min
x
loss(x) + λ g∈G [x]g 2
Coefficients
Magnitude
=10 6
Coefficients
Magnitude
=10 3
Coefficients
Magnitude
=0.01
Coefficients
Magnitude
=0.1
estimated coefficients
ground truth
0 10 20 30 40 50
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
Objectiveminusoptimum
0 10 20 30 40 50
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
0 5 10 15 20
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
0.0 0.2 0.4 0.6 0.8 1.0
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
Adaptive TOS (variant 1) Adaptive TOS (variant 2) TOS (1/L) TOS (1.99/L) TOS-AOLS PDHG Adaptive PDHG
26/30
61. Stochastic optimization
Problem
arg min
x∈Rd
1
n
n
i=1
fi (x)
Heuristic from7 to estimate L by verifying at each iteration t
fi (xt −
1
L
fi (xt)) ≤ fi (xt) −
1
2L
fi (xt) 2
with i random index sampled at iter t.
7
Mark Schmidt, Nicolas Le Roux, and Francis Bach (2017). “Minimizing finite
sums with the stochastic average gradient”. In: Mathematical Programming.
27/30
62. Stochastic optimization
Problem
arg min
x∈Rd
1
n
n
i=1
fi (x)
Heuristic from7 to estimate L by verifying at each iteration t
fi (xt −
1
L
fi (xt)) ≤ fi (xt) −
1
2L
fi (xt) 2
with i random index sampled at iter t.
L-smooth inequality with y = xt − 1
L fi (xt), x = xt
7
Mark Schmidt, Nicolas Le Roux, and Francis Bach (2017). “Minimizing finite
sums with the stochastic average gradient”. In: Mathematical Programming.
27/30
66. Conclusion
• Sufficient decrease condition to set step-size in FW and
variants.
• (Mostly) Hyperparameter-free, adaptivity to local geometry.
29/30
67. Conclusion
• Sufficient decrease condition to set step-size in FW and
variants.
• (Mostly) Hyperparameter-free, adaptivity to local geometry.
• Applications in proximal splitting and stochastic optimization.
Thanks for your attention
29/30
68. References
Beck, Amir and Marc Teboulle (2009). “Gradient-based algorithms with applications to
signal recovery”. In: Convex optimization in signal processing and communications.
Demyanov, Vladimir and Aleksandr Rubinov (1970). Approximate methods in
optimization problems (translated from Russian).
Gu´elat, Jacques and Patrice Marcotte (1986). “Some comments on Wolfe’s away
step”. In: Mathematical Programming.
Kerdreux, Thomas, Fabian Pedregosa, and Alexandre d’Aspremont (2018).
“Frank-Wolfe with Subsampling Oracle”. In: Proceedings of the 35th International
Conference on Machine Learning.
Lacoste-Julien, Simon and Martin Jaggi (2015). “On the global linear convergence of
Frank-Wolfe optimization variants”. In: Advances in Neural Information Processing
Systems.
Mitchell, BF, Vladimir Fedorovich Demyanov, and VN Malozemov (1974). “Finding
the point of a polyhedron closest to the origin”. In: SIAM Journal on Control.
29/30
69. Nesterov, Yu (2013). “Gradient methods for minimizing composite functions”. In:
Mathematical Programming.
Niculae, Vlad et al. (2018). “SparseMAP: Differentiable Sparse Structured Inference”.
In: International Conference on Machine Learning.
Pedregosa, Fabian et al. (2018). “Step-Size Adaptivity in Projection-Free
Optimization”. In: Submitted.
Pedregosa, Fabian and Gauthier Gidel (2018). “Adaptive Three Operator Splitting”.
In: Proceedings of the 35th International Conference on Machine Learning.
Ping, Wei, Qiang Liu, and Alexander T Ihler (2016). “Learning Infinite RBMs with
Frank-Wolfe”. In: Advances in Neural Information Processing Systems.
Schmidt, Mark, Nicolas Le Roux, and Francis Bach (2017). “Minimizing finite sums
with the stochastic average gradient”. In: Mathematical Programming.
Wolfe, Philip (1970). “Convergence theory in nonlinear programming”. In: Integer and
nonlinear programming.
30/30