Sufficient decrease is all you need

Suﬃcient decrease is all you need
A simple condition to forget about the step-size, with
applications to the Frank-Wolfe algorithm.
Fabian Pedregosa
June 4th, 2018. Google Brain Montreal

Where I Come From
ML/Optimization/Software Guy
Engineer (2010–2012)
First contact with ML: develop ML
library (scikit-learn).
ML and NeuroScience (2012–2015)
PhD applying ML to neuroscience.
ML and Optimization (2015–)
Stochastic, Parallel, Constrained,
Hyperparameter optimization.
1/30

Outline
Motivation: eliminate step-size parameter.
2/30

Outline
1. Frank-Wolfe, A method for constrained optimization.
2. Adaptive Frank-Wolfe. Frank-Wolfe without the step-size.
3. Perspectives. Other applications: proximal splitting,
stochastic optimization.
2/30

Outline
1. Frank-Wolfe, A method for constrained optimization.
2. Adaptive Frank-Wolfe. Frank-Wolfe without the step-size.
3. Perspectives. Other applications: proximal splitting,
stochastic optimization.
With a little help from my collaborators
Armin Askari
(UC Berkeley)
Geoﬀrey N´egiar
(UC Berkeley)
Martin Jaggi
(EPFL)
Gauthier Gidel
(UdeM)
2/30

The Frank-Wolfe (FW) algorithm, aka conditional gradient
Problem: smooth f , compact D
arg min
x∈D
f (x)
Algorithm 1: Frank-Wolfe (FW)
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 Find γt by line-search: γt ∈
arg minγ∈[0,1] f ((1−γ)xt +γst)
4 xt+1 = (1 − γt)xt + γtst
3/30

Why people ♥ Frank-Wolfe
• Projection-free. Only linear subproblems arise vs quadratic
for projection.
4/30

for projection.
• Solution of linear subproblem is always extremal element of
D.
4/30

for projection.
• Solution of linear subproblem is always extremal element of
D.
• Iterates admit sparse representation = xt convex combination
of at most t elements.
4/30

Recent applications of Frank-Wolfe
• Learning the structure of a neural network.1
• Attention mechanisms that enforce sparsity.2
• 1-constrained problems with extreme number of features.3
1
Wei Ping, Qiang Liu, and Alexander T Ihler (2016). “Learning Inﬁnite RBMs
with Frank-Wolfe”. In: Advances in Neural Information Processing Systems.
2
Vlad Niculae et al. (2018). “SparseMAP: Diﬀerentiable Sparse Structured
Inference”. In: International Conference on Machine Learning.
3
Thomas Kerdreux, Fabian Pedregosa, and Alexandre d’Aspremont (2018).
“Frank-Wolfe with Subsampling Oracle”. In: Proceedings of the 35th
International Conference on Machine Learning.
5/30

A practical issue
• Line-search only
eﬃcient when closed
form exists (quadratic
objective).
• Step-size
γt = 2/(t + 2) is
convergent, but
extremely slow.
1 for t = 0, 1 . . . do
4 xt+1 = (1 − γt)xt + γtst
6/30

A practical issue
• Line-search only
eﬃcient when closed
form exists (quadratic
objective).
• Step-size
γt = 2/(t + 2) is
convergent, but
extremely slow.
1 for t = 0, 1 . . . do
4 xt+1 = (1 − γt)xt + γtst
Can we do better?
6/30

A suﬃcient decrease condition

Down the citation rabbit hole
4
Vladimir Demyanov
Alexsandr Rubinov
4
Vladimir Demyanov and Aleksandr Rubinov (1970). Approximate methods in
optimization problems (translated from Russian).
7/30

The Demyanov-Rubinov (DR) Frank-Wolfe variant
Problem: smooth objective, compact domain
arg min
x∈D
f (x), where f is L-smooth .
(L-smooth ≡ diﬀerentiable with L-Lipschitz gradient).
• Step-size depends
on the correlation
between − f (xt)
and the descent
direction st − xt.
Algorithm 3: FW, DR variant
1 for t = 0, 1 . . . do
3 γt =min
− f (xt), st − xt
L st − xt
2
, 1
4 xt+1 = (1 − γt)xt + γtst
8/30

Where does γt =min
L st − xt
2
, 1 come from?
9/30

Where does γt =min
L st − xt
2
, 1 come from?
L-smooth inequality
Any L-smooth function f veriﬁes
f (y) ≤ f (x) + f (x), y − x +
L
2
x − y 2
,
for all x, y in the domain.
9/30

Where does γt =min
L st − xt
2
, 1 come from?
L-smooth inequality
Any L-smooth function f veriﬁes
f (y) ≤ f (x) + f (x), y − x +
L
2
x − y 2
:=Qx (y)
,
for all x, y in the domain.
• The right hand side is a
quadratic upper bound
Qx (y)
f (y)
9/30

Justiﬁcation of the step-size
• L-smooth inequality at xt+1(γ) = (1 − γ)xt + γst, x = xt
gives
f (xt+1(γ)) ≤ f (xt) − γ f (xt), st − xt +
γ2L
2
st − xt
2
10/30

gives
γ2L
2
st − xt
2
• Minimizing right hand side on γ ∈ [0, 1] gives
γ =min
L st − xt
2
, 1 ,
= Demyanov-Rubinov step-size!
10/30

gives
γ2L
2
st − xt
2
• Minimizing right hand side on γ ∈ [0, 1] gives
γ =min
L st − xt
2
, 1 ,
= Demyanov-Rubinov step-size!
• ≡ exact line search on the quadratic upper bound.
10/30

Towards an Adaptive FW
Quadratic upper bound
The Demyanov-Rubinov makes use of a quadratic upper bound,
but it is only evaluated at xt, xt+1.
11/30

L-smooth inequality can be replaced by
f (xt+1) ≤ f (xt) − γt f (xt), st − xt +
γ2
t Lt
2
st − xt
2
with γt =min
Lt st − xt
2
, 1
11/30

L-smooth inequality can be replaced by
f (xt+1) ≤ f (xt) − γt f (xt), st − xt +
γ2
t Lt
2
st − xt
2
with γt =min
Lt st − xt
2
, 1
Key diﬀerence with DR: L is replaced by Lt. Potentially Lt L.
11/30

The Adaptive FW algorithm
New FW variant with adaptive step-size.5
Algorithm 4: The Adaptive Frank-Wolfe algorithm (AdaFW)
1 for t = 0, 1 . . . do
3 Find Lt that verifies sufficient decrease (1), with
4 γt =min
Lt st − xt
2
, 1
5 xt+1 = (1 − γt)xt + γtst
f (xt+1) ≤ f (xt) + γt f (xt), st − xt +
γ2
t Lt
2
st − xt
2
(1)
5
Fabian Pedregosa, Armin Askari, Geoffrey Negiar, and Martin Jaggi (2018).
“Step-Size Adaptivity in Projection-Free Optimization”. In: Submitted.
12/30

γ =0 γ = 1γt
f (xt) + γ f (xt), st − xt + γ2Lt
2 st − xt
2
f ((1 − γ)xt + γst)
• Worst-case, Lt = L. Often Lt L =⇒ larger step-size.
13/30

γ =0 γ = 1γt
2 st − xt
2
f ((1 − γ)xt + γst)
• Adaptivity to local geometry.
13/30

γ =0 γ = 1γt
2 st − xt
2
f ((1 − γ)xt + γst)
• Adaptivity to local geometry.
• Two extra function evaluations per iteration. Often given as
byproduct of gradient.
13/30

Extension to other FW variants

Zig-Zagging phenomena in FW
The Frank-Wolfe algorithm zig-zags when the solution lies in a
face of the boundary.
Some FW variants have been developed to address this issue.
14/30

Away-steps FW, informal
Away-steps FW algorithm (Wolfe, 1970) (Gu´elat and Marcotte,
1986) adds the possibility to move away from an active atom.
15/30

Away-steps FW algorithm
Keep active set St = active vertices that have been previously
selected and have non-zero weight.
16/30

Away-steps FW algorithm
Keep active set St = active vertices that have been previously
selected and have non-zero weight.
Algorithm 5: Away-Steps FW
1 for t = 0, 1 . . . do
3 vt ∈ arg maxv∈St
f (xt), v
4 if − f (xt), st − xt ≥ − f (xt), xt − vt then
5 dt = st − xt, FW step
6 else
7 dt = xt − vt, Away step
8 Find γt by line-search: γt ∈ arg minγ∈[0,γmax
t ] f (xt +γdt)
9 xt+1 = xt + γtdt
16/30

Pairwise FW
Key idea
Move weight mass between two atoms in each step.
Proposed by (Lacoste-Julien and Jaggi, 2015), inspired the MDM
alg. (Mitchell, Demyanov, and Malozemov, 1974).
Algorithm 6: Pairwise FW
1 for t = 0, 1 . . . do
3 vt ∈ arg maxs∈St
f (xt), s
4 dt = st − vt
arg minγ∈[0,γmax
t ] f (xt +γdt)
6 xt+1 = xt + γtdt
17/30

Away-steps FW and Pairwise FW
Convergence of Away-steps and Pairwise FW
• Linear convergence for strongly convex functions on polytopes
(Lacoste-Julien and Jaggi, 2015).
18/30

• Can we design variants with suﬃcient decrease?
18/30

• Can we design variants with suﬃcient decrease?
Introducing Adaptive Away-steps and Adaptive Pairwise
Choose Lt such that it veriﬁes
f (xt + γtdt) ≤ f (xt) + γt f (xt), dt +
γ2
t Lt
2
dt
2
with γt =min
− f (xt), dt
Lt dt
2
, 1
18/30

Adaptive Pairwise FW
Algorithm 7: Pairwise FW
1 for t = 0, 1 . . . do
3 vt ∈ arg maxs∈St
f (xt), s
4 dt = st − vt
5 Find Lt that veriﬁes suﬃcient decrease (2), with
6 γt =min
− f (xt), dt
Lt dt
2
, 1
7 xt+1 = xt + γtdt
f (xt + γtdt) ≤ f (xt) + γt f (xt), dt +
γ2
t Lt
2
dt
2
(2)
19/30

Theory for Adaptive Step-size variants
Strongly convex f
Pairwise and Away-steps converge linearly on a polytope. For
each “good step” we have:
f (xt+1) − f (x ) ≤ (1 − ρ)(f (xt) − f (x ))
20/30

Strongly convex f
f (xt+1) − f (x ) ≤ (1 − ρ)(f (xt) − f (x ))
Convex f
For all FW variants, f (xt) − f (x ) ≤ O(1/t)
20/30

Strongly convex f
f (xt+1) − f (x ) ≤ (1 − ρ)(f (xt) − f (x ))
Convex f
For all FW variants, f (xt) − f (x ) ≤ O(1/t)
Non-Convex f
For all FW variants, maxs∈D f (xt), xt − s ≤ O(1/
√
t)
20/30

Experiments RCV1
Problem: 1-constrained logistic regression
arg min
x 1≤α
1
n
n
i=1
ϕ(aT
i x, bi ) with ϕ = logistic loss.
Dataset dimension density Lt /L
RCV1 47236 10−3 1.3 × 10−2
0 100 200 300 400
Time (in seconds)
10 8
10 6
10 4
10 2
100
Objectiveminusoptimum
1 ball radius = 100
0 200 400 600 800
Time (in seconds)
10 8
10 6
10 4
10 2
100
1 ball radius = 200
0 250 500 750 1000
Time (in seconds)
10 8
10 6
10 4
10 2
100
1 ball radius = 300
AdaFW AdaPFW AdaAFW FW PFW AFW D-FW
21/30

Experiments Madelon
Problem: 1-constrained logistic regression
arg min
x 1≤α
1
n
n
i=1
ϕ(aT
i x, bi ) with ϕ = logistic loss.
Madelon 500 1. 3.3 × 10−3
0 2 4
Time (in seconds)
10 8
10 6
10 4
10 2
1 ball radius = 13
0.0 2.5 5.0 7.5 10.0
Time (in seconds)
10 8
10 6
10 4
10 2
1 ball radius = 20
0 5 10 15 20
Time (in seconds)
10 8
10 6
10 4
10 2
1 ball radius = 30
AdaFW AdaPFW AdaAFW FW PFW AFW D-FW
22/30

Experiments MovieLens 1M
Problem: trace-norm constrained robust matrix completion
arg min
x ∗≤α
1
|B|
n
(i,j)∈B
h(Xi,j , Ai,j ) with h = Huber loss.
MovieLens 1M 22,393,987 0.04 1.1 × 10−2
0 200 400 600 800
Time (in seconds)
10 6
10 4
10 2
100
trace ball radius = 300
0 1000 2000
Time (in seconds)
10 6
10 4
10 2
100
0 2000 4000
Time (in seconds)
10 6
10 4
10 2
100
Adaptive FW FW D-FW
23/30

Proximal Splitting
Building quadratic upper bound is common in proximal gradient
descent (Beck and Teboulle, 2009) (Nesterov, 2013).
Recently extended to the Davis-Yin three operator splitting6
f (x) + g(x) + h(x)
with access to f , proxγg , proxγh
Key insight: verify a suﬃcient decrease condition of the form
f (xt+1) ≤ f (zt) + f (zt), xt+1 − zt +
1
2γt
xt+1 − zt
2
6
Fabian Pedregosa and Gauthier Gidel (2018). “Adaptive Three Operator
Splitting”. In: Proceedings of the 35th International Conference on Machine
Learning.
24/30

Nearly-isotonic penalty
Problem
arg minx loss(x) + λ p−1
i=1 max{xi − xi+1, 0}
Coefficients
Magnitude
=10 6
Coefficients
Magnitude
=10 3
Coefficients
Magnitude
=0.01
Coefficients
Magnitude
=0.1
estimated coefficients
ground truth
0 100 200 300 400
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
0 200 400 600
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
0 100 200 300 400
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
0 100 200 300 400
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
Adaptive TOS (variant 1) Adaptive TOS (variant 2) TOS (1/L) TOS (1.99/L)TOS-AOLS PDHG Adaptive PDHG
25/30

Overlapping group lasso penalty
Problem
arg min
x
loss(x) + λ g∈G [x]g 2
Coefficients
Magnitude
=10 6
Coefficients
Magnitude
=10 3
Coefficients
Magnitude
=0.01
Coefficients
Magnitude
=0.1
estimated coefficients
ground truth
0 10 20 30 40 50
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
0 10 20 30 40 50
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
0 5 10 15 20
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
0.0 0.2 0.4 0.6 0.8 1.0
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
Adaptive TOS (variant 1) Adaptive TOS (variant 2) TOS (1/L) TOS (1.99/L) TOS-AOLS PDHG Adaptive PDHG
26/30

Stochastic optimization
Problem
arg min
x∈Rd
1
n
n
i=1
fi (x)
Heuristic from7 to estimate L by verifying at each iteration t
fi (xt −
1
L
fi (xt)) ≤ fi (xt) −
1
2L
fi (xt) 2
with i random index sampled at iter t.
7
Mark Schmidt, Nicolas Le Roux, and Francis Bach (2017). “Minimizing ﬁnite
sums with the stochastic average gradient”. In: Mathematical Programming.
27/30

Stochastic optimization
Problem
arg min
x∈Rd
1
n
n
i=1
fi (x)
Heuristic from7 to estimate L by verifying at each iteration t
fi (xt −
1
L
fi (xt)) ≤ fi (xt) −
1
2L
fi (xt) 2
with i random index sampled at iter t.
L-smooth inequality with y = xt − 1
L fi (xt), x = xt
7
Mark Schmidt, Nicolas Le Roux, and Francis Bach (2017). “Minimizing ﬁnite
sums with the stochastic average gradient”. In: Mathematical Programming.
27/30

Experiments stochastic line search
28/30

Experiments stochastic line search
Can we prove convergence for such (or similar) stochastic adaptive
step-size?
28/30

Conclusion
• Suﬃcient decrease condition to set step-size in FW and
variants.
29/30

Conclusion
variants.
• (Mostly) Hyperparameter-free, adaptivity to local geometry.
29/30

Conclusion
variants.
• (Mostly) Hyperparameter-free, adaptivity to local geometry.
• Applications in proximal splitting and stochastic optimization.
Thanks for your attention
29/30

References
Beck, Amir and Marc Teboulle (2009). “Gradient-based algorithms with applications to
signal recovery”. In: Convex optimization in signal processing and communications.
Demyanov, Vladimir and Aleksandr Rubinov (1970). Approximate methods in
optimization problems (translated from Russian).
Gu´elat, Jacques and Patrice Marcotte (1986). “Some comments on Wolfe’s away
step”. In: Mathematical Programming.
Kerdreux, Thomas, Fabian Pedregosa, and Alexandre d’Aspremont (2018).
“Frank-Wolfe with Subsampling Oracle”. In: Proceedings of the 35th International
Conference on Machine Learning.
Lacoste-Julien, Simon and Martin Jaggi (2015). “On the global linear convergence of
Frank-Wolfe optimization variants”. In: Advances in Neural Information Processing
Systems.
Mitchell, BF, Vladimir Fedorovich Demyanov, and VN Malozemov (1974). “Finding
the point of a polyhedron closest to the origin”. In: SIAM Journal on Control.
29/30

Nesterov, Yu (2013). “Gradient methods for minimizing composite functions”. In:
Mathematical Programming.
Niculae, Vlad et al. (2018). “SparseMAP: Differentiable Sparse Structured Inference”.
In: International Conference on Machine Learning.
Pedregosa, Fabian et al. (2018). “Step-Size Adaptivity in Projection-Free
Optimization”. In: Submitted.
Pedregosa, Fabian and Gauthier Gidel (2018). “Adaptive Three Operator Splitting”.
In: Proceedings of the 35th International Conference on Machine Learning.
Ping, Wei, Qiang Liu, and Alexander T Ihler (2016). “Learning Infinite RBMs with
Frank-Wolfe”. In: Advances in Neural Information Processing Systems.
Schmidt, Mark, Nicolas Le Roux, and Francis Bach (2017). “Minimizing finite sums
with the stochastic average gradient”. In: Mathematical Programming.
Wolfe, Philip (1970). “Convergence theory in nonlinear programming”. In: Integer and
nonlinear programming.
30/30

Sufficient decrease is all you need

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (18)

Ähnlich wie Sufficient decrease is all you need

Ähnlich wie Sufficient decrease is all you need (20)

Mehr von Fabian Pedregosa

Mehr von Fabian Pedregosa (11)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Sufficient decrease is all you need