4. Convex optimization problem
convex optimization problem:
minimize f0(x)
subject to fi (x) ≤ 0, i = 1, . . . , m
Ax = b
variable x ∈ Rn
equality constraints are linear
f0, . . . , fm are convex: for θ ∈ [0, 1],
fi (θx + (1 − θ)y) ≤ θfi (x) + (1 − θ)fi (y)
i.e., fi have nonnegative (upward) curvature
Convex optimization 4
7. Why convex optimization?
we can solve convex optimization problems effectively
there are lots of applications
Convex optimization 5
8. Application areas
machine learning, statistics
finance
supply chain, revenue management, advertising
control
signal and image processing, vision
networking
circuit design
and many others . . .
Convex optimization 6
9. Convex optimization solvers
medium scale (1000s–10000s variables, constraints)
interior-point methods on single machine
large-scale (100k – 1B variables, constraints)
custom (often problem specific) methods, e.g., SGD
lots of on-going research
growing list of open source solvers
Convex optimization 7
10. Convex optimization modeling languages
(new) high level language support for convex optimization
describe problem in high level language
problem compiled to standard form and solved
implementations:
YALMIP, CVX (Matlab)
CVXPY (Python)
Convex.jl (Julia)
Convex optimization 8
11. CVXPY
(Diamond & Boyd, 2013)
minimize Ax − b 2
2 + γ x 1
subject to x ∞ ≤ 1
from cvxpy import *
x = Variable(n)
cost = sum_squares(A*x-b) + gamma*norm(x,1)
prob = Problem(Minimize(cost),
[norm(x,"inf") <= 1])
opt_val = prob.solve()
solution = x.value
Convex optimization 9
12. Example: Image in-painting
guess pixel values in obscured/corrupted parts of image
total variation in-painting: choose pixel values xij ∈ R3
to
minimize total variation
TV(x) =
ij
xi+1,j − xij
xi,j+1 − xij 2
a convex problem
Convex optimization 10
13. Example
512 × 512 color image (n ≈ 800000 variables)
Original Corrupted
Convex optimization 11
18. Predictor
given data (xi , yi ), i = 1, . . . , m
x is feature vector, y is outcome or label
find predictor ψ so that
y ≈ ˆy = ψ(x) for data (x, y) that you haven’t seen
ψ is a regression model for y ∈ R
ψ is a classifier for y ∈ {−1, 1}
Model fitting via convex optimization 16
19. Loss minimization predictor
predictor parametrized by θ ∈ Rn
loss function L(xi , yi , θ) gives miss-fit for data point (xi , yi )
for given θ, predictor is
ψ(x) = argmin
y
L(x, y, θ)
how do we choose parameter θ?
Model fitting via convex optimization 17
20. Model fitting via regularized loss minimization
choose θ by minimizing regularized loss
1
m
m
i=1
L(xi , yi , θ) + λr(θ)
regularization r(θ) penalizes model complexity, enforces
constraints, or represents prior
λ > 0 scales regularization
Model fitting via convex optimization 18
21. Model fitting via regularized loss minimization
choose θ by minimizing regularized loss
1
m
m
i=1
L(xi , yi , θ) + λr(θ)
regularization r(θ) penalizes model complexity, enforces
constraints, or represents prior
λ > 0 scales regularization
for many useful cases, this is a convex problem
Model fitting via convex optimization 18
22. Examples
predictor L(x, y, θ) ψ(x) r(θ)
least-squares (θT x − y)2 θT x 0
ridge regression (θT x − y)2 θT x θ 2
2
lasso (θT x − y)2 θT x θ 1
logistic classifier log(1 + exp(−yθT x)) sign(θT x) 0
SVM (1 − yθT x)+ sign(θT x) θ 2
2
can mix and match, e.g., r(θ) = θ 1 sparsifies
all lead to convex fitting problems
Model fitting via convex optimization 19
23. Robust (Huber) regression
loss L(x, y, θ) = φhub(θT x − y)
φhub is Huber function (with threshold M > 0):
φhub
(u) =
u2 |u| ≤ M
2Mu − M2 |u| > M
same as least-squares for small residuals, but allows (some)
large residuals
and so, robust to outliers
Model fitting via convex optimization 20
24. Example
m = 450 measurements, n = 300 regressors
choose θtrue; xi ∼ N(0, I)
set yi = (θtrue)T xi + i , i ∼ N(0, 1)
with probability p, replace yi with −yi
data has fraction p of (non-obvious) wrong measurements
distribution of ‘good’ and ‘bad’ yi are the same
try to recover θtrue ∈ Rn
from measurements y ∈ Rm
‘prescient’ version: we know which measurements are wrong
Model fitting via convex optimization 21
27. Quantile regression
quantile regression: use tilted 1 loss
L(x, y, θ) = τ(r)+ + (1 − τ)(r)−
with r = θT x − y, τ ∈ (0, 1)
τ = 0.5: equal penalty for over- and under-estimating
τ = 0.1: 9× more penalty for under-estimating
τ = 0.9: 9× more penalty for over-estimating
τ-quantile of residuals is zero
Model fitting via convex optimization 24
28. Example
time series xt, t = 0, 1, 2, . . .
auto-regressive predictor:
ˆxt+1 = θT
(1, xt, . . . , xt−M)
M = 10 is memory of predictor
use quantile regression for τ = 0.1, 0.5, 0.9
at each time t, gives three one-step-ahead predictions:
ˆx0.1
t+1, ˆx0.5
t+1, ˆx0.9
t+1
Model fitting via convex optimization 25
35. Consensus optimization
want to solve problem with N objective terms
minimize N
i=1 fi (x)
e.g., fi is the loss function for ith block of training data
consensus form:
minimize N
i=1 fi (xi )
subject to xi − z = 0
xi are local variables
z is the global variable
xi − z = 0 are consistency or consensus constraints
Consensus optimization and model fitting 32
36. Consensus optimization via ADMM
with xk = (1/N) N
i=1 xk
i (average over local variables)
xk+1
i := argmin
xi
fi (xi ) + (ρ/2) xi − xk
+ uk
i
2
2
uk+1
i := uk
i + (xk+1
i − xk+1
)
get global minimum, under very general conditions
uk is running sum of inconsistencies (PI control)
minimizations carried out independently and in parallel
coordination is via averaging of local variables xi
Consensus optimization and model fitting 33
37. Consensus model fitting
variable is θ, parameter in predictor
fi (θi ) is loss + (share of) regularizer for ith data block
θk+1
i minimizes local loss + additional quadratic term
local parameters converge to consensus, same as if whole
data set were handled together
privacy preserving: agents don’t reveal data to each other
Consensus optimization and model fitting 34
38. Example
SVM:
hinge loss l(u) = (1 − u)+
sum square regularization r(θ) = θ2
2
baby problem with n = 2, m = 400 to illustrate
examples split into N = 20 groups, in worst possible way:
each group contains only positive or negative examples
Consensus optimization and model fitting 35
42. CVXPY implementation
(Steven Diamond)
N = 105 samples, n = 103 (dense) features
hinge (SVM) loss with 1 regularization
data split into 100 chunks
100 processes on 32 cores
26 sec per ADMM iteration
100 iterations for objective to converge
10 iterations (5 minutes) to get good model
Consensus optimization and model fitting 39
44. H2O implementation
(Tomas Nykodym)
click-through data derived from a kaggle data set
20000 features, 20M examples
logistic loss, elastic net regularization
examples divided into 100 chunks (of different sizes)
run on 100 H2O instances
5 iterations to get good global model
Consensus optimization and model fitting 41
50. Summary
ADMM consensus
can do machine learning across distributed data sources
the data never moves
get same model as if you had collected all data in one place
Consensus optimization and model fitting 47
51. Resources
many researchers have worked on the topics covered
Convex Optimization
Distributed Optimization and Statistical Learning via the
Alternating Direction Method of Multipliers
EE364a (course slides, videos, code, homework, . . . )
software CVX, CVXPY, Convex.jl
all available online
Consensus optimization and model fitting 48