Scott Clark, Software Engineer, Yelp at MLconf SF

Optimal Learning
for Fun and Profit with MOE
Scott Clark
MLconf SF 2014
11/14/14
Joint work with: Eric Liu, Peter Frazier, Norases Vesdapunt, Deniz Oktay, JaiLei Wang
sclark@yelp.com @DrScottClark

Outline of Talk
● Optimal Learning
○ What is it?
○ Why do we care?
● Multi-armed bandits
○ Definition and motivation
○ Examples
● Bayesian global optimization
○ Optimal experiment design
○ Uses to extend traditional A/B testing
○ Examples
● MOE: Metric Optimization Engine
○ Examples and Features

What is optimal learning?
Optimal learning addresses the challenge of
how to collect information as efficiently as
possible, primarily for settings where
collecting information is time consuming
and expensive.
Prof. Warren Powell - optimallearning.princeton.edu
What is the most efficient way to collect
information?
Prof. Peter Frazier - people.orie.cornell.edu/pfrazier
How do we make the most money, as fast
as possible?
Me - @DrScottClark

What are multi-armed bandits?
THE SETUP
● Imagine you are in front of K slot machines.
● Each one is set to "free play" (but you can still win $$$)
● Each has a possibly different, unknown payout rate
● You have a fixed amount of time to maximize payout
GO!

What are multi-armed bandits?
THE SETUP
(math version)

Real World Bandits
Why do we care?
● Maps well onto Click Through Rate (CTR)
○ Each arm is an ad or search result
○ Each click is a success
○ Want to maximize clicks
● Can be used in experiments (A/B testing)
○ Want to find the best solutions, fast
○ Want to limit how often bad solutions are used

Tradeoffs
Exploration vs. Exploitation
Gaining more knowledge about the system
vs.
Getting largest payout with current knowledge

Naive Example
Epsilon First Policy
● Sample sequentially εT < T times
○ only explore
● Pick the best and sample for t = εT+1, ..., T
○ only exploit

Example (K = 3, t = 0)
Unknown p = 0.5 p = 0.8 p = 0.2
payout rate
PULLS:
WINS:
RATIO:
0
0
-
0
0
-
0
0
-
Observed Information

Unknown p = 0.5 p = 0.8 p = 0.2
payout rate
PULLS:
WINS:
RATIO:
1
1
1
0
0
-
0
0
-

Unknown p = 0.5 p = 0.8 p = 0.2
payout rate
PULLS:
WINS:
RATIO:
1
1
1
1
1
1
0
0
-

Unknown p = 0.5 p = 0.8 p = 0.2
payout rate
PULLS:
WINS:
RATIO:
1
1
1
1
1
1
1
0
0

Unknown p = 0.5 p = 0.8 p = 0.2
payout rate
PULLS:
WINS:
RATIO:
2
1
0.5
1
1
1
1
0
0

Unknown p = 0.5 p = 0.8 p = 0.2
payout rate
PULLS:
WINS:
RATIO:
2
1
0.5
2
2
1
1
0
0

Unknown p = 0.5 p = 0.8 p = 0.2
payout rate
PULLS:
WINS:
RATIO:
2
1
0.5
2
2
1
2
0
0

Unknown p = 0.5 p = 0.8 p = 0.2
payout rate
PULLS:
WINS:
RATIO:
3
2
0.66
2
2
1
2
0
0

Unknown p = 0.5 p = 0.8 p = 0.2
payout rate
PULLS:
WINS:
RATIO:
3
2
0.66
3
3
1
2
0
0

Unknown p = 0.5 p = 0.8 p = 0.2
payout rate
PULLS:
WINS:
RATIO:
3
2
0.66
3
3
1
3
1
0.33

Example (K = 3, t > 9)
Exploit!
Profit!
Right?

What if our observed ratio is a poor approx?
Unknown p = 0.5 p = 0.8 p = 0.2
payout rate
PULLS:
WINS:
RATIO:
3
2
0.66
3
3
1
3
1
0.33

What if our observed ratio is a poor approx?
Unknown p = 0.9 p = 0.5 p = 0.5
payout rate
PULLS:
WINS:
RATIO:
3
2
0.66
3
3
1
3
1
0.33

Fixed exploration fails
Regret is unbounded!
Amount of exploration
needs to depend on data
We need better policies!

What should we do?
Many different policies
● Weighted random choice (another naive approach)
● Epsilon-greedy
○ Best arm so far with P=1-ε, random otherwise
● Epsilon-decreasing*
○ Best arm so far with P=1-(ε * exp(-rt)), random otherwise
● UCB-exp*
● UCB-tuned*
● BLA*
● SoftMax*
● etc, etc, etc (60+ years of research)
*Regret bounded as t->infinity

Bandits in the Wild
What if...
● Hardware constraints limit real-time knowledge? (batching)
● Payoff noisy? Non-binary? Changes in time? (dynamic content)
● Parallel sampling? (many concurrent users)
● Arms expire? (events, news stories, etc)
● You have knowledge of the user? (logged in, contextual history)
● The number of arms increases? Continuous? (parameter search)
Every problem is different.
This is an active area of research.

THE GOAL
● Optimize some objective function
○ CTR, revenue, delivery time, or some combination thereof
● given some parameters
○ config values, cuttoffs, ML parameters
● CTR = f(parameters)
○ Find best parameters
● We want to sample the underlying function as few times as possible
(more mathy version)

Metric Optimization Engine
A global, black box method for parameter optimization
History of how past parameters have performed
MOE
New, optimal parameters

What does MOE do?
● MOE optimizes a metric (like CTR) given some
parameters as inputs (like scoring weights)
● Given the past performance of different parameters
MOE suggests new, optimal parameters to test
Results of A/B
tests run so far
MOE
New, optimal
values to A/B test

Example Experiment
Biz details distance in ad
● Setting a different distance cutoff for each category
Parameters + Obj Func
distance_cutoffs = {
‘shopping’: 20.0,
‘food’: 14.0,
‘auto’: 15.0,
…}
objective_function = {
‘value’: 0.012,
‘std’: 0.00013
}
MOE New Parameters
distance_cutoffs = {
‘shopping’: 22.1,
‘food’: 7.3,
‘auto’: 12.6,
…}
to show “X miles away” text in biz_details ad
● For each category we define a maximum distance
Run A/B Test

Why do we need MOE?
● Parameter optimization is hard
○ Finding the perfect set of parameters takes a long time
○ Hope it is well behaved and try to move in the right direction
○ Not possible as number of parameters increases
● Intractable to find best set of parameters in all situations
○ Thousands of combinations of program type, flow, category
○ Finding the best parameters manually is impossible
● Heuristics quickly break down in the real world
○ Dependent parameters (changes to one change all others)
○ Many parameters at once (location, category, map, place, ...)
○ Non-linear (complexity and chaos break assumptions)
MOE solves all of these problems in an optimal way

How does it work?
MOE
1. Build Gaussian Process (GP)
with points sampled so far
2. Optimize covariance
hyperparameters of GP
3. Find point(s) of highest
Expected Improvement
within parameter domain
4. Return optimal next best
point(s) to sample

Rasmussen and
Williams GPML
gaussianprocess.org
Gaussian Processes

Prior:
Posterior:
Gaussian Processes

Optimizing Covariance Hyperparameters
Finding the GP model that fits best
● All of these GPs are created with the same initial data
○ with different hyperparameters (length scales)
● Need to find the model that is most likely given the data
○ Maximum likelihood, cross validation, priors, etc
Rasmussen and Williams Gaussian Processes for Machine Learning

Optimizing Covariance Hyperparameters
Rasmussen and Williams Gaussian Processes for Machine Learning

Find point(s) of highest expected improvement
We want to find the point(s) that are expected to beat the best point seen so far, by the most.
[Jones, Schonlau, Welsch 1998]
[Clark, Frazier 2012]

Tying it all Together #1: A/B Testing
Users
Experiment
Framework
(users -> cohorts)
(cohorts -> % traffic,
params)
● Optimally assign traffic fractions for
experiments (Multi-Armed Bandits)
● Optimally suggest new cohorts to be run
(Bayesian Global Optimization)
Metric System (batch)
Logs, Metrics, Results
MOE
Multi-Armed Bandits
Bayesian Global Opt
App
cohorts -> params
params -> objective function
optimal cohort % traffic
optimal new params
daily/hourly batch
time consuming and expensive

Tying it all Together #2
Expensive Batch Systems
Machine Learning
Framework
complex regression, deep
learning system, etc
● Optimally suggest new hyperparameters
for the framework to minimize loss
(Bayesian Global Optimization)
Metrics
Error, Loss, Likelihood, etc
MOE
Bayesian Global Opt
Big Data
framework output
time consuming and expensive Hyperparameters
optimal hyperparameters

What is MOE doing right now?
MOE is now live in production
● MOE is informing active experiments
● MOE is successfully optimizing towards all given metrics
● MOE treats the underlying system it is optimizing as a black box,
allowing it to be easily extended to any system

MOE is Open Source!
github.com/Yelp/MOE

MOE is Fully Documented
yelp.github.io/MOE

MOE has Examples
yelp.github.io/MOE/examples.html

● Multi-Armed Bandits
○ Many policies implemented and more on the way
● Global Optimization
○ Bayesian Global Optimization via Expected Improvement on GPs

MOE is Easy to Install
● yelp.github.io/MOE/install.html#install-in-docker
● registry.hub.docker.com/u/yelpmoe/latest
A MOE server is now running at http://localhost:6543

Questions?
sclark@yelp.com
@DrScottClark
github.com/Yelp/MOE

References
Gaussian Processes for Machine Learning
Carl edward Rasmussen and Christopher K. I. Williams. 2006.
Massachusetts Institute of Technology. 55 Hayward St., Cambridge, MA 02142.
http://www.gaussianprocess.org/gpml/ (free electronic copy)
Parallel Machine Learning Algorithms In Bioinformatics and Global Optimization
(PhD Dissertation)
Part II, EPI: Expected Parallel Improvement
Scott Clark. 2012.
Cornell University, Center for Applied Mathematics. Ithaca, NY.
https://github.com/sc932/Thesis
Differentiation of the Cholesky Algorithm
S. P. Smith. 1995.
Journal of Computational and Graphical Statistics. Volume 4. Number 2. p134-147
A Multi-points Criterion for Deterministic Parallel Global Optimization based on
Gaussian Processes.
David Ginsbourger, Rodolphe Le Riche, and Laurent Carraro. 2008.
D´epartement 3MI. Ecole Nationale Sup´erieure des Mines. 158 cours Fauriel, Saint-Etienne, France.
{ginsbourger, leriche, carraro}@emse.fr
Efficient Global Optimization of Expensive Black-Box Functions
Jones, D.R., Schonlau, M., Welch,W.J. 1998.
Journal of Global Optimization, 13, 455-492.

Use Cases
● Optimizing a system's click-through or conversion rate (CTR).
○ MOE is useful when evaluating CTR requires running an A/B test on real user traffic, and
getting statistically significant results requires running this test for a substantial amount of time
(hours, days, or even weeks). Examples include setting distance thresholds, ad unit properties,
or internal configuration values.
○ http://engineeringblog.yelp.com/2014/10/using-moe-the-metric-optimization-engine-to-optimize-an-
ab-testing-experiment-framework.html
● Optimizing tunable parameters of a machine-learning prediction method.
○ MOE can be used when calculating the prediction error for one choice of the parameters takes a
long time, which might happen because the prediction method is complex and takes a long
time to train, or because the data used to evaluate the error is huge. Examples include deep
learning methods or hyperparameters of features in logistic regression.

More Use Cases
● Optimizing the design of an engineering system.
○ MOE helps when evaluating a design requires running a complex physics-based numerical
simulation on a supercomputer. Examples include designing and modeling airplanes, the
traffic network of a city, a combustion engine, or a hospital.
● Optimizing the parameters of a real-world experiment.
○ MOE can help guide design when every experiment needs to be physically created in a lab or
very few experiments can be run in parallel. Examples include chemistry, biology, or physics
experiments or a drug trial.
● Any time sampling a tunable, unknown function is time consuming or
expensive.

Scott Clark, Software Engineer, Yelp at MLconf SF

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Scott Clark, Software Engineer, Yelp at MLconf SF

Similar to Scott Clark, Software Engineer, Yelp at MLconf SF (20)

More from MLconf

More from MLconf (20)

Recently uploaded

Recently uploaded (20)

Scott Clark, Software Engineer, Yelp at MLconf SF