Alina Beygelzimer, Senior Research Scientist, Yahoo Labs at MLconf NYC

Learning with Exploration
Alina Beygelzimer
Yahoo Labs, New York
(based on work by many)

Interactive Learning
Repeatedly:
1 A user comes to Yahoo
2 Yahoo chooses content to present (urls, ads, news stories)
3 The user reacts to the presented information (clicks on something)
Making good content decisions requires learning from user feedback.

Abstracting the Setting
For t = 1, . . . , T:
1 The world produces some context x ∈ X
2 The learner chooses an action a ∈ A
3 The world reacts with reward r(a, x)
Goal: Learn a good policy for choosing actions given context

Dominant Solution
1 Deploy some initial system
2 Collect data using this system
3 Use machine learning to build a reward predictor ˆr(a, x) from
collected data
4 Evaluate new system = arg maxa ˆr(a, x)
oﬄine evaluation on past data
bucket test
5 If metrics improve, switch to this new system and repeat

Example: Bagels vs. Pizza for New York and Chicago users

Initial system: NY gets bagels, Chicago gets pizza.
New York
Chicago

Observed CTR
New York ? 0.6
Chicago 0.4 ?

Observed CTR/Estimated CTR
New York ?/0.5 0.6/0.6
Chicago 0.4/0.4 ?/0.5

New York ?/0.5 0.6/0.6
Chicago 0.4/0.4 ?/0.5
Bagels win. Switch to serving bagels for all and update model
based on new data.

New York ?/0.5 0.6/0.6
Chicago 0.4/0.4 0.7/0.5
based on new data.

New York ?/0.4595 0.6/0.6
Chicago 0.4/0.4 0.7/0.7
based on new data.

Observed CTR/Estimated CTR/True CTR
New York ?/0.4595/1 0.6/0.6/0.6
Chicago 0.4/0.4/0.4 0.7/0.7/0.7
Yikes! Missed out big in NY!

Basic Observations
1 Standard machine learning is not enough. Model ﬁts collected
data perfectly.

Basic Observations
data perfectly.
2 More data doesn’t help: Observed = True where data was
collected.

Basic Observations
data perfectly.
collected.
3 Better data helps! Exploration is required.

Basic Observations
data perfectly.
collected.
3 Better data helps! Exploration is required.
4 Prediction errors are not a proxy for controlled exploration.

Attempt to ﬁx
New policy: bagels in the morning, pizza at night for both
cities

Attempt to ﬁx
cities
This will overestimate the CTR for both!

Attempt to ﬁx
cities
This will overestimate the CTR for both!
Solution: Deployed system should be randomized with
probabilities recorded.

Oﬄine Evaluation
Evaluating a new system on data collected by deployed system
may mislead badly:
New York ?/1/1 0.6/0.6/0.5
Chicago 0.4/0.4/0.4 0.7/0.7/0.7
The new system appears worse than deployed system on
collected data, although its true loss may be much lower.

The Evaluation Problem
Given a new policy, how do we evaluate it?

The Evaluation Problem
Given a new policy, how do we evaluate it?
One possibility: Deploy it in the world.
Very Expensive! Need a bucket for every candidate policy.

A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels for everyone rule

Segment users randomly into Policy 1 and Policy 2 groups:

Policy 2

Policy 2
no click

Policy 2 Policy 1
no click

Policy 2 Policy 1
NY
no click

Policy 2 Policy 1
no click no click

Policy 2 Policy 1 Policy 2
no click no click

Policy 2 Policy 1 Policy 2
no click no click click

Policy 2 Policy 1 Policy 2 Policy 1

Chicago

no click no click click no click

. . .
Two weeks later, evaluate which is better.

Instead randomize every transaction
(at least for transactions you plan to use for learning and/or evaluation)
Simplest strategy: -greedy. Go with empirically best policy, but
always choose a random action with probability > 0.

no click
(x, b, 0, pb)

no click no click
(x, b, 0, pb) (x, p, 0, pp)

(x, b, 0, pb) (x, p, 0, pp) (x, p, 1, pp)

(x, b, 0, pb) (x, p, 0, pp) (x, p, 1, pp) (x, b, 0, pb)

no click no click click no click · · ·
(x, b, 0, pb) (x, p, 0, pp) (x, p, 1, pp) (x, b, 0, pb)
Oﬄine evaluation
Later evaluate any policy using the same events. Each evaluation
is cheap and immediate.

The Importance Weighting Trick
Let π : X → A be a policy. How do we evaluate it?

Let π : X → A be a policy. How do we evaluate it?
Collect exploration samples of the form
(x, a, ra, pa),
where
x = context
a = action
ra = reward for action
pa = probability of action a
then evaluate
Value(π) = Average
ra 1(π(x) = a)
pa

Theorem
Value(π) is an unbiased estimate of the expected reward of π:
E(x,r)∼D rπ(x) = E[ Value(π) ]
with deviations bounded by O( 1√
T minx pπ(x)
).
Example:
Action 1 2
Reward 0.5 1
Probability 1
4
3
4
Estimate

Theorem
T minx pπ(x)
).
Example:
Action 1 2
Reward 0.5 1
Probability 1
4
3
4
Estimate 2 0

Theorem
T minx pπ(x)
).
Example:
Action 1 2
Reward 0.5 1
Probability 1
4
3
4
Estimate 2 | 0 0 | 4
3

Can we do better?
Suppose we have a (possibly bad) reward estimator ˆr(a, x). How
can we use it?

Can we do better?
can we use it?
Value (π) = Average
(ra − ˆr(a, x))1(π(x) = a)
pa
+ ˆr(π(x), x)

Can we do better?
can we use it?
(ra − ˆr(a, x))1(π(x) = a)
pa
+ ˆr(π(x), x)
Why does this work?

Can we do better?
can we use it?
(ra − ˆr(a, x))1(π(x) = a)
pa
+ ˆr(π(x), x)
Why does this work?
Ea∼p
ˆr(a, x)1(π(x) = a)
pa
= ˆr(π(x), x)

Can we do better?
can we use it?
(ra − ˆr(a, x))1(π(x) = a)
pa
+ ˆr(π(x), x)
Why does this work?
Ea∼p
ˆr(a, x)1(π(x) = a)
pa
= ˆr(π(x), x)
Keeps the estimate unbiased. It helps, because ra − ˆr(a, x) small
reduces variance.

How do you directly optimize based on past exploration
data?
1 Learn ˆr(a, x).
2 Compute for each x and a ∈ A:
(ra − ˆr(a, x))1(a = a)
pa
+ ˆr(a , x)
3 Learn π using a cost-sensitive multiclass classiﬁer.

Take home summary
Using exploration data
1 There are techniques for using past exploration data to
evaluate any policy.
2 You can reliably measure performance oﬄine, and hence
experiment much faster, shifting from guess-and-check (A/B
testing) to direct optimization.
Doing exploration
1 There has been much recent progress on practical
regret-optimal algorithms.
2 -greedy has suboptimal regret but is a reasonable choice in
practice.

Comparison of Approaches
Supervised -greedy Optimal CB algorithms
Feedback full bandit bandit
Regret O
ln
|Π|
δ
T O
3 |A| ln
|Π|
δ
T O
|A| ln
|Π|
δ
T
Running time O(T) O(T) O(T1.5)
A. Agarwal, D. Hsu, S. Kale, J. Langford, L. Li, R. Schapire, Taming the
Monster: A Fast and Simple Algorithm for Contextual Bandits, 2014
M. Dudik, D. Hsu, S. Kale, N. Karampatziakis, J. Langford, L. Reyzin, T.
Zhang: Eﬃcient optimal learning for contextual bandits, 2011
A. Beygelzimer, J. Langford, L. Li, L. Reyzin, R. Schapire: Contextual Bandit
Algorithms with Supervised Learning Guarantees, 2011

Alina Beygelzimer, Senior Research Scientist, Yahoo Labs at MLconf NYC

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Alina Beygelzimer, Senior Research Scientist, Yahoo Labs at MLconf NYC

Ähnlich wie Alina Beygelzimer, Senior Research Scientist, Yahoo Labs at MLconf NYC (20)

Mehr von MLconf

Mehr von MLconf (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Alina Beygelzimer, Senior Research Scientist, Yahoo Labs at MLconf NYC