SlideShare ist ein Scribd-Unternehmen logo
1 von 67
Downloaden Sie, um offline zu lesen
Learning with Exploration
Alina Beygelzimer
Yahoo Labs, New York
(based on work by many)
Interactive Learning
Repeatedly:
1 A user comes to Yahoo
2 Yahoo chooses content to present (urls, ads, news stories)
3 The user reacts to the presented information (clicks on something)
Making good content decisions requires learning from user feedback.
Abstracting the Setting
For t = 1, . . . , T:
1 The world produces some context x ∈ X
2 The learner chooses an action a ∈ A
3 The world reacts with reward r(a, x)
Goal: Learn a good policy for choosing actions given context
Dominant Solution
1 Deploy some initial system
2 Collect data using this system
3 Use machine learning to build a reward predictor ˆr(a, x) from
collected data
4 Evaluate new system = arg maxa ˆr(a, x)
offline evaluation on past data
bucket test
5 If metrics improve, switch to this new system and repeat
Example: Bagels vs. Pizza for New York and Chicago users
Example: Bagels vs. Pizza for New York and Chicago users
Initial system: NY gets bagels, Chicago gets pizza.
New York
Chicago
Example: Bagels vs. Pizza for New York and Chicago users
Initial system: NY gets bagels, Chicago gets pizza.
Observed CTR
New York ? 0.6
Chicago 0.4 ?
Example: Bagels vs. Pizza for New York and Chicago users
Initial system: NY gets bagels, Chicago gets pizza.
Observed CTR/Estimated CTR
New York ?/0.5 0.6/0.6
Chicago 0.4/0.4 ?/0.5
Example: Bagels vs. Pizza for New York and Chicago users
Initial system: NY gets bagels, Chicago gets pizza.
Observed CTR/Estimated CTR
New York ?/0.5 0.6/0.6
Chicago 0.4/0.4 ?/0.5
Bagels win. Switch to serving bagels for all and update model
based on new data.
Example: Bagels vs. Pizza for New York and Chicago users
Initial system: NY gets bagels, Chicago gets pizza.
Observed CTR/Estimated CTR
New York ?/0.5 0.6/0.6
Chicago 0.4/0.4 0.7/0.5
Bagels win. Switch to serving bagels for all and update model
based on new data.
Example: Bagels vs. Pizza for New York and Chicago users
Initial system: NY gets bagels, Chicago gets pizza.
Observed CTR/Estimated CTR
New York ?/0.4595 0.6/0.6
Chicago 0.4/0.4 0.7/0.7
Bagels win. Switch to serving bagels for all and update model
based on new data.
Example: Bagels vs. Pizza for New York and Chicago users
Initial system: NY gets bagels, Chicago gets pizza.
Observed CTR/Estimated CTR/True CTR
New York ?/0.4595/1 0.6/0.6/0.6
Chicago 0.4/0.4/0.4 0.7/0.7/0.7
Yikes! Missed out big in NY!
Basic Observations
1 Standard machine learning is not enough. Model fits collected
data perfectly.
Basic Observations
1 Standard machine learning is not enough. Model fits collected
data perfectly.
2 More data doesn’t help: Observed = True where data was
collected.
Basic Observations
1 Standard machine learning is not enough. Model fits collected
data perfectly.
2 More data doesn’t help: Observed = True where data was
collected.
3 Better data helps! Exploration is required.
Basic Observations
1 Standard machine learning is not enough. Model fits collected
data perfectly.
2 More data doesn’t help: Observed = True where data was
collected.
3 Better data helps! Exploration is required.
4 Prediction errors are not a proxy for controlled exploration.
Attempt to fix
New policy: bagels in the morning, pizza at night for both
cities
Attempt to fix
New policy: bagels in the morning, pizza at night for both
cities
This will overestimate the CTR for both!
Attempt to fix
New policy: bagels in the morning, pizza at night for both
cities
This will overestimate the CTR for both!
Solution: Deployed system should be randomized with
probabilities recorded.
Offline Evaluation
Evaluating a new system on data collected by deployed system
may mislead badly:
New York ?/1/1 0.6/0.6/0.5
Chicago 0.4/0.4/0.4 0.7/0.7/0.7
The new system appears worse than deployed system on
collected data, although its true loss may be much lower.
The Evaluation Problem
Given a new policy, how do we evaluate it?
The Evaluation Problem
Given a new policy, how do we evaluate it?
One possibility: Deploy it in the world.
Very Expensive! Need a bucket for every candidate policy.
A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels for everyone rule
A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels for everyone rule
Segment users randomly into Policy 1 and Policy 2 groups:
A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels for everyone rule
Segment users randomly into Policy 1 and Policy 2 groups:
Policy 2
A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels for everyone rule
Segment users randomly into Policy 1 and Policy 2 groups:
Policy 2
A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels for everyone rule
Segment users randomly into Policy 1 and Policy 2 groups:
Policy 2
no click
A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels for everyone rule
Segment users randomly into Policy 1 and Policy 2 groups:
Policy 2
no click
A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels for everyone rule
Segment users randomly into Policy 1 and Policy 2 groups:
Policy 2 Policy 1
no click
A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels for everyone rule
Segment users randomly into Policy 1 and Policy 2 groups:
Policy 2 Policy 1
NY
no click
A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels for everyone rule
Segment users randomly into Policy 1 and Policy 2 groups:
Policy 2 Policy 1
no click no click
A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels for everyone rule
Segment users randomly into Policy 1 and Policy 2 groups:
Policy 2 Policy 1
no click no click
A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels for everyone rule
Segment users randomly into Policy 1 and Policy 2 groups:
Policy 2 Policy 1 Policy 2
no click no click
A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels for everyone rule
Segment users randomly into Policy 1 and Policy 2 groups:
Policy 2 Policy 1 Policy 2
no click no click
A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels for everyone rule
Segment users randomly into Policy 1 and Policy 2 groups:
Policy 2 Policy 1 Policy 2
no click no click click
A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels for everyone rule
Segment users randomly into Policy 1 and Policy 2 groups:
Policy 2 Policy 1 Policy 2
no click no click click
A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels for everyone rule
Segment users randomly into Policy 1 and Policy 2 groups:
Policy 2 Policy 1 Policy 2 Policy 1
no click no click click
A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels for everyone rule
Segment users randomly into Policy 1 and Policy 2 groups:
Policy 2 Policy 1 Policy 2 Policy 1
Chicago
no click no click click
A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels for everyone rule
Segment users randomly into Policy 1 and Policy 2 groups:
Policy 2 Policy 1 Policy 2 Policy 1
no click no click click no click
A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels for everyone rule
Segment users randomly into Policy 1 and Policy 2 groups:
Policy 2 Policy 1 Policy 2 Policy 1
no click no click click no click
. . .
Two weeks later, evaluate which is better.
Instead randomize every transaction
(at least for transactions you plan to use for learning and/or evaluation)
Simplest strategy: -greedy. Go with empirically best policy, but
always choose a random action with probability > 0.
Instead randomize every transaction
(at least for transactions you plan to use for learning and/or evaluation)
Simplest strategy: -greedy. Go with empirically best policy, but
always choose a random action with probability > 0.
Instead randomize every transaction
(at least for transactions you plan to use for learning and/or evaluation)
Simplest strategy: -greedy. Go with empirically best policy, but
always choose a random action with probability > 0.
Instead randomize every transaction
(at least for transactions you plan to use for learning and/or evaluation)
Simplest strategy: -greedy. Go with empirically best policy, but
always choose a random action with probability > 0.
no click
(x, b, 0, pb)
Instead randomize every transaction
(at least for transactions you plan to use for learning and/or evaluation)
Simplest strategy: -greedy. Go with empirically best policy, but
always choose a random action with probability > 0.
no click
(x, b, 0, pb)
Instead randomize every transaction
(at least for transactions you plan to use for learning and/or evaluation)
Simplest strategy: -greedy. Go with empirically best policy, but
always choose a random action with probability > 0.
no click
(x, b, 0, pb)
Instead randomize every transaction
(at least for transactions you plan to use for learning and/or evaluation)
Simplest strategy: -greedy. Go with empirically best policy, but
always choose a random action with probability > 0.
no click no click
(x, b, 0, pb) (x, p, 0, pp)
Instead randomize every transaction
(at least for transactions you plan to use for learning and/or evaluation)
Simplest strategy: -greedy. Go with empirically best policy, but
always choose a random action with probability > 0.
no click no click
(x, b, 0, pb) (x, p, 0, pp)
Instead randomize every transaction
(at least for transactions you plan to use for learning and/or evaluation)
Simplest strategy: -greedy. Go with empirically best policy, but
always choose a random action with probability > 0.
no click no click
(x, b, 0, pb) (x, p, 0, pp)
Instead randomize every transaction
(at least for transactions you plan to use for learning and/or evaluation)
Simplest strategy: -greedy. Go with empirically best policy, but
always choose a random action with probability > 0.
no click no click click
(x, b, 0, pb) (x, p, 0, pp) (x, p, 1, pp)
Instead randomize every transaction
(at least for transactions you plan to use for learning and/or evaluation)
Simplest strategy: -greedy. Go with empirically best policy, but
always choose a random action with probability > 0.
no click no click click
(x, b, 0, pb) (x, p, 0, pp) (x, p, 1, pp)
Instead randomize every transaction
(at least for transactions you plan to use for learning and/or evaluation)
Simplest strategy: -greedy. Go with empirically best policy, but
always choose a random action with probability > 0.
no click no click click
(x, b, 0, pb) (x, p, 0, pp) (x, p, 1, pp)
Instead randomize every transaction
(at least for transactions you plan to use for learning and/or evaluation)
Simplest strategy: -greedy. Go with empirically best policy, but
always choose a random action with probability > 0.
no click no click click no click
(x, b, 0, pb) (x, p, 0, pp) (x, p, 1, pp) (x, b, 0, pb)
Instead randomize every transaction
(at least for transactions you plan to use for learning and/or evaluation)
Simplest strategy: -greedy. Go with empirically best policy, but
always choose a random action with probability > 0.
no click no click click no click · · ·
(x, b, 0, pb) (x, p, 0, pp) (x, p, 1, pp) (x, b, 0, pb)
Offline evaluation
Later evaluate any policy using the same events. Each evaluation
is cheap and immediate.
The Importance Weighting Trick
Let π : X → A be a policy. How do we evaluate it?
The Importance Weighting Trick
Let π : X → A be a policy. How do we evaluate it?
Collect exploration samples of the form
(x, a, ra, pa),
where
x = context
a = action
ra = reward for action
pa = probability of action a
then evaluate
Value(π) = Average
ra 1(π(x) = a)
pa
The Importance Weighting Trick
Theorem
Value(π) is an unbiased estimate of the expected reward of π:
E(x,r)∼D rπ(x) = E[ Value(π) ]
with deviations bounded by O( 1√
T minx pπ(x)
).
Example:
Action 1 2
Reward 0.5 1
Probability 1
4
3
4
Estimate
The Importance Weighting Trick
Theorem
Value(π) is an unbiased estimate of the expected reward of π:
E(x,r)∼D rπ(x) = E[ Value(π) ]
with deviations bounded by O( 1√
T minx pπ(x)
).
Example:
Action 1 2
Reward 0.5 1
Probability 1
4
3
4
Estimate 2 0
The Importance Weighting Trick
Theorem
Value(π) is an unbiased estimate of the expected reward of π:
E(x,r)∼D rπ(x) = E[ Value(π) ]
with deviations bounded by O( 1√
T minx pπ(x)
).
Example:
Action 1 2
Reward 0.5 1
Probability 1
4
3
4
Estimate 2 | 0 0 | 4
3
Can we do better?
Suppose we have a (possibly bad) reward estimator ˆr(a, x). How
can we use it?
Can we do better?
Suppose we have a (possibly bad) reward estimator ˆr(a, x). How
can we use it?
Value (π) = Average
(ra − ˆr(a, x))1(π(x) = a)
pa
+ ˆr(π(x), x)
Can we do better?
Suppose we have a (possibly bad) reward estimator ˆr(a, x). How
can we use it?
Value (π) = Average
(ra − ˆr(a, x))1(π(x) = a)
pa
+ ˆr(π(x), x)
Why does this work?
Can we do better?
Suppose we have a (possibly bad) reward estimator ˆr(a, x). How
can we use it?
Value (π) = Average
(ra − ˆr(a, x))1(π(x) = a)
pa
+ ˆr(π(x), x)
Why does this work?
Ea∼p
ˆr(a, x)1(π(x) = a)
pa
= ˆr(π(x), x)
Can we do better?
Suppose we have a (possibly bad) reward estimator ˆr(a, x). How
can we use it?
Value (π) = Average
(ra − ˆr(a, x))1(π(x) = a)
pa
+ ˆr(π(x), x)
Why does this work?
Ea∼p
ˆr(a, x)1(π(x) = a)
pa
= ˆr(π(x), x)
Keeps the estimate unbiased. It helps, because ra − ˆr(a, x) small
reduces variance.
How do you directly optimize based on past exploration
data?
1 Learn ˆr(a, x).
2 Compute for each x and a ∈ A:
(ra − ˆr(a, x))1(a = a)
pa
+ ˆr(a , x)
3 Learn π using a cost-sensitive multiclass classifier.
Take home summary
Using exploration data
1 There are techniques for using past exploration data to
evaluate any policy.
2 You can reliably measure performance offline, and hence
experiment much faster, shifting from guess-and-check (A/B
testing) to direct optimization.
Doing exploration
1 There has been much recent progress on practical
regret-optimal algorithms.
2 -greedy has suboptimal regret but is a reasonable choice in
practice.
Comparison of Approaches
Supervised -greedy Optimal CB algorithms
Feedback full bandit bandit
Regret O
ln
|Π|
δ
T O
3 |A| ln
|Π|
δ
T O
|A| ln
|Π|
δ
T
Running time O(T) O(T) O(T1.5)
A. Agarwal, D. Hsu, S. Kale, J. Langford, L. Li, R. Schapire, Taming the
Monster: A Fast and Simple Algorithm for Contextual Bandits, 2014
M. Dudik, D. Hsu, S. Kale, N. Karampatziakis, J. Langford, L. Reyzin, T.
Zhang: Efficient optimal learning for contextual bandits, 2011
A. Beygelzimer, J. Langford, L. Li, L. Reyzin, R. Schapire: Contextual Bandit
Algorithms with Supervised Learning Guarantees, 2011

Weitere ähnliche Inhalte

Ähnlich wie Alina Beygelzimer, Senior Research Scientist, Yahoo Labs at MLconf NYC

Decisicion traps
Decisicion trapsDecisicion traps
Decisicion traps
Febriandika
 
ConAgra deck 6-25-13
ConAgra deck 6-25-13ConAgra deck 6-25-13
ConAgra deck 6-25-13
Jim Keegan
 

Ähnlich wie Alina Beygelzimer, Senior Research Scientist, Yahoo Labs at MLconf NYC (20)

Data Science Popup Austin: Predicting Customer Behavior & Enhancing Customer ...
Data Science Popup Austin: Predicting Customer Behavior & Enhancing Customer ...Data Science Popup Austin: Predicting Customer Behavior & Enhancing Customer ...
Data Science Popup Austin: Predicting Customer Behavior & Enhancing Customer ...
 
high effort judgement
high effort judgementhigh effort judgement
high effort judgement
 
Applied Data Science for monetization: pitfalls, common misconceptions, and n...
Applied Data Science for monetization: pitfalls, common misconceptions, and n...Applied Data Science for monetization: pitfalls, common misconceptions, and n...
Applied Data Science for monetization: pitfalls, common misconceptions, and n...
 
136 advanced a-b testing (anthony rindone)
136   advanced a-b testing (anthony rindone)136   advanced a-b testing (anthony rindone)
136 advanced a-b testing (anthony rindone)
 
Finding your mobile growth
Finding your mobile growthFinding your mobile growth
Finding your mobile growth
 
Making Decisions with Data: Beyond Basic A/B Testing (ProductCamp Boston 2016)
Making Decisions with Data: Beyond Basic A/B Testing (ProductCamp Boston 2016)Making Decisions with Data: Beyond Basic A/B Testing (ProductCamp Boston 2016)
Making Decisions with Data: Beyond Basic A/B Testing (ProductCamp Boston 2016)
 
Zachary Brown - Forecasting Consumer Response to GMOs
Zachary Brown - Forecasting Consumer Response to GMOsZachary Brown - Forecasting Consumer Response to GMOs
Zachary Brown - Forecasting Consumer Response to GMOs
 
Behavior Based Approach to Experiment Design
Behavior Based Approach to Experiment DesignBehavior Based Approach to Experiment Design
Behavior Based Approach to Experiment Design
 
Lessons learned from Large Scale Real World Recommender Systems
Lessons learned from Large Scale Real World Recommender SystemsLessons learned from Large Scale Real World Recommender Systems
Lessons learned from Large Scale Real World Recommender Systems
 
Decisicion traps
Decisicion trapsDecisicion traps
Decisicion traps
 
105 Advanced A-B Testing: Making Decisions with Data
105 Advanced A-B Testing: Making Decisions with Data105 Advanced A-B Testing: Making Decisions with Data
105 Advanced A-B Testing: Making Decisions with Data
 
Impersonal Recommendation system on top of Hadoop
Impersonal Recommendation system on top of HadoopImpersonal Recommendation system on top of Hadoop
Impersonal Recommendation system on top of Hadoop
 
Application au service_de_la_sante_publique
Application au service_de_la_sante_publiqueApplication au service_de_la_sante_publique
Application au service_de_la_sante_publique
 
CASE STUDY2-Presentation.pdf
CASE STUDY2-Presentation.pdfCASE STUDY2-Presentation.pdf
CASE STUDY2-Presentation.pdf
 
Intro to Reinforcement learning - part II
Intro to Reinforcement learning - part IIIntro to Reinforcement learning - part II
Intro to Reinforcement learning - part II
 
Novel Algorithms for Ranking and Suggesting True Popular Items
Novel Algorithms for Ranking and Suggesting True Popular ItemsNovel Algorithms for Ranking and Suggesting True Popular Items
Novel Algorithms for Ranking and Suggesting True Popular Items
 
Proximal Policy Optimization
Proximal Policy OptimizationProximal Policy Optimization
Proximal Policy Optimization
 
10NTC - Data Superheroes - DiJulio
10NTC - Data Superheroes - DiJulio10NTC - Data Superheroes - DiJulio
10NTC - Data Superheroes - DiJulio
 
ConAgra deck 6-25-13
ConAgra deck 6-25-13ConAgra deck 6-25-13
ConAgra deck 6-25-13
 
How To Put Together A Corporate Social Responsibility Strategy (And Why it Ma...
How To Put Together A Corporate Social Responsibility Strategy (And Why it Ma...How To Put Together A Corporate Social Responsibility Strategy (And Why it Ma...
How To Put Together A Corporate Social Responsibility Strategy (And Why it Ma...
 

Mehr von MLconf

Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
MLconf
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
MLconf
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
MLconf
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
MLconf
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI World
MLconf
 

Mehr von MLconf (20)

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
 
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushIgor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
 
Josh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceJosh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious Experience
 
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
 
Meghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the Cheap
 
Noam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionNoam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data Collection
 
June Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of ML
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI World
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to code
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better Software
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime Changes
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

Alina Beygelzimer, Senior Research Scientist, Yahoo Labs at MLconf NYC

  • 1. Learning with Exploration Alina Beygelzimer Yahoo Labs, New York (based on work by many)
  • 2. Interactive Learning Repeatedly: 1 A user comes to Yahoo 2 Yahoo chooses content to present (urls, ads, news stories) 3 The user reacts to the presented information (clicks on something) Making good content decisions requires learning from user feedback.
  • 3. Abstracting the Setting For t = 1, . . . , T: 1 The world produces some context x ∈ X 2 The learner chooses an action a ∈ A 3 The world reacts with reward r(a, x) Goal: Learn a good policy for choosing actions given context
  • 4. Dominant Solution 1 Deploy some initial system 2 Collect data using this system 3 Use machine learning to build a reward predictor ˆr(a, x) from collected data 4 Evaluate new system = arg maxa ˆr(a, x) offline evaluation on past data bucket test 5 If metrics improve, switch to this new system and repeat
  • 5. Example: Bagels vs. Pizza for New York and Chicago users
  • 6. Example: Bagels vs. Pizza for New York and Chicago users Initial system: NY gets bagels, Chicago gets pizza. New York Chicago
  • 7. Example: Bagels vs. Pizza for New York and Chicago users Initial system: NY gets bagels, Chicago gets pizza. Observed CTR New York ? 0.6 Chicago 0.4 ?
  • 8. Example: Bagels vs. Pizza for New York and Chicago users Initial system: NY gets bagels, Chicago gets pizza. Observed CTR/Estimated CTR New York ?/0.5 0.6/0.6 Chicago 0.4/0.4 ?/0.5
  • 9. Example: Bagels vs. Pizza for New York and Chicago users Initial system: NY gets bagels, Chicago gets pizza. Observed CTR/Estimated CTR New York ?/0.5 0.6/0.6 Chicago 0.4/0.4 ?/0.5 Bagels win. Switch to serving bagels for all and update model based on new data.
  • 10. Example: Bagels vs. Pizza for New York and Chicago users Initial system: NY gets bagels, Chicago gets pizza. Observed CTR/Estimated CTR New York ?/0.5 0.6/0.6 Chicago 0.4/0.4 0.7/0.5 Bagels win. Switch to serving bagels for all and update model based on new data.
  • 11. Example: Bagels vs. Pizza for New York and Chicago users Initial system: NY gets bagels, Chicago gets pizza. Observed CTR/Estimated CTR New York ?/0.4595 0.6/0.6 Chicago 0.4/0.4 0.7/0.7 Bagels win. Switch to serving bagels for all and update model based on new data.
  • 12. Example: Bagels vs. Pizza for New York and Chicago users Initial system: NY gets bagels, Chicago gets pizza. Observed CTR/Estimated CTR/True CTR New York ?/0.4595/1 0.6/0.6/0.6 Chicago 0.4/0.4/0.4 0.7/0.7/0.7 Yikes! Missed out big in NY!
  • 13. Basic Observations 1 Standard machine learning is not enough. Model fits collected data perfectly.
  • 14. Basic Observations 1 Standard machine learning is not enough. Model fits collected data perfectly. 2 More data doesn’t help: Observed = True where data was collected.
  • 15. Basic Observations 1 Standard machine learning is not enough. Model fits collected data perfectly. 2 More data doesn’t help: Observed = True where data was collected. 3 Better data helps! Exploration is required.
  • 16. Basic Observations 1 Standard machine learning is not enough. Model fits collected data perfectly. 2 More data doesn’t help: Observed = True where data was collected. 3 Better data helps! Exploration is required. 4 Prediction errors are not a proxy for controlled exploration.
  • 17. Attempt to fix New policy: bagels in the morning, pizza at night for both cities
  • 18. Attempt to fix New policy: bagels in the morning, pizza at night for both cities This will overestimate the CTR for both!
  • 19. Attempt to fix New policy: bagels in the morning, pizza at night for both cities This will overestimate the CTR for both! Solution: Deployed system should be randomized with probabilities recorded.
  • 20. Offline Evaluation Evaluating a new system on data collected by deployed system may mislead badly: New York ?/1/1 0.6/0.6/0.5 Chicago 0.4/0.4/0.4 0.7/0.7/0.7 The new system appears worse than deployed system on collected data, although its true loss may be much lower.
  • 21. The Evaluation Problem Given a new policy, how do we evaluate it?
  • 22. The Evaluation Problem Given a new policy, how do we evaluate it? One possibility: Deploy it in the world. Very Expensive! Need a bucket for every candidate policy.
  • 23. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule
  • 24. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups:
  • 25. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2
  • 26. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2
  • 27. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2 no click
  • 28. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2 no click
  • 29. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2 Policy 1 no click
  • 30. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2 Policy 1 NY no click
  • 31. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2 Policy 1 no click no click
  • 32. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2 Policy 1 no click no click
  • 33. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2 Policy 1 Policy 2 no click no click
  • 34. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2 Policy 1 Policy 2 no click no click
  • 35. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2 Policy 1 Policy 2 no click no click click
  • 36. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2 Policy 1 Policy 2 no click no click click
  • 37. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2 Policy 1 Policy 2 Policy 1 no click no click click
  • 38. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2 Policy 1 Policy 2 Policy 1 Chicago no click no click click
  • 39. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2 Policy 1 Policy 2 Policy 1 no click no click click no click
  • 40. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2 Policy 1 Policy 2 Policy 1 no click no click click no click . . . Two weeks later, evaluate which is better.
  • 41. Instead randomize every transaction (at least for transactions you plan to use for learning and/or evaluation) Simplest strategy: -greedy. Go with empirically best policy, but always choose a random action with probability > 0.
  • 42. Instead randomize every transaction (at least for transactions you plan to use for learning and/or evaluation) Simplest strategy: -greedy. Go with empirically best policy, but always choose a random action with probability > 0.
  • 43. Instead randomize every transaction (at least for transactions you plan to use for learning and/or evaluation) Simplest strategy: -greedy. Go with empirically best policy, but always choose a random action with probability > 0.
  • 44. Instead randomize every transaction (at least for transactions you plan to use for learning and/or evaluation) Simplest strategy: -greedy. Go with empirically best policy, but always choose a random action with probability > 0. no click (x, b, 0, pb)
  • 45. Instead randomize every transaction (at least for transactions you plan to use for learning and/or evaluation) Simplest strategy: -greedy. Go with empirically best policy, but always choose a random action with probability > 0. no click (x, b, 0, pb)
  • 46. Instead randomize every transaction (at least for transactions you plan to use for learning and/or evaluation) Simplest strategy: -greedy. Go with empirically best policy, but always choose a random action with probability > 0. no click (x, b, 0, pb)
  • 47. Instead randomize every transaction (at least for transactions you plan to use for learning and/or evaluation) Simplest strategy: -greedy. Go with empirically best policy, but always choose a random action with probability > 0. no click no click (x, b, 0, pb) (x, p, 0, pp)
  • 48. Instead randomize every transaction (at least for transactions you plan to use for learning and/or evaluation) Simplest strategy: -greedy. Go with empirically best policy, but always choose a random action with probability > 0. no click no click (x, b, 0, pb) (x, p, 0, pp)
  • 49. Instead randomize every transaction (at least for transactions you plan to use for learning and/or evaluation) Simplest strategy: -greedy. Go with empirically best policy, but always choose a random action with probability > 0. no click no click (x, b, 0, pb) (x, p, 0, pp)
  • 50. Instead randomize every transaction (at least for transactions you plan to use for learning and/or evaluation) Simplest strategy: -greedy. Go with empirically best policy, but always choose a random action with probability > 0. no click no click click (x, b, 0, pb) (x, p, 0, pp) (x, p, 1, pp)
  • 51. Instead randomize every transaction (at least for transactions you plan to use for learning and/or evaluation) Simplest strategy: -greedy. Go with empirically best policy, but always choose a random action with probability > 0. no click no click click (x, b, 0, pb) (x, p, 0, pp) (x, p, 1, pp)
  • 52. Instead randomize every transaction (at least for transactions you plan to use for learning and/or evaluation) Simplest strategy: -greedy. Go with empirically best policy, but always choose a random action with probability > 0. no click no click click (x, b, 0, pb) (x, p, 0, pp) (x, p, 1, pp)
  • 53. Instead randomize every transaction (at least for transactions you plan to use for learning and/or evaluation) Simplest strategy: -greedy. Go with empirically best policy, but always choose a random action with probability > 0. no click no click click no click (x, b, 0, pb) (x, p, 0, pp) (x, p, 1, pp) (x, b, 0, pb)
  • 54. Instead randomize every transaction (at least for transactions you plan to use for learning and/or evaluation) Simplest strategy: -greedy. Go with empirically best policy, but always choose a random action with probability > 0. no click no click click no click · · · (x, b, 0, pb) (x, p, 0, pp) (x, p, 1, pp) (x, b, 0, pb) Offline evaluation Later evaluate any policy using the same events. Each evaluation is cheap and immediate.
  • 55. The Importance Weighting Trick Let π : X → A be a policy. How do we evaluate it?
  • 56. The Importance Weighting Trick Let π : X → A be a policy. How do we evaluate it? Collect exploration samples of the form (x, a, ra, pa), where x = context a = action ra = reward for action pa = probability of action a then evaluate Value(π) = Average ra 1(π(x) = a) pa
  • 57. The Importance Weighting Trick Theorem Value(π) is an unbiased estimate of the expected reward of π: E(x,r)∼D rπ(x) = E[ Value(π) ] with deviations bounded by O( 1√ T minx pπ(x) ). Example: Action 1 2 Reward 0.5 1 Probability 1 4 3 4 Estimate
  • 58. The Importance Weighting Trick Theorem Value(π) is an unbiased estimate of the expected reward of π: E(x,r)∼D rπ(x) = E[ Value(π) ] with deviations bounded by O( 1√ T minx pπ(x) ). Example: Action 1 2 Reward 0.5 1 Probability 1 4 3 4 Estimate 2 0
  • 59. The Importance Weighting Trick Theorem Value(π) is an unbiased estimate of the expected reward of π: E(x,r)∼D rπ(x) = E[ Value(π) ] with deviations bounded by O( 1√ T minx pπ(x) ). Example: Action 1 2 Reward 0.5 1 Probability 1 4 3 4 Estimate 2 | 0 0 | 4 3
  • 60. Can we do better? Suppose we have a (possibly bad) reward estimator ˆr(a, x). How can we use it?
  • 61. Can we do better? Suppose we have a (possibly bad) reward estimator ˆr(a, x). How can we use it? Value (π) = Average (ra − ˆr(a, x))1(π(x) = a) pa + ˆr(π(x), x)
  • 62. Can we do better? Suppose we have a (possibly bad) reward estimator ˆr(a, x). How can we use it? Value (π) = Average (ra − ˆr(a, x))1(π(x) = a) pa + ˆr(π(x), x) Why does this work?
  • 63. Can we do better? Suppose we have a (possibly bad) reward estimator ˆr(a, x). How can we use it? Value (π) = Average (ra − ˆr(a, x))1(π(x) = a) pa + ˆr(π(x), x) Why does this work? Ea∼p ˆr(a, x)1(π(x) = a) pa = ˆr(π(x), x)
  • 64. Can we do better? Suppose we have a (possibly bad) reward estimator ˆr(a, x). How can we use it? Value (π) = Average (ra − ˆr(a, x))1(π(x) = a) pa + ˆr(π(x), x) Why does this work? Ea∼p ˆr(a, x)1(π(x) = a) pa = ˆr(π(x), x) Keeps the estimate unbiased. It helps, because ra − ˆr(a, x) small reduces variance.
  • 65. How do you directly optimize based on past exploration data? 1 Learn ˆr(a, x). 2 Compute for each x and a ∈ A: (ra − ˆr(a, x))1(a = a) pa + ˆr(a , x) 3 Learn π using a cost-sensitive multiclass classifier.
  • 66. Take home summary Using exploration data 1 There are techniques for using past exploration data to evaluate any policy. 2 You can reliably measure performance offline, and hence experiment much faster, shifting from guess-and-check (A/B testing) to direct optimization. Doing exploration 1 There has been much recent progress on practical regret-optimal algorithms. 2 -greedy has suboptimal regret but is a reasonable choice in practice.
  • 67. Comparison of Approaches Supervised -greedy Optimal CB algorithms Feedback full bandit bandit Regret O ln |Π| δ T O 3 |A| ln |Π| δ T O |A| ln |Π| δ T Running time O(T) O(T) O(T1.5) A. Agarwal, D. Hsu, S. Kale, J. Langford, L. Li, R. Schapire, Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits, 2014 M. Dudik, D. Hsu, S. Kale, N. Karampatziakis, J. Langford, L. Reyzin, T. Zhang: Efficient optimal learning for contextual bandits, 2011 A. Beygelzimer, J. Langford, L. Li, L. Reyzin, R. Schapire: Contextual Bandit Algorithms with Supervised Learning Guarantees, 2011