The document discusses optimization techniques for online experiments known as multi-armed bandits. It introduces the multi-armed bandit problem of balancing exploration of new options with exploitation of existing knowledge. It then describes several techniques for solving this problem, including AB testing, epsilon greedy, and upper confidence bound approaches. The document provides examples of how these techniques work and notes that multi-armed bandit methods can provide more efficient learning than AB testing alone.
1. Bandit Basics – A Different
take on Online Optimization
Conductrics twitter: @mgershoff
2. Who is this guy?
Matt Gershoff
CEO: Conductrics
Many Years in Database Marketing (New York
and Paris)
and a bit of Web Analytics
www.conductrics.com
twitter:@mgershoff
Email:matt@conductrics.com
4. What Are We Going to Hear?
• Optimization Basics
• Multi-Armed Bandit
• Its a Problem, Not a Method
• Some Methods
• AB Testing
• Epsilon Greedy
• Upper Confidence Interval (UCB)
• Some Results
6. OPTIMIZATION
If THIS Then THAT
ITTT brings together:
1.Decision Rules
2.Predictive Analytics
3.Choice Optimization
Conductrics twitter: @mgershoff
7. OPTIMIZATION
Find and Apply the Rule with the most Value
If THIS Then THAT If THIS Then THAT
If THIS Then THAT If THIS Then THAT
If THIS Then THAT If THIS Then THAT
If THIS Then THAT If THIS Then THAT
If THIS Then THAT If THIS Then THAT
If THIS Then THAT If THIS Then THAT
Conductrics twitter: @mgershoff
8. OPTIMIZATION
Variables whose Values Variables whose Values
Are Given to You You Control
THIS THAT
If Facebook
High Spend
Urban GEO
.
Then Offer A
Offer B
Offer C
.
. Predictive Model .
. .
. F1
.
.
F2
S Valuei .
Home Page Fm
Offer Y
App Use Offer Z
Inputs Conductrics twitter: @mgershoff
Outputs
9. But …
1. We Don’t Have Data on ‘THAT’
2. Need to Collect – Sample THAT
3. How to Sample Efficiently? Offer A ?
Offer B ?
Offer C ?
.
.
.
.
.
Offer Y ?
Offer Z ?
Conductrics twitter: @mgershoff
11. What is a Multi Armed Bandit
One Armed Bandit –>Slot Machine
The problem:
How to pick between Slot Machines so that
you walk out with most $$$ from Casino at the
end of the Night?
OR
A B
Conductrics twitter: @mgershoff
12. Objective
Pick so as to get the most
return/profit as you can
over time
Technical term: Minimize Regret
Conductrics twitter: @mgershoff
13. Sequential Selection
… but how to Pick? OR
A B
Need to Sample, but do it efficiently
Conductrics twitter: @mgershoff
14. Explore – Collect Data
OR
A B
• Data Collection is costly – an Investment
• Be Efficient – Balance the potential value of
collecting new data with exploiting what
you currently know.
Conductrics twitter: @mgershoff
15. Multi-Armed Bandits
“Bandit problems embody in essential
form a conflict evident in all human
action: choosing actions which yield
immediate reward vs. choosing actions
… whose benefit will come only later.”*
- Peter Whittle
*Source: Qing Zhao, UC Davis. Plenary talk at SPAWC, June, 2010.
Conductrics twitter: @mgershoff
16. Exploration Exploitation
1) Explore/Learn – Try out different actions
to learn how they perform over time – This is
a data collection task.
2) Exploit/Earn – Take advantage of what
you have learned to get highest payoff –
Your current best guess
Conductrics twitter: @mgershoff
17. Not A New Problem
1933 – first work on competing
options
1940 – WWII Problem Allies
attempt to tackle
1953 – Bellman formulates as a
Dynamic Programing problem
Source: http://www.lancs.ac.uk/~winterh/GRhist.html
Conductrics twitter: @mgershoff
18. Testing
• Explore First
– All actions have an equal chance of
selection (uniform random).
– Use hypothesis testing to select a
‘Winner’.
• Then Exploit - Keep only ‘Winner’
for selection
Conductrics twitter: @mgershoff
20. P-Values: A Digression
P-Values:
• NOT the probability that the Null is
True. P( Null=True| DATA)
• P(DATA (or more extreme)| Null=True)
• Not a great tool for deciding when to
stop sampling
See:
http://andrewgelman.com/2010/09/noooooooooooooo_1/
http://www.stat.duke.edu/~berger/papers/02-01.pdf
Conductrics twitter: @mgershoff
21. A Couple Other Methods
1. Epsilon Greedy
Nice and Simple
2. Upper Confidence Bounds(UCB)
Adapts to Uncertainty
Conductrics twitter: @mgershoff
23. Greedy
What do you mean by ‘Greedy’?
Make whatever choice seems
best at the moment.
24. Epsilon Greedy
What do you mean by ‘Epsilon
Greedy’?
• Explore – randomly select action
percent of the time (say 20%)
• Exploit – Play greedy (pick the
current best) 1- (say 80%)
25. Epsilon Greedy
User
Explore/Learn Exploit/Earn
(20%) (80%)
Select Select Current
Randomly Best
Like AB Testing (Be Greedy)
Conductrics twitter: @mgershoff
26. Epsilon Greedy
20% Random 80% Select Best
Action Value
A $5.00
B $4.00
C $3.00
D $2.00
E $1.00
Conductrics twitter: @mgershoff
27. Continuous Sampling
Explore/Learn
Exploit/Earn
Time
Conductrics twitter: @mgershoff
28. Epsilon Greedy
– Super Simple/low cost to implement
– Tends to be surprisingly effective
– Less affected by ‘Seasonality’
– Not optimal (hard to pick best )
– Doesn’t use measure of variance
– Should/How to decrease Exploration over
time? Conductrics twitter: @mgershoff
29. Upper Confidence Bound
Basic Idea:
1) Calculate both mean and a measure
of uncertainty (variance) for each
action.
2) Make Greedy selections based on
mean + uncertainty bonus
Conductrics twitter: @mgershoff
31. Upper Confidence
Score each option using the upper
portion of the interval as a Bonus
Mean +Bonus
Conductrics twitter: @mgershoff
32. Upper Confidence Bound
1) Use upper portion of CI as ‘Bonus’
2) Make Greedy Selections
A Select A
B
C
$0 $5 $10
Estimated Reward
Conductrics twitter: @mgershoff
33. Upper Confidence Bound
1) Selecting Action ‘A’ reduces uncertainty
bonus (because more data)
2) Action ‘C’ now has highest score
A
B
C Select C
$0 $5 $10
Estimated Reward
Conductrics twitter: @mgershoff
34. Upper Confidence Bound
• Like A/B Test – uses variance
measure
• Unlike A/B Test – no hypothesis
test
• Automatically Balances
Exploration with Exploitation
Conductrics twitter: @mgershoff