Multi-Armed Bandits:  Intro, examples and tricks

Multi-Armed Bandits: 
Intro, examples and tricks
Dr Ilias Flaounas
Senior Data Scientist at Atlassian
Data Science Sydney meetup
22 March 2016

Motivation
Increase awareness of some very useful but less
known techniques
Demo some current work at Atlassian
Connect it with some research from my past
Hopefully, there will be something useful for everybody
— apologies for the few equations and loose notation

http://www.nancydixonblog.com/2012/05/-why-knowledge-management-didnt-save-general-motors-addressing-complex-issues-
by-convening-conversat.html

( rA,1 )
( rC,2 )
rB,3
+ rA,4 + rA,5
+ rC,6
+ rA,7 / nA
+ rC,8
/ nB
/ nc
µA =
µB =
µC =

1. e-greedy: the best arm is selected for a proportion
of 1-e of the trials and a random arm in e trials.
2. e-greedy with variable e
3. Pure exploration ﬁrst, then pure exploitation.
4. …
5. Thompson sampling 
(Draw from the estimated beta-distrom
6. Upper Conﬁdence Bound (UCB)
Many solutions…

Disadvantages
Reaching signiﬁcance for
non-winning arms takes
longer
Unclear stopping criteria 
Hard to order non-winning
arms and assess reliably
their impact
Advantages
Reaching signiﬁcance for
the winning arm is faster 
Best arm can change over
time
There are no false
positives in the long term

Optimizely recently introduced MAB rebranded as:  
“Trafﬁc auto-allocation”

Let’s add some context
What happens if we want to assess 100 variations?
How about 1,000 or 10,000 variations?

Contextual Multi-Armed Bandits
rA, t = f(xA,1, xA,2, xA,3…)A -> {xA,1, xA,2, xA,3…}
rB,t = f(xB,1, xB,2, xB,3…)
rC,t = f(xC,1, xC,2, xC,3…)
Experiment parameters, e.g., price,  
#users, product, bundles, colour of UI elements…
B -> {xB,1, xB,2, xB,3…}
C -> {xC,1, xC,2, xC,3…}

We introduce a notion
of proximity or similarity
between arms
A -> {xA,1, xA,2, xA,3…}
B -> {xB,1, xB,2, xB,3…}
Contextual Multi-Armed Bandits

LinUCB
L. Li, W. Chu, J. Langford, R. E. Schapire, “A Contextual-Bandit Approach to Personalized News
Article Recommendation”, WWW, 2010.
The UCB is some expectation plus some conﬁdence level:
µ↵(t) + ↵(t)
We assume there is some unknown vector θ∗, the same for each arm,  
for which:
E[ra,t|xa,t] = xT
a,t✓⇤

ˆ✓t := C 1
t XT
t yt
Xt := {xa(1),1, xa(2),2, . . . , xa(t),t}T
yt := {ra(1),1, ra(2),2, . . . , ra(t),t}T
Ct := XT
t Xt
Using least squares:
ˆµa(t) := xT
a,t
ˆ✓t
E[ra,t|xa,t] = xT
a,t✓⇤
µ↵(t) + ↵(t)
ˆµa := xT
a,tC 1
t XT
t yt

The upper conﬁdence bound is some expectation plus some conﬁdence level:
µ↵(t) + ↵(t)
ˆ(t) :=
q
xT
a,tC 1
t xa,tˆµa := xT
a,tC 1
t XT
t yt

L. Li, W. Chu, J. Langford, R. E. Schapire, A Contextual-Bandit Approach to Personalized News Article
Recommendation, WWW, 2010.

Product onboarding…
Which arm would you pull?

• How can we locate
the city of Bristol from
tweets?
• 10K candidate
locations organised in
a 100x100 grid
• At every step we get
tweets from one
location and count
mentions of “Bristol”
• Challenge: ﬁnd the
target in sub-linear
time complexity!

Linear methods fail on this problem.
How can we go non-linear?

John-Shawe Taylor & Nello Cristianini, “Kernel Methods for Pattern Analysis”,
Cambridge University press, 2004.
The Kernel trick! —no, it’s not just for SVMs

ˆµa(t) := xT
a,t
ˆ✓t ˆµa(t) = kT
x,tK 1
t yt
ˆa(t) =
q
tkT
x,tK 2
t kx,tˆ(t) :=
q
xT
a,tC 1
t xa,t
Ct := XT
t Xt Kt = XtXT
t
LinUCB:
M. Valko, N. Korda, R. Munos, I. Flaounas, N. Cristianini, “Finite-Time Analysis of
Kernelised Contextual Bandits”, UAI, 2013.
KernelUCB:

• The last few steps
of the algorithm
before it locates
Bristol.
• KernelUCB with
RBF kernel
converges after
~300 iterations
(instead of >>10K).

Target is the red dot.
We locate it using KernelUCB with RBF kernel.
KernelUCB code: http://www.complacs.org/pmwiki.php/CompLACS/KernelUCB

What if we have a high-dimensional space?
Hashing trick
Implementation in
Vowpal Wabbit,  
by J. Langford, et al.

References
M. Valko, N. Korda, R. Munos, I. Flaounas, N. Cristianini, “Finite-Time Analysis of
Kernelised Contextual Bandits”, UAI, 2013.
L. Li, W. Chu, J. Langford, R. E. Schapire, “A Contextual-Bandit Approach to
Personalized News Article Recommendation”, WWW, 2010.
John-Shawe Taylor & Nello Cristianini, “Kernel Methods for Pattern Analysis”,
Cambridge University press, 2004.
Implementation of KernelUCB in Complacs toolkit: 
http://www.complacs.org/pmwiki.php/CompLACS/KernelUCB
https://en.wikipedia.org/wiki/Multi-armed_bandit
https://github.com/JohnLangford/vowpal_wabbit/wiki/Contextual-Bandit-Example

Thank you -
We are hiring!
Dr Ilias Flaounas
Senior Data Scientist
<ﬁrst>.<last>@atlassian.com

Multi-Armed Bandits:  Intro, examples and tricks

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (20)

Similar to Multi-Armed Bandits:  Intro, examples and tricks

Similar to Multi-Armed Bandits:  Intro, examples and tricks (20)

More from Ilias Flaounas

More from Ilias Flaounas (9)

Recently uploaded

Recently uploaded (20)