3. 1. Two Characters:
Exploration and Exploitation
- Need to balance exploration and exploitation
- or, experimentation and profit-maximization
- or, learning new ideas and taking advantage of the best of old ideas,
- or, gathering data and acting on that data
4. 2. Why Use
Multiarmed Bandit Algorithms?
- Measurable achievements examples
- Traffic, Conversions, Sales, CTRs
- Definitions
- Reward: Measure of success
- Arms: List of potential changes
- Explaining standard A/B testing as an exploration - exploitation tradeoff
- Short period of pure exploration (Assigning equal numbers of users to A/B)
- Long period of pure exploitation (Send all of the users to successful option
- Why A/B testing might be a bad strategy?
- Abrupt transition
- Wastes resources exploring inferior options
5. 3. The ϵ-Greedy Algorithm
- Tries to be fair to the two opposite goals of exploration & exploitation
- ϵ=0: Pure exploitation
- ϵ=1: Pure exploration
- Problem of fixed ϵ
- May need more exploration at the start, may need more exploitation after some time.
- Explores arms completely at random without any concern about their merits.
6. 4. Debugging
Bandit Algorithms
- Bandit algorithms are not black-box functions.
- Bandit algorithms have to actively select which data it should acquire (Active Learning)
and analyze that data at real time (Online Learning).
- Bandit data and bandit analysis are inseparable. “Feedback cycle.”
7. 4. Debugging
Bandit Algorithms
- Use Monte Carlo simulation to provide simulated data in real-time.
- Analyzing results
- Tracking the probability of choosing the best arm,
as both bandit algorithms and rewards are probabilistic.
- Tracking the average reward at each point in time.
- Tracking the cumulative reward at each point in time,
to look at the bigger picture of the lifetime performance.
8. 5. The Softmax Algorithm
- Problem of fixed ϵ revisited
- If the difference in rewards between two arms is small,
more exploration is needed, and vice versa.
- Never get past the intrinsic errors caused by the purely random exploration strategy.
- Set the probability of choosing arm A with accumulative reward rA
as …
-
- Temperature parameter τ shifts the behavior along a continuum
between pure exploration ( τ = ∞ ) and exploitation ( τ = 0 ) .
- Negative rewards are okay thanks to exponential rescaling.
- Annealing: Encouraging to explore less over time by slowly decreasing τ .
9. 6. UCB
The Upper Confidence Bound Algorithm
- Problems of softmax algorithms
- Only pay attention on how much reward they’ve gotten from the arms.
- Gullible: easily misled by a few negative experiences, as the algorithm do not keep track of how
much they know about the arms (how much confident).
- UCBs avoid being gullible by keeping track of confidence in assessments of the
estimated values of all the arms.
- UCBs doesn’t use randomness, and doesn’t have any free parameters.
10. 6. UCB
The Upper Confidence Bound Algorithm
- UCB1 (one of the variants of UCBs) chooses arm i
with accumulative rewards ri
, bonus bi
, and number of times ni
as …
-
- Cold start is prevented by bi
= ∞
- UCBs are explicitly curious algorithms.
- Curiousness are implemented with bonus bi
, where bi
gets bigger when ni
is too small.
- So, we will occasionally visit the worst of the arms.
11. 6. UCB
The Upper Confidence Bound Algorithm
- Comparing bandit algorithms side-by-side
- UCB1 is much noisier than ϵ -Greedy or Softmax.
- ϵ -Greedy doesn’t converge as quickly as Softmax.
- UCB1 takes a while to catch up with Softmax.
- UCB1 finds the best arm quickly,
but the backpedaling it does causes it to underperform the Softmax.
12. 7. Bandits in the Real World:
Complexity and Complications
- A/A Testing
- Testing of bandit algorithms itself
- Estimation of the actual variability in real-time data.
- Running concurrent experiments
- May have strange interactions between experiments (ex. different logos and fonts)
- Continuous experimentation vs. Periodic testing
- Bandit algorithms look much better than A/B testing when you are willing to let them run for a
very long time.
- Metrics of Success
- Optimizing short-term CTR may destroy long-term retainability.
- Rescaling metrics into 0-1 space helps algorithms to work well.
- Moving worlds
- Arms with changing rewards raise serious problems.
- Average (No parameter to tune) vs. Weighted Average (Flexibility towards moving worlds)
13. 8. Conclusion
- There is no universal bandit algorithm that will always do the best job.
- Domain expertise and good judgement will always be necessary.
- There is always a trade-off between exploration & exploitation.
Initialization of an algorithm matters a lot. Biases may both help or hurt.
- Make sure you explore less over time.