4. Let’s say you are running an online bookstore and want to
build a recommendation engine. You might have collected the
following interaction information from the website:
{user_a, item_a} = purchase
{user_a, item_b} = no purchase
{user_b, item_d} = purchase
{user_c, item_e} = purchase
{user_a, item_a} = purchase
{user_c, item_f} = no purchase
{user_f, item_h} = purchase
..
Before going ahead and making a matrix factorization
recommender, let’s think a little bit about the process that
generated this data.
MOTIVATION
5. What a user is likely to purchase obviously depends on the
user in question:
f(user context)
It may also depend on time of day & past purchases:
f(user context, time context, past purchase context)
And a bunch of other stuff besides:
f(user context, time context, past purchase context, ...)
This is pretty standard fare; in supervised learning, we usually
make recommenders that predict purchases or click based on
such contextual information.
MOTIVATION
6. What we usually don’t take into account is the process.
- What if a particular book was promoted heavily for
half a year?
- What if some books aren’t displayed on the front page,
but under some sub-menu?
- What if the entire UI was subject to a redesign six
months ago?
- What if some popular book was out of print for two
months?
The list of possibilities is endless. It’s clear that in practice
business logic plays a very large role in terms of generated
data.
MOTIVATION
7. We could theoretically try to add features that capture
information about the business process:
f(user context, time context, past purchase context, business
process context...)
However, this in implausible if not impossible to do in
practice. And it’s a maintenance nightmare.
MOTIVATION
8. Implication: supervised learning won’t work “optimally” (for
lack of a better word). Some combinations of user & item
pairs may simply never manifest themselves in our dataset in a
way that allows us to learn the true best recommendation for
each user.
Business logic contaminating generated data is a big issue. For
some problems, it means you aren’t optimising to the best of
your ability. For others, it may even make it impossible to
optimise anything at all*.
* e.g. optimising price when the same item has only ever had a
single fixed price.
MOTIVATION
9. Key takeaway: business logic influences an introduces bias in
future data (note: deployed ML models become part of this
process).
What can we do about it?
MOTIVATION
11. Better-than-theoretical solution: use a learning paradigm that
randomly tries different things in a controlled fashion, a.k.a.
Reinforcement Learning
MOTIVATION
12. Practical solution: use a variant of reinforcement learning that
works for a large portion of business problems, a.k.a.
Contextual Bandits.
MOTIVATION
15. RL
In the beginning, a reinforcement learning policy knows
nothing about the world. It must explore different options to
learn what works and what doesn’t.
In addition, a policy must also exploit its knowledge in order
to actually maximise rewards over time.
In RL, you only ever get to see data (rewards) from
actions you took. The rest is hidden from you.
16. RL
RL agents are typically trained and evaluated against a
simulator. Learn by interacting with the simulator, deploy to
production, repeat.
Policy
Simulator (e.g. OpenAI Gym for video
games)
Learn Act
17. RL
In most real world situations, you don’t have a simulator.
How can we evaluate the goodness of a new policy, based on
the data collected from some past policy?
Policy
Simulator (e.g. OpenAI Gym for video
games)
Learn Act
18. RL
Spoiler: offline policy evaluation is largely an unsolved
problem in RL.
Policy
Simulator (e.g. OpenAI Gym for video
games)
Learn Act
19. RL
If we relax the requirements of RL to contextual bandits, we
can evaluate offline pretty easily, before production
deployment.
Policy
Simulator (e.g. OpenAI Gym for video
games)
Learn Act
21. CBs
Assuming your policy is a black box that always has some
non-zero probability of exploring, bandit data is a quad
(chosen_action,action_probability,context,reward).
Action Prob. Context Reward
Item A 0.95 Max, 24, Monday 1
Item A 0.9 Anna, 34, Wednesday 1
Item C 0.1 John, 28, Saturday -1
Item B 0.3 Mike, 56, Saturday -1
Item D 0.22 Mary, 34, Tuesday 1
22. CBs
Given that this data was generated by some past policy or
policies that you deployed, how can we use it to evaluate some
new policy we’re working on? We only see rewards for actions
previously taken, by some possibly bad policy.
Action Prob. Context Reward
Item A 0.95 Max, 24, Monday 1
Item A 0.9 Anna, 34, Wednesday 1
Item C 0.1 John, 28, Saturday -1
Item B 0.3 Mike, 56, Saturday -1
Item D 0.22 Mary, 34, Tuesday 1
23. CBs
Turns out we can use the probabilities to correct for the
imbalance in data and solve the problem! If our new policy
agrees with the logged action, set Rhat = observed reward
/ prob, else 0.
Action Prob. Context Reward
Item A 0.95 Max, 24, Monday 1
Item A 0.9 Anna, 34, Wednesday 1
Item C 0.1 John, 28, Saturday -1
Item B 0.3 Mike, 56, Saturday -1
Item D 0.22 Mary, 34, Tuesday 1
24. CBs
Turns out we can use the probabilities to correct for the
imbalance in data and solve the problem! If our new policy
agrees with the logged action, set Rhat = observed reward
/ prob, else 0.
Action Prob. Context Agrees Rhat
Item A 0.95 Max, 24, Monday Yes 1.0526...
Item A 0.9 Anna, 34, Wednesday No 0
Item C 0.1 John, 28, Saturday Yes -10
Item B 0.3 Mike, 56, Saturday No 0
Item D 0.22 Mary, 34, Tuesday Yes 4,5455...
25. CBs
Expected value of the new policy in this example:
-4,4019138756 / 5 = -0,8803827751
This method is called inverse probability weighting (IPS) and
is an unbiased estimator vIPS of the true policy value.
Action Prob. Context Agrees Rhat
Item A 0.95 Max, 24, Monday Yes 1.0526...
Item A 0.9 Anna, 34, Wednesday No 0
Item C 0.1 John, 28, Saturday Yes -10
Item B 0.3 Mike, 56, Saturday No 0
Item D 0.22 Mary, 34, Tuesday Yes 4,5455...
26. CBs
Takeway: with CBs, we can easily evaluate a new policy in an
unbiased way before it goes into production.
Action Prob. Context Agrees Rhat
Item A 0.95 Max, 24, Monday Yes 1.0526...
Item A 0.9 Anna, 34, Wednesday No 0
Item C 0.1 John, 28, Saturday Yes -10
Item B 0.3 Mike, 56, Saturday No 0
Item D 0.22 Mary, 34, Tuesday Yes 4,5455...
27. CBs
So how to we actually learn a policy?
Train a regression model to predict vIPS directly, (x, a) ->
Rhat! Play argmax() or explore from time to time based on
some strategy.
Action(a) Prob. Context (x) Agrees Rhat
Item A 0.95 Max, 24, Monday Yes 1.0526...
Item A 0.9 Anna, 34, Wednesday No 0
Item C 0.1 John, 28, Saturday Yes -10
Item B 0.3 Mike, 56, Saturday No 0
Item D 0.22 Mary, 34, Tuesday Yes 4,5455...
28. CBs
Offline estimators, learning algorithms, and exploration
strategies need not be hand-made. They are found in Vowpal
Wabbit, a library with first-class bandit support, and the Open
Bandit pipeline:
http://vowpalwabbit.org
https://zr-obp.readthedocs.io/en/l
atest/
29. CBs
If you want a ready-made system, Azure Personalizer provides
contextual bandits-as-a-service:
https://azure.microsoft.com/en-us/services/co
gnitive-services/personalizer/
31. DYNAMIC
PRICING
● Case study: fourkind.com/work/forenom-pricing
● Context: pricing of aparthotel rooms to maximise RevPAR
(revenue per available room)
● Results: 13 % increase in RevPAR in the group of locations (23%
of total capacity) included in A/B-testing
● Context: Subscription pricing
● Results: (as per A/B-test) 12% increase in total revenue for
products offered as part of the CB system
● Context: Parking pricing
● Results: (as per A/B-test) 3% increase in total revenue
37. REFERENCES
1. IPS and other estimators (Dudik, Langford et al):
https://arxiv.org/abs/1103.4601
2. Real-world reinforcement learning (SlideShare):
https://www.slideshare.net/MaxPagels/realworld-reinforcement-learnin
g-234276181
3. Bandit Algorithms (SlideShare):
https://www.slideshare.net/SC5/practical-ai-for-business-bandit-algorit
hms
4. Real world interactive learning (Vimeo): https://vimeo.com/240429210
5. A Survey on Practical Applications of Multi-Armed and Contextual
Bandits (Bouneffouf, Rish): https://arxiv.org/abs/1904.10040
A special thanks to John Langford for inspiration and patience answering
my questions, and to the entire VW team for answering my questions on
implementations.