Reinforcement Learning in Practice: Contextual Bandits

REINFORCEMENT LEARNING IN PRACTICE
HELSINKI RL MEETUP
Max Pagels, Machine Learning Partner
@maxpagels
www.linkedin.com/in/maxpagels

Job: Fourkind
Education: BSc & MSc comp. sci, University of Helsinki
Background: CS researcher, full-stack dev, front-end dev,
data scientist
Interests: Immediate-reward RL, ML reductions,
incremental/online learning, representation learning,
soft-constraint learning
Some sectors: maritime, healthcare, insurance, ecommerce,
gaming, telecommunications, transportation, media,
education, logistics, consumer goods

Let’s say you are running an online bookstore and want to
build a recommendation engine. You might have collected the
following interaction information from the website:
{user_a, item_a} = purchase
{user_a, item_b} = no purchase
{user_b, item_d} = purchase
{user_c, item_e} = purchase
{user_a, item_a} = purchase
{user_c, item_f} = no purchase
{user_f, item_h} = purchase
..
Before going ahead and making a matrix factorization
recommender, let’s think a little bit about the process that
generated this data.
MOTIVATION

What a user is likely to purchase obviously depends on the
user in question:
f(user context)
It may also depend on time of day & past purchases:
f(user context, time context, past purchase context)
And a bunch of other stuff besides:
f(user context, time context, past purchase context, ...)
This is pretty standard fare; in supervised learning, we usually
make recommenders that predict purchases or click based on
such contextual information.
MOTIVATION

What we usually don’t take into account is the process.
- What if a particular book was promoted heavily for
half a year?
- What if some books aren’t displayed on the front page,
but under some sub-menu?
- What if the entire UI was subject to a redesign six
months ago?
- What if some popular book was out of print for two
months?
The list of possibilities is endless. It’s clear that in practice
business logic plays a very large role in terms of generated
data.
MOTIVATION

We could theoretically try to add features that capture
information about the business process:
f(user context, time context, past purchase context, business
process context...)
However, this in implausible if not impossible to do in
practice. And it’s a maintenance nightmare.
MOTIVATION

Implication: supervised learning won’t work “optimally” (for
lack of a better word). Some combinations of user & item
pairs may simply never manifest themselves in our dataset in a
way that allows us to learn the true best recommendation for
each user.
Business logic contaminating generated data is a big issue. For
some problems, it means you aren’t optimising to the best of
your ability. For others, it may even make it impossible to
optimise anything at all*.
* e.g. optimising price when the same item has only ever had a
single fixed price.
MOTIVATION

Key takeaway: business logic influences an introduces bias in
future data (note: deployed ML models become part of this
process).
What can we do about it?
MOTIVATION

Theoretical solution: randomise everything. MOTIVATION

Better-than-theoretical solution: use a learning paradigm that
randomly tries different things in a controlled fashion, a.k.a.
Reinforcement Learning
MOTIVATION

Practical solution: use a variant of reinforcement learning that
works for a large portion of business problems, a.k.a.
Contextual Bandits.
MOTIVATION

Environment
Policy
Goal: learn to act so as to maximise reward over time.
ActionRewardState
RL

RL
In the beginning, a reinforcement learning policy knows
nothing about the world. It must explore different options to
learn what works and what doesn’t.
In addition, a policy must also exploit its knowledge in order
to actually maximise rewards over time.
In RL, you only ever get to see data (rewards) from
actions you took. The rest is hidden from you.

RL
RL agents are typically trained and evaluated against a
simulator. Learn by interacting with the simulator, deploy to
production, repeat.
Policy
Simulator (e.g. OpenAI Gym for video
games)
Learn Act

RL
In most real world situations, you don’t have a simulator.
How can we evaluate the goodness of a new policy, based on
the data collected from some past policy?
Policy
games)
Learn Act

RL
Spoiler: offline policy evaluation is largely an unsolved
problem in RL.
Policy
games)
Learn Act

RL
If we relax the requirements of RL to contextual bandits, we
can evaluate offline pretty easily, before production
deployment.
Policy
games)
Learn Act

Environment
Policy
Goal: learn to act so as to maximise reward over time.
ActionImmediate rewardState
CBs

CBs
Assuming your policy is a black box that always has some
non-zero probability of exploring, bandit data is a quad
(chosen_action,action_probability,context,reward).
Action Prob. Context Reward
Item A 0.95 Max, 24, Monday 1
Item A 0.9 Anna, 34, Wednesday 1
Item C 0.1 John, 28, Saturday -1
Item B 0.3 Mike, 56, Saturday -1
Item D 0.22 Mary, 34, Tuesday 1

CBs
Given that this data was generated by some past policy or
policies that you deployed, how can we use it to evaluate some
new policy we’re working on? We only see rewards for actions
previously taken, by some possibly bad policy.

CBs
Turns out we can use the probabilities to correct for the
imbalance in data and solve the problem! If our new policy
agrees with the logged action, set Rhat = observed reward
/ prob, else 0.

CBs
Turns out we can use the probabilities to correct for the
imbalance in data and solve the problem! If our new policy
agrees with the logged action, set Rhat = observed reward
/ prob, else 0.
Action Prob. Context Agrees Rhat
Item A 0.95 Max, 24, Monday Yes 1.0526...
Item A 0.9 Anna, 34, Wednesday No 0
Item C 0.1 John, 28, Saturday Yes -10
Item B 0.3 Mike, 56, Saturday No 0
Item D 0.22 Mary, 34, Tuesday Yes 4,5455...

CBs
Expected value of the new policy in this example:
-4,4019138756 / 5 = -0,8803827751
This method is called inverse probability weighting (IPS) and
is an unbiased estimator vIPS of the true policy value.

CBs
Takeway: with CBs, we can easily evaluate a new policy in an
unbiased way before it goes into production.

CBs
So how to we actually learn a policy?
Train a regression model to predict vIPS directly, (x, a) ->
Rhat! Play argmax() or explore from time to time based on
some strategy.
Action(a) Prob. Context (x) Agrees Rhat

CBs
Offline estimators, learning algorithms, and exploration
strategies need not be hand-made. They are found in Vowpal
Wabbit, a library with first-class bandit support, and the Open
Bandit pipeline:
http://vowpalwabbit.org
https://zr-obp.readthedocs.io/en/l
atest/

CBs
If you want a ready-made system, Azure Personalizer provides
contextual bandits-as-a-service:
https://azure.microsoft.com/en-us/services/co
gnitive-services/personalizer/

DYNAMIC
PRICING
● Case study: fourkind.com/work/forenom-pricing
● Context: pricing of aparthotel rooms to maximise RevPAR
(revenue per available room)
● Results: 13 % increase in RevPAR in the group of locations (23%
of total capacity) included in A/B-testing
● Context: Subscription pricing
● Results: (as per A/B-test) 12% increase in total revenue for
products offered as part of the CB system
● Context: Parking pricing
● Results: (as per A/B-test) 3% increase in total revenue

ARTWORK
PERSONALISATION
Context: Artwork personalisation in Yle Areena,
https://areena.yle.fi/1-50499272
Results: (as per A/B-test) 2.3% increase in average minutes
(viewing time), 4.83% increase in conversion

SELFIE
PERSONALISATION
Context: Tinder Smart Photos,
https://tinderengineering.ghost.io/smart-photos-2/

RECOMMENDATIONS
Context: Xbox (Top of Home),
https://www.microsoft.com/en-us/research/podcast/reinforceme
nt-learning-for-the-real-world-with-dr-john-langford-and-rafah-ho
sn/
Image credit:
https://www.kotaku.com.au/2020/02/the-new-xbox-one-home-sc
reen-is-a-lot-cleaner/

CODEC SELECTION
LOAD BALANCING
DRUG OPTIMISATION
RECOMMENDATIONS
UI OPTIMISATION
PORTFOLIO ALLOCATION
DYNAMIC PRICINGCOLD-START LEARNING
GAME MATCHMAKING

REFERENCES
1. IPS and other estimators (Dudik, Langford et al):
https://arxiv.org/abs/1103.4601
2. Real-world reinforcement learning (SlideShare):
https://www.slideshare.net/MaxPagels/realworld-reinforcement-learnin
g-234276181
3. Bandit Algorithms (SlideShare):
https://www.slideshare.net/SC5/practical-ai-for-business-bandit-algorit
hms
4. Real world interactive learning (Vimeo): https://vimeo.com/240429210
5. A Survey on Practical Applications of Multi-Armed and Contextual
Bandits (Bouneffouf, Rish): https://arxiv.org/abs/1904.10040
A special thanks to John Langford for inspiration and patience answering
my questions, and to the entire VW team for answering my questions on
implementations.

The Hands-on Advisory.
max.pagels@fourkind.com
www.fourkind.com
@fourkindnow

Reinforcement Learning in Practice: Contextual Bandits

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Reinforcement Learning in Practice: Contextual Bandits

Ähnlich wie Reinforcement Learning in Practice: Contextual Bandits (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Reinforcement Learning in Practice: Contextual Bandits