SlideShare ist ein Scribd-Unternehmen logo
1 von 38
Downloaden Sie, um offline zu lesen
REINFORCEMENT LEARNING IN PRACTICE
HELSINKI RL MEETUP
Max Pagels, Machine Learning Partner
@maxpagels
www.linkedin.com/in/maxpagels
Job: Fourkind
Education: BSc & MSc comp. sci, University of Helsinki
Background: CS researcher, full-stack dev, front-end dev,
data scientist
Interests: Immediate-reward RL, ML reductions,
incremental/online learning, representation learning,
soft-constraint learning
Some sectors: maritime, healthcare, insurance, ecommerce,
gaming, telecommunications, transportation, media,
education, logistics, consumer goods
MOTIVATION
Let’s say you are running an online bookstore and want to
build a recommendation engine. You might have collected the
following interaction information from the website:
{user_a, item_a} = purchase
{user_a, item_b} = no purchase
{user_b, item_d} = purchase
{user_c, item_e} = purchase
{user_a, item_a} = purchase
{user_c, item_f} = no purchase
{user_f, item_h} = purchase
..
Before going ahead and making a matrix factorization
recommender, let’s think a little bit about the process that
generated this data.
MOTIVATION
What a user is likely to purchase obviously depends on the
user in question:
f(user context)
It may also depend on time of day & past purchases:
f(user context, time context, past purchase context)
And a bunch of other stuff besides:
f(user context, time context, past purchase context, ...)
This is pretty standard fare; in supervised learning, we usually
make recommenders that predict purchases or click based on
such contextual information.
MOTIVATION
What we usually don’t take into account is the process.
- What if a particular book was promoted heavily for
half a year?
- What if some books aren’t displayed on the front page,
but under some sub-menu?
- What if the entire UI was subject to a redesign six
months ago?
- What if some popular book was out of print for two
months?
The list of possibilities is endless. It’s clear that in practice
business logic plays a very large role in terms of generated
data.
MOTIVATION
We could theoretically try to add features that capture
information about the business process:
f(user context, time context, past purchase context, business
process context...)
However, this in implausible if not impossible to do in
practice. And it’s a maintenance nightmare.
MOTIVATION
Implication: supervised learning won’t work “optimally” (for
lack of a better word). Some combinations of user & item
pairs may simply never manifest themselves in our dataset in a
way that allows us to learn the true best recommendation for
each user.
Business logic contaminating generated data is a big issue. For
some problems, it means you aren’t optimising to the best of
your ability. For others, it may even make it impossible to
optimise anything at all*.
* e.g. optimising price when the same item has only ever had a
single fixed price.
MOTIVATION
Key takeaway: business logic influences an introduces bias in
future data (note: deployed ML models become part of this
process).
What can we do about it?
MOTIVATION
Theoretical solution: randomise everything. MOTIVATION
Better-than-theoretical solution: use a learning paradigm that
randomly tries different things in a controlled fashion, a.k.a.
Reinforcement Learning
MOTIVATION
Practical solution: use a variant of reinforcement learning that
works for a large portion of business problems, a.k.a.
Contextual Bandits.
MOTIVATION
RL vs. CONTEXTUAL BANDITS
Environment
Policy
Goal: learn to act so as to maximise reward over time.
ActionRewardState
RL
RL
In the beginning, a reinforcement learning policy knows
nothing about the world. It must explore different options to
learn what works and what doesn’t.
In addition, a policy must also exploit its knowledge in order
to actually maximise rewards over time.
In RL, you only ever get to see data (rewards) from
actions you took. The rest is hidden from you.
RL
RL agents are typically trained and evaluated against a
simulator. Learn by interacting with the simulator, deploy to
production, repeat.
Policy
Simulator (e.g. OpenAI Gym for video
games)
Learn Act
RL
In most real world situations, you don’t have a simulator.
How can we evaluate the goodness of a new policy, based on
the data collected from some past policy?
Policy
Simulator (e.g. OpenAI Gym for video
games)
Learn Act
RL
Spoiler: offline policy evaluation is largely an unsolved
problem in RL.
Policy
Simulator (e.g. OpenAI Gym for video
games)
Learn Act
RL
If we relax the requirements of RL to contextual bandits, we
can evaluate offline pretty easily, before production
deployment.
Policy
Simulator (e.g. OpenAI Gym for video
games)
Learn Act
Environment
Policy
Goal: learn to act so as to maximise reward over time.
ActionImmediate rewardState
CBs
CBs
Assuming your policy is a black box that always has some
non-zero probability of exploring, bandit data is a quad
(chosen_action,action_probability,context,reward).
Action Prob. Context Reward
Item A 0.95 Max, 24, Monday 1
Item A 0.9 Anna, 34, Wednesday 1
Item C 0.1 John, 28, Saturday -1
Item B 0.3 Mike, 56, Saturday -1
Item D 0.22 Mary, 34, Tuesday 1
CBs
Given that this data was generated by some past policy or
policies that you deployed, how can we use it to evaluate some
new policy we’re working on? We only see rewards for actions
previously taken, by some possibly bad policy.
Action Prob. Context Reward
Item A 0.95 Max, 24, Monday 1
Item A 0.9 Anna, 34, Wednesday 1
Item C 0.1 John, 28, Saturday -1
Item B 0.3 Mike, 56, Saturday -1
Item D 0.22 Mary, 34, Tuesday 1
CBs
Turns out we can use the probabilities to correct for the
imbalance in data and solve the problem! If our new policy
agrees with the logged action, set Rhat = observed reward
/ prob, else 0.
Action Prob. Context Reward
Item A 0.95 Max, 24, Monday 1
Item A 0.9 Anna, 34, Wednesday 1
Item C 0.1 John, 28, Saturday -1
Item B 0.3 Mike, 56, Saturday -1
Item D 0.22 Mary, 34, Tuesday 1
CBs
Turns out we can use the probabilities to correct for the
imbalance in data and solve the problem! If our new policy
agrees with the logged action, set Rhat = observed reward
/ prob, else 0.
Action Prob. Context Agrees Rhat
Item A 0.95 Max, 24, Monday Yes 1.0526...
Item A 0.9 Anna, 34, Wednesday No 0
Item C 0.1 John, 28, Saturday Yes -10
Item B 0.3 Mike, 56, Saturday No 0
Item D 0.22 Mary, 34, Tuesday Yes 4,5455...
CBs
Expected value of the new policy in this example:
-4,4019138756 / 5 = -0,8803827751
This method is called inverse probability weighting (IPS) and
is an unbiased estimator vIPS of the true policy value.
Action Prob. Context Agrees Rhat
Item A 0.95 Max, 24, Monday Yes 1.0526...
Item A 0.9 Anna, 34, Wednesday No 0
Item C 0.1 John, 28, Saturday Yes -10
Item B 0.3 Mike, 56, Saturday No 0
Item D 0.22 Mary, 34, Tuesday Yes 4,5455...
CBs
Takeway: with CBs, we can easily evaluate a new policy in an
unbiased way before it goes into production.
Action Prob. Context Agrees Rhat
Item A 0.95 Max, 24, Monday Yes 1.0526...
Item A 0.9 Anna, 34, Wednesday No 0
Item C 0.1 John, 28, Saturday Yes -10
Item B 0.3 Mike, 56, Saturday No 0
Item D 0.22 Mary, 34, Tuesday Yes 4,5455...
CBs
So how to we actually learn a policy?
Train a regression model to predict vIPS directly, (x, a) ->
Rhat! Play argmax() or explore from time to time based on
some strategy.
Action(a) Prob. Context (x) Agrees Rhat
Item A 0.95 Max, 24, Monday Yes 1.0526...
Item A 0.9 Anna, 34, Wednesday No 0
Item C 0.1 John, 28, Saturday Yes -10
Item B 0.3 Mike, 56, Saturday No 0
Item D 0.22 Mary, 34, Tuesday Yes 4,5455...
CBs
Offline estimators, learning algorithms, and exploration
strategies need not be hand-made. They are found in Vowpal
Wabbit, a library with first-class bandit support, and the Open
Bandit pipeline:
http://vowpalwabbit.org
https://zr-obp.readthedocs.io/en/l
atest/
CBs
If you want a ready-made system, Azure Personalizer provides
contextual bandits-as-a-service:
https://azure.microsoft.com/en-us/services/co
gnitive-services/personalizer/
USE CASES
DYNAMIC
PRICING
● Case study: fourkind.com/work/forenom-pricing
● Context: pricing of aparthotel rooms to maximise RevPAR
(revenue per available room)
● Results: 13 % increase in RevPAR in the group of locations (23%
of total capacity) included in A/B-testing
● Context: Subscription pricing
● Results: (as per A/B-test) 12% increase in total revenue for
products offered as part of the CB system
● Context: Parking pricing
● Results: (as per A/B-test) 3% increase in total revenue
ARTWORK
PERSONALISATION
Context: Artwork personalisation in Yle Areena,
https://areena.yle.fi/1-50499272
Results: (as per A/B-test) 2.3% increase in average minutes
(viewing time), 4.83% increase in conversion
SELFIE
PERSONALISATION
Context: Tinder Smart Photos,
https://tinderengineering.ghost.io/smart-photos-2/
RECOMMENDATIONS
Context: Xbox (Top of Home),
https://www.microsoft.com/en-us/research/podcast/reinforceme
nt-learning-for-the-real-world-with-dr-john-langford-and-rafah-ho
sn/
Image credit:
https://www.kotaku.com.au/2020/02/the-new-xbox-one-home-sc
reen-is-a-lot-cleaner/
CODEC SELECTION
LOAD BALANCING
DRUG OPTIMISATION
RECOMMENDATIONS
UI OPTIMISATION
PORTFOLIO ALLOCATION
DYNAMIC PRICINGCOLD-START LEARNING
GAME MATCHMAKING
THANK YOU!
QUESTIONS?
REFERENCES
1. IPS and other estimators (Dudik, Langford et al):
https://arxiv.org/abs/1103.4601
2. Real-world reinforcement learning (SlideShare):
https://www.slideshare.net/MaxPagels/realworld-reinforcement-learnin
g-234276181
3. Bandit Algorithms (SlideShare):
https://www.slideshare.net/SC5/practical-ai-for-business-bandit-algorit
hms
4. Real world interactive learning (Vimeo): https://vimeo.com/240429210
5. A Survey on Practical Applications of Multi-Armed and Contextual
Bandits (Bouneffouf, Rish): https://arxiv.org/abs/1904.10040
A special thanks to John Langford for inspiration and patience answering
my questions, and to the entire VW team for answering my questions on
implementations.
The Hands-on Advisory.
max.pagels@fourkind.com
www.fourkind.com
@fourkindnow

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learning
 
AI - Introduction to Bellman Equations
AI - Introduction to Bellman EquationsAI - Introduction to Bellman Equations
AI - Introduction to Bellman Equations
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learning
 
Multi armed bandit
Multi armed banditMulti armed bandit
Multi armed bandit
 
Multi-armed bandit by Joni Turunen
Multi-armed bandit by Joni TurunenMulti-armed bandit by Joni Turunen
Multi-armed bandit by Joni Turunen
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learning
 
Bandit Algorithms
Bandit AlgorithmsBandit Algorithms
Bandit Algorithms
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive EnvironmentsMulti PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Inverse Reinforcement Learning Algorithms
Inverse Reinforcement Learning AlgorithmsInverse Reinforcement Learning Algorithms
Inverse Reinforcement Learning Algorithms
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Dueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learningDueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
数独のアルゴリズムを考える ― 各種法の改善と高速化
数独のアルゴリズムを考える ― 各種法の改善と高速化数独のアルゴリズムを考える ― 各種法の改善と高速化
数独のアルゴリズムを考える ― 各種法の改善と高速化
 
Why start using uplift models for more efficient marketing campaigns
Why start using uplift models for more efficient marketing campaignsWhy start using uplift models for more efficient marketing campaigns
Why start using uplift models for more efficient marketing campaigns
 
Control as Inference.pptx
Control as Inference.pptxControl as Inference.pptx
Control as Inference.pptx
 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)
 
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitions
 

Ähnlich wie Reinforcement Learning in Practice: Contextual Bandits

Machine Learning with Python- Methods for Machine Learning.pptx
Machine Learning with Python- Methods for Machine Learning.pptxMachine Learning with Python- Methods for Machine Learning.pptx
Machine Learning with Python- Methods for Machine Learning.pptx
iaeronlineexm
 
notes as .ppt
notes as .pptnotes as .ppt
notes as .ppt
butest
 
Chapter01.ppt
Chapter01.pptChapter01.ppt
Chapter01.ppt
butest
 

Ähnlich wie Reinforcement Learning in Practice: Contextual Bandits (20)

Counterfactual Learning for Recommendation
Counterfactual Learning for RecommendationCounterfactual Learning for Recommendation
Counterfactual Learning for Recommendation
 
Fast Distributed Online Classification
Fast Distributed Online ClassificationFast Distributed Online Classification
Fast Distributed Online Classification
 
End-to-End Machine Learning Project
End-to-End Machine Learning ProjectEnd-to-End Machine Learning Project
End-to-End Machine Learning Project
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2
 
Fast Distributed Online Classification
Fast Distributed Online Classification Fast Distributed Online Classification
Fast Distributed Online Classification
 
Machine learning Algorithms with a Sagemaker demo
Machine learning Algorithms with a Sagemaker demoMachine learning Algorithms with a Sagemaker demo
Machine learning Algorithms with a Sagemaker demo
 
Machine Learning with Python- Methods for Machine Learning.pptx
Machine Learning with Python- Methods for Machine Learning.pptxMachine Learning with Python- Methods for Machine Learning.pptx
Machine Learning with Python- Methods for Machine Learning.pptx
 
Probabilistic Programming: Why, What, How, When?
Probabilistic Programming: Why, What, How, When?Probabilistic Programming: Why, What, How, When?
Probabilistic Programming: Why, What, How, When?
 
2020 04 10 Catch IT - Getting started with ML.Net
2020 04 10 Catch IT - Getting started with ML.Net2020 04 10 Catch IT - Getting started with ML.Net
2020 04 10 Catch IT - Getting started with ML.Net
 
Essentials of machine learning algorithms
Essentials of machine learning algorithmsEssentials of machine learning algorithms
Essentials of machine learning algorithms
 
Big Data Science - hype?
Big Data Science - hype?Big Data Science - hype?
Big Data Science - hype?
 
Machine Learning: Opening the Pandora's Box - Dhiana Deva @ QCon São Paulo 2019
Machine Learning: Opening the Pandora's Box - Dhiana Deva @ QCon São Paulo 2019Machine Learning: Opening the Pandora's Box - Dhiana Deva @ QCon São Paulo 2019
Machine Learning: Opening the Pandora's Box - Dhiana Deva @ QCon São Paulo 2019
 
notes as .ppt
notes as .pptnotes as .ppt
notes as .ppt
 
Taking Machine Learning from Batch to Real-Time (big data eXposed 2015)
Taking Machine Learning from Batch to Real-Time (big data eXposed 2015)Taking Machine Learning from Batch to Real-Time (big data eXposed 2015)
Taking Machine Learning from Batch to Real-Time (big data eXposed 2015)
 
Machine Learning with Azure and Databricks Virtual Workshop
Machine Learning with Azure and Databricks Virtual WorkshopMachine Learning with Azure and Databricks Virtual Workshop
Machine Learning with Azure and Databricks Virtual Workshop
 
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
 
(In)convenient truths about applied machine learning
(In)convenient truths about applied machine learning(In)convenient truths about applied machine learning
(In)convenient truths about applied machine learning
 
Chapter01.ppt
Chapter01.pptChapter01.ppt
Chapter01.ppt
 
How can we train with few data
How can we train with few dataHow can we train with few data
How can we train with few data
 
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
 

Kürzlich hochgeladen

Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
VictoriaMetrics
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 

Kürzlich hochgeladen (20)

%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - Keynote
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
WSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security Program
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
 
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
 

Reinforcement Learning in Practice: Contextual Bandits

  • 1. REINFORCEMENT LEARNING IN PRACTICE HELSINKI RL MEETUP Max Pagels, Machine Learning Partner @maxpagels www.linkedin.com/in/maxpagels
  • 2. Job: Fourkind Education: BSc & MSc comp. sci, University of Helsinki Background: CS researcher, full-stack dev, front-end dev, data scientist Interests: Immediate-reward RL, ML reductions, incremental/online learning, representation learning, soft-constraint learning Some sectors: maritime, healthcare, insurance, ecommerce, gaming, telecommunications, transportation, media, education, logistics, consumer goods
  • 4. Let’s say you are running an online bookstore and want to build a recommendation engine. You might have collected the following interaction information from the website: {user_a, item_a} = purchase {user_a, item_b} = no purchase {user_b, item_d} = purchase {user_c, item_e} = purchase {user_a, item_a} = purchase {user_c, item_f} = no purchase {user_f, item_h} = purchase .. Before going ahead and making a matrix factorization recommender, let’s think a little bit about the process that generated this data. MOTIVATION
  • 5. What a user is likely to purchase obviously depends on the user in question: f(user context) It may also depend on time of day & past purchases: f(user context, time context, past purchase context) And a bunch of other stuff besides: f(user context, time context, past purchase context, ...) This is pretty standard fare; in supervised learning, we usually make recommenders that predict purchases or click based on such contextual information. MOTIVATION
  • 6. What we usually don’t take into account is the process. - What if a particular book was promoted heavily for half a year? - What if some books aren’t displayed on the front page, but under some sub-menu? - What if the entire UI was subject to a redesign six months ago? - What if some popular book was out of print for two months? The list of possibilities is endless. It’s clear that in practice business logic plays a very large role in terms of generated data. MOTIVATION
  • 7. We could theoretically try to add features that capture information about the business process: f(user context, time context, past purchase context, business process context...) However, this in implausible if not impossible to do in practice. And it’s a maintenance nightmare. MOTIVATION
  • 8. Implication: supervised learning won’t work “optimally” (for lack of a better word). Some combinations of user & item pairs may simply never manifest themselves in our dataset in a way that allows us to learn the true best recommendation for each user. Business logic contaminating generated data is a big issue. For some problems, it means you aren’t optimising to the best of your ability. For others, it may even make it impossible to optimise anything at all*. * e.g. optimising price when the same item has only ever had a single fixed price. MOTIVATION
  • 9. Key takeaway: business logic influences an introduces bias in future data (note: deployed ML models become part of this process). What can we do about it? MOTIVATION
  • 10. Theoretical solution: randomise everything. MOTIVATION
  • 11. Better-than-theoretical solution: use a learning paradigm that randomly tries different things in a controlled fashion, a.k.a. Reinforcement Learning MOTIVATION
  • 12. Practical solution: use a variant of reinforcement learning that works for a large portion of business problems, a.k.a. Contextual Bandits. MOTIVATION
  • 13. RL vs. CONTEXTUAL BANDITS
  • 14. Environment Policy Goal: learn to act so as to maximise reward over time. ActionRewardState RL
  • 15. RL In the beginning, a reinforcement learning policy knows nothing about the world. It must explore different options to learn what works and what doesn’t. In addition, a policy must also exploit its knowledge in order to actually maximise rewards over time. In RL, you only ever get to see data (rewards) from actions you took. The rest is hidden from you.
  • 16. RL RL agents are typically trained and evaluated against a simulator. Learn by interacting with the simulator, deploy to production, repeat. Policy Simulator (e.g. OpenAI Gym for video games) Learn Act
  • 17. RL In most real world situations, you don’t have a simulator. How can we evaluate the goodness of a new policy, based on the data collected from some past policy? Policy Simulator (e.g. OpenAI Gym for video games) Learn Act
  • 18. RL Spoiler: offline policy evaluation is largely an unsolved problem in RL. Policy Simulator (e.g. OpenAI Gym for video games) Learn Act
  • 19. RL If we relax the requirements of RL to contextual bandits, we can evaluate offline pretty easily, before production deployment. Policy Simulator (e.g. OpenAI Gym for video games) Learn Act
  • 20. Environment Policy Goal: learn to act so as to maximise reward over time. ActionImmediate rewardState CBs
  • 21. CBs Assuming your policy is a black box that always has some non-zero probability of exploring, bandit data is a quad (chosen_action,action_probability,context,reward). Action Prob. Context Reward Item A 0.95 Max, 24, Monday 1 Item A 0.9 Anna, 34, Wednesday 1 Item C 0.1 John, 28, Saturday -1 Item B 0.3 Mike, 56, Saturday -1 Item D 0.22 Mary, 34, Tuesday 1
  • 22. CBs Given that this data was generated by some past policy or policies that you deployed, how can we use it to evaluate some new policy we’re working on? We only see rewards for actions previously taken, by some possibly bad policy. Action Prob. Context Reward Item A 0.95 Max, 24, Monday 1 Item A 0.9 Anna, 34, Wednesday 1 Item C 0.1 John, 28, Saturday -1 Item B 0.3 Mike, 56, Saturday -1 Item D 0.22 Mary, 34, Tuesday 1
  • 23. CBs Turns out we can use the probabilities to correct for the imbalance in data and solve the problem! If our new policy agrees with the logged action, set Rhat = observed reward / prob, else 0. Action Prob. Context Reward Item A 0.95 Max, 24, Monday 1 Item A 0.9 Anna, 34, Wednesday 1 Item C 0.1 John, 28, Saturday -1 Item B 0.3 Mike, 56, Saturday -1 Item D 0.22 Mary, 34, Tuesday 1
  • 24. CBs Turns out we can use the probabilities to correct for the imbalance in data and solve the problem! If our new policy agrees with the logged action, set Rhat = observed reward / prob, else 0. Action Prob. Context Agrees Rhat Item A 0.95 Max, 24, Monday Yes 1.0526... Item A 0.9 Anna, 34, Wednesday No 0 Item C 0.1 John, 28, Saturday Yes -10 Item B 0.3 Mike, 56, Saturday No 0 Item D 0.22 Mary, 34, Tuesday Yes 4,5455...
  • 25. CBs Expected value of the new policy in this example: -4,4019138756 / 5 = -0,8803827751 This method is called inverse probability weighting (IPS) and is an unbiased estimator vIPS of the true policy value. Action Prob. Context Agrees Rhat Item A 0.95 Max, 24, Monday Yes 1.0526... Item A 0.9 Anna, 34, Wednesday No 0 Item C 0.1 John, 28, Saturday Yes -10 Item B 0.3 Mike, 56, Saturday No 0 Item D 0.22 Mary, 34, Tuesday Yes 4,5455...
  • 26. CBs Takeway: with CBs, we can easily evaluate a new policy in an unbiased way before it goes into production. Action Prob. Context Agrees Rhat Item A 0.95 Max, 24, Monday Yes 1.0526... Item A 0.9 Anna, 34, Wednesday No 0 Item C 0.1 John, 28, Saturday Yes -10 Item B 0.3 Mike, 56, Saturday No 0 Item D 0.22 Mary, 34, Tuesday Yes 4,5455...
  • 27. CBs So how to we actually learn a policy? Train a regression model to predict vIPS directly, (x, a) -> Rhat! Play argmax() or explore from time to time based on some strategy. Action(a) Prob. Context (x) Agrees Rhat Item A 0.95 Max, 24, Monday Yes 1.0526... Item A 0.9 Anna, 34, Wednesday No 0 Item C 0.1 John, 28, Saturday Yes -10 Item B 0.3 Mike, 56, Saturday No 0 Item D 0.22 Mary, 34, Tuesday Yes 4,5455...
  • 28. CBs Offline estimators, learning algorithms, and exploration strategies need not be hand-made. They are found in Vowpal Wabbit, a library with first-class bandit support, and the Open Bandit pipeline: http://vowpalwabbit.org https://zr-obp.readthedocs.io/en/l atest/
  • 29. CBs If you want a ready-made system, Azure Personalizer provides contextual bandits-as-a-service: https://azure.microsoft.com/en-us/services/co gnitive-services/personalizer/
  • 31. DYNAMIC PRICING ● Case study: fourkind.com/work/forenom-pricing ● Context: pricing of aparthotel rooms to maximise RevPAR (revenue per available room) ● Results: 13 % increase in RevPAR in the group of locations (23% of total capacity) included in A/B-testing ● Context: Subscription pricing ● Results: (as per A/B-test) 12% increase in total revenue for products offered as part of the CB system ● Context: Parking pricing ● Results: (as per A/B-test) 3% increase in total revenue
  • 32. ARTWORK PERSONALISATION Context: Artwork personalisation in Yle Areena, https://areena.yle.fi/1-50499272 Results: (as per A/B-test) 2.3% increase in average minutes (viewing time), 4.83% increase in conversion
  • 33. SELFIE PERSONALISATION Context: Tinder Smart Photos, https://tinderengineering.ghost.io/smart-photos-2/
  • 34. RECOMMENDATIONS Context: Xbox (Top of Home), https://www.microsoft.com/en-us/research/podcast/reinforceme nt-learning-for-the-real-world-with-dr-john-langford-and-rafah-ho sn/ Image credit: https://www.kotaku.com.au/2020/02/the-new-xbox-one-home-sc reen-is-a-lot-cleaner/
  • 35. CODEC SELECTION LOAD BALANCING DRUG OPTIMISATION RECOMMENDATIONS UI OPTIMISATION PORTFOLIO ALLOCATION DYNAMIC PRICINGCOLD-START LEARNING GAME MATCHMAKING
  • 37. REFERENCES 1. IPS and other estimators (Dudik, Langford et al): https://arxiv.org/abs/1103.4601 2. Real-world reinforcement learning (SlideShare): https://www.slideshare.net/MaxPagels/realworld-reinforcement-learnin g-234276181 3. Bandit Algorithms (SlideShare): https://www.slideshare.net/SC5/practical-ai-for-business-bandit-algorit hms 4. Real world interactive learning (Vimeo): https://vimeo.com/240429210 5. A Survey on Practical Applications of Multi-Armed and Contextual Bandits (Bouneffouf, Rish): https://arxiv.org/abs/1904.10040 A special thanks to John Langford for inspiration and patience answering my questions, and to the entire VW team for answering my questions on implementations.