Reward Innovation for long-term member satisfaction

Reward innovation for
long-term member
satisfaction
Gary Tang, Jiangwei Pan, Henry
Wang and Justin Basilico
Goal
Create a personalized homepage to help
members find content to watch and enjoy
that
maximizes long-term member satisfaction.
Member enjoys watching Netflix
So continues the subscription
and
tells their friends about it
Long-term satisfaction for Netflix
Batch learning from bandit feedback
Production policy
● show recommendations to the member
Member gives feedbacks on recommendations
● immediate: skip/play a show
● long-term: cancel/renew subscription
Goal
● train a policy to maximize the long-term reward
Batch learning from bandit feedback
Production policy
● show recommendations to the member
Member gives feedbacks on recommendations
● immediate: skip/play a show
● long-term: cancel/renew subscription
Goal
● train a policy to maximize the long-term reward
Challenges with long-term retention
Noisy signal
Influenced by external
factors
Nonsensitive signal
Only sensitive for
“borderline” members
Delayed signal
Need to wait a long time &
hard to attribute
Proxy reward
Train policy to optimize proxy reward
● highly correlated with long-term
satisfaction
● sensitive to individual
recommendations
Immediate feedback as proxy
● e.g. start to play a show
Delayed feedback could be more aligned
● e.g. completing a show
Delayed proxy reward
Need to wait a long time to observe the
delayed reward
● can not be used in training
immediately
● can hurt coldstarting of model
Don’t want to wait? predict delayed reward
● use all user actions up to policy training
time
Train policy to maximize predicted reward
Note: predicted reward can not be used
online as it uses post-recommendation user
actions
The ideas are not new
Related work:
● Long-term optimization in recommenders
● Reward shaping in reinforcement learning
● Online reward optimization
We focus on reward innovation as an important
product development workstream at Netflix
Integrating reward in bandit
Reward component
provides the objective
of bandit policy
training data
Reward innovation
● Ideation: what aspect of long-term satisfaction hasn’t been
captured as a reward?
○ Requires balancing perspectives of ML, business, and
psychology. Not easy!
● Development: how do we compute this new reward for every
recommended item?
○ Collecting immediate feedbacks, predicting delayed feedbacks
● Evaluation: whether this is a good reward for the bandit
policy to maximize?
Reward infrastructure
Common development patterns
● Computing immediate proxy
rewards
● Predicting delayed proxy rewards
● Combining multiple rewards
● Sharing rewards across multiple
recommender components
Reward evaluation
● Online testing is expensive (time, resources)
● Use offline evaluation to determine promising
reward candidates for online testing
● Challenge: hard to compare policies trained with
different rewards
Offline reward evaluation
● compare policies along multiple
reward axes using OPE
● choose small number of candidate
policies on pareto front
● compare candidate policies online
using long-term user satisfaction
metrics
Practical learnings
● Reward normalization: dynamic range of reward
can affect SGD training dynamics
● Reward features: pairing a reward with a correlated
feature tends to improve the model’s ability to
optimize that reward
● Reward alignment: make different parts of the
overall recommender system “point in the same
direction”
Summary
● Proxy rewards: we can train bandit policies using
proxy rewards to optimize long-term member
satisfaction
● Art and science: to come up with good reward
hypotheses
● Supporting infrastructure: we develop
infrastructure to help iterate on new hypotheses
quickly
Open challenges
● Proxy rewards: how can we identify proxy rewards that are
aligned with long-term satisfaction in a more principle
way?
● Reinforcement learning: can we use reinforcement
learning to optimize long-term reward directly for
recommendations?
Thank you!
Questions?
Jiangwei Pan
jpan@netflix.com
1 von 18

Recomendados

Recommendation Modeling with Impression Data at Netflix von
Recommendation Modeling with Impression Data at NetflixRecommendation Modeling with Impression Data at Netflix
Recommendation Modeling with Impression Data at NetflixJiangwei Pan
900 views24 Folien
Recent Trends in Personalization: A Netflix Perspective von
Recent Trends in Personalization: A Netflix PerspectiveRecent Trends in Personalization: A Netflix Perspective
Recent Trends in Personalization: A Netflix PerspectiveJustin Basilico
30.3K views64 Folien
Time, Context and Causality in Recommender Systems von
Time, Context and Causality in Recommender SystemsTime, Context and Causality in Recommender Systems
Time, Context and Causality in Recommender SystemsYves Raimond
5.9K views35 Folien
Lessons Learned from Building Machine Learning Software at Netflix von
Lessons Learned from Building Machine Learning Software at NetflixLessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at NetflixJustin Basilico
14.5K views34 Folien
Recap: Designing a more Efficient Estimator for Off-policy Evaluation in Band... von
Recap: Designing a more Efficient Estimator for Off-policy Evaluation in Band...Recap: Designing a more Efficient Estimator for Off-policy Evaluation in Band...
Recap: Designing a more Efficient Estimator for Off-policy Evaluation in Band...Justin Basilico
4.1K views19 Folien
Artwork Personalization at Netflix von
Artwork Personalization at NetflixArtwork Personalization at Netflix
Artwork Personalization at NetflixJustin Basilico
28.1K views83 Folien

Más contenido relacionado

Was ist angesagt?

若手・中堅研究者のための科研費 von
若手・中堅研究者のための科研費若手・中堅研究者のための科研費
若手・中堅研究者のための科研費Yasuhisa Kondo
40.8K views25 Folien
Causal Inference in Marketing von
Causal Inference in MarketingCausal Inference in Marketing
Causal Inference in MarketingTa-Wei (David) Huang
1.6K views17 Folien
Learning a Personalized Homepage von
Learning a Personalized HomepageLearning a Personalized Homepage
Learning a Personalized HomepageJustin Basilico
6.5K views34 Folien
Engagement, metrics and "recommenders" von
Engagement, metrics and "recommenders"Engagement, metrics and "recommenders"
Engagement, metrics and "recommenders"Mounia Lalmas-Roelleke
3.6K views42 Folien
Dowhy: An end-to-end library for causal inference von
Dowhy: An end-to-end library for causal inferenceDowhy: An end-to-end library for causal inference
Dowhy: An end-to-end library for causal inferenceAmit Sharma
5.9K views38 Folien
The VoiceMOS Challenge 2022 von
The VoiceMOS Challenge 2022The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022NU_I_TODALAB
605 views31 Folien

Was ist angesagt?(20)

若手・中堅研究者のための科研費 von Yasuhisa Kondo
若手・中堅研究者のための科研費若手・中堅研究者のための科研費
若手・中堅研究者のための科研費
Yasuhisa Kondo40.8K views
Learning a Personalized Homepage von Justin Basilico
Learning a Personalized HomepageLearning a Personalized Homepage
Learning a Personalized Homepage
Justin Basilico6.5K views
Dowhy: An end-to-end library for causal inference von Amit Sharma
Dowhy: An end-to-end library for causal inferenceDowhy: An end-to-end library for causal inference
Dowhy: An end-to-end library for causal inference
Amit Sharma5.9K views
The VoiceMOS Challenge 2022 von NU_I_TODALAB
The VoiceMOS Challenge 2022The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022
NU_I_TODALAB605 views
計量経済学と 機械学習の交差点入り口 (公開用) von Shota Yasui
計量経済学と 機械学習の交差点入り口 (公開用)計量経済学と 機械学習の交差点入り口 (公開用)
計量経済学と 機械学習の交差点入り口 (公開用)
Shota Yasui19K views
Counterfactual evaluation of machine learning models von Michael Manapat
Counterfactual evaluation of machine learning modelsCounterfactual evaluation of machine learning models
Counterfactual evaluation of machine learning models
Michael Manapat20.5K views
SHAP値の考え方を理解する(木構造編) von Kazuyuki Wakasugi
SHAP値の考え方を理解する(木構造編)SHAP値の考え方を理解する(木構造編)
SHAP値の考え方を理解する(木構造編)
Kazuyuki Wakasugi12.6K views
ノンパラメトリックベイズ4章クラスタリング von 智文 中野
ノンパラメトリックベイズ4章クラスタリングノンパラメトリックベイズ4章クラスタリング
ノンパラメトリックベイズ4章クラスタリング
智文 中野2.6K views
金融時系列のための深層t過程回帰モデル von Kei Nakagawa
金融時系列のための深層t過程回帰モデル金融時系列のための深層t過程回帰モデル
金融時系列のための深層t過程回帰モデル
Kei Nakagawa2K views
Recent Trends in Personalization at Netflix von Justin Basilico
Recent Trends in Personalization at NetflixRecent Trends in Personalization at Netflix
Recent Trends in Personalization at Netflix
Justin Basilico24.2K views
【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO" von Deep Learning JP
【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"
【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"
Deep Learning JP798 views
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15 von MLconf
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
MLconf5.3K views
Monte carlo dropout and variational bound von 天乐 杨
Monte carlo dropout and variational boundMonte carlo dropout and variational bound
Monte carlo dropout and variational bound
天乐 杨147 views
Modeling uncertainty in deep learning von Sungjoon Choi
Modeling uncertainty in deep learning Modeling uncertainty in deep learning
Modeling uncertainty in deep learning
Sungjoon Choi2.9K views
Counterfactual Learning for Recommendation von Olivier Jeunen
Counterfactual Learning for RecommendationCounterfactual Learning for Recommendation
Counterfactual Learning for Recommendation
Olivier Jeunen633 views
ベルヌーイ分布からベータ分布までを関係づける von itoyan110
ベルヌーイ分布からベータ分布までを関係づけるベルヌーイ分布からベータ分布までを関係づける
ベルヌーイ分布からベータ分布までを関係づける
itoyan11020.2K views

Similar a Reward Innovation for long-term member satisfaction

VIP Program Creation von
VIP Program CreationVIP Program Creation
VIP Program CreationJenny Weigle
128 views13 Folien
2016 5-5 technology association of georgia - tag - sales force engagement (2)... von
2016 5-5 technology association of georgia - tag - sales force engagement (2)...2016 5-5 technology association of georgia - tag - sales force engagement (2)...
2016 5-5 technology association of georgia - tag - sales force engagement (2)...Erin Bush
119 views38 Folien
Headspace apm growth project von
Headspace apm growth projectHeadspace apm growth project
Headspace apm growth projectDhiren Patel
1.1K views43 Folien
Analytics Academy 2017 Presentation Slides von
Analytics Academy 2017 Presentation SlidesAnalytics Academy 2017 Presentation Slides
Analytics Academy 2017 Presentation SlidesHarvardComms
963 views180 Folien
Aligning with partners on customer success von
Aligning with partners on customer successAligning with partners on customer success
Aligning with partners on customer successMatthew Klassen
693 views23 Folien
Sequential Decision Making in Recommendations von
Sequential Decision Making in RecommendationsSequential Decision Making in Recommendations
Sequential Decision Making in RecommendationsJaya Kawale
2.1K views32 Folien

Similar a Reward Innovation for long-term member satisfaction(20)

2016 5-5 technology association of georgia - tag - sales force engagement (2)... von Erin Bush
2016 5-5 technology association of georgia - tag - sales force engagement (2)...2016 5-5 technology association of georgia - tag - sales force engagement (2)...
2016 5-5 technology association of georgia - tag - sales force engagement (2)...
Erin Bush119 views
Headspace apm growth project von Dhiren Patel
Headspace apm growth projectHeadspace apm growth project
Headspace apm growth project
Dhiren Patel1.1K views
Analytics Academy 2017 Presentation Slides von HarvardComms
Analytics Academy 2017 Presentation SlidesAnalytics Academy 2017 Presentation Slides
Analytics Academy 2017 Presentation Slides
HarvardComms963 views
Aligning with partners on customer success von Matthew Klassen
Aligning with partners on customer successAligning with partners on customer success
Aligning with partners on customer success
Matthew Klassen693 views
Sequential Decision Making in Recommendations von Jaya Kawale
Sequential Decision Making in RecommendationsSequential Decision Making in Recommendations
Sequential Decision Making in Recommendations
Jaya Kawale2.1K views
How to Gauge Engagement in Moodle Using NPS von Lambda Solutions
How to Gauge Engagement in Moodle Using NPSHow to Gauge Engagement in Moodle Using NPS
How to Gauge Engagement in Moodle Using NPS
Lambda Solutions244 views
Dev's Guide to Feedback Driven Development von Marty Haught
Dev's Guide to Feedback Driven DevelopmentDev's Guide to Feedback Driven Development
Dev's Guide to Feedback Driven Development
Marty Haught1.9K views
Panel: Making responsible gambling work within the industry von Horizons RG
Panel: Making responsible gambling work within the industry Panel: Making responsible gambling work within the industry
Panel: Making responsible gambling work within the industry
Horizons RG599 views
Serious Games + Learning Science = Win: How to Teach Product Knowledge, Polic... von Bottom-Line Performance
Serious Games + Learning Science = Win: How to Teach Product Knowledge, Polic...Serious Games + Learning Science = Win: How to Teach Product Knowledge, Polic...
Serious Games + Learning Science = Win: How to Teach Product Knowledge, Polic...
Mkt2 building youronlinemktgpresence von brock55
Mkt2 building youronlinemktgpresenceMkt2 building youronlinemktgpresence
Mkt2 building youronlinemktgpresence
brock55447 views
5 Ways to Increase the Effectiveness of Your Compensation Programs von Human Capital Media
5 Ways to Increase the Effectiveness of Your Compensation Programs5 Ways to Increase the Effectiveness of Your Compensation Programs
5 Ways to Increase the Effectiveness of Your Compensation Programs
How to Keep Up Your Creative Testing After Budget Cuts von Tinuiti
How to Keep Up Your Creative Testing After Budget CutsHow to Keep Up Your Creative Testing After Budget Cuts
How to Keep Up Your Creative Testing After Budget Cuts
Tinuiti558 views
Necessary Elements of Digital Marketing to Grow Your Business von Digital Vidya
Necessary Elements of Digital Marketing to Grow Your BusinessNecessary Elements of Digital Marketing to Grow Your Business
Necessary Elements of Digital Marketing to Grow Your Business
Digital Vidya1.1K views
First 30 days of Your CRO Program von VWO
First 30 days of Your CRO ProgramFirst 30 days of Your CRO Program
First 30 days of Your CRO Program
VWO60 views
Closing the Loop: Enhancing User Experience with Monetization | Tal Shoham von Jessica Tams
Closing the Loop: Enhancing User Experience with Monetization | Tal ShohamClosing the Loop: Enhancing User Experience with Monetization | Tal Shoham
Closing the Loop: Enhancing User Experience with Monetization | Tal Shoham
Jessica Tams341 views
The Bumpy Road to Actionable SLOs von DevOps.com
The Bumpy Road to Actionable SLOsThe Bumpy Road to Actionable SLOs
The Bumpy Road to Actionable SLOs
DevOps.com140 views

Último

Cencora Executive Symposium von
Cencora Executive SymposiumCencora Executive Symposium
Cencora Executive Symposiummarketingcommunicati21
139 views14 Folien
The Role of Patterns in the Era of Large Language Models von
The Role of Patterns in the Era of Large Language ModelsThe Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language ModelsYunyao Li
80 views65 Folien
"Surviving highload with Node.js", Andrii Shumada von
"Surviving highload with Node.js", Andrii Shumada "Surviving highload with Node.js", Andrii Shumada
"Surviving highload with Node.js", Andrii Shumada Fwdays
53 views29 Folien
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava... von
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...ShapeBlue
101 views17 Folien
Digital Personal Data Protection (DPDP) Practical Approach For CISOs von
Digital Personal Data Protection (DPDP) Practical Approach For CISOsDigital Personal Data Protection (DPDP) Practical Approach For CISOs
Digital Personal Data Protection (DPDP) Practical Approach For CISOsPriyanka Aash
153 views59 Folien
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T von
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&TCloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&TShapeBlue
112 views34 Folien

Último(20)

The Role of Patterns in the Era of Large Language Models von Yunyao Li
The Role of Patterns in the Era of Large Language ModelsThe Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language Models
Yunyao Li80 views
"Surviving highload with Node.js", Andrii Shumada von Fwdays
"Surviving highload with Node.js", Andrii Shumada "Surviving highload with Node.js", Andrii Shumada
"Surviving highload with Node.js", Andrii Shumada
Fwdays53 views
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava... von ShapeBlue
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...
ShapeBlue101 views
Digital Personal Data Protection (DPDP) Practical Approach For CISOs von Priyanka Aash
Digital Personal Data Protection (DPDP) Practical Approach For CISOsDigital Personal Data Protection (DPDP) Practical Approach For CISOs
Digital Personal Data Protection (DPDP) Practical Approach For CISOs
Priyanka Aash153 views
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T von ShapeBlue
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&TCloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
ShapeBlue112 views
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f... von TrustArc
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...
TrustArc160 views
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda... von ShapeBlue
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...
ShapeBlue120 views
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue von ShapeBlue
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlueVNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue
ShapeBlue163 views
DRBD Deep Dive - Philipp Reisner - LINBIT von ShapeBlue
DRBD Deep Dive - Philipp Reisner - LINBITDRBD Deep Dive - Philipp Reisner - LINBIT
DRBD Deep Dive - Philipp Reisner - LINBIT
ShapeBlue140 views
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti... von ShapeBlue
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
ShapeBlue98 views
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online von ShapeBlue
KVM Security Groups Under the Hood - Wido den Hollander - Your.OnlineKVM Security Groups Under the Hood - Wido den Hollander - Your.Online
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online
ShapeBlue181 views
Why and How CloudStack at weSystems - Stephan Bienek - weSystems von ShapeBlue
Why and How CloudStack at weSystems - Stephan Bienek - weSystemsWhy and How CloudStack at weSystems - Stephan Bienek - weSystems
Why and How CloudStack at weSystems - Stephan Bienek - weSystems
ShapeBlue197 views
The Power of Heat Decarbonisation Plans in the Built Environment von IES VE
The Power of Heat Decarbonisation Plans in the Built EnvironmentThe Power of Heat Decarbonisation Plans in the Built Environment
The Power of Heat Decarbonisation Plans in the Built Environment
IES VE69 views
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... von Bernd Ruecker
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
Bernd Ruecker50 views
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ... von ShapeBlue
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
ShapeBlue123 views
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue von ShapeBlue
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlueCloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue
ShapeBlue93 views
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue von ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlueCloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
ShapeBlue94 views

Reward Innovation for long-term member satisfaction

  • 1. Reward innovation for long-term member satisfaction Gary Tang, Jiangwei Pan, Henry Wang and Justin Basilico
  • 2. Goal Create a personalized homepage to help members find content to watch and enjoy that maximizes long-term member satisfaction.
  • 3. Member enjoys watching Netflix So continues the subscription and tells their friends about it Long-term satisfaction for Netflix
  • 4. Batch learning from bandit feedback Production policy ● show recommendations to the member Member gives feedbacks on recommendations ● immediate: skip/play a show ● long-term: cancel/renew subscription Goal ● train a policy to maximize the long-term reward
  • 5. Batch learning from bandit feedback Production policy ● show recommendations to the member Member gives feedbacks on recommendations ● immediate: skip/play a show ● long-term: cancel/renew subscription Goal ● train a policy to maximize the long-term reward
  • 6. Challenges with long-term retention Noisy signal Influenced by external factors Nonsensitive signal Only sensitive for “borderline” members Delayed signal Need to wait a long time & hard to attribute
  • 7. Proxy reward Train policy to optimize proxy reward ● highly correlated with long-term satisfaction ● sensitive to individual recommendations Immediate feedback as proxy ● e.g. start to play a show Delayed feedback could be more aligned ● e.g. completing a show
  • 8. Delayed proxy reward Need to wait a long time to observe the delayed reward ● can not be used in training immediately ● can hurt coldstarting of model Don’t want to wait? predict delayed reward ● use all user actions up to policy training time Train policy to maximize predicted reward Note: predicted reward can not be used online as it uses post-recommendation user actions
  • 9. The ideas are not new Related work: ● Long-term optimization in recommenders ● Reward shaping in reinforcement learning ● Online reward optimization We focus on reward innovation as an important product development workstream at Netflix
  • 10. Integrating reward in bandit Reward component provides the objective of bandit policy training data
  • 11. Reward innovation ● Ideation: what aspect of long-term satisfaction hasn’t been captured as a reward? ○ Requires balancing perspectives of ML, business, and psychology. Not easy! ● Development: how do we compute this new reward for every recommended item? ○ Collecting immediate feedbacks, predicting delayed feedbacks ● Evaluation: whether this is a good reward for the bandit policy to maximize?
  • 12. Reward infrastructure Common development patterns ● Computing immediate proxy rewards ● Predicting delayed proxy rewards ● Combining multiple rewards ● Sharing rewards across multiple recommender components
  • 13. Reward evaluation ● Online testing is expensive (time, resources) ● Use offline evaluation to determine promising reward candidates for online testing ● Challenge: hard to compare policies trained with different rewards
  • 14. Offline reward evaluation ● compare policies along multiple reward axes using OPE ● choose small number of candidate policies on pareto front ● compare candidate policies online using long-term user satisfaction metrics
  • 15. Practical learnings ● Reward normalization: dynamic range of reward can affect SGD training dynamics ● Reward features: pairing a reward with a correlated feature tends to improve the model’s ability to optimize that reward ● Reward alignment: make different parts of the overall recommender system “point in the same direction”
  • 16. Summary ● Proxy rewards: we can train bandit policies using proxy rewards to optimize long-term member satisfaction ● Art and science: to come up with good reward hypotheses ● Supporting infrastructure: we develop infrastructure to help iterate on new hypotheses quickly
  • 17. Open challenges ● Proxy rewards: how can we identify proxy rewards that are aligned with long-term satisfaction in a more principle way? ● Reinforcement learning: can we use reinforcement learning to optimize long-term reward directly for recommendations?