Sequential Decision Making in Recommendations

Sequential Decision Making in
Recommendations
Jaya Kawale
Toronto Machine Learning Conference,
Nov 2019

Quickly help members discover content they’ll love

Global Members, Personalized Tastes
158 Million Subscribers
~200 Countries

Personalized images.
... to what images to select

... to reaching out to our members

Everything is a Recommendation!*
* Xavier Amatrian and Justin Basilico - https://medium.com/netflix-techblog/netflix-recommendations-beyond-the-5-stars-part-1-55838468f429

Sequence of decisions
visit
click
play
visit
visit
play
thum
bs
up
time
m
ylistadd

Optimize satisfaction
Satisfaction
visit
click
play
visit
visit
play
time

Optimize satisfaction
Satisfaction
visit
click
play
visit
visit
play
time
How should
recommendation
algorithms take this
into account ?

Sequential Decision
Making
Recommendation
Observations
Rewards
Clinical
Trials
Network
Routing
Online
Advertising
AI for
Games
Hyperparameter
Optimization

Recommendation as a Sequential Decision Making
Problem
Learner Environment
Action
Reward
Context

Why is it challenging ?
… Because we don’t know!
● The current environment or state: We may not have full knowledge of the
state we are in.
● Reward: Taking an action at some state results in some reward not known
beforehand.
● The transition dynamics: No knowledge of how the environment or state
changes due to a particular action.

Multi-Armed Bandits
● A gambler playing multiple slot machines with
unknown reward distribution
● Which machine to play to maximize reward?

Multi-Armed Bandit For Recommendation
Exploration-Exploitation tradeoff :
Recommend the optimal title given the evidence i.e. exploit
OR
Recommend other titles to gather feedback i.e. explore.

Numerous Variants
● Different Strategies: ε-Greedy, Thompson Sampling (TS), Upper Confidence
Bound (UCB), etc.
● Different Environments:
○ Stochastic and stationary: Reward is generated i.i.d. from a distribution
specific to the action. No payoff drift.
○ Adversarial: No assumptions on how rewards are generated.
● Different objectives: Cumulative regret, tracking the best expert
● Continuous or discrete set of actions, finite vs infinite
● Extensions: Varying set of arms, Contextual Bandits, etc.

Epsilon Greedy for MABs
● Unbiased
training data
● Greedy
● Select optimal
action
Explore
ε 1-ε
Exploit

Greedy Exploit Policy
Member
Features
Candidate Pool
Model 1
Winner
Probability Of Play
Model 2
Model 3
Model 4

Considerations for the greedy policy
● Explore
○ Bandwidth allocation and cost of exploration
○ New vs existing titles
● Exploit
○ Title availability
○ Frequency of model update
○ Incremental updates vs batch training
■ Non-stationarity of title popularities
?
?
?
? ??
?

Opportunity Cost
Netflix homepage is an expensive real-estate:
- so many titles to promote
- so few opportunities to win a “moment of truth”
D1 D2 D3 D4 D5
Promote?▶ ▶ ▶ ▶
Probability of
Play
Days

p(Y|X, do(R))
Recommendation
Build policy, e.g. what R leads to max Y?

Some approaches
● Bandit approaches (with caveats)
● Counterfactual Risk Minimization [Swaminathan & Joachims, 2015]
● IPS Estimator for MF [Schnabel et al., 2016]
○ Train a debiasing model and reweight the data
● Causal Embeddings [Bonner & Vasile, 2018]
○ Jointly learn debiasing model and task model
○ Regularize the two towards each other
● Doubly-Robust MF [Wang et al., 2019]

Long-term Reward: Road to RL
● Maximize user long term satisfaction rather than play clicks or
duration.

Some Preliminaries
Everything
we know
about the
user
Action
Changing
User
preferences
Reward
Some
starting
point / user
state
Discount
Factor

Policy Gradients
● Learn a policy that maximizes the cumulative
future reward from time t.
● Maximization solved by gradient w.r.t. some policy
parameter.
● E.g. Reinforce*
*Minmin Chen, Alex Beutel, Paul Covington, Sagar Jain, Francois Belletti, Ed H. Chi: Top-K Oﬀ-Policy
Correction for a REINFORCE Recommender System. WSDM 2019: 456-464

Deep Q-Learning
● Q-value: Optimal value for a state action pair.
● Oﬀ-policy algorithm
● Directly learn the function to approximate the
Q-value.
● Challenges on training and making it work in
practice.
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, Martin A.
Riedmiller: Playing Atari with Deep Reinforcement Learning CoRR abs/1312.5602 (2013)

Many more challenges...
● High-dimensional action space: Recommending a single item is
O(|C|); typically want to do ranking or page construction, which is
combinatorial, e.g. Marginal Slates [Dimakopoulou et al., 2019] or
SlateQ [le et al., 2019]
● Off-policy correction: Need to learn & evaluate from existing system
actions, e.g. [Chen et al., 2019] or ReCap [More et al., 2019]
● Good simulators: Requires knowing feedback for user on
recommended items, e.g. [Rohde et al., 2018]
● Changing rewards: Every action may change our ‘ground truth’
● Changing action space: New actions (items) become available and
need to be cold-started.

Thank you.
Jaya Kawale (jkawale@netflix.com)
Twitter: @ jayakawale

Sequential Decision Making in Recommendations

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Sequential Decision Making in Recommendations

Ähnlich wie Sequential Decision Making in Recommendations (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Sequential Decision Making in Recommendations