8. Everything is a Recommendation!*
* Xavier Amatrian and Justin Basilico - https://medium.com/netflix-techblog/netflix-recommendations-beyond-the-5-stars-part-1-55838468f429
13. Recommendation as a Sequential Decision Making
Problem
Learner Environment
Action
Reward
Context
14. Why is it challenging ?
… Because we don’t know!
● The current environment or state: We may not have full knowledge of the
state we are in.
● Reward: Taking an action at some state results in some reward not known
beforehand.
● The transition dynamics: No knowledge of how the environment or state
changes due to a particular action.
16. Multi-Armed Bandits
● A gambler playing multiple slot machines with
unknown reward distribution
● Which machine to play to maximize reward?
17. Multi-Armed Bandit For Recommendation
Exploration-Exploitation tradeoff :
Recommend the optimal title given the evidence i.e. exploit
OR
Recommend other titles to gather feedback i.e. explore.
18. Numerous Variants
● Different Strategies: ε-Greedy, Thompson Sampling (TS), Upper Confidence
Bound (UCB), etc.
● Different Environments:
○ Stochastic and stationary: Reward is generated i.i.d. from a distribution
specific to the action. No payoff drift.
○ Adversarial: No assumptions on how rewards are generated.
● Different objectives: Cumulative regret, tracking the best expert
● Continuous or discrete set of actions, finite vs infinite
● Extensions: Varying set of arms, Contextual Bandits, etc.
19. Epsilon Greedy for MABs
● Unbiased
training data
● Greedy
● Select optimal
action
Explore
ε 1-ε
Exploit
21. Considerations for the greedy policy
● Explore
○ Bandwidth allocation and cost of exploration
○ New vs existing titles
● Exploit
○ Title availability
○ Frequency of model update
○ Incremental updates vs batch training
■ Non-stationarity of title popularities
?
?
?
? ??
?
22. Opportunity Cost
Netflix homepage is an expensive real-estate:
- so many titles to promote
- so few opportunities to win a “moment of truth”
D1 D2 D3 D4 D5
Promote?▶ ▶ ▶ ▶
Probability of
Play
Days
25. Some approaches
● Bandit approaches (with caveats)
● Counterfactual Risk Minimization [Swaminathan & Joachims, 2015]
● IPS Estimator for MF [Schnabel et al., 2016]
○ Train a debiasing model and reweight the data
● Causal Embeddings [Bonner & Vasile, 2018]
○ Jointly learn debiasing model and task model
○ Regularize the two towards each other
● Doubly-Robust MF [Wang et al., 2019]
29. Policy Gradients
● Learn a policy that maximizes the cumulative
future reward from time t.
● Maximization solved by gradient w.r.t. some policy
parameter.
● E.g. Reinforce*
*Minmin Chen, Alex Beutel, Paul Covington, Sagar Jain, Francois Belletti, Ed H. Chi: Top-K Off-Policy
Correction for a REINFORCE Recommender System. WSDM 2019: 456-464
30. Deep Q-Learning
● Q-value: Optimal value for a state action pair.
● Off-policy algorithm
● Directly learn the function to approximate the
Q-value.
● Challenges on training and making it work in
practice.
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, Martin A.
Riedmiller: Playing Atari with Deep Reinforcement Learning CoRR abs/1312.5602 (2013)
31. Many more challenges...
● High-dimensional action space: Recommending a single item is
O(|C|); typically want to do ranking or page construction, which is
combinatorial, e.g. Marginal Slates [Dimakopoulou et al., 2019] or
SlateQ [le et al., 2019]
● Off-policy correction: Need to learn & evaluate from existing system
actions, e.g. [Chen et al., 2019] or ReCap [More et al., 2019]
● Good simulators: Requires knowing feedback for user on
recommended items, e.g. [Rohde et al., 2018]
● Changing rewards: Every action may change our ‘ground truth’
● Changing action space: New actions (items) become available and
need to be cold-started.