Recommendation systems today are widely used across many applications such as in multimedia content platforms, social networks, and ecommerce, to provide suggestions to users that are most likely to fulfill their needs, thereby improving the user experience. Academic research, to date, largely focuses on the performance of recommendation models in terms of ranking quality or accuracy measures, which often don’t directly translate into improvements in the real-world. In this talk, we present some of the most interesting challenges that we face in the personalization efforts at Netflix. The goal of this talk is to sunshine challenging research problems in industrial recommendation systems and start a conversation about exciting areas of future research.
13. ○ Every person is unique with a variety of interests
… and sometimes multiple people use the same profile
○ Help people find what they want when they’re not sure what they want
○ Non-stationary, context-dependent, mood-dependent, ...
○ Large datasets but small data per member
… and potentially biased by the output of your system
○ Cold-start problems on all sides
○ More than just accuracy: diversity, novelty, freshness, fairness, ...
○ ...
No, personalization is hard!
17. ~2012 ~2017
Deep Learning
becomes popular in
Machine Learning
Deep Learning
becomes popular in
Recommender Systems
What took so long?
~2019
Traditional methods do
as well or better than
Deep Learning for
Recommender Systems
… Wait, what?
Timeline
22. … isn’t always the best
U
V
Mean squared
loss
?
Also see [Dacrema et al., 2019], [Rendle et. al,
2019], [Rendle et. al, 2021]. Make sure you
tune your baselines.
24. X
R
EASE:
Embarrassingly Shallow Auto-Encoders [Steck, 2019]
● Super efficient model to train in a
collaborative filtering setting inspired
by SLIM
● Learn item-by-item matrix X such that
R.X is close to R and diag(X) is 0
○ Avoids trivial solution of identity
● Closed-form solution
● More on that: auto-encoders that
don’t overfit towards identity
≈ R
0
0
31. From Correlation to Causation
● Most recommendation algorithms
are correlational
○ Some early recommendation
algorithms literally computed
correlations between users and items
● Did you watch a movie because
we recommended it to you? Or
because you liked it? Or both?
● If you had to watch a movie, would
you like it? [Wang et al., 2020] p(Y|X) → p(Y|X, do(R))
(from http://www.tylervigen.com/spurious-correlations)
32. Feedback loops
Impression bias
inflates plays
Leads to inflated
item popularity
More plays
More
impressions
Oscillations in
distribution of genre
recommendations
Feedback loops can cause biases to be
reinforced by the recommendation system!
[Chaney et al., 2018]: simulations showing that this can reduce the
usefulness of the system
They’re real:
38. Challenges in Causal Recommendations
● Handling unobserved confounders
● Coming up with the right causal graph
● High variance (especially propensity-based ones)
● Computational challenges (e.g. [Wong, 2020])
● Off-policy evaluation
● When and how to introduce exploration
40. Why contextual bandits for recommendations?
● Break feedback loops
● Want to explore to learn
● Uncertainty around member interests and new items
● Sparse and indirect feedback
● Changing trends
▶
Early news example: [Li et al., 2010]
42. Recommendation as
Contextual Bandit
● Environment: Netflix homepage
● Context: Member
● Arm: Display video at top of page
● Policy: Selects a video to recommend
● Reward: Member plays and enjoys video
Video Selector
▶
?
44. Causality & Bandits [Dimakopoulou et al., 2021]
● Data collected from bandits is not IID
○ Bandits collect data adaptively
○ Initial noise may mean choosing an arm less
often, which can keep its sample mean low
● Inverse Propensity Weighting? High variance
○ Take inspiration from Doubly Robust
estimators
● Doubly Adaptive Thompson Sampling (DATS)
○ Thompson Sampling using the distribution of
the Adaptive Doubly Robust estimator in
place of the posterior
○ DATS performs better in practice and
matches TS regret bound
45. ● Designing good exploration is an art
○ Especially to support future algorithm innovation
○ Challenging to do member-level A/B tests comparing
fully on-policy bandits at high scale
● Bandits over large action spaces: rankings and slates
● Layers of bandits that influence each other
● Handling delayed rewards
Challenges with bandits in the real world
46. Going Long-Term
● Want to maximize long-term member joy
● Involves many member visits, recommendation actions and delayed reward
● … sounds like Reinforcement Learning
47. Within a page
RL to optimize a
ranking or slate
How long?
Within a session
RL to optimize
multiple interactions
in a session
Across sessions
RL to optimize
interactions across
multiple sessions
48. Building simulators for evaluating recommenders
Page-level
Whole system (Accordion)
[McInerney et al., 2021]
Ranking
49. ● Embeddings for actions: List-wise [Zhao et al., 2017] or Page-wise
recommendation [Zhao et al. 2018] based on [Dulac-Arnold et al., 2016]
● Adversarial model for user simulator: GAN-like model [Chen et al., 2019]
● Policy Gradient: Candidate generator using REINFORCE and TPRO [Chen
et al., 2019],
● Multi-task: Additional model head or Actor-Critic [Xin et al., 2020],
Auxiliary tasks for REINFORCE [Chen et al., 2021]
● Handling Diversity [Hansen et al., 2021], Slates [Ie et al., 2020], &
Multiple Recommenders [Zhao et. al, 2020]
● ...
Many potential directions
51. ● We want to optimize long-term member joy
● While accounting for:
○ Avoiding “trust busters”
○ Coldstarting
○ Fairness
○ Findability
○ ...
What is your recommender trying to optimize?
53. Layers of Metrics
RMSE
NDCG on
historical data
Member
Engagement in
A/B test
Joy
Example case: Misaligned Metrics
Training
Objective
Offline Metric Online Metric Goal
55. Recap [More et al., 2019]
● Bandit replay-style metrics can
have high variance due to low
number of matches with large
action spaces
● Use a ranking approach: Good to
rank high reward arms near top, low
reward arms near bottom
56. ● Nuanced metrics:
○ Differences between what you want and what you can encapsulate in a
metric
○ Where does enjoyment come from? How does that vary by person?
○ How do you measure that at scale?
● What about effects beyond typical A/B time horizon?
● Incorporating fairness
○ Calibration to distribution of user tastes [Steck, 2018]
○ Item cold-start [Zhu et. al, 2021]
● Beyond algorithms: Ensuring a positive impact on society
Challenges in objectives