Our objective for the Netflix recommendation engine is to create a personalized experience for our members, making it easier for them to find a video to watch and enjoy. When a member logs on to the service, she/he may be in one or a combination of different watching modes: discovering a new content to watch, continuing to watch a partially-watched movie or a TV show she/he has been binging on, playing one of the contents she/he had put in her play list during an earlier session, etc. If, for example, we can reasonably predict when a member is more likely to be in the continuation mode, and which videos she/he is more likely to resume, it makes sense to place those videos in more prominent places of the home page. In this talk we focus on understanding the discovery vs. continuation behavior and explain how we have used machine learning to improve the member experience by learning a personalized balance between those two modes. As a case study, we focus on a recent change on the personalization of a row of recommendations called “Continue Watching,” which appears on the main page of the Netflix member homepage on the website and the app and currently drives a significant proportion of member streaming hours.
4. Netflix Scale
§ > 83M members
§ > 190 countries
§ > 1000 device types
§ > 3.7B hours of content
streamed every month
§ 36% of peak US
downstream traffic
4
5. § Recommendations through
predicted star rating
§ Contest:
§ Accuracy measured by root
mean squared error (RMSE)
§ Improve by 10% = $1 million!
§ Data size:
§ 100M ratings (back then
“almost massive”)
5
6. Turn on Netflix, and the
absolute best contents for you
would automatically start playing
Recommendation System: Ideal State
6
7. Create a page of recommendations
where the titles you are
most likely to watch and enjoy are
shown on the most visible parts of
the page
Meanwhile…
7
8. Title Ranking
Everything is a RecommendationRowSelection&Ordering
Recommendations are
driven by machine
learning algorithms
Over 80% of what
members watch comes
from our
recommendations
8
9. How the Homepage is Built
§ The titles are organized as rows
§ Ordering of titles within rows depends on the row type
§ Selection and ordering of rows:
§ Personalized page generation
algorithm
§ Also some business rules and
constraints
§ Balance thematic coherence,
relevance, and diversity
9
10. Various Types of Member Interactions/Feedback
§ Plays
§ How long, pause, rewind, skip, etc.
§ Rating and social
§ Rate, like, share
§ Context
§ Time, location, device, language
§ Interactions
§ Scrolling, opening a title page,
search, list add 10
11. Building the Recommendations is Data Driven
§ Try an idea offline using historical
data to see if it would have made
better recommendations
§ Offline metrics: AUC, nDCG, Recall, …
§ If it did, deploy a live A/B test to see
if it performs well in Production
§ Primary metric: Member retention
Idea /
Problem
Data
Algorithm
Model
Metrics
A/B
Testing
11
12. For More Reading
§ Netflix tech blog:
§ bit.ly/beyondfivestars
§ bit.ly/learnapage
§ bit.ly/sparktimetravel
12
14. The same you watched last time!
What Is the Most Likely Title You Will Watch?
§ A large portion of watching hours are spent in continue
watching mode
14
15. Different Modes of Watching
§ Continuation: Resume a
recently-watched TV/Movie
§ List: Play a title previously
added to My List
§ Rewatch: Rewatch a title
enjoyed in the past
§ Discovery: Discover a new
title to watch
15
16. Recommending for Different Modes:
Approach 1
§ Build one unified model for ranking the titles in each row
and one for ranking rows
§ Optimized for the likelihood of play/enjoyment from the page
§ Benefits:
§ Fewer models to maintain
§ Fewer A/B tests
16
17. Approach 1: Challenges
§ Members behave differently in different modes
§ Different row types are designed for different behaviors
§ Hard to capture and balance all that in one objective
§ E.g. simply ranking titles by likelihood of play will fill the page with
already-watched titles è Poor member experience
§ Recommendations for different modes have different
sensitivities to member actions
§ Continuation recs may react immediately to watching activities,
My List recs may react to My List add/remove activities, etc.
17
18. Approach 2: Dedicated Models + Blend
§ Build separate models for the each mode
§ Blend the results on the page
§ Blending can be done through a model trained offline, or a
parameter tuned online
§ E.g., one or more dedicated rows for each mode
§ Pro:
§ More modular, provides more intuitive knobs for balancing
§ Con:
§ Less elegant, more maintenance 18
20. Continue Watching Row: The Past
§ CW row was shown on some devices
§ Videos sorted by recency of last watch
§ Row appearance on page by business rules
§ On the website, only a single CW title
§ A very significant fraction of plays are continuations
§ CW deserved a better treatment
20
21. Objective
§ Unify the CW row across devices
§ Optimize the row in two dimensions:
§ Row position on page
§ Place it higher when the member is more
likely to resume a video
§ Re-order the titles within the CW row
§ By their likelihood to be resumed in the
current session
21
22. Some Intuitive Patterns
§ Member may be more likely to want to
§ Resume a video if:
§ In the middle of binging a TV show
§ Partially watched a movie recently
§ Often watched it around this time of the day, location, or on the current
device
§ Discover a new title if:
§ Just finished a movie or completed all episodes of a show
§ Hasn’t watched anything recently
§ Is a relatively new member
22
23. Building a Recommendation Model for CW
§ Feature Brainstorm
§ Training Data
§ Models and Metrics
§ Implementation
23
24. Feature Ideas
§ Member-level:
§ Member’s subscription: tenure, country, language
§ How active has the member been recently
§ Member past ratings, genre preferences, etc.
24
25. Feature Ideas
§ Video and member’s previous interactions with it:
§ How recently was the video added to the catalog, watched, ...
§ How much of the movie/show watched
§ Video metadata:
§ Type and genre of video, # episodes
§ E.g., kids titles may be re-watched more
§ What else is on the catalog
§ Popularity and relevance of the video
§ How often do members resume this video
25
27. Title Ranking Model
§ Training data
§ Continuation sessions
§ Look at which of the recently-watched titles were played?
§ Model
§ Learn-to-rank: Linear/ensembles/…
§ Optimize for how well we rank the played title among other titles
27
28. Title Ranking Model: Performance
§ Baseline: Ranking by recency of
last play
§ Recency rank was also an
important feature in the model
§ Metrics significantly higher than
the baseline
§ E.g. Significant lift in precision
§ A/B testing also showed
improvements
28
29. Row Placement Model
§ Objective
§ Estimate the likelihood of continuation vs. discovery
§ Map that likelihood to a position on the page
§ Simplification:
§ Fix two candidate positions on the page and apply a threshold
§ Tune the threshold to optimize some accuracy metric
29
30. Row Placement Model: Training
§ Training data
§ Randomly select sessions with plays globally
§ Model
§ Binary classification of continuation vs. discovery sessions
§ Evaluated using classification and ranking metrics
30
31. Row Placement Model: Performance
§ Metrics
§ Achieved high classification metrics for predicting continuation vs
discovery
§ Error types:
§ False positives è CW occupies top of the page unnecessarily
§ False negative è Difficult for member to find the CW title
§ Placing the row
§ Threshold trades off FP and FN è Hard to tune offline
§ Tuned the threshold by A/B testing
31
32. Reusing the Title Ranking Model
§ Use the title-level scores
§ Calibrate scores to get probability Pt of continuation for each CW
title t
§ Aggregate into an overall probability of continuation
§ E.g., assuming independence:
PCW = 1 - ∏tϵCW (1- Pt)
§ Pro: Avoids maintaining two separate models
§ Con: Not as accurate as a dedicated model
32
33. Context Awareness
§ Title ranks highest on the same time of day and device
as last play
§ Experiment:
§ Played “Sid the Science Kid” on iPhone
§ Played “Narcos” on the website
è Different ranking on iPhone and Web
33
34. Serving the CW Row in Production
§ Score cannot be precomputed è Real- or near real-time
§ Some features are context dependent
§ Row should refresh each time a member watches a title
§ Need to push updates to clients to keep the row fresh
§ Latency bottleneck: Data transfers from the cache to
computation backend
§ Requires careful backend engineering
§ Fallback strategy: If computation fails, can use recency ranking
34
36. Conclusions
§ Important to understand different modes of behavior
§ Continuation is a key driver of streaming hours
§ Improving CW recommendations improves member experience
§ A/B testing showed significant boost in user engagement
§ Future:
§ Incorporate the placement of CW row (and others) into the main
page construction model
§ When can we automatically start resuming a title? 36