1. A/B Testing at Netflix:
Experimentation Platform
Steve Urban
experimentation@netflix.com
2. • Technology is just one part of the equation: a culture of
experimentation is the other essential part
• All product ideas are subjected to the scientific method, with
actual data supporting changes before changes are rolled out
to all users
• The effectiveness of any idea is measured without bias - the
seniority of the person proposing the idea is irrelevant
Importance of A/B Testing at Netflix
3. A/B testing enables product decisions throughout Netflix, with
our users spread across all departments
• Data Scientists: Does this new ranking algorithm result in more plays?
• Product Managers: Does this new UI reduce the time for users to find content?
• Marketing: Which email campaign resulted in more new subscribers?
• Content: Which thumbnail image resulted in more streams of Daredevil?
• Engineers: Is the new implementation of this streaming algorithm more
performant when internet connectivity is spotty?
• and so on...
Our Users
4. • Being an internal tool is not an excuse for poor UX
• Given the diverse expertise of our users workflows must be
simple and effective while providing value
• Cover all generic test management scenarios
• Easily accommodate unique experimentation needs as they
come up
• Ingest and combine real-time behavioral and batch metadata
from numerous sources
A/B Testing Platform Objectives
5. We’re looking for a Full-Stack
Engineer to help across the board:
• Collaborate with users across Netflix to
understand their UI needs
• Be part of a team of engineers and UX
experts
• Tech stack: Java, React, Node
• Data visualization experience is a plus
We’re Hiring
Netflix has a unique culture. Read about it here.
We need a Server-Side Engineer with
expertise designing distributed systems:
• Help design and rebuild our allocation
engine
• Experience processing large datasets -
including efficient incorporation of near
real-time data
• Expertise with various Big Data databases
• Machine learning experience is a plus
8. Which set of recommendations is better?
orA B
Given that I Watched House of Cards...
9. Hard to Answer Without Disciplined
Experimentation
orA? B?
10. A/B Testing Process
Target Population
Hypothesis: Retention and/or engagement will improve with new recommendation algorithm.
Process: Randomly group users into different buckets. Other than the tests, all other factors are
constant.
Control Group:
Continue to experience
the current version (A)
Test Group B:
Experience version B
Test Group C:
Experience version C
11. A/B Testing Process Continued
Analyze & Compare Key Results
Algorithm A (Control)
Algorithm B
Algorithm C?
...
Viewing hours delta: N/A
N/A as this is what
we are measuring
other options against
Viewing hours delta: +2.3%
Statistically Significant: Yes
Viewing hours delta: -5.7%
Statistically Significant: Yes
2.3% better than the
control, and we’re
confident about it
Ouch! Don’t use this
algorithm.
14. Allocation & Stratification
All US Regions
● Randomly distribute and assign customers to a variant in
the experiment utilizing Stratified Sampling
● Start, Stop, and Track allocations in near real-time
Percentage of Users*:
North East 22%
South East 13%
South West 17%
... ...
*Numerical values are for illustrative purposes only and are totally made up
“Random sampling” with
enforcement of sample
proportions across regions
Percentage of Users
15. Segmentation
Target Population
● Divide a broad target population into subsets with similar properties
● Some tests are meant to measure impact on specific populations
● Must maintain scale and low latencies
Segmentation by specific
properties
Haven’t used a tablet to
access Netflix in n days
Used a game console to
access Netflix within last
n days
Smart TV users
16. Test Health
● All test experiences are not equal, but we must ensure this isn’t due to buggy implementations
● Issues can be device specific, so must monitor at device, test, and experience granularity
● The example below is super-simplified - we need to create visualizations which effectively convey
test health, internationally, across thousands of devices
Control Cell
Experience B No errors/fallbacks
Experience A Issue on TV UI detected
No errors/fallbacks
17. ABlaze UI: Test Lifecycle Management
Initial Planning: Test Configuration Screens
● Determine hypothesis
● Implement each test experience
Schedule Test: Scheduler View
● Define real-time rules & conditions
● Consider potential conflicts
Monitor Test: Dashboard and Alert Views
● Monitor test health over time
○ Real-time analysis and alerting on metrics and
allocations
● Pull test if bugs/issues present themselves
Hypothesis Evaluation: Comparison Views
● Interactive filtering, analysis, & visualization of
data
● Call success or failure of test
Implement or Re-Test
● Devise plan to roll winning experience
(if any) out to production
● Else, potentially revise hypothesis and
retest
18. Some Challenges
• Operate resiliently and at low latencies, despite:
• Customer allocations taking place in real-time
• Need for near real-time insights into test health over massive datasets
• Data that is distributed across multiple clusters
• Data processing:
• Joins across billions of rows of data from many sources can cause massive increase in
number of rows
• Efficient management of datasets to support interactive analysis, dashboards, etc.
• Rich and flexible filtering to support interactive analysis
• Extract forecasts and insights
• Oh, and make it as easy to use as possible for the users...