SlideShare verwendet Cookies, um die Funktionalität und Leistungsfähigkeit der Webseite zu verbessern und Ihnen relevante Werbung bereitzustellen. Wenn Sie diese Webseite weiter besuchen, erklären Sie sich mit der Verwendung von Cookies auf dieser Seite einverstanden. Lesen Sie bitte unsere Nutzervereinbarung und die Datenschutzrichtlinie.
SlideShare verwendet Cookies, um die Funktionalität und Leistungsfähigkeit der Webseite zu verbessern und Ihnen relevante Werbung bereitzustellen. Wenn Sie diese Webseite weiter besuchen, erklären Sie sich mit der Verwendung von Cookies auf dieser Seite einverstanden. Lesen Sie bitte unsere unsere Datenschutzrichtlinie und die Nutzervereinbarung.
Tactical:We optimize for month-over-month retention, and measure engagement via time spent streamingWe typically run each experiment for 2-3 months
MeasurableTied to a key business goalAgree on the detailed calculation, tooRon Kohavi calls this “Overall Evaluation Criterion” (OEC)
People come to you all the time with metrics they like, many of which seem like great ideas.Our approach at Netflix is to measure all sorts of stuff, but be stringent about which metrics are appropriate for decision-making about the test outcomeSpecifically, we study metrics’ relationships with our core metrics.
Metric is relevant to retention/LTV (via predictive modeling) Metric is not just highly correlated with another slightly better metricMetric is actionableMetric shows differentiation across test cells
Modeling approach is a combo of elastic net & random forest techniques, to rank variables by ability to predict retentionElastic net: differentiate between highly-correlated variablesRandom Forest: rank variables according to importance
Netflix_Controlled Experimentation_Panel_The Hive
Experimentation Panel 3-20-13Some Insights from NetflixExperimentation 1
Experimentation at Netflix Core to our culture Goal is to maximize our customers’ viewing enjoyment New and existing global members participate in multiple tests We experiment in all areas (personalization algorithms, product features, acquisition, streaming optimization, etc.) 2
Clarity on key metric(s) is critical Netflix’s goal with our members: Continually improve member enjoyment Retention Netflix’s goal with our visitors: Optimize visitor experience to entice people to try Netflix Free trial conversion 3
What about other great metricsthat you believe to be a positive measure? 4
Determining the appropriate use of a metric Predictive modeling (of core metric) Brainstorm Vet any “winners”potential metrics, with PMs and pastcollect new data experiments Productize successful metrics 5
Example ranking of some possible metrics 0.012 0.01Variable Importance Measure 0.008 0.006 0.004 0.002 0 6
We predict customer tenure from streaming hours Probability of retaining at each future billing cycle based on streaming S hoursRetention at N days of tenure Total hours consumed during 22 days of membership Cume % in Test Cell Leverage the retention-hours curves above to measure the full distribution of hours in each test cell and predict tenure Streaming Hours 8
Percent of streaming hours from search-based rows 10
Filtered measurement Activity filtering: Filter to a subset of activity – e.g. streaming hours from one row Controversial for decision-making; risk increases as the interaction potential (or cannibalization potential) increases Allocation filtering: Filter to a subset of members in the test – e.g streaming hours for the subset of customers who performed a search Good for decision-making as long as: 1. The segment incorporates the full set of members who were exposed to the experience being tested 2. Segment is large enough to care about (or strategically important) 3. The segment holds up to a controlled experiment (members comprising the segment are not selected in a way that could 11 have been influenced by the test experience)
Unintended threats to controlled experiment Engineering bug (A and B don’t work as intended) Control cell is not engineered like a true test cell (“fixed”), and instead uses the standard production experience Unplanned interaction with other experiments, campaigns, etc. that is differential across test cells 12
Experiment on minimum number ofqualifying titles in order for a “genre row” toappear 13
Discovered that the test cells were notworking properly Cumulative distribution of page views by test cell Customers in the test cell using 15 as the minimum were seeing fewer rows altogether Number of genre rows on the page 15