6. Outcomes & Error Rates
Fundamental
Tradeoff
Confidence Intervals
Hypotheses
XX X
“A/B Testing
Playbook”Opening
Mid-game
Mid-game
Closing
7. Outcomes & Error Rates
Fundamental
Tradeoff
Confidence Intervals
Hypotheses
XX X
“A/B Testing
Playbook”Opening
Mid-game
Mid-game
Closing
8. 1. What makes a good hypothesis?
2. What are the differences between False Positive Rate and
False Discovery Rate?
3. How can I trade off pulling the 3 levers I have: thresholding
error rates, detecting smaller improvements, and running my
tests longer?
At the end of this talk, you should be able to answer
9. 1. A good hypothesis has a variation and a clearly defined goal
which you crafted ahead of time.
2. False Positive Rates have hidden sources of error when testing
many goals and variations. False Discovery Rates correct this
by increasing runtime when you have many low signal goals.
3. All three levers are inversely related. For example, running my
tests longer can get me lower error rates, or detect smaller
effects.
The answers
11. • Control and Variation
A control is the original, or baseline version of
content that you are testing through a
variation.
• Goal
Metric used to measure impact of control and
variation
• Baseline conversion rate
The control group’s expected conversion rate.
• Effect size
The improvement (positive or negative) of
your variation over baseline.
• Sample size
The number of visitors in your test.
12. • A hypothesis test is a control, and variation that you
want to show improves a goal
• An experiment is a collection of hypotheses (goals &
variation pairs) that all have the same control
13. Outcomes & Error Rates
Fundamental
Tradeoff
Confidence Intervals
Hypotheses
XX X
“A/B Testing
Playbook”Opening
Mid-game
Mid-game
Closing
16. Why is this not actionable?
“I think changing the
header image will make
my site better.”
• Removing the header will increase engagement?
• Removing the header will increase total revenue?
• Removing the header will increase “the finals”
clicks?
• Growing the header will increase engagement?
• Growing the header with increase “the finals” clicks.
• …
Test creep!Bad hypothesis
17. • Removing the header will increase engagement?
• Removing the header will increase total revenue?
• Removing the header will increase “the finals”
clicks?
• Growing the header will increase engagement?
• Growing the header with increase “the finals” clicks.
• …
• Removing the header will increase engagement?
• Removing the header will increase total revenue?
• Removing the header will increase “the finals”
clicks?
• Growing the header will increase engagement?
• Growing the header with increase “the finals” clicks.
• …
“I think changing the
header image will make
my site better.”
Bad hypothesis Good hypotheses
Organized and clear
18. The more relationships (hypotheses) you test,
the longer (visitors) it will take
to achieve the same outcome (error rate).
Hypotheses also give the cost of your experiment
19. Questions to check for a good hypothesis
What are you trying to show with your idea?
What key metrics should it drive?
Are all my goals and variations necessary given my testing limits?
20. 1. What makes a good hypothesis?
2. What are the differences between False Positive Rate and
False Discovery Rate?
3. How can I trade off pulling the 3 levers I have: thresholding
error rates, detecting smaller improvements, and running my
tests longer?
At the end of this talk, you should be able to answer
21. 1. What makes a good hypothesis?
Answer: A good hypothesis has a variation and a clearly
defined goal which you crafted ahead of time.
2. What are the differences between False Positive Rate and
False Discovery Rate?
3. How can I trade off pulling the 3 levers I have: thresholding
error rates, detecting smaller improvements, and running my
tests longer?
At the end of this talk, you should be able to answer
22. Outcomes & Error Rates
Fundamental
Tradeoff
Confidence Intervals
Hypotheses
XX X
“A/B Testing
Playbook”Opening
Mid-game
Mid-game
Closing
25. “True” value of
hypothesis
Improvement No effect
Result
of test
Winner /
Loser
True positive
False
positive
Inconclusive
False
negative
True
negative
“True” value of
hypothesis
Improvement No effect
Result
of test
Winner /
Loser
True positive
False
positive
Inconclusive
False
negative
True
negative
(no effect, winner / loser)
27. “True” value of
hypothesis
Improvement No effect
Result
of test
Winner /
Loser
True positive
False
positive
Inconclusive
False
negative
True
negative
“True” value of
hypothesis
Improvement No effect
Result
of test
Winner /
Loser
True positive
False
positive
Inconclusive
False
negative
True
negative
“True” value of
hypothesis
Improvement No effect
Result
of test
Winner /
Loser
True positive
False
positive
Inconclusive
False
negative
True
negative
(no effect, winner / loser)
(+/- improvement, inconclusive)
29. “True” value of
hypothesis
Improvement No effect
Result
of test
Winner /
Loser
True positive
False
positive
Inconclusive
False
negative
True
negative
“True” value of
hypothesis
Improvement No effect
Result
of test
Winner /
Loser
True positive
False
positive
Inconclusive
False
negative
True
negative
“True” value of
hypothesis
Improvement No effect
Result
of test
Winner /
Loser
True positive
False
positive
Inconclusive
False
negative
True
negative
“True” value of
hypothesis
Improvement No effect
Result
of test
Winner /
Loser
True positive
False
positive
Inconclusive
False
negative
True
negative
“True” value of
hypothesis
Improvement No effect
Result
of test
Winner /
Loser
True positive
False
positive
Inconclusive
False
negative
True
negative
(no effect, winner / loser)
(+/- improvement, inconclusive)
(+/- improvement, winner / loser)
(no effect, inconclusive)
30. The 2x2 table will help us to
1. Keep track of different error rates we care about
2. Explore the consequences of controlling false positives vs false
discoveries
32. “True” value of
hypothesis
Improve-
ment
No effect
Result
of test
Winner /
Loser
True
positive
False
positive
Inconc-
lusive
False
negative
True
negative
• False positive rate (Type I error)
= “Chance of a false positive from a
variation with no effect on a goal”
33. “True” value of
hypothesis
Improve-
ment
No effect
Result
of test
Winner /
Loser
True
positive
False
positive
Inconc-
lusive
False
negative
True
negative
• False positive rate (Type I error)
= “Chance of a false positive from a
variation with no effect on a goal”
34. “True” value of
hypothesis
Improve-
ment
No effect
Result
of test
Winner /
Loser
True
positive
False
positive
Inconc-
lusive
False
negative
True
negative
• False positive rate (Type I error)
= “Chance of a false positive from a
variation with no effect on a goal”
35. “True” value of
hypothesis
Improve-
ment
No effect
Result
of test
Winner /
Loser
True
positive
False
positive
Inconc-
lusive
False
negative
True
negative
• False positive rate (Type I error)
= “Chance of a false positive from a
variation with no effect on a goal”
• Thresholding the FPR
“When I have a variation with no
effect on a goal, I’ll find an effect
less than 10% of the time.”
36. How can we ever compute a False Positive Rate if we
don’t know whether a hypothesis is true or not?
Statistical tests (fixed horizon t-test, Stats Engine) are designed to
threshold an error rate.
Example:
“Calling winners & losers when a p-value is below .05 will guarantee a
False Positive Rate below 5%”
42. “True” value of
hypothesis
Improve-
ment
No effect
Result
of test
Winner /
Loser
True
positive
False
positive
Inconc-
lusive
False
negative
True
negative
False positive rates are only useful in the context of all
hypotheses
44. • False discovery rate (FDR)
= “Chance of a false positive from a
conclusive result”
“True” value of
hypothesis
Improve-
ment
No effect
Result
of test
Winner /
Loser
True
positive
False
positive
Inconc-
lusive
False
negative
True
negative
45. • False discovery rate (FDR)
= “Chance of a false positive from a
conclusive result”
“True” value of
hypothesis
Improve-
ment
No effect
Result
of test
Winner /
Loser
True
positive
False
positive
Inconc-
lusive
False
negative
True
negative
46. • False discovery rate (FDR)
= “Chance of a false positive from a
conclusive result”
“True” value of
hypothesis
Improve-
ment
No effect
Result
of test
Winner /
Loser
True
positive
False
positive
Inconc-
lusive
False
negative
True
negative
47. • False discovery rate (FDR)
= “Chance of a false positive from a
conclusive result”
• Thresholding the FDR
= “When you see a winning or losing
goal on a variation, it’s wrong less
than 10% of the time.”
“True” value of
hypothesis
Improve-
ment
No effect
Result
of test
Winner /
Loser
True
positive
False
positive
Inconc-
lusive
False
negative
True
negative
49. “True” value of
hypothesis
Improve-
ment
No effect
Result
of test
Winner /
loser
True
positive
False
positive
Inconc-
lusive
False
negative
True
negative
False discovery rates are useful despite the number of
hypotheses
50. What’s the catch?
The more hypotheses (goals & variations) in your
experiment, the longer it takes to find conclusives.
53. What’s the catch?
The more low signal hypotheses (goals & variations) in
your experiment, the longer it takes to find conclusives.
54. Recap
• False Positive Rate thresholding
-controls the chance of a false positive when you have a hypothesis with no
effect
-misrepresents your error rate with multiple goals and variations
• False Discovery Rate thresholding
-controls the chance of a false positive when you have a winning or losing
hypothesis
-is accurate regardless of how many hypotheses you run
-can take longer to reach significance with more low signal variations on goals
55. Tips & Tricks for running experiments with False
Discovery Rates
• Ask: Which goal is most important to me?
-This should be my primary goal (not impacted by all other
goals)
• Run large, or large multivariate tests without fear of finding
spurious results, but be prepared for the cost of exploration
• A little human intuition and prior knowledge can go a long way
towards reducing the runtime of your experiments
56. 1. What makes a good hypothesis?
2. What are the differences between False Positive Rate and
False Discovery Rate?
3. How can I trade off pulling the 3 levers I have: thresholding
error rates, detecting smaller improvements, and running my
tests longer?
At the end of this talk, you should be able to answer
57. 1. What makes a good hypothesis?
2. What are the differences between False Positive Rate and False
Discovery Rate?
False Positive Rates have hidden sources of error when testing
many goals and variations. False Discovery Rates correct this
by increasing runtime when you have many noisy goals.
3. How can I trade off pulling the 3 levers I have: thresholding error
rates, detecting smaller improvements, and running my tests
longer?
At the end of this talk, you should be able to answer
58. Outcomes & Error Rates
Fundamental
Tradeoff
Confidence Intervals
Hypotheses
XX X
“A/B Testing
Playbook”Opening
Mid-game
Mid-game
Closing
59. “3 Levers” of A/B Testing
1.Threshold an error rate
• “I want no more than 10% false discovery rate”
2.Detecting effect sizes (setting an MDE)
• “I’m OK with only detecting greater than 5% improvement”
3.Running tests longer
• “I can afford to run this test for 3 weeks, or 50,000
visitors”
60. Fundamental Tradeoff of A/B Testing
Error rates Runtime
Effect size /
Baseline CR
All Inversely
Related
61. Error rates Runtime
Effect size /
Baseline CR
At any number of visitors,
the less you threshold your error rate,
the smaller effect sizes you can detect.
All Inversely
Related
62. Error rates Runtime
Effect size /
Baseline CR
All Inversely
Related
At any error rate threshold,
stopping your test earlier means
you can only detect larger effect
sizes.
63. Error rates Runtime
Effect size /
Baseline CR
All Inversely
Related
For any effect size,
the lower error rate you want,
the longer you need to run your test.
64. What does this look like in practice?
Average Visitors needed to reach
significance with Stats Engine
Improvement (relative)
5% 10% 25%
Significance
Threshold
95% 62,400 13,500 1,800
90% 59,100 12,800 1,700
80% 52,600 11,400 1,500
Baseline conversion rate = 10%
65. All A/B Testing platforms address the fundamental tradeoff …
1. Choose a minimum detectable effect (MDE) and false
positive rate threshold
2. Find a required sampled minimum sample size with a
sample size calculator
3. Wait until the minimum sample size is reached
4. Look at your results once and only once
66. Optimizely is the only platform
that lets you pull the levers in
real time
68. 5%
Error rates Runtime
Effect size /
Baseline CR
-
+5%,
10%
52,600
?
In the beginning, we make an educated guess …
69. 5%
Error rates Runtime (remaining)
Effect size /
Baseline CR
-
+13%,
16%
1,600
Instead of:
52,600 -
7,200 =
45,400
… but then the improvement turns out to be better …
70. 5%
Error rates Runtime (remaining)
Effect size /
Baseline CR
-
+2%,
8%
> 100,000
… or a lot worse.
71. Recap
• The Fundamental Tradeoff of A/B Testing affects you no matter
what testing platform you use.
-If you want to detect a 5% Improvement on a 10% baseline
conversion rate, you should be prepared to wait for at least
50,000 visitors
• Optimizely’s Stats Engine is the only platform that allows you to
adjust the trade-off in real time while still reporting valid error
rates
72. 1. What makes a good hypothesis?
2. What are the differences between False Positive Rate and
False Discovery Rate?
3. How can I trade off pulling the 3 levers I have: thresholding
error rates, detecting smaller improvements, and running my
tests longer?
At the end of this talk, you should be able to answer
73. 1. What makes a good hypothesis?
2. What are the differences between False Positive Rate and False
Discovery Rate?
3. How can I trade off pulling the 3 levers I have: thresholding error
rates, detecting smaller improvements, and running my tests
longer?
All three are inversely related. For example, running my tests
longer can get me lower error rates, or detect smaller effects.
At the end of this talk, you should be able to answer
74. Outcomes & Error Rates
Fundamental
Tradeoff
Confidence Intervals
Hypotheses
XX X
“A/B Testing
Playbook”Opening
Mid-game
Mid-game
Closing
75. Outcomes & Error Rates
Fundamental
Tradeoff
Confidence Intervals
Hypotheses
XX X
“A/B Testing
Playbook”Opening
Mid-game
Mid-game
Closing
76. Outcomes & Error Rates
Fundamental
Tradeoff
Confidence Intervals
Hypotheses
XX X
“A/B Testing
Playbook”Opening
Mid-game
Mid-game
Closing
77.
78. Statistics in 40 Minutes:
A/B Testing
Fundamentals
Leo Pekelis
Statistician, Optimizely
@lpekelis
leonid@optimizely.com
#opticon2015
79. Outcomes & Error Rates
Fundamental
Tradeoff
Confidence Intervals
Hypotheses
XX X
“A/B Testing
Playbook”Opening
Mid-game
Mid-game
Closing
80.
81. Definition:
A confidence interval is a range of values for your metric (revenue, conversion rate, etc.)
that is 90%* likely to contain the true difference between your variation and baseline.
87. A confidence interval is the mirror image of statistical
significance
Mathematical Definition:
The set of parameter values X so that a hypothesis test with null hypothesis
H0: Removing a distracting header will result in X more revenue per visitor.
is not yet rejected.
88. Tuning your testing strategy for your traffic and
business
1%6% 4% 3%
inconclusive inconclusive inconclusiveinconclusive 18%
Looking for small effects?
• Tests take longer to reach
significance
• Might find more winners, if you
are willing to wait long enough
• Can you accept a higher error
rate?
inconclusive
Testing for larger effects?
• Run more tests, faster
• It’s time to move on if the effect size is
not worth the visitors
• A confidence interval may be
enough.
• Change allocation to 10% to lessen
impact on new experiments
90. • False negative rate (Type II error)
= “Rate of false negatives from all
hypotheses that could have been
false negatives.”
“True” value of
hypothesis
Improve-
ment
No effect
Outcome
of test
Winner /
Loser
True
positive
False
positive
Inconc-
lusive
False
negative
True
negative
91. • False negative rate (Type II error)
= “Rate of false negatives from all
hypotheses that could have been
false negatives.”
“True” value of
hypothesis
Improve-
ment
No effect
Outcome
of test
Winner /
Loser
True
positive
False
positive
Inconc-
lusive
False
negative
True
negative
92. • False negative rate (Type II error)
= “Rate of false negatives from all
hypotheses that could have been
false negatives.”
“True” value of
hypothesis
Improve-
ment
No effect
Outcome
of test
Winner /
Loser
True
positive
False
positive
Inconc-
lusive
False
negative
True
negative
93. • False negative rate (Type II error)
= “Rate of false negatives from all
variations with an improvement on
a goal.”
= #(False negative) / #(Improve)
• Thresholding Type II error
= “When have a goal on a variation
with an effect, you miss it less than
10% of the time.”
“True” value of
hypothesis
Improve-
ment
No effect
Outcome
of test
Winner /
Loser
True
positive
False
positive
Inconc-
lusive
False
negative
True
negative