Opticon 2015-Statistics in 40 minutes

Statistics in 40 Minutes:
A/B Testing
Fundamentals
Leo Pekelis
Statistician, Optimizely
@lpekelis
leonid@optimizely.com
#opticon2015

You have your own unique approach to A/B Testing

The goal of this talk is to break
down A/B Testing to its
fundamentals.

A/B Testing
Platform1) Create an experiment
2) Read the results page

Outcomes & Error Rates
Fundamental
Tradeoﬀ
Conﬁdence Intervals
Hypotheses
XX X
“A/B Testing
Playbook”Opening
Mid-game
Mid-game
Closing

1. What makes a good hypothesis?
2. What are the diﬀerences between False Positive Rate and
False Discovery Rate?
3. How can I trade oﬀ pulling the 3 levers I have: thresholding
error rates, detecting smaller improvements, and running my
tests longer?
At the end of this talk, you should be able to answer

1. A good hypothesis has a variation and a clearly deﬁned goal
which you crafted ahead of time.
2. False Positive Rates have hidden sources of error when testing
many goals and variations. False Discovery Rates correct this
by increasing runtime when you have many low signal goals.
3. All three levers are inversely related. For example, running my
tests longer can get me lower error rates, or detect smaller
eﬀects.
The answers

• Control and Variation
A control is the original, or baseline version of
content that you are testing through a
variation.
• Goal
Metric used to measure impact of control and
variation
• Baseline conversion rate
The control group’s expected conversion rate.
• Effect size
The improvement (positive or negative) of
your variation over baseline.
• Sample size
The number of visitors in your test.

• A hypothesis test is a control, and variation that you
want to show improves a goal
• An experiment is a collection of hypotheses (goals &
variation pairs) that all have the same control

http://www.nba.com/
Imagine we are the NBA

What is a good hypothesis (test)?

Why is this not actionable?
“I think changing the
header image will make
my site better.”
• Removing the header will increase engagement?
• Removing the header will increase total revenue?
• Removing the header will increase “the ﬁnals”
clicks?
• Growing the header will increase engagement?
• Growing the header with increase “the ﬁnals” clicks.
• …
Test creep!Bad hypothesis

clicks?
• …
clicks?
• …
“I think changing the
header image will make
my site better.”
Bad hypothesis Good hypotheses
Organized and clear

The more relationships (hypotheses) you test,
the longer (visitors) it will take
to achieve the same outcome (error rate).
Hypotheses also give the cost of your experiment

Questions to check for a good hypothesis
What are you trying to show with your idea?
What key metrics should it drive?
Are all my goals and variations necessary given my testing limits?

Answer: A good hypothesis has a variation and a clearly
defined goal which you crafted ahead of time.
2. What are the differences between False Positive Rate and
False Discovery Rate?
3. How can I trade off pulling the 3 levers I have: thresholding
error rates, detecting smaller improvements, and running my
tests longer?

What are the possible outcomes?

“True” value of
hypothesis
Improvement No effect
Result
of test
Winner /
Loser
True positive
False
positive
Inconclusive
False
negative
True
negative
“True” value of
hypothesis
Result
of test
Winner /
Loser
True positive
False
positive
Inconclusive
False
negative
True
negative
(no eﬀect, winner / loser)

“True” value of
hypothesis
Result
of test
Winner /
Loser
True positive
False
positive
Inconclusive
False
negative
True
negative
“True” value of
hypothesis
Result
of test
Winner /
Loser
True positive
False
positive
Inconclusive
False
negative
True
negative
“True” value of
hypothesis
Result
of test
Winner /
Loser
True positive
False
positive
Inconclusive
False
negative
True
negative
(+/- improvement, inconclusive)

“True” value of
hypothesis
Result
of test
Winner /
Loser
True positive
False
positive
Inconclusive
False
negative
True
negative
“True” value of
hypothesis
Result
of test
Winner /
Loser
True positive
False
positive
Inconclusive
False
negative
True
negative
“True” value of
hypothesis
Result
of test
Winner /
Loser
True positive
False
positive
Inconclusive
False
negative
True
negative
“True” value of
hypothesis
Result
of test
Winner /
Loser
True positive
False
positive
Inconclusive
False
negative
True
negative
“True” value of
hypothesis
Result
of test
Winner /
Loser
True positive
False
positive
Inconclusive
False
negative
True
negative
(+/- improvement, inconclusive)
(+/- improvement, winner / loser)
(no eﬀect, inconclusive)

The 2x2 table will help us to
1. Keep track of diﬀerent error rates we care about
2. Explore the consequences of controlling false positives vs false
discoveries

Error rate 1: False positive rate

“True” value of
hypothesis
Improve-
ment
No effect
Result
of test
Winner /
Loser
True
positive
False
positive
Inconc-
lusive
False
negative
True
negative
• False positive rate (Type I error)
= “Chance of a false positive from a
variation with no eﬀect on a goal”

“True” value of
hypothesis
Improve-
ment
No effect
Result
of test
Winner /
Loser
True
positive
False
positive
Inconc-
lusive
False
negative
True
negative
• False positive rate (Type I error)
variation with no effect on a goal”
• Thresholding the FPR
“When I have a variation with no
effect on a goal, I’ll find an effect
less than 10% of the time.”

How can we ever compute a False Positive Rate if we
don’t know whether a hypothesis is true or not?
Statistical tests (ﬁxed horizon t-test, Stats Engine) are designed to
threshold an error rate.
Example:
“Calling winners & losers when a p-value is below .05 will guarantee a
False Positive Rate below 5%”

False Positive Rates with multiple tests

What happened?
21 tests X 5% FPR = 1 False Positive on average

“True” value of
hypothesis
Improve-
ment
No effect
Result
of test
Winner /
Loser
True
positive
False
positive
Inconc-
lusive
False
negative
True
negative
False positive rates are only useful in the context of all
hypotheses

Error rate 2: False discovery
rate

• False discovery rate (FDR)
conclusive result”
“True” value of
hypothesis
Improve-
ment
No effect
Result
of test
Winner /
Loser
True
positive
False
positive
Inconc-
lusive
False
negative
True
negative

• False discovery rate (FDR)
conclusive result”
• Thresholding the FDR
= “When you see a winning or losing
goal on a variation, it’s wrong less
than 10% of the time.”
“True” value of
hypothesis
Improve-
ment
No effect
Result
of test
Winner /
Loser
True
positive
False
positive
Inconc-
lusive
False
negative
True
negative

or X 5% FDR = 0.05 False Positives on average

“True” value of
hypothesis
Improve-
ment
No effect
Result
of test
Winner /
loser
True
positive
False
positive
Inconc-
lusive
False
negative
True
negative
False discovery rates are useful despite the number of
hypotheses

What’s the catch?
The more hypotheses (goals & variations) in your
experiment, the longer it takes to find conclusives.

What’s the catch?
The more low signal hypotheses (goals & variations) in
your experiment, the longer it takes to find conclusives.

Recap
• False Positive Rate thresholding
-controls the chance of a false positive when you have a hypothesis with no
eﬀect
-misrepresents your error rate with multiple goals and variations
• False Discovery Rate thresholding
-controls the chance of a false positive when you have a winning or losing
hypothesis
-is accurate regardless of how many hypotheses you run
-can take longer to reach signiﬁcance with more low signal variations on goals

Tips & Tricks for running experiments with False
Discovery Rates
• Ask: Which goal is most important to me?
-This should be my primary goal (not impacted by all other
goals)
• Run large, or large multivariate tests without fear of ﬁnding
spurious results, but be prepared for the cost of exploration
• A little human intuition and prior knowledge can go a long way
towards reducing the runtime of your experiments

2. What are the diﬀerences between False Positive Rate and False
Discovery Rate?
False Positive Rates have hidden sources of error when testing
many goals and variations. False Discovery Rates correct this
by increasing runtime when you have many noisy goals.
3. How can I trade oﬀ pulling the 3 levers I have: thresholding error
rates, detecting smaller improvements, and running my tests
longer?

“3 Levers” of A/B Testing
1.Threshold an error rate
• “I want no more than 10% false discovery rate”
2.Detecting effect sizes (setting an MDE)
• “I’m OK with only detecting greater than 5% improvement”
3.Running tests longer
• “I can afford to run this test for 3 weeks, or 50,000
visitors”

Fundamental Tradeoff of A/B Testing
Error rates Runtime
Eﬀect size /
Baseline CR
All Inversely
Related

Error rates Runtime
Eﬀect size /
Baseline CR
At any number of visitors,
the less you threshold your error rate,
the smaller eﬀect sizes you can detect.
All Inversely
Related

Error rates Runtime
Eﬀect size /
Baseline CR
All Inversely
Related
At any error rate threshold,
stopping your test earlier means
you can only detect larger eﬀect
sizes.

Error rates Runtime
Eﬀect size /
Baseline CR
All Inversely
Related
For any eﬀect size,
the lower error rate you want,
the longer you need to run your test.

What does this look like in practice?
Average Visitors needed to reach
signiﬁcance with Stats Engine
Improvement (relative)
5% 10% 25%
Signiﬁcance
Threshold
95% 62,400 13,500 1,800
90% 59,100 12,800 1,700
80% 52,600 11,400 1,500
Baseline conversion rate = 10%

All A/B Testing platforms address the fundamental tradeoff …
1. Choose a minimum detectable eﬀect (MDE) and false
positive rate threshold
2. Find a required sampled minimum sample size with a
sample size calculator
3. Wait until the minimum sample size is reached
4. Look at your results once and only once

Optimizely is the only platform
that lets you pull the levers in
real time

5%
Error rates Runtime
Eﬀect size /
Baseline CR
-
+5%,
10%
52,600
?
In the beginning, we make an educated guess …

5%
Error rates Runtime (remaining)
Eﬀect size /
Baseline CR
-
+13%,
16%
1,600
Instead of:
52,600 -
7,200 =
45,400
… but then the improvement turns out to be better …

5%
Error rates Runtime (remaining)
Eﬀect size /
Baseline CR
-
+2%,
8%
> 100,000
… or a lot worse.

Recap
• The Fundamental Tradeoff of A/B Testing affects you no matter
what testing platform you use.
-If you want to detect a 5% Improvement on a 10% baseline
conversion rate, you should be prepared to wait for at least
50,000 visitors
• Optimizely’s Stats Engine is the only platform that allows you to
adjust the trade-off in real time while still reporting valid error
rates

2. What are the differences between False Positive Rate and False
Discovery Rate?
3. How can I trade off pulling the 3 levers I have: thresholding error
rates, detecting smaller improvements, and running my tests
longer?
All three are inversely related. For example, running my tests
longer can get me lower error rates, or detect smaller effects.

Definition:
A confidence interval is a range of values for your metric (revenue, conversion rate, etc.)
that is 90%* likely to contain the true difference between your variation and baseline.

15.41
11.4Middle Ground
Best Case
Worst case
7.29

This is true regardless of your
significance.

We can’t wait for significance

The confidence interval tells us what we need to know

A confidence interval is the mirror image of statistical
significance
Mathematical Deﬁnition:
The set of parameter values X so that a hypothesis test with null hypothesis
H0: Removing a distracting header will result in X more revenue per visitor.
is not yet rejected.

Tuning your testing strategy for your traffic and
business
1%6% 4% 3%
inconclusive inconclusive inconclusiveinconclusive 18%
Looking for small effects?
• Tests take longer to reach
significance
• Might find more winners, if you
are willing to wait long enough
• Can you accept a higher error
rate?
inconclusive
Testing for larger effects?
• Run more tests, faster
• It’s time to move on if the effect size is
not worth the visitors
• A confidence interval may be
enough.
• Change allocation to 10% to lessen
impact on new experiments

Error rate 3: False negative rate

• False negative rate (Type II error)
= “Rate of false negatives from all
hypotheses that could have been
false negatives.”
“True” value of
hypothesis
Improve-
ment
No effect
Outcome
of test
Winner /
Loser
True
positive
False
positive
Inconc-
lusive
False
negative
True
negative

• False negative rate (Type II error)
= “Rate of false negatives from all
variations with an improvement on
a goal.”
= #(False negative) / #(Improve)
• Thresholding Type II error
= “When have a goal on a variation
with an eﬀect, you miss it less than
10% of the time.”
“True” value of
hypothesis
Improve-
ment
No effect
Outcome
of test
Winner /
Loser
True
positive
False
positive
Inconc-
lusive
False
negative
True
negative

Opticon 2015-Statistics in 40 minutes

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Opticon 2015-Statistics in 40 minutes

Ähnlich wie Opticon 2015-Statistics in 40 minutes (20)

Mehr von Optimizely

Mehr von Optimizely (20)

Opticon 2015-Statistics in 40 minutes