Optimizely Stats Engine: An overview and practical tips for running experiments

Optimizely Stats Engine
Leo Pekelis
Darwish Gani
Robin Pam

Housekeeping notes
• Chat box is available for questions
• There will be time for Q&A at the end
• We will be recording the webinar for future viewing
• All attendees will receive a copy of slides after the
webinar

Your speakers
Darwish Gani
Product manager
Robin Pam
Product marketing
Leo Pekelis
Statistician

Objectives
Understand why Optimizely built Stats Engine
Introduce the methods Stats Engine uses to
calculate results
Get practical recommendations for how to test with
Stats Engine

Meet Joe: A farmer who uses a traditional t-test

Joe wants to try a new fertilizer this year

With his original fertilizer, 10% of plants
survive the winter
He thinks that this new fertilizer might help
more survive.
Joe has a hypothesis

Joe calculates a sample size for his experiment in advance,
given how much better he thinks the new fertilizer might be

He waits through the winter…

10% of plants survive the winter
15% of plants survive
96% statistical signiﬁcance!
…and he is rewarded for his patience

Meet Kyle: Head of Optimization at Optimizely

Kyle doesn’t know what improvement to expect

Kyle also gets data from Optimizely all the time

Kyle wants to test many goals and variations at
once, instead of just one hypothesis
vs.

Actually that’s a lot of work.
It’s cumbersome and error-prone.

What’s your chance of making incorrect decision?

How we did it
• Partnered with Stanford
statisticians
• Talked with customers
• Examined historical
experiments
• Found the best methods
for real-time data

What does Stats Engine do?
Provides a principled and mathematical way to calculate your
chance of making an incorrect decision.

Sequential Testing False Discovery Rate
• First used in 1940s for military
weapons testing
• Sample size is not fixed in
advance
• Data is evaluated as it’s collected
• First used in 1990s in genetics
• Correct error rates for
multiple goals and variations
• Expected number of false
discoveries

Sequential Testing
False Discovery Rate control
+
=
Statistical Significance for Digital Experimentation

Statistical Significance for Digital Experimentation
Continuously Evaluate Test Results
Run many goals and variations
Don’t worry about estimating a
MDE upfront

Sequential Testing
Finding the right stopping rule

Variation #1
Variation #2
Declare a winner?
500
Visitors
50%
65%
Is this lift big enough
for the visitors I saw?
?

Desired Stopping Rule:
I will be “wrong” only 5% of the time.

Variation #1
Variation #2
1000
Visitors
55%
59%
Traditional Error Rates 5%
Find a stopping rule, so I declare
a winner incorrectly <5% of the
time

Variation #1
Variation #2
Traditional Error Rates
Visitors
500
50%
65%
5%
1000
55%
59%
5%
5000
52%
57%
5%
10000
54%
59%
5%

Variation #1
Variation #2
500 1000 5000 10000
Visitors
50% 55% 52% 54%
65% 59% 57% 59%
5% 5% 5% 5%
Look only once: 5% Error rate

Variation #1
Variation #2
500 1000 5000 10000
Visitors
50% 55% 52% 54%
65% 59% 57% 59%
5% 5% 5% 5%
Look only once: 5% Error rate
Look more than once: >5% Error rate

Variation #1
Variation #2
500 1000 5000 10000
Visitors
50% 55% 52% 54%
65% 59% 57% 59%
5% 5% 5% 5%
<5% Error rate for the entire test!
Sequential Testing
Error Rate
1% .5% 1.5% 1.5% < 5%

“P-Hacking”
“Continuous Monitoring”
“Repeated Significance Testing”
These are not new problems!

Steven Goodman, Stanford Physician & Statistician, nature.com
“The P value was never meant to be used the
way it's used today.”

Source: Evan Miller, How not to Run an A/B Test

Sample Size + Power Calculations
Focus on creating and running tests

Sequential Testing
• Continuously Evaluate Test Results
• Don’t worry about estimating a
MDE upfront
Framework of hypothesis testing that was created to allow the
experimenter to evaluate test results as they come in

False discovery rate control
Error rates for a world with many goals and
variations

Variations Goal 1 Goal 2 Goal 3 Goal 4 Goal 5
Control
Variation 1
Variation 2

Control
Variation 1
Variation 2
Significance Level 90 (False Positive Rate 10%)

Control
Variation 1
Variation 2
1 False Positive!

Control
Variation 1
Variation 2
1 False Positive!
1 other variation x goal has a large improvement.

Control
Variation 1
Variation 2
1 False Positive!
1 other variation x goal has a large improvement.
1 True Positive!

My Report
• Variation 2 is improving on Goal 1
“10% of what I
report could be wrong.”
X
50%
Furthermore, I found the following results.
This leads me to conclude that …

Statistical Significance:
The chance that any variation on any goal
that is reported as a winner or loser
is correct.

• “New York Times has a feature in its Tuesday science section, Take a Number … Today’s
column is in error … This is the old, old error of confusing p(A|B) with p(B|A).”
• Andrew Gelman, Misunderstanding the p-value
• “If I were to randomly select a drug out of the lot of 100, run it through my tests, and
discover a p<0.05 statistically significant benefit, there is only a 62% chance that the
drug is actually effective.”
• Alex Reinhart, The p value and the base rate fallacy
• “In this article I’ll show that badly performed A/B tests can produce winning results
which are more likely to be false than true. At best, this leads to the needless
modification of websites; at worst, to modification which damages profits.”
• Martin Goodson, Most Winning A/B Test Results are Illusory
• “An unguarded use of single-inference procedures results in a greatly increased false
positive (significance) rate”
• Benjamini, Yoav, and Yosef Hochberg. "Controlling the false discovery rate: a practical and powerful approach to
multiple testing." Journal of the Royal Statistical Society. Series B (Methodological) (1995): 289-300. APA

• “New York Times has a feature in its Tuesday science section, Take a Number … Today’s
column is in error … This is the old, old error of confusing p(A|B) with p(B|A).”
• Andrew Gelman, Misunderstanding the p-value
• “If I were to randomly select a drug out of the lot of 100, run it through my
tests, and discover a p<0.05 statistically significant benefit, there is only a
62% chance that the drug is actually effective.”
• Alex Reinhart, The p value and the base rate fallacy
• “In this article I’ll show that badly performed A/B tests can produce winning results
which are more likely to be false than true. At best, this leads to the needless
modification of websites; at worst, to modification which damages profits.”
• Martin Goodson, Most Winning A/B Test Results are Illusory
• “An unguarded use of single-inference procedures results in a greatly increased false
positive (significance) rate”
• Benjamini, Yoav, and Yosef Hochberg. "Controlling the false discovery rate: a practical and powerful approach to
multiple testing." Journal of the Royal Statistical Society. Series B (Methodological) (1995): 289-300. APA

False Discovery Rate control
Framework for controlling errors that arise from running multiple
experiments at once.
Run many goals & variations

What Stats Engine means for you
• You see
fewer, but more accurate conclusive results.
• You can
implement winners as soon as significance is reached.
• You get
• easy experiment workflow.
• reduced unforeseen, and hidden errors.

And now, for some practical guidance

First, some vocabulary
• Baseline conversion rate
The control group’s expected conversion rate.
• Minimum detectable effect
The smallest conversion rate difference it is
possible to detect in an A/B Test.
• Statistical significance
The likelihood that the observed difference in
conversion rates is not due to chance.
• Minimum sample size
The smallest number of visitors required to
reliably detect a given conversion rate
difference

Sample size calculators and statistical power

How many visitors do you need to see significant
results?
Visitors needed to reach signiﬁcance
with Stats Engine
Improvement
5% 10% 25%
Baseline
conversion rate
1% 458,900 101,600 13,000
5% 69,500 15,000 1,800
10% 29,200 6,200 700
25% 8,100 1,700 200
Lower conversion rate, lower effects = more visitors

One example of calculating your opportunity cost
12% minimum
detectable effect

Should you stop or continue a test?

Is my test
signiﬁcant?
Congrats
Can I afford
to wait?
Continue
Stop
Accept lower
signiﬁcance
Concede
inconclusive
Yes
No
No
Yes

Difference intervals
15.4
11.6Middle Ground
Best Case
Worst case
7.3

Seasonality
• We DO take into account seasonality while a test is
running.
• We DO NOT take into account future seasonality after an
experiment is stopped.

Is my test
signiﬁcant?
Congrats
Can I afford
to wait?
Continue
Stop
Accept lower
signiﬁcance
Concede
inconclusive
Yes
No
No
Yes
• Use Difference
Intervals to understand
the types of lifts you
could see.

A reasonable number of visitors left, relative to
your traffic?

Visitors Remaining Explained
Improvement
Increases in
Magnitude
Improvement
Stays the Same
Improvement
Decreases in
Magnitude

A good idea to wait
5761 + 3200
Note!

Is my test
signiﬁcant?
Congrats
Can I afford
to wait?
Continue
Stop
Accept lower
signiﬁcance
Concede
inconclusive
Yes
No
No
Yes
• Use Visitors Remaining
to evaluate if waiting
makes sense.

If you’re an organization that can
• Iterate quickly on new variations
• Run lots of experiments
• Have little downside risk of implementing non-winning
variations
then you can likely tolerate a higher error rate.

Difference intervals can guide your decision

Is my test
signiﬁcant?
Congrats
Can I afford
to wait?
Continue
Stop
Accept lower
signiﬁcance
Concede
inconclusive
Yes
No
No
Yes • Use Difference
Intervals to measure
risk you take on

> 100,000 visitors? What next?

Time to move on to the next idea

Recap
Is my test
signiﬁcant?
Congrats
Can I afford
to wait?
Continue
Stop
Accept lower
signiﬁcance
Concede
inconclusive
Yes
No
No
Yes
• Use Difference
could see.

Recap
Is my test
signiﬁcant?
Congrats
Can I afford
to wait?
Continue
Stop
Accept lower
signiﬁcance
Concede
inconclusive
Yes
No
No
Yes
• Use Difference
could see.
Can I afford
to wait?
Continue
Stop
Concede
inconclusive
Yes
No
No
makes sense.

Recap
Is my test
significant?
Congrats
Can I afford
to wait?
Continue
Stop
Accept lower
significance
Concede
inconclusive
Yes
No
No
Yes
• Use Difference
could see.
Can I afford
to wait?
Continue
Stop
Concede
inconclusive
Yes
No
No
makes sense.
Stop
Accept lower
significance
Concede
inconclusive
Yes
No
Can I afford
to wait?
• Use Difference
risk you take on

Recap
Is my test
significant?
Congrats
Can I afford
to wait?
Continue
Stop
Accept lower
significance
Concede
inconclusive
Yes
No
No
Yes
• Use Difference
could see.
Can I afford
to wait?
Continue
Stop
Concede
inconclusive
Yes
No
No
makes sense.
Stop
Accept lower
significance
Concede
inconclusive
Yes
No
Can I afford
to wait?
• Use Difference
risk you take on
makes sense.

Tuning your testing strategy for your traffic and
business
1%6% 4% 3%
inconclusive inconclusive inconclusiveinconclusive 18%
Looking for small effects?
• Tests take longer to reach
significance
• Might find more winners, if you
are willing to wait long enough
inconclusive
Testing for larger effects?
• Run more tests, faster
• Know when it’s time to move on
to the next idea

Do I need to re-run my historical tests?

Is this a one-tailed or two-tailed test? Why
did you switch?

Why does Stats Engine report 0%
Statistical Signiﬁcance when other tools
report higher values?

Why does Statistical Signiﬁcance increase
step-wise?

If I take the results I see in Optimizely and
plug them into any other statistics
calculator, the statistical signiﬁcance is
different. Why?

How does Stats Engine handle revenue
calculations?

Optimizely Stats Engine: An overview and practical tips for running experiments

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Mehr von Optimizely

Mehr von Optimizely (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Optimizely Stats Engine: An overview and practical tips for running experiments