Opticon 2017 Experimenting with Stats Engine

Experimenting
with Stats
EnginePete Koomen
Co-founder, CTO, Optimizely
@koomen
pete@optimizely.com
opticon2017

Agenda
Here
1. Why we built Stats Engine
2. How to make a decisions with Stats
Engine
3. How to scale your decision process
opticon2017

opticon2017opticon2017
Why we built Stats Engine

The study followed 1,291 participants for 10 years.
No exercise: 438 with 128 deaths (29%)
Light exercise: 576 with 7 deaths (1%)
Moderate exercise: 262 with 8 deaths (3%)
Heavy exercise: 40 with 2 deaths (5%)

“Thank goodness a third person
didn't die, or public health
authorities would be banning
jogging.”
– Alex Hutchinson, Runner’s World

The “T-test”
(a.k.a. “NHST”, a.k.a. “Student T-test” )
The T-test in a nutshell
1. Run your experiment until you have reached
the required sample size, and then stop.
2. Ask “What are the chances I’d have gotten
these results in an A/A test?” (p-value)
3. If p-value < 5%, your results are signiﬁcant.

1908
Data is expensive.
Data is slow.
Practitioners are trained.
2017
Data is cheap.
Data is real-time.
Practitioners are everyone.
The T-test was designed for this
world

T-Test Pitfalls
1. Peeking
2. Multiple comparisons

p-Value < 5%.
Signiﬁcant!
p-Value > 5%.
Inconclusive.
p-Value > 5%.
Inconclusive.
Min Sample Size
Time
Experiment Starts
p-Value > 5%.
Inconclusive.

Why is this a problem?
There is a ~5% chance of seeing a false
positive each time you peek.

p-Value < 5%.
Signiﬁcant!
p-Value > 5%.
Inconclusive.
p-Value > 5%.
Inconclusive.
Min Sample Size
Time
Experiment Starts
p-Value > 5%.
Inconclusive.
4 peeks —> ~18% chance of seeing a false positive

The “T-test”
(a.k.a. “NHST”, a.k.a. “Student T-test” )
The T-test in a nutshell
1. Run your experiment until you have reached the
required sample size, and then stop.
2. Ask “What are the chances I’d have gotten these
results in an A/A test?” (p-value)
3. If p-value < 5%, your results are signiﬁcant.

Solution: Stats Engine uses sequential testing to
compute an “always-valid” p-value.

- - - - -
Metrics
1 2 3 4 5
Variations
A
B
C
D
Control

False Discovery Rate = P( No Real Improvement | 10% Lift )
False Positive Rate = P( 10% Lift | No Real Improvement )
“How likely are my results if I assume there is no
underlying difference between my variation and control?
“How likely is it that my results are a ﬂuke?”
Solution: Stats Engine controls False Discovery Rate by
becoming more conservative when more metrics and
variations are added to a test.

How to make decisions with Stats Engine
When should I stop an experiment?
Understanding resets
How do additional variations and metrics affect my experiment?
How do I trade off between risk and velocity?

Variation
👍 Use “visitors remaining” to decide whether continuing
your experiment is worth it.

“Peeking at A/B Tests: Why it matters, and what to do about it” KDD 2017
👍 Statistical Signiﬁcance rises whenever there is strong
evidence of a difference between variation and control

“Peeking at A/B Tests: Why it matters, and what to do about it” KDD 2017
0

Variatio
Variation
👍 Statistical Signiﬁcance will “reset” when there is strong
evidence of an underlying change.

Variation
👍 If your point estimate is near the edge of its conﬁdence
interval, consider running the experiment longer.
-19.3% -2.58%

Stats Engine treats each metric as a “signal”.
High Signal metrics are directly affected by the
experiment
Low Signal metrics are indirectly or not at all affected by
the experiment

False Discovery Rate = P( No Real Improvement | 10% Lift )
False Positive Rate = P( 10% Lift | No Real Improvement )
“How likely are my results if I assume there is no
underlying difference between my variation and control?
“How likely is it that my results are a ﬂuke?”
Solution: Stats Engine controls False Discovery Rate by
becoming more conservative when more low signal
metrics and variations are added to a test.

Variations
A
B
C
D
Metrics
1 2 3 4 5 6 7 8
Primary Secondary Monitoring
…

👍For maximum velocity, use “high
signal” primary and secondary metrics.
👍Use monitoring metrics for “low
signal” metrics.

Max False Discovery Rate
👍 Use your Statistical Signiﬁcance threshold to control
risk vs. velocity.

How to scale your decision process
Risk vs. Velocity for Experimentation Programs
Getting organizational buy-in

👍Deﬁne “risk classes” for your team’s experiments
👍Keep low-risk experiments “low touch”
👍Save data science analysis resources for high risk experiments
👍Run high-risk experiments for 1+ conversion cycles to control for
seasonality
👍Rerun high-risk experiments
Risk vs. Velocity for Experimentation
Programs

👍Decide how and when you’ll share experiment
results with your organization.
👍Write down your “decision process” and socialize
with the team
Getting organizational buy-in

opticon2017
Q&A
Pete Koomen
@koomen
pete@optimizely.com

Opticon 2017 Experimenting with Stats Engine

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Opticon 2017 Experimenting with Stats Engine

Ähnlich wie Opticon 2017 Experimenting with Stats Engine (20)

Mehr von Optimizely

Mehr von Optimizely (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Opticon 2017 Experimenting with Stats Engine