6. The study followed 1,291 participants for 10 years.
No exercise: 438 with 128 deaths (29%)
Light exercise: 576 with 7 deaths (1%)
Moderate exercise: 262 with 8 deaths (3%)
Heavy exercise: 40 with 2 deaths (5%)
7. “Thank goodness a third person
didn't die, or public health
authorities would be banning
jogging.”
– Alex Hutchinson, Runner’s World
11. The “T-test”
(a.k.a. “NHST”, a.k.a. “Student T-test” )
The T-test in a nutshell
1. Run your experiment until you have reached
the required sample size, and then stop.
2. Ask “What are the chances I’d have gotten
these results in an A/A test?” (p-value)
3. If p-value < 5%, your results are significant.
12. 1908
Data is expensive.
Data is slow.
Practitioners are trained.
2017
Data is cheap.
Data is real-time.
Practitioners are everyone.
The T-test was designed for this
world
16. Why is this a problem?
There is a ~5% chance of seeing a false
positive each time you peek.
17. p-Value < 5%.
Significant!
p-Value > 5%.
Inconclusive.
p-Value > 5%.
Inconclusive.
Min Sample Size
Time
Experiment Starts
p-Value > 5%.
Inconclusive.
4 peeks —> ~18% chance of seeing a false positive
18. The “T-test”
(a.k.a. “NHST”, a.k.a. “Student T-test” )
The T-test in a nutshell
1. Run your experiment until you have reached the
required sample size, and then stop.
2. Ask “What are the chances I’d have gotten these
results in an A/A test?” (p-value)
3. If p-value < 5%, your results are significant.
24. - - - - -
Metrics
1 2 3 4 5
Variations
A
B
C
D
Control
25. False Discovery Rate = P( No Real Improvement | 10% Lift )
False Positive Rate = P( 10% Lift | No Real Improvement )
“How likely are my results if I assume there is no
underlying difference between my variation and control?
“How likely is it that my results are a fluke?”
Solution: Stats Engine controls False Discovery Rate by
becoming more conservative when more metrics and
variations are added to a test.
26. opticon2017opticon2017
How to make decisions with Stats Engine
When should I stop an experiment?
Understanding resets
How do additional variations and metrics affect my experiment?
How do I trade off between risk and velocity?
27. opticon2017opticon2017
How to make decisions with Stats Engine
When should I stop an experiment?
Understanding resets
How do additional variations and metrics affect my experiment?
How do I trade off between risk and velocity?
29. opticon2017opticon2017
How to make decisions with Stats Engine
When should I stop an experiment?
Understanding resets
How do additional variations and metrics affect my experiment?
How do I trade off between risk and velocity?
31. “Peeking at A/B Tests: Why it matters, and what to do about it” KDD 2017
👍 Statistical Significance rises whenever there is strong
evidence of a difference between variation and control
32. “Peeking at A/B Tests: Why it matters, and what to do about it” KDD 2017
0
34. Variation
👍 If your point estimate is near the edge of its confidence
interval, consider running the experiment longer.
-19.3% -2.58%
35. opticon2017opticon2017
How to make decisions with Stats Engine
When should I stop an experiment?
Understanding resets
How do additional variations and metrics affect my experiment?
How do I trade off between risk and velocity?
36. False Discovery Rate = P( No Real Improvement | 10% Lift )
False Positive Rate = P( 10% Lift | No Real Improvement )
“How likely are my results if I assume there is no
underlying difference between my variation and control?
“How likely is it that my results are a fluke?”
Solution: Stats Engine controls False Discovery Rate by
becoming more conservative when more metrics and
variations are added to a test.
37. Stats Engine treats each metric as a “signal”.
High Signal metrics are directly affected by the
experiment
Low Signal metrics are indirectly or not at all affected by
the experiment
38. False Discovery Rate = P( No Real Improvement | 10% Lift )
False Positive Rate = P( 10% Lift | No Real Improvement )
“How likely are my results if I assume there is no
underlying difference between my variation and control?
“How likely is it that my results are a fluke?”
Solution: Stats Engine controls False Discovery Rate by
becoming more conservative when more low signal
metrics and variations are added to a test.
40. 👍For maximum velocity, use “high
signal” primary and secondary metrics.
👍Use monitoring metrics for “low
signal” metrics.
41. opticon2017opticon2017
How to make decisions with Stats Engine
When should I stop an experiment?
Understanding resets
How do additional variations and metrics affect my experiment?
How do I trade off between risk and velocity?
42. Max False Discovery Rate
👍 Use your Statistical Significance threshold to control
risk vs. velocity.
44. 👍Define “risk classes” for your team’s experiments
👍Keep low-risk experiments “low touch”
👍Save data science analysis resources for high risk experiments
👍Run high-risk experiments for 1+ conversion cycles to control for
seasonality
👍Rerun high-risk experiments
Risk vs. Velocity for Experimentation
Programs
45. 👍Decide how and when you’ll share experiment
results with your organization.
👍Write down your “decision process” and socialize
with the team
Getting organizational buy-in