Since it was introduced in 2014, Stats Engine has served as a fast, powerful, and easy-to-use foundation for tens of thousands of digital experiments. But how exactly does it work?
In this session, we will explain the key differences and advantages of Stats Engine by comparing and contrasting it with a familiar old friend: the t-test.
ICT role in 21st century education and its challenges
Tale of Two Tests
1.
2. Tale of Two Tests
Jimmy Jin
Statistician
Mei Luo
Strategic Customer Success Manager
3. Experimentation in
the Digital Age
• You want to run an experiment on the
background image on the homepage of your e-
commerce clothing site, Attic & Button
• As a practitioner, you would want...
4. Results in real time
Evaluate impact on
multiple KPIs
Run experiment with
minimal data inputs
6. Review of basic terms
• p-value: The probability of observing a given result if
there is no difference
• False positive rate (Type 1 error rate): “How often will
the test detect an illusion?”
• Power (1 minus Type 2 error rate): “How often will the test
detect the real thing?”
7. Steps to doing a t-test
1. Calculate a required sample size for your A/B test
1. Depends on the minimum detectable effect (MDE)
2. Collect your data
3. Make a decision
Continuing past the prescribed sample size or
stopping early is NOT allowed.
16. Summary – The Peeking Problem
t-test:
• Peeking during a t-test increases the chance you’ll find a winning
result when none actually exists (a false positive)
Stats Engine:
• Sequential testing enables evaluation of experiment data as it is
collected. Tests can be stopped at any time with valid results.
28. Summary – The Guessing Problem
t-test
• If you set a small MDE, tests will take longer to conclude. If you set a
large MDE, you may miss smaller improvements.
Stats Engine
• When the true lift exceeds your MDE, you’ll be able to call your test
faster.
31. Built-in protections in Stats Engine
Ordinary, corrections for multiple comparisons happen
after all tests have concluded.
In Optimizely, we perform these corrections in real time
so your results are protected no matter when you look
check your experiment.
32. Activity 3: false positives
You conduct an experiment with many variations. Under
which scenario would you suspect more false positives?
1. You obtain 5 significant results.
2. You obtain 50 significant results.
33. false positives vs. false discoveries
False positive rate
P( significant | no true effect )
False discovery rate
P( no true effect | significant )
36. Example: FDR tiering in an actual
experiment
Let’s walk through an actual experiment!
37. Summary – The Multiple Comparisons
Problemt-test
• Traditional statistics control for false positive rates which does not
equate to the probability of making an incorrect business decision
Stats Engine
• Stats Engine controls for false discovery rate; as you add more metrics
to your experiment, Optimizely will become more conservative in
calling a winner or loser
39. 3 Takeaways
• Monitor results in real-time for faster experimentation,
without increased error rates
• Run fully powered experiments without guessing at
sample size calculations
• Evaluate impact on many metrics without
sacrificing accuracy
Stats Engine allows you to...