SlideShare ist ein Scribd-Unternehmen logo
1 von 93
Downloaden Sie, um offline zu lesen
Optimizely Stats Engine
Leo Pekelis
Darwish Gani
Robin Pam
Housekeeping notes
• Chat box is available for questions
• There will be time for Q&A at the end
• We will be recording the webinar for future viewing
• All attendees will receive a copy of slides after the
webinar
Your speakers
Darwish Gani
Product manager
Robin Pam
Product marketing
Leo Pekelis
Statistician
Objectives
Understand why Optimizely built Stats Engine
Introduce the methods Stats Engine uses to
calculate results
Get practical recommendations for how to test with
Stats Engine
Why make a new Stats Engine?
Meet Joe: A farmer who uses a traditional t-test
Joe wants to try a new fertilizer this year
With his original fertilizer, 10% of plants
survive the winter
He thinks that this new fertilizer might help
more survive.
Joe has a hypothesis
Joe calculates a sample size for his experiment in advance,
given how much better he thinks the new fertilizer might be
He waits through the winter…
10% of plants survive the winter
15% of plants survive
96% statistical significance!
…and he is rewarded for his patience
Does that work today?
Meet Kyle: Head of Optimization at Optimizely
Kyle doesn’t know what improvement to expect
Kyle also gets data from Optimizely all the time
Kyle wants to test many goals and variations at
once, instead of just one hypothesis
vs.
Actually that’s a lot of work.
It’s cumbersome and error-prone.
What’s your chance of making incorrect decision?
30%
Objectives
Understand why Optimizely built Stats Engine
Introduce the methods Stats Engine uses to
calculate results
Get practical recommendations for how to test with
Stats Engine
Introducing Stats Engine
How we did it
• Partnered with Stanford
statisticians
• Talked with customers
• Examined historical
experiments
• Found the best methods
for real-time data
What does Stats Engine do?
Provides a principled and mathematical way to calculate your
chance of making an incorrect decision.
Sequential Testing False Discovery Rate
• First used in 1940s for military
weapons testing
• Sample size is not fixed in
advance
• Data is evaluated as it’s collected
• First used in 1990s in genetics
• Correct error rates for
multiple goals and variations
• Expected number of false
discoveries
Sequential Testing
False Discovery Rate control
+
=
Statistical Significance for Digital Experimentation
Statistical Significance for Digital Experimentation
Continuously Evaluate Test Results
Run many goals and variations
Don’t worry about estimating a
MDE upfront
Sequential Testing
Finding the right stopping rule
Variation #1
Variation #2
Declare a winner?
500
Visitors
50%
65%
Is this lift big enough
for the visitors I saw?
?
Desired Stopping Rule:
I will be “wrong” only 5% of the time.
Variation #1
Variation #2
1000
Visitors
55%
59%
Traditional Error Rates 5%
Find a stopping rule, so I declare
a winner incorrectly <5% of the
time
Variation #1
Variation #2
Traditional Error Rates
Visitors
500
50%
65%
5%
1000
55%
59%
5%
5000
52%
57%
5%
10000
54%
59%
5%
Variation #1
Variation #2
Traditional Error Rates
500 1000 5000 10000
Visitors
50% 55% 52% 54%
65% 59% 57% 59%
5% 5% 5% 5%
Look only once: 5% Error rate
Variation #1
Variation #2
Traditional Error Rates
500 1000 5000 10000
Visitors
50% 55% 52% 54%
65% 59% 57% 59%
5% 5% 5% 5%
Look only once: 5% Error rate
Look more than once: >5% Error rate
Variation #1
Variation #2
Traditional Error Rates
500 1000 5000 10000
Visitors
50% 55% 52% 54%
65% 59% 57% 59%
5% 5% 5% 5%
<5% Error rate for the entire test!
Sequential Testing
Error Rate
1% .5% 1.5% 1.5% < 5%
“P-Hacking”
“Continuous Monitoring”
“Repeated Significance Testing”
These are not new problems!
Steven Goodman, Stanford Physician & Statistician, nature.com
“The P value was never meant to be used the
way it's used today.”
Source: Evan Miller, How not to Run an A/B Test
Sample Size + Power Calculations
Focus on creating and running tests
Sequential Testing
• Continuously Evaluate Test Results
• Don’t worry about estimating a
MDE upfront
Framework of hypothesis testing that was created to allow the
experimenter to evaluate test results as they come in
False discovery rate control
Error rates for a world with many goals and
variations
Variations Goal 1 Goal 2 Goal 3 Goal 4 Goal 5
Control
Variation 1
Variation 2
Variations Goal 1 Goal 2 Goal 3 Goal 4 Goal 5
Control
Variation 1
Variation 2
Significance Level 90 (False Positive Rate 10%)
Variations Goal 1 Goal 2 Goal 3 Goal 4 Goal 5
Control
Variation 1
Variation 2
Significance Level 90 (False Positive Rate 10%)
1 False Positive!
Variations Goal 1 Goal 2 Goal 3 Goal 4 Goal 5
Control
Variation 1
Variation 2
1 False Positive!
Significance Level 90 (False Positive Rate 10%)
1 other variation x goal has a large improvement.
Variations Goal 1 Goal 2 Goal 3 Goal 4 Goal 5
Control
Variation 1
Variation 2
1 False Positive!
Significance Level 90 (False Positive Rate 10%)
1 other variation x goal has a large improvement.
1 True Positive!
My Report
• Variation 2 is improving on Goal 1
• Variation 1 is improving on Goal 4
“10% of what I
report could be wrong.”
X
50%
• Variation 2 is improving on Goal 1
• Variation 1 is improving on Goal 4
Furthermore, I found the following results.
This leads me to conclude that …
Statistical Significance:
The chance that any variation on any goal
that is reported as a winner or loser
is correct.
• “New York Times has a feature in its Tuesday science section, Take a Number … Today’s
column is in error … This is the old, old error of confusing p(A|B) with p(B|A).”
• Andrew Gelman, Misunderstanding the p-value
• “If I were to randomly select a drug out of the lot of 100, run it through my tests, and
discover a p<0.05 statistically significant benefit, there is only a 62% chance that the
drug is actually effective.”
• Alex Reinhart, The p value and the base rate fallacy
• “In this article I’ll show that badly performed A/B tests can produce winning results
which are more likely to be false than true. At best, this leads to the needless
modification of websites; at worst, to modification which damages profits.”
• Martin Goodson, Most Winning A/B Test Results are Illusory
• “An unguarded use of single-inference procedures results in a greatly increased false
positive (significance) rate”
• Benjamini, Yoav, and Yosef Hochberg. "Controlling the false discovery rate: a practical and powerful approach to
multiple testing." Journal of the Royal Statistical Society. Series B (Methodological) (1995): 289-300. APA
• “New York Times has a feature in its Tuesday science section, Take a Number … Today’s
column is in error … This is the old, old error of confusing p(A|B) with p(B|A).”
• Andrew Gelman, Misunderstanding the p-value
• “If I were to randomly select a drug out of the lot of 100, run it through my
tests, and discover a p<0.05 statistically significant benefit, there is only a
62% chance that the drug is actually effective.”
• Alex Reinhart, The p value and the base rate fallacy
• “In this article I’ll show that badly performed A/B tests can produce winning results
which are more likely to be false than true. At best, this leads to the needless
modification of websites; at worst, to modification which damages profits.”
• Martin Goodson, Most Winning A/B Test Results are Illusory
• “An unguarded use of single-inference procedures results in a greatly increased false
positive (significance) rate”
• Benjamini, Yoav, and Yosef Hochberg. "Controlling the false discovery rate: a practical and powerful approach to
multiple testing." Journal of the Royal Statistical Society. Series B (Methodological) (1995): 289-300. APA
False Discovery Rate control
Framework for controlling errors that arise from running multiple
experiments at once.
Run many goals & variations
What Stats Engine means for you
• You see
fewer, but more accurate conclusive results.
• You can
implement winners as soon as significance is reached.
• You get
• easy experiment workflow.
• reduced unforeseen, and hidden errors.
Objectives
Understand why Optimizely built Stats Engine
Introduce the methods Stats Engine uses to
calculate results
Get practical recommendations for how to test with
Stats Engine
And now, for some practical guidance
First, some vocabulary
• Baseline conversion rate
The control group’s expected conversion rate.
• Minimum detectable effect
The smallest conversion rate difference it is
possible to detect in an A/B Test.
• Statistical significance
The likelihood that the observed difference in
conversion rates is not due to chance.
• Minimum sample size
The smallest number of visitors required to
reliably detect a given conversion rate
difference
Sample size calculators and statistical power
How many visitors do you need to see significant
results?
Visitors needed to reach significance
with Stats Engine
Improvement
5% 10% 25%
Baseline
conversion rate
1% 458,900 101,600 13,000
5% 69,500 15,000 1,800
10% 29,200 6,200 700
25% 8,100 1,700 200
Lower conversion rate, lower effects = more visitors
One example of calculating your opportunity cost
12% minimum
detectable effect
Original
Variation
Should you stop or continue a test?
Should you stop or continue a test?
Is my test
significant?
Congrats
Can I afford
to wait?
Continue
Stop
Accept lower
significance
Concede
inconclusive
Yes
No
No
Yes
Should you stop or continue a test?
Is my test
significant?
Congrats
Can I afford
to wait?
Continue
Stop
Accept lower
significance
Concede
inconclusive
Yes
No
No
Yes
Difference intervals
Difference intervals
15.4
11.6Middle Ground
Best Case
Worst case
7.3
Seasonality
• We DO take into account seasonality while a test is
running.
• We DO NOT take into account future seasonality after an
experiment is stopped.
Should you stop or continue a test?
Is my test
significant?
Congrats
Can I afford
to wait?
Continue
Stop
Accept lower
significance
Concede
inconclusive
Yes
No
No
Yes
• Use Difference
Intervals to understand
the types of lifts you
could see.
Should you stop or continue a test?
Is my test
significant?
Congrats
Can I afford
to wait?
Continue
Stop
Accept lower
significance
Concede
inconclusive
Yes
No
No
Yes
A reasonable number of visitors left, relative to
your traffic?
Visitors Remaining Explained
Improvement
Increases in
Magnitude
Improvement
Stays the Same
Improvement
Decreases in
Magnitude
A good idea to wait
5761 + 3200
Note!
Should you stop or continue a test?
Is my test
significant?
Congrats
Can I afford
to wait?
Continue
Stop
Accept lower
significance
Concede
inconclusive
Yes
No
No
Yes
• Use Visitors Remaining
to evaluate if waiting
makes sense.
Should you stop or continue a test?
Is my test
significant?
Congrats
Can I afford
to wait?
Continue
Stop
Accept lower
significance
Concede
inconclusive
Yes
No
No
Yes
If you’re an organization that can
• Iterate quickly on new variations
• Run lots of experiments
• Have little downside risk of implementing non-winning
variations
then you can likely tolerate a higher error rate.
Difference intervals can guide your decision
Should you stop or continue a test?
Is my test
significant?
Congrats
Can I afford
to wait?
Continue
Stop
Accept lower
significance
Concede
inconclusive
Yes
No
No
Yes • Use Difference
Intervals to measure
risk you take on
> 100,000 visitors? What next?
Time to move on to the next idea
Should you stop or continue a test?
Is my test
significant?
Congrats
Can I afford
to wait?
Continue
Stop
Accept lower
significance
Concede
inconclusive
Yes
No
No
Yes
• Use Visitors Remaining
to evaluate if waiting
makes sense.
Recap
Is my test
significant?
Congrats
Can I afford
to wait?
Continue
Stop
Accept lower
significance
Concede
inconclusive
Yes
No
No
Yes
• Use Difference
Intervals to understand
the types of lifts you
could see.
Recap
Is my test
significant?
Congrats
Can I afford
to wait?
Continue
Stop
Accept lower
significance
Concede
inconclusive
Yes
No
No
Yes
• Use Difference
Intervals to understand
the types of lifts you
could see.
Can I afford
to wait?
Continue
Stop
Concede
inconclusive
Yes
No
No
• Use Visitors Remaining
to evaluate if waiting
makes sense.
Recap
Is my test
significant?
Congrats
Can I afford
to wait?
Continue
Stop
Accept lower
significance
Concede
inconclusive
Yes
No
No
Yes
• Use Difference
Intervals to understand
the types of lifts you
could see.
Can I afford
to wait?
Continue
Stop
Concede
inconclusive
Yes
No
No
• Use Visitors Remaining
to evaluate if waiting
makes sense.
Stop
Accept lower
significance
Concede
inconclusive
Yes
No
Can I afford
to wait?
• Use Difference
Intervals to measure
risk you take on
Recap
Is my test
significant?
Congrats
Can I afford
to wait?
Continue
Stop
Accept lower
significance
Concede
inconclusive
Yes
No
No
Yes
• Use Difference
Intervals to understand
the types of lifts you
could see.
Can I afford
to wait?
Continue
Stop
Concede
inconclusive
Yes
No
No
• Use Visitors Remaining
to evaluate if waiting
makes sense.
Stop
Accept lower
significance
Concede
inconclusive
Yes
No
Can I afford
to wait?
• Use Difference
Intervals to measure
risk you take on
• Use Visitors Remaining
to evaluate if waiting
makes sense.
Tuning your testing strategy for your traffic and
business
1%6% 4% 3%
inconclusive inconclusive inconclusiveinconclusive 18%
Looking for small effects?
• Tests take longer to reach
significance
• Might find more winners, if you
are willing to wait long enough
inconclusive
Testing for larger effects?
• Run more tests, faster
• Know when it’s time to move on
to the next idea
Frequently asked questions
Do I need to re-run my historical tests?
Is this a one-tailed or two-tailed test? Why
did you switch?
Why does Stats Engine report 0%
Statistical Significance when other tools
report higher values?
Why does Statistical Significance increase
step-wise?
If I take the results I see in Optimizely and
plug them into any other statistics
calculator, the statistical significance is
different. Why?
How does Stats Engine handle revenue
calculations?
Questions?

Weitere ähnliche Inhalte

Mehr von Optimizely

Mehr von Optimizely (20)

The Future of Optimizely for Technical Teams
The Future of Optimizely for Technical TeamsThe Future of Optimizely for Technical Teams
The Future of Optimizely for Technical Teams
 
Empowering Agents to Provide Service from Anywhere: Contact Centers in the Ti...
Empowering Agents to Provide Service from Anywhere: Contact Centers in the Ti...Empowering Agents to Provide Service from Anywhere: Contact Centers in the Ti...
Empowering Agents to Provide Service from Anywhere: Contact Centers in the Ti...
 
Experimentation Everywhere: Create Exceptional Online Shopping Experiences an...
Experimentation Everywhere: Create Exceptional Online Shopping Experiences an...Experimentation Everywhere: Create Exceptional Online Shopping Experiences an...
Experimentation Everywhere: Create Exceptional Online Shopping Experiences an...
 
Building an Experiment Pipeline for GitHub’s New Free Team Offering
Building an Experiment Pipeline for GitHub’s New Free Team OfferingBuilding an Experiment Pipeline for GitHub’s New Free Team Offering
Building an Experiment Pipeline for GitHub’s New Free Team Offering
 
AMC Networks Experiments Faster on the Server Side
AMC Networks Experiments Faster on the Server SideAMC Networks Experiments Faster on the Server Side
AMC Networks Experiments Faster on the Server Side
 
Evolving Experimentation from CRO to Product Development
Evolving Experimentation from CRO to Product DevelopmentEvolving Experimentation from CRO to Product Development
Evolving Experimentation from CRO to Product Development
 
Overcoming the Challenges of Experimentation on a Service Oriented Architecture
Overcoming the Challenges of Experimentation on a Service Oriented ArchitectureOvercoming the Challenges of Experimentation on a Service Oriented Architecture
Overcoming the Challenges of Experimentation on a Service Oriented Architecture
 
How The Zebra Utilized Feature Experiments To Increase Carrier Card Engagemen...
How The Zebra Utilized Feature Experiments To Increase Carrier Card Engagemen...How The Zebra Utilized Feature Experiments To Increase Carrier Card Engagemen...
How The Zebra Utilized Feature Experiments To Increase Carrier Card Engagemen...
 
Making Your Hypothesis Work Harder to Inform Future Product Strategy
Making Your Hypothesis Work Harder to Inform Future Product StrategyMaking Your Hypothesis Work Harder to Inform Future Product Strategy
Making Your Hypothesis Work Harder to Inform Future Product Strategy
 
Kick Your Assumptions: How Scholl's Test-Everything Culture Drives Revenue
Kick Your Assumptions: How Scholl's Test-Everything Culture Drives RevenueKick Your Assumptions: How Scholl's Test-Everything Culture Drives Revenue
Kick Your Assumptions: How Scholl's Test-Everything Culture Drives Revenue
 
Experimentation through Clients' Eyes
Experimentation through Clients' EyesExperimentation through Clients' Eyes
Experimentation through Clients' Eyes
 
Shipping to Learn and Accelerate Growth with GitHub
Shipping to Learn and Accelerate Growth with GitHubShipping to Learn and Accelerate Growth with GitHub
Shipping to Learn and Accelerate Growth with GitHub
 
Test Everything: TrustRadius Delivers Customer Value with Experimentation
Test Everything: TrustRadius Delivers Customer Value with ExperimentationTest Everything: TrustRadius Delivers Customer Value with Experimentation
Test Everything: TrustRadius Delivers Customer Value with Experimentation
 
Optimizely Agent: Scaling Resilient Feature Delivery
Optimizely Agent: Scaling Resilient Feature DeliveryOptimizely Agent: Scaling Resilient Feature Delivery
Optimizely Agent: Scaling Resilient Feature Delivery
 
The Future of Software Development
The Future of Software DevelopmentThe Future of Software Development
The Future of Software Development
 
Practical Use Case: How Dosh Uses Feature Experiments To Accelerate Mobile De...
Practical Use Case: How Dosh Uses Feature Experiments To Accelerate Mobile De...Practical Use Case: How Dosh Uses Feature Experiments To Accelerate Mobile De...
Practical Use Case: How Dosh Uses Feature Experiments To Accelerate Mobile De...
 
Run High Impact Experimentation with High-quality Customer Discovery
Run High Impact Experimentation with High-quality Customer DiscoveryRun High Impact Experimentation with High-quality Customer Discovery
Run High Impact Experimentation with High-quality Customer Discovery
 
Using Empathy to Build Custom Solutions at Scale
Using Empathy to Build Custom Solutions at ScaleUsing Empathy to Build Custom Solutions at Scale
Using Empathy to Build Custom Solutions at Scale
 
How to find data insights that will drive a 10X impact
How to find data insights that will drive a 10X impact How to find data insights that will drive a 10X impact
How to find data insights that will drive a 10X impact
 
Targeted Rollouts: How to Release Features to Multiple Audiences
Targeted Rollouts: How to Release Features to Multiple AudiencesTargeted Rollouts: How to Release Features to Multiple Audiences
Targeted Rollouts: How to Release Features to Multiple Audiences
 

Kürzlich hochgeladen

Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 

Kürzlich hochgeladen (20)

Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 

Optimizely Stats Engine: An overview and practical tips for running experiments

  • 1. Optimizely Stats Engine Leo Pekelis Darwish Gani Robin Pam
  • 2. Housekeeping notes • Chat box is available for questions • There will be time for Q&A at the end • We will be recording the webinar for future viewing • All attendees will receive a copy of slides after the webinar
  • 3. Your speakers Darwish Gani Product manager Robin Pam Product marketing Leo Pekelis Statistician
  • 4. Objectives Understand why Optimizely built Stats Engine Introduce the methods Stats Engine uses to calculate results Get practical recommendations for how to test with Stats Engine
  • 5. Why make a new Stats Engine?
  • 6. Meet Joe: A farmer who uses a traditional t-test
  • 7. Joe wants to try a new fertilizer this year
  • 8. With his original fertilizer, 10% of plants survive the winter He thinks that this new fertilizer might help more survive. Joe has a hypothesis
  • 9. Joe calculates a sample size for his experiment in advance, given how much better he thinks the new fertilizer might be
  • 10. He waits through the winter…
  • 11. 10% of plants survive the winter 15% of plants survive 96% statistical significance! …and he is rewarded for his patience
  • 12. Does that work today?
  • 13. Meet Kyle: Head of Optimization at Optimizely
  • 14. Kyle doesn’t know what improvement to expect
  • 15. Kyle also gets data from Optimizely all the time
  • 16. Kyle wants to test many goals and variations at once, instead of just one hypothesis vs.
  • 17. Actually that’s a lot of work. It’s cumbersome and error-prone.
  • 18. What’s your chance of making incorrect decision?
  • 19. 30%
  • 20. Objectives Understand why Optimizely built Stats Engine Introduce the methods Stats Engine uses to calculate results Get practical recommendations for how to test with Stats Engine
  • 22. How we did it • Partnered with Stanford statisticians • Talked with customers • Examined historical experiments • Found the best methods for real-time data
  • 23. What does Stats Engine do? Provides a principled and mathematical way to calculate your chance of making an incorrect decision.
  • 24. Sequential Testing False Discovery Rate • First used in 1940s for military weapons testing • Sample size is not fixed in advance • Data is evaluated as it’s collected • First used in 1990s in genetics • Correct error rates for multiple goals and variations • Expected number of false discoveries
  • 25. Sequential Testing False Discovery Rate control + = Statistical Significance for Digital Experimentation
  • 26. Statistical Significance for Digital Experimentation Continuously Evaluate Test Results Run many goals and variations Don’t worry about estimating a MDE upfront
  • 27. Sequential Testing Finding the right stopping rule
  • 28. Variation #1 Variation #2 Declare a winner? 500 Visitors 50% 65% Is this lift big enough for the visitors I saw? ?
  • 29. Desired Stopping Rule: I will be “wrong” only 5% of the time.
  • 30. Variation #1 Variation #2 1000 Visitors 55% 59% Traditional Error Rates 5% Find a stopping rule, so I declare a winner incorrectly <5% of the time
  • 31. Variation #1 Variation #2 Traditional Error Rates Visitors 500 50% 65% 5% 1000 55% 59% 5% 5000 52% 57% 5% 10000 54% 59% 5%
  • 32. Variation #1 Variation #2 Traditional Error Rates 500 1000 5000 10000 Visitors 50% 55% 52% 54% 65% 59% 57% 59% 5% 5% 5% 5% Look only once: 5% Error rate
  • 33. Variation #1 Variation #2 Traditional Error Rates 500 1000 5000 10000 Visitors 50% 55% 52% 54% 65% 59% 57% 59% 5% 5% 5% 5% Look only once: 5% Error rate Look more than once: >5% Error rate
  • 34. Variation #1 Variation #2 Traditional Error Rates 500 1000 5000 10000 Visitors 50% 55% 52% 54% 65% 59% 57% 59% 5% 5% 5% 5% <5% Error rate for the entire test! Sequential Testing Error Rate 1% .5% 1.5% 1.5% < 5%
  • 36. Steven Goodman, Stanford Physician & Statistician, nature.com “The P value was never meant to be used the way it's used today.”
  • 37. Source: Evan Miller, How not to Run an A/B Test
  • 38.
  • 39. Sample Size + Power Calculations Focus on creating and running tests
  • 40. Sequential Testing • Continuously Evaluate Test Results • Don’t worry about estimating a MDE upfront Framework of hypothesis testing that was created to allow the experimenter to evaluate test results as they come in
  • 41. False discovery rate control Error rates for a world with many goals and variations
  • 42. Variations Goal 1 Goal 2 Goal 3 Goal 4 Goal 5 Control Variation 1 Variation 2
  • 43. Variations Goal 1 Goal 2 Goal 3 Goal 4 Goal 5 Control Variation 1 Variation 2 Significance Level 90 (False Positive Rate 10%)
  • 44. Variations Goal 1 Goal 2 Goal 3 Goal 4 Goal 5 Control Variation 1 Variation 2 Significance Level 90 (False Positive Rate 10%) 1 False Positive!
  • 45. Variations Goal 1 Goal 2 Goal 3 Goal 4 Goal 5 Control Variation 1 Variation 2 1 False Positive! Significance Level 90 (False Positive Rate 10%) 1 other variation x goal has a large improvement.
  • 46. Variations Goal 1 Goal 2 Goal 3 Goal 4 Goal 5 Control Variation 1 Variation 2 1 False Positive! Significance Level 90 (False Positive Rate 10%) 1 other variation x goal has a large improvement. 1 True Positive!
  • 47. My Report • Variation 2 is improving on Goal 1 • Variation 1 is improving on Goal 4 “10% of what I report could be wrong.” X 50% • Variation 2 is improving on Goal 1 • Variation 1 is improving on Goal 4 Furthermore, I found the following results. This leads me to conclude that …
  • 48. Statistical Significance: The chance that any variation on any goal that is reported as a winner or loser is correct.
  • 49. • “New York Times has a feature in its Tuesday science section, Take a Number … Today’s column is in error … This is the old, old error of confusing p(A|B) with p(B|A).” • Andrew Gelman, Misunderstanding the p-value • “If I were to randomly select a drug out of the lot of 100, run it through my tests, and discover a p<0.05 statistically significant benefit, there is only a 62% chance that the drug is actually effective.” • Alex Reinhart, The p value and the base rate fallacy • “In this article I’ll show that badly performed A/B tests can produce winning results which are more likely to be false than true. At best, this leads to the needless modification of websites; at worst, to modification which damages profits.” • Martin Goodson, Most Winning A/B Test Results are Illusory • “An unguarded use of single-inference procedures results in a greatly increased false positive (significance) rate” • Benjamini, Yoav, and Yosef Hochberg. "Controlling the false discovery rate: a practical and powerful approach to multiple testing." Journal of the Royal Statistical Society. Series B (Methodological) (1995): 289-300. APA
  • 50. • “New York Times has a feature in its Tuesday science section, Take a Number … Today’s column is in error … This is the old, old error of confusing p(A|B) with p(B|A).” • Andrew Gelman, Misunderstanding the p-value • “If I were to randomly select a drug out of the lot of 100, run it through my tests, and discover a p<0.05 statistically significant benefit, there is only a 62% chance that the drug is actually effective.” • Alex Reinhart, The p value and the base rate fallacy • “In this article I’ll show that badly performed A/B tests can produce winning results which are more likely to be false than true. At best, this leads to the needless modification of websites; at worst, to modification which damages profits.” • Martin Goodson, Most Winning A/B Test Results are Illusory • “An unguarded use of single-inference procedures results in a greatly increased false positive (significance) rate” • Benjamini, Yoav, and Yosef Hochberg. "Controlling the false discovery rate: a practical and powerful approach to multiple testing." Journal of the Royal Statistical Society. Series B (Methodological) (1995): 289-300. APA
  • 51. False Discovery Rate control Framework for controlling errors that arise from running multiple experiments at once. Run many goals & variations
  • 52. What Stats Engine means for you • You see fewer, but more accurate conclusive results. • You can implement winners as soon as significance is reached. • You get • easy experiment workflow. • reduced unforeseen, and hidden errors.
  • 53. Objectives Understand why Optimizely built Stats Engine Introduce the methods Stats Engine uses to calculate results Get practical recommendations for how to test with Stats Engine
  • 54. And now, for some practical guidance
  • 55. First, some vocabulary • Baseline conversion rate The control group’s expected conversion rate. • Minimum detectable effect The smallest conversion rate difference it is possible to detect in an A/B Test. • Statistical significance The likelihood that the observed difference in conversion rates is not due to chance. • Minimum sample size The smallest number of visitors required to reliably detect a given conversion rate difference
  • 56. Sample size calculators and statistical power
  • 57. How many visitors do you need to see significant results? Visitors needed to reach significance with Stats Engine Improvement 5% 10% 25% Baseline conversion rate 1% 458,900 101,600 13,000 5% 69,500 15,000 1,800 10% 29,200 6,200 700 25% 8,100 1,700 200 Lower conversion rate, lower effects = more visitors
  • 58. One example of calculating your opportunity cost 12% minimum detectable effect
  • 60. Should you stop or continue a test?
  • 61. Should you stop or continue a test? Is my test significant? Congrats Can I afford to wait? Continue Stop Accept lower significance Concede inconclusive Yes No No Yes
  • 62. Should you stop or continue a test? Is my test significant? Congrats Can I afford to wait? Continue Stop Accept lower significance Concede inconclusive Yes No No Yes
  • 63.
  • 66. Seasonality • We DO take into account seasonality while a test is running. • We DO NOT take into account future seasonality after an experiment is stopped.
  • 67. Should you stop or continue a test? Is my test significant? Congrats Can I afford to wait? Continue Stop Accept lower significance Concede inconclusive Yes No No Yes • Use Difference Intervals to understand the types of lifts you could see.
  • 68. Should you stop or continue a test? Is my test significant? Congrats Can I afford to wait? Continue Stop Accept lower significance Concede inconclusive Yes No No Yes
  • 69. A reasonable number of visitors left, relative to your traffic?
  • 70. Visitors Remaining Explained Improvement Increases in Magnitude Improvement Stays the Same Improvement Decreases in Magnitude
  • 71. A good idea to wait 5761 + 3200 Note!
  • 72. Should you stop or continue a test? Is my test significant? Congrats Can I afford to wait? Continue Stop Accept lower significance Concede inconclusive Yes No No Yes • Use Visitors Remaining to evaluate if waiting makes sense.
  • 73. Should you stop or continue a test? Is my test significant? Congrats Can I afford to wait? Continue Stop Accept lower significance Concede inconclusive Yes No No Yes
  • 74.
  • 75. If you’re an organization that can • Iterate quickly on new variations • Run lots of experiments • Have little downside risk of implementing non-winning variations then you can likely tolerate a higher error rate.
  • 76. Difference intervals can guide your decision
  • 77. Should you stop or continue a test? Is my test significant? Congrats Can I afford to wait? Continue Stop Accept lower significance Concede inconclusive Yes No No Yes • Use Difference Intervals to measure risk you take on
  • 78. > 100,000 visitors? What next?
  • 79. Time to move on to the next idea
  • 80. Should you stop or continue a test? Is my test significant? Congrats Can I afford to wait? Continue Stop Accept lower significance Concede inconclusive Yes No No Yes • Use Visitors Remaining to evaluate if waiting makes sense.
  • 81. Recap Is my test significant? Congrats Can I afford to wait? Continue Stop Accept lower significance Concede inconclusive Yes No No Yes • Use Difference Intervals to understand the types of lifts you could see.
  • 82. Recap Is my test significant? Congrats Can I afford to wait? Continue Stop Accept lower significance Concede inconclusive Yes No No Yes • Use Difference Intervals to understand the types of lifts you could see. Can I afford to wait? Continue Stop Concede inconclusive Yes No No • Use Visitors Remaining to evaluate if waiting makes sense.
  • 83. Recap Is my test significant? Congrats Can I afford to wait? Continue Stop Accept lower significance Concede inconclusive Yes No No Yes • Use Difference Intervals to understand the types of lifts you could see. Can I afford to wait? Continue Stop Concede inconclusive Yes No No • Use Visitors Remaining to evaluate if waiting makes sense. Stop Accept lower significance Concede inconclusive Yes No Can I afford to wait? • Use Difference Intervals to measure risk you take on
  • 84. Recap Is my test significant? Congrats Can I afford to wait? Continue Stop Accept lower significance Concede inconclusive Yes No No Yes • Use Difference Intervals to understand the types of lifts you could see. Can I afford to wait? Continue Stop Concede inconclusive Yes No No • Use Visitors Remaining to evaluate if waiting makes sense. Stop Accept lower significance Concede inconclusive Yes No Can I afford to wait? • Use Difference Intervals to measure risk you take on • Use Visitors Remaining to evaluate if waiting makes sense.
  • 85. Tuning your testing strategy for your traffic and business 1%6% 4% 3% inconclusive inconclusive inconclusiveinconclusive 18% Looking for small effects? • Tests take longer to reach significance • Might find more winners, if you are willing to wait long enough inconclusive Testing for larger effects? • Run more tests, faster • Know when it’s time to move on to the next idea
  • 87. Do I need to re-run my historical tests?
  • 88. Is this a one-tailed or two-tailed test? Why did you switch?
  • 89. Why does Stats Engine report 0% Statistical Significance when other tools report higher values?
  • 90. Why does Statistical Significance increase step-wise?
  • 91. If I take the results I see in Optimizely and plug them into any other statistics calculator, the statistical significance is different. Why?
  • 92. How does Stats Engine handle revenue calculations?