Tale of Two Tests

•Als PPTX, PDF herunterladen•

0 gefällt mir•239 views

Since it was introduced in 2014, Stats Engine has served as a fast, powerful, and easy-to-use foundation for tens of thousands of digital experiments. But how exactly does it work? In this session, we will explain the key differences and advantages of Stats Engine by comparing and contrasting it with a familiar old friend: the t-test.

Technologie

Tale of Two Tests
Jimmy Jin
Statistician
Mei Luo
Strategic Customer Success Manager

Experimentation in
the Digital Age
• You want to run an experiment on the
background image on the homepage of your e-
commerce clothing site, Attic & Button
• As a practitioner, you would want...

Results in real time
Evaluate impact on
multiple KPIs
Run experiment with
minimal data inputs

Review of basic terms
• p-value: The probability of observing a given result if
there is no difference
• False positive rate (Type 1 error rate): “How often will
the test detect an illusion?”
• Power (1 minus Type 2 error rate): “How often will the test
detect the real thing?”

Steps to doing a t-test
1. Calculate a required sample size for your A/B test
1. Depends on the minimum detectable effect (MDE)
2. Collect your data
3. Make a decision
Continuing past the prescribed sample size or
stopping early is NOT allowed.

Activity 1: peek for yourselves
External demo!

False positive rate inflation in a t-test

Results Dashboard
Significance increases as more data is collected

Summary – The Peeking Problem
t-test:
• Peeking during a t-test increases the chance you’ll find a winning
result when none actually exists (a false positive)
Stats Engine:
• Sequential testing enables evaluation of experiment data as it is
collected. Tests can be stopped at any time with valid results.

Activity 2: the lift is right
Guess the lift (and see the consequences) of 2 actual
Optimizely experiments.

What is the expected lift to
subscription conversion rate?
Original Variant

What is the expected lift to subscription
conversion rate?
 -5%
 -2%
 2%
 5%
+35.75%

What is the expected lift to add-to-cart rate?
Original Variant

 -10%
 -15%
 20%
 2%
+16.46%
What is the expected lift to the add-to-cart rate?

Why is power limited in fixed-horizon tests?

How is the Optimizely calculator different?

Summary – The Guessing Problem
t-test
• If you set a small MDE, tests will take longer to conclude. If you set a
large MDE, you may miss smaller improvements.
Stats Engine
• When the true lift exceeds your MDE, you’ll be able to call your test
faster.

Built-in protections in Stats Engine
Ordinary, corrections for multiple comparisons happen
after all tests have concluded.
In Optimizely, we perform these corrections in real time
so your results are protected no matter when you look
check your experiment.

Activity 3: false positives
You conduct an experiment with many variations. Under
which scenario would you suspect more false positives?
1. You obtain 5 significant results.
2. You obtain 50 significant results.

false positives vs. false discoveries
False positive rate
P( significant | no true effect )
False discovery rate
P( no true effect | significant )

Example: FDR tiering in an actual
experiment
Let’s walk through an actual experiment!

Summary – The Multiple Comparisons
Problemt-test
• Traditional statistics control for false positive rates which does not
equate to the probability of making an incorrect business decision
Stats Engine
• Stats Engine controls for false discovery rate; as you add more metrics
to your experiment, Optimizely will become more conservative in
calling a winner or loser

3 Takeaways
• Monitor results in real-time for faster experimentation,
without increased error rates
• Run fully powered experiments without guessing at
sample size calculations
• Evaluate impact on many metrics without
sacrificing accuracy
Stats Engine allows you to...

Empfohlen

Measure Your Way To MaturityOptimizely

Train-the-Trainer - Enabling your teamOptimizely

You are the Catalyst: Optimization Champions’ Keys to SuccessOptimizely

How to Reduce Customer Acquisition Costs and Optimize Advertising SpendOptimizely

Optimizely Workshop 1: Prioritize your roadmapOptimizely

The Science of Getting Testing RightOptimizely

Experimentation as a growth strategy: A conversation with The Motley FoolChris Goward

VWO Webinar: How To Plan Your Optimisation RoadmapVWO

Empfohlen

Measure Your Way To MaturityOptimizely

Train-the-Trainer - Enabling your teamOptimizely

You are the Catalyst: Optimization Champions’ Keys to SuccessOptimizely

How to Reduce Customer Acquisition Costs and Optimize Advertising SpendOptimizely

Optimizely Workshop 1: Prioritize your roadmapOptimizely

The Science of Getting Testing RightOptimizely

Experimentation as a growth strategy: A conversation with The Motley FoolChris Goward

VWO Webinar: How To Plan Your Optimisation RoadmapVWO

Full Stack ExperimentationOptimizely

An Experimentation Framework: How to Position for Triple Digit GrowthOptimizely

Optimizely Demo DeckMattharth

Optimizing Your B2B Demand Generation MachineOptimizely

Optimism Webinar 1: Improving your digital experiences - what's next in 2019?Optimizely

Under the Hood Webinar Series: B2B Experimentation & Personalization at Optim...Optimizely

Mailchimp: Scaling Experimentation Across TeamsOptimizely

Optimizely Experience Customer Story - AtlassianOptimizely

What Your Customers Really Do Online: 5 Ways to Remove the GuessworkOptimizely

Speed Matters - Strategies to Improve Your Site PerformanceOptimizely

World Class Optimization: Benchmarking 1,000+ CompaniesOptimizely

Opticon 2017 Driving Bottom Line ImpactOptimizely

Making Your Hypothesis Work Harder to Inform Future Product StrategyOptimizely

Losing is the New WinningOptimizely

Magento Meetup New Delhi- AB TestingWebkul Software Pvt. Ltd.

Definition of A/B testing and Case Studies by OptimizelyRusseWeb

Getting Started with Server-Side TestingOptimizely

Optimizely Under the Hood Series: Managing Experimentation at ScaleOptimizely

Aeroméxico: Improving the Booking Experience with Data-First PersonalizationOptimizely

Optimizely Workshop: Take Action on Results with StatisticsOptimizely

The Finishing LineOban International

Webinar: Experimentation & Product Management by Indeed Product LeadProduct School

Weitere ähnliche Inhalte

Was ist angesagt?

Full Stack ExperimentationOptimizely

An Experimentation Framework: How to Position for Triple Digit GrowthOptimizely

Optimizely Demo DeckMattharth

Optimizing Your B2B Demand Generation MachineOptimizely

Optimism Webinar 1: Improving your digital experiences - what's next in 2019?Optimizely

Under the Hood Webinar Series: B2B Experimentation & Personalization at Optim...Optimizely

Mailchimp: Scaling Experimentation Across TeamsOptimizely

Optimizely Experience Customer Story - AtlassianOptimizely

What Your Customers Really Do Online: 5 Ways to Remove the GuessworkOptimizely

Speed Matters - Strategies to Improve Your Site PerformanceOptimizely

World Class Optimization: Benchmarking 1,000+ CompaniesOptimizely

Opticon 2017 Driving Bottom Line ImpactOptimizely

Making Your Hypothesis Work Harder to Inform Future Product StrategyOptimizely

Losing is the New WinningOptimizely

Magento Meetup New Delhi- AB TestingWebkul Software Pvt. Ltd.

Definition of A/B testing and Case Studies by OptimizelyRusseWeb

Getting Started with Server-Side TestingOptimizely

Optimizely Under the Hood Series: Managing Experimentation at ScaleOptimizely

Aeroméxico: Improving the Booking Experience with Data-First PersonalizationOptimizely

Optimizely Workshop: Take Action on Results with StatisticsOptimizely

Was ist angesagt? (20)

Full Stack Experimentation

An Experimentation Framework: How to Position for Triple Digit Growth

Optimizely Demo Deck

Optimizing Your B2B Demand Generation Machine

Optimism Webinar 1: Improving your digital experiences - what's next in 2019?

Under the Hood Webinar Series: B2B Experimentation & Personalization at Optim...

Mailchimp: Scaling Experimentation Across Teams

Optimizely Experience Customer Story - Atlassian

What Your Customers Really Do Online: 5 Ways to Remove the Guesswork

Speed Matters - Strategies to Improve Your Site Performance

World Class Optimization: Benchmarking 1,000+ Companies

Opticon 2017 Driving Bottom Line Impact

Making Your Hypothesis Work Harder to Inform Future Product Strategy

Losing is the New Winning

Magento Meetup New Delhi- AB Testing

Definition of A/B testing and Case Studies by Optimizely

Getting Started with Server-Side Testing

Optimizely Under the Hood Series: Managing Experimentation at Scale

Aeroméxico: Improving the Booking Experience with Data-First Personalization

Optimizely Workshop: Take Action on Results with Statistics

Ähnlich wie Tale of Two Tests

The Finishing LineOban International

Webinar: Experimentation & Product Management by Indeed Product LeadProduct School

Why learn Six Sigma, 4,28,15James F. McCarthy

Patrick McKenzie Opticon 2014: Advanced A/B TestingPatrick McKenzie

SAMPLE SIZE – The indispensable A/B test calculation that you’re not makingZack Notes

Yogurt -Weekly Predictions: The Future, TodayJoosworks.com

Yogurt : Predicting the Unpredictable with 95% AccuracyJoosworks.com

Opticon 2017 Experimenting with Stats EngineOptimizely

Data-Driven UI/UX Design with A/B TestingJack Nguyen (Hung Tien)

Validation and hypothesis based product management by Abdallah Al-KhalidiAbdallah Al-Khalidi

Data Science and Goodhart's LawDomino Data Lab

The Necessity of the Measure Phase with Matt Hansen at StatStuffMatt Hansen

Testing As A Bottleneck - How Testing Slows Down Modern Development Processes...TEST Huddle

Value added testing (VAT)Harishankar Srinivasan

Doing monitoring rightJohn-Daniel Trask

Lean Six SigmaShankaran Rd

Debugging Intermittent Issues - A How ToLloydMoore

10NTC - Data Superheroes - DiJuliosarahdijulio

UX STRAT Online 2020: Dr. Martin Tingley, NetflixUX STRAT

VSSML18. EvaluationsBigML, Inc

Ähnlich wie Tale of Two Tests (20)

The Finishing Line

Webinar: Experimentation & Product Management by Indeed Product Lead

Why learn Six Sigma, 4,28,15

Patrick McKenzie Opticon 2014: Advanced A/B Testing

SAMPLE SIZE – The indispensable A/B test calculation that you’re not making

Yogurt -Weekly Predictions: The Future, Today

Yogurt : Predicting the Unpredictable with 95% Accuracy

Opticon 2017 Experimenting with Stats Engine

Data-Driven UI/UX Design with A/B Testing

Validation and hypothesis based product management by Abdallah Al-Khalidi

Data Science and Goodhart's Law

The Necessity of the Measure Phase with Matt Hansen at StatStuff

Testing As A Bottleneck - How Testing Slows Down Modern Development Processes...

Value added testing (VAT)

Doing monitoring right

Lean Six Sigma

Debugging Intermittent Issues - A How To

10NTC - Data Superheroes - DiJulio

UX STRAT Online 2020: Dr. Martin Tingley, Netflix

VSSML18. Evaluations

Mehr von Optimizely

Clover Rings Up Digital Growth to Drive ExperimentationOptimizely

Make Every Touchpoint Count: How to Drive Revenue in an Increasingly Online W...Optimizely

Atlassian's Mystique CLI, Minimizing the Experiment Development CycleOptimizely

Autotrader Case Study: Migrating from Home-Grown Testing to Best-in-Class Too...Optimizely

Zillow + Optimizely: Building the Bridge to $20 Billion RevenueOptimizely

The Future of Optimizely for Technical TeamsOptimizely

Empowering Agents to Provide Service from Anywhere: Contact Centers in the Ti...Optimizely

Experimentation Everywhere: Create Exceptional Online Shopping Experiences an...Optimizely

Building an Experiment Pipeline for GitHub’s New Free Team OfferingOptimizely

AMC Networks Experiments Faster on the Server SideOptimizely

Evolving Experimentation from CRO to Product DevelopmentOptimizely

Overcoming the Challenges of Experimentation on a Service Oriented ArchitectureOptimizely

How The Zebra Utilized Feature Experiments To Increase Carrier Card Engagemen...Optimizely

Kick Your Assumptions: How Scholl's Test-Everything Culture Drives RevenueOptimizely

Experimentation through Clients' EyesOptimizely

Shipping to Learn and Accelerate Growth with GitHubOptimizely

Test Everything: TrustRadius Delivers Customer Value with ExperimentationOptimizely

Optimizely Agent: Scaling Resilient Feature DeliveryOptimizely

The Future of Software DevelopmentOptimizely

Practical Use Case: How Dosh Uses Feature Experiments To Accelerate Mobile De...Optimizely

Mehr von Optimizely (20)

Clover Rings Up Digital Growth to Drive Experimentation

Make Every Touchpoint Count: How to Drive Revenue in an Increasingly Online W...

Atlassian's Mystique CLI, Minimizing the Experiment Development Cycle

Autotrader Case Study: Migrating from Home-Grown Testing to Best-in-Class Too...

Zillow + Optimizely: Building the Bridge to $20 Billion Revenue

The Future of Optimizely for Technical Teams

Empowering Agents to Provide Service from Anywhere: Contact Centers in the Ti...

Experimentation Everywhere: Create Exceptional Online Shopping Experiences an...

Building an Experiment Pipeline for GitHub’s New Free Team Offering

AMC Networks Experiments Faster on the Server Side

Evolving Experimentation from CRO to Product Development

Overcoming the Challenges of Experimentation on a Service Oriented Architecture

How The Zebra Utilized Feature Experiments To Increase Carrier Card Engagemen...

Kick Your Assumptions: How Scholl's Test-Everything Culture Drives Revenue

Experimentation through Clients' Eyes

Shipping to Learn and Accelerate Growth with GitHub

Test Everything: TrustRadius Delivers Customer Value with Experimentation

Optimizely Agent: Scaling Resilient Feature Delivery

The Future of Software Development

Practical Use Case: How Dosh Uses Feature Experiments To Accelerate Mobile De...

Kürzlich hochgeladen

presentation ICT roal in 21st century educationjfdjdjcjdnsjd

Ransomware_Q4_2023. The report. [EN].pdfOverkill Security

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz

Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez

DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity

Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays

Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous

Manulife - Insurer Transformation Award 2024The Digital Insurer

DBX First Quarter 2024 Investor PresentationDropbox

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub

TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc

MS Copilot expands with MS Graph connectorsNanddeep Nachan

Exploring Multimodal Embeddings with MilvusZilliz

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

ICT role in 21st century education and its challengesrafiqahmad00786416

Kürzlich hochgeladen (20)

presentation ICT roal in 21st century education

Ransomware_Q4_2023. The report. [EN].pdf

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam

Cyberprint. Dark Pink Apt Group [EN].pdf

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...

Spring Boot vs Quarkus the ultimate battle - DevoxxUK

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

Exploring the Future Potential of AI-Enabled Smartphone Processors

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

Manulife - Insurer Transformation Award 2024

DBX First Quarter 2024 Investor Presentation

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

MS Copilot expands with MS Graph connectors

Exploring Multimodal Embeddings with Milvus

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

ICT role in 21st century education and its challenges

Tale of Two Tests

2. Tale of Two Tests Jimmy Jin Statistician Mei Luo Strategic Customer Success Manager

3. Experimentation in the Digital Age • You want to run an experiment on the background image on the homepage of your e- commerce clothing site, Attic & Button • As a practitioner, you would want...

4. Results in real time Evaluate impact on multiple KPIs Run experiment with minimal data inputs

5. A brief refresher

6. Review of basic terms • p-value: The probability of observing a given result if there is no difference • False positive rate (Type 1 error rate): “How often will the test detect an illusion?” • Power (1 minus Type 2 error rate): “How often will the test detect the real thing?”

7. Steps to doing a t-test 1. Calculate a required sample size for your A/B test 1. Depends on the minimum detectable effect (MDE) 2. Collect your data 3. Make a decision Continuing past the prescribed sample size or stopping early is NOT allowed.

8. The peeking problem

9. Activity 1: peek for yourselves External demo!

10. False positive rate inflation in a t-test

11. Why is a t-test sensitive to this?

12. The Stats Engine solution (essentially)

13. Upshot: faster results

14. Results Dashboard Significance increases as more data is collected

15. t-test comparison on results page data

16. Summary – The Peeking Problem t-test: • Peeking during a t-test increases the chance you’ll find a winning result when none actually exists (a false positive) Stats Engine: • Sequential testing enables evaluation of experiment data as it is collected. Tests can be stopped at any time with valid results.

17. The guessing problem

18. The bane of t-testing

19. Missed connections

20. Activity 2: the lift is right Guess the lift (and see the consequences) of 2 actual Optimizely experiments.

21. What is the expected lift to subscription conversion rate? Original Variant

22. What is the expected lift to subscription conversion rate?  -5%  -2%  2%  5% +35.75%

23. What is the expected lift to add-to-cart rate? Original Variant

24.  -10%  -15%  20%  2% +16.46% What is the expected lift to the add-to-cart rate?

25. Why is power limited in fixed-horizon tests?

26. The sequential advantage

27. How is the Optimizely calculator different?

28. Summary – The Guessing Problem t-test • If you set a small MDE, tests will take longer to conclude. If you set a large MDE, you may miss smaller improvements. Stats Engine • When the true lift exceeds your MDE, you’ll be able to call your test faster.

29. The multiple comparisons problem

30. A higher risk of false positives

31. Built-in protections in Stats Engine Ordinary, corrections for multiple comparisons happen after all tests have concluded. In Optimizely, we perform these corrections in real time so your results are protected no matter when you look check your experiment.

32. Activity 3: false positives You conduct an experiment with many variations. Under which scenario would you suspect more false positives? 1. You obtain 5 significant results. 2. You obtain 50 significant results.

33. false positives vs. false discoveries False positive rate P( significant | no true effect ) False discovery rate P( no true effect | significant )

34. FDR corrections in real time

35. FDR tiering

36. Example: FDR tiering in an actual experiment Let’s walk through an actual experiment!

37. Summary – The Multiple Comparisons Problemt-test • Traditional statistics control for false positive rates which does not equate to the probability of making an incorrect business decision Stats Engine • Stats Engine controls for false discovery rate; as you add more metrics to your experiment, Optimizely will become more conservative in calling a winner or loser

38. Summary

39. 3 Takeaways • Monitor results in real-time for faster experimentation, without increased error rates • Run fully powered experiments without guessing at sample size calculations • Evaluate impact on many metrics without sacrificing accuracy Stats Engine allows you to...

40. Q&A

41. Thank you!