SlideShare ist ein Scribd-Unternehmen logo
1 von 93
Statistics in 40 Minutes:
A/B Testing
Fundamentals
Leo Pekelis
Statistician, Optimizely
@lpekelis
leonid@optimizely.com
#opticon2015
You have your own unique approach to A/B Testing
The goal of this talk is to break
down A/B Testing to its
fundamentals.
A/B Testing
Platform1) Create an experiment
2) Read the results page
Outcomes & Error Rates
Fundamental
Tradeoff
Confidence Intervals
Hypotheses
XX X
“A/B Testing
Playbook”Opening
Mid-game
Mid-game
Closing
Outcomes & Error Rates
Fundamental
Tradeoff
Confidence Intervals
Hypotheses
XX X
“A/B Testing
Playbook”Opening
Mid-game
Mid-game
Closing
1. What makes a good hypothesis?
2. What are the differences between False Positive Rate and
False Discovery Rate?
3. How can I trade off pulling the 3 levers I have: thresholding
error rates, detecting smaller improvements, and running my
tests longer?
At the end of this talk, you should be able to answer
1. A good hypothesis has a variation and a clearly defined goal
which you crafted ahead of time.
2. False Positive Rates have hidden sources of error when testing
many goals and variations. False Discovery Rates correct this
by increasing runtime when you have many low signal goals.
3. All three levers are inversely related. For example, running my
tests longer can get me lower error rates, or detect smaller
effects.
The answers
First, some vocabulary (yay!)
• Control and Variation
A control is the original, or baseline version of
content that you are testing through a
variation.
• Goal
Metric used to measure impact of control and
variation
• Baseline conversion rate
The control group’s expected conversion rate.
• Effect size
The improvement (positive or negative) of
your variation over baseline.
• Sample size
The number of visitors in your test.
• A hypothesis test is a control, and variation that you
want to show improves a goal
• An experiment is a collection of hypotheses (goals &
variation pairs) that all have the same control
Outcomes & Error Rates
Fundamental
Tradeoff
Confidence Intervals
Hypotheses
XX X
“A/B Testing
Playbook”Opening
Mid-game
Mid-game
Closing
http://www.nba.com/
Imagine we are the NBA
What is a good hypothesis (test)?
Why is this not actionable?
“I think changing the
header image will make
my site better.”
• Removing the header will increase engagement?
• Removing the header will increase total revenue?
• Removing the header will increase “the finals”
clicks?
• Growing the header will increase engagement?
• Growing the header with increase “the finals” clicks.
• …
Test creep!Bad hypothesis
• Removing the header will increase engagement?
• Removing the header will increase total revenue?
• Removing the header will increase “the finals”
clicks?
• Growing the header will increase engagement?
• Growing the header with increase “the finals” clicks.
• …
• Removing the header will increase engagement?
• Removing the header will increase total revenue?
• Removing the header will increase “the finals”
clicks?
• Growing the header will increase engagement?
• Growing the header with increase “the finals” clicks.
• …
“I think changing the
header image will make
my site better.”
Bad hypothesis Good hypotheses
Organized and clear
The more relationships (hypotheses) you test,
the longer (visitors) it will take
to achieve the same outcome (error rate).
Hypotheses also give the cost of your experiment
Questions to check for a good hypothesis
What are you trying to show with your idea?
What key metrics should it drive?
Are all my goals and variations necessary given my testing limits?
1. What makes a good hypothesis?
2. What are the differences between False Positive Rate and
False Discovery Rate?
3. How can I trade off pulling the 3 levers I have: thresholding
error rates, detecting smaller improvements, and running my
tests longer?
At the end of this talk, you should be able to answer
1. What makes a good hypothesis?
Answer: A good hypothesis has a variation and a clearly
defined goal which you crafted ahead of time.
2. What are the differences between False Positive Rate and
False Discovery Rate?
3. How can I trade off pulling the 3 levers I have: thresholding
error rates, detecting smaller improvements, and running my
tests longer?
At the end of this talk, you should be able to answer
Outcomes & Error Rates
Fundamental
Tradeoff
Confidence Intervals
Hypotheses
XX X
“A/B Testing
Playbook”Opening
Mid-game
Mid-game
Closing
http://www.nba.com/
What are the possible outcomes?
“True” value of
hypothesis
Improvement No effect
Result
of test
Winner /
Loser
True positive
False
positive
Inconclusive
False
negative
True
negative
“True” value of
hypothesis
Improvement No effect
Result
of test
Winner /
Loser
True positive
False
positive
Inconclusive
False
negative
True
negative
(no effect, winner / loser)
http://www.nba.com/
“True” value of
hypothesis
Improvement No effect
Result
of test
Winner /
Loser
True positive
False
positive
Inconclusive
False
negative
True
negative
“True” value of
hypothesis
Improvement No effect
Result
of test
Winner /
Loser
True positive
False
positive
Inconclusive
False
negative
True
negative
“True” value of
hypothesis
Improvement No effect
Result
of test
Winner /
Loser
True positive
False
positive
Inconclusive
False
negative
True
negative
(no effect, winner / loser)
(+/- improvement, inconclusive)
http://www.nba.com/
“True” value of
hypothesis
Improvement No effect
Result
of test
Winner /
Loser
True positive
False
positive
Inconclusive
False
negative
True
negative
“True” value of
hypothesis
Improvement No effect
Result
of test
Winner /
Loser
True positive
False
positive
Inconclusive
False
negative
True
negative
“True” value of
hypothesis
Improvement No effect
Result
of test
Winner /
Loser
True positive
False
positive
Inconclusive
False
negative
True
negative
“True” value of
hypothesis
Improvement No effect
Result
of test
Winner /
Loser
True positive
False
positive
Inconclusive
False
negative
True
negative
“True” value of
hypothesis
Improvement No effect
Result
of test
Winner /
Loser
True positive
False
positive
Inconclusive
False
negative
True
negative
(no effect, winner / loser)
(+/- improvement, inconclusive)
(+/- improvement, winner / loser)
(no effect, inconclusive)
The 2x2 table will help us to
1. Keep track of different error rates we care about
2. Explore the consequences of controlling false positives vs false
discoveries
Error rate 1: False positive rate
“True” value of
hypothesis
Improve-
ment
No effect
Result
of test
Winner /
Loser
True
positive
False
positive
Inconc-
lusive
False
negative
True
negative
• False positive rate (Type I error)
= “Chance of a false positive from a
variation with no effect on a goal”
“True” value of
hypothesis
Improve-
ment
No effect
Result
of test
Winner /
Loser
True
positive
False
positive
Inconc-
lusive
False
negative
True
negative
• False positive rate (Type I error)
= “Chance of a false positive from a
variation with no effect on a goal”
“True” value of
hypothesis
Improve-
ment
No effect
Result
of test
Winner /
Loser
True
positive
False
positive
Inconc-
lusive
False
negative
True
negative
• False positive rate (Type I error)
= “Chance of a false positive from a
variation with no effect on a goal”
“True” value of
hypothesis
Improve-
ment
No effect
Result
of test
Winner /
Loser
True
positive
False
positive
Inconc-
lusive
False
negative
True
negative
• False positive rate (Type I error)
= “Chance of a false positive from a
variation with no effect on a goal”
• Thresholding the FPR
“When I have a variation with no
effect on a goal, I’ll find an effect
less than 10% of the time.”
How can we ever compute a False Positive Rate if we
don’t know whether a hypothesis is true or not?
Statistical tests (fixed horizon t-test, Stats Engine) are designed to
threshold an error rate.
Example:
“Calling winners & losers when a p-value is below .05 will guarantee a
False Positive Rate below 5%”
False Positive Rates with multiple tests
https://xkcd.com/882/
https://xkcd.com/882/
https://xkcd.com/882/
What happened?
21 tests X 5% FPR = 1 False Positive on average
“True” value of
hypothesis
Improve-
ment
No effect
Result
of test
Winner /
Loser
True
positive
False
positive
Inconc-
lusive
False
negative
True
negative
False positive rates are only useful in the context of all
hypotheses
Error rate 2: False discovery
rate
• False discovery rate (FDR)
= “Chance of a false positive from a
conclusive result”
“True” value of
hypothesis
Improve-
ment
No effect
Result
of test
Winner /
Loser
True
positive
False
positive
Inconc-
lusive
False
negative
True
negative
• False discovery rate (FDR)
= “Chance of a false positive from a
conclusive result”
“True” value of
hypothesis
Improve-
ment
No effect
Result
of test
Winner /
Loser
True
positive
False
positive
Inconc-
lusive
False
negative
True
negative
• False discovery rate (FDR)
= “Chance of a false positive from a
conclusive result”
“True” value of
hypothesis
Improve-
ment
No effect
Result
of test
Winner /
Loser
True
positive
False
positive
Inconc-
lusive
False
negative
True
negative
• False discovery rate (FDR)
= “Chance of a false positive from a
conclusive result”
• Thresholding the FDR
= “When you see a winning or losing
goal on a variation, it’s wrong less
than 10% of the time.”
“True” value of
hypothesis
Improve-
ment
No effect
Result
of test
Winner /
Loser
True
positive
False
positive
Inconc-
lusive
False
negative
True
negative
or X 5% FDR = 0.05 False Positives on average
“True” value of
hypothesis
Improve-
ment
No effect
Result
of test
Winner /
loser
True
positive
False
positive
Inconc-
lusive
False
negative
True
negative
False discovery rates are useful despite the number of
hypotheses
What’s the catch?
The more hypotheses (goals & variations) in your
experiment, the longer it takes to find conclusives.
Not quite …
low signal high signal
What’s the catch?
The more low signal hypotheses (goals & variations) in
your experiment, the longer it takes to find conclusives.
Recap
• False Positive Rate thresholding
-controls the chance of a false positive when you have a hypothesis with no
effect
-misrepresents your error rate with multiple goals and variations
• False Discovery Rate thresholding
-controls the chance of a false positive when you have a winning or losing
hypothesis
-is accurate regardless of how many hypotheses you run
-can take longer to reach significance with more low signal variations on goals
Tips & Tricks for running experiments with False
Discovery Rates
• Ask: Which goal is most important to me?
-This should be my primary goal (not impacted by all other
goals)
• Run large, or large multivariate tests without fear of finding
spurious results, but be prepared for the cost of exploration
• A little human intuition and prior knowledge can go a long way
towards reducing the runtime of your experiments
1. What makes a good hypothesis?
2. What are the differences between False Positive Rate and
False Discovery Rate?
3. How can I trade off pulling the 3 levers I have: thresholding
error rates, detecting smaller improvements, and running my
tests longer?
At the end of this talk, you should be able to answer
1. What makes a good hypothesis?
2. What are the differences between False Positive Rate and False
Discovery Rate?
False Positive Rates have hidden sources of error when testing
many goals and variations. False Discovery Rates correct this
by increasing runtime when you have many noisy goals.
3. How can I trade off pulling the 3 levers I have: thresholding error
rates, detecting smaller improvements, and running my tests
longer?
At the end of this talk, you should be able to answer
Outcomes & Error Rates
Fundamental
Tradeoff
Confidence Intervals
Hypotheses
XX X
“A/B Testing
Playbook”Opening
Mid-game
Mid-game
Closing
“3 Levers” of A/B Testing
1.Threshold an error rate
• “I want no more than 10% false discovery rate”
2.Detecting effect sizes (setting an MDE)
• “I’m OK with only detecting greater than 5% improvement”
3.Running tests longer
• “I can afford to run this test for 3 weeks, or 50,000
visitors”
Fundamental Tradeoff of A/B Testing
Error rates Runtime
Effect size /
Baseline CR
All Inversely
Related
Error rates Runtime
Effect size /
Baseline CR
At any number of visitors,
the less you threshold your error rate,
the smaller effect sizes you can detect.
All Inversely
Related
Error rates Runtime
Effect size /
Baseline CR
All Inversely
Related
At any error rate threshold,
stopping your test earlier means
you can only detect larger effect
sizes.
Error rates Runtime
Effect size /
Baseline CR
All Inversely
Related
For any effect size,
the lower error rate you want,
the longer you need to run your test.
What does this look like in practice?
Average Visitors needed to reach
significance with Stats Engine
Improvement (relative)
5% 10% 25%
Significance
Threshold
95% 62,400 13,500 1,800
90% 59,100 12,800 1,700
80% 52,600 11,400 1,500
Baseline conversion rate = 10%
All A/B Testing platforms address the fundamental tradeoff …
1. Choose a minimum detectable effect (MDE) and false
positive rate threshold
2. Find a required sampled minimum sample size with a
sample size calculator
3. Wait until the minimum sample size is reached
4. Look at your results once and only once
Optimizely is the only platform
that lets you pull the levers in
real time
http://www.nba.com/
5%
Error rates Runtime
Effect size /
Baseline CR
-
+5%,
10%
52,600
?
In the beginning, we make an educated guess …
5%
Error rates Runtime (remaining)
Effect size /
Baseline CR
-
+13%,
16%
1,600
Instead of:
52,600 -
7,200 =
45,400
… but then the improvement turns out to be better …
5%
Error rates Runtime (remaining)
Effect size /
Baseline CR
-
+2%,
8%
> 100,000
… or a lot worse.
Recap
• The Fundamental Tradeoff of A/B Testing affects you no matter
what testing platform you use.
-If you want to detect a 5% Improvement on a 10% baseline
conversion rate, you should be prepared to wait for at least
50,000 visitors
• Optimizely’s Stats Engine is the only platform that allows you to
adjust the trade-off in real time while still reporting valid error
rates
1. What makes a good hypothesis?
2. What are the differences between False Positive Rate and
False Discovery Rate?
3. How can I trade off pulling the 3 levers I have: thresholding
error rates, detecting smaller improvements, and running my
tests longer?
At the end of this talk, you should be able to answer
1. What makes a good hypothesis?
2. What are the differences between False Positive Rate and False
Discovery Rate?
3. How can I trade off pulling the 3 levers I have: thresholding error
rates, detecting smaller improvements, and running my tests
longer?
All three are inversely related. For example, running my tests
longer can get me lower error rates, or detect smaller effects.
At the end of this talk, you should be able to answer
Outcomes & Error Rates
Fundamental
Tradeoff
Confidence Intervals
Hypotheses
XX X
“A/B Testing
Playbook”Opening
Mid-game
Mid-game
Closing
Outcomes & Error Rates
Fundamental
Tradeoff
Confidence Intervals
Hypotheses
XX X
“A/B Testing
Playbook”Opening
Mid-game
Mid-game
Closing
Outcomes & Error Rates
Fundamental
Tradeoff
Confidence Intervals
Hypotheses
XX X
“A/B Testing
Playbook”Opening
Mid-game
Mid-game
Closing
Statistics in 40 Minutes:
A/B Testing
Fundamentals
Leo Pekelis
Statistician, Optimizely
@lpekelis
leonid@optimizely.com
#opticon2015
Outcomes & Error Rates
Fundamental
Tradeoff
Confidence Intervals
Hypotheses
XX X
“A/B Testing
Playbook”Opening
Mid-game
Mid-game
Closing
Definition:
A confidence interval is a range of values for your metric (revenue, conversion rate, etc.)
that is 90%* likely to contain the true difference between your variation and baseline.
15.41
11.4Middle Ground
Best Case
Worst case
7.29
This is true regardless of your
significance.
http://www.nba.com/
We can’t wait for significance
The confidence interval tells us what we need to know
A confidence interval is the mirror image of statistical
significance
Mathematical Definition:
The set of parameter values X so that a hypothesis test with null hypothesis
H0: Removing a distracting header will result in X more revenue per visitor.
is not yet rejected.
Tuning your testing strategy for your traffic and
business
1%6% 4% 3%
inconclusive inconclusive inconclusiveinconclusive 18%
Looking for small effects?
• Tests take longer to reach
significance
• Might find more winners, if you
are willing to wait long enough
• Can you accept a higher error
rate?
inconclusive
Testing for larger effects?
• Run more tests, faster
• It’s time to move on if the effect size is
not worth the visitors
• A confidence interval may be
enough.
• Change allocation to 10% to lessen
impact on new experiments
Error rate 3: False negative rate
• False negative rate (Type II error)
= “Rate of false negatives from all
hypotheses that could have been
false negatives.”
“True” value of
hypothesis
Improve-
ment
No effect
Outcome
of test
Winner /
Loser
True
positive
False
positive
Inconc-
lusive
False
negative
True
negative
• False negative rate (Type II error)
= “Rate of false negatives from all
hypotheses that could have been
false negatives.”
“True” value of
hypothesis
Improve-
ment
No effect
Outcome
of test
Winner /
Loser
True
positive
False
positive
Inconc-
lusive
False
negative
True
negative
• False negative rate (Type II error)
= “Rate of false negatives from all
hypotheses that could have been
false negatives.”
“True” value of
hypothesis
Improve-
ment
No effect
Outcome
of test
Winner /
Loser
True
positive
False
positive
Inconc-
lusive
False
negative
True
negative
• False negative rate (Type II error)
= “Rate of false negatives from all
variations with an improvement on
a goal.”
= #(False negative) / #(Improve)
• Thresholding Type II error
= “When have a goal on a variation
with an effect, you miss it less than
10% of the time.”
“True” value of
hypothesis
Improve-
ment
No effect
Outcome
of test
Winner /
Loser
True
positive
False
positive
Inconc-
lusive
False
negative
True
negative

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

[Elite Camp 2016] Peep Laja - Fresh Out Of the Oven
[Elite Camp 2016] Peep Laja - Fresh Out Of the Oven[Elite Camp 2016] Peep Laja - Fresh Out Of the Oven
[Elite Camp 2016] Peep Laja - Fresh Out Of the Oven
 
Kate Moore - Acquisition the Agile Way: Scaling Marketing from $5M to $25M
Kate Moore - Acquisition the Agile Way: Scaling Marketing from $5M to $25MKate Moore - Acquisition the Agile Way: Scaling Marketing from $5M to $25M
Kate Moore - Acquisition the Agile Way: Scaling Marketing from $5M to $25M
 
Conversion Tracking Template | Pirate Skills
Conversion Tracking Template | Pirate SkillsConversion Tracking Template | Pirate Skills
Conversion Tracking Template | Pirate Skills
 
The Product Journey: Roadmaps Are Dead! Long Live Roadmaps!
The Product Journey: Roadmaps Are Dead! Long Live Roadmaps!The Product Journey: Roadmaps Are Dead! Long Live Roadmaps!
The Product Journey: Roadmaps Are Dead! Long Live Roadmaps!
 
[Elite Camp 2016] Karl Gilis - How to Make Sure Your New Website Won’t Be a F...
[Elite Camp 2016] Karl Gilis - How to Make Sure Your New Website Won’t Be a F...[Elite Camp 2016] Karl Gilis - How to Make Sure Your New Website Won’t Be a F...
[Elite Camp 2016] Karl Gilis - How to Make Sure Your New Website Won’t Be a F...
 
Engineering the Evaluation Funnel Pecha Kucha
Engineering the Evaluation Funnel Pecha KuchaEngineering the Evaluation Funnel Pecha Kucha
Engineering the Evaluation Funnel Pecha Kucha
 
Hiten Shah and Marie Prokopets - Lessons Learned from Building 5 Products in ...
Hiten Shah and Marie Prokopets - Lessons Learned from Building 5 Products in ...Hiten Shah and Marie Prokopets - Lessons Learned from Building 5 Products in ...
Hiten Shah and Marie Prokopets - Lessons Learned from Building 5 Products in ...
 
A/B Mythbusters: Common Optimization Objections Debunked
A/B Mythbusters: Common Optimization Objections DebunkedA/B Mythbusters: Common Optimization Objections Debunked
A/B Mythbusters: Common Optimization Objections Debunked
 
[Elite Camp 2016] Craig Sullivan - Elite Camp Summary Session
[Elite Camp 2016] Craig Sullivan - Elite Camp Summary Session[Elite Camp 2016] Craig Sullivan - Elite Camp Summary Session
[Elite Camp 2016] Craig Sullivan - Elite Camp Summary Session
 
Conversion Optimization in Practice: BaconBiz 2013
Conversion Optimization in Practice: BaconBiz 2013Conversion Optimization in Practice: BaconBiz 2013
Conversion Optimization in Practice: BaconBiz 2013
 
Data-Driven Marketing
Data-Driven MarketingData-Driven Marketing
Data-Driven Marketing
 
Patrick Campbell - The Case for Freemium
Patrick Campbell - The Case for FreemiumPatrick Campbell - The Case for Freemium
Patrick Campbell - The Case for Freemium
 
Why UX experts should show humility & (a/b) test
Why UX experts should show humility & (a/b) testWhy UX experts should show humility & (a/b) test
Why UX experts should show humility & (a/b) test
 
Conversion Rate Optimization 101 - Kick-Start Your Growth Engine
Conversion Rate Optimization 101 - Kick-Start Your Growth EngineConversion Rate Optimization 101 - Kick-Start Your Growth Engine
Conversion Rate Optimization 101 - Kick-Start Your Growth Engine
 
Monika Saha - Key Considerations When Operationalizing Your Pricing Strategy
Monika Saha - Key Considerations When Operationalizing Your Pricing StrategyMonika Saha - Key Considerations When Operationalizing Your Pricing Strategy
Monika Saha - Key Considerations When Operationalizing Your Pricing Strategy
 
Ecommerce Test Ideation: What to Optimize and Why
Ecommerce Test Ideation: What to Optimize and WhyEcommerce Test Ideation: What to Optimize and Why
Ecommerce Test Ideation: What to Optimize and Why
 
Optimizing the customer-journey - Website AB-testing cases
Optimizing the customer-journey - Website AB-testing casesOptimizing the customer-journey - Website AB-testing cases
Optimizing the customer-journey - Website AB-testing cases
 
A WordStream & KISSmetrics Webinar: A/B Testing; You’re Most Likely Doing It ...
A WordStream & KISSmetrics Webinar: A/B Testing; You’re Most Likely Doing It ...A WordStream & KISSmetrics Webinar: A/B Testing; You’re Most Likely Doing It ...
A WordStream & KISSmetrics Webinar: A/B Testing; You’re Most Likely Doing It ...
 
How to create better A/B tests based on user research
How to create better A/B tests based on user researchHow to create better A/B tests based on user research
How to create better A/B tests based on user research
 
Conversion Optimization with Peep Laja
Conversion Optimization with Peep LajaConversion Optimization with Peep Laja
Conversion Optimization with Peep Laja
 

Ähnlich wie Opticon 2015-Statistics in 40 minutes

Hypothesis TestingHypothesisThe formal testing of hypothesis.docx
Hypothesis TestingHypothesisThe formal testing of hypothesis.docxHypothesis TestingHypothesisThe formal testing of hypothesis.docx
Hypothesis TestingHypothesisThe formal testing of hypothesis.docx
adampcarr67227
 
Prompt it, not Google it - Prompt Engineering for Data Scientists
Prompt it, not Google it - Prompt Engineering for Data ScientistsPrompt it, not Google it - Prompt Engineering for Data Scientists
Prompt it, not Google it - Prompt Engineering for Data Scientists
Kevin Lee
 

Ähnlich wie Opticon 2015-Statistics in 40 minutes (20)

UX STRAT Online 2020: Dr. Martin Tingley, Netflix
UX STRAT Online 2020: Dr. Martin Tingley, NetflixUX STRAT Online 2020: Dr. Martin Tingley, Netflix
UX STRAT Online 2020: Dr. Martin Tingley, Netflix
 
Losing is the New Winning
Losing is the New WinningLosing is the New Winning
Losing is the New Winning
 
How to Increase Your Testing Success by Combining Qualitative and Quantitativ...
How to Increase Your Testing Success by Combining Qualitative and Quantitativ...How to Increase Your Testing Success by Combining Qualitative and Quantitativ...
How to Increase Your Testing Success by Combining Qualitative and Quantitativ...
 
Realise the Benefits of Holistic Testing
Realise the Benefits of Holistic TestingRealise the Benefits of Holistic Testing
Realise the Benefits of Holistic Testing
 
13-statistics.pptx
13-statistics.pptx13-statistics.pptx
13-statistics.pptx
 
Hypothesis TestingHypothesisThe formal testing of hypothesis.docx
Hypothesis TestingHypothesisThe formal testing of hypothesis.docxHypothesis TestingHypothesisThe formal testing of hypothesis.docx
Hypothesis TestingHypothesisThe formal testing of hypothesis.docx
 
Webinar: Experimentation & Product Management by Indeed Product Lead
Webinar: Experimentation & Product Management by Indeed Product LeadWebinar: Experimentation & Product Management by Indeed Product Lead
Webinar: Experimentation & Product Management by Indeed Product Lead
 
あなたのアイデアは”Goo dアイデア!?” それとも”badアイデア!?”
あなたのアイデアは”Goo dアイデア!?” それとも”badアイデア!?”あなたのアイデアは”Goo dアイデア!?” それとも”badアイデア!?”
あなたのアイデアは”Goo dアイデア!?” それとも”badアイデア!?”
 
あなたのアイデアは”Goo dアイデア!?” それとも”badアイデア!?”
あなたのアイデアは”Goo dアイデア!?” それとも”badアイデア!?”あなたのアイデアは”Goo dアイデア!?” それとも”badアイデア!?”
あなたのアイデアは”Goo dアイデア!?” それとも”badアイデア!?”
 
Optimizely Workshop: Take Action on Results with Statistics
Optimizely Workshop: Take Action on Results with StatisticsOptimizely Workshop: Take Action on Results with Statistics
Optimizely Workshop: Take Action on Results with Statistics
 
SearchLove Boston 2017 | Richard Fergie | You Aren't Doing Science and That's OK
SearchLove Boston 2017 | Richard Fergie | You Aren't Doing Science and That's OKSearchLove Boston 2017 | Richard Fergie | You Aren't Doing Science and That's OK
SearchLove Boston 2017 | Richard Fergie | You Aren't Doing Science and That's OK
 
Being Right Starts By Knowing You're Wrong
Being Right Starts By Knowing You're WrongBeing Right Starts By Knowing You're Wrong
Being Right Starts By Knowing You're Wrong
 
Digital Marketing ROI at Blogworld NYC
Digital Marketing ROI at Blogworld NYCDigital Marketing ROI at Blogworld NYC
Digital Marketing ROI at Blogworld NYC
 
Causality in Python PyCon 2021 ISRAEL
Causality in Python PyCon 2021 ISRAELCausality in Python PyCon 2021 ISRAEL
Causality in Python PyCon 2021 ISRAEL
 
The Finishing Line
The Finishing LineThe Finishing Line
The Finishing Line
 
Prompt it, not Google it - Prompt Engineering for Data Scientists
Prompt it, not Google it - Prompt Engineering for Data ScientistsPrompt it, not Google it - Prompt Engineering for Data Scientists
Prompt it, not Google it - Prompt Engineering for Data Scientists
 
iSG Webinar – AB Testing: The most important thing you’re NOT doing
iSG Webinar – AB Testing: The most important thing you’re NOT doingiSG Webinar – AB Testing: The most important thing you’re NOT doing
iSG Webinar – AB Testing: The most important thing you’re NOT doing
 
Chris Stuccio - Data science - Conversion Hotel 2015
Chris Stuccio - Data science - Conversion Hotel 2015Chris Stuccio - Data science - Conversion Hotel 2015
Chris Stuccio - Data science - Conversion Hotel 2015
 
4 Steps Toward Scientific A/B Testing
4 Steps Toward Scientific A/B Testing4 Steps Toward Scientific A/B Testing
4 Steps Toward Scientific A/B Testing
 
Fail Well, Pivot Fast: Product Experimentation for Continuous Discovery
Fail Well, Pivot Fast: Product Experimentation for Continuous DiscoveryFail Well, Pivot Fast: Product Experimentation for Continuous Discovery
Fail Well, Pivot Fast: Product Experimentation for Continuous Discovery
 

Mehr von Optimizely

Mehr von Optimizely (20)

Clover Rings Up Digital Growth to Drive Experimentation
Clover Rings Up Digital Growth to Drive ExperimentationClover Rings Up Digital Growth to Drive Experimentation
Clover Rings Up Digital Growth to Drive Experimentation
 
Make Every Touchpoint Count: How to Drive Revenue in an Increasingly Online W...
Make Every Touchpoint Count: How to Drive Revenue in an Increasingly Online W...Make Every Touchpoint Count: How to Drive Revenue in an Increasingly Online W...
Make Every Touchpoint Count: How to Drive Revenue in an Increasingly Online W...
 
The Science of Getting Testing Right
The Science of Getting Testing RightThe Science of Getting Testing Right
The Science of Getting Testing Right
 
Atlassian's Mystique CLI, Minimizing the Experiment Development Cycle
Atlassian's Mystique CLI, Minimizing the Experiment Development CycleAtlassian's Mystique CLI, Minimizing the Experiment Development Cycle
Atlassian's Mystique CLI, Minimizing the Experiment Development Cycle
 
Autotrader Case Study: Migrating from Home-Grown Testing to Best-in-Class Too...
Autotrader Case Study: Migrating from Home-Grown Testing to Best-in-Class Too...Autotrader Case Study: Migrating from Home-Grown Testing to Best-in-Class Too...
Autotrader Case Study: Migrating from Home-Grown Testing to Best-in-Class Too...
 
Zillow + Optimizely: Building the Bridge to $20 Billion Revenue
Zillow + Optimizely: Building the Bridge to $20 Billion RevenueZillow + Optimizely: Building the Bridge to $20 Billion Revenue
Zillow + Optimizely: Building the Bridge to $20 Billion Revenue
 
The Future of Optimizely for Technical Teams
The Future of Optimizely for Technical TeamsThe Future of Optimizely for Technical Teams
The Future of Optimizely for Technical Teams
 
Empowering Agents to Provide Service from Anywhere: Contact Centers in the Ti...
Empowering Agents to Provide Service from Anywhere: Contact Centers in the Ti...Empowering Agents to Provide Service from Anywhere: Contact Centers in the Ti...
Empowering Agents to Provide Service from Anywhere: Contact Centers in the Ti...
 
Experimentation Everywhere: Create Exceptional Online Shopping Experiences an...
Experimentation Everywhere: Create Exceptional Online Shopping Experiences an...Experimentation Everywhere: Create Exceptional Online Shopping Experiences an...
Experimentation Everywhere: Create Exceptional Online Shopping Experiences an...
 
Building an Experiment Pipeline for GitHub’s New Free Team Offering
Building an Experiment Pipeline for GitHub’s New Free Team OfferingBuilding an Experiment Pipeline for GitHub’s New Free Team Offering
Building an Experiment Pipeline for GitHub’s New Free Team Offering
 
AMC Networks Experiments Faster on the Server Side
AMC Networks Experiments Faster on the Server SideAMC Networks Experiments Faster on the Server Side
AMC Networks Experiments Faster on the Server Side
 
Evolving Experimentation from CRO to Product Development
Evolving Experimentation from CRO to Product DevelopmentEvolving Experimentation from CRO to Product Development
Evolving Experimentation from CRO to Product Development
 
Overcoming the Challenges of Experimentation on a Service Oriented Architecture
Overcoming the Challenges of Experimentation on a Service Oriented ArchitectureOvercoming the Challenges of Experimentation on a Service Oriented Architecture
Overcoming the Challenges of Experimentation on a Service Oriented Architecture
 
How The Zebra Utilized Feature Experiments To Increase Carrier Card Engagemen...
How The Zebra Utilized Feature Experiments To Increase Carrier Card Engagemen...How The Zebra Utilized Feature Experiments To Increase Carrier Card Engagemen...
How The Zebra Utilized Feature Experiments To Increase Carrier Card Engagemen...
 
Making Your Hypothesis Work Harder to Inform Future Product Strategy
Making Your Hypothesis Work Harder to Inform Future Product StrategyMaking Your Hypothesis Work Harder to Inform Future Product Strategy
Making Your Hypothesis Work Harder to Inform Future Product Strategy
 
Kick Your Assumptions: How Scholl's Test-Everything Culture Drives Revenue
Kick Your Assumptions: How Scholl's Test-Everything Culture Drives RevenueKick Your Assumptions: How Scholl's Test-Everything Culture Drives Revenue
Kick Your Assumptions: How Scholl's Test-Everything Culture Drives Revenue
 
Experimentation through Clients' Eyes
Experimentation through Clients' EyesExperimentation through Clients' Eyes
Experimentation through Clients' Eyes
 
Shipping to Learn and Accelerate Growth with GitHub
Shipping to Learn and Accelerate Growth with GitHubShipping to Learn and Accelerate Growth with GitHub
Shipping to Learn and Accelerate Growth with GitHub
 
Test Everything: TrustRadius Delivers Customer Value with Experimentation
Test Everything: TrustRadius Delivers Customer Value with ExperimentationTest Everything: TrustRadius Delivers Customer Value with Experimentation
Test Everything: TrustRadius Delivers Customer Value with Experimentation
 
Optimizely Agent: Scaling Resilient Feature Delivery
Optimizely Agent: Scaling Resilient Feature DeliveryOptimizely Agent: Scaling Resilient Feature Delivery
Optimizely Agent: Scaling Resilient Feature Delivery
 

Opticon 2015-Statistics in 40 minutes

  • 1. Statistics in 40 Minutes: A/B Testing Fundamentals Leo Pekelis Statistician, Optimizely @lpekelis leonid@optimizely.com #opticon2015
  • 2. You have your own unique approach to A/B Testing
  • 3. The goal of this talk is to break down A/B Testing to its fundamentals.
  • 4.
  • 5. A/B Testing Platform1) Create an experiment 2) Read the results page
  • 6. Outcomes & Error Rates Fundamental Tradeoff Confidence Intervals Hypotheses XX X “A/B Testing Playbook”Opening Mid-game Mid-game Closing
  • 7. Outcomes & Error Rates Fundamental Tradeoff Confidence Intervals Hypotheses XX X “A/B Testing Playbook”Opening Mid-game Mid-game Closing
  • 8. 1. What makes a good hypothesis? 2. What are the differences between False Positive Rate and False Discovery Rate? 3. How can I trade off pulling the 3 levers I have: thresholding error rates, detecting smaller improvements, and running my tests longer? At the end of this talk, you should be able to answer
  • 9. 1. A good hypothesis has a variation and a clearly defined goal which you crafted ahead of time. 2. False Positive Rates have hidden sources of error when testing many goals and variations. False Discovery Rates correct this by increasing runtime when you have many low signal goals. 3. All three levers are inversely related. For example, running my tests longer can get me lower error rates, or detect smaller effects. The answers
  • 11. • Control and Variation A control is the original, or baseline version of content that you are testing through a variation. • Goal Metric used to measure impact of control and variation • Baseline conversion rate The control group’s expected conversion rate. • Effect size The improvement (positive or negative) of your variation over baseline. • Sample size The number of visitors in your test.
  • 12. • A hypothesis test is a control, and variation that you want to show improves a goal • An experiment is a collection of hypotheses (goals & variation pairs) that all have the same control
  • 13. Outcomes & Error Rates Fundamental Tradeoff Confidence Intervals Hypotheses XX X “A/B Testing Playbook”Opening Mid-game Mid-game Closing
  • 15. What is a good hypothesis (test)?
  • 16. Why is this not actionable? “I think changing the header image will make my site better.” • Removing the header will increase engagement? • Removing the header will increase total revenue? • Removing the header will increase “the finals” clicks? • Growing the header will increase engagement? • Growing the header with increase “the finals” clicks. • … Test creep!Bad hypothesis
  • 17. • Removing the header will increase engagement? • Removing the header will increase total revenue? • Removing the header will increase “the finals” clicks? • Growing the header will increase engagement? • Growing the header with increase “the finals” clicks. • … • Removing the header will increase engagement? • Removing the header will increase total revenue? • Removing the header will increase “the finals” clicks? • Growing the header will increase engagement? • Growing the header with increase “the finals” clicks. • … “I think changing the header image will make my site better.” Bad hypothesis Good hypotheses Organized and clear
  • 18. The more relationships (hypotheses) you test, the longer (visitors) it will take to achieve the same outcome (error rate). Hypotheses also give the cost of your experiment
  • 19. Questions to check for a good hypothesis What are you trying to show with your idea? What key metrics should it drive? Are all my goals and variations necessary given my testing limits?
  • 20. 1. What makes a good hypothesis? 2. What are the differences between False Positive Rate and False Discovery Rate? 3. How can I trade off pulling the 3 levers I have: thresholding error rates, detecting smaller improvements, and running my tests longer? At the end of this talk, you should be able to answer
  • 21. 1. What makes a good hypothesis? Answer: A good hypothesis has a variation and a clearly defined goal which you crafted ahead of time. 2. What are the differences between False Positive Rate and False Discovery Rate? 3. How can I trade off pulling the 3 levers I have: thresholding error rates, detecting smaller improvements, and running my tests longer? At the end of this talk, you should be able to answer
  • 22. Outcomes & Error Rates Fundamental Tradeoff Confidence Intervals Hypotheses XX X “A/B Testing Playbook”Opening Mid-game Mid-game Closing
  • 24. What are the possible outcomes?
  • 25. “True” value of hypothesis Improvement No effect Result of test Winner / Loser True positive False positive Inconclusive False negative True negative “True” value of hypothesis Improvement No effect Result of test Winner / Loser True positive False positive Inconclusive False negative True negative (no effect, winner / loser)
  • 27. “True” value of hypothesis Improvement No effect Result of test Winner / Loser True positive False positive Inconclusive False negative True negative “True” value of hypothesis Improvement No effect Result of test Winner / Loser True positive False positive Inconclusive False negative True negative “True” value of hypothesis Improvement No effect Result of test Winner / Loser True positive False positive Inconclusive False negative True negative (no effect, winner / loser) (+/- improvement, inconclusive)
  • 29. “True” value of hypothesis Improvement No effect Result of test Winner / Loser True positive False positive Inconclusive False negative True negative “True” value of hypothesis Improvement No effect Result of test Winner / Loser True positive False positive Inconclusive False negative True negative “True” value of hypothesis Improvement No effect Result of test Winner / Loser True positive False positive Inconclusive False negative True negative “True” value of hypothesis Improvement No effect Result of test Winner / Loser True positive False positive Inconclusive False negative True negative “True” value of hypothesis Improvement No effect Result of test Winner / Loser True positive False positive Inconclusive False negative True negative (no effect, winner / loser) (+/- improvement, inconclusive) (+/- improvement, winner / loser) (no effect, inconclusive)
  • 30. The 2x2 table will help us to 1. Keep track of different error rates we care about 2. Explore the consequences of controlling false positives vs false discoveries
  • 31. Error rate 1: False positive rate
  • 32. “True” value of hypothesis Improve- ment No effect Result of test Winner / Loser True positive False positive Inconc- lusive False negative True negative • False positive rate (Type I error) = “Chance of a false positive from a variation with no effect on a goal”
  • 33. “True” value of hypothesis Improve- ment No effect Result of test Winner / Loser True positive False positive Inconc- lusive False negative True negative • False positive rate (Type I error) = “Chance of a false positive from a variation with no effect on a goal”
  • 34. “True” value of hypothesis Improve- ment No effect Result of test Winner / Loser True positive False positive Inconc- lusive False negative True negative • False positive rate (Type I error) = “Chance of a false positive from a variation with no effect on a goal”
  • 35. “True” value of hypothesis Improve- ment No effect Result of test Winner / Loser True positive False positive Inconc- lusive False negative True negative • False positive rate (Type I error) = “Chance of a false positive from a variation with no effect on a goal” • Thresholding the FPR “When I have a variation with no effect on a goal, I’ll find an effect less than 10% of the time.”
  • 36. How can we ever compute a False Positive Rate if we don’t know whether a hypothesis is true or not? Statistical tests (fixed horizon t-test, Stats Engine) are designed to threshold an error rate. Example: “Calling winners & losers when a p-value is below .05 will guarantee a False Positive Rate below 5%”
  • 37. False Positive Rates with multiple tests
  • 41. What happened? 21 tests X 5% FPR = 1 False Positive on average
  • 42. “True” value of hypothesis Improve- ment No effect Result of test Winner / Loser True positive False positive Inconc- lusive False negative True negative False positive rates are only useful in the context of all hypotheses
  • 43. Error rate 2: False discovery rate
  • 44. • False discovery rate (FDR) = “Chance of a false positive from a conclusive result” “True” value of hypothesis Improve- ment No effect Result of test Winner / Loser True positive False positive Inconc- lusive False negative True negative
  • 45. • False discovery rate (FDR) = “Chance of a false positive from a conclusive result” “True” value of hypothesis Improve- ment No effect Result of test Winner / Loser True positive False positive Inconc- lusive False negative True negative
  • 46. • False discovery rate (FDR) = “Chance of a false positive from a conclusive result” “True” value of hypothesis Improve- ment No effect Result of test Winner / Loser True positive False positive Inconc- lusive False negative True negative
  • 47. • False discovery rate (FDR) = “Chance of a false positive from a conclusive result” • Thresholding the FDR = “When you see a winning or losing goal on a variation, it’s wrong less than 10% of the time.” “True” value of hypothesis Improve- ment No effect Result of test Winner / Loser True positive False positive Inconc- lusive False negative True negative
  • 48. or X 5% FDR = 0.05 False Positives on average
  • 49. “True” value of hypothesis Improve- ment No effect Result of test Winner / loser True positive False positive Inconc- lusive False negative True negative False discovery rates are useful despite the number of hypotheses
  • 50. What’s the catch? The more hypotheses (goals & variations) in your experiment, the longer it takes to find conclusives.
  • 52. low signal high signal
  • 53. What’s the catch? The more low signal hypotheses (goals & variations) in your experiment, the longer it takes to find conclusives.
  • 54. Recap • False Positive Rate thresholding -controls the chance of a false positive when you have a hypothesis with no effect -misrepresents your error rate with multiple goals and variations • False Discovery Rate thresholding -controls the chance of a false positive when you have a winning or losing hypothesis -is accurate regardless of how many hypotheses you run -can take longer to reach significance with more low signal variations on goals
  • 55. Tips & Tricks for running experiments with False Discovery Rates • Ask: Which goal is most important to me? -This should be my primary goal (not impacted by all other goals) • Run large, or large multivariate tests without fear of finding spurious results, but be prepared for the cost of exploration • A little human intuition and prior knowledge can go a long way towards reducing the runtime of your experiments
  • 56. 1. What makes a good hypothesis? 2. What are the differences between False Positive Rate and False Discovery Rate? 3. How can I trade off pulling the 3 levers I have: thresholding error rates, detecting smaller improvements, and running my tests longer? At the end of this talk, you should be able to answer
  • 57. 1. What makes a good hypothesis? 2. What are the differences between False Positive Rate and False Discovery Rate? False Positive Rates have hidden sources of error when testing many goals and variations. False Discovery Rates correct this by increasing runtime when you have many noisy goals. 3. How can I trade off pulling the 3 levers I have: thresholding error rates, detecting smaller improvements, and running my tests longer? At the end of this talk, you should be able to answer
  • 58. Outcomes & Error Rates Fundamental Tradeoff Confidence Intervals Hypotheses XX X “A/B Testing Playbook”Opening Mid-game Mid-game Closing
  • 59. “3 Levers” of A/B Testing 1.Threshold an error rate • “I want no more than 10% false discovery rate” 2.Detecting effect sizes (setting an MDE) • “I’m OK with only detecting greater than 5% improvement” 3.Running tests longer • “I can afford to run this test for 3 weeks, or 50,000 visitors”
  • 60. Fundamental Tradeoff of A/B Testing Error rates Runtime Effect size / Baseline CR All Inversely Related
  • 61. Error rates Runtime Effect size / Baseline CR At any number of visitors, the less you threshold your error rate, the smaller effect sizes you can detect. All Inversely Related
  • 62. Error rates Runtime Effect size / Baseline CR All Inversely Related At any error rate threshold, stopping your test earlier means you can only detect larger effect sizes.
  • 63. Error rates Runtime Effect size / Baseline CR All Inversely Related For any effect size, the lower error rate you want, the longer you need to run your test.
  • 64. What does this look like in practice? Average Visitors needed to reach significance with Stats Engine Improvement (relative) 5% 10% 25% Significance Threshold 95% 62,400 13,500 1,800 90% 59,100 12,800 1,700 80% 52,600 11,400 1,500 Baseline conversion rate = 10%
  • 65. All A/B Testing platforms address the fundamental tradeoff … 1. Choose a minimum detectable effect (MDE) and false positive rate threshold 2. Find a required sampled minimum sample size with a sample size calculator 3. Wait until the minimum sample size is reached 4. Look at your results once and only once
  • 66. Optimizely is the only platform that lets you pull the levers in real time
  • 68. 5% Error rates Runtime Effect size / Baseline CR - +5%, 10% 52,600 ? In the beginning, we make an educated guess …
  • 69. 5% Error rates Runtime (remaining) Effect size / Baseline CR - +13%, 16% 1,600 Instead of: 52,600 - 7,200 = 45,400 … but then the improvement turns out to be better …
  • 70. 5% Error rates Runtime (remaining) Effect size / Baseline CR - +2%, 8% > 100,000 … or a lot worse.
  • 71. Recap • The Fundamental Tradeoff of A/B Testing affects you no matter what testing platform you use. -If you want to detect a 5% Improvement on a 10% baseline conversion rate, you should be prepared to wait for at least 50,000 visitors • Optimizely’s Stats Engine is the only platform that allows you to adjust the trade-off in real time while still reporting valid error rates
  • 72. 1. What makes a good hypothesis? 2. What are the differences between False Positive Rate and False Discovery Rate? 3. How can I trade off pulling the 3 levers I have: thresholding error rates, detecting smaller improvements, and running my tests longer? At the end of this talk, you should be able to answer
  • 73. 1. What makes a good hypothesis? 2. What are the differences between False Positive Rate and False Discovery Rate? 3. How can I trade off pulling the 3 levers I have: thresholding error rates, detecting smaller improvements, and running my tests longer? All three are inversely related. For example, running my tests longer can get me lower error rates, or detect smaller effects. At the end of this talk, you should be able to answer
  • 74. Outcomes & Error Rates Fundamental Tradeoff Confidence Intervals Hypotheses XX X “A/B Testing Playbook”Opening Mid-game Mid-game Closing
  • 75. Outcomes & Error Rates Fundamental Tradeoff Confidence Intervals Hypotheses XX X “A/B Testing Playbook”Opening Mid-game Mid-game Closing
  • 76. Outcomes & Error Rates Fundamental Tradeoff Confidence Intervals Hypotheses XX X “A/B Testing Playbook”Opening Mid-game Mid-game Closing
  • 77.
  • 78. Statistics in 40 Minutes: A/B Testing Fundamentals Leo Pekelis Statistician, Optimizely @lpekelis leonid@optimizely.com #opticon2015
  • 79. Outcomes & Error Rates Fundamental Tradeoff Confidence Intervals Hypotheses XX X “A/B Testing Playbook”Opening Mid-game Mid-game Closing
  • 80.
  • 81. Definition: A confidence interval is a range of values for your metric (revenue, conversion rate, etc.) that is 90%* likely to contain the true difference between your variation and baseline.
  • 83. This is true regardless of your significance.
  • 85. We can’t wait for significance
  • 86. The confidence interval tells us what we need to know
  • 87. A confidence interval is the mirror image of statistical significance Mathematical Definition: The set of parameter values X so that a hypothesis test with null hypothesis H0: Removing a distracting header will result in X more revenue per visitor. is not yet rejected.
  • 88. Tuning your testing strategy for your traffic and business 1%6% 4% 3% inconclusive inconclusive inconclusiveinconclusive 18% Looking for small effects? • Tests take longer to reach significance • Might find more winners, if you are willing to wait long enough • Can you accept a higher error rate? inconclusive Testing for larger effects? • Run more tests, faster • It’s time to move on if the effect size is not worth the visitors • A confidence interval may be enough. • Change allocation to 10% to lessen impact on new experiments
  • 89. Error rate 3: False negative rate
  • 90. • False negative rate (Type II error) = “Rate of false negatives from all hypotheses that could have been false negatives.” “True” value of hypothesis Improve- ment No effect Outcome of test Winner / Loser True positive False positive Inconc- lusive False negative True negative
  • 91. • False negative rate (Type II error) = “Rate of false negatives from all hypotheses that could have been false negatives.” “True” value of hypothesis Improve- ment No effect Outcome of test Winner / Loser True positive False positive Inconc- lusive False negative True negative
  • 92. • False negative rate (Type II error) = “Rate of false negatives from all hypotheses that could have been false negatives.” “True” value of hypothesis Improve- ment No effect Outcome of test Winner / Loser True positive False positive Inconc- lusive False negative True negative
  • 93. • False negative rate (Type II error) = “Rate of false negatives from all variations with an improvement on a goal.” = #(False negative) / #(Improve) • Thresholding Type II error = “When have a goal on a variation with an effect, you miss it less than 10% of the time.” “True” value of hypothesis Improve- ment No effect Outcome of test Winner / Loser True positive False positive Inconc- lusive False negative True negative