SlideShare ist ein Scribd-Unternehmen logo
1 von 32
Downloaden Sie, um offline zu lesen
A-Z of A/B
testing
Dr. Shrividya Ravi
Data Scientist at Metail
Data Insights Meetup 3 December 2015
Overview
• Introduction
– What is A/B testing?
– Comparing web and field tests
• Analysis
• Effects that can affect test
– Denominator issues
– Temporal effects
– Hidden bias
• Bootstrapping
• From validation to understanding mechanisms
A/B testing
• Essentially a randomized trial
• Split traffic 50:50
• One group sees ‘normal’ site
• The other group sees the variant
or ‘treatment’
• After a set period of time, calculate
the difference in KPIs between the
two groups
• Generally, you can attribute the
difference to the treatment
http://www.smashingmagazine.com/2010/06/the-ultimate-guide-to-a-b-testing/
Web vs. field trials
• Data
– Quantity
– Quality
– Type
• Web data: large quantities, low quality until aggregation
and cohort creation, observational.
• Field trials: small to medium quantities, high quality
information about participants, combination of direct
responses, tests and observations.
Events
• Launching the widget
• Adding item to Bag
• Rotating MeModel
• Pressing Share button
• Adding garment to try
on
A single event
web 2015-01-14 04:41:20.000 2015-01-14 04:41:53.480 struct
0e833b00-d2cb-436b-ad1d-21fa47474b80 primary
js-2.2.0cloudfront hadoop-0.5.0-common-0.4.0 XX.XX.XX.X
2091617875 aca45a2fbc191e7b 3 BR
-XX.XXXX -XX.XXXXX https://live-cdn.me-tail.net/wanda-
ui/5a180420-2416-11e2-81c1-0800200c9a66/pt-
BR/?xdm_e=http%3A%2F%2Fwww.dafiti.com.br&xdm_c=default4031&xdm_p=
1#init-
data/%7B%22retailerPageType%22%3A%22productListing%22%2C%22open%22
%3Afalse%7D http://www.dafiti.com.br/roupas-femininas/casacos-e-
jaquetas/ https live-cdn.me-tail.net 80 /wanda-ui/5a180420-2416-
11e2-81c1-0800200c9a66/pt-BR/
xdm_e=http%3A%2F%2Fwww.dafiti.com.br&xdm_c=default4031&xdm_
p=1 init-
data/%7B%22retailerPageType%22%3A%22productListing%22%2C%22open%22
%3Afalse%7D http www.dafiti.com.br 80 /roupas-femininas/casacos-
e-jaquetas/ unknown
TabBar OpenTab productListing
Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/39.0.2171.95 Safari/537.36 Chrome Chrome
39.0.2171.95 Browser WEBKIT pt-BR 1 1 1
0 0 0 1 0 0 1 24 216 0
Windows Windows Microsoft Corporation
America/Sao_Paulo Computer 0 1366 768 UTF-8 216
0
• Widget launched by
clicking on tab
• Selected
Information:
– Timestamp
– UserID (cookie ID)
– Geolocation: Country
code, longitufe &
latitude (usually of ISP),
Timezone
– IP address
– URLs: host, current,
referrer
– Event hierarchy
• Others:
– Browser information
– Device & OS
information
– Session counter
Logs of millions of events
• Store all raw logs in the cloud
• Create aggregates of specific every day
– Use aggregates to create cohorts
UserID Retailer Engaged? Order value OrderID Group Test Ratio User Type
0157dab05efbef6f XX null 68.11 68137749 out 50 ExistingBin
0158ee5980cc75ad XX null null null in 90 NewBin
015ab3acaba4c770 XX TRUE null null in 90 NewBin
015e3a8e1d5ad181 XX null null null in 90 NewBin
015e3da4002e861a XX null null null in 90 NewBin
0160ae8d4465773b XX TRUE null null in 90 NewBin
0161f081a2c51d9f XX null null null out 50 ExistingBin
01647bcd7185da9d XX null 96.27 27342749 out 50 NewBin
Aggregated slice over some time period by user
Analysing A/B tests
Basic A/B test
• Change in
homepage
• Measure
difference in
Average Order
Value (AOV)
between control
and treatment.
https://www.google.co.uk/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0ahUKEwjip4_nxr3JAhWFJw8KHTbdAaoQjRwIBw&url=http%3A%2F%2Fpadic
ode.com%2Fblog%2Femail-marketing-2%2Fab-testing-resources%2F&psig=AFQjCNEnv6hj7n9VW-z0tx54iLl5c3srIA&ust=1449158617165351
Results
AOV = Mean Order Value
Group AOV (monetary units)
Control 114
Variant 103.4
Data distribution
• Skewed distribution
• Prices are often
distributed log
normally.
• So, depending on the
skew and extreme
values, the mean can
fluctuate without it
indicating a real effect.
Significance testing
• z-test of lognormal means
between groups
• Critical threshold: 0.05
• H0 = 0; Ha > 0
• Single-tailed p-value:
– 0.00682
• Double-tailed p-value
– 0.0137
Simulate beforehand
Pitfalls
• Temporal effects
– Novelty
– Spikes
• Noise
– Dilution
• Bias
– Bucketing bias
– Asymmetric cohorts
• Bugs
Temporal effects
• Novelty: strong effect at the beginning wanes
over time.
• Spikes/spurious data: Atypical mechanisms.
• Monitoring only p-value going below a critical
threshold will misinterpret the effect.
Spikes
• Toy Scenario
Property Group A Group B
Stable conversion 0.2 0.22
Sale conversion 0.3 0.5
Visitor rate (per
group)
N(200,20) N(200,20)
Where 𝑁 𝜇, 𝜎 are normal distribution paramters. Conversion
data generated using a binomial distribution.
Spikes
• Spurious effects like
flash sales can quickly
push p-value below
critical threshold.
• Sale on Day 6 increases
conversion in both
groups but difference is
higher in Group B.
• Increase in cumulative
conversion reduces p-
value dramatically.
Spikes
Spurious effects like flash sales can quickly push p-value below
critical threshold.
days all_visitors all_orders_a all_orders_b conversion_a conversion_b difference p_value
1 194 32 34 0.165 0.175 0.010 0.393
2 355 63 76 0.177 0.214 0.037 0.109
3 566 110 126 0.194 0.223 0.028 0.121
4 772 155 168 0.201 0.218 0.017 0.208
5 953 185 210 0.194 0.220 0.026 0.079
6 1154 244 300 0.211 0.260 0.049 0.003
7 1331 279 346 0.210 0.260 0.050 0.001
8 1537 323 386 0.210 0.251 0.041 0.003
9 1713 362 422 0.211 0.246 0.035 0.007
10 1931 404 467 0.209 0.242 0.033 0.008
11 2135 451 511 0.211 0.239 0.028 0.014
12 2341 497 555 0.212 0.237 0.025 0.021
13 2553 544 606 0.213 0.237 0.024 0.019
14 2770 595 666 0.215 0.240 0.026 0.011
15 2956 633 707 0.214 0.239 0.025 0.011
16 3143 670 749 0.213 0.238 0.025 0.009
17 3388 713 803 0.210 0.237 0.027 0.004
18 3590 757 843 0.211 0.235 0.024 0.007
Real-world data
Even null effect sizes will vary to some non-zero value.
Temporal variability
• Instability in effect size.
• Depending on the A/B test, the long term
instability of effect size can be debilitating
after roll-out.
• But can also provide a source of insight.
Dilution
• Users who can see
treatment are a small
fraction of population.
http://www.infoq.com/presentations/ab-testing-pinterest
• Create a
counterfactual cohort
from control group
for correct
measurement.
Dilution
• Treatment only made available in the ‘Variant’
group and only a small fraction of treatment
group actually go through the treatment.
• Can use Instrumental Variables to scale the
overall effect
Instrumental variables
• Bin = Instrument (Z)
• Instrument is able to “predict” actual treatment
(T)
• 2 Stage linear regression
– 𝑌 = 𝛼 + 𝛽𝑇 + 𝜖 (ideal equation)
– 𝑌 = 𝜌 + 𝜎𝑍 + 𝜃 (measured values at bin level)
– 𝑇 = 𝛾 + 𝛿𝑍 + 𝜀 (isolating treatment effect)
– 𝑌 = 𝛼 + 𝛽 𝛾 + 𝛿𝑍 (re-stated ideal equation)
– 𝛽 = 𝜎/𝛿
• When treatment is only possible in one group, 𝛿
is a proportion and the effect size of ‘true’
treatment vs. control is the effect at the bin level
scaled by 1/𝛿.
Simulate beforehand
• Know that adoption is important to see the difference at the level of
Treatment (In) vs. Control (Out).
• Estimate how long it will take to see a statistically significant effect
given:1000 digitised garments, 10% adoption and higher conversion
rate for engaged users.
Bucketing bias
• Non-representative population
– Bugs in bucketing
– Bugs in treatment not showing up on selective
devices
• Run A/A test
• Examine cohorts carefully
Bucketing bias
• Asymmetric bucketing
– Smaller group gets
values from high
density regions
– Larger group gets the
full range
– Smaller group
becomes a non-
representative sample
• Run A/A test
• Symmetric bucketing
Other analyses of
A/B tests
Bootstrapping
• No assumptions about data distribution
• Can calculate any metric
• Explicitly performs the assumptions of
hypothesis testing so easier to explain
• Can also interpreted from a Bayesian
perspective Reference
Bootstrapping
http://www.texample.net/tikz/examples/bootstrap-resampling/
Bootstrapping
• Resultant Gaussian
distribution of metric
• Need to check
convergence of bootstrap
samples.
• Can be used to get a
distribution of
differences.
http://rosetta.ahmedmoustafa.io/bootstrap/
Bayesian models
• Explicit mechanistic modelling of important
parameters.
• Answers the question: “What are the range of
conversion rates that result in the observed
data?”
– Use knowledge that conversion rates can be modelled
with a binomial distribution and the parameter space
to be explored is p.
Conclusions
• How to analyse an A/B test
• Understand different problems and highlighted some solutions
• Techniques that allow for understanding mechanisms

Weitere ähnliche Inhalte

Ähnlich wie AB Testing Insights for Data-Driven Decisions

How to Enter the Data Analytics Industry?
How to Enter the Data Analytics Industry?How to Enter the Data Analytics Industry?
How to Enter the Data Analytics Industry?Ganes Kesari
 
Virtual segment brief
Virtual segment briefVirtual segment brief
Virtual segment briefAhmed Shaaban
 
A/B Testing - Design, Analysis and Pitfals
A/B Testing - Design, Analysis and PitfalsA/B Testing - Design, Analysis and Pitfals
A/B Testing - Design, Analysis and PitfalsSlava Borodovsky
 
A/B Testing - Customer Experience Platform experimentation using Pearson’s Ch...
A/B Testing - Customer Experience Platform experimentation using Pearson’s Ch...A/B Testing - Customer Experience Platform experimentation using Pearson’s Ch...
A/B Testing - Customer Experience Platform experimentation using Pearson’s Ch...Aurangzeb Khan
 
Key steps in transforming a calibration program slideshare
Key steps in transforming a calibration program slideshareKey steps in transforming a calibration program slideshare
Key steps in transforming a calibration program slideshareJohn Cummins, CPIP
 
Forward-Looking ALLL: Computing Qualitative Adjustments
Forward-Looking ALLL: Computing Qualitative AdjustmentsForward-Looking ALLL: Computing Qualitative Adjustments
Forward-Looking ALLL: Computing Qualitative AdjustmentsLibby Bierman
 
Credit risk scoring model final
Credit risk scoring model finalCredit risk scoring model final
Credit risk scoring model finalRitu Sarkar
 
Making Your Hypothesis Work Harder to Inform Future Product Strategy
Making Your Hypothesis Work Harder to Inform Future Product StrategyMaking Your Hypothesis Work Harder to Inform Future Product Strategy
Making Your Hypothesis Work Harder to Inform Future Product StrategyOptimizely
 
Surviving Reimbursement Changes - Managing Outcomes and Cost Analytics
Surviving Reimbursement Changes - Managing Outcomes and Cost AnalyticsSurviving Reimbursement Changes - Managing Outcomes and Cost Analytics
Surviving Reimbursement Changes - Managing Outcomes and Cost AnalyticsAnthony Laflen
 
Airbnb offline experiments
Airbnb offline experimentsAirbnb offline experiments
Airbnb offline experimentsElena Grewal
 
Methodologies for impact assessment of post harvest technologies
Methodologies for impact assessment of post harvest technologiesMethodologies for impact assessment of post harvest technologies
Methodologies for impact assessment of post harvest technologiesAshish Murai
 
A/B Testing Best Practices - Do's and Don'ts
A/B Testing Best Practices - Do's and Don'tsA/B Testing Best Practices - Do's and Don'ts
A/B Testing Best Practices - Do's and Don'tsRamkumar Ravichandran
 
Scientific Revenue USF 2016 talk
Scientific Revenue USF 2016 talkScientific Revenue USF 2016 talk
Scientific Revenue USF 2016 talkScientificRevenue
 
Organization Report on Warm stream-Heat Transfer People
Organization Report on Warm stream-Heat Transfer PeopleOrganization Report on Warm stream-Heat Transfer People
Organization Report on Warm stream-Heat Transfer PeopleNivedita Shrivastava
 
Ab testing 101
Ab testing 101Ab testing 101
Ab testing 101Ashish Dua
 
John Billings: Developing a new predictive risk model
John Billings: Developing a new predictive risk modelJohn Billings: Developing a new predictive risk model
John Billings: Developing a new predictive risk modelNuffield Trust
 
신뢰할 수 있는 A/B 테스트를 위해 알아야 할 것들
신뢰할 수 있는 A/B 테스트를 위해 알아야 할 것들신뢰할 수 있는 A/B 테스트를 위해 알아야 할 것들
신뢰할 수 있는 A/B 테스트를 위해 알아야 할 것들Minho Lee
 

Ähnlich wie AB Testing Insights for Data-Driven Decisions (20)

How to Enter the Data Analytics Industry?
How to Enter the Data Analytics Industry?How to Enter the Data Analytics Industry?
How to Enter the Data Analytics Industry?
 
Virtual segment brief
Virtual segment briefVirtual segment brief
Virtual segment brief
 
A/B Testing - Design, Analysis and Pitfals
A/B Testing - Design, Analysis and PitfalsA/B Testing - Design, Analysis and Pitfals
A/B Testing - Design, Analysis and Pitfals
 
A/B Testing - Customer Experience Platform experimentation using Pearson’s Ch...
A/B Testing - Customer Experience Platform experimentation using Pearson’s Ch...A/B Testing - Customer Experience Platform experimentation using Pearson’s Ch...
A/B Testing - Customer Experience Platform experimentation using Pearson’s Ch...
 
Key steps in transforming a calibration program slideshare
Key steps in transforming a calibration program slideshareKey steps in transforming a calibration program slideshare
Key steps in transforming a calibration program slideshare
 
Ab testing explained
Ab testing explainedAb testing explained
Ab testing explained
 
Forward-Looking ALLL: Computing Qualitative Adjustments
Forward-Looking ALLL: Computing Qualitative AdjustmentsForward-Looking ALLL: Computing Qualitative Adjustments
Forward-Looking ALLL: Computing Qualitative Adjustments
 
Credit risk scoring model final
Credit risk scoring model finalCredit risk scoring model final
Credit risk scoring model final
 
Making Your Hypothesis Work Harder to Inform Future Product Strategy
Making Your Hypothesis Work Harder to Inform Future Product StrategyMaking Your Hypothesis Work Harder to Inform Future Product Strategy
Making Your Hypothesis Work Harder to Inform Future Product Strategy
 
Bev Nicholls - Patient Portals - HiNZ Conference 2015
Bev Nicholls - Patient Portals - HiNZ Conference 2015Bev Nicholls - Patient Portals - HiNZ Conference 2015
Bev Nicholls - Patient Portals - HiNZ Conference 2015
 
Surviving Reimbursement Changes - Managing Outcomes and Cost Analytics
Surviving Reimbursement Changes - Managing Outcomes and Cost AnalyticsSurviving Reimbursement Changes - Managing Outcomes and Cost Analytics
Surviving Reimbursement Changes - Managing Outcomes and Cost Analytics
 
Airbnb offline experiments
Airbnb offline experimentsAirbnb offline experiments
Airbnb offline experiments
 
Methodologies for impact assessment of post harvest technologies
Methodologies for impact assessment of post harvest technologiesMethodologies for impact assessment of post harvest technologies
Methodologies for impact assessment of post harvest technologies
 
A/B Testing Best Practices - Do's and Don'ts
A/B Testing Best Practices - Do's and Don'tsA/B Testing Best Practices - Do's and Don'ts
A/B Testing Best Practices - Do's and Don'ts
 
Six Sigma Overview
Six Sigma OverviewSix Sigma Overview
Six Sigma Overview
 
Scientific Revenue USF 2016 talk
Scientific Revenue USF 2016 talkScientific Revenue USF 2016 talk
Scientific Revenue USF 2016 talk
 
Organization Report on Warm stream-Heat Transfer People
Organization Report on Warm stream-Heat Transfer PeopleOrganization Report on Warm stream-Heat Transfer People
Organization Report on Warm stream-Heat Transfer People
 
Ab testing 101
Ab testing 101Ab testing 101
Ab testing 101
 
John Billings: Developing a new predictive risk model
John Billings: Developing a new predictive risk modelJohn Billings: Developing a new predictive risk model
John Billings: Developing a new predictive risk model
 
신뢰할 수 있는 A/B 테스트를 위해 알아야 할 것들
신뢰할 수 있는 A/B 테스트를 위해 알아야 할 것들신뢰할 수 있는 A/B 테스트를 위해 알아야 할 것들
신뢰할 수 있는 A/B 테스트를 위해 알아야 할 것들
 

Kürzlich hochgeladen

RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in collegessuser7a7cd61
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 

Kürzlich hochgeladen (20)

RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in college
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 

AB Testing Insights for Data-Driven Decisions

  • 1. A-Z of A/B testing Dr. Shrividya Ravi Data Scientist at Metail Data Insights Meetup 3 December 2015
  • 2. Overview • Introduction – What is A/B testing? – Comparing web and field tests • Analysis • Effects that can affect test – Denominator issues – Temporal effects – Hidden bias • Bootstrapping • From validation to understanding mechanisms
  • 3. A/B testing • Essentially a randomized trial • Split traffic 50:50 • One group sees ‘normal’ site • The other group sees the variant or ‘treatment’ • After a set period of time, calculate the difference in KPIs between the two groups • Generally, you can attribute the difference to the treatment http://www.smashingmagazine.com/2010/06/the-ultimate-guide-to-a-b-testing/
  • 4. Web vs. field trials • Data – Quantity – Quality – Type • Web data: large quantities, low quality until aggregation and cohort creation, observational. • Field trials: small to medium quantities, high quality information about participants, combination of direct responses, tests and observations.
  • 5. Events • Launching the widget • Adding item to Bag • Rotating MeModel • Pressing Share button • Adding garment to try on
  • 6. A single event web 2015-01-14 04:41:20.000 2015-01-14 04:41:53.480 struct 0e833b00-d2cb-436b-ad1d-21fa47474b80 primary js-2.2.0cloudfront hadoop-0.5.0-common-0.4.0 XX.XX.XX.X 2091617875 aca45a2fbc191e7b 3 BR -XX.XXXX -XX.XXXXX https://live-cdn.me-tail.net/wanda- ui/5a180420-2416-11e2-81c1-0800200c9a66/pt- BR/?xdm_e=http%3A%2F%2Fwww.dafiti.com.br&xdm_c=default4031&xdm_p= 1#init- data/%7B%22retailerPageType%22%3A%22productListing%22%2C%22open%22 %3Afalse%7D http://www.dafiti.com.br/roupas-femininas/casacos-e- jaquetas/ https live-cdn.me-tail.net 80 /wanda-ui/5a180420-2416- 11e2-81c1-0800200c9a66/pt-BR/ xdm_e=http%3A%2F%2Fwww.dafiti.com.br&xdm_c=default4031&xdm_ p=1 init- data/%7B%22retailerPageType%22%3A%22productListing%22%2C%22open%22 %3Afalse%7D http www.dafiti.com.br 80 /roupas-femininas/casacos- e-jaquetas/ unknown TabBar OpenTab productListing Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 Chrome Chrome 39.0.2171.95 Browser WEBKIT pt-BR 1 1 1 0 0 0 1 0 0 1 24 216 0 Windows Windows Microsoft Corporation America/Sao_Paulo Computer 0 1366 768 UTF-8 216 0 • Widget launched by clicking on tab • Selected Information: – Timestamp – UserID (cookie ID) – Geolocation: Country code, longitufe & latitude (usually of ISP), Timezone – IP address – URLs: host, current, referrer – Event hierarchy • Others: – Browser information – Device & OS information – Session counter
  • 7. Logs of millions of events • Store all raw logs in the cloud • Create aggregates of specific every day – Use aggregates to create cohorts UserID Retailer Engaged? Order value OrderID Group Test Ratio User Type 0157dab05efbef6f XX null 68.11 68137749 out 50 ExistingBin 0158ee5980cc75ad XX null null null in 90 NewBin 015ab3acaba4c770 XX TRUE null null in 90 NewBin 015e3a8e1d5ad181 XX null null null in 90 NewBin 015e3da4002e861a XX null null null in 90 NewBin 0160ae8d4465773b XX TRUE null null in 90 NewBin 0161f081a2c51d9f XX null null null out 50 ExistingBin 01647bcd7185da9d XX null 96.27 27342749 out 50 NewBin Aggregated slice over some time period by user
  • 9. Basic A/B test • Change in homepage • Measure difference in Average Order Value (AOV) between control and treatment. https://www.google.co.uk/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0ahUKEwjip4_nxr3JAhWFJw8KHTbdAaoQjRwIBw&url=http%3A%2F%2Fpadic ode.com%2Fblog%2Femail-marketing-2%2Fab-testing-resources%2F&psig=AFQjCNEnv6hj7n9VW-z0tx54iLl5c3srIA&ust=1449158617165351
  • 10. Results AOV = Mean Order Value Group AOV (monetary units) Control 114 Variant 103.4
  • 11. Data distribution • Skewed distribution • Prices are often distributed log normally. • So, depending on the skew and extreme values, the mean can fluctuate without it indicating a real effect.
  • 12. Significance testing • z-test of lognormal means between groups • Critical threshold: 0.05 • H0 = 0; Ha > 0 • Single-tailed p-value: – 0.00682 • Double-tailed p-value – 0.0137
  • 14. Pitfalls • Temporal effects – Novelty – Spikes • Noise – Dilution • Bias – Bucketing bias – Asymmetric cohorts • Bugs
  • 15. Temporal effects • Novelty: strong effect at the beginning wanes over time. • Spikes/spurious data: Atypical mechanisms. • Monitoring only p-value going below a critical threshold will misinterpret the effect.
  • 16. Spikes • Toy Scenario Property Group A Group B Stable conversion 0.2 0.22 Sale conversion 0.3 0.5 Visitor rate (per group) N(200,20) N(200,20) Where 𝑁 𝜇, 𝜎 are normal distribution paramters. Conversion data generated using a binomial distribution.
  • 17. Spikes • Spurious effects like flash sales can quickly push p-value below critical threshold. • Sale on Day 6 increases conversion in both groups but difference is higher in Group B. • Increase in cumulative conversion reduces p- value dramatically.
  • 18. Spikes Spurious effects like flash sales can quickly push p-value below critical threshold. days all_visitors all_orders_a all_orders_b conversion_a conversion_b difference p_value 1 194 32 34 0.165 0.175 0.010 0.393 2 355 63 76 0.177 0.214 0.037 0.109 3 566 110 126 0.194 0.223 0.028 0.121 4 772 155 168 0.201 0.218 0.017 0.208 5 953 185 210 0.194 0.220 0.026 0.079 6 1154 244 300 0.211 0.260 0.049 0.003 7 1331 279 346 0.210 0.260 0.050 0.001 8 1537 323 386 0.210 0.251 0.041 0.003 9 1713 362 422 0.211 0.246 0.035 0.007 10 1931 404 467 0.209 0.242 0.033 0.008 11 2135 451 511 0.211 0.239 0.028 0.014 12 2341 497 555 0.212 0.237 0.025 0.021 13 2553 544 606 0.213 0.237 0.024 0.019 14 2770 595 666 0.215 0.240 0.026 0.011 15 2956 633 707 0.214 0.239 0.025 0.011 16 3143 670 749 0.213 0.238 0.025 0.009 17 3388 713 803 0.210 0.237 0.027 0.004 18 3590 757 843 0.211 0.235 0.024 0.007
  • 19. Real-world data Even null effect sizes will vary to some non-zero value.
  • 20. Temporal variability • Instability in effect size. • Depending on the A/B test, the long term instability of effect size can be debilitating after roll-out. • But can also provide a source of insight.
  • 21. Dilution • Users who can see treatment are a small fraction of population. http://www.infoq.com/presentations/ab-testing-pinterest • Create a counterfactual cohort from control group for correct measurement.
  • 22. Dilution • Treatment only made available in the ‘Variant’ group and only a small fraction of treatment group actually go through the treatment. • Can use Instrumental Variables to scale the overall effect
  • 23. Instrumental variables • Bin = Instrument (Z) • Instrument is able to “predict” actual treatment (T) • 2 Stage linear regression – 𝑌 = 𝛼 + 𝛽𝑇 + 𝜖 (ideal equation) – 𝑌 = 𝜌 + 𝜎𝑍 + 𝜃 (measured values at bin level) – 𝑇 = 𝛾 + 𝛿𝑍 + 𝜀 (isolating treatment effect) – 𝑌 = 𝛼 + 𝛽 𝛾 + 𝛿𝑍 (re-stated ideal equation) – 𝛽 = 𝜎/𝛿 • When treatment is only possible in one group, 𝛿 is a proportion and the effect size of ‘true’ treatment vs. control is the effect at the bin level scaled by 1/𝛿.
  • 24. Simulate beforehand • Know that adoption is important to see the difference at the level of Treatment (In) vs. Control (Out). • Estimate how long it will take to see a statistically significant effect given:1000 digitised garments, 10% adoption and higher conversion rate for engaged users.
  • 25. Bucketing bias • Non-representative population – Bugs in bucketing – Bugs in treatment not showing up on selective devices • Run A/A test • Examine cohorts carefully
  • 26. Bucketing bias • Asymmetric bucketing – Smaller group gets values from high density regions – Larger group gets the full range – Smaller group becomes a non- representative sample • Run A/A test • Symmetric bucketing
  • 28. Bootstrapping • No assumptions about data distribution • Can calculate any metric • Explicitly performs the assumptions of hypothesis testing so easier to explain • Can also interpreted from a Bayesian perspective Reference
  • 30. Bootstrapping • Resultant Gaussian distribution of metric • Need to check convergence of bootstrap samples. • Can be used to get a distribution of differences. http://rosetta.ahmedmoustafa.io/bootstrap/
  • 31. Bayesian models • Explicit mechanistic modelling of important parameters. • Answers the question: “What are the range of conversion rates that result in the observed data?” – Use knowledge that conversion rates can be modelled with a binomial distribution and the parameter space to be explored is p.
  • 32. Conclusions • How to analyse an A/B test • Understand different problems and highlighted some solutions • Techniques that allow for understanding mechanisms