How can A/B testing go wrong?

•

1 gefällt mir•1,275 views

LivePerson Developers is proud to host a meetup about A/B testing by Shlomo Lahav, Chief Scientist at LivePerson. The lecture will focus on testing and the ability to deduct conclusions, especially in the web. - What is an A/B test? - How to construct an A/B test properly? - What are the metrics that can be used? - Can the results be miss leading? - Errors: bias and statistical errors - First and second type errors - Measuring lift, why lift is a biased measure. - Is it possible to change the test settings during the test? - How to run a multivariate testing effectively?

Technologie Business

The problem

Measuring the effect of multiple alternatives
on the performance over a given population.

2

Performance

A list of objective measurements

3

Possible solutions

• A model that describe the results and
evaluates the marginal effect of the
alternatives
• Test the alternatives side by side while all
the rest is equal

4

Example

• the problem: Testing two different layouts
of a web page (A and B)
•
•
•
•

Population: visitors/visits
Performance: conversion rate
Alternatives: two different layouts
Objective: the find the better layout and
asses the performance difference

5

What does it mean all the rest being equal

• Fairness: for every member in the
population, the probability to be allocated
to A is the same.
• For each member, any other decisions is
independent with the test allocation (A/B).
• Observations are independent

6

Population: Visitor vs. visit
Population
Visitor

Visitor

Visit

Measurement
Visit conversion
rate
Lifetime
conversions per
visitor
Visit conversion
rate

Issues
Independency is
violated

A visitor may be
exposed to both A
and B (in different
visits)

7

Errors

• When we compare a test alternative to the
control alternative
• False Positive – Calling the test to be the
winner by mistake
• False Negative – calling the control to be
the winner by mistake

8

When do we end the test

• After a predefined period/observations.
• When the difference is significant

9

Example

• We want to test two alternatives and
select the better one.
• The results are: CR(A)=9.21%,
CR(B)=11.93%. The win of B is statistical
significant (p-value<5%).
• We need to estimate the gain of B vs. A.
• Is our estimate of 2.72% a fair estimate?

11

Results
p-value

Rate

Actual

A

B

Gain B
over A

10.00%

11.00%

1.00%

B wins

5%

92.5%

9.21%

11.93%

2.72%

A wins

5%

7.5%

13.71%

7.61%

-6.10%

B wins

1%

98.5%

9.59%

11.43%

1.84%

A wins

1%

1.5%

14.94%

7.05%

-7.89%

12

Selection bias

• An AB test is conducted between A1,
A2,…,An
• After the test is completed, we select Ak.
• Should we expect Ak to perform as it did
during the test?
• Does the test outcome (the rank of k)
affects our expectation?

13

What else can go wrong?

• Independency is not maintained (traffic,
changes etc.)
• The fairness is handled by random
allocation. This can be biased due chance
• The significance level is usually higher
than planned (continues evaluation) which
results in a higher false positive.

14

How to control the traffic split?

• By percentage or round robin?
• Can we change the split?

15

Another example

• Need to test two design layouts in multiple
location, while each location has a
different conversion rate.
• Different populations – use lifts and
accumulate the lifts.
• How do we calculate the lift: A over B or B
over A?

16

lifts
A

B
8%
10%

10%
8%

Average

Lift B over A Lift A over B
25%
-20%
-20%
25%
2.5%
-2.5%

17

Change in split - Simpson ‘s paradox

New

Returning

A

B

CR(A)

CR(B)

CR(A)

6%

15%

CR(B)

5%

14%

Weekday

80%

20%

90%

10%

7.80%

6.80%

Weekend

10%

90%

50%

50%

14.10%

13.10%

10.05%

12.05%

total

18

Can we remove alternatives

• Start with 3 alternatives (equal split)
• Remove one

start

0

0

0.5

0.5

1

1

modify

0

0

0

1

1

1

19

Multiple tests

• Is it valid to run multiple AB tests
simultaneously?

20

Weitere ähnliche Inhalte

Andere mochten auch

2016 07 efw sap functional shortAEGIS Consulting, Inc

Map machinery cvt2go

MakerFaire Tokyo 2014, Yantra 3.0 Nepal, Aki Party in ShenzhenNico-Tech Shenzhen/ニコ技深圳コミュニティ

游戏运营(第二讲)www.emean.com

De Vastgoedmanager Als Spin In Het Web Bij Het Verduurzamen Van VastgoedNetherlands Enterprise Agency (RVO.nl)

Internet Filtering In South Koreamichroeder

Creating Compelling Videos2GregTuke

How to deploy rpd and catalog without enterprise mangerRavi Kumar Lanke

TeviiTELE-satellite ara

Hilversum Media Campus John Leek 160414Netherlands Institute for Sound and Vision

Corporate WebsitesKai Platschke

Suntronic Solsys Resist Dielectric Pv Products Feb2011stu99dwn

SonicviewTELE-satellite ara

Onderzoek CO2 reductiepotentieel Duurzaam Inkopen kantoorgebouwenNetherlands Enterprise Agency (RVO.nl)

Official Final CSP slideshowlangevinm14

AntechTELE-satellite ara

How To Use Green View With On ParSavantGPS (Makers of OnPar)

Innovation In Medical Caresirlkm

Recent Developments in Compensation AnalysisThomas Econometrics

W.K. Kellogg Foundation: Workforce CompositionW.K. Kellogg Foundation

Andere mochten auch (20)

2016 07 efw sap functional short

Map machinery

MakerFaire Tokyo 2014, Yantra 3.0 Nepal, Aki Party in Shenzhen

游戏运营(第二讲)

De Vastgoedmanager Als Spin In Het Web Bij Het Verduurzamen Van Vastgoed

Internet Filtering In South Korea

Creating Compelling Videos2

How to deploy rpd and catalog without enterprise manger

Tevii

Hilversum Media Campus John Leek 160414

Corporate Websites

Suntronic Solsys Resist Dielectric Pv Products Feb2011

Sonicview

Onderzoek CO2 reductiepotentieel Duurzaam Inkopen kantoorgebouwen

Official Final CSP slideshow

Antech

How To Use Green View With On Par

Innovation In Medical Care

Recent Developments in Compensation Analysis

W.K. Kellogg Foundation: Workforce Composition

Ähnlich wie How can A/B testing go wrong?

The Finishing LineOban International

Multiple regression to findout drivers of online satisfactionSomdeep Sen

A Introduction To A-B Testyihucha

Conversion Conference BerlinTom Capper

Statistics for CRO - Conversion Conference LondonTom Capper

A B testing introduction.pptxAhmed Khaled

Data-Driven Decision Making by Expedia Sr PMProduct School

Res 342 final exammn8676766

Res 342 final examnbvyut9878

You should test that: How to use A/B testing in product designKelley Howell

Optimizely Workshop: Take Action on Results with StatisticsOptimizely

Ab testingFahad Zahid

RES 342 Final Exam heightly

RES 342 Final Exam Answersheightly

Res 342 Finalheightly

How to know the impact of changes on audience reach - User and partner confer...AT Internet

Podium_20190115TRBXiaoyu Guo

Webinar: Common Mistakes in A/B TestingOptimizely

Drippler's A/B test libraryNir Hartmann

Significance TestsAnthony J. Evans

Ähnlich wie How can A/B testing go wrong? (20)

The Finishing Line

Multiple regression to findout drivers of online satisfaction

A Introduction To A-B Test

Conversion Conference Berlin

Statistics for CRO - Conversion Conference London

A B testing introduction.pptx

Data-Driven Decision Making by Expedia Sr PM

Res 342 final exam

You should test that: How to use A/B testing in product design

Optimizely Workshop: Take Action on Results with Statistics

Ab testing

RES 342 Final Exam

RES 342 Final Exam Answers

Res 342 Final

How to know the impact of changes on audience reach - User and partner confer...

Podium_20190115TRB

Webinar: Common Mistakes in A/B Testing

Drippler's A/B test library

Significance Tests

Mehr von LivePerson

Microservices on top of kafkaLivePerson

Graph QL IntroductionLivePerson

Kubernetes your tests! automation with docker on google cloud platformLivePerson

Growing into a proactive Data PlatformLivePerson

Measure() or die() LivePerson

Resilience from Theory to PracticeLivePerson

System Revolution- How We Did It LivePerson

Liveperson DLD 2015 LivePerson

Http 2: Should I care?LivePerson

Mobile app real-time content modifications using websocketsLivePerson

Mobile SDK: Considerations & Best Practices LivePerson

Functional programming with Java 8LivePerson

Apache Avro in LivePerson [Hebrew]LivePerson

Apache Avro and Messaging at Scale in LivePersonLivePerson

Data compression in Modern ApplicationLivePerson

Support Office Hour Webinar - LivePerson API LivePerson

SIP - Introduction to SIP ProtocolLivePerson

Scalding: Reaching Efficient MapReduceLivePerson

Building Enterprise Level End-To-End Monitor System with Open Source Solution...LivePerson

Introduction to Data ScienceLivePerson

Mehr von LivePerson (20)

Microservices on top of kafka

Graph QL Introduction

Kubernetes your tests! automation with docker on google cloud platform

Growing into a proactive Data Platform

Measure() or die()

Resilience from Theory to Practice

System Revolution- How We Did It

Liveperson DLD 2015

Http 2: Should I care?

Mobile app real-time content modifications using websockets

Mobile SDK: Considerations & Best Practices

Functional programming with Java 8

Apache Avro in LivePerson [Hebrew]

Apache Avro and Messaging at Scale in LivePerson

Data compression in Modern Application

Support Office Hour Webinar - LivePerson API

SIP - Introduction to SIP Protocol

Scalding: Reaching Efficient MapReduce

Building Enterprise Level End-To-End Monitor System with Open Source Solution...

Introduction to Data Science

Kürzlich hochgeladen

Automating Google Workspace (GWS) & more with Apps Scriptwesley chun

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

ICT role in 21st century education and its challengesrafiqahmad00786416

Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz

presentation ICT roal in 21st century educationjfdjdjcjdnsjd

Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays

Manulife - Insurer Transformation Award 2024The Digital Insurer

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

MS Copilot expands with MS Graph connectorsNanddeep Nachan

Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh

Corporate and higher education May webinar.pptxRustici Software

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

Kürzlich hochgeladen (20)

Automating Google Workspace (GWS) & more with Apps Script

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Axa Assurance Maroc - Insurer Innovation Award 2024

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

How to Troubleshoot Apps for the Modern Connected Worker

ICT role in 21st century education and its challenges

Strategies for Landing an Oracle DBA Job as a Fresher

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...

presentation ICT roal in 21st century education

Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...

Manulife - Insurer Transformation Award 2024

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Boost Fertility New Invention Ups Success Rates.pdf

MS Copilot expands with MS Graph connectors

Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model

Corporate and higher education May webinar.pptx

Data Cloud, More than a CDP by Matt Robison

How can A/B testing go wrong?

1. A/B testing Shlomo Lahav

2. The problem Measuring the effect of multiple alternatives on the performance over a given population. 2

3. Performance A list of objective measurements 3

4. Possible solutions • A model that describe the results and evaluates the marginal effect of the alternatives • Test the alternatives side by side while all the rest is equal 4

5. Example • the problem: Testing two different layouts of a web page (A and B) • • • • Population: visitors/visits Performance: conversion rate Alternatives: two different layouts Objective: the find the better layout and asses the performance difference 5

6. What does it mean all the rest being equal • Fairness: for every member in the population, the probability to be allocated to A is the same. • For each member, any other decisions is independent with the test allocation (A/B). • Observations are independent 6

7. Population: Visitor vs. visit Population Visitor Visitor Visit Measurement Visit conversion rate Lifetime conversions per visitor Visit conversion rate Issues Independency is violated A visitor may be exposed to both A and B (in different visits) 7

8. Errors • When we compare a test alternative to the control alternative • False Positive – Calling the test to be the winner by mistake • False Negative – calling the control to be the winner by mistake 8

9. When do we end the test • After a predefined period/observations. • When the difference is significant 9

10. What does it mean all the rest being equal • Fairness: for every member in the population, the probability to be allocated to A is the same. • For each member, any other decisions is independent with the test allocation (A/B). • Observations are independent 10

11. Example • We want to test two alternatives and select the better one. • The results are: CR(A)=9.21%, CR(B)=11.93%. The win of B is statistical significant (p-value<5%). • We need to estimate the gain of B vs. A. • Is our estimate of 2.72% a fair estimate? 11

12. Results p-value Rate Actual A B Gain B over A 10.00% 11.00% 1.00% B wins 5% 92.5% 9.21% 11.93% 2.72% A wins 5% 7.5% 13.71% 7.61% -6.10% B wins 1% 98.5% 9.59% 11.43% 1.84% A wins 1% 1.5% 14.94% 7.05% -7.89% 12

13. Selection bias • An AB test is conducted between A1, A2,…,An • After the test is completed, we select Ak. • Should we expect Ak to perform as it did during the test? • Does the test outcome (the rank of k) affects our expectation? 13

14. What else can go wrong? • Independency is not maintained (traffic, changes etc.) • The fairness is handled by random allocation. This can be biased due chance • The significance level is usually higher than planned (continues evaluation) which results in a higher false positive. 14

15. How to control the traffic split? • By percentage or round robin? • Can we change the split? 15

16. Another example • Need to test two design layouts in multiple location, while each location has a different conversion rate. • Different populations – use lifts and accumulate the lifts. • How do we calculate the lift: A over B or B over A? 16

17. lifts A B 8% 10% 10% 8% Average Lift B over A Lift A over B 25% -20% -20% 25% 2.5% -2.5% 17

18. Change in split - Simpson ‘s paradox New Returning A B CR(A) CR(B) CR(A) 6% 15% CR(B) 5% 14% Weekday 80% 20% 90% 10% 7.80% 6.80% Weekend 10% 90% 50% 50% 14.10% 13.10% 10.05% 12.05% total 18

19. Can we remove alternatives • Start with 3 alternatives (equal split) • Remove one start 0 0 0.5 0.5 1 1 modify 0 0 0 1 1 1 19

20. Multiple tests • Is it valid to run multiple AB tests simultaneously? 20

How can A/B testing go wrong?

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie How can A/B testing go wrong?

Ähnlich wie How can A/B testing go wrong? (20)

Mehr von LivePerson

Mehr von LivePerson (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

How can A/B testing go wrong?