Ab testing 101

Google’s infamous AB test: testing 41 variants of mildly different shades of blue

Longitudinal or pre-post testing is difficult since little variance is explained by product features. Other factors
impacting conversion are:
Price
Weekend/Weekday
Seasonality
Source of Traffic
Availability
Mix of users (distribution bias)
Clarity of product thinking & avoiding snowballing of incorrect insights
Why was conversion for new android better than older version for the first 3 days?
(Hint: Early adoptor bias- users with stable wifi, automated app upgrade cycle and loyal to app convert higher
than all users)
Why is AB Testing needed?

Choosing Alia Bhatt as brand ambassador
A recommended hotel on the top of the listing
Impact of a fix for latency
Increase sign-in rate by increasing the size of the login button
Impact of showing packing list as a notification a day before the flight date
Quiz: What can or cannot be AB tested
AB testing is for lower hanging fruits not quantum leaps: for those user testing,
interviews and FGDs as well as analysis of existing data are better.

Choosing Alia Bhatt as brand ambassador: No
A recommended hotel on the top of the listing: Yes
Impact of a fix for latency: Yes
Increase sign-in rate by increasing the size of the login button: Yes
Impact of showing packing list as a notification a day before the flight date: Tough, but theoretically yes
Quiz: What can or cannot be AB tested
AB testing is for lower hanging fruits not quantum leaps: for those user testing,
interviews and FGDs as well as analysis of existing data are better.

Key Stages of AB Testing
Hypothesis Definition
Metric Identification
Determining Size & Duration
Tooling & Distribution
Invariance Testing
Analyzing Results

Almost all AB experiment hypotheses should look something like below:
Eg. 1
H0 (Null/Control): A big login button will not impact user login percentage
H1 (Test): A big login button will significantly increase user login percentage
Eg: 2
H0 (Control): Putting higher user rating hotels at the top of the listing doesn’t change conversion
H1 (Test): Putting higher user rating hotels at the top of the listing changes conversion significantly
Good to articulate the hypothesis you’re testing in simple English at the start of the experiment. The
hypothesis should have a user verbiage and not a feature verbiage. It’s okay if you skip this too as long as
you get the idea.
Hypotheses Definition

Counts, eg.
#Shoppers
#Users buying
#Orders
Rates, eg.
Click through Rate
Search to Shopper Rate
Bounce Rate
Probability (a user completes a task), eg.
User Conversion in the funnel
Metric identification (1/2)

Consider the following metrics for conversion:
1. #Order/#Visits to listing page
2. #Visitors to TY Page/#Visitors to Listing Page
3. #Visits to TY Page/#Visits to listing page
4. #Orders/#PageViews of listing page
Metric identification (2/2): Quiz
1 2 3 4
User refreshes the listing page
User breaks the booking into 2
User’s TY page gets refreshed
User does a browser back and the page is served from cache
User drops off on details and comes back via drop-off
notification
Omniture is not firing properly on listing page

1. If showing a summary of hotel USPs on the details page is improving conversion?
2. If a user who purchased with MMT will come back again?
3. If we are sending too many or too few notifications to users?
How can you measure?

1 .If showing a summary of hotel USPs on the details page is improving conversion?
A simple A/B set-up with and without the feature will help in evaluation
2. If a user who purchased with MMT will come back again?
A. An secondary metric captured by asking buyers this question or an NPS survey and comparing results
should give some idea
3. If we are sending too many or too few notifications to users?
A. An indirect metric measured as retained users on the app across the two variants
How can you measure?

Size & Duration
Reality Test Output Error
Control is better Control is better 1- α (confidence level)
Control is better Test is better α (significance)
Test is better Test is better 1-β (power)
Test is better Control is better β
α or type-I error is the probability of rejecting null when it is true (Downside Error)
β or type-II error is the probability of accepting null when control is better (Opportunity Cost Error)
Target values to test significance is at α = 5% and 1-β=80%

Size & Duration
Size:
• To figure out the size of the samples required to get the 80% power for the test, here
• These many users need to be targeted with the smallest of the test variant being examined
Duration:
• Is an outcome of what % of traffic can you direct to the test + some minimum duration considerations
• You might want to limit the %age exposure of the experiment due to:
• Revenue impacts
• Leaving room for other people to experiment
• Even if the sample size for the required power can be reached in a shorter duration good to reduce the exposure of
the experiment to include:
• At-least 1 weekend/weekdays
• low & high discounting periods (if possible)
• Low & high availability periods (if possible)

No Peeking
• It is important to not reduce power of the test by changing decision with insufficient data
• Best explained in the blog. Primary idea being that taking duration clues from early data introduces human error in
the measurement
• In-case the sample size is turning out to be very high, a few ways to reduce it are:
• Use this sequential sampling approach (reduces size by as high as 50% in some scenarios)
• Use this Bayesian sampling approach (mathematically intensive)
• Try matching the lowest unit of measurement with lowest unit of distribution (eg instead of measuring
latency/user measure latency per hit and distribute the experiment on hit)
• Try moving the experiment allocation closer to the step where there is an actual change (eg assign payment
experiment to payment page users)

Distribution Metric
1. Page Views
2. Cookies
3. Login-ID
4. Device ID
5. IP Address
Tooling & Distribution (1/2)
Which will not be hampered by the following 1 2 3 4 5
User shortlists 2-3 hotels and comes back after a day
User starts search on mobile and books on desktop
User changes browsers on the machine
User logs out and continues with another ID

Typical requirements for an AB system are:
Each experiment should support multiple variants (A/B/C..) and each variant can be defined using a combination of
experiment variables
Each user is randomly assigned a variant (as per the distribution percentage). System ensures users are served a
consistent experience basis their device ID or cookie (other distribution parameters like page view or visit might be
used but cookie/device-id is the most stable)
Auto-logs the variant that the users are being exposed to in an analytics system
There are multiple AB testing systems available by several vendors or one can be easily created internally using a tag
manager like Google tags
Tooling & Distribution (2/2)

A/A Testing:
Ideally, it is good to run 1 or many A/A test to measure the same metric you’re planning to measure in A/B tests before
and after your test period
Even if the above is not feasible, do try to run A/A test regularly to test the underlying system
Things to test during A/A Tests:
Key metrics you measure (like conversion, counts, page-views, etc) and their statistical difference between the
two cohorts at different ratios of test & control
A/A & Invariance Testing

Invariance Testing
Identify Invariance metrics- metrics that should not change between control & experiment
One of the basic metrics that will be the invariant will be the count of the users assigned to each group. Very
important to test these
Each of the invariants should be within statistical bounds between population and control
A/A & Invariance Testing

1. Remember the threshold practical significance threshold used in sample size calculator. That is going to be
the least change that we care about, so a statistically significant change < the practical significance
threshold is useless.
2. Choose the distribution & test:
1. Counts: poisson distribution or poisson-mean
2. Rates: poisson distribution or possison-mean
3. Click-through-probability: binomial distribution & t-test (or chi-square test).
Analyzing Results (1/3)

Analyzing Results (2/3): Taking
Decision Launch
Don’t Launch or
Keep Testing

Analyzing Results (2/3): Taking
Decision Launch
Don’t Launch or
Keep Testing
Yes
No Keep Testing
No Don’t Launch
No Keep Testing

Analyzing Results (3/3): Taking Decision

A/B/C Setup
A particular type of experiment set-up that is beneficial where there might be server & client side affects that
introduce bias. A few examples
Measure impact of persuasion shown (say last room left)
User might be positively impacted to convert higher, v/s
Higher latency to fetch persuasion might reduce conversion
Showing a message “Cheaper than Rajdhani” on flights > 75 mins duration and fare <3000
User might be positively impacted to convert, v/s
Conversion for cheaper flight (<3000) is generally higher
Showing a USP of the hotel generated from user reviews, eg. guests love this because: “great neighborhood to
stay”
User might be positively impacted to convert, v/s
Feature might only be visible on hotels with > X reviews (and hence bookings). There is an innate hotel bias.
In these scenarios, it is best to setup 3 variants:
A= Feature Off or Control
B= Feature On but not shown to users
C= Feature on but shown to users.
A/B/C Setup

AB testing in an organization typically goes through the following stages:
Would encourage you all to help your organization move to the
next stage in the AB testing journey
Recommended to reach a state where the company culture supports quick prototyping and testing with real
users
Maintain high standards of experiment analysis and responsible reporting
Things to Improve
Sanity Checks
Testing for
conflict
resolution
Testing for
impact
measurement
Testing for
hypothesis
Rapid
prototyping &
testing

Definitely read the Evan Miller blog. It basically summarizes everything you need to know.
If keen on getting in more detail of techniques and best practices, take the course on Udacity. Just doing the first chapter
would be good enough
Further Reading

Ab testing 101

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Ab testing 101

Ähnlich wie Ab testing 101 (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Ab testing 101