SearchLove Boston 2017 | Richard Fergie | You Aren't Doing Science and That's OK

You Aren’t Doing Science
And That’s OK

Richard Fergie
@richardfergie
richard.fergie@gmail.com

Digital offers great scope for testing different options and finding
out which is the best

Two difficulties with this:
1. “Best” can sometimes be hard to define
2. Saying which option is better is fraught with peril!

A result has statistical significance when it is very
unlikely to have occurred given the null hypothesis
- Wikipedia

Two problems with this:
1. People do significance testing wrong
2. Even if they do it right, it isn’t the best tool for the job

Peeking early
http://www.evanmiller.org/how-not-to-run-an-ab-test.html

P-hacking
https://projects.fivethirtyeight.com/p-hacking/

These issues can all be avoided with a disciplined testing
methodology and a bit of mathematical jiggery-pokery

A result has statistical significance when it is very unlikely to have
occurred given the null hypothesis
- Wikipedia

These are fundamental problems with what statistical significance
is

And no where have we mentioned how this concept aligns with
business goals.

But it works.
And the only rule is that it has to work.

All models are wrong, but some are useful
- George Box

The game:
1. Players can add new adverts, pause adverts and reactivate
adverts each turn
2. In between turns, active adverts get impressions and clicks
3. http://www.eanalytica.com/ad-testing/

Things that might be important that are not modelled:
1. That you might be an ad writing or landing page genius
2. Getting it wrong might lead you down a blind alley

This is useful because it is way simpler than real life
But maybe it is too simple and misses something important?

Different strategies:
1. Do nothing
2. Pause a random advert and add a new advert
3. Cheat
4. Run a chi-squared test
5. Run a g-test
6. Just pick the advert with the best observed CTR

Cheat: 8422 clicks
Chi-squared (p=0.05): 7896
Chi-squared (p=0.2): 8199
Pick best observed: 8287

Trying out more variations and occasionally picking the wrong one
is better than waiting to be more certain of making the right
decision

What about testing more adverts at once?

I was really surprised by this and had to go back and double check
my code

A very quick introduction to multi-armed bandits

Explore vs Exploit
You can exploit what appears to be the best option at the time
Or you can explore to see if something else is a better option

There are lots of clever ways of optimising this balance.
A simple way:
X% of the time explore.
(100-X)% of the time exploit the best option

And we can use this to crank up the number of variations in the
test!

All of this was done on 1000 impressions per week
i.e. split 1000 impressions between the variations before doing any
test
This isn’t a huge number but smaller numbers are sometimes
common

All this is based on the idea of continuous testing where creating
new variations is cheap
This is mostly true for PPC text ads
If it is true for landing pages and conversion rate optimisation then
your organisation is doing very well!

Suppose you have a fixed window for testing after which the
winning variation will be around for a long time

Then the most important thing will be to end up with the best
performing variation at the end.
This is not the same as getting the best performance during the
test

Just picking the best performing variation is again a strong strategy

But if we imagine that creating new variations is costly then maybe
it isn’t so good.

For a 10 week test the “Pick Best” strategy requires 11 different
test variations.
Other methods get nearly the same end result with 8 or fewer test
variations

New game:
 There is not limit to how long tests can run for
 There is a limit to how many new variations can be used

This is about where I am right now

This is about where I am right now
You are like a little baby.
Watch this...
@mgershoff
conductrics.com

https://github.com/farizrahman4u/qlearning4k
https://deepmind.com/research/publications/playing-atari-deep-reinforcement-learning/

Performance:
Comparable with the “random” strategy

Improvements:
1. Feed in CTR rather than clicks
2. Change reward function to be the true regret of the decision

Performance:
Slightly better than random, some of the time

Areas for further investigation:
1. Changing click through rates
2. Weekly seasonality
3. Cost of creating new variations

Key Actions
 Don’t worry too much about statistical significance testing
 Do worry about how you can generate and deploy more test
variations
 Think more about decision theory than trying to mimic what a
scientist does

Game code and example strategies:
http://www.eanalytica.com/SearchLove-Notebook/

SearchLove Boston 2017 | Richard Fergie | You Aren't Doing Science and That's OK

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to SearchLove Boston 2017 | Richard Fergie | You Aren't Doing Science and That's OK

Similar to SearchLove Boston 2017 | Richard Fergie | You Aren't Doing Science and That's OK (20)

More from Distilled

More from Distilled (20)

Recently uploaded

Recently uploaded (20)

SearchLove Boston 2017 | Richard Fergie | You Aren't Doing Science and That's OK