A/B testing is a great technique to experiment with changes to your product. At Etsy we make extensive use of them to test out ideas; we’ve got 30+ running right now. Although the concept is simple, the execution is a bit tricker then you’d think. In this talk I will cover the common, and a few of the not so common, mistakes that can skew your results.
2. -‐
hi,
I’m
corey,
from
etsy
(@coreyloose)
-‐
Marketplace
where
people
around
the
world
connect
to
buy
and
sell
unique
goods
(not
all
that
different
from
the
art
fair
going
on
right
now)
-‐
We
like
to
run
a
lot
of
a/b
tests
2
4. -‐ Have
a
theory
on
something
that
will
make
your
product
beJer
-‐ Show
it
to
some
random
of
visitors
(but
keep
it
consistent)
“buckeMng”
-‐ Try
both
for
a
bit
and
see
which
one
does
beJer
-‐ Not
only
does
this
test
if
your
idea
is
good,
it
also
tests
your
implementaMon
and
all
sorts
of
complex
interacMons
-‐ Would
this
one
cause
an
Increased
error
rate
in
variaMon
selecMon?
4
5. -‐ As
I
just
explained
it,
A/B
tesMng
sounds
simple
+
awesome
-‐ And
it
is,
but
as
always
the
devil
is
in
the
details
-‐ I’m
going
to
tell
a
bunch
of
stories
of
stuff
that
we
did
wrong,
not
to
be
negaMve
but
it’s
just
more
interesMng
then
spraying
campaign
around
-‐ Lets
start
with
a
really
common
no-‐no
5
7. -‐ Alluring
because
it
doesn’t
require
you
to
have
rich
metric
gathering
or
buckeMng
-‐ You’re
going
to
need
some
tooling
-‐ We
built
Feature
and
Catapult
7
8. -‐ (only
code
in
the
presentaMon)
-‐ Plenty
of
other
opMons
out
there,
but
we’re
happy
with
this
-‐ Open
source
-‐ Easy
enough
that
PMs
can
change
experiment
weights
-‐ Uses
cookie
to
ensure
user
experience
stays
consistent
-‐ You’ll
need
your
own
logging
to
do
analysis
8
9. -‐ Internal
tool
that
does
data
analysis
of
a/b
tests
based
on
data
processing
from
feature
event
logs
-‐ For
this
experiment:
more
pages
but
less
add
to
cart
-‐ No
staMsMcal
significance
for
conversion
rate
9
10. -‐ A
bit
sobering
but
you
goJa
have
a
lot
of
traffic,
or
make
a
big
change
to
do
this
10
11. -‐ WriJen
by
an
Etsy
alumni
-‐ To
detect
a
small
change
you
need
a
lot
of
Mme
11
12. -‐ The
good
news
is
if
you
can
make
a
bigger
effect,
it
gets
much
easier
to
detect
(1%
=>
5%)
12
13. -‐ Have
a
hypothesis
going
in,
no
fishing
(lets
just
pump
some
people
full
of
this
new
chemical)
-‐ Lets
get
into
some
more
interesMng
failures
13
14. -‐ Going
to
tell
a
few
stories
about
a
first
type
of
failure
-‐ Mechanical
14
15. -‐ All
users
get
bucketed
but
only
Australian
users
are
eligible
for
an
experiment
15
16. -‐ This
is
what
really
happens,
since
the
rest
of
the
world
isn’t
eligible
-‐ Going
to
under
represent
any
effects
16
18. -‐ If
your
experiment
causes
the
page
to
be
a
lot
bigger,
weirdness
can
happen
-‐ Page
loads
slower
18
19. -‐ This
ensures
the
user
actually
saw
the
page
+
we
have
access
to
more
informaMon
19
20. -‐ Slow
network
speed
on
mobile
-‐ The
combo
led
to
experiments
being
under-‐reported
-‐ NoMced
because
experiment
group
would
appear
to
have
far
less
people
in
it
-‐ Lesson:
Watch
page
weight
20
21. -‐ We
don’t
support
ie7
-‐ We
ran
an
experiment
once
that
looked
like
this
in
Ie7
-‐ Was
sMll
enough
traffic
to
tank
experiment
-‐ Lesson:
Slice
by
user
groups
in
the
analysis
21
22. -‐ (hal
9000)
-‐ Ran
an
experiment
on
our
acMvity
feed,
small
%
-‐ All
the
metrics
tanked
-‐ Turned
out
a
bot
we
have
to
monitor
page
Mmes
was
bucketed
in
-‐ Lesson:
a/b
tooling
ignore
your
bots
22
23. -‐ Previous
stories
were
mechanical,
but
the
real
power
of
A/B
tesMng
is
seeing
how
your
idea
interacts
with
the
world
23
24. -‐ Implemented
as
a
monolithic
release
-‐ A/B
test
kept
as
a
hurdle
at
the
end
24
26. -‐ It
failed
terribly,
purchases
down
over
20%
-‐ Since
we
built
it
all
at
once,
we
had
nothing
to
pin
it
on
-‐ What
if
we
had
done
something
simple,
are
more
items
beJer?
–
40
v.
80
items
on
a
page
-‐ Lesson:
test
ideas
in
isolaMon
26
27. -‐ Here’s
a
story
about
an
A/B
test
telling
us
something
our
product
intuiMon
didn’t
-‐ Seems
like
an
obvious,
simple
win
-‐ Logins
are
way
down
-‐ Turns
out
average
users
use
way
worse
passwords
then
employees
-‐ Ended
up
being
a
no-‐go
for
other
reasons
-‐ Lesson:
unintended
consequences
27
28. -‐ You
can’t
measure
everything
that
maJers
-‐ Can
iron
out
the
mechanical
issues
-‐ Can
run
Mghtly
scoped
tests
that
allow
you
to
make
confident
decisions
-‐ What
if
you
asked
½
of
the
people
you
met
for
the
rest
of
the
day
for
a
$1
-‐ You’d
end
up
with
more
money
28
29. -‐ That’s
what
you’re
doing
with
this
-‐ If
you
a/b
test
it,
you’ll
get
more
signups
+
probably
beJer
Mme-‐on-‐page
-‐ Maybe
a
few
more
bounces
-‐ But
goodwill
&
brand
impression
is
hard
to
measure
29