When in doubt, go live

When in doubt, go live
Techniques for decision making
based on real user behavior
© 2020 ThoughtWorks
Irene Torres
Klaus Fleerkötter

You save time and make better decisions
by establishing shorter feedback loops
from feature idea to feature usage.

Irene Torres
Developer @ TW
PhD Neuroscience
Science perspective
Klaus Fleerkötter
Developer @ TW
Information Systems
Techie perspective
Klaus
Who’s talking?

What is this talk about?
Speciﬁc use cases
that worked for us
Tech & Research
And what is it not...
Extensive coverage of
user research
Software testing

One of Germany’s
biggest online retailers
Top 5 highest traﬃc
e-commerce sites
(Germany)
Orders: <= 10 per second
Qualiﬁed visits:
Ø 1.6 million / day
Examples

PO
Establishing Feedback Loops
Users
Team
Stakeholders
Users

PO
Delivery
Pipeline
Feature
Toggle
Shadow
TraﬃcLab
Test Focus
Group
Survey
Visual
Report
A/B
Test
Establishing Feedback Loops

Prerequisites

PO
An Iterative and Incremental development process

Services that can be built independently by cross-functional
teams that are structured around business domains
Dev
PO
QA Ops
UX
DA

The Delivery Pipeline
Delivery
Pipeline
Iterative and
Incremental
development
Independent
Teams

The Delivery Pipeline
Build Test Deploy

Gain situational awareness
Knowing that you went live and nothing’s on ﬁre

Feature Toggles
Delivery
Pipeline
Feature
Toggle
Iterative and
Incremental
development
Independent
Teams

Feature Toggles
Decouple go-live from deployment
© CC BY 2.0 "Switch" Jon_Callow_Images
if (toggleIsOn) then {
executeNewBehavior()
} else {
executeOldBehavior()
}

Feature Toggles
Flip for experimentation
© CC BY-ND 2.0 "Oﬀ?" Nicholas Liby
Without Recompile?
Without Restart?
Per Request?

While
developing,
go live

Shadow Traﬃc
Delivery
Pipeline
Feature
Toggle
Shadow
Traﬃc
Iterative and
Incremental
development
Independent
Teams

Shadow Traﬃc
Not just for testing
User
Old Behavior
New Behavior
sees no diﬀerence
Run
both
Team

Shadow Traﬃc
Get early feedback
60% 40%
Min 3 items?
Mostly fashion?
Not sold out?
Max 1 of each kind?
Maximize!

Visual Report
Delivery
Pipeline
Feature
Toggle
Shadow
Traﬃc
Visual
Report
Iterative and
Incremental
development
Independent
Teams

Visual Report
Quality of a feature

Assess that the MVP has the correct business rules
● Visual report (e.g. html page)
Visual Report
Quality of a feature
Beach pants
manual auto
Leather bags
Jackets

Go live
without
ﬂying blind

A/B Testing
Delivery
Pipeline
Feature
Toggle
Shadow
Traﬃc
A/B
Test
Visual
Report
Iterative and
Incremental
development
Independent
Teams

A/B testing
“You want your data to inform, to guide, to improve your business model, to help
you decide on a course of action.” Lean Analytics

A/B testing
Focus on the understanding of the underlying statistics that drives the
calculation of a sample size.
STATS

A/B testing
A/B testing ≡ a set of statistical tests that evaluate two independent groups, a
control and a test group
“Independent groups” -> between-subjects design
STATS
Focus on the understanding of the underlying statistics that drives the
calculation of a sample size.
groups =
variants
“Independent groups” -> between-subjects design

A/B testing
Control [A]
Test [B]

A/B testing
A/B testing mostly uses statistical hypothesis testing to calculate the likelihood of a change in your
website being meaningful.
Null hypothesis (H0): The state of the world. There is no effect, no difference when you apply
changes.
H0: Our <KPIs> remained the “same” in the control group and in the test group
Alternative hypothesis (H1): the changes in the test group had a real effect.
H1: Our users are actively engaged in clicking the button and therefore our A2B is relatively increased
by 5%

A/B testing
Alternative hypothesis (H1): the changes in the test group had a real eﬀect.
H1: Our users are actively engaged in clicking the button and therefore our A2B is relatively increased
by 5%

A/B testing
Source: https://abtestguide.com/abtestsize/

A/B testing
Metrics
we know

A/B testing
Metrics
we know We decide from
previous data or
knowledge about
this variable
[eﬀect size]

A/B testing
Metrics
we know
We decide from
previous data or
knowledge about
this variable
[eﬀect size]
Dependent on the
variable and what we
are looking for
[normally two-sided]

A/B testing
Metrics
we know
We decide from
previous data or
knowledge about
this variable
[eﬀect size]
We can play but
mostly by
convention and
dependent on traﬃc
[accuracy]
Dependent on the
variable and what we
are looking for
[normally two-sided]

A/B testing
Effect size
The magnitude of the effect, how important the difference is

A/B testing
Test conversion rate = 15 * 2 + 2 = 2.3% (± 0.3%)
Effect size
The magnitude of the effect, how important the difference is
Improvement that is meaningful for your business
Test conversion rate - Control conversion rate
Control conversion rate
Relative improvement*100 =
100

A/B testing
One-sided or two-sided?
ControlTest
Mean test
Mean
control
Is the difference significant
enough to reject the null
hypothesis?
H0 : 𝝻t = 𝝻c
𝝻t : mean test
𝝻c : mean control
difference in
means

A/B testing
One-sided or two-sided?
H1 : 𝝻t > 𝝻c
(one-sided)
directional
H1 : 𝝻t ≠ 𝝻c
(two-sided)
Two-sided tends to be the best option
𝝻t : mean test
𝝻c : mean control

A/B testing
Power, signiﬁcance level & conﬁdence level

A/B testing
Power of a test: the probability of finding an effect when it is really there. It is the inverse of the
type II error (false negatives)
Source: https://towardsdatascience.com/a-guide-for-selecting-an-appropriate-metric-for-your-a-b-test-9068cccb7fb
Typical value is 80% (a convention)
Power Chance to miss a true effect
Sample size

A/B testing
Source: https://www.youtube.com/watch?v=CSBCKVQLf8c
Our study
Effect present Effect absent
Real World
Effect
present
Reject H0
Type II error
(miss)
Effect absent
Type I error
(false alarm)
Reject H1
Type II error : probability to miss an effect that is really there (the odds to not detect it)

A/B testing
Our study
Real World
Effect present
Reject H0
(power 1-𝛃)
Type II error
(miss)
( 𝛃 risk)
Effect absent
Type I error
(false alarm)
Reject H1
Type II error : miss -> probability less than 20% (𝛃)
Power is 1-𝛃 -> 80%
Power Chance to miss a true effect
Sample size

A/B testing
Signiﬁcance level (𝛂): the probability of detecting an eﬀect that is really not there

A/B testing
Type I error : false alarm -> probability less than 5% (𝛂) Confidence level is 1- 𝛂 : 95%
Significance level 𝛂 related to p-value: 𝛂 > p-value
Our study
Real World
Effect present Reject H0 Type II error (miss)
Effect
absent
Type I error
(false alarm)
(𝛂 risk)
Reject H1

A/B testing
Confidence level: the inverse of the significance level. The probability that the value of a
parameter falls within a specified range of values
Significance level (𝛂)
Confidence level
Sample size
(significance level 𝛂 tells you about the
probability that the effect you found was
just chance; 𝛂 > p-value)
Significance level ~ 0.05 (5%)
P-value < 0.05

A/B testing
Meaningful for your
business
Power and confidence
level influence your
sample size and the
probability of finding a
true effect

A/B testing
High traffic Low traffic
Important points
Choose KPIs wisely,
low effect size
Choose KPIs with high
increase (large effect size)

A/B testing
Choose KPIs wisely,
low eﬀect size
Important points
+5%
+0.5%

A/B testing
Choose KPIs wisely, low eﬀect
size
Accuracy, minimise risk
Important points

A/B testing
Choose KPIs wisely, low effect
size
Preferably AB but also MVT
AB
Run Qualitative tests
Never stop an experiment before time even if you “find” significant results (danger! False
positives raising!)
Source: https://www.evanmiller.org/how-not-to-run-an-ab-test.html
https://vwo.com/blog/ab-split-testing-low-traffic-sites/
Important points

Before
development

Focus Group Survey
Delivery
Pipeline
Feature
Toggle
Shadow
TraﬃcFocus
Group
Survey Visual
Report
Iterative and
Incremental
development
Independent
Teams
A/B
Test

Focus Group Survey
Delivery
Pipeline
Feature
Toggle
Shadow
TraﬃcFocus
Group
Survey Visual
Report
Iterative and
Incremental
development
Independent
Teams
What is it
Study using inferential statistics
to verify an hypothesis.
When
As part of the discovery of a
feature, during development
Why
Short feedback loops
Data-driven decisions
Caution! You need experience
designing and analysing statistical
tests.

The shopteaser survey
Focus Group Survey

Focus Group Survey
Stronglydisagree
Disagree
Neutral
Agree
Stronglyagree
Likert Scale
[categorical variable]
Your research question will
drive the design of the
experiment and also the
analysis of your data
trial
trial
trial
trial

Focus Group Survey
Stronglydisagree
Disagree
Neutral
Agree
Stronglyagree
Likert Scale
trial
trial
trial
Things that could go wrong:
- Familiarity bias
Methodology examples:
- Gave 5s per trial so the
answers would be
spontaneous
- The ﬁrst trials were
discarded
[categorical variable that can be
transformed to continuous -
scale 1-5]

During the design phase we also took into account:
● Collect demographic data: there is no such thing as enough data
● Collect feedback at the end of the survey: did they understand the task, did
something go wrong?
● Make clear instructions: if you are not there, they cannot ask and will “assume”
Focus Group Survey

Insights from a focus group
selectedmanual

Lab test
Delivery
Pipeline
Feature
Toggle
Shadow
Traﬃc
Lab
Test
Focus
Group
Survey Visual
Report
A/B
Test
Iterative and
Incremental
development
Independent
Teams

UX designers test the design and usability of a
feature on a test group.
● Small group of people in-person (~5-10 pp)
● Web-based testing of users remote
● Qualitative questions
○ e.g. did you like it? Was it easy to ﬁnd?
UX Lab tests

Wrapping up

PO
Delivery
Pipeline
Feature
Toggle
Shadow
TraﬃcLab
Test Focus
Group
Survey
Visual
Report
A/B
Test
Techniques for faster and better decisions
Iterative and
Incremental
development
In-
dependent
Team

When is your next release? Could it be earlier?
Do you have a solid hypothesis and measurable KPIs for it?
Which measurements could you be using instead of
assuming the user’s preference?
Which of your meetings in the next 2 weeks could be
replaced by a lean experiment?

Thank you
Irene Torres
Klaus Fleerkötter

Questions?
#talk5-when-in-doubt-go-live
Irene Torres
Klaus Fleerkötter

When in doubt, go live

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie When in doubt, go live

Ähnlich wie When in doubt, go live (20)

Mehr von Thoughtworks

Mehr von Thoughtworks (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

When in doubt, go live