Weitere ähnliche Inhalte Ähnlich wie When in doubt, go live (20) Mehr von Thoughtworks (20) Kürzlich hochgeladen (20) When in doubt, go live1. When in doubt, go live
Techniques for decision making
based on real user behavior
© 2020 ThoughtWorks
Irene Torres
Klaus Fleerkötter
2. You save time and make better decisions
by establishing shorter feedback loops
from feature idea to feature usage.
© 2020 ThoughtWorks
3. Irene Torres
Developer @ TW
PhD Neuroscience
Science perspective
Klaus Fleerkötter
Developer @ TW
Information Systems
Techie perspective
Klaus
Who’s talking?
© 2020 ThoughtWorks
4. What is this talk about?
Specific use cases
that worked for us
Tech & Research
And what is it not...
© 2020 ThoughtWorks
Extensive coverage of
user research
Software testing
5. One of Germany’s
biggest online retailers
Top 5 highest traffic
e-commerce sites
(Germany)
Orders: <= 10 per second
Qualified visits:
Ø 1.6 million / day
Examples
© 2020 ThoughtWorks
11. Services that can be built independently by cross-functional
teams that are structured around business domains
© 2020 ThoughtWorks
Dev
PO
QA Ops
UX
DA
12. The Delivery Pipeline
© 2020 ThoughtWorks
Delivery
Pipeline
Iterative and
Incremental
development
Independent
Teams
15. Feature Toggles
© 2020 ThoughtWorks
Delivery
Pipeline
Feature
Toggle
Iterative and
Incremental
development
Independent
Teams
16. Feature Toggles
Decouple go-live from deployment
© 2020 ThoughtWorks
© CC BY 2.0 "Switch" Jon_Callow_Images
if (toggleIsOn) then {
executeNewBehavior()
} else {
executeOldBehavior()
}
17. Feature Toggles
Flip for experimentation
© CC BY-ND 2.0 "Off?" Nicholas Liby
Without Recompile?
Without Restart?
Per Request?
© 2020 ThoughtWorks
19. Shadow Traffic
© 2020 ThoughtWorks
Delivery
Pipeline
Feature
Toggle
Shadow
Traffic
Iterative and
Incremental
development
Independent
Teams
20. Shadow Traffic
Not just for testing
© 2020 ThoughtWorks
User
Old Behavior
New Behavior
sees no difference
Run
both
Team
21. Shadow Traffic
Get early feedback
60% 40%
Min 3 items?
Mostly fashion?
Not sold out?
Max 1 of each kind?
Maximize!
© 2020 ThoughtWorks
22. Visual Report
© 2020 ThoughtWorks
Delivery
Pipeline
Feature
Toggle
Shadow
Traffic
Visual
Report
Iterative and
Incremental
development
Independent
Teams
26. Assess that the MVP has the correct business rules
● Visual report (e.g. html page)
Visual Report
Quality of a feature
Beach pants
manual auto
Leather bags
Jackets
© 2020 ThoughtWorks
28. A/B Testing
© 2020 ThoughtWorks
Delivery
Pipeline
Feature
Toggle
Shadow
Traffic
A/B
Test
Visual
Report
Iterative and
Incremental
development
Independent
Teams
29. A/B testing
© 2020 ThoughtWorks
“You want your data to inform, to guide, to improve your business model, to help
you decide on a course of action.” Lean Analytics
30. A/B testing
© 2020 ThoughtWorks
“You want your data to inform, to guide, to improve your business model, to help
you decide on a course of action.” Lean Analytics
Focus on the understanding of the underlying statistics that drives the
calculation of a sample size.
STATS
31. A/B testing
© 2020 ThoughtWorks
“You want your data to inform, to guide, to improve your business model, to help
you decide on a course of action.” Lean Analytics
A/B testing ≡ a set of statistical tests that evaluate two independent groups, a
control and a test group
“Independent groups” -> between-subjects design
STATS
Focus on the understanding of the underlying statistics that drives the
calculation of a sample size.
groups =
variants
“Independent groups” -> between-subjects design
33. A/B testing
A/B testing mostly uses statistical hypothesis testing to calculate the likelihood of a change in your
website being meaningful.
Null hypothesis (H0): The state of the world. There is no effect, no difference when you apply
changes.
H0: Our <KPIs> remained the “same” in the control group and in the test group
Alternative hypothesis (H1): the changes in the test group had a real effect.
H1: Our users are actively engaged in clicking the button and therefore our A2B is relatively increased
by 5%
© 2020 ThoughtWorks
34. A/B testing
© 2020 ThoughtWorks
Alternative hypothesis (H1): the changes in the test group had a real effect.
H1: Our users are actively engaged in clicking the button and therefore our A2B is relatively increased
by 5%
36. A/B testing
© 2020 ThoughtWorks
Source: https://abtestguide.com/abtestsize/
Metrics
we know
37. A/B testing
© 2020 ThoughtWorks
Source: https://abtestguide.com/abtestsize/
Metrics
we know We decide from
previous data or
knowledge about
this variable
[effect size]
38. A/B testing
© 2020 ThoughtWorks
Source: https://abtestguide.com/abtestsize/
Metrics
we know
We decide from
previous data or
knowledge about
this variable
[effect size]
Dependent on the
variable and what we
are looking for
[normally two-sided]
39. A/B testing
© 2020 ThoughtWorks
Source: https://abtestguide.com/abtestsize/
Metrics
we know
We decide from
previous data or
knowledge about
this variable
[effect size]
We can play but
mostly by
convention and
dependent on traffic
[accuracy]
Dependent on the
variable and what we
are looking for
[normally two-sided]
40. A/B testing
© 2020 ThoughtWorks
Source: https://abtestguide.com/abtestsize/
Effect size
The magnitude of the effect, how important the difference is
41. A/B testing
© 2020 ThoughtWorks
Source: https://abtestguide.com/abtestsize/
Test conversion rate = 15 * 2 + 2 = 2.3% (± 0.3%)
Effect size
The magnitude of the effect, how important the difference is
Improvement that is meaningful for your business
Test conversion rate - Control conversion rate
Control conversion rate
Relative improvement*100 =
100
42. A/B testing
© 2020 ThoughtWorks
Source: https://abtestguide.com/abtestsize/
One-sided or two-sided?
ControlTest
Mean test
Mean
control
Is the difference significant
enough to reject the null
hypothesis?
H0 : 𝝻t = 𝝻c
𝝻t : mean test
𝝻c : mean control
difference in
means
43. A/B testing
© 2020 ThoughtWorks
Source: https://abtestguide.com/abtestsize/
One-sided or two-sided?
H1 : 𝝻t > 𝝻c
(one-sided)
directional
H1 : 𝝻t ≠ 𝝻c
(two-sided)
Two-sided tends to be the best option
𝝻t : mean test
𝝻c : mean control
45. A/B testing
© 2020 ThoughtWorks
Power of a test: the probability of finding an effect when it is really there. It is the inverse of the
type II error (false negatives)
Source: https://towardsdatascience.com/a-guide-for-selecting-an-appropriate-metric-for-your-a-b-test-9068cccb7fb
Typical value is 80% (a convention)
Power Chance to miss a true effect
Sample size
Power, significance level & confidence level
46. A/B testing
© 2020 ThoughtWorks
Source: https://www.youtube.com/watch?v=CSBCKVQLf8c
Our study
Effect present Effect absent
Real World
Effect
present
Reject H0
Type II error
(miss)
Effect absent
Type I error
(false alarm)
Reject H1
Type II error : probability to miss an effect that is really there (the odds to not detect it)
47. A/B testing
© 2020 ThoughtWorks
Source: https://www.youtube.com/watch?v=CSBCKVQLf8c
Our study
Effect present Effect absent
Real World
Effect present
Reject H0
(power 1-𝛃)
Type II error
(miss)
( 𝛃 risk)
Effect absent
Type I error
(false alarm)
Reject H1
Type II error : miss -> probability less than 20% (𝛃)
Power is 1-𝛃 -> 80%
Power Chance to miss a true effect
Sample size
48. A/B testing
© 2020 ThoughtWorks
Source: https://towardsdatascience.com/a-guide-for-selecting-an-appropriate-metric-for-your-a-b-test-9068cccb7fb
Typical value is 95% (a convention)
Significance level (𝛂): the probability of detecting an effect that is really not there
Power, significance level & confidence level
49. A/B testing
© 2020 ThoughtWorks
Source: https://www.youtube.com/watch?v=CSBCKVQLf8c
Type I error : false alarm -> probability less than 5% (𝛂) Confidence level is 1- 𝛂 : 95%
Significance level 𝛂 related to p-value: 𝛂 > p-value
Our study
Effect present Effect absent
Real World
Effect present Reject H0 Type II error (miss)
Effect
absent
Type I error
(false alarm)
(𝛂 risk)
Reject H1
50. A/B testing
© 2020 ThoughtWorks
Confidence level: the inverse of the significance level. The probability that the value of a
parameter falls within a specified range of values
Source: https://towardsdatascience.com/a-guide-for-selecting-an-appropriate-metric-for-your-a-b-test-9068cccb7fb
Typical value is 95% (a convention)
Significance level (𝛂)
Confidence level
Sample size
(significance level 𝛂 tells you about the
probability that the effect you found was
just chance; 𝛂 > p-value)
Power, significance level & confidence level
Significance level ~ 0.05 (5%)
P-value < 0.05
51. A/B testing
© 2020 ThoughtWorks
Source: https://abtestguide.com/abtestsize/
Meaningful for your
business
Power and confidence
level influence your
sample size and the
probability of finding a
true effect
52. A/B testing
© 2020 ThoughtWorks
High traffic Low traffic
Important points
Choose KPIs wisely,
low effect size
Choose KPIs with high
increase (large effect size)
53. A/B testing
© 2020 ThoughtWorks
High traffic Low traffic
Choose KPIs wisely,
low effect size
Choose KPIs with high
increase (large effect size)
Important points
+5%
+0.5%
54. A/B testing
© 2020 ThoughtWorks
High traffic Low traffic
Choose KPIs wisely, low effect
size
Accuracy, minimise risk
Choose KPIs with high
increase (large effect size)
Important points
55. A/B testing
© 2020 ThoughtWorks
High traffic Low traffic
Choose KPIs wisely, low effect
size
Preferably AB but also MVT
Choose KPIs with high
increase (large effect size)
AB
Run Qualitative tests
Never stop an experiment before time even if you “find” significant results (danger! False
positives raising!)
Source: https://www.evanmiller.org/how-not-to-run-an-ab-test.html
https://vwo.com/blog/ab-split-testing-low-traffic-sites/
Important points
57. Focus Group Survey
© 2020 ThoughtWorks
Delivery
Pipeline
Feature
Toggle
Shadow
TrafficFocus
Group
Survey Visual
Report
Iterative and
Incremental
development
Independent
Teams
A/B
Test
58. Focus Group Survey
© 2020 ThoughtWorks
Delivery
Pipeline
Feature
Toggle
Shadow
TrafficFocus
Group
Survey Visual
Report
Iterative and
Incremental
development
Independent
Teams
What is it
Study using inferential statistics
to verify an hypothesis.
When
As part of the discovery of a
feature, during development
Why
Short feedback loops
Data-driven decisions
Caution! You need experience
designing and analysing statistical
tests.
60. Focus Group Survey
© 2020 ThoughtWorks
Stronglydisagree
Disagree
Neutral
Agree
Stronglyagree
Likert Scale
[categorical variable]
The shopteaser survey
Your research question will
drive the design of the
experiment and also the
analysis of your data
trial
trial
trial
trial
61. Focus Group Survey
© 2020 ThoughtWorks
Stronglydisagree
Disagree
Neutral
Agree
Stronglyagree
Likert Scale
The shopteaser survey
trial
trial
trial
Things that could go wrong:
- Familiarity bias
Methodology examples:
- Gave 5s per trial so the
answers would be
spontaneous
- The first trials were
discarded
[categorical variable that can be
transformed to continuous -
scale 1-5]
62. During the design phase we also took into account:
● Collect demographic data: there is no such thing as enough data
● Collect feedback at the end of the survey: did they understand the task, did
something go wrong?
● Make clear instructions: if you are not there, they cannot ask and will “assume”
© 2020 ThoughtWorks
Focus Group Survey
The shopteaser survey
63. Insights from a focus group
The shopteaser survey
© 2020 ThoughtWorks
selectedmanual
64. Lab test
© 2020 ThoughtWorks
Delivery
Pipeline
Feature
Toggle
Shadow
Traffic
Lab
Test
Focus
Group
Survey Visual
Report
A/B
Test
Iterative and
Incremental
development
Independent
Teams
65. UX designers test the design and usability of a
feature on a test group.
● Small group of people in-person (~5-10 pp)
● Web-based testing of users remote
● Qualitative questions
○ e.g. did you like it? Was it easy to find?
UX Lab tests
© 2020 ThoughtWorks
68. When is your next release? Could it be earlier?
Do you have a solid hypothesis and measurable KPIs for it?
Which measurements could you be using instead of
assuming the user’s preference?
Which of your meetings in the next 2 weeks could be
replaced by a lean experiment?
© 2020 ThoughtWorks