11. 1. Split population 2. Show variations 4. Choose winner 4. Deploy winner3. Measure response
12. State
Sync State
Sync
Game
State
DB
Data
Warehouse
Analyse
Events
(tagged
with
A
or
B)
Game
Server
Content
Management
System
A
or
B
variant
A/B
Test
Server
19. (f) Zoomed in version of the previous scatterplot. Note that jitter and transparency
Value of first purchase
Totalspend
$70$30$15
A/B Test
VIPs
Repeat
Purchases
$5$2
31. p-Value
95%
2.5% 2.5%
=
1
p
2
e
(x µ)2)
2 2
(n − r)!r!
µ = np σ = np(1 − p)
z =
¯x − µ
σ
z =
¯x − µ
ˆσ√
n
SE(¯x) =
ˆσ
√
n
n =
(zα + zβ)2
σ2
δ2
µ + 1.96σ
µ − 1.96σ
z =
¯x − µ
σ
z =
¯x − µ
ˆσ√
n
SE(¯x) =
ˆσ
√
n
n =
(zα + zβ)2
σ2
δ2
µ + 1.96σ
µ − 1.96σ
32. p-Value
The probability of observing as extreme a result
assuming the null hypothesis is true
The probability of the data given the model
OR
33. Null Hypothesis: H0
H H
Accept H Type-II Error
Reject H Type-I Error
Truth
Observation
False Positive
34. All we can ever say is either
not enough evidence that retention rates are the same
the retention rates are different, 95% of the time
p < 0.05
p-Value
35. actually...
The evidence supports a rejection of the null hypothesis, i.e.
the probability of seeing a result as extreme as this, assuming
the retention rate is actually 30%, is less than 5%.
p < 0.05
40. Family-wise Error
p = P( Type-I Error) = 0.05
5% of the time we will get a false positive - for one treatment
P( no Type-I Error) = 0.95
P( at least 1 Type-I Error for 2 treatments) = (1 - 0.9025) = 0.0975
P( no Type-I Error for 2 treatments) = (0.95)(0.95) = 0.9025
46. Tossing the Coin
0 10 20 30 40 50
0.00.20.40.60.81.0
Number of flips
ProportionofHeads
Heads
Tails
THTHTTTHHTHH…
47. 1 5 10 50 500
0.00.20.40.60.81.0
Number of flips
ProportionofHeads Tossing the Coin
Long run average = 0.5
48. p(y|x) =
p(y, x)
p(x)
p(x|y) =
p(x, y)
p(y)
p(y, x) = p(x, y) = p(y|x)p(x) = p(x|y)p(y)
p(y|x) =
p(x|y)p(y)
p(x)
p(y|x) =
p(x|y)p(y)
P
Probability of x
p(y|x) =
p(y, x)
p(x)
p(x|y) =
p(x, y)
p(y)
p(y, x) = p(x, y) = p(y|x)p(x) = p(x|y)p(y)
p(y|x) =
p(x|y)p(y)
p(x)
Probability of x and y
(conjoint)
p(y|x) =
p(y, x)
p(x)
p(x|y) =
p(x, y)
p(y)
p(y, x) = p(x, y) = p(y|x)p(x) = p(x|y)p
p(x|y)p(y)
Probability of x given y
(conditional)
Terminology
49. The Bernoulli distribution
For a “fair” coin, θ = 0.5
Head (H) = 1, Tails (T) = 0
p(x|θ) = θx(1 − θ)(1−x)
A single toss:
p(x|θ) = θx(1 − θ)(1−x)
p(heads = 1|0.5) = 0.51(1−0.5)(1−1) = 0.5
0 (1−0)
p(x|θ) = θx(1 − θ)(1−x)
p(heads = 1|0.5) = 0.51(1−0.5)(1−1) = 0.5
p(tails = 0|0.5) = 0.50(1−0.5)(1−0) = 0.5
50. The Binomial
p(x|θ) = θx(1 − θ)(1−x)
Probability of heads in a single throw:
Probability of x heads in n throws:
p(heads = 1|0.5) = 0.51
(1 − 0.5)(1−1)
= 0.5
p(tails = 0|0.5) = 0.50
(1 − 0.5)(1−0)
= 0.5
p(h|n, θ) =
n
h
θh
(1 − θ)(n−h)
p(x|θ, n) =
n
x
θx
(1 − θ)(n−x)
51. 20 tosses of a fair coin
=
1
p
2⇡ 2
e
(x µ)2)
2 2
1 − θ)(n−x)p(θ)dθ
pc − pv
1−pc)+pv(1−pv)
n
p(x|0.5, 20)
p(x|θ)p(θ)
x
(1 − θ)(1−x)
0.51
(1 − 0.5)(1−1)
= 0.5
52. 20 tosses of an “un-fair” coin
1
p
2⇡ 2
e
(x µ)2)
2 2
θx(1 − θ)(n−x)p(θ)dθ
ztest =
pc − pv
pc(1−pc)+pv(1−pv)
n
p(x|0.2, 20) p(x|0.5, 20)
p(¯θ) = p(x|θ)p(θ)
p(x|θ) = θx
(1 − θ)(1−x)
s = 1|0.5) = 0.51
(1 − 0.5)(1−1)
= 0.5
53. Likelihood of θ given observation i of x heads in n throws:
“Binomial Likelihood”
h
p(x|θ, n) =
n
x
θx
(1 − θ)(n−x)
1
p(h|n, θ) =
n
h
θh
(1 − θ)(n−h)
p(x|θ, n) =
n
x
θx
(1 − θ)(n−x)
L(θ|xi, ni) =
ni
xi
θxi
(1 − θ)(ni−xi)
54. The Likelihood
Increasing likelihood of θ with more observations…
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.4
1 heads in 2 throws.
θ
Likelihood
0.0 0.2 0.4 0.6 0.8 1.0
0.000.100.20
5 heads in 10 throws.
θ
Likelihood
0.0 0.2 0.4 0.6 0.8 1.0
0.0000.0100.020
500 heads in 1000 throws.
θ
Likelihood
55. p(¯θ) = p(x|θ)p(θ)
p(x|θ) = θx
(1 − θ)(1−x)
p(heads = 1|0.5) = 0.51
(1 − 0.5)(1−1)
= 0.5
p(tails = 0|0.5) = 0.50
(1 − 0.5)(1−0)
= 0.5
The observations (#heads)
p(x|0.2, 20) p(x|0.5, 20)
p(¯θ) = p(x|θ)p(θ)
p(x|θ) = θx
(1 − θ)(1−x)
p(heads = 1|0.5) = 0.51
(1 − 0.5)(1−1)
= 0.5
p(tails = 0|0.5) = 0.50
(1 − 0.5)(1−0)
= 0.5
The model parameter (e.g. fair coin)
p(x|θ, n) = θx
(1 − θ)(n−x)
B(a, b) =
Γ(a)Γ(b)
Γ(a + b)
=
(a − 1)!(b − 1)!
(a + b − 1)!
beta(θ|a, b) = θ(a−1)
(1 − θ)(b−1)
/ B(a, b)
p(θ|x) = θx
(1 − θ)(n−x)
p(θ) / p(x)We want to know this
ztest =
c v
pc(1−pc)+pv(1−pv)
n
p(x|0.2, 20) p(x|0.5, 20)
p(¯θ) = p(x|θ)p(θ)
p(x|θ) = θx
(1 − θ)(1−x)
p(heads = 1|0.5) = 0.51
(1 − 0.5)(1−1)
= 0.5
Probability of data given model
A recap…
56. Note that
p(x, y) = p(x|y)p(y) = p(y|x)p(x)
p(x|θ) ̸= p(θ|x)
p(cloudy|raining) ̸= p(raining|cloudy)
p(θ|x, n) = θx
(1 − θ)(n−x)
θ(a−1)
(1 − θ)(b−1)
/ B(a, b) p(
p(θ|x, n) = θx+a−1
(1 − θ)(n−x+b−1)
/ B(x + a, n − x + b
61. p(✓|x)
| {z }
= p(x|✓)
| {z }
p(✓)
|{z}
/ p(x)
|{z}
p(x) =
Z
p(x|✓)p(✓)d✓
The prior
Captures our “belief” in the model
based on prior experience, observations or knowledge
63. Selecting a prior
We’d like the product of prior and likelihood
to be “like” the likelihood
We’d like the integral to be easily evaluated
p(x, n|θ) = θx
(1 − θ)(n−x)
p(θ|x) = θx
(1 − θ)(n−x)
p(θ) / p(x)
p(θ|x) =
θx
(1 − θ)(n−x)
p(θ)
θx(1 − θ)(n−x)p(θ)dθ
ztest =
pc − pv
pc(1−pc)+pv(1−pv)
n
p(h|n, θ) =
n
h
θh
(1 − θ)(n−h)
p(x|θ, n) =
n
x
θx
(1 − θ)(n−x)
1
67. beta priorbinomial likelihood
Putting it together…
p(θ|x, n) = θx
(1 − θ)(n−x)
θ(a−1)
(1 − θ)(b−1)
/ B(a, b) p(x, n)
p(θ|x, n) = θx+a−1
(1 − θ)(n−x+b−1)
/ B(x + a, n − x + b)
p(x, n|θ) = θx
(1 − θ)(n−x)
number of heads x number of tails (n-x)
p(x|y) ̸= p(y|x)
p(cloudy|raining) ̸= p(raining|cloudy)
p(θ|x, n) = θx
(1 − θ)(n−x)
θ(a−1)
(1 − θ)(b−1)
/ B(a, b) p(x)
p(θ|x, n) = θx+a−1
(1 − θ)(n−x+b−1)
/ B(x + a, n − x + b)
p(θa > θb) =
θa>θb
p(θa|xa)p(θb|xb) dθadθb
68. 1. Decide on a prior, which captures your belief
2. Run experiment and observe data for heads, tails
3. Determine your posterior based on the data
4. Use posterior as your new belief and re-run
experiment
5. Rinse, repeat until you hit an actionable certainty
Putting it together…
77. Captures our prior belief, expertise, opinion
Strong prior belief means:
‣ we need lots of evidence to contradict
‣ results converge more quickly (if prior is “relevant”)
Provides inertia
With enough samples, prior’s impact diminishes,
rapidly
The Prior
79. With 1 or more variants we have a multi-dimensional
problem
Need to evaluate volumes under the posterior
In general requires numerical quadrature = Markov
Chain Monte-Carlo (MCMC)
Multiple variant tests
95. Benefits / Features
Continuously observable
No need to fix population size in advance
Incorporate prior knowledge / expertise
Result is a “true” probability
A measure of the difference magnitude is given
Consistent framework for lots of different scenarios
99. Multi-arm bandits
Draw sample from
each "arm" prior
distribution
Evaluate Success
Update posterior for
chosen "arm"
Pick largest & deploy
Thompson Sampling
Thompson, William R. "On the likelihood that one unknown probability exceeds another in view of the evidence of two samples". Biometrika, 25(3–4):285–294