4. Learning by example
Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 2 / 16
5. Learning by example
• How did you solve this problem?
• Can you make this process explicit (e.g. write code to do so)?
Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 2 / 16
27. Learning by example
• How did you solve this problem?
• Can you make this process explicit (e.g. write code to do so)?
Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 2 / 16
28. Diagnoses a la Bayes1
• You’re testing for a rare disease:
• 1% of the population is infected
• You have a highly sensitive and specific test:
• 99% of sick patients test positive
• 99% of healthy patients test negative
• Given that a patient tests positive, what is probability the
patient is sick?
1
Wiggins, SciAm 2006
Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 3 / 16
29. Diagnoses a la Bayes
Population
10,000 ppl
1% Sick
100 ppl
99% Test +
99 ppl
1% Test -
1 per
99% Healthy
9900 ppl
1% Test +
99 ppl
99% Test -
9801 ppl
Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 4 / 16
30. Diagnoses a la Bayes
Population
10,000 ppl
1% Sick
100 ppl
99% Test +
99 ppl
1% Test -
1 per
99% Healthy
9900 ppl
1% Test +
99 ppl
99% Test -
9801 ppl
So given that a patient tests positive (198 ppl), there is a 50%
chance the patient is sick (99 ppl)!
Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 4 / 16
31. Diagnoses a la Bayes
Population
10,000 ppl
1% Sick
100 ppl
99% Test +
99 ppl
1% Test -
1 per
99% Healthy
9900 ppl
1% Test +
99 ppl
99% Test -
9801 ppl
The small error rate on the large healthy population produces
many false positives.
Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 4 / 16
32. Natural frequencies a la Gigerenzer2
2
http://bit.ly/ggbbc
Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 5 / 16
33. Inverting conditional probabilities
Bayes’ Theorem
Equate the far right- and left-hand sides of product rule
p (y|x) p (x) = p (x, y) = p (x|y) p (y)
and divide to get the probability of y given x from the probability
of x given y:
p (y|x) =
p (x|y) p (y)
p (x)
where p (x) =
P
y∈ΩY
p (x|y) p (y) is the normalization constant.
Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 6 / 16
34. Diagnoses a la Bayes
Given that a patient tests positive, what is probability the patient
is sick?
p (sick|+) =
99/100
z }| {
p (+|sick)
1/100
z }| {
p (sick)
p (+)
| {z }
99/1002+99/1002=198/1002
=
99
198
=
1
2
where p (+) = p (+|sick) p (sick) + p (+|healthy) p (healthy).
Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 7 / 16
35. (Super) Naive Bayes
We can use Bayes’ rule to build a one-word spam classifier:
p (spam|word) =
p (word|spam) p (spam)
p (word)
where we estimate these probabilities with ratios of counts:
ˆp(word|spam) =
# spam docs containing word
# spam docs
ˆp(word|ham) =
# ham docs containing word
# ham docs
ˆp(spam) =
# spam docs
# docs
ˆp(ham) =
# ham docs
# docs
Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 8 / 16
39. Naive Bayes
Represent each document by a binary vector ~x where xj = 1 if the
j-th word appears in the document (xj = 0 otherwise).
Modeling each word as an independent Bernoulli random variable,
the probability of observing a document ~x of class c is:
p (~x|c) =
Y
j
✓
xj
jc (1 − ✓jc)1−xj
where ✓jc denotes the probability that the j-th word occurs in a
document of class c.
Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 12 / 16
40. Naive Bayes
Using this likelihood in Bayes’ rule and taking a logarithm, we have:
log p (c|~x) = log
p (~x|c) p (c)
p (~x)
=
X
j
xj log
✓jc
1 − ✓jc
+
X
j
log(1 − ✓jc) + log
✓c
p (~x)
where ✓c is the probability of observing a document of class c.
Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 13 / 16
45. tangent: logistic function as surrogate loss function
◮ define f (x) ≡ log p(y = 1|x)/p(y = −1|x) ∈ R
◮ p(y = 1|x) + p(y = −1|x) = 1 → p(y|x) = 1/(1 + exp(−yf ))
46. tangent: logistic function as surrogate loss function
◮ define f (x) ≡ log p(y = 1|x)/p(y = −1|x) ∈ R
◮ p(y = 1|x) + p(y = −1|x) = 1 → p(y|x) = 1/(1 + exp(−yf ))
◮ − log2 p({y}N
1 ) = i log2 1 + e−yi f (xi ) ≡ i ℓ(yi f (xi ))
47. tangent: logistic function as surrogate loss function
◮ define f (x) ≡ log p(y = 1|x)/p(y = −1|x) ∈ R
◮ p(y = 1|x) + p(y = −1|x) = 1 → p(y|x) = 1/(1 + exp(−yf ))
◮ − log2 p({y}N
1 ) = i log2 1 + e−yi f (xi ) ≡ i ℓ(yi f (xi ))
◮ ℓ′′ > 0, ℓ(µ) > 1[µ < 0] ∀µ ∈ R.
48. tangent: logistic function as surrogate loss function
◮ define f (x) ≡ log p(y = 1|x)/p(y = −1|x) ∈ R
◮ p(y = 1|x) + p(y = −1|x) = 1 → p(y|x) = 1/(1 + exp(−yf ))
◮ − log2 p({y}N
1 ) = i log2 1 + e−yi f (xi ) ≡ i ℓ(yi f (xi ))
◮ ℓ′′ > 0, ℓ(µ) > 1[µ < 0] ∀µ ∈ R.
◮ ∴ maximizing log-likelihood is minimizing a surrogate convex
loss function for classification
49. tangent: logistic function as surrogate loss function
◮ define f (x) ≡ log p(y = 1|x)/p(y = −1|x) ∈ R
◮ p(y = 1|x) + p(y = −1|x) = 1 → p(y|x) = 1/(1 + exp(−yf ))
◮ − log2 p({y}N
1 ) = i log2 1 + e−yi f (xi ) ≡ i ℓ(yi f (xi ))
◮ ℓ′′ > 0, ℓ(µ) > 1[µ < 0] ∀µ ∈ R.
◮ ∴ maximizing log-likelihood is minimizing a surrogate convex
loss function for classification
◮ but i log2 1 + e−yi wT h(xi ) not as easy as i e−yi wT h(xi )
50. boosting 1
L exponential surrogate loss function, summed over examples:
◮ L[F] = i exp (−yi F(xi ))
51. boosting 1
L exponential surrogate loss function, summed over examples:
◮ L[F] = i exp (−yi F(xi ))
◮ = i exp −yi
t
t′ wt′ ht′ (xi ) ≡ Lt(wt)
52. boosting 1
L exponential surrogate loss function, summed over examples:
◮ L[F] = i exp (−yi F(xi ))
◮ = i exp −yi
t
t′ wt′ ht′ (xi ) ≡ Lt(wt)
◮ Draw ht ∈ H large space of rules s.t. h(x) ∈ {−1, +1}
53. boosting 1
L exponential surrogate loss function, summed over examples:
◮ L[F] = i exp (−yi F(xi ))
◮ = i exp −yi
t
t′ wt′ ht′ (xi ) ≡ Lt(wt)
◮ Draw ht ∈ H large space of rules s.t. h(x) ∈ {−1, +1}
◮ label y ∈ {−1, +1}
54. boosting 1
L exponential surrogate loss function, summed over examples:
◮ Lt+1(wt; w) ≡ i dt
i exp (−yi wht+1(xi ))
Punchlines: sparse, predictive, interpretable, fast (to execute), and
easy to extend, e.g., trees, flexible hypotheses spaces, L1, L∞
4, . . .
55. boosting 1
L exponential surrogate loss function, summed over examples:
◮ Lt+1(wt; w) ≡ i dt
i exp (−yi wht+1(xi ))
◮ = y=h′ dt
i e−w + y=h′ dt
i e+w ≡ e−w D+ + e+w D−
Punchlines: sparse, predictive, interpretable, fast (to execute), and
easy to extend, e.g., trees, flexible hypotheses spaces, L1, L∞
4, . . .
56. boosting 1
L exponential surrogate loss function, summed over examples:
◮ Lt+1(wt; w) ≡ i dt
i exp (−yi wht+1(xi ))
◮ = y=h′ dt
i e−w + y=h′ dt
i e+w ≡ e−w D+ + e+w D−
◮ ∴ wt+1 = argminw Lt+1(w) = (1/2) log D+/D−
Punchlines: sparse, predictive, interpretable, fast (to execute), and
easy to extend, e.g., trees, flexible hypotheses spaces, L1, L∞
4, . . .
57. boosting 1
L exponential surrogate loss function, summed over examples:
◮ Lt+1(wt; w) ≡ i dt
i exp (−yi wht+1(xi ))
◮ = y=h′ dt
i e−w + y=h′ dt
i e+w ≡ e−w D+ + e+w D−
◮ ∴ wt+1 = argminw Lt+1(w) = (1/2) log D+/D−
◮ Lt+1(wt+1) = 2 D+D− = 2 ν+(1 − ν+)/D, where
0 ≤ ν+ ≡ D+/D = D+/Lt ≤ 1
Punchlines: sparse, predictive, interpretable, fast (to execute), and
easy to extend, e.g., trees, flexible hypotheses spaces, L1, L∞
4, . . .
58. boosting 1
L exponential surrogate loss function, summed over examples:
◮ Lt+1(wt; w) ≡ i dt
i exp (−yi wht+1(xi ))
◮ = y=h′ dt
i e−w + y=h′ dt
i e+w ≡ e−w D+ + e+w D−
◮ ∴ wt+1 = argminw Lt+1(w) = (1/2) log D+/D−
◮ Lt+1(wt+1) = 2 D+D− = 2 ν+(1 − ν+)/D, where
0 ≤ ν+ ≡ D+/D = D+/Lt ≤ 1
◮ update example weights dt+1
i = dt
i e∓w
Punchlines: sparse, predictive, interpretable, fast (to execute), and
easy to extend, e.g., trees, flexible hypotheses spaces, L1, L∞
4, . . .
4
Duchi + Singer “Boosting with structural sparsity” ICML ’09