3. Class-conditionnal
independency assumption
Often said simple, or naive, even idiot*
* Idiot's Bayes - not so stupid after all?, Hand, D.J., & Yu, K. (2001).
International Statistical Review. Vol 69 part 3, pages 385-399. ISSN 0306-7734.
argmax
y
p(y/x1, ..., xK) = argmax
y
p(y)
Y
k
p(xk/y)
5. Winning binning?
Outliers
Missing values
Stability
* MODL: a Bayes optimal discretization method for continuous attributes. Boullé, M., (2006).
Machine Learning, 65(1):131-165.
No parameter to validate
O(nlog(n))
6. Selective Naive Bayes
On predictive distributions and Bayesian networks. Kontkanen, P., Myllymäki, P., Silander, T., Tirri, H. & Grünwald, P. (2000).
Statistics and Computing, 10, 39-54.
sk 2 {0, 1}
argmax
y
p(y/x1, ..., xK) = argmax
y
p(y)
Y
k
p(xk/y)sk
7. Select features
An introduction to variable and feature selection. Guyon, I., Elisseeff, A. (2003)
Journal of machine learning research 3 (Mar), 1157-1182
Wrapper approach?
Embedded approach?
→ Greedy optimization
→ Nested subsets
→ Direct objective
optimization
Filter approach?
→ Mutual information
→ Weak learner
→ Cross-validation
8. Forward Feature Selection
A
B
C D
Pool of actual
candidates
Pool of future
candidates
Features
included in the
model
E
Draw
independently
Include iff the AUROCC
is improved
Keep it safe
otherwise
9. Forward Feature Selection
A
B
C
D
Pool of actual
candidates
Pool of future
candidates
Features
included in the
model
E
Draw
independently
Include iff the AUROCC
is improved
Keep it safe
otherwise
10. wk 2 [0, 1]
Soft selection
argmax
y
p(y/x1, ..., xK) = argmax
y
p(y)
Y
k
p(xk/y)wk
11. The averaging trick
wk =
P
s2S skp(s/d)
P
s2S p(s/d)
* A Parameter-Free Classification Method for Large Scale Learning. Boullé, M., (2009).
Journal of Machine Learning Research, 10:1367-1385.
12. The averaging trick
Explored model only
wk =
P
s2S skp(s/d)
P
s2S p(s/d)
wk =
P
s2S skp(s/d)
P
s2S p(s/d)
* A Parameter-Free Classification Method for Large Scale Learning. Boullé, M., (2009).
Journal of Machine Learning Research, 10:1367-1385.
13. The averaging trick
Nonparametric prior
wk =
P
s2S skp(s/d)
P
s2S p(s/d)
* A Parameter-Free Classification Method for Large Scale Learning. Boullé, M., (2009).
Journal of Machine Learning Research, 10:1367-1385.
14. + -
performance
algorithm complexity
Nonparametric and stable
(bye bye cross-validation!)
It’s up to the user to find
‘composite’ features and capture
correlational relationships, but
…
It’s where the
fun is, ain’t it?Numeric / Categorical
(bye bye dummy-encoding!)
Interpretable*
*https://www.quora.com/What-makes-a-model-
interpretable/answer/Claudia-Perlich