3. What’s/why
feature
selec7on
• A
procedure
in
machine
learning
to
find
a
subset
of
features
that
produces
‘be>er’
model
for
given
dataset
– Avoid
overfiLng
and
achieve
be>er
generaliza7on
ability
– Reduce
the
storage
requirement
and
training
7me
– Interpretability
4. When
feature
selec7on
is
important
•
•
•
•
•
•
Noise
data
Lots
of
low
frequent
features
Use
mul7-‐type
features
Too
many
features
comparing
to
samples
Complex
model
Samples
in
real
scenario
is
inhomogeneous
with
training
&
test
samples
5. When
No.(samples)/No.(feature)
is
large
•
•
•
•
Feature
selec7on
with
Gini
indexing
Algorithm:
Logis7c
regression
Training
samples:
640k;
test
samples:
49K
Feature:
watch
behavior
of
audiences;
show
level
(11327)
AUC
0.83
0.82
L1-‐LR
L2-‐LR
0.81
0.8
all
80%
70%
60%
50%
40%
30%
ra+o
of
features
used
20%
10%
6. When
No(samples)
equals
to
No(feature)
• L1
Logis7c
regression;
• Training
samples:
50k;
test
samples:
49K
• Feature:
watch
behavior
of
audiences;
video
level
(49166)
how
AUC
change
with
feature
number
selected
0.736
0.735
0.734
0.733
0.732
ACU
0.731
0.73
0.729
0.728
all
90%
80%
70%
60%
50%
40%
30%
20%
10%
7. Typical
methods
for
feature
selec7on
• Categories
Single
feature
evalua7on
Subset
selec7on
filter
MI,
IG,
KL-‐D,
GI,
CHI
Category
distance,
…
wrapper
Ranking
accuracy
For
LR
(SFO,
using
single
feature
Graiing)
• Single
feature
evalua7on
– Frequency
based,
mutual
informa7on,
KL
divergence,
Gini
indexing,
informa7on
gain,
Chi
square
sta7s7c
• Subset
selec7on
method
– Sequen7al
forward
selec7on
– Sequen7al
backward
selec7on
8. Single
feature
evalua7on
• Measure
quality
of
features
by
all
kinds
of
metrics
– Frequency
based
– Dependence
of
feature
and
label
(Co-‐occurrence)
• mutual
informa7on,
Chi
square
sta7s7c
– Informa7on
theory
• KL
divergence,
informa7on
gain
– Gini
indexing
9. Frequency
based
• Remove
features
according
to
frequency
of
features
or
instances
contain
the
feature
• Typical
scenario
– Text
mining
10. Mutual
informa7on
• Measure
the
dependence
of
two
random
variables
• Defini7on
11. Chi
Square
Sta7s7c
• Measure
the
dependence
of
two
variables
– A:
number
of
7mes
feature
t
and
category
c
co-‐occur
– B:
number
of
7mes
t
occurs
without
c
– C:
number
of
7mes
c
occurs
without
t
– D:
number
of
7mes
neither
c
or
t
occurs
– N:
total
number
of
instances
12. Entropy
• Characterize
the
(im)purity
of
an
collec7on
of
examples
𝐸𝑛𝑡𝑟𝑜𝑝𝑦( 𝑆)= −∑𝑖↑▒ 𝑃↓𝑖 𝐼𝑛 𝑃↓𝑖
13. Informa7on
Gain
• Reduc7on
in
entropy
caused
by
par77oning
the
examples
according
to
the
a>ribute
14. KL
divergence
• Measure
the
difference
between
two
probability
distribu7on
15. Gini
indexing
• Calculate
condi7onal
probability
of
f
given
class
label
• Normalize
across
all
classes
• Calculate
Gini
coefficient
• For
two
categories
case
16. Comparison
in
text
categoriza7on
(1)
• A
compara)ve
study
on
feature
selec)on
in
text
categoriza)on
(ICML’97)
17. Comparison
in
text
categoriza7on
(2)
• Feature
selec)on
for
text
classifica)on
based
on
Gini
Coefficient
of
Inequality
(JMLR’03)
18. Shortages
of
single
feature
evalua7on
• Relevance
between
features
are
ignored
– Features
could
be
redundant
– A
feature
that
is
completely
useless
by
itself
can
provide
a
significant
performance
improvement
when
taken
with
others
– Two
features
that
are
useless
by
themselves
can
be
useful
together
19. Shortages
of
single
feature
evalua7on
(2)
• A
feature
that
is
completely
useless
by
itself
can
provide
a
significant
performance
improvement
when
taken
with
others
20. Shortages
of
single
feature
evalua7on
(3)
• Two
features
that
are
useless
by
themselves
can
be
useful
together
21. Subset
selec7on
methods
• Select
subsets
of
features
that
together
have
good
predic7ve
power,
as
opposed
to
ranking
features
individually
• Always
by
adding
new
features
into
exis7ng
set
or
removing
features
out
of
exis7ng
set
– Sequen7al
forward
selec7on
– Sequen7al
backward
selec7on
• Evalua7on
– category
distance
measurement
– Classifica7on
error
23. Wrapper
methods
for
logis7c
regression
• Forward
feature
selec7on
– Naïve
method
• need
build
models
quadra7c
in
the
number
of
feature
– Graiing
– Single
feature
op7miza7on
24. SFO
(Singhet
al.,
2009)
• Only
op7mizing
coefficient
of
the
new
feature
• Only
need
iterate
over
instances
that
contain
the
new
feature
• Also
fully
relearn
one
new
model
with
selected
feature
included
25. Graiing
(Perkins
2003)
• Use
the
loss
func7on’s
gradient
with
respect
to
the
new
feature
to
decide
whether
to
add
the
feature
• At
each
step,
feature
with
largest
gradient
is
added
• Model
is
fully
relearned
aier
each
feature
is
added
– Need
only
build
D
models
overall
26. Experimenta7on
• Percent
improvement
of
log-‐likelihood
in
test
set
• Both
SFO
and
Graiing
are
easy
parallelized
27. Summariza7on
• Categories
Single
feature
evalua+on
Subset
selec+on
filter
MI,
IG,
KL-‐D,
GI,
CHI
Category
distance,
…
wrapper
Ranking
accuracy
using
single
feature
For
LR
(SFO,
Graiing)
• Filter
+
Single
feature
evalua7on
– Less
7me
consuming,
usually
works
well
• Wrapper
+
Subset
selec7on
– Higher
accuracy,
but
easy
overfiLng
28. Tips
about
feature
selec7on
•
•
•
•
Remove
features
could
not
occur
in
real
scenario
If
no
contribu7on,
the
less
feature
the
be>er
Use
L1
regulariza7on
for
logis7c
regression
Use
random
subspace
method
29. References
• Feature
selec)on
for
Classifica)on
(IDA’97)
• An
Introduc)on
to
Variable
and
Feature
Selec)on
(JMLR’03)
• Feature
selec)on
for
text
classifica)on
Based
on
Gini
Coefficient
of
Inequality
(JMLR’03)
• A
compara)ve
study
on
feature
selec)on
in
text
categoriza)on
(ICML’97)
• Scaling
Up
Machine
Learning
Hinweis der Redaktion
Why samples of different categories could be separatedSeparated well -> smaller classification errorDifferent feature has different contribution
Noise data : Lots of low frequent features: use ad-id as a feature, easy overfittingMulti-type features:Too many features comparing to samples : feature number > sample number; feature combinationComplex model: ANNSamples to be predicted is inhomogeneous with training & test samples : demographic targeting; time series related
Key points: “how to measure the quality of features” and “whether and how to use the underlying algorithms”1. Optimal feature set could only be selected through exhaustive method;2. Among all existing feature selection methods, the feature set are generated by adding or removing some features from set in last step
Decision tree
Is not a true metric for distance measurement, because it’s not symmetricCould not be negative (Gibbs inequality)Used in topic model
Features could be redundant : videoId,contentId
With 1000 features, and cost 1 second to build one model on average, would cost about 1 week