Feature selection

Machine
learning
workshop

guodong@hulu.com

Machine
learning
introduc7on

Logis7c
regression

Feature
selec+on

Boos7ng,
tree
boos7ng

See
more
machine
learning
post:
h>p://dongguo.me

Outline

• 
• 
• 
• 

Introduc7on

Typical
feature
selec7on
methods

Feature
selec7on
in
logis7c
regression

Tips
and
conclusion

What’s/why
feature
selec7on

•  A
procedure
in
machine
learning
to
ﬁnd
a
subset
of

features
that
produces
‘be>er’
model
for
given

dataset

–  Avoid
overﬁLng
and
achieve
be>er
generaliza7on
ability

–  Reduce
the
storage
requirement
and
training
7me

–  Interpretability

When
feature
selec7on
is
important

• 
• 
• 
• 
• 
• 

Noise
data

Lots
of
low
frequent
features

Use
mul7-‐type
features

Too
many
features
comparing
to
samples

Complex
model

Samples
in
real
scenario
is
inhomogeneous
with

training
&
test
samples

When
No.(samples)/No.(feature)
is
large

• 
• 
• 
• 

Feature
selec7on
with
Gini
indexing

Algorithm:
Logis7c
regression

Training
samples:

640k;
test
samples:
49K

Feature:
watch
behavior
of
audiences;
show
level
(11327)

AUC

0.83

0.82

L1-‐LR

L2-‐LR

0.81

0.8

all

80%

70%

60%

50%

40%

30%

ra+o
of
features
used

20%

10%

When
No(samples)
equals
to
No(feature)

•  L1
Logis7c
regression;

•  Training
samples:
50k;
test
samples:
49K

•  Feature:
watch
behavior
of
audiences;
video
level
(49166)

how
AUC
change
with
feature
number
selected

0.736

0.735

0.734

0.733

0.732

ACU

0.731

0.73

0.729

0.728

all

90%
80%
70%
60%
50%
40%
30%
20%
10%

Typical
methods
for
feature
selec7on

•  Categories

Single
feature

evalua7on

Subset
selec7on

ﬁlter

MI,
IG,
KL-‐D,
GI,
CHI

Category
distance,

…

wrapper

Ranking
accuracy

For
LR
(SFO,

using
single
feature
Graiing)

•  Single
feature
evalua7on

–  Frequency
based,
mutual
informa7on,
KL
divergence,
Gini
indexing,

informa7on
gain,
Chi
square
sta7s7c

•  Subset
selec7on
method

–  Sequen7al
forward
selec7on

–  Sequen7al
backward
selec7on

Single
feature
evalua7on

•  Measure
quality
of
features
by
all
kinds
of
metrics

–  Frequency
based

–  Dependence
of
feature
and
label
(Co-‐occurrence)

•  mutual
informa7on,
Chi
square
sta7s7c

–  Informa7on
theory

•  KL
divergence,
informa7on
gain

–  Gini
indexing

Frequency
based

•  Remove
features
according
to
frequency
of
features

or
instances
contain
the
feature

•  Typical
scenario

–  Text
mining

Mutual
informa7on

•  Measure
the
dependence
of
two
random
variables

•  Deﬁni7on

Chi
Square
Sta7s7c

•  Measure
the
dependence
of
two
variables

–  A:
number
of
7mes
feature
t
and
category
c
co-‐occur

–  B:
number
of
7mes
t
occurs
without
c

–  C:
number
of
7mes
c
occurs
without
t

–  D:
number
of
7mes
neither
c
or
t
occurs

–  N:
total
number
of
instances

Entropy

•  Characterize
the
(im)purity
of
an
collec7on
of

examples

𝐸𝑛𝑡𝑟𝑜𝑝𝑦( 𝑆)= −∑𝑖↑▒ 𝑃↓𝑖  𝐼𝑛 𝑃↓𝑖

Informa7on
Gain

•  Reduc7on
in
entropy
caused
by
par77oning
the

examples
according
to
the
a>ribute

KL
divergence

•  Measure
the
diﬀerence
between
two
probability

distribu7on

Gini
indexing

•  Calculate
condi7onal
probability
of
f
given
class
label

•  Normalize
across
all
classes

•  Calculate
Gini
coeﬃcient

•  For
two
categories
case

Comparison
in
text
categoriza7on
(1)

•  A
compara)ve
study
on
feature
selec)on
in
text
categoriza)on
(ICML’97)

Comparison
in
text
categoriza7on
(2)

•  Feature
selec)on
for
text
classiﬁca)on
based
on
Gini
Coeﬃcient
of

Inequality
(JMLR’03)

Shortages
of
single
feature
evalua7on

•  Relevance
between
features
are
ignored

–  Features
could
be
redundant

–  A
feature
that
is
completely
useless
by
itself
can
provide
a

signiﬁcant
performance
improvement
when
taken
with

others

–  Two
features
that
are
useless
by
themselves
can
be
useful

together

Shortages
of
single
feature
evalua7on
(2)

•  A
feature
that
is
completely
useless
by
itself
can

provide
a
signiﬁcant
performance
improvement

when
taken
with
others

Shortages
of
single
feature
evalua7on
(3)

•  Two
features
that
are
useless
by
themselves
can
be

useful
together

Subset
selec7on
methods

•  Select
subsets
of
features
that
together
have
good

predic7ve
power,
as
opposed
to
ranking
features

individually

•  Always
by
adding
new
features
into
exis7ng
set
or

removing
features
out
of
exis7ng
set

–  Sequen7al
forward
selec7on

–  Sequen7al
backward
selec7on

•  Evalua7on

–  category
distance
measurement

–  Classiﬁca7on
error

Category
distance
measurement

•  Select
features
subset
with
large
category
distance

Wrapper
methods
for
logis7c
regression

•  Forward
feature
selec7on

–  Naïve
method

•  need
build
models
quadra7c
in
the
number
of
feature

–  Graiing

–  Single
feature
op7miza7on

SFO
(Singhet
al.,
2009)

•  Only
op7mizing
coeﬃcient
of
the
new
feature

•  Only
need
iterate
over
instances
that
contain
the

new
feature

•  Also
fully
relearn
one
new
model
with
selected

feature
included

Graiing
(Perkins
2003)

•  Use
the
loss
func7on’s
gradient
with
respect
to
the

new
feature
to
decide
whether
to
add
the
feature

•  At
each
step,
feature
with
largest
gradient
is
added

•  Model
is
fully
relearned
aier
each
feature
is
added

–  Need
only
build
D
models
overall

Experimenta7on

•  Percent
improvement
of
log-‐likelihood
in
test
set

•  Both
SFO
and
Graiing
are
easy
parallelized

Summariza7on

•  Categories

Single
feature
evalua+on

Subset
selec+on

ﬁlter

MI,
IG,
KL-‐D,
GI,
CHI

Category
distance,

…

wrapper

Ranking
accuracy
using

single
feature

For
LR
(SFO,

Graiing)

•  Filter
+
Single
feature
evalua7on

–  Less
7me
consuming,
usually
works
well

•  Wrapper
+
Subset
selec7on

–  Higher
accuracy,
but
easy
overﬁLng

Tips
about
feature
selec7on

• 
• 
• 
• 

Remove
features
could
not
occur
in
real
scenario

If
no
contribu7on,
the
less
feature
the
be>er

Use
L1
regulariza7on
for
logis7c
regression

Use
random
subspace
method

References

•  Feature
selec)on
for
Classifica)on
(IDA’97)

•  An
Introduc)on
to
Variable
and
Feature
Selec)on
(JMLR’03)

•  Feature
selec)on
for
text
classifica)on
Based
on
Gini

Coefficient
of
Inequality
(JMLR’03)

•  A
compara)ve
study
on
feature
selec)on
in
text

categoriza)on
(ICML’97)

•  Scaling
Up
Machine
Learning

Feature selection

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Feature selection

Ähnlich wie Feature selection (20)

Mehr von Dong Guo

Mehr von Dong Guo (8)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Feature selection

Hinweis der Redaktion