C4.5 enhances ID3 by making it more robust to noise, able to handle continuous attributes, deal with missing data, and convert decision trees to rules. It avoids overfitting through pre-pruning and post-pruning techniques. When dealing with continuous attributes, it evaluates all possible split points and chooses the optimal one. It treats missing data as a separate value but this is not always appropriate. It generates rules from trees in a greedy manner by pruning conditions to reduce estimated error. The next topic will be on instance-based classifiers.
1. Introduction to Machine
Learning
Lecture 6
Albert Orriols i Puig
aorriols@salle.url.edu
i l @ ll ld
Artificial Intelligence – Machine Learning
Enginyeria i Arquitectura La Salle
gy q
Universitat Ramon Llull
2. Recap of Lecture 4
ID3 is a strong system that
gy
Uses hill-climbing search based on the information gain
measure to sea c through the space o dec s o trees
easu e o search oug e of decision ees
Outputs a single hypothesis.
Never b kt k It converges to locally optimal solutions.
N backtracks. tl ll ti l l ti
Uses all training examples at each step, contrary to methods
that
th t make decisions i
k d ii incrementally.
t ll
Uses statistical properties of all examples: the search is less
sensitive t errors i i di id l t i i examples.
iti to in individual training l
Can handle noisy data by modifying its termination criterion to
accept hypotheses that imperfectly fit the data.
th th th t i f tl th d t
Slide 2
Artificial Intelligence Machine Learning
3. Recap of Lecture 4
However, ID3 has some drawbacks
It
I can only deal with nominal d
l d l ih i l data
It is not able to deal with noisy data sets
It may be not robust in presence of noise
Slide 3
Artificial Intelligence Machine Learning
4. Today’s Agenda
Going from ID3 to C4.5
How C4.5 enhances C4.5 to
Be robust in the presence of noise. Avoid overfitting
Deal with continuous attributes
Deal with missing data
Convert trees to rules
Slide 4
Artificial Intelligence Machine Learning
5. What’s Overfitting?
Overfitting = Given a hypothesis space H, a hypothesis hєH is said to
overfit the training data if there exists some alternative hypothesis h’єH,
such that
h has smaller error than h’ over the training examples, but
h examples
1.
1
h’ has a smaller error than h over the entire distribution of instances.
2.
Slide 5
Artificial Intelligence Machine Learning
6. Why May my System Overfit?
In domains with noise or uncertainty
y
the system may try to decrease the training error by completely
fitting a the training e a p es
g all e a g examples
The learner overfits
to correctly classify
the noisy instances Noisy instances
Occam’s razor: Prefer the
simplest hypothesis that fits
the data with high accuracy
Slide 6
Artificial Intelligence Machine Learning
7. How to Avoid Overfitting?
Ok, my system may overfit… Can I avoid it?
, yy y
Sure! Do not include branches that fit data too specifically
How?
H?
Pre-prune: Stop growing a branch when information becomes
1.
unreliable
li bl
Post-prune: Take a fully-grown decision tree and discard
2.
unreliable parts
li bl
Slide 7
Artificial Intelligence Machine Learning
8. Pre-pruning
Based on statistical significance test
g
Stop growing the tree when there is no statistically significant
assoc a o between any attribute and e class at particular
association be ee a y a bu e a d the c ass a a pa cu a
node
Use all available da a for training a d app y the s a s ca test
a a a ab e data o a g and apply e statistical es
to estimate whether expanding/pruning a node is to produce an
improvement beyond the training set
Most popular test: chi-squared test
ID3 used chi-squared test in addition to information gain
Only statistically significant attributes were allowed to be
selected by information gain procedure
Slide 8
Artificial Intelligence Machine Learning
9. Pre-pruning
Early stopping: Pre-pruning may stop the growth process prematurely
Classic example: XOR/Parity-problem
No individual attribute exhibits any significant association to the class
Structure is only visible in fully expanded tree
Pre-pruning won t
Pre pruning won’t expand the root node
But: XOR-type problems rare in practice
And: pre-pruning faster than post-pruning
x1 x2 Class
1 0 0 0
01 10
2 0 1 1
3 1 0 1
4 1 1 0 00 10
Slide 9
Artificial Intelligence Machine Learning
10. Post-pruning
First, build the full tree
,
Then, prune it
Fully-grown
Fully grown tree shows all attribute interactions
Problem: some subtrees might be due to chance effects
Two pruning operations:
Subtree replacement
1.
Subtree raising
2.
Possible strategies:
error estimation
significance t ti
i ifi testing
MDL principle
Slide 10
Artificial Intelligence Machine Learning
11. Subtree Replacement
Bottom up approach
p pp
Consider replacing a tree after considering all its subtrees
Ex: labor negotiations
Slide 11
Artificial Intelligence Machine Learning
12. Subtree Replacement
Algorithm:
1. Split the data into training and validation set
2. Do until further pruning is harmful:
a. Evaluate impact on the validation set of pruning
each possible node
b. Select th
b S l t the node whose removal most i
d h l t increases
the validation set accuracy
Slide 12
Artificial Intelligence Machine Learning
13. Subtree Raising
Delete node
Redistribute instances
Slower than subtree
replacement
(Worthwhile?)
X
Slide 13
Artificial Intelligence Machine Learning
14. Estimating Error Rates
Ok we can prune. But when?
p
Prune only if it reduces the estimated error
Error on the training data is NOT a useful estimator
Q: Why it would result in very little pruning?
Use hold-out set for pruning
hold out
Training
T ii
Separate a validation set Data set’
Training
g
Use this validation set to
test the improvement Data set
Validation
C4.5 s
C4 5’s method set
Derive confidence interval from training data
Use a heuristic limit derived from this for pruning
limit, this,
Standard Bernoulli-process-based method
Shaky statistical assumptions (based on training data)
y p ( g )
Slide 14
Artificial Intelligence Machine Learning
15. Deal with continuous attributes
When dealing with nominal data
g
We evaluated the grain for each possible value
In
I continuous data, we have infinite values.
ti dt h i fi it l
What should we do?
Continuous-valued attributes may take infinite values, but we
have a limited number of values in our instances (at most N if
we have N instances)
Therefore, simulate that you have N nominal values
Evaluate information gain for every possible split point of the
attribute
Choose the best split point
The information gain of the attribute is the information gain
of the best split
Slide 15
Artificial Intelligence Machine Learning
16. Deal with continuous attributes
Example
Outlook Temperature Humidity Windy Play
Sunny
y 85 85 False No
Sunny 80 90 True No
Overcast 83 86 False Yes
Rainy 75 80 False Yes
… … … … …
Continuous attributes
Slide 16
Artificial Intelligence Machine Learning
17. Deal with continuous attributes
Split on temperature attribute:
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes
Y N
No Y
Yes Y
Yes Yes
Y No
N No
N Yes Y
Y Yes Yes
Y No
N Y
Yes Yes N
Y No
E.g.: temperature < 71.5: yes/ , no/2
g te pe atu e 5 yes/4, o/
temperature ≥ 71.5: yes/5, no/3
Info([4,2],[5,3]) = 6/14 info([4,2]) + 8/14 info([5,3]) = 0.939 bits
Place split points halfway between values
Can evaluate all split points in one pass!
Slide 17
Artificial Intelligence Machine Learning
18. Deal with continuous attributes
To speed up
p p
Entropy only needs to be evaluated between points of different
c asses
classes
value 64 65 68 69 70 71 72 72 75 75 80 81 83 85
class Yes X
No Yes Yes Yes No No Yes Yes Yes No Yes Yes No
Potential optimal breakpoints
Breakpoints between values of the same class cannot
be optimal
Slide 18
Artificial Intelligence Machine Learning
19. Deal with Missing Data
Treat missing values as a separate value
g p
Missing value denoted “?” in C4.X
Simple idea: treat missing as a separate value
Q: When this is not appropriate?
A: Wh
A When values are missing d to diff
l i i due different reasons
Example 1: gene expression could be missing when it is very
high or very low
Example 2: field IsPregnant=missing for a male patient should be
treated differently (no) than for a female patient of age 25
(unknown)
Slide 19
Artificial Intelligence Machine Learning
20. Deal with Missing Data
Split instances with missing values into pieces
A piece going down a branch receives a weight proportional to
the popularity of the branch
weights sum to 1
Info gain works with fractional instances
Use sums of weights instead of counts
During classification, split the instance into pieces
in the same way
Merge probability distribution using weights
Slide 20
Artificial Intelligence Machine Learning
21. From Trees to Rules
I finally g a tree from domains with
y got
Noisy instances
Missing l
Mi i values
Continuous attributes
But I prefer rules…
No context dependent
Procedure
Generate a rule for each tree
Get context-independent rules
Slide 21
Artificial Intelligence Machine Learning
22. From Trees to Rules
A procedure a little more sophisticated: C4.5Rules
p p
C4.5rules: greedily prune conditions from each rule if this
reduces its es a ed e o
educes s estimated error
Can produce duplicate rules
Check for this at the end
Then
look at each class in turn
consider the rules for that class
find a “good” subset (guided by MDL)
good
Then rank the subsets to avoid conflicts
Finally, remove rules (greedily) if this decreases error on the
training data
Slide 22
Artificial Intelligence Machine Learning
23. Next Class
Instance-based Classifiers
Slide 23
Artificial Intelligence Machine Learning
24. Introduction to Machine
Learning
Lecture 6
Albert Orriols i Puig
aorriols@salle.url.edu
i l @ ll ld
Artificial Intelligence – Machine Learning
Enginyeria i Arquitectura La Salle
gy q
Universitat Ramon Llull