1. Week02
The Author
November 24, 2009
1 Exercise 1
2 Exercise 2
• Intuitively, a1 has a higher information gain. The value of a2 has equal distribution
for both classes, which shows no discriminative ability.
• H(class) = − 1 log2 ( 1 ) − 1 log2 ( 2 ) = 1
2 2 2
1
• Gain for a1
H(a1 , true) = − 1 log2 ( 1 ) − 2 log2 ( 3 ) = 0.9183
3 3 3
2
H(a1 , f alse) = − 3 log2 ( 3 ) − 3 log2 ( 2 ) = 0.9183
1 1 2
3
H(a1 ) = 2 ∗ 0.9183 + 1 ∗ 0.9183 = 0.9183
1
2
Gain(a1 ) = 1 − 0.9183 = 0.0817
1
2. studentID score class
st1 9 yes
st2 4 no
st3 7 yes
...
Table 1: Example of overfitting
• Gain for a2
H(a2 , true) = − 1 log2 ( 1 ) − 1 log2 ( 2 ) = 1
2 2 2
1
H(a2 , f alse) = − 2 log2 ( 2 ) − 2 log2 ( 1 ) = 1
1 1 1
2
1
H(a2 ) = 2 ∗ 1 + 1 ∗ 1 = 1
2
Gain(a2 ) = 1 − 1 = 0
3 Exercise 3
Assume we have following training example shown in Tab 3: For the attribute studentID,
it’s unique for each instance. In the training data, we can easily get the target class value
as long as know the studentID. However, this can not be generalized to unseen data, i.e.,
given a new studentID, we won’t be able to predict its class label.
4 Exercise 4
Example: if an attribute has n values, in an extreme case, we can have a data set of n
instances and each instance has a different value. Assume that we have a binary target,
then for each value of the attribute, the entropy of each value of the attribute is H(Sv ) =
−0 ∗ log2 0 − 1 ∗ log2 1 = 0
|Sv |
H(S, A) = H(S) − ∗ 0 = H(S) (1)
|S|
v∈values(A)
since H(S, A) <= H(S), H(S) is the maximum gain we can have, so that the attribute in
this extreme case will always be selected by the information gain criterion. However, this
is not a good choice. (Consider the over-fitting problem discussed in exercise 3)
5 Exercise 5
• Assign the most common value among examples for the missing value, i.e., “true” for
attribute a1 at instance 2. In this case, we have
gain(a1 ) = H(class) − H(class, a1 ) = H(class) − ( 3 H([2, 1]) + 1 H([1, 0]))
4 4
2
3. • A new value “missing” can be assigned to attribute a1 for instance 2. In this case,
we have
gain(a1 ) = H(class) − ( 1 H([1, 1]) + 1 H([1, 0]) + 1 H([1, 0]))
2 4 4
3