Data mining: Baseline & Decision Trees

Data mining
‘Baseline & Decision Trees’

COMPUTER ASSIGNMENT 2

BARRY KOLLEE

10349863

Regression
|
CPU
performance

1.
Evaluation in Machine Learning
Copy the file weather.arff to your home directory. This file contains data for deciding when to
play a certain sport given weather conditions. Run the J48 classifier using "weather.arff" as
the training set.

1. Report how many instances are correctly and incorrectly classified on the training set.
2. The classifier weka.classifiers.rules.ZeroR simply assigns the most common
classification in a training set to any new classifications and can be used as a
baseline for evaluating other machine learning schemes. Invoke the ZeroR classifier
using weather.arff. Report the number of correctly classified and misclassified
instances both for the training set and cross-validation.
3. What are baselines used for? Is ZeroR a reasonable baseline? Can you think of other
types of baselines?
4. What is the difference between a development set and a test set?
5. What is the difference between accuracy and precision? Give small examples for
which both precision and accuracy scores differ greatly.

1. I’ve loaded up the delivered database file “weather.arff” into weka and run the J48 classifier by using
the delivered training set. The J48 is one of the ‘tree classifiers’. Next to the results UTF8-description of
the model, we’re also able to see the decision tree of our weather.arff file. The results are listed below of
the J48 classifier is listed below. The correct and incorrect instances are given in red.

Scheme:weka.classifiers.trees.J48 -C 0.25 -M 2
Relation: weather
Instances: 14
Attributes: 5
outlook
temperature
humidity
windy
play
Test mode:evaluate on training data

=== Classifier model (full training set) ===

J48 pruned tree
------------------

outlook = sunny
| humidity <= 75: yes (2.0)
| humidity > 75: no (3.0)
outlook = overcast: yes (4.0)
outlook = rainy
| windy = TRUE: no (2.0)
| windy = FALSE: yes (3.0)

Number of Leaves : 5

Size of the tree : 8

Time taken to build model: 0.01 seconds

=== Evaluation on training set ===
=== Summary ===

Correctly Classified Instances 14 100 %
Incorrectly Classified Instances 0 0 %
Kappa statistic 1
Mean absolute error 0
Root mean squared error 0
Relative absolute error 0 %
Root relative squared error 0 %
Total Number of Instances 14

=== Detailed Accuracy By Class ===

2

Regression
|
CPU
performance

TP Rate FP Rate Precision Recall F-Measure ROC Area Class
1 0 1 1 1 1 yes
1 0 1 1 1 1 no
Weighted Avg. 1 0 1 1 1 1

=== Confusion Matrix ===

a b <-- classified as
9 0 | a = yes
0 5 | b = no

2. Now the ZeroR classifier has been used to classify the “weather.arff”. First I performed a 10-fold crossing-
validation. The correct and incorrect instances are given in red.

Scheme:weka.classifiers.rules.ZeroR
Relation: weather
Instances: 14
Attributes: 5
outlook
temperature
humidity
windy
play
Test mode:10-fold cross-validation


ZeroR predicts class value: yes

Time taken to build model: 0 seconds

=== Stratified cross-validation ===
=== Summary ===

Correctly Classified Instances 9 64.2857 %
Incorrectly Classified Instances 5 35.7143 %
Kappa statistic 0
Mean absolute error 0.4762
Root mean squared error 0.4934


1 1 0.643 1 0.783 0.178 yes
0 0 0 0 0 0.178 no
Weighted Avg. 0.643 0.643 0.413 0.643 0.503 0.178


3

Regression
|
CPU
performance

9 0 | a = yes
5 0 | b = no

Now I used the option ‘use training set’ in stead of 10-fold cross-validation. The correct and incorrect instances
are given in red. The model is listed below:

Scheme:weka.classifiers.rules.ZeroR
Relation: weather
Instances: 14
Attributes: 5
outlook
temperature
humidity
windy
play
Test mode:evaluate on training data


ZeroR predicts class value: yes


=== Evaluation on training set ===
=== Summary ===

Kappa statistic 0


1 1 0.643 1 0.783 0.5 yes
0 0 0 0 0 0.5 no
Weighted Avg. 0.643 0.643 0.413 0.643 0.503 0.5


9 0 | a = yes
5 0 | b = no

3. a) A baseline is a simple approach to a given problem, which is often used to compare other
approaches to, in order to see whether the other approaches perform better.

Next to datamining this is a common term in businesses. Where a business can define a certain
baseline for their company goals and/or strategy. Several approaches could deal this goal and/or
strategy.

b) No it is not. It just determines the most common class or the median (in case of numeric values). It
tests how well a class can be predicted without considering any attributes.

c) An example of a better type of baseline is NaiveBayes. This classifier does take attributes into
account which makes our created model better. That’s because it’s not only checking for the most
common class but also on every attributes the instances have. With these attributes we can give a way
more accurate insight of the baseline. You can see that in the snippet below:

4

Regression
|
CPU
performance

=== Run information ===

Scheme:weka.classifiers.bayes.NaiveBayes
Relation: weather
Instances: 14
Attributes: 5
outlook
temperature
humidity
windy
play


Naive Bayes Classifier

Class
Attribute yes no
(0.63) (0.38)
===============================
outlook
sunny 3.0 4.0
overcast 5.0 1.0
rainy 4.0 3.0
[total] 12.0 8.0

temperature
mean 72.9697 74.8364
std. dev. 5.2304 7.384
weight sum 9 5
precision 1.9091 1.9091

humidity
mean 78.8395 86.1111
std. dev. 9.8023 9.2424
weight sum 9 5
precision 3.4444 3.4444

windy
TRUE 4.0 4.0
FALSE 7.0 3.0
[total] 11.0 7.0


=== Summary ===

Kappa statistic 0.1026
Relative absolute error 97.6254 %
Root relative squared error 110.051 %


0.889 0.8 0.667 0.889 0.762 0.444 yes
0.2 0.111 0.5 0.2 0.286 0.444 no
Weighted Avg. 0.643 0.554 0.607 0.643 0.592 0.444


8 1 | a = yes
4 1 | b = no

5

Regression
|
CPU
performance

4. We create our training set to increase the accuracy of the classifier, which we use on the data. The
more data we train the more accurate the resulting model will be.

The other two sets are used to evaluate the performance of the classifier we use. The development set
is used to evaluate the accuracy of different configurations of our classifier. It’s called the development
set because we continuously need to evaluate the classification performance.

In the end we’ve got a model, which has a great performance on the test data. To get estimates on how
good the new model will deal with new data we use the test data.

5.
With accuracy we are getting a result which is close to the actual value/answer/datapoint. With
precision we target to have an equal result on every new prediction on every new datapoint.

I.e. if we play darts we can be playing accurately by hitting the bulls eye. But precise means that we
should throw a dart on the exact same spot every time.

6

Regression
|
CPU
performance

2. Decision Trees
This assignment uses the WEKA implementation of C4.5, a decision tree learner. To invoke
this learner e.g. using a file called "train.arff" as a training set you can type:

java weka.classifiers.trees.J48 -t train.arff

This will construct a decision tree from train.arff and then apply it to train.arff. After that it will
perform a 10-fold cross-validation on train.arff.

2.1. Copy the file zoo.arff zoo.train.arff zoo.test.arff from Blackboard. This data includes
instances of animals described by their features (hairy, feathered, etc) and classifications of
those animals (e.g. mammal, bird, reptile). Invoke the J48 classifer using zoo.train.arff and
zoo.test.arff as the training and testing files respectively. Note that zoo.{train,test}.arff together
contain the same data as zoo.arff, i.e. the latter was split to create the training and testing
sets.w

1. Report the number of correctly and incorrectly classified instances for the test data for
Decision Trees.
2. Include in your report a description of the decision tree constructed by the J48
classifier and explain how the decision tree is used to classify a new instance.

1. I’ve opened the zootest-file within weka and run the J48 classifier. The number of correct and incorrect instances
are given in red. This is the use of the test set:


Relation: zoo
Instances: 81
Attributes: 18
animal
hair
feathers
eggs
milk
airborne
aquatic
predator
toothed
backbone
breathes
venomous
fins
legs
tail
domestic
catsize
type
Test mode:user supplied test set: size unknown (reading incrementally)


J48 pruned tree
------------------

feathers = false
| milk = false
| | toothed = false
| | | airborne = false: invertebrate (8.0/1.0)
| | | airborne = true: insect (5.0)
| | toothed = true
| | | fins = false
| | | | legs <= 2: reptile (3.0)
| | | | legs > 2: amphibian (3.0)
| | | fins = true: fish (10.0)
| milk = true: mammal (36.0)

7

Regression
|
CPU
performance

feathers = true: bird (16.0)




=== Evaluation on test set ===
=== Summary ===

Correctly Classified Instances 17 85 %
Incorrectly Classified Instances 3 15 %

1 0 1 1 1 1 mammal
1 0 1 1 1 1 bird
0 0 0 0 0 0.5 reptile
1 0 1 1 1 1 fish
1 0.053 0.5 1 0.667 0.974 amphibian
0.5 0 1 0.5 0.667 0.944 insect
1 0.118 0.6 1 0.75 0.941 invertebrate
Weighted Avg. 0.85 0.02 0.815 0.85 0.813 0.934


a b c d e f g <-- classified as
5 0 0 0 0 0 0 | a = mammal
0 4 0 0 0 0 0 | b = bird
0 0 0 0 1 0 1 | c = reptile
0 0 0 3 0 0 0 | d = fish
0 0 0 0 1 0 0 | e = amphibian
0 0 0 0 0 1 1 | f = insect
0 0 0 0 0 0 3 | g = invertebrate

We can also perform a classifier based on the training set (and not the test set). The number of correct
and incorrect instances is given in red:


Relation: zoo
Instances: 81
Attributes: 18
animal
hair
feathers
eggs
milk
airborne
aquatic
predator
toothed
backbone
breathes
venomous
fins
legs
tail
domestic
catsize
type


J48 pruned tree
------------------

8

Regression
|
CPU
performance

The decision tree is listed above. We define 6 different steps within our decision tree.

1. Does the creature has feathers (True/False)
2. Does the creature produces milk (True/False)
3. Is the creature toothed (True/False)
4. Does the creature flies (True/False)
5. Does the creature has fins (True/False)
6. Does the creature has more or less legs then 2 (numerical check)

2.2. If you invoke a WEKA classifier with a training set but no testing set, WEKA will
automatically perform a 10-fold cross-validation on the training set and report how many
instances are correctly and incorrectly classified when those instances are used as test data
during the cross-validation.

1. When you ran the J48 classifier with the zoo.arff file, a 10-fold cross validation was
performed. Report the number of instances correctly and incorrectly classified during
the cross-validation.

I loaded up the delivered zoo.arff file and performed a 10-fold crossing with the J48 classifier. The
number of correct and incorrect instances are given in red.


Relation: zoo
Instances: 101
Attributes: 18
animal
hair
feathers
eggs
milk
airborne
aquatic
predator
toothed
backbone
breathes
venomous
fins

10

Regression
|
CPU
performance

legs
tail
domestic
catsize
type


J48 pruned tree
------------------

feathers = false
| milk = false
| | backbone = false
| | | airborne = false
| | | | predator = false
| | | | | legs <= 2: invertebrate (2.0)
| | | | | legs > 2: insect (2.0)
| | | | predator = true: invertebrate (8.0)
| | backbone = true
| | | fins = false
| | | | tail = false: amphibian (3.0)
| | | | tail = true: reptile (6.0/1.0)




=== Summary ===



1 0 1 1 1 1 mammal
1 0 1 1 1 1 bird
0.6 0.01 0.75 0.6 0.667 0.793 reptile
1 0.011 0.929 1 0.963 0.994 fish
0.75 0 1 0.75 0.857 0.872
amphibian
0.625 0.032 0.625 0.625 0.625 0.92 insect
0.8 0.033 0.727 0.8 0.762 0.986
invertebrate
Weighted Avg. 0.921 0.008 0.922 0.921 0.92 0.976


41 0 0 0 0 0 0 | a = mammal
0 20 0 0 0 0 0 | b = bird
0 0 3 1 0 1 0 | c = reptile
0 0 0 13 0 0 0 | d = fish
0 0 1 0 3 0 0 | e = amphibian
0 0 0 0 0 5 3 | f = insect

11

Regression
|
CPU
performance

2. With the WEKA classifiers, the "-x [value]" option can be used to specify how many
folds to use in a cross validation -- e.g. "-x 5", will specify a 5-fold cross-validation.
Suppose you wish to perform a "leave one out" cross-validation on the zoo.arff data.
How many folds must you specify to achieve this?

Because we have 101 instances we need to perform a total of 101 folds.

2.3. Make a copy of the weather.arff file and modify it to train a classifier with multiple
(discrete) target values. The classifier is supposed to assign your favorite sport according to
weather conditions (take, for example, "swimming", "badminton" and "none"). Modify the
training data according to your settings (create at least 20 training examples).

1. Train the J48 classifier on the original data in weather.arff and on your modified
version and report the number of correctly and incorrectly classified instances during
the 10-fold cross-validation.

First I performed the 10-fold cross validation on the weather.arff. The number of correct and incorrect
classified instances is given in red.


Relation: weather
Instances: 14
Attributes: 5
outlook
temperature
humidity
windy
play


J48 pruned tree
------------------

outlook = sunny
| humidity <= 75: yes (2.0)
| humidity > 75: no (3.0)
outlook = overcast: yes (4.0)
outlook = rainy
| windy = TRUE: no (2.0)
| windy = FALSE: yes (3.0)




=== Summary ===



0.778 0.6 0.7 0.778 0.737 0.789 yes
0.4 0.222 0.5 0.4 0.444 0.789 no

12

Regression
|
CPU
performance

Weighted Avg. 0.643 0.465 0.629 0.643 0.632 0.789


7 2 | a = yes
3 2 | b = no

1. We check for the discrete value ‘outlook’ of the weather first, which has 3 discrete values;
respectively sunny, outlook and rainy.
2. If we have a sunny outlook we check for the value of the humidity.
• If the humidity is lower or equal then 75 we play.
• If the humidity is higher then 75 we don't play.
3. If we have a rainy outlook we check if it’s windy or not. It it’s windy we don’t play, if it’s not
windy we do play.
4. If we have an overcast outlook we play.

Now I’ve edited the weather.arff file with discrete values in stead of a true or false value. I’ve chosen to
include the sports football (FB), indoor-tennis (TN) and ‘none’/nothing. The correct and incorrect
classified instances are given in red.


Relation: weather
Instances: 24
Attributes: 5
outlook
temperature
humidity
windy
play


J48 pruned tree
------------------

outlook = sunny
| humidity <= 90: FB (5.0)
| humidity > 90: TN (3.0)
outlook = overcast
| humidity <= 79

13

Regression
|
CPU
performance

| | temperature <= 85: FB (4.0)
| | temperature > 85: TN (2.0)
| humidity > 79: TN (3.0)
outlook = rainy
| humidity <= 79: TN (2.0)
| humidity > 79: none (5.0)




=== Summary ===



0.667 0.2 0.667 0.667 0.667 0.763 FB
0.5 0.214 0.625 0.5 0.556 0.754 TN
1 0.105 0.714 1 0.833 0.984 none
Weighted Avg. 0.667 0.186 0.659 0.667 0.655 0.805


a b c <-- classified as
6 3 0 | a = FB
3 5 2 | b = TN
0 0 5 | c = none

2. Include your training data and the corresponding decision tree in your report and
comment on its structure.

The training data which I used is on the model which is listed above is:

@relation weather

@attribute outlook {sunny, overcast, rainy}
@attribute temperature real
@attribute humidity real
@attribute windy {TRUE, FALSE}
@attribute play {FB, TN, none}

@data
sunny, 95, 95, FALSE, TN
sunny, 65, 65, FALSE, FB
sunny, 65, 65, TRUE, FB
overcast, 65,65,TRUE, FB
overcast, 75,65, TRUE, FB
overcast, 85, 90, FALSE, TN
overcast, 85, 90, TRUE, TN
overcast, 65,68, FALSE, FB
overcast, 90,65, TRUE, TN
overcast, 65,95, FALSE, TN
overcast, 90,65, FALSE, TN
overcast, 85,65, TRUE, FB

14

Regression
|
CPU
performance

rainy,70,96,FALSE,none
rainy,68,80,FALSE,none
rainy,65,70,TRUE, TN
rainy,76,79,FALSE, TN
rainy,74,96,TRUE,none

Weka produced the decision tree which is listed below when running the J48 classifier on the data.

1. We check for the discrete value ‘outlook’ of the weather first, which has 3 discrete values;
respectively sunny, outlook and rainy.
2. If we have a sunny outlook we check for the value of the humidity.
• If the humidity is lower or equal then 90 we play football.
• If the humidity is higher then 90 we play indoor tennis.
3. If we have an overcast we can play both indoor-tennis or football
• If the humidity is higher then 79 we play indoor-tennis
• If the humidity is lower or equal then 79 we can play both indoor-tennis or football
depending on the temperature
5.
§ If the temperature is lower or equal then 85 we play football
§ If the temperature is higher then 85 we play indoor-tennis
4. If we have a rainy outlook we check how high the humidity is. We can play indoor-tennis or
we stay at home.
• If the humidity is lower or equal then 79 we play indoor-tennis
• If the humidity is higher then 79 we stay at home.

3. The "-U" option can be used to turn pruning off. What does pruning do and why?
What happens to the tree learned from your data when pruning is "off"? Comment on
the differences or explain why there is no difference.

With pruning we eliminate branches within our model for generalizing parts of our data. If this results to
a higher accuracy we mostly leave the new pruned part (of our model) in place. If pruning is not set then
every branch within the model of the data becomes/remains visible.

Within my own model there were no changes when turning pruning on and off. I think this happens
because there aren’t enough branches (or sub-branches) within my model. In conclusion; if we would
generalize parts of this decision tree then this would definitely lead to overfitting because we describe
parts of our model to well. We need to have more branches to profit from pruning. Adding more
attributes could be a solution for this.

15

Data mining: Baseline & Decision Trees

Recommended

Recommended

More Related Content

Similar to Data mining: Baseline & Decision Trees

Similar to Data mining: Baseline & Decision Trees (17)

More from BarryK88

More from BarryK88 (12)

Data mining: Baseline & Decision Trees