SlideShare a Scribd company logo
1 of 15
Download to read offline
Data mining
‘Baseline & Decision Trees’




         COMPUTER ASSIGNMENT 2

         BARRY KOLLEE

         10349863
Regression	
  |	
  CPU	
  performance	
  
	
  

	
  
	
  1.
   Evaluation in Machine Learning
Copy the file weather.arff to your home directory. This file contains data for deciding when to
play a certain sport given weather conditions. Run the J48 classifier using "weather.arff" as
the training set.

         1. Report how many instances are correctly and incorrectly classified on the training set.
         2. The classifier weka.classifiers.rules.ZeroR simply assigns the most common
            classification in a training set to any new classifications and can be used as a
            baseline for evaluating other machine learning schemes. Invoke the ZeroR classifier
            using weather.arff. Report the number of correctly classified and misclassified
            instances both for the training set and cross-validation.
         3. What are baselines used for? Is ZeroR a reasonable baseline? Can you think of other
            types of baselines?
         4. What is the difference between a development set and a test set?
         5. What is the difference between accuracy and precision? Give small examples for
            which both precision and accuracy scores differ greatly.

1. I’ve loaded up the delivered database file “weather.arff” into weka and run the J48 classifier by using
the delivered training set. The J48 is one of the ‘tree classifiers’. Next to the results UTF8-description of
the model, we’re also able to see the decision tree of our weather.arff file. The results are listed below of
the J48 classifier is listed below. The correct and incorrect instances are given in red.


       Scheme:weka.classifiers.trees.J48 -C 0.25 -M 2
       Relation:     weather
       Instances:    14
       Attributes:   5
                     outlook
                     temperature
                     humidity
                     windy
                     play
       Test mode:evaluate on training data

       === Classifier model (full training set) ===

       J48 pruned tree
       ------------------

       outlook = sunny
       |   humidity <= 75: yes (2.0)
       |   humidity > 75: no (3.0)
       outlook = overcast: yes (4.0)
       outlook = rainy
       |   windy = TRUE: no (2.0)
       |   windy = FALSE: yes (3.0)

       Number of Leaves      : 5

       Size of the tree :        8


       Time taken to build model: 0.01 seconds

       === Evaluation on training set ===
       === Summary ===

       Correctly Classified Instances                14                  100        %
       Incorrectly Classified Instances               0                    0        %
       Kappa statistic                                1
       Mean absolute error                            0
       Root mean squared error                        0
       Relative absolute error                        0        %
       Root relative squared error                    0        %
       Total Number of Instances                     14

       === Detailed Accuracy By Class ===




2
Regression	
  |	
  CPU	
  performance	
  
	
  
                         TP Rate        FP Rate    Precision        Recall     F-Measure        ROC Area      Class
                           1              0           1              1            1               1            yes
                           1              0           1              1            1               1            no
       Weighted Avg.       1              0           1              1            1               1

       === Confusion Matrix ===

        a b   <-- classified as
        9 0 | a = yes
        0 5 | b = no




2. Now the ZeroR classifier has been used to classify the “weather.arff”. First I performed a 10-fold crossing-
validation. The correct and incorrect instances are given in red.


       Scheme:weka.classifiers.rules.ZeroR
       Relation:     weather
       Instances:    14
       Attributes:   5
                     outlook
                     temperature
                     humidity
                     windy
                     play
       Test mode:10-fold cross-validation

       === Classifier model (full training set) ===

       ZeroR predicts class value: yes

       Time taken to build model: 0 seconds

       === Stratified cross-validation ===
       === Summary ===

       Correctly Classified Instances                      9                      64.2857 %
       Incorrectly Classified Instances                    5                      35.7143 %
       Kappa statistic                                     0
       Mean absolute error                                 0.4762
       Root mean squared error                             0.4934
       Relative absolute error                           100      %
       Root relative squared error                       100      %
       Total Number of Instances                          14

       === Detailed Accuracy By Class ===

                         TP Rate        FP Rate    Precision        Recall     F-Measure        ROC Area      Class
                           1              1           0.643          1            0.783           0.178        yes
                           0              0           0              0            0               0.178        no
       Weighted Avg.       0.643          0.643       0.413          0.643        0.503           0.178

       === Confusion Matrix ===




3
Regression	
  |	
  CPU	
  performance	
  
	
  

        a b   <-- classified as
        9 0 | a = yes
        5 0 | b = no




Now I used the option ‘use training set’ in stead of 10-fold cross-validation. The correct and incorrect instances
are given in red. The model is listed below:


       Scheme:weka.classifiers.rules.ZeroR
       Relation:     weather
       Instances:    14
       Attributes:   5
                     outlook
                     temperature
                     humidity
                     windy
                     play
       Test mode:evaluate on training data

       === Classifier model (full training set) ===

       ZeroR predicts class value: yes

       Time taken to build model: 0 seconds

       === Evaluation on training set ===
       === Summary ===

       Correctly Classified Instances                    9                     64.2857 %
       Incorrectly Classified Instances                  5                     35.7143 %
       Kappa statistic                                   0
       Mean absolute error                               0.4643
       Root mean squared error                           0.4795
       Relative absolute error                         100      %
       Root relative squared error                     100      %
       Total Number of Instances                        14

       === Detailed Accuracy By Class ===

                         TP Rate        FP Rate   Precision       Recall    F-Measure        ROC Area     Class
                           1              1          0.643         1           0.783           0.5         yes
                           0              0          0             0           0               0.5         no
       Weighted Avg.       0.643          0.643      0.413         0.643       0.503           0.5

       === Confusion Matrix ===

        a b   <-- classified as
        9 0 | a = yes
        5 0 | b = no



3. a) A baseline is a simple approach to a given problem, which is often used to compare other
approaches to, in order to see whether the other approaches perform better.

Next to datamining this is a common term in businesses. Where a business can define a certain
baseline for their company goals and/or strategy. Several approaches could deal this goal and/or
strategy.

b) No it is not. It just determines the most common class or the median (in case of numeric values). It
tests how well a class can be predicted without considering any attributes.

c) An example of a better type of baseline is NaiveBayes. This classifier does take attributes into
account which makes our created model better. That’s because it’s not only checking for the most
common class but also on every attributes the instances have. With these attributes we can give a way
more accurate insight of the baseline. You can see that in the snippet below:




4
Regression	
  |	
  CPU	
  performance	
  
	
  

       === Run information ===

       Scheme:weka.classifiers.bayes.NaiveBayes
       Relation:     weather
       Instances:    14
       Attributes:   5
                     outlook
                     temperature
                     humidity
                     windy
                     play
       Test mode:10-fold cross-validation

       === Classifier model (full training set) ===

       Naive Bayes Classifier

                        Class
       Attribute         yes      no
                       (0.63) (0.38)
       ===============================
       outlook
         sunny             3.0     4.0
         overcast          5.0     1.0
         rainy             4.0     3.0
         [total]          12.0     8.0

       temperature
         mean             72.9697 74.8364
         std. dev.         5.2304   7.384
         weight sum             9       5
         precision         1.9091 1.9091

       humidity
         mean             78.8395 86.1111
         std. dev.         9.8023 9.2424
         weight sum             9       5
         precision         3.4444 3.4444

       windy
         TRUE                   4.0         4.0
         FALSE                  7.0         3.0
         [total]               11.0         7.0



       Time taken to build model: 0 seconds

       === Stratified cross-validation ===
       === Summary ===

       Correctly Classified Instances                   9                64.2857 %
       Incorrectly Classified Instances                 5                35.7143 %
       Kappa statistic                                  0.1026
       Mean absolute error                              0.4649
       Root mean squared error                          0.543
       Relative absolute error                         97.6254 %
       Root relative squared error                    110.051 %
       Total Number of Instances                       14

       === Detailed Accuracy By Class ===

                         TP Rate        FP Rate   Precision   Recall   F-Measure     ROC Area   Class
                           0.889          0.8        0.667     0.889      0.762        0.444     yes
                           0.2            0.111      0.5       0.2        0.286        0.444     no
       Weighted Avg.       0.643          0.554      0.607     0.643      0.592        0.444

       === Confusion Matrix ===

        a b   <-- classified as
        8 1 | a = yes
        4 1 | b = no




5
Regression	
  |	
  CPU	
  performance	
  
	
  

4. We create our training set to increase the accuracy of the classifier, which we use on the data. The
more data we train the more accurate the resulting model will be.



The other two sets are used to evaluate the performance of the classifier we use. The development set
is used to evaluate the accuracy of different configurations of our classifier. It’s called the development
set because we continuously need to evaluate the classification performance.

In the end we’ve got a model, which has a great performance on the test data. To get estimates on how
good the new model will deal with new data we use the test data.


5.	
  With accuracy we are getting a result which is close to the actual value/answer/datapoint. With
precision we target to have an equal result on every new prediction on every new datapoint.	
  

I.e. if we play darts we can be playing accurately by hitting the bulls eye. But precise means that we
should throw a dart on the exact same spot every time.
	
  




6
Regression	
  |	
  CPU	
  performance	
  
	
  



2. Decision Trees
This assignment uses the WEKA implementation of C4.5, a decision tree learner. To invoke
this learner e.g. using a file called "train.arff" as a training set you can type:

java weka.classifiers.trees.J48 -t train.arff

This will construct a decision tree from train.arff and then apply it to train.arff. After that it will
perform a 10-fold cross-validation on train.arff.

2.1. Copy the file zoo.arff zoo.train.arff zoo.test.arff from Blackboard. This data includes
instances of animals described by their features (hairy, feathered, etc) and classifications of
those animals (e.g. mammal, bird, reptile). Invoke the J48 classifer using zoo.train.arff and
zoo.test.arff as the training and testing files respectively. Note that zoo.{train,test}.arff together
contain the same data as zoo.arff, i.e. the latter was split to create the training and testing
sets.w

        1. Report the number of correctly and incorrectly classified instances for the test data for
           Decision Trees.
        2. Include in your report a description of the decision tree constructed by the J48
           classifier and explain how the decision tree is used to classify a new instance.

1. I’ve opened the zootest-file within weka and run the J48 classifier. The number of correct and incorrect instances
are given in red. This is the use of the test set:


       === Run information ===

       Scheme:weka.classifiers.trees.J48 -C 0.25 -M 2
       Relation:     zoo
       Instances:    81
       Attributes:   18
                     animal
                     hair
                     feathers
                     eggs
                     milk
                     airborne
                     aquatic
                     predator
                     toothed
                     backbone
                     breathes
                     venomous
                     fins
                     legs
                     tail
                     domestic
                     catsize
                     type
       Test mode:user supplied test set: size unknown (reading incrementally)

       === Classifier model (full training set) ===

       J48 pruned tree
       ------------------

       feathers = false
       |   milk = false
       |   |   toothed = false
       |   |   |   airborne = false: invertebrate (8.0/1.0)
       |   |   |   airborne = true: insect (5.0)
       |   |   toothed = true
       |   |   |   fins = false
       |   |   |   |    legs <= 2: reptile (3.0)
       |   |   |   |    legs > 2: amphibian (3.0)
       |   |   |   fins = true: fish (10.0)
       |   milk = true: mammal (36.0)




7
Regression	
  |	
  CPU	
  performance	
  
	
  
       feathers = true: bird (16.0)

       Number of Leaves                  : 7

       Size of the tree :                     13


       Time taken to build model: 0.01 seconds

       === Evaluation on test set ===
       === Summary ===

       Correctly Classified Instances                         17               85      %
       Incorrectly Classified Instances                        3               15      %
       Kappa statistic                                         0.8187
       Mean absolute error                                     0.0464
       Root mean squared error                                 0.1965
       Relative absolute error                                20.0843 %
       Root relative squared error                            55.849 %
       Total Number of Instances                              20

       === Detailed Accuracy By Class ===
               TP Rate   FP Rate    Precision   Recall                F-Measure    ROC Area Class
                  1          0          1         1                        1           1        mammal
                  1          0          1         1                        1           1        bird
                  0          0          0         0                        0           0.5      reptile
                  1          0          1         1                        1           1        fish
                  1          0.053      0.5       1                        0.667       0.974    amphibian
                  0.5        0          1         0.5                      0.667       0.944     insect
                  1          0.118      0.6       1                        0.75        0.941    invertebrate
       Weighted Avg.    0.85       0.02       0.815                   0.85       0.813       0.934

       === Confusion Matrix ===

        a   b   c   d   e   f   g       <--   classified as
        5   0   0   0   0   0   0   |   a =   mammal
        0   4   0   0   0   0   0   |   b =   bird
        0   0   0   0   1   0   1   |   c =   reptile
        0   0   0   3   0   0   0   |   d =   fish
        0   0   0   0   1   0   0   |   e =   amphibian
        0   0   0   0   0   1   1   |   f =   insect
        0   0   0   0   0   0   3   |   g =   invertebrate



We can also perform a classifier based on the training set (and not the test set). The number of correct
and incorrect instances is given in red:

       === Run information ===

       Scheme:weka.classifiers.trees.J48 -C 0.25 -M 2
       Relation:     zoo
       Instances:    81
       Attributes:   18
                     animal
                     hair
                     feathers
                     eggs
                     milk
                     airborne
                     aquatic
                     predator
                     toothed
                     backbone
                     breathes
                     venomous
                     fins
                     legs
                     tail
                     domestic
                     catsize
                     type
       Test mode:10-fold cross-validation

       === Classifier model (full training set) ===

       J48 pruned tree
       ------------------




8
Regression	
  |	
  CPU	
  performance	
  
	
  

       feathers = false
       |   milk = false
       |   |   toothed = false
       |   |   |   airborne = false: invertebrate (8.0/1.0)
       |   |   |   airborne = true: insect (5.0)
       |   |   toothed = true
       |   |   |   fins = false
       |   |   |   |    legs <= 2: reptile (3.0)
       |   |   |   |    legs > 2: amphibian (3.0)
       |   |   |   fins = true: fish (10.0)
       |   milk = true: mammal (36.0)
       feathers = true: bird (16.0)

       Number of Leaves                        : 7

       Size of the tree :                          13


       Time taken to build model: 0.03 seconds

       === Stratified cross-validation ===
       === Summary ===

       Correctly Classified Instances                                     75          92.5926 %
       Incorrectly Classified Instances                                    6           7.4074 %
       Kappa statistic                                                     0.8987
       Mean absolute error                                                 0.0232
       Root mean squared error                                             0.1465
       Relative absolute error                                            10.882 %
       Root relative squared error                                        45.1077 %
       Total Number of Instances                                          81

       === Detailed Accuracy By Class ===

                      TP Rate             FP Rate    Precision    Recall F-Measure   ROC Area Class
                                       1          0          1          1        1         1        mammal
                                       1          0          1          1        1         1        bird
                                       0.333      0.013      0.5        0.333    0.4       0.66     reptile
                                       1          0.014      0.909      1        0.952     0.993    fish
                                       0.667      0.013      0.667      0.667    0.667     0.827    amphibian
                                       0.667      0.013      0.8        0.667    0.727     0.818    insect
                                       0.857      0.027      0.75       0.857    0.8       0.907 invertebrate
       Weighted Avg.                     0.926      0.006      0.921      0.926    0.922      0.959

       === Confusion Matrix ===

       a b    c        d       e       f       g      <-- classified as
       36 0       0        0       0       0    0 |     a = mammal
       0 16   0 0              0       0       0 |     b = bird
       0 0    1 1              1       0       0 |     c = reptile
       0 0    0 10             0       0       0 |     d = fish
       0 0    1 0              2       0       0 |     e = amphibian
       0 0    0 0              0       4       2 |     f = insect
       0 0    0 0              0       1       6 |     g = invertebrate



2. Weka produces the following decision tree:



       feathers = false
       |   milk = false
       |   |   toothed = false
       |   |   |   airborne = false: invertebrate (8.0/1.0)
       |   |   |   airborne = true: insect (5.0)
       |   |   toothed = true
       |   |   |   fins = false
       |   |   |   |    legs <= 2: reptile (3.0)
       |   |   |   |    legs > 2: amphibian (3.0)
       |   |   |   fins = true: fish (10.0)
       |   milk = true: mammal (36.0)
       feathers = true: bird (16.0)




9
Regression	
  |	
  CPU	
  performance	
  
	
  




The decision tree is listed above. We define 6 different steps within our decision tree.

       1.   Does the creature has feathers (True/False)
       2.   Does the creature produces milk (True/False)
       3.   Is the creature toothed (True/False)
       4.   Does the creature flies (True/False)
       5.   Does the creature has fins (True/False)
       6.   Does the creature has more or less legs then 2 (numerical check)

2.2. If you invoke a WEKA classifier with a training set but no testing set, WEKA will
automatically perform a 10-fold cross-validation on the training set and report how many
instances are correctly and incorrectly classified when those instances are used as test data
during the cross-validation.

       1. When you ran the J48 classifier with the zoo.arff file, a 10-fold cross validation was
          performed. Report the number of instances correctly and incorrectly classified during
          the cross-validation.

I loaded up the delivered zoo.arff file and performed a 10-fold crossing with the J48 classifier. The
number of correct and incorrect instances are given in red.

=== Run information ===

Scheme:weka.classifiers.trees.J48 -C 0.25 -M 2
Relation:     zoo
Instances:    101
Attributes:   18
              animal
              hair
              feathers
              eggs
              milk
              airborne
              aquatic
              predator
              toothed
              backbone
              breathes
              venomous
              fins




10
Regression	
  |	
  CPU	
  performance	
  
	
  
              legs
              tail
              domestic
              catsize
              type
Test mode:10-fold cross-validation

=== Classifier model (full training set) ===

J48 pruned tree
------------------

feathers = false
|   milk = false
|   |   backbone = false
|   |   |   airborne = false
|   |   |   |    predator = false
|   |   |   |    |   legs <= 2: invertebrate (2.0)
|   |   |   |    |   legs > 2: insect (2.0)
|   |   |   |    predator = true: invertebrate (8.0)
|   |   |   airborne = true: insect (6.0)
|   |   backbone = true
|   |   |   fins = false
|   |   |   |    tail = false: amphibian (3.0)
|   |   |   |    tail = true: reptile (6.0/1.0)
|   |   |   fins = true: fish (13.0)
|   milk = true: mammal (41.0)
feathers = true: bird (20.0)

Number of Leaves                :       9

Size of the tree :                      17


Time taken to build model: 0.02 seconds

=== Stratified cross-validation ===
=== Summary ===

Correctly Classified Instances                              93                92.0792 %
Incorrectly Classified Instances                             8                 7.9208 %
Kappa statistic                                              0.8955
Mean absolute error                                          0.0225
Root mean squared error                                      0.14
Relative absolute error                                     10.2478 %
Root relative squared error                                 42.4398 %
Total Number of Instances                                  101

=== Detailed Accuracy By Class ===

                          TP Rate            FP Rate   Precision   Recall   F-Measure     ROC Area   Class
                            1                  0          1         1          1            1         mammal
                            1                  0          1         1          1            1         bird
                            0.6                0.01       0.75      0.6        0.667        0.793     reptile
                            1                  0.011      0.929     1          0.963        0.994     fish
                            0.75               0          1         0.75       0.857        0.872
amphibian
                            0.625              0.032      0.625     0.625      0.625        0.92      insect
                            0.8                0.033      0.727     0.8        0.762        0.986
invertebrate
Weighted Avg.               0.921              0.008      0.922     0.921      0.92         0.976

=== Confusion Matrix ===

        a b    c d    e     f       g       <-- classified as
       41 0    0 0    0     0       0   |    a = mammal
        0 20   0 0    0     0       0   |    b = bird
        0 0    3 1    0     1       0   |    c = reptile
        0 0    0 13   0     0       0   |    d = fish
        0 0    1 0    3     0       0   |    e = amphibian
        0 0    0 0    0     5       3   |    f = insect
        0 0    0 0    0     2       8   |    g = invertebrate




11
Regression	
  |	
  CPU	
  performance	
  
	
  

       2. With the WEKA classifiers, the "-x [value]" option can be used to specify how many
          folds to use in a cross validation -- e.g. "-x 5", will specify a 5-fold cross-validation.
          Suppose you wish to perform a "leave one out" cross-validation on the zoo.arff data.
          How many folds must you specify to achieve this?

Because we have 101 instances we need to perform a total of 101 folds.

2.3. Make a copy of the weather.arff file and modify it to train a classifier with multiple
(discrete) target values. The classifier is supposed to assign your favorite sport according to
weather conditions (take, for example, "swimming", "badminton" and "none"). Modify the
training data according to your settings (create at least 20 training examples).

       1. Train the J48 classifier on the original data in weather.arff and on your modified
          version and report the number of correctly and incorrectly classified instances during
          the 10-fold cross-validation.


First I performed the 10-fold cross validation on the weather.arff. The number of correct and incorrect
classified instances is given in red.


         === Run information ===

         Scheme:weka.classifiers.trees.J48 -C 0.25 -M 2
         Relation:     weather
         Instances:    14
         Attributes:   5
                       outlook
                       temperature
                       humidity
                       windy
                       play
         Test mode:10-fold cross-validation

         === Classifier model (full training set) ===

         J48 pruned tree
         ------------------

         outlook = sunny
         |   humidity <= 75: yes (2.0)
         |   humidity > 75: no (3.0)
         outlook = overcast: yes (4.0)
         outlook = rainy
         |   windy = TRUE: no (2.0)
         |   windy = FALSE: yes (3.0)

         Number of Leaves          : 5

         Size of the tree :            8


         Time taken to build model: 0 seconds

         === Stratified cross-validation ===
         === Summary ===

         Correctly Classified Instances                       9                64.2857 %
         Incorrectly Classified Instances                     5                35.7143 %
         Kappa statistic                                      0.186
         Mean absolute error                                  0.2857
         Root mean squared error                              0.4818
         Relative absolute error                             60      %
         Root relative squared error                         97.6586 %
         Total Number of Instances                           14

         === Detailed Accuracy By Class ===

                           TP Rate          FP Rate   Precision   Recall   F-Measure   ROC Area    Class
                             0.778            0.6        0.7       0.778      0.737      0.789      yes
                             0.4              0.222      0.5       0.4        0.444      0.789      no



12
Regression	
  |	
  CPU	
  performance	
  
	
  
        Weighted Avg.         0.643         0.465        0.629        0.643        0.632          0.789

          === Confusion Matrix ===

           a b   <-- classified as
           7 2 | a = yes
           3 2 | b = no




            1.   We check for the discrete value ‘outlook’ of the weather first, which has 3 discrete values;
                 respectively sunny, outlook and rainy.
            2.   If we have a sunny outlook we check for the value of the humidity.
                      •   If the humidity is lower or equal then 75 we play.
                      •   If the humidity is higher then 75 we don't play.
            3.   If we have a rainy outlook we check if it’s windy or not. It it’s windy we don’t play, if it’s not
                 windy we do play.
            4.   If we have an overcast outlook we play.

Now I’ve edited the weather.arff file with discrete values in stead of a true or false value. I’ve chosen to
include the sports football (FB), indoor-tennis (TN) and ‘none’/nothing. The correct and incorrect
classified instances are given in red.


       === Run information ===

       Scheme:weka.classifiers.trees.J48 -C 0.25 -M 2
       Relation:     weather
       Instances:    24
       Attributes:   5
                     outlook
                     temperature
                     humidity
                     windy
                     play
       Test mode:10-fold cross-validation

       === Classifier model (full training set) ===

       J48 pruned tree
       ------------------

       outlook = sunny
       |   humidity <= 90: FB (5.0)
       |   humidity > 90: TN (3.0)
       outlook = overcast
       |   humidity <= 79




13
Regression	
  |	
  CPU	
  performance	
  
	
  
       |   |   temperature <= 85: FB (4.0)
       |   |   temperature > 85: TN (2.0)
       |   humidity > 79: TN (3.0)
       outlook = rainy
       |   humidity <= 79: TN (2.0)
       |   humidity > 79: none (5.0)

       Number of Leaves        : 7

       Size of the tree :        12


       Time taken to build model: 0 seconds

       === Stratified cross-validation ===
       === Summary ===

       Correctly Classified Instances                  16                  66.6667 %
       Incorrectly Classified Instances                 8                  33.3333 %
       Kappa statistic                                  0.4947
       Mean absolute error                              0.212
       Root mean squared error                          0.4295
       Relative absolute error                         48.9316 %
       Root relative squared error                     92.0147 %
       Total Number of Instances                       24

       === Detailed Accuracy By Class ===

                            TP Rate     FP Rate   Precision   Recall   F-Measure       ROC Area   Class
                              0.667       0.2        0.667     0.667      0.667          0.763     FB
                              0.5         0.214      0.625     0.5        0.556          0.754     TN
                              1           0.105      0.714     1          0.833          0.984     none
       Weighted Avg.          0.667       0.186      0.659     0.667      0.655          0.805

       === Confusion Matrix ===

        a   b   c   <--   classified as
        6   3   0 | a =   FB
        3   5   2 | b =   TN
        0   0   5 | c =   none




        2. Include your training data and the corresponding decision tree in your report and
           comment on its structure.

The training data which I used is on the model which is listed above is:


       @relation weather

       @attribute     outlook {sunny, overcast, rainy}
       @attribute     temperature real
       @attribute     humidity real
       @attribute     windy {TRUE, FALSE}
       @attribute     play {FB, TN, none}


       @data
       sunny, 95, 95, FALSE, TN
       sunny, 65, 65, FALSE, FB
       sunny, 65, 65, TRUE, FB
       sunny, 76, 78, TRUE, FB
       sunny, 80, 85, TRUE, FB
       sunny, 75, 95, FALSE, TN
       sunny, 80, 80, TRUE, FB
       sunny, 60, 95, FALSE, TN
       overcast, 65,65,TRUE, FB
       overcast, 75,65, TRUE, FB
       overcast, 85, 90, FALSE, TN
       overcast, 85, 90, TRUE, TN
       overcast, 65,68, FALSE, FB
       overcast, 90,65, TRUE, TN
       overcast, 65,95, FALSE, TN
       overcast, 90,65, FALSE, TN
       overcast, 85,65, TRUE, FB



14
Regression	
  |	
  CPU	
  performance	
  
	
  
       rainy,70,96,FALSE,none
       rainy,68,80,FALSE,none
       rainy,65,70,TRUE, TN
       rainy,76,79,FALSE, TN
       rainy,74,96,TRUE,none
       rainy,60,80,TRUE,none
       rainy,74,96,TRUE,none



Weka produced the decision tree which is listed below when running the J48 classifier on the data.




            1.   We check for the discrete value ‘outlook’ of the weather first, which has 3 discrete values;
                 respectively sunny, outlook and rainy.
            2.   If we have a sunny outlook we check for the value of the humidity.
                      •   If the humidity is lower or equal then 90 we play football.
                      •   If the humidity is higher then 90 we play indoor tennis.
            3.   If we have an overcast we can play both indoor-tennis or football
                      •   If the humidity is higher then 79 we play indoor-tennis
                      •   If the humidity is lower or equal then 79 we can play both indoor-tennis or football
                          depending on the temperature
                                  5.
                                  §  If the temperature is lower or equal then 85 we play football
                                  §  If the temperature is higher then 85 we play indoor-tennis
            4.   If we have a rainy outlook we check how high the humidity is. We can play indoor-tennis or
                 we stay at home.
                      •   If the humidity is lower or equal then 79 we play indoor-tennis
                      •   If the humidity is higher then 79 we stay at home.

        3. The "-U" option can be used to turn pruning off. What does pruning do and why?
           What happens to the tree learned from your data when pruning is "off"? Comment on
           the differences or explain why there is no difference.

With pruning we eliminate branches within our model for generalizing parts of our data. If this results to
a higher accuracy we mostly leave the new pruned part (of our model) in place. If pruning is not set then
every branch within the model of the data becomes/remains visible.

Within my own model there were no changes when turning pruning on and off. I think this happens
because there aren’t enough branches (or sub-branches) within my model. In conclusion; if we would
generalize parts of this decision tree then this would definitely lead to overfitting because we describe
parts of our model to well. We need to have more branches to profit from pruning. Adding more
attributes could be a solution for this.



15

More Related Content

Similar to Data mining: Baseline & Decision Trees

Workshop - Introduction to Machine Learning with R
Workshop - Introduction to Machine Learning with RWorkshop - Introduction to Machine Learning with R
Workshop - Introduction to Machine Learning with RShirin Elsinghorst
 
ML-ChapterFour-ModelEvaluation.pptx
ML-ChapterFour-ModelEvaluation.pptxML-ChapterFour-ModelEvaluation.pptx
ML-ChapterFour-ModelEvaluation.pptxbelay41
 
MS Word.doc
MS Word.docMS Word.doc
MS Word.docbutest
 
Data Mining With A Simulated Annealing Based Fuzzy Classification System
Data Mining With A Simulated Annealing Based Fuzzy Classification SystemData Mining With A Simulated Annealing Based Fuzzy Classification System
Data Mining With A Simulated Annealing Based Fuzzy Classification SystemJamie (Taka) Wang
 
Classification by Machine Learning Approaches
Classification by Machine Learning Approaches Classification by Machine Learning Approaches
Classification by Machine Learning Approaches butest
 
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationBridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationThomas Ploetz
 
Performance and Availability Tradeoffs in Replicated File Systems
Performance and Availability Tradeoffs in Replicated File SystemsPerformance and Availability Tradeoffs in Replicated File Systems
Performance and Availability Tradeoffs in Replicated File Systemspeterhoneyman
 
06-00-ACA-Evaluation.pdf
06-00-ACA-Evaluation.pdf06-00-ACA-Evaluation.pdf
06-00-ACA-Evaluation.pdfAlexanderLerch4
 
IRE major project group 22 IIITH
IRE major project group 22 IIITHIRE major project group 22 IIITH
IRE major project group 22 IIITHAkhil Jindal
 
How to make fewer errors at the stage of code writing. Part N4.
How to make fewer errors at the stage of code writing. Part N4.How to make fewer errors at the stage of code writing. Part N4.
How to make fewer errors at the stage of code writing. Part N4.PVS-Studio
 
Huong dan cu the svm
Huong dan cu the svmHuong dan cu the svm
Huong dan cu the svmtaikhoan262
 
DATA MINING - EVALUATING CLUSTERING ALGORITHM
DATA MINING - EVALUATING CLUSTERING ALGORITHMDATA MINING - EVALUATING CLUSTERING ALGORITHM
DATA MINING - EVALUATING CLUSTERING ALGORITHMTochukwu Udeh
 
Dr. Oner CelepcikayCS 4319CS 4319Machine LearningW.docx
Dr. Oner CelepcikayCS 4319CS 4319Machine LearningW.docxDr. Oner CelepcikayCS 4319CS 4319Machine LearningW.docx
Dr. Oner CelepcikayCS 4319CS 4319Machine LearningW.docxmadlynplamondon
 

Similar to Data mining: Baseline & Decision Trees (17)

Workshop - Introduction to Machine Learning with R
Workshop - Introduction to Machine Learning with RWorkshop - Introduction to Machine Learning with R
Workshop - Introduction to Machine Learning with R
 
wk5ppt2_Iris
wk5ppt2_Iriswk5ppt2_Iris
wk5ppt2_Iris
 
ML-ChapterFour-ModelEvaluation.pptx
ML-ChapterFour-ModelEvaluation.pptxML-ChapterFour-ModelEvaluation.pptx
ML-ChapterFour-ModelEvaluation.pptx
 
evaluation and credibility-Part 2
evaluation and credibility-Part 2evaluation and credibility-Part 2
evaluation and credibility-Part 2
 
MS Word.doc
MS Word.docMS Word.doc
MS Word.doc
 
Data Mining With A Simulated Annealing Based Fuzzy Classification System
Data Mining With A Simulated Annealing Based Fuzzy Classification SystemData Mining With A Simulated Annealing Based Fuzzy Classification System
Data Mining With A Simulated Annealing Based Fuzzy Classification System
 
Classification by Machine Learning Approaches
Classification by Machine Learning Approaches Classification by Machine Learning Approaches
Classification by Machine Learning Approaches
 
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationBridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
 
Performance and Availability Tradeoffs in Replicated File Systems
Performance and Availability Tradeoffs in Replicated File SystemsPerformance and Availability Tradeoffs in Replicated File Systems
Performance and Availability Tradeoffs in Replicated File Systems
 
06-00-ACA-Evaluation.pdf
06-00-ACA-Evaluation.pdf06-00-ACA-Evaluation.pdf
06-00-ACA-Evaluation.pdf
 
IRE major project group 22 IIITH
IRE major project group 22 IIITHIRE major project group 22 IIITH
IRE major project group 22 IIITH
 
Comp102 lec 5.1
Comp102   lec 5.1Comp102   lec 5.1
Comp102 lec 5.1
 
How to make fewer errors at the stage of code writing. Part N4.
How to make fewer errors at the stage of code writing. Part N4.How to make fewer errors at the stage of code writing. Part N4.
How to make fewer errors at the stage of code writing. Part N4.
 
Guide
GuideGuide
Guide
 
Huong dan cu the svm
Huong dan cu the svmHuong dan cu the svm
Huong dan cu the svm
 
DATA MINING - EVALUATING CLUSTERING ALGORITHM
DATA MINING - EVALUATING CLUSTERING ALGORITHMDATA MINING - EVALUATING CLUSTERING ALGORITHM
DATA MINING - EVALUATING CLUSTERING ALGORITHM
 
Dr. Oner CelepcikayCS 4319CS 4319Machine LearningW.docx
Dr. Oner CelepcikayCS 4319CS 4319Machine LearningW.docxDr. Oner CelepcikayCS 4319CS 4319Machine LearningW.docx
Dr. Oner CelepcikayCS 4319CS 4319Machine LearningW.docx
 

More from BarryK88

Data mining test notes (back)
Data mining test notes (back)Data mining test notes (back)
Data mining test notes (back)BarryK88
 
Data mining test notes (front)
Data mining test notes (front)Data mining test notes (front)
Data mining test notes (front)BarryK88
 
Data mining assignment 2
Data mining assignment 2Data mining assignment 2
Data mining assignment 2BarryK88
 
Data mining assignment 4
Data mining assignment 4Data mining assignment 4
Data mining assignment 4BarryK88
 
Data mining assignment 3
Data mining assignment 3Data mining assignment 3
Data mining assignment 3BarryK88
 
Data mining assignment 5
Data mining assignment 5Data mining assignment 5
Data mining assignment 5BarryK88
 
Data mining assignment 6
Data mining assignment 6Data mining assignment 6
Data mining assignment 6BarryK88
 
Data mining assignment 1
Data mining assignment 1Data mining assignment 1
Data mining assignment 1BarryK88
 
Semantic web final assignment
Semantic web final assignmentSemantic web final assignment
Semantic web final assignmentBarryK88
 
Semantic web assignment 3
Semantic web assignment 3Semantic web assignment 3
Semantic web assignment 3BarryK88
 
Semantic web assignment 2
Semantic web assignment 2Semantic web assignment 2
Semantic web assignment 2BarryK88
 
Semantic web assignment1
Semantic web assignment1Semantic web assignment1
Semantic web assignment1BarryK88
 

More from BarryK88 (12)

Data mining test notes (back)
Data mining test notes (back)Data mining test notes (back)
Data mining test notes (back)
 
Data mining test notes (front)
Data mining test notes (front)Data mining test notes (front)
Data mining test notes (front)
 
Data mining assignment 2
Data mining assignment 2Data mining assignment 2
Data mining assignment 2
 
Data mining assignment 4
Data mining assignment 4Data mining assignment 4
Data mining assignment 4
 
Data mining assignment 3
Data mining assignment 3Data mining assignment 3
Data mining assignment 3
 
Data mining assignment 5
Data mining assignment 5Data mining assignment 5
Data mining assignment 5
 
Data mining assignment 6
Data mining assignment 6Data mining assignment 6
Data mining assignment 6
 
Data mining assignment 1
Data mining assignment 1Data mining assignment 1
Data mining assignment 1
 
Semantic web final assignment
Semantic web final assignmentSemantic web final assignment
Semantic web final assignment
 
Semantic web assignment 3
Semantic web assignment 3Semantic web assignment 3
Semantic web assignment 3
 
Semantic web assignment 2
Semantic web assignment 2Semantic web assignment 2
Semantic web assignment 2
 
Semantic web assignment1
Semantic web assignment1Semantic web assignment1
Semantic web assignment1
 

Data mining: Baseline & Decision Trees

  • 1. Data mining ‘Baseline & Decision Trees’ COMPUTER ASSIGNMENT 2 BARRY KOLLEE 10349863
  • 2. Regression  |  CPU  performance        1. Evaluation in Machine Learning Copy the file weather.arff to your home directory. This file contains data for deciding when to play a certain sport given weather conditions. Run the J48 classifier using "weather.arff" as the training set. 1. Report how many instances are correctly and incorrectly classified on the training set. 2. The classifier weka.classifiers.rules.ZeroR simply assigns the most common classification in a training set to any new classifications and can be used as a baseline for evaluating other machine learning schemes. Invoke the ZeroR classifier using weather.arff. Report the number of correctly classified and misclassified instances both for the training set and cross-validation. 3. What are baselines used for? Is ZeroR a reasonable baseline? Can you think of other types of baselines? 4. What is the difference between a development set and a test set? 5. What is the difference between accuracy and precision? Give small examples for which both precision and accuracy scores differ greatly. 1. I’ve loaded up the delivered database file “weather.arff” into weka and run the J48 classifier by using the delivered training set. The J48 is one of the ‘tree classifiers’. Next to the results UTF8-description of the model, we’re also able to see the decision tree of our weather.arff file. The results are listed below of the J48 classifier is listed below. The correct and incorrect instances are given in red. Scheme:weka.classifiers.trees.J48 -C 0.25 -M 2 Relation: weather Instances: 14 Attributes: 5 outlook temperature humidity windy play Test mode:evaluate on training data === Classifier model (full training set) === J48 pruned tree ------------------ outlook = sunny | humidity <= 75: yes (2.0) | humidity > 75: no (3.0) outlook = overcast: yes (4.0) outlook = rainy | windy = TRUE: no (2.0) | windy = FALSE: yes (3.0) Number of Leaves : 5 Size of the tree : 8 Time taken to build model: 0.01 seconds === Evaluation on training set === === Summary === Correctly Classified Instances 14 100 % Incorrectly Classified Instances 0 0 % Kappa statistic 1 Mean absolute error 0 Root mean squared error 0 Relative absolute error 0 % Root relative squared error 0 % Total Number of Instances 14 === Detailed Accuracy By Class === 2
  • 3. Regression  |  CPU  performance     TP Rate FP Rate Precision Recall F-Measure ROC Area Class 1 0 1 1 1 1 yes 1 0 1 1 1 1 no Weighted Avg. 1 0 1 1 1 1 === Confusion Matrix === a b <-- classified as 9 0 | a = yes 0 5 | b = no 2. Now the ZeroR classifier has been used to classify the “weather.arff”. First I performed a 10-fold crossing- validation. The correct and incorrect instances are given in red. Scheme:weka.classifiers.rules.ZeroR Relation: weather Instances: 14 Attributes: 5 outlook temperature humidity windy play Test mode:10-fold cross-validation === Classifier model (full training set) === ZeroR predicts class value: yes Time taken to build model: 0 seconds === Stratified cross-validation === === Summary === Correctly Classified Instances 9 64.2857 % Incorrectly Classified Instances 5 35.7143 % Kappa statistic 0 Mean absolute error 0.4762 Root mean squared error 0.4934 Relative absolute error 100 % Root relative squared error 100 % Total Number of Instances 14 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class 1 1 0.643 1 0.783 0.178 yes 0 0 0 0 0 0.178 no Weighted Avg. 0.643 0.643 0.413 0.643 0.503 0.178 === Confusion Matrix === 3
  • 4. Regression  |  CPU  performance     a b <-- classified as 9 0 | a = yes 5 0 | b = no Now I used the option ‘use training set’ in stead of 10-fold cross-validation. The correct and incorrect instances are given in red. The model is listed below: Scheme:weka.classifiers.rules.ZeroR Relation: weather Instances: 14 Attributes: 5 outlook temperature humidity windy play Test mode:evaluate on training data === Classifier model (full training set) === ZeroR predicts class value: yes Time taken to build model: 0 seconds === Evaluation on training set === === Summary === Correctly Classified Instances 9 64.2857 % Incorrectly Classified Instances 5 35.7143 % Kappa statistic 0 Mean absolute error 0.4643 Root mean squared error 0.4795 Relative absolute error 100 % Root relative squared error 100 % Total Number of Instances 14 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class 1 1 0.643 1 0.783 0.5 yes 0 0 0 0 0 0.5 no Weighted Avg. 0.643 0.643 0.413 0.643 0.503 0.5 === Confusion Matrix === a b <-- classified as 9 0 | a = yes 5 0 | b = no 3. a) A baseline is a simple approach to a given problem, which is often used to compare other approaches to, in order to see whether the other approaches perform better. Next to datamining this is a common term in businesses. Where a business can define a certain baseline for their company goals and/or strategy. Several approaches could deal this goal and/or strategy. b) No it is not. It just determines the most common class or the median (in case of numeric values). It tests how well a class can be predicted without considering any attributes. c) An example of a better type of baseline is NaiveBayes. This classifier does take attributes into account which makes our created model better. That’s because it’s not only checking for the most common class but also on every attributes the instances have. With these attributes we can give a way more accurate insight of the baseline. You can see that in the snippet below: 4
  • 5. Regression  |  CPU  performance     === Run information === Scheme:weka.classifiers.bayes.NaiveBayes Relation: weather Instances: 14 Attributes: 5 outlook temperature humidity windy play Test mode:10-fold cross-validation === Classifier model (full training set) === Naive Bayes Classifier Class Attribute yes no (0.63) (0.38) =============================== outlook sunny 3.0 4.0 overcast 5.0 1.0 rainy 4.0 3.0 [total] 12.0 8.0 temperature mean 72.9697 74.8364 std. dev. 5.2304 7.384 weight sum 9 5 precision 1.9091 1.9091 humidity mean 78.8395 86.1111 std. dev. 9.8023 9.2424 weight sum 9 5 precision 3.4444 3.4444 windy TRUE 4.0 4.0 FALSE 7.0 3.0 [total] 11.0 7.0 Time taken to build model: 0 seconds === Stratified cross-validation === === Summary === Correctly Classified Instances 9 64.2857 % Incorrectly Classified Instances 5 35.7143 % Kappa statistic 0.1026 Mean absolute error 0.4649 Root mean squared error 0.543 Relative absolute error 97.6254 % Root relative squared error 110.051 % Total Number of Instances 14 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.889 0.8 0.667 0.889 0.762 0.444 yes 0.2 0.111 0.5 0.2 0.286 0.444 no Weighted Avg. 0.643 0.554 0.607 0.643 0.592 0.444 === Confusion Matrix === a b <-- classified as 8 1 | a = yes 4 1 | b = no 5
  • 6. Regression  |  CPU  performance     4. We create our training set to increase the accuracy of the classifier, which we use on the data. The more data we train the more accurate the resulting model will be. The other two sets are used to evaluate the performance of the classifier we use. The development set is used to evaluate the accuracy of different configurations of our classifier. It’s called the development set because we continuously need to evaluate the classification performance. In the end we’ve got a model, which has a great performance on the test data. To get estimates on how good the new model will deal with new data we use the test data. 5.  With accuracy we are getting a result which is close to the actual value/answer/datapoint. With precision we target to have an equal result on every new prediction on every new datapoint.   I.e. if we play darts we can be playing accurately by hitting the bulls eye. But precise means that we should throw a dart on the exact same spot every time.   6
  • 7. Regression  |  CPU  performance     2. Decision Trees This assignment uses the WEKA implementation of C4.5, a decision tree learner. To invoke this learner e.g. using a file called "train.arff" as a training set you can type: java weka.classifiers.trees.J48 -t train.arff This will construct a decision tree from train.arff and then apply it to train.arff. After that it will perform a 10-fold cross-validation on train.arff. 2.1. Copy the file zoo.arff zoo.train.arff zoo.test.arff from Blackboard. This data includes instances of animals described by their features (hairy, feathered, etc) and classifications of those animals (e.g. mammal, bird, reptile). Invoke the J48 classifer using zoo.train.arff and zoo.test.arff as the training and testing files respectively. Note that zoo.{train,test}.arff together contain the same data as zoo.arff, i.e. the latter was split to create the training and testing sets.w 1. Report the number of correctly and incorrectly classified instances for the test data for Decision Trees. 2. Include in your report a description of the decision tree constructed by the J48 classifier and explain how the decision tree is used to classify a new instance. 1. I’ve opened the zootest-file within weka and run the J48 classifier. The number of correct and incorrect instances are given in red. This is the use of the test set: === Run information === Scheme:weka.classifiers.trees.J48 -C 0.25 -M 2 Relation: zoo Instances: 81 Attributes: 18 animal hair feathers eggs milk airborne aquatic predator toothed backbone breathes venomous fins legs tail domestic catsize type Test mode:user supplied test set: size unknown (reading incrementally) === Classifier model (full training set) === J48 pruned tree ------------------ feathers = false | milk = false | | toothed = false | | | airborne = false: invertebrate (8.0/1.0) | | | airborne = true: insect (5.0) | | toothed = true | | | fins = false | | | | legs <= 2: reptile (3.0) | | | | legs > 2: amphibian (3.0) | | | fins = true: fish (10.0) | milk = true: mammal (36.0) 7
  • 8. Regression  |  CPU  performance     feathers = true: bird (16.0) Number of Leaves : 7 Size of the tree : 13 Time taken to build model: 0.01 seconds === Evaluation on test set === === Summary === Correctly Classified Instances 17 85 % Incorrectly Classified Instances 3 15 % Kappa statistic 0.8187 Mean absolute error 0.0464 Root mean squared error 0.1965 Relative absolute error 20.0843 % Root relative squared error 55.849 % Total Number of Instances 20 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class 1 0 1 1 1 1 mammal 1 0 1 1 1 1 bird 0 0 0 0 0 0.5 reptile 1 0 1 1 1 1 fish 1 0.053 0.5 1 0.667 0.974 amphibian 0.5 0 1 0.5 0.667 0.944 insect 1 0.118 0.6 1 0.75 0.941 invertebrate Weighted Avg. 0.85 0.02 0.815 0.85 0.813 0.934 === Confusion Matrix === a b c d e f g <-- classified as 5 0 0 0 0 0 0 | a = mammal 0 4 0 0 0 0 0 | b = bird 0 0 0 0 1 0 1 | c = reptile 0 0 0 3 0 0 0 | d = fish 0 0 0 0 1 0 0 | e = amphibian 0 0 0 0 0 1 1 | f = insect 0 0 0 0 0 0 3 | g = invertebrate We can also perform a classifier based on the training set (and not the test set). The number of correct and incorrect instances is given in red: === Run information === Scheme:weka.classifiers.trees.J48 -C 0.25 -M 2 Relation: zoo Instances: 81 Attributes: 18 animal hair feathers eggs milk airborne aquatic predator toothed backbone breathes venomous fins legs tail domestic catsize type Test mode:10-fold cross-validation === Classifier model (full training set) === J48 pruned tree ------------------ 8
  • 9. Regression  |  CPU  performance     feathers = false | milk = false | | toothed = false | | | airborne = false: invertebrate (8.0/1.0) | | | airborne = true: insect (5.0) | | toothed = true | | | fins = false | | | | legs <= 2: reptile (3.0) | | | | legs > 2: amphibian (3.0) | | | fins = true: fish (10.0) | milk = true: mammal (36.0) feathers = true: bird (16.0) Number of Leaves : 7 Size of the tree : 13 Time taken to build model: 0.03 seconds === Stratified cross-validation === === Summary === Correctly Classified Instances 75 92.5926 % Incorrectly Classified Instances 6 7.4074 % Kappa statistic 0.8987 Mean absolute error 0.0232 Root mean squared error 0.1465 Relative absolute error 10.882 % Root relative squared error 45.1077 % Total Number of Instances 81 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class 1 0 1 1 1 1 mammal 1 0 1 1 1 1 bird 0.333 0.013 0.5 0.333 0.4 0.66 reptile 1 0.014 0.909 1 0.952 0.993 fish 0.667 0.013 0.667 0.667 0.667 0.827 amphibian 0.667 0.013 0.8 0.667 0.727 0.818 insect 0.857 0.027 0.75 0.857 0.8 0.907 invertebrate Weighted Avg. 0.926 0.006 0.921 0.926 0.922 0.959 === Confusion Matrix === a b c d e f g <-- classified as 36 0 0 0 0 0 0 | a = mammal 0 16 0 0 0 0 0 | b = bird 0 0 1 1 1 0 0 | c = reptile 0 0 0 10 0 0 0 | d = fish 0 0 1 0 2 0 0 | e = amphibian 0 0 0 0 0 4 2 | f = insect 0 0 0 0 0 1 6 | g = invertebrate 2. Weka produces the following decision tree: feathers = false | milk = false | | toothed = false | | | airborne = false: invertebrate (8.0/1.0) | | | airborne = true: insect (5.0) | | toothed = true | | | fins = false | | | | legs <= 2: reptile (3.0) | | | | legs > 2: amphibian (3.0) | | | fins = true: fish (10.0) | milk = true: mammal (36.0) feathers = true: bird (16.0) 9
  • 10. Regression  |  CPU  performance     The decision tree is listed above. We define 6 different steps within our decision tree. 1. Does the creature has feathers (True/False) 2. Does the creature produces milk (True/False) 3. Is the creature toothed (True/False) 4. Does the creature flies (True/False) 5. Does the creature has fins (True/False) 6. Does the creature has more or less legs then 2 (numerical check) 2.2. If you invoke a WEKA classifier with a training set but no testing set, WEKA will automatically perform a 10-fold cross-validation on the training set and report how many instances are correctly and incorrectly classified when those instances are used as test data during the cross-validation. 1. When you ran the J48 classifier with the zoo.arff file, a 10-fold cross validation was performed. Report the number of instances correctly and incorrectly classified during the cross-validation. I loaded up the delivered zoo.arff file and performed a 10-fold crossing with the J48 classifier. The number of correct and incorrect instances are given in red. === Run information === Scheme:weka.classifiers.trees.J48 -C 0.25 -M 2 Relation: zoo Instances: 101 Attributes: 18 animal hair feathers eggs milk airborne aquatic predator toothed backbone breathes venomous fins 10
  • 11. Regression  |  CPU  performance     legs tail domestic catsize type Test mode:10-fold cross-validation === Classifier model (full training set) === J48 pruned tree ------------------ feathers = false | milk = false | | backbone = false | | | airborne = false | | | | predator = false | | | | | legs <= 2: invertebrate (2.0) | | | | | legs > 2: insect (2.0) | | | | predator = true: invertebrate (8.0) | | | airborne = true: insect (6.0) | | backbone = true | | | fins = false | | | | tail = false: amphibian (3.0) | | | | tail = true: reptile (6.0/1.0) | | | fins = true: fish (13.0) | milk = true: mammal (41.0) feathers = true: bird (20.0) Number of Leaves : 9 Size of the tree : 17 Time taken to build model: 0.02 seconds === Stratified cross-validation === === Summary === Correctly Classified Instances 93 92.0792 % Incorrectly Classified Instances 8 7.9208 % Kappa statistic 0.8955 Mean absolute error 0.0225 Root mean squared error 0.14 Relative absolute error 10.2478 % Root relative squared error 42.4398 % Total Number of Instances 101 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class 1 0 1 1 1 1 mammal 1 0 1 1 1 1 bird 0.6 0.01 0.75 0.6 0.667 0.793 reptile 1 0.011 0.929 1 0.963 0.994 fish 0.75 0 1 0.75 0.857 0.872 amphibian 0.625 0.032 0.625 0.625 0.625 0.92 insect 0.8 0.033 0.727 0.8 0.762 0.986 invertebrate Weighted Avg. 0.921 0.008 0.922 0.921 0.92 0.976 === Confusion Matrix === a b c d e f g <-- classified as 41 0 0 0 0 0 0 | a = mammal 0 20 0 0 0 0 0 | b = bird 0 0 3 1 0 1 0 | c = reptile 0 0 0 13 0 0 0 | d = fish 0 0 1 0 3 0 0 | e = amphibian 0 0 0 0 0 5 3 | f = insect 0 0 0 0 0 2 8 | g = invertebrate 11
  • 12. Regression  |  CPU  performance     2. With the WEKA classifiers, the "-x [value]" option can be used to specify how many folds to use in a cross validation -- e.g. "-x 5", will specify a 5-fold cross-validation. Suppose you wish to perform a "leave one out" cross-validation on the zoo.arff data. How many folds must you specify to achieve this? Because we have 101 instances we need to perform a total of 101 folds. 2.3. Make a copy of the weather.arff file and modify it to train a classifier with multiple (discrete) target values. The classifier is supposed to assign your favorite sport according to weather conditions (take, for example, "swimming", "badminton" and "none"). Modify the training data according to your settings (create at least 20 training examples). 1. Train the J48 classifier on the original data in weather.arff and on your modified version and report the number of correctly and incorrectly classified instances during the 10-fold cross-validation. First I performed the 10-fold cross validation on the weather.arff. The number of correct and incorrect classified instances is given in red. === Run information === Scheme:weka.classifiers.trees.J48 -C 0.25 -M 2 Relation: weather Instances: 14 Attributes: 5 outlook temperature humidity windy play Test mode:10-fold cross-validation === Classifier model (full training set) === J48 pruned tree ------------------ outlook = sunny | humidity <= 75: yes (2.0) | humidity > 75: no (3.0) outlook = overcast: yes (4.0) outlook = rainy | windy = TRUE: no (2.0) | windy = FALSE: yes (3.0) Number of Leaves : 5 Size of the tree : 8 Time taken to build model: 0 seconds === Stratified cross-validation === === Summary === Correctly Classified Instances 9 64.2857 % Incorrectly Classified Instances 5 35.7143 % Kappa statistic 0.186 Mean absolute error 0.2857 Root mean squared error 0.4818 Relative absolute error 60 % Root relative squared error 97.6586 % Total Number of Instances 14 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.778 0.6 0.7 0.778 0.737 0.789 yes 0.4 0.222 0.5 0.4 0.444 0.789 no 12
  • 13. Regression  |  CPU  performance     Weighted Avg. 0.643 0.465 0.629 0.643 0.632 0.789 === Confusion Matrix === a b <-- classified as 7 2 | a = yes 3 2 | b = no 1. We check for the discrete value ‘outlook’ of the weather first, which has 3 discrete values; respectively sunny, outlook and rainy. 2. If we have a sunny outlook we check for the value of the humidity. • If the humidity is lower or equal then 75 we play. • If the humidity is higher then 75 we don't play. 3. If we have a rainy outlook we check if it’s windy or not. It it’s windy we don’t play, if it’s not windy we do play. 4. If we have an overcast outlook we play. Now I’ve edited the weather.arff file with discrete values in stead of a true or false value. I’ve chosen to include the sports football (FB), indoor-tennis (TN) and ‘none’/nothing. The correct and incorrect classified instances are given in red. === Run information === Scheme:weka.classifiers.trees.J48 -C 0.25 -M 2 Relation: weather Instances: 24 Attributes: 5 outlook temperature humidity windy play Test mode:10-fold cross-validation === Classifier model (full training set) === J48 pruned tree ------------------ outlook = sunny | humidity <= 90: FB (5.0) | humidity > 90: TN (3.0) outlook = overcast | humidity <= 79 13
  • 14. Regression  |  CPU  performance     | | temperature <= 85: FB (4.0) | | temperature > 85: TN (2.0) | humidity > 79: TN (3.0) outlook = rainy | humidity <= 79: TN (2.0) | humidity > 79: none (5.0) Number of Leaves : 7 Size of the tree : 12 Time taken to build model: 0 seconds === Stratified cross-validation === === Summary === Correctly Classified Instances 16 66.6667 % Incorrectly Classified Instances 8 33.3333 % Kappa statistic 0.4947 Mean absolute error 0.212 Root mean squared error 0.4295 Relative absolute error 48.9316 % Root relative squared error 92.0147 % Total Number of Instances 24 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.667 0.2 0.667 0.667 0.667 0.763 FB 0.5 0.214 0.625 0.5 0.556 0.754 TN 1 0.105 0.714 1 0.833 0.984 none Weighted Avg. 0.667 0.186 0.659 0.667 0.655 0.805 === Confusion Matrix === a b c <-- classified as 6 3 0 | a = FB 3 5 2 | b = TN 0 0 5 | c = none 2. Include your training data and the corresponding decision tree in your report and comment on its structure. The training data which I used is on the model which is listed above is: @relation weather @attribute outlook {sunny, overcast, rainy} @attribute temperature real @attribute humidity real @attribute windy {TRUE, FALSE} @attribute play {FB, TN, none} @data sunny, 95, 95, FALSE, TN sunny, 65, 65, FALSE, FB sunny, 65, 65, TRUE, FB sunny, 76, 78, TRUE, FB sunny, 80, 85, TRUE, FB sunny, 75, 95, FALSE, TN sunny, 80, 80, TRUE, FB sunny, 60, 95, FALSE, TN overcast, 65,65,TRUE, FB overcast, 75,65, TRUE, FB overcast, 85, 90, FALSE, TN overcast, 85, 90, TRUE, TN overcast, 65,68, FALSE, FB overcast, 90,65, TRUE, TN overcast, 65,95, FALSE, TN overcast, 90,65, FALSE, TN overcast, 85,65, TRUE, FB 14
  • 15. Regression  |  CPU  performance     rainy,70,96,FALSE,none rainy,68,80,FALSE,none rainy,65,70,TRUE, TN rainy,76,79,FALSE, TN rainy,74,96,TRUE,none rainy,60,80,TRUE,none rainy,74,96,TRUE,none Weka produced the decision tree which is listed below when running the J48 classifier on the data. 1. We check for the discrete value ‘outlook’ of the weather first, which has 3 discrete values; respectively sunny, outlook and rainy. 2. If we have a sunny outlook we check for the value of the humidity. • If the humidity is lower or equal then 90 we play football. • If the humidity is higher then 90 we play indoor tennis. 3. If we have an overcast we can play both indoor-tennis or football • If the humidity is higher then 79 we play indoor-tennis • If the humidity is lower or equal then 79 we can play both indoor-tennis or football depending on the temperature 5. § If the temperature is lower or equal then 85 we play football § If the temperature is higher then 85 we play indoor-tennis 4. If we have a rainy outlook we check how high the humidity is. We can play indoor-tennis or we stay at home. • If the humidity is lower or equal then 79 we play indoor-tennis • If the humidity is higher then 79 we stay at home. 3. The "-U" option can be used to turn pruning off. What does pruning do and why? What happens to the tree learned from your data when pruning is "off"? Comment on the differences or explain why there is no difference. With pruning we eliminate branches within our model for generalizing parts of our data. If this results to a higher accuracy we mostly leave the new pruned part (of our model) in place. If pruning is not set then every branch within the model of the data becomes/remains visible. Within my own model there were no changes when turning pruning on and off. I think this happens because there aren’t enough branches (or sub-branches) within my model. In conclusion; if we would generalize parts of this decision tree then this would definitely lead to overfitting because we describe parts of our model to well. We need to have more branches to profit from pruning. Adding more attributes could be a solution for this. 15