SlideShare a Scribd company logo
1 of 92
Download to read offline
Lecture No. 3

       Ravi Gupta
 AU-KBC Research Centre,
MIT Campus, Anna University




                              Date: 12.3.2008
Today’s Agenda


•   Recap of ID3 Algorithm
•   Machine Learning Bias
•   Occam’s razor principle
•   Handling ID3 problems
Decision Trees


• Decision tree learning is a method for approximating
discrete value target functions, in which the learned function
is represented by a decision tree.

• Decision trees can also be represented by if-then-else rule.

• Decision tree learning is one of the most widely used
approach for inductive inference .
Decision Trees
Edges: Attribute
value
                                    Intermediate
                                    Nodes: Attributes

                                                                             Attribute: A1
                                                                 Attribute                       Attribute
                                                                  value            Attribute
                                                                                                  value
                                                                                    value

                                                           Attribute: A2         Output           Attribute: A3
                                                                                 value
                                                                                          Attribute
                                                   Attribute         Attribute                           Attribute
                                                                                           value
                                                    value             value                               value


                                                        Output         Output          Output            Output
                                                        value          value           value             value




                   Leave node:
                   Output value
Decision Trees Representation

                          conjunction
            disjunction
Decision Trees as If-then-else rule
                                                    conjunction
                             disjunction




   •If (Outlook = Sunny AND humidity = Normal) then PlayTennis = Yes
   •If (Outlook = Overcast) then PlayTennis = Yes
   •If (Outlook = Rain AND Wind = Weak) then PlayTennis = Yes
Problems Suitable for Decision Trees



    • Instances are represented by attribute-value pairs

    • The target function has discrete output values

    • Disjunctive descriptions may be required

    • The training data may contain errors

    • The training data may contain missing attribute values
Building Decision Tree

                                           Attribute: A1
                  Attribute value
                                                                      Attribute value
                                                   Attribute
                                                     value



                                           Output value
              Attribute: A2                                                  Attribute: A3


                                                           Attribute value
Attribute value          Attribute value                                            Attribute value




                                                      Output value                 Output value
  Output value                Output value
Building Decision Tree

             Outlook
           Temperature   Which attribute to
                          select ?????
             Humidity
              Wind
Root
node
Entropy

Given a collection S, containing positive and negative examples of
some target concept, the entropy of S relative to this boolean
classification (yes/no) is




       where       is the proportion of positive examples in S and pӨ, is the
       proportion of negative examples in S. In all calculations involving
       entropy we define 0 log 0 to be 0.
Information Gain Measure

 Information gain, is simply the expected reduction in entropy
 caused by partitioning the examples according to this attribute.

 More precisely, the information gain, Gain(S, A) of an attribute A,
 relative to a collection of examples S, is defined as




 where Values(A) is the set of all possible values for attribute A,
 and Sv, is the subset of S for which attribute A has value v, i.e.,
Information Gain Measure



                                                 Entropy of S after
                    Entropy of S
                                                     partition




Gain(S, A) is the expected reduction in entropy caused by knowing the value of
attribute A.

Gain(S, A) is the information provided about the target &action value, given the
value of some other attribute A. The value of Gain(S, A) is the number of bits
saved when encoding the target value of an arbitrary member of S, by knowing
the value of attribute A.
Example




There are 14 examples. 9 positive and 5 negative examples [9+, 5-].

The entropy of S relative to this boolean (yes/no) classification is
Gain (S, Attribute = Wind)
Final Decision Tree
Some Insights into Capabilities and
       Limitations of ID3 Algorithm
•   ID3’s algorithm searches complete hypothesis space. [Advantage]

•   ID3 maintain only a single current hypothesis as it searches through
    the space of decision trees. By determining only as single hypothesis, ID3
    loses the capabilities that follows explicitly representing all consistent
    hypothesis. [Disadvantage]

•   ID3 in its pure form performs no backtracking in its search. Once it
    selects an attribute to test at a particular level in the tree, it never backtracks
    to reconsider this choice. Therefore, it is susceptible to the usual risks of
    hill-climbing search without backtracking: converging to locally
    optimal solutions that are not globally optimal. [Disadvantage]
Some Insights into Capabilities and
       Limitations of ID3 Algorithm

•   ID3 uses all training examples at each step in the search to make
    statistically based decisions regarding how to refine its current
    hypothesis. This contrasts with methods that make decisions
    incrementally, based on individual training examples (e.g., FIND-S or
    CANDIDATE-ELIMINATION). One advantage of using statistical
    properties of all the examples (e.g., information gain) is that the
    resulting search is much less sensitive to errors in individual training
    examples. [Advantage]
Machine Learning Biases


• Language Bias/Restriction Bias: Restriction on the
  type of hypothesis to be learned. (Limits the set of
  hypothesis to be learned/expressed).

• Preference Bias/Search Bias: A preference for certain
  hypothesis over others (e.g., shorter hypothesis), with no
  hard restriction on the hypothesis space.
CANDIDATE-ELIMINATION Algorithm
CANDIDATE-ELIMINATION Algorithm




   Hypothesis was assumed to be conjunction of Attributes
CANDIDATE-ELIMINATION Algorithm




    Candidate-Elimination algorithm is Language biased
CANDIDATE-ELIMINATION Algorithm




  The problem is the algorithm considers (biased) only conjunctive space.

  The following example requires a more expressive hypothesis space
Building Decision Tree

                                           Attribute: A1
                  Attribute value
                                                                      Attribute value
                                                   Attribute
                                                     value



                                           Output value
              Attribute: A2                                                  Attribute: A3


                                                           Attribute value
Attribute value          Attribute value                                            Attribute value




                                                      Output value                 Output value
  Output value                Output value
Decision Tree




ID3 algorithm has Preference/Search Bias
ID3 Strategy for Selecting Hypothesis


 • Selects trees that place the attributes with highest
   information gain closest to the root.

 • Selects in favor of shorter trees over longer ones.
Preference Bias or Restriction Bias ?


   A preference bias is more desirable than a restriction bias,
   because it allows the learner to work within a complete
   hypothesis space that is assured to contain the unknown
   target function.

   In contrast, a restriction bias that strictly limits the set of
   potential hypotheses is generally less desirable, because it
   introduces the possibility of excluding the unknown target
   function altogether.
Preference Bias or Restriction Bias ?


  ID3 exhibits a purely preference bias and CANDIDATE-ELIMINATION
  a purely restriction bias, some learning systems combine both.
Preference Bias AND Restriction Bias ?
Preference Bias AND Restriction Bias ?

  •   Task T: playing checkers
  •   Performance measure P: % of games won in the world
      tournament
  •   Training experience E: games played against itself
  •   Target function: F : Board → R
  •   Target function representation
           F'(b) = w0 + w1x1+ w2x2 + w3x3 + w4x4 + w5x5 + w6x6




        A linear combination of variables
        (Language Bias/Restriction Bias)
Preference Bias AND Restriction Bias ?



 E(Error) ≡                   ∑
              < b , Ftrain ( b ) >∈ training examples
                                                        (Ftrain (b) − F '(b))
                                                                            2




      Preference Bias (Because weights are found based on Least
      Mean Square technique)
Issues in Decision Tree Learning

 •   Determining how deeply to grow the decision tree
 •   Handling continuous attributes
 •   Choosing an appropriate attribute
 •   Selection measure
 •   Handling training data with missing attribute values
 •   Handling attributes with differing costs, and improving
     computational efficiency
Occam’s Razor
Occam's razor (sometimes
spelled Ockham's razor) is a
principle attributed to the 14th-
century English logician and
Franciscan friar William of
Ockham.


The principle states that the
explanation of any
phenomenon should make as
few assumptions as possible,
eliminating those that make no
difference in the observable
predictions of the explanatory
hypothesis or theory.
Occam’s Razor

This is often paraphrased as quot;All other things being equal, the simplest
solution is the best.quot;


In other words, when multiple competing theories are equal in other
respects, the principle recommends selecting the theory that introduces
the fewest assumptions and postulates the fewest entities. It is in this
sense that Occam's razor is usually understood.




      Prefer the simplest hypothesis that fits the data
Why it’s called Occam’s Razor

    Tom M. Mitchell say’s…. Occam got this idea during shaving




  Wikipedia say’s….. The term razor refers to the act of shaving
  away unnecessary assumptions to get to the simplest
  explanation.
ID3 Strategy for Selecting Hypothesis


 • Selects trees that place the attributes with highest
   information gain closest to the root.

 • Selects in favor of shorter trees over longer ones.
Problem with Occam’s Razor

Why should simplest hypothesis that fits the data is best solution.
Why not second simplest or third simplest hypothesis.


The size of a hypothesis is determined by the particular
representation used internally by the learner. Two learners using
different internal representations could therefore arrive at different
hypotheses, both justifying their contradictory conclusions by
Occam's razor!
Training and Testing

For classification problems, a classifier’s performance is
measured in terms of the error rate.

The classifier predicts the class of each instance: if it is correct,
that is counted as a success; if not, it is an error.

The error rate is just the proportion of errors made over a whole
set of instances, and it measures the overall performance of the
classifier.
Training and Testing

We are interested in is the likely future performance on new
data, not the past performance on old data. We already know the
classifications of each instance in the training set, which after all
is why we can use it for training.

We are not generally interested in learning about those
classifications—although we might be if our purpose is data
cleansing rather than prediction.

So the question is, is the error rate on old data likely to be a good
indicator of the error rate on new data?
    The answer is a resounding no—not if the old data was used
    during the learning process to train the classifier.
Training and Testing




Error rate on the training set is not likely to be a good
indicator of future performance.
Training and Testing

Self-consistency Test: When training and test dataset are same



The error rate on the training data is called the resubstitution error,
because it is calculated by resubstituting the training instances into a
classifier that was constructed from them.
Training and Testing
Hold out Strategy: Holdout method reserves a certain amount for
testing and uses the remainder for training (and sets part of that aside
for validation, if required).




             In practical scenario we have limited
             number of example with us…….
Training and Testing

K-fold Cross validation technique:

In the k-fold cross-validation, the dataset was partitioned randomly into k
equal-sized sets. The training and testing of each classifier were carried out k
times using one distinct set for testing and other k-1 sets for training.
4-Fold Cross-validation
4-Fold Cross-validation

                                  ACC1




                   Test Dataset
Training Dataset
4-Fold Cross-validation

                                  ACC2




                   Test Dataset
Training Dataset
4-Fold Cross-validation

                                  ACC3




                   Test Dataset
Training Dataset
4-Fold Cross-validation

                                  ACC4




                   Test Dataset
Training Dataset
4-Fold Cross-validation




 ACC = (ACC1 + ACC2 + ACC3 + ACC4) / 4
Issues in Decision Tree Learning

 •   Determining how deeply to grow the decision tree
 •   Handling continuous attributes
 •   Choosing an appropriate attribute
 •   Selection measure
 •   Handling training data with missing attribute values
 •   Handling attributes with differing costs, and improving
     computational efficiency
Avoiding Overfitting in Decision
          Trees…..




• A hypothesis is said to be over-fitting the training
  examples if some other hypothesis that fits the
  training examples less well actually performs better
  over the entire distribution of instances (i.e., including
  instances beyond the training set).
Overfitting




H: Hypothesis Space
Overfitting



                         Negative
Positive                 example
example
Overfitting




h1                 h2
Overfitting




     h1 is more accurate
h1   than h2 on the training   h2
     examples
Overfitting




     h1 is less accurate
h1   than h2 on the unseen   h2
     (test) examples
Overfitting

                                Is h1 more accurate
                                 than h2 on training
                                     examples

                                                          no
                       yes




         Is h1 more accurate                            Is h1 more accurate
            than h2 on test                                than h2 on test
              examples                                       examples


     yes                   No                                                 No
                                                       yes


No over-fitting         Over-fitting                                   No over-fitting
                                              Over-fitting
Overfitting




Overfitting in decision tree learning. As ID3 adds new nodes to grow the decision tree, the
accuracy of the tree measured over the training examples increases monotonically. However,
when measured over a set of test examples independent of the training examples, accuracy
first increases, then decreases.
Overfitting in Decision Tree




Overfitting in decision tree learning. As ID3 adds new nodes to grow the decision tree, the
accuracy of the tree measured over the training examples increases monotonically. However,
when measured over a set of test examples independent of the training examples, accuracy
first increases, then decreases.
Why Overfitting Happens in
   Decision Tree Learning?

• Presence of error in the training examples.
  (In general in machine learning)

• When small numbers of examples are associated
  with leaf node.
Presence of Error and Over-fitting
Presence of Error and Over-fitting
Presence of Error and Over-fitting




                     More Complex
                     Tree depth is more
Presence of Error and Over-fitting
How to avoid Overfitting…

• Stop growing the tree earlier, before it
  reaches the point where it perfectly
  classifies the training data

• Allow the tree to overfit the data, and then
  post-prune the tree.
How to avoid Overfitting…

• Post-pruning overfit trees has been found
  to be more successful in practice. This is
  due to the difficulty in the first approach of
  estimating precisely when to stop growing
  the tree.
How to avoid Overfitting…

• Regardless of whether the correct tree size
  is found by stopping early or by post-
  pruning, a key question is what criterion is
  to be used to determine the correct final
  tree size.
Determining correct final tree size
• Use a separate set of examples for training and testing.
  [Training and Validation] <for pruning method>

• Use all the available data for training, but apply a
  statistical test (for e.g., Chi-square test) to estimate
  whether expanding (or pruning) a particular node is
  likely to produce an improvement beyond the training
  set. <for pruning method>

• Use an explicit measure of the complexity for encoding
  the training examples and the decision tree, halting
  growth of the tree when this encoding size is
  minimized. This approach, based on a heuristic called
  the Minimum Description Length principle (MDL).
Pruning Methods


• Reduced-error pruning (Quinlan 1987)

• Rule post-pruning (Quinlan 1993)
Reduced Error Pruning

• Pruning a decision node consists of removing the
  subtree rooted at that node, making it a leaf node,
  and assigning it the most common classification of
  the training examples affiliated with that node.

• Nodes are removed only if the resulting pruned
  tree performs no worse than-the original over the
  validation set.
Reduced Error Pruning
Reduced Error Pruning
Drawback of Training and
     Validation Method


Using a separate set of data to guide pruning is an
effective approach provided a large amount of data is
available. The major drawback of this approach is
that when data is limited.
Rule Post-Pruning

In practice, it is one quite successful method for finding
high accuracy hypotheses in post-pruning of decision
tree.
Rule Post-Pruning (Step 1)


 1
Rule Post-Pruning (Step 2)

2

    1: IF (Outlook = sunny and Temperature = Hot) THEN PlayTennis = No
    2: IF (Outlook = sunny and Temperature = Cold) THEN PlayTennis = Yes
    3: IF (Outlook = sunny and Temperature = Mild and Humidity=High) THEN PlayTennis = No
    4: IF (Outlook = sunny and Temperature = Mild and Humidity=Normal) THEN PlayTennis = Yes
    5: IF (Outlook = overcast) THEN PlayTennis = Yes
    6: IF (Outlook = rain and Wind = Strong) THEN PlayTennis = No
    7: IF (Outlook = rain and Wind = Weak) THEN PlayTennis = Yes
Rule Post-Pruning (Step 3)

    3
                    1: IF (Outlook = sunny and Temperature = Hot) THEN PlayTennis = No




IF (Outlook = sunny and Temperature = Hot) THEN PlayTennis = No




IF (Outlook = sunny) THEN PlayTennis = No                                       Test Dataset
                                                                                 (Validation
                                                                                  examples)
IF (Temperature = Hot) THEN PlayTennis = No
Rule Post-Pruning (Step 3)

  3

IF (Outlook = sunny and Temperature = Hot) THEN PlayTennis = No
                                                                     Acc1


IF (Outlook = sunny) THEN PlayTennis = No
                                                    Acc2                       Test Dataset
                                                                                (Validation
                                                    Acc3                         examples)
IF (Temperature = Hot) THEN PlayTennis = No



                            If Acc3 > Acc2 & Acc1


          1: IF (Outlook = sunny and Temperature = Hot) THEN PlayTennis = No


                     IF (Temperature = Hot) THEN PlayTennis = No
Rule Post-Pruning (Step 4)

 4
                                                                    S1: Acc1
     R1: Acc1
                                                                    S2: Acc2
     R2: Acc2            Sort rules in descending order             S3: Acc3
     R3: Acc3            of their accuracy on test                  S4: Acc4
     R4: Acc4            dataset or validation examples
                                                                    .
     .
                                                                    .
     .
                                                                    .
     .
                                                                    S11: Acc11
     R11: Acc11
                                                                    S12: Acc12
     R12: Acc12
                                                                    S13: Acc13
     R13: Acc13
                                                                    S14: Acc14
     R14: Acc14

S1: Acc1 >= S2: Acc2 >= S3: Acc3 >= S4: Acc4 >= … >= S11: Acc11 >= S12: Acc12 >= S13:
Acc13 >= S14: Acc14
Handling Continuous-Valued
         Attribute
Handling Continuous-Valued
         Attribute
Handling Continuous-Valued
         Attribute


We have dynamically defining new discrete valued attributes so
that it partition the continuous attribute value into a discrete set
of intervals.
Alternative Measures for Selecting
            Attributes
  There is a natural bias in the information gain measure that
  favors attributes with many values over those with few values.

 Consider the attribute Date, which has a very large number of possible
 values (e.g., March 11,2008).

 If we were to add this as a attribute to the data, it would have the
 highest information gain of any of the attributes. This is because Date
 alone perfectly predicts the target attribute over the training data. Thus,
 it would be selected as the decision attribute for the root node of the
 tree and lead to a (quite broad) tree of depth one, which perfectly
 classifies the training data.

 However, this decision tree would fare poorly on subsequent examples,
 because it is not a useful predictor despite the fact that it perfectly
 separates the training data.
Alternative Measures for Selecting
            Attributes

 What is wrong with the attribute Date?
    It has so many possible values that it is bound to separate the
    training examples into very small subsets. Because of this, it will
    have a very high information gain relative to the training
    examples, despite being a very poor predictor of the target
    function over unseen instances.

 One way to avoid this difficulty is to select decision attributes based
 on some measure other than information gain. One alternative
 measure that has been used successfully is the gain ratio (Quinlan
 1986). The gain ratio measure penalizes attributes such as Date by
 incorporating a term, called split information, that is sensitive to how
 broadly and uniformly the attribute splits the data.
Alternative Measures for Selecting
            Attributes

 What is wrong with the attribute Date?
    It has so many possible values that it is bound to separate the
    training examples into very small subsets. Because of this, it will
    have a very high information gain relative to the training
    examples, despite being a very poor predictor of the target
    function over unseen instances.

 One way to avoid this difficulty is to select decision attributes based
 on some measure other than information gain. One alternative
 measure that has been used successfully is the gain ratio (Quinlan
 1986). The gain ratio measure penalizes attributes such as Date by
 incorporating a term, called split information, that is sensitive to how
 broadly and uniformly the attribute splits the data.
Alternative Measures for Selecting
            Attributes




 where S1 through Sc, are the c subsets of examples resulting from
 partitioning S by the c-valued attribute A.

 Splitlnformation is actually the entropy of S with respect to the values of
 attribute A. This is in contrast to our previous uses of entropy, in which we
 considered only the entropy of S with respect to the target attribute whose
 value is to be predicted by the learned tree.
Alternative Measures for Selecting
            Attributes




 The Splitlnformation term discourages the selection of attributes with
 many uniformly distributed values.

 For example, consider a collection of n examples that are completely
 separated by attribute A (e.g., Date). In this case, the Splitlnformation
 value will be logn. In contrast, a boolean attribute B that splits the same n
 examples exactly in half will have Splitlnformation of 1. If attributes A and
 B produce the same information gain, then clearly B will score higher
 according to the Gain Ratio measure.
Handling Missing Attributes


In certain cases, the available data may be missing values for some
attributes. For example, in a medical domain in which we wish to
predict patient outcome based on various laboratory tests, it may be
that the lab test Blood-Test-Result is available only for a subset of
the patients. In such cases, it is common to estimate the missing
attribute value based on other examples for which this attribute has a
known value.
Handling Missing Attributes

• One strategy for dealing with the missing attribute value is to assign
it the value that is most common among training examples at node n.

• Alternatively, we might assign it the most common value among
examples at node n that have the classification c(x)

 A more complex procedure is to assign a probability to each of the
possible values of A rather than simply assigning the most common
value to A(x). These probabilities can be estimated again based on the
observed frequencies of the various values for A among the examples
at node n. This method for handling missing attribute values is used
in C4.5 (Quinlan 1993).
Handling Attributes with Different
              Cost

 In some learning tasks the instance attributes may have associated
 costs. For example, in learning to classify medical diseases we might
 describe patients in terms of attributes such as Temperature,
 BiopsyResult, Pulse, BloodTestResults, etc.

 These attributes vary significantly in their costs, both in terms of
 monetary cost and cost to patient comfort.

 In such tasks, we would prefer decision trees that use low-cost
 attributes where possible, relying on high-cost attributes only when
 needed to produce reliable classifications.
Handling Attributes with Different
              Cost

 ID3 can be modified to take into account attribute costs by
 introducing a cost term into the attribute selection measure. For
 example, we might divide the Gain by the cost of the attribute, so that
 lower-cost attributes would be preferred.

 However, such cost-sensitive measures do not guarantee finding an
 optimal cost-sensitive decision tree, they do bias the search in favor
 of low-cost attributes.



                            Gain( S , A )
                             Cost( A )
Handling Attributes with Different
              Cost
 Tan and Schlimmer (1990) and Tan (1993) describe one such approach
 and apply it to a robot perception task in which the robot must learn to
 classify different objects according to how they can be grasped by the
 robot's manipulator. In this case the attributes correspond to different
 sensor readings obtained by a movable sonar on the robot.

 Attribute cost is measured by the number of seconds required to obtain
 the attribute value by positioning and operating the sonar. They
 demonstrate that more efficient recognition strategies are learned,
 without sacrificing classification accuracy, by replacing the information
 gain attribute selection measure by the following measure.
Handling Attributes with Different
              Cost

Nunez (1988) describes a related approach and its application to
learning medical diagnosis rules. Here the attributes are different
symptoms and laboratory tests with differing costs. His system uses a
somewhat different attribute selection measure

More Related Content

What's hot

Artificial Neural Networks Lect3: Neural Network Learning rules
Artificial Neural Networks Lect3: Neural Network Learning rulesArtificial Neural Networks Lect3: Neural Network Learning rules
Artificial Neural Networks Lect3: Neural Network Learning rulesMohammed Bennamoun
 
Presentation on supervised learning
Presentation on supervised learningPresentation on supervised learning
Presentation on supervised learningTonmoy Bhagawati
 
Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.butest
 
Hough Transform By Md.Nazmul Islam
Hough Transform By Md.Nazmul IslamHough Transform By Md.Nazmul Islam
Hough Transform By Md.Nazmul IslamNazmul Islam
 
Feature Extraction
Feature ExtractionFeature Extraction
Feature Extractionskylian
 
Classification and Regression
Classification and RegressionClassification and Regression
Classification and RegressionMegha Sharma
 
Evaluating hypothesis
Evaluating  hypothesisEvaluating  hypothesis
Evaluating hypothesisswapnac12
 
Machine learning and decision trees
Machine learning and decision treesMachine learning and decision trees
Machine learning and decision treesPadma Metta
 
Machine Learning
Machine LearningMachine Learning
Machine LearningShrey Malik
 
Machine learning Lecture 2
Machine learning Lecture 2Machine learning Lecture 2
Machine learning Lecture 2Srinivasan R
 
Andrew NG machine learning
Andrew NG machine learningAndrew NG machine learning
Andrew NG machine learningShareDocView.com
 
Machine Learning lecture6(regularization)
Machine Learning lecture6(regularization)Machine Learning lecture6(regularization)
Machine Learning lecture6(regularization)cairo university
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree LearningMilind Gokhale
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning Chandra Meena
 
Probabilistic Reasoning
Probabilistic ReasoningProbabilistic Reasoning
Probabilistic ReasoningJunya Tanaka
 
Module 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationModule 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationSara Hooker
 

What's hot (20)

Artificial Neural Networks Lect3: Neural Network Learning rules
Artificial Neural Networks Lect3: Neural Network Learning rulesArtificial Neural Networks Lect3: Neural Network Learning rules
Artificial Neural Networks Lect3: Neural Network Learning rules
 
Presentation on supervised learning
Presentation on supervised learningPresentation on supervised learning
Presentation on supervised learning
 
Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.
 
Hough Transform By Md.Nazmul Islam
Hough Transform By Md.Nazmul IslamHough Transform By Md.Nazmul Islam
Hough Transform By Md.Nazmul Islam
 
Neural networks
Neural networksNeural networks
Neural networks
 
Feature Extraction
Feature ExtractionFeature Extraction
Feature Extraction
 
Classification and Regression
Classification and RegressionClassification and Regression
Classification and Regression
 
Evaluating hypothesis
Evaluating  hypothesisEvaluating  hypothesis
Evaluating hypothesis
 
Machine learning and decision trees
Machine learning and decision treesMachine learning and decision trees
Machine learning and decision trees
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Machine learning Lecture 2
Machine learning Lecture 2Machine learning Lecture 2
Machine learning Lecture 2
 
Andrew NG machine learning
Andrew NG machine learningAndrew NG machine learning
Andrew NG machine learning
 
Machine Learning lecture6(regularization)
Machine Learning lecture6(regularization)Machine Learning lecture6(regularization)
Machine Learning lecture6(regularization)
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree Learning
 
Concept learning
Concept learningConcept learning
Concept learning
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
 
Scaling and Normalization
Scaling and NormalizationScaling and Normalization
Scaling and Normalization
 
Neural Networks
Neural NetworksNeural Networks
Neural Networks
 
Probabilistic Reasoning
Probabilistic ReasoningProbabilistic Reasoning
Probabilistic Reasoning
 
Module 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationModule 4: Model Selection and Evaluation
Module 4: Model Selection and Evaluation
 

Viewers also liked

Machine learning Lecture 4
Machine learning Lecture 4Machine learning Lecture 4
Machine learning Lecture 4Srinivasan R
 
Zeromq - Pycon India 2013
Zeromq - Pycon India 2013Zeromq - Pycon India 2013
Zeromq - Pycon India 2013Srinivasan R
 
Machine learning Lecture 1
Machine learning Lecture 1Machine learning Lecture 1
Machine learning Lecture 1Srinivasan R
 
Machine Learning Lecture 3 Decision Trees
Machine Learning Lecture 3 Decision TreesMachine Learning Lecture 3 Decision Trees
Machine Learning Lecture 3 Decision Treesananth
 
Generative Adversarial Networks and Their Applications
Generative Adversarial Networks and Their ApplicationsGenerative Adversarial Networks and Their Applications
Generative Adversarial Networks and Their ApplicationsArtifacia
 

Viewers also liked (6)

Candidate elimination example
Candidate elimination exampleCandidate elimination example
Candidate elimination example
 
Machine learning Lecture 4
Machine learning Lecture 4Machine learning Lecture 4
Machine learning Lecture 4
 
Zeromq - Pycon India 2013
Zeromq - Pycon India 2013Zeromq - Pycon India 2013
Zeromq - Pycon India 2013
 
Machine learning Lecture 1
Machine learning Lecture 1Machine learning Lecture 1
Machine learning Lecture 1
 
Machine Learning Lecture 3 Decision Trees
Machine Learning Lecture 3 Decision TreesMachine Learning Lecture 3 Decision Trees
Machine Learning Lecture 3 Decision Trees
 
Generative Adversarial Networks and Their Applications
Generative Adversarial Networks and Their ApplicationsGenerative Adversarial Networks and Their Applications
Generative Adversarial Networks and Their Applications
 

Recently uploaded

ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfVanessa Camilleri
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptshraddhaparab530
 
Presentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxPresentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxRosabel UA
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptxiammrhaywood
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Projectjordimapav
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4JOYLYNSAMANIEGO
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...JojoEDelaCruz
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
 
Measures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataMeasures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataBabyAnnMotar
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptxmary850239
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSMae Pangan
 
Dust Of Snow By Robert Frost Class-X English CBSE
Dust Of Snow By Robert Frost Class-X English CBSEDust Of Snow By Robert Frost Class-X English CBSE
Dust Of Snow By Robert Frost Class-X English CBSEaurabinda banchhor
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
Millenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptxMillenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptxJanEmmanBrigoli
 

Recently uploaded (20)

ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdf
 
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptxINCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.ppt
 
Presentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxPresentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptx
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Project
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
 
Measures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataMeasures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped data
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHS
 
Dust Of Snow By Robert Frost Class-X English CBSE
Dust Of Snow By Robert Frost Class-X English CBSEDust Of Snow By Robert Frost Class-X English CBSE
Dust Of Snow By Robert Frost Class-X English CBSE
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
Millenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptxMillenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptx
 

Machine learning Lecture 3

  • 1. Lecture No. 3 Ravi Gupta AU-KBC Research Centre, MIT Campus, Anna University Date: 12.3.2008
  • 2. Today’s Agenda • Recap of ID3 Algorithm • Machine Learning Bias • Occam’s razor principle • Handling ID3 problems
  • 3. Decision Trees • Decision tree learning is a method for approximating discrete value target functions, in which the learned function is represented by a decision tree. • Decision trees can also be represented by if-then-else rule. • Decision tree learning is one of the most widely used approach for inductive inference .
  • 4. Decision Trees Edges: Attribute value Intermediate Nodes: Attributes Attribute: A1 Attribute Attribute value Attribute value value Attribute: A2 Output Attribute: A3 value Attribute Attribute Attribute Attribute value value value value Output Output Output Output value value value value Leave node: Output value
  • 5. Decision Trees Representation conjunction disjunction
  • 6. Decision Trees as If-then-else rule conjunction disjunction •If (Outlook = Sunny AND humidity = Normal) then PlayTennis = Yes •If (Outlook = Overcast) then PlayTennis = Yes •If (Outlook = Rain AND Wind = Weak) then PlayTennis = Yes
  • 7. Problems Suitable for Decision Trees • Instances are represented by attribute-value pairs • The target function has discrete output values • Disjunctive descriptions may be required • The training data may contain errors • The training data may contain missing attribute values
  • 8. Building Decision Tree Attribute: A1 Attribute value Attribute value Attribute value Output value Attribute: A2 Attribute: A3 Attribute value Attribute value Attribute value Attribute value Output value Output value Output value Output value
  • 9. Building Decision Tree Outlook Temperature Which attribute to select ????? Humidity Wind Root node
  • 10. Entropy Given a collection S, containing positive and negative examples of some target concept, the entropy of S relative to this boolean classification (yes/no) is where is the proportion of positive examples in S and pӨ, is the proportion of negative examples in S. In all calculations involving entropy we define 0 log 0 to be 0.
  • 11. Information Gain Measure Information gain, is simply the expected reduction in entropy caused by partitioning the examples according to this attribute. More precisely, the information gain, Gain(S, A) of an attribute A, relative to a collection of examples S, is defined as where Values(A) is the set of all possible values for attribute A, and Sv, is the subset of S for which attribute A has value v, i.e.,
  • 12. Information Gain Measure Entropy of S after Entropy of S partition Gain(S, A) is the expected reduction in entropy caused by knowing the value of attribute A. Gain(S, A) is the information provided about the target &action value, given the value of some other attribute A. The value of Gain(S, A) is the number of bits saved when encoding the target value of an arbitrary member of S, by knowing the value of attribute A.
  • 13. Example There are 14 examples. 9 positive and 5 negative examples [9+, 5-]. The entropy of S relative to this boolean (yes/no) classification is
  • 16. Some Insights into Capabilities and Limitations of ID3 Algorithm • ID3’s algorithm searches complete hypothesis space. [Advantage] • ID3 maintain only a single current hypothesis as it searches through the space of decision trees. By determining only as single hypothesis, ID3 loses the capabilities that follows explicitly representing all consistent hypothesis. [Disadvantage] • ID3 in its pure form performs no backtracking in its search. Once it selects an attribute to test at a particular level in the tree, it never backtracks to reconsider this choice. Therefore, it is susceptible to the usual risks of hill-climbing search without backtracking: converging to locally optimal solutions that are not globally optimal. [Disadvantage]
  • 17. Some Insights into Capabilities and Limitations of ID3 Algorithm • ID3 uses all training examples at each step in the search to make statistically based decisions regarding how to refine its current hypothesis. This contrasts with methods that make decisions incrementally, based on individual training examples (e.g., FIND-S or CANDIDATE-ELIMINATION). One advantage of using statistical properties of all the examples (e.g., information gain) is that the resulting search is much less sensitive to errors in individual training examples. [Advantage]
  • 18. Machine Learning Biases • Language Bias/Restriction Bias: Restriction on the type of hypothesis to be learned. (Limits the set of hypothesis to be learned/expressed). • Preference Bias/Search Bias: A preference for certain hypothesis over others (e.g., shorter hypothesis), with no hard restriction on the hypothesis space.
  • 20. CANDIDATE-ELIMINATION Algorithm Hypothesis was assumed to be conjunction of Attributes
  • 21. CANDIDATE-ELIMINATION Algorithm Candidate-Elimination algorithm is Language biased
  • 22. CANDIDATE-ELIMINATION Algorithm The problem is the algorithm considers (biased) only conjunctive space. The following example requires a more expressive hypothesis space
  • 23. Building Decision Tree Attribute: A1 Attribute value Attribute value Attribute value Output value Attribute: A2 Attribute: A3 Attribute value Attribute value Attribute value Attribute value Output value Output value Output value Output value
  • 24. Decision Tree ID3 algorithm has Preference/Search Bias
  • 25. ID3 Strategy for Selecting Hypothesis • Selects trees that place the attributes with highest information gain closest to the root. • Selects in favor of shorter trees over longer ones.
  • 26. Preference Bias or Restriction Bias ? A preference bias is more desirable than a restriction bias, because it allows the learner to work within a complete hypothesis space that is assured to contain the unknown target function. In contrast, a restriction bias that strictly limits the set of potential hypotheses is generally less desirable, because it introduces the possibility of excluding the unknown target function altogether.
  • 27. Preference Bias or Restriction Bias ? ID3 exhibits a purely preference bias and CANDIDATE-ELIMINATION a purely restriction bias, some learning systems combine both.
  • 28. Preference Bias AND Restriction Bias ?
  • 29. Preference Bias AND Restriction Bias ? • Task T: playing checkers • Performance measure P: % of games won in the world tournament • Training experience E: games played against itself • Target function: F : Board → R • Target function representation F'(b) = w0 + w1x1+ w2x2 + w3x3 + w4x4 + w5x5 + w6x6 A linear combination of variables (Language Bias/Restriction Bias)
  • 30. Preference Bias AND Restriction Bias ? E(Error) ≡ ∑ < b , Ftrain ( b ) >∈ training examples (Ftrain (b) − F '(b)) 2 Preference Bias (Because weights are found based on Least Mean Square technique)
  • 31. Issues in Decision Tree Learning • Determining how deeply to grow the decision tree • Handling continuous attributes • Choosing an appropriate attribute • Selection measure • Handling training data with missing attribute values • Handling attributes with differing costs, and improving computational efficiency
  • 32. Occam’s Razor Occam's razor (sometimes spelled Ockham's razor) is a principle attributed to the 14th- century English logician and Franciscan friar William of Ockham. The principle states that the explanation of any phenomenon should make as few assumptions as possible, eliminating those that make no difference in the observable predictions of the explanatory hypothesis or theory.
  • 33. Occam’s Razor This is often paraphrased as quot;All other things being equal, the simplest solution is the best.quot; In other words, when multiple competing theories are equal in other respects, the principle recommends selecting the theory that introduces the fewest assumptions and postulates the fewest entities. It is in this sense that Occam's razor is usually understood. Prefer the simplest hypothesis that fits the data
  • 34. Why it’s called Occam’s Razor Tom M. Mitchell say’s…. Occam got this idea during shaving Wikipedia say’s….. The term razor refers to the act of shaving away unnecessary assumptions to get to the simplest explanation.
  • 35. ID3 Strategy for Selecting Hypothesis • Selects trees that place the attributes with highest information gain closest to the root. • Selects in favor of shorter trees over longer ones.
  • 36. Problem with Occam’s Razor Why should simplest hypothesis that fits the data is best solution. Why not second simplest or third simplest hypothesis. The size of a hypothesis is determined by the particular representation used internally by the learner. Two learners using different internal representations could therefore arrive at different hypotheses, both justifying their contradictory conclusions by Occam's razor!
  • 37. Training and Testing For classification problems, a classifier’s performance is measured in terms of the error rate. The classifier predicts the class of each instance: if it is correct, that is counted as a success; if not, it is an error. The error rate is just the proportion of errors made over a whole set of instances, and it measures the overall performance of the classifier.
  • 38. Training and Testing We are interested in is the likely future performance on new data, not the past performance on old data. We already know the classifications of each instance in the training set, which after all is why we can use it for training. We are not generally interested in learning about those classifications—although we might be if our purpose is data cleansing rather than prediction. So the question is, is the error rate on old data likely to be a good indicator of the error rate on new data? The answer is a resounding no—not if the old data was used during the learning process to train the classifier.
  • 39. Training and Testing Error rate on the training set is not likely to be a good indicator of future performance.
  • 40. Training and Testing Self-consistency Test: When training and test dataset are same The error rate on the training data is called the resubstitution error, because it is calculated by resubstituting the training instances into a classifier that was constructed from them.
  • 41. Training and Testing Hold out Strategy: Holdout method reserves a certain amount for testing and uses the remainder for training (and sets part of that aside for validation, if required). In practical scenario we have limited number of example with us…….
  • 42. Training and Testing K-fold Cross validation technique: In the k-fold cross-validation, the dataset was partitioned randomly into k equal-sized sets. The training and testing of each classifier were carried out k times using one distinct set for testing and other k-1 sets for training.
  • 44. 4-Fold Cross-validation ACC1 Test Dataset Training Dataset
  • 45. 4-Fold Cross-validation ACC2 Test Dataset Training Dataset
  • 46. 4-Fold Cross-validation ACC3 Test Dataset Training Dataset
  • 47. 4-Fold Cross-validation ACC4 Test Dataset Training Dataset
  • 48. 4-Fold Cross-validation ACC = (ACC1 + ACC2 + ACC3 + ACC4) / 4
  • 49. Issues in Decision Tree Learning • Determining how deeply to grow the decision tree • Handling continuous attributes • Choosing an appropriate attribute • Selection measure • Handling training data with missing attribute values • Handling attributes with differing costs, and improving computational efficiency
  • 50. Avoiding Overfitting in Decision Trees….. • A hypothesis is said to be over-fitting the training examples if some other hypothesis that fits the training examples less well actually performs better over the entire distribution of instances (i.e., including instances beyond the training set).
  • 52. Overfitting Negative Positive example example
  • 54. Overfitting h1 is more accurate h1 than h2 on the training h2 examples
  • 55. Overfitting h1 is less accurate h1 than h2 on the unseen h2 (test) examples
  • 56. Overfitting Is h1 more accurate than h2 on training examples no yes Is h1 more accurate Is h1 more accurate than h2 on test than h2 on test examples examples yes No No yes No over-fitting Over-fitting No over-fitting Over-fitting
  • 57. Overfitting Overfitting in decision tree learning. As ID3 adds new nodes to grow the decision tree, the accuracy of the tree measured over the training examples increases monotonically. However, when measured over a set of test examples independent of the training examples, accuracy first increases, then decreases.
  • 58. Overfitting in Decision Tree Overfitting in decision tree learning. As ID3 adds new nodes to grow the decision tree, the accuracy of the tree measured over the training examples increases monotonically. However, when measured over a set of test examples independent of the training examples, accuracy first increases, then decreases.
  • 59. Why Overfitting Happens in Decision Tree Learning? • Presence of error in the training examples. (In general in machine learning) • When small numbers of examples are associated with leaf node.
  • 60. Presence of Error and Over-fitting
  • 61. Presence of Error and Over-fitting
  • 62. Presence of Error and Over-fitting More Complex Tree depth is more
  • 63. Presence of Error and Over-fitting
  • 64. How to avoid Overfitting… • Stop growing the tree earlier, before it reaches the point where it perfectly classifies the training data • Allow the tree to overfit the data, and then post-prune the tree.
  • 65. How to avoid Overfitting… • Post-pruning overfit trees has been found to be more successful in practice. This is due to the difficulty in the first approach of estimating precisely when to stop growing the tree.
  • 66. How to avoid Overfitting… • Regardless of whether the correct tree size is found by stopping early or by post- pruning, a key question is what criterion is to be used to determine the correct final tree size.
  • 67. Determining correct final tree size • Use a separate set of examples for training and testing. [Training and Validation] <for pruning method> • Use all the available data for training, but apply a statistical test (for e.g., Chi-square test) to estimate whether expanding (or pruning) a particular node is likely to produce an improvement beyond the training set. <for pruning method> • Use an explicit measure of the complexity for encoding the training examples and the decision tree, halting growth of the tree when this encoding size is minimized. This approach, based on a heuristic called the Minimum Description Length principle (MDL).
  • 68. Pruning Methods • Reduced-error pruning (Quinlan 1987) • Rule post-pruning (Quinlan 1993)
  • 69. Reduced Error Pruning • Pruning a decision node consists of removing the subtree rooted at that node, making it a leaf node, and assigning it the most common classification of the training examples affiliated with that node. • Nodes are removed only if the resulting pruned tree performs no worse than-the original over the validation set.
  • 72. Drawback of Training and Validation Method Using a separate set of data to guide pruning is an effective approach provided a large amount of data is available. The major drawback of this approach is that when data is limited.
  • 73. Rule Post-Pruning In practice, it is one quite successful method for finding high accuracy hypotheses in post-pruning of decision tree.
  • 75. Rule Post-Pruning (Step 2) 2 1: IF (Outlook = sunny and Temperature = Hot) THEN PlayTennis = No 2: IF (Outlook = sunny and Temperature = Cold) THEN PlayTennis = Yes 3: IF (Outlook = sunny and Temperature = Mild and Humidity=High) THEN PlayTennis = No 4: IF (Outlook = sunny and Temperature = Mild and Humidity=Normal) THEN PlayTennis = Yes 5: IF (Outlook = overcast) THEN PlayTennis = Yes 6: IF (Outlook = rain and Wind = Strong) THEN PlayTennis = No 7: IF (Outlook = rain and Wind = Weak) THEN PlayTennis = Yes
  • 76. Rule Post-Pruning (Step 3) 3 1: IF (Outlook = sunny and Temperature = Hot) THEN PlayTennis = No IF (Outlook = sunny and Temperature = Hot) THEN PlayTennis = No IF (Outlook = sunny) THEN PlayTennis = No Test Dataset (Validation examples) IF (Temperature = Hot) THEN PlayTennis = No
  • 77. Rule Post-Pruning (Step 3) 3 IF (Outlook = sunny and Temperature = Hot) THEN PlayTennis = No Acc1 IF (Outlook = sunny) THEN PlayTennis = No Acc2 Test Dataset (Validation Acc3 examples) IF (Temperature = Hot) THEN PlayTennis = No If Acc3 > Acc2 & Acc1 1: IF (Outlook = sunny and Temperature = Hot) THEN PlayTennis = No IF (Temperature = Hot) THEN PlayTennis = No
  • 78. Rule Post-Pruning (Step 4) 4 S1: Acc1 R1: Acc1 S2: Acc2 R2: Acc2 Sort rules in descending order S3: Acc3 R3: Acc3 of their accuracy on test S4: Acc4 R4: Acc4 dataset or validation examples . . . . . . S11: Acc11 R11: Acc11 S12: Acc12 R12: Acc12 S13: Acc13 R13: Acc13 S14: Acc14 R14: Acc14 S1: Acc1 >= S2: Acc2 >= S3: Acc3 >= S4: Acc4 >= … >= S11: Acc11 >= S12: Acc12 >= S13: Acc13 >= S14: Acc14
  • 81. Handling Continuous-Valued Attribute We have dynamically defining new discrete valued attributes so that it partition the continuous attribute value into a discrete set of intervals.
  • 82. Alternative Measures for Selecting Attributes There is a natural bias in the information gain measure that favors attributes with many values over those with few values. Consider the attribute Date, which has a very large number of possible values (e.g., March 11,2008). If we were to add this as a attribute to the data, it would have the highest information gain of any of the attributes. This is because Date alone perfectly predicts the target attribute over the training data. Thus, it would be selected as the decision attribute for the root node of the tree and lead to a (quite broad) tree of depth one, which perfectly classifies the training data. However, this decision tree would fare poorly on subsequent examples, because it is not a useful predictor despite the fact that it perfectly separates the training data.
  • 83. Alternative Measures for Selecting Attributes What is wrong with the attribute Date? It has so many possible values that it is bound to separate the training examples into very small subsets. Because of this, it will have a very high information gain relative to the training examples, despite being a very poor predictor of the target function over unseen instances. One way to avoid this difficulty is to select decision attributes based on some measure other than information gain. One alternative measure that has been used successfully is the gain ratio (Quinlan 1986). The gain ratio measure penalizes attributes such as Date by incorporating a term, called split information, that is sensitive to how broadly and uniformly the attribute splits the data.
  • 84. Alternative Measures for Selecting Attributes What is wrong with the attribute Date? It has so many possible values that it is bound to separate the training examples into very small subsets. Because of this, it will have a very high information gain relative to the training examples, despite being a very poor predictor of the target function over unseen instances. One way to avoid this difficulty is to select decision attributes based on some measure other than information gain. One alternative measure that has been used successfully is the gain ratio (Quinlan 1986). The gain ratio measure penalizes attributes such as Date by incorporating a term, called split information, that is sensitive to how broadly and uniformly the attribute splits the data.
  • 85. Alternative Measures for Selecting Attributes where S1 through Sc, are the c subsets of examples resulting from partitioning S by the c-valued attribute A. Splitlnformation is actually the entropy of S with respect to the values of attribute A. This is in contrast to our previous uses of entropy, in which we considered only the entropy of S with respect to the target attribute whose value is to be predicted by the learned tree.
  • 86. Alternative Measures for Selecting Attributes The Splitlnformation term discourages the selection of attributes with many uniformly distributed values. For example, consider a collection of n examples that are completely separated by attribute A (e.g., Date). In this case, the Splitlnformation value will be logn. In contrast, a boolean attribute B that splits the same n examples exactly in half will have Splitlnformation of 1. If attributes A and B produce the same information gain, then clearly B will score higher according to the Gain Ratio measure.
  • 87. Handling Missing Attributes In certain cases, the available data may be missing values for some attributes. For example, in a medical domain in which we wish to predict patient outcome based on various laboratory tests, it may be that the lab test Blood-Test-Result is available only for a subset of the patients. In such cases, it is common to estimate the missing attribute value based on other examples for which this attribute has a known value.
  • 88. Handling Missing Attributes • One strategy for dealing with the missing attribute value is to assign it the value that is most common among training examples at node n. • Alternatively, we might assign it the most common value among examples at node n that have the classification c(x) A more complex procedure is to assign a probability to each of the possible values of A rather than simply assigning the most common value to A(x). These probabilities can be estimated again based on the observed frequencies of the various values for A among the examples at node n. This method for handling missing attribute values is used in C4.5 (Quinlan 1993).
  • 89. Handling Attributes with Different Cost In some learning tasks the instance attributes may have associated costs. For example, in learning to classify medical diseases we might describe patients in terms of attributes such as Temperature, BiopsyResult, Pulse, BloodTestResults, etc. These attributes vary significantly in their costs, both in terms of monetary cost and cost to patient comfort. In such tasks, we would prefer decision trees that use low-cost attributes where possible, relying on high-cost attributes only when needed to produce reliable classifications.
  • 90. Handling Attributes with Different Cost ID3 can be modified to take into account attribute costs by introducing a cost term into the attribute selection measure. For example, we might divide the Gain by the cost of the attribute, so that lower-cost attributes would be preferred. However, such cost-sensitive measures do not guarantee finding an optimal cost-sensitive decision tree, they do bias the search in favor of low-cost attributes. Gain( S , A ) Cost( A )
  • 91. Handling Attributes with Different Cost Tan and Schlimmer (1990) and Tan (1993) describe one such approach and apply it to a robot perception task in which the robot must learn to classify different objects according to how they can be grasped by the robot's manipulator. In this case the attributes correspond to different sensor readings obtained by a movable sonar on the robot. Attribute cost is measured by the number of seconds required to obtain the attribute value by positioning and operating the sonar. They demonstrate that more efficient recognition strategies are learned, without sacrificing classification accuracy, by replacing the information gain attribute selection measure by the following measure.
  • 92. Handling Attributes with Different Cost Nunez (1988) describes a related approach and its application to learning medical diagnosis rules. Here the attributes are different symptoms and laboratory tests with differing costs. His system uses a somewhat different attribute selection measure