Measures of Central Tendency: Mean, Median and Mode
Perceptron Arboles De Decision
1. Enlarging the Margins in Perceptron
Decision Trees
Kristin P. Bennett, Donghui Wu
Department of Mathematical Sciences
Rensselaer Polytechnic Institute
bennek@rpi.edu, wud2@rpi.edu
Nello Cristianini John Shawe-Taylor
Dept of Engineering Mathematics Dept of Computer Science
University of Bristol Royal Holloway, University of London
nello.cristianini@bristol.ac.uk jst@dcs.rhbnc.ac.uk
0-0
2. Perceptron Decision Trees
Definition 0.1 Perceptron Decision Trees (PDT), are decision
trees in which each internal node is associated with a hyperplane in
general position in the input space, i.e. the decision is constructed
using a linear combination of attributes, instead of one attribute.
w1 x x x x
x x x
x x w3
w1 x
o
x o o o
o o o o
x o
w2 w3 x o o
x o
x o
o o o
x x
x o o x x
w2
1
3. Which Decision Is Better?
– Capacity Control of Linear Classifier
Construct a linear classifier: f (x) = wT x − b, i.e.
find a vector w ∈ Rn and a scalar b such that
f (xi ) = wT xi − b ≥1 if yi = 1,
f (xi ) = wT xi − b ≤ −1 if yi = −1,
i = 1, . . . ,
2
4. Large Margin PDTs Generalize Better
Theorem 0.1 Suppose we are able to classify an m sample of
labeled examples using a perception decision tree and suppose that
the tree obtained contained K decision nodes with margins γi at
node i, then we can bound the generalization error with probability
greater than 1 − δ to be less than
130R2 (4m)K+1 2K
K
D log(4em) log(4m) + log
m (K + 1)δ
K 1
where D = i=1 γi .
2
3
5. The Baseline Algorithms for Comparison
Algorithm 0.1 (Basic OC1 Algorithm ) Start with the root
node, while a node remains to split
• Optimize the decision based on some splitting criterion.
– Randomized search for local minimum.
– Re-starts to jump out of local minimum.
• Partition the node into two or more child nodes based on
decision.
Prune the tree if necessary.
Splitting Criterion : Twoing Rule (goodness measure)
k 2
|TL | |TR | |Li| |Ri |
T woingV alue = ∗ ∗ −
n n i=1
|TL | |TR |
4
6. Twoing Rule
Splitting criteria - Twoing Rule: (Breiman, et al. (1984))
k 2
|TL | |TR | |Li| |Ri |
T woingV alue = ∗ ∗ −
n n i=1
|TL | |TR |
where
n = |TL | + |TR | - total number of instances at current node
k - number of classes, for two class problems
|TL | - number of instances on the left of the split, i.e. wT x − b >= 0
|TR | - number of instances on the right of the split i.e. wT x − b < 0
|Li | - number of instances in category i on the the left of the split
|Ri | - number of instances in category i on the the right of the split
5
7. Enlarging the Margins of PDT
Algorithms of producing large margin PDTs:
• Post-process existing trees (FAT).
Find optimal separating hyperplane at each node of the
existing trees.
• Incorporate large Margin into splitting criteria (MOC1)
max T woingV alue + C ∗ CurrentM argin
• Incorporate large margin into goodness measure (MOC2)
Modified Towing Rule.
6
8. The OC1-PDT and FAT-PDT
IDEA: Post-process the existing OC1-PDT, replace the the OC1
decision with optimal separating hyperlane at each decision node.
x o o o o
x
o o o o
o x x
x x o
x x
x x x
x
x x
x x
x
x x x
o
x o
; x
o
x
o
o OC1
o o
FAT
o
o x
o
7
9. The FAT Algorithm
1. Construct a decision tree using OC1, call it OC1-PDT.
2. Starting from root of OC1-PDT, traverses through all the non-leaf
nodes. At each node,
• Relabel the points at node with ω T x − b ≥ 0 as superclass right,
the other points at this node as superclass left.
• Find the perceptron (optimal separating hyperplane)
f (x) = ω ∗ T x − b∗ , which separates superclasses right and left
perfectly with maximal margin.
• Replace the original perceptron with the new one.
8
10. FAT Generalizes Better Than OC1
10−fold cross validation results: FAT vs OC1
100
95 significant
x=y
90
FAT 10−CV average accuracy
85
80
75
70
65
65 70 75 80 85 90 95 100
OC1 10−CV average accuracy
9
11. MOC1: Margin OC1
IDEA: Use Multi-objective splitting criterion to maximize both
TwoingValue and margin.
max T woingV alue + C ∗ CurrentM argin
x
o
x o
x o
o
x o o
x o
x o
o
10
12. MOC1 Generalizes Better Than OC1
10−fold cross validation results: MOC1 vs OC1
100
significant
not significant
95 x=y
90
MOC1 10−CV average accuracy
85
80
75
70
65
65 70 75 80 85 90 95 100
OC1 10−CV average accuracy
11
13. The MOC2 Algorithm
IDEA: Modify Twoing Criterion to allow “soft-margin”.
Want both high accuracy and strong separation/margin.
k k
|M TL | |M TR | |Li | |Ri | |M Li | |M Ri |
T woingV alue = ∗ ∗ − ∗ −
n n |TL | |TR | |M TL | |M TR |
i=1 i=1
x x o
x x o
x o
x
o o
x
x x x x o
x o o
o o
x x o o o
x o o
x x o o
x x
o
o
x o
x o
x
wx=b+1 wx = b wx=b-1
12
14. The Modified Twoing Rule
k k
|M TL | |M TR | |Li | |Ri | |M Li | |M Ri |
T woingV alue = ∗ ∗ − ∗ −
n n |TL | |TR | |M TL | |M TR |
i=1 i=1
where n = |TL | + |TR | - total number of instances at current node
k - number of classes, for two class problems
|TL | - number of instances on the left of the split, i.e. wT x − b >= 0
|TR | - number of instances on the right of the split i.e. wT x − b < 0
|Li | - number of instances in category i on the the left of the split
|Ri | - number of instances in category i on the the right of the split
|M TL | - number of instances on the left of the split, wT x − b >= 1
|M TR | - number of instances on the right of the split wT x − b <= −1
|M Li | - number of instances in category i with wT x − b >= 1
|M Ri | - number of instances in category i with wT x − b <= −1
13
15. MOC2 Generalizes Better Than OC1
10−fold cross validation results: MOC2 vs OC1
100
significant
95
not significant
x=y
90
MOC2 10−CV average accuracy
85
80
75
70
65
65 70 75 80 85 90 95 100
OC1 10−CV average accuracy
14
17. Conclusions
• Generalization error of PDT is bounded by function of
margins, tree size, and training set size.
• Three algorithms to control capacity of PDT investigated:
– Post-processing existing trees (FAT)
– Incorporating margins into splitting criteria:
∗ Multicriteria splitting rule (MOC1)
∗ Soft-margin modified twoing-rule (MOC2).
• Theoretically and empirically enlarged margin PDT performed
better.
16