3. Logic of KNN
Find from historical record that looks as similar as
possible to the new record.
Which group will I be classified?
4. KNN instances and distance
measure
Each instance/samples is categorized as a vector of
numbers, so all instances correspond to points in an n-
dimensional Euclidean space.
North Carolina state bird: p = (p1, p2,..., pn)
Dinosaur: q = (q1, q2,..., qn)
How to measure the distance between instances?
Euclidean distance:
5. K nearest neighbor
You have k nearest neighbors and you need to
pick k to get the classification – 1, 3, 5 are
people often pick.
Question: Why is number of nearest neighbors often odd number?
Answer: because the classification is decided by majority vote!
7. Example of a Decision Tree
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
Single, Divorced
< 80K > 80K
Splitting Attributes
Training Data
Decision Tree
http://www.scribd.com/doc/56167859/7/Decision-Tree-Classification-Task
8. Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
Start from the root of tree.
http://www.scribd.com/doc/56167859/7/Decision-Tree-Classification-Task
9. Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
http://www.scribd.com/doc/56167859/7/Decision-Tree-Classification-Task
10. Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
http://www.scribd.com/doc/56167859/7/Decision-Tree-Classification-Task
11. Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
http://www.scribd.com/doc/56167859/7/Decision-Tree-Classification-Task
12. Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
http://www.scribd.com/doc/56167859/7/Decision-Tree-Classification-Task
13. Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
Assign Cheat to “No”
http://www.scribd.com/doc/56167859/7/Decision-Tree-Classification-Task
14. Special feature of decision tree of random
forest
Trees should not be
pruned.
Each individual tree is
over fitting (not
generalized well), but it
will be okay after taking
the majority vote (which
will be explained later).
Persecuting a tree is NOT allowed
in the random forest world!
15. Logic of ensemble
High-dimensional pattern reorganization problem is as
complicated as an elephant to a blind man – too many
perspectives to touch and to know!
A single decision tree is like a single blind man. It is subject to over fitting and unstab
“Unstable” means that small changes in the training set leads to large changes in
Predictions.
16. The logic of ensemble - continued
A single blind man is limited. Why
not send many blind men and let
them to investigate the elephant
from different perspectives, and
then aggregate their opinion?
The MANY blind men approach is
like random forest, an ensemble
of many trees!
In random forest, each tree is like a blind man and they will use the training set
(the part of
the elephant they touched) to draw conclusions (build the training model) and
then to make
17. Translating it to a little bit jargon….
Random forest is an ensemble classifier of many
decision trees.
Each tree casts a vote at its terminal nodes. (For
binary endpoint, the vote will be “YES” or “NO”.)
The final decision of prediction depends on the
majority vote of trees.
The motivation for generating multiple trees is to
increase predictive accuracy.
18. Need to get some ensemble rules….
To avoid a blind men to announce an elephant is like a
carpet, there must be some rules so that their votes
make as much sense as they can in aggregation.
elephant (hair) carpet
19. Boostrap (randomness by the samples)
Bootstrap sampling: create new training sets by random sampling from
original data WITH replacement.
Dataset
Bootstrap Dataset1
Bootstrap Dataset2
Bootstrap Dataset 3
OOB samples (around 1/3)
OOB samples (around 1/3)
OOB samples (around 1/3)
Bootstrap data (about 2/3 of training data) is to grow the tree and OOB
samples is for self testing – to evaluate the performance of each tree and to
get unbiased estimate of classification error.
Bootstrap data is the mainstream random forest. People some times use
sampling without replacement.
.
.
.
.
20. Random subspace (randomness by features)
For a bootstrap samples with M
predictors, at each node, m (m<M)
variables are selected at random
and only those m features are
considered for splitting. This is to
let trees grow using different
features, like letting each blind
men see the data from different
perspectives.
Find the best split on the selected
m variables.
The value of m is fixed when the
forest is grown.
21. How to classify new objects using random
forest?
Put the input vector on each of the trees in the forest. Each tree gives a
classification (a vote) and the forest chooses the classification having the
majority votes (over all the trees in the forest).
New sample
Tree 3
New sample
Tree 2
New sample
Tree 1
New sample
Tree 4
New sample
Tree n
Final decision – majority vote
22. Review in stats language
Definition: Random forest is learning ensemble consisting of bagging
(or other type of re-sampling) of un-pruned decision tree learners
with a randomized selection of features at each split.
Random forest algorithm
Let Ntrees be the number of trees to build
for each of Ntrees iterations
1. Select a new bootstrap (or other type of re-sampling) sample from
training set
2. Grow an un-pruned tree on this bootstrap.
3. At each internal node, randomly select mtry predictors and
determine the best split using only these predictors.
4. Do not perform cost complexity pruning. Save tree as is, along side
those built thus far.
Output overall prediction as the average response (regression) or
majority vote (classification) from all individually trained trees
http://www.dabi.temple.edu/~hbling/8590.002/Montillo_RandomForests_4-2-2009.pdf
23. Pattern recognition is fun
Lunar mining robot
"Give me a place to stand on, and I will move the Earth with a lever .” – Archimedes
Give the machine enough data and algorithm, he/she will behave similar like you.
Mars Rover