SlideShare ist ein Scribd-Unternehmen logo
1 von 121
July 23, 2014 Data Mining: Concepts and Techniques1
Data Mining:
Concepts and
Techniques
— Chapter 6 —
Jiawei Han
Department of Computer Science
University of Illinois at Urbana-Champaign
www.cs.uiuc.edu/~hanj
©2006 Jiawei Han and Micheline Kamber, All rights reserved
July 23, 2014 Data Mining: Concepts and Techniques2
Chapter 6. Classification and
Prediction
 What is classification? What is
prediction?
 Issues regarding classification
and prediction
 Classification by decision tree
induction
 Bayesian classification
 Rule-based classification
 Classification by back
propagation
 Support Vector Machines
(SVM)
 Associative classification
 Lazy learners (or learning from
your neighbors)
 Other classification methods
 Prediction
 Accuracy and error measures
 Ensemble methods
 Model selection

July 23, 2014 Data Mining: Concepts and Techniques3
 Classification
 predicts categorical class labels (discrete or nominal)
 classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new data
 Prediction
 models continuous-valued functions, i.e., predicts
unknown or missing values
 Typical applications
 Credit approval
 Target marketing
 Medical diagnosis
 Fraud detection
Classification vs. Prediction
July 23, 2014 Data Mining: Concepts and Techniques4
Classification—A Two-Step Process
 Model construction: describing a set of predetermined classes
 Each tuple/sample is assumed to belong to a predefined class,
as determined by the class label attribute
 The set of tuples used for model construction is training set
 The model is represented as classification rules, decision trees,
or mathematical formulae
 Model usage: for classifying future or unknown objects
 Estimate accuracy of the model

The known label of test sample is compared with the
classified result from the model

Accuracy rate is the percentage of test set samples that are
correctly classified by the model

Test set is independent of training set, otherwise over-fitting
will occur
 If the accuracy is acceptable, use the model to classify data
tuples whose class labels are not known
July 23, 2014 Data Mining: Concepts and Techniques5
Process (1): Model Construction
Training
Data
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Classification
Algorithms
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Classifier
(Model)
July 23, 2014 Data Mining: Concepts and Techniques6
Process (2): Using the Model in
Prediction
Classifier
Testing
Data
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Unseen Data
(Jeff, Professor, 4)
Tenured?
July 23, 2014 Data Mining: Concepts and Techniques7
Supervised vs. Unsupervised Learning
 Supervised learning (classification)
 Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
 New data is classified based on the training set
 Unsupervised learning (clustering)
 The class labels of training data is unknown
 Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data
July 23, 2014 Data Mining: Concepts and Techniques8
Chapter 6. Classification and
Prediction
 What is classification? What is
prediction?
 Issues regarding classification
and prediction
 Classification by decision tree
induction
 Bayesian classification
 Rule-based classification
 Classification by back
propagation
 Support Vector Machines
(SVM)
 Associative classification
 Lazy learners (or learning from
your neighbors)
 Other classification methods
 Prediction
 Accuracy and error measures
 Ensemble methods
 Model selection

July 23, 2014 Data Mining: Concepts and Techniques9
Issues: Data Preparation
 Data cleaning
 Preprocess data in order to reduce noise and handle
missing values
 Relevance analysis (feature selection)
 Remove the irrelevant or redundant attributes
 Data transformation
 Generalize and/or normalize data
July 23, 2014 Data Mining: Concepts and Techniques10
Issues: Evaluating Classification Methods
 Accuracy
 classifier accuracy: predicting class label
 predictor accuracy: guessing value of predicted
attributes
 Speed
 time to construct the model (training time)
 time to use the model (classification/prediction time)
 Robustness: handling noise and missing values
 Scalability: efficiency in disk-resident databases
 Interpretability
 understanding and insight provided by the model
 Other measures, e.g., goodness of rules, such as
decision tree size or compactness of classification rules
July 23, 2014 Data Mining: Concepts and Techniques11
Chapter 6. Classification and
Prediction
 What is classification? What is
prediction?
 Issues regarding classification
and prediction
 Classification by decision tree
induction
 Bayesian classification
 Rule-based classification
 Classification by back
propagation
 Support Vector Machines
(SVM)
 Associative classification
 Lazy learners (or learning from
your neighbors)
 Other classification methods
 Prediction
 Accuracy and error measures
 Ensemble methods
 Model selection

July 23, 2014 Data Mining: Concepts and Techniques12
Example of a Decision Tree
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
categorical
categorical
continuous
class
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Splitting Attributes
Training Data Model: Decision Tree
July 23, 2014 Data Mining: Concepts and Techniques13
Another Example of Decision Tree
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
categorical
categorical
continuous
class
MarSt
Refund
TaxInc
YESNO
NO
NO
Yes No
Married
Single,
Divorced
< 80K > 80K
There could be more than one tree that
fits the same data!
July 23, 2014 Data Mining: Concepts and Techniques14
Decision Tree Classification Task
Decision
Tree
July 23, 2014 Data Mining: Concepts and Techniques15
Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
Start from the root of tree.
July 23, 2014 Data Mining: Concepts and Techniques16
Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
July 23, 2014 Data Mining: Concepts and Techniques17
Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
July 23, 2014 Data Mining: Concepts and Techniques18
Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
July 23, 2014 Data Mining: Concepts and Techniques19
Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
July 23, 2014 Data Mining: Concepts and Techniques20
Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
Assign Cheat to “No”
July 23, 2014 Data Mining: Concepts and Techniques21
Decision Tree Classification Task
Decision
Tree
July 23, 2014 Data Mining: Concepts and Techniques22
Algorithm for Decision Tree Induction
 Basic algorithm (a greedy algorithm)
 Tree is constructed in a top-down recursive divide-and-conquer
manner
 At start, all the training examples are at the root
 Attributes are categorical (if continuous-valued, they are
discretized in advance)
 Examples are partitioned recursively based on selected attributes
 Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
 Conditions for stopping partitioning
 All samples for a given node belong to the same class
 There are no remaining attributes for further partitioning – majority
voting is employed for classifying the leaf
 There are no samples left
July 23, 2014 Data Mining: Concepts and Techniques23
Decision Tree Induction
 Many Algorithms:
 Hunt’s Algorithm (one of the earliest)
 CART
 ID3, C4.5
 SLIQ,SPRINT
July 23, 2014 Data Mining: Concepts and Techniques24
General Structure of Hunt’s
Algorithm
 Let Dt be the set of training records
that reach a node t
 General Procedure:
 If Dt contains records that belong
the same class yt, then t is a leaf
node labeled as yt
 If Dt is an empty set, then t is a
leaf node labeled by the default
class, yd
 If Dt contains records that belong
to more than one class, use an
attribute test to split the data into
smaller subsets. Recursively
apply the procedure to each
subset.
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Dt
?
July 23, 2014 Data Mining: Concepts and Techniques25
Hunt’s Algorithm
Don’t
Cheat
Refund
Don’t
Cheat
Don’t
Cheat
Yes No
Refund
Don’t
Cheat
Yes No
Marital
Status
Don’t
Cheat
Cheat
Single,
Divorced
Married
Taxable
Income
Don’t
Cheat
< 80K >= 80K
Refund
Don’t
Cheat
Yes No
Marital
Status
Don’t
Cheat
Cheat
Single,
Divorced
Married
July 23, 2014 Data Mining: Concepts and Techniques26
Tree Induction
 Greedy strategy.
 Split the records based on an attribute test that
optimizes certain criterion.
 Issues
 Determine how to split the records

How to specify the attribute test condition?

How to determine the best split?
 Determine when to stop splitting
July 23, 2014 Data Mining: Concepts and Techniques27
Tree Induction
 Greedy strategy.
 Split the records based on an attribute test that
optimizes certain criterion.
 Issues
 Determine how to split the records

How to specify the attribute test condition?

How to determine the best split?
 Determine when to stop splitting
July 23, 2014 Data Mining: Concepts and Techniques28
How to Specify Test Condition?
 Depends on attribute types
 Nominal
 Ordinal
 Continuous
 Depends on number of ways to split
 2-way split
 Multi-way split
July 23, 2014 Data Mining: Concepts and Techniques29
Attributes
 Multi-way split: Use as many partitions as distinct
values.
 Binary split: Divides values into two subsets.
Need to find optimal partitioning.
CarType
Family
Sports
Luxury
CarType
{Family,
Luxury} {Sports}
CarType
{Sports,
Luxury} {Family} OR
July 23, 2014 Data Mining: Concepts and Techniques30
 Multi-way split: Use as many partitions as distinct
values.
 Binary split: Divides values into two subsets.
Need to find optimal partitioning.
 What about this split?
Splitting Based on Ordinal
Attributes
Size
Small
Medium
Large
Size
{Medium,
Large} {Small}
Size
{Small,
Medium} {Large}
OR
Size
{Small,
Large} {Medium}
July 23, 2014 Data Mining: Concepts and Techniques31
Attributes
 Different ways of handling
 Discretization to form an ordinal categorical
attribute

Static – discretize once at the beginning

Dynamic – ranges can be found by equal interval
bucketing, equal frequency bucketing
(percentiles), or clustering.
 Binary Decision: (A < v) or (A ≥ v)

consider all possible splits and finds the best cut

can be more compute intensive
July 23, 2014 Data Mining: Concepts and Techniques32
Attributes
July 23, 2014 Data Mining: Concepts and Techniques33
Tree Induction
 Greedy strategy.
 Split the records based on an attribute test that
optimizes certain criterion.
 Issues
 Determine how to split the records

How to specify the attribute test condition?

How to determine the best split?
 Determine when to stop splitting
July 23, 2014 Data Mining: Concepts and Techniques34
How to determine the Best Split
Before Splitting: 10 records of class 0,
10 records of class 1
Which test condition is the best?
July 23, 2014 Data Mining: Concepts and Techniques35
How to determine the Best Split
 Greedy approach:
 Nodes with homogeneous class distribution are
preferred
 Need a measure of node impurity:
Non-homogeneous,
High degree of impurity
Homogeneous,
Low degree of impurity
July 23, 2014 Data Mining: Concepts and Techniques36
Measures of Node Impurity
 Gini Index
 Entropy
 Misclassification error
July 23, 2014 Data Mining: Concepts and Techniques37
How to Find the Best Split
B?
Yes No
Node N3 Node N4
A?
Yes No
Node N1 Node N2
Before Splitting:
C0 N10
C1 N11
C0 N20
C1 N21
C0 N30
C1 N31
C0 N40
C1 N41
C0 N00
C1 N01
M0
M1 M2 M3 M4
M12 M34
Gain = M0 – M12 vs M0 – M34
July 23, 2014 Data Mining: Concepts and Techniques38
Examples
 Using Information Gain
July 23, 2014 Data Mining: Concepts and Techniques39
Information Gain in a Nutshell
)(
||
||
)()(
)(
v
AValuesv
v
SEntropy
S
S
SEntropyAnGainInformatio ⋅−= ∑∈
∑∈
−=
Decisionsd
dpdpEntropy ))(log(*)(
typically yes/no
July 23, 2014 Data Mining: Concepts and Techniques40
Playing Tennis
July 23, 2014 Data Mining: Concepts and Techniques41
Choosing an Attribute
 We want to split our decision tree on one of the
attributes
 There are four attributes to choose from:
 Outlook
 Temperature
 Humidity
 Wind
July 23, 2014 Data Mining: Concepts and Techniques42
How to Choose an Attribute
 Want to calculated the information gain of each
attribute
 Let us start with Outlook
 What is Entropy(S)?
 -5/14*log2(5/14) – 9/14*log2(9/14)
= Entropy(5/14,9/14) = 0.9403
July 23, 2014 Data Mining: Concepts and Techniques43
Outlook Continued
 The expected conditional entropy is:
5/14 * Entropy(3/5,2/5) +
4/14 * Entropy(1,0) +
5/14 * Entropy(3/5,2/5) = 0.6935
 So IG(Outlook) = 0.9403 – 0.6935 = 0.2468
July 23, 2014 Data Mining: Concepts and Techniques44
Temperature
 Now let us look at the attribute Temperature
 The expected conditional entropy is:
4/14 * Entropy(2/4,2/4) +
6/14 * Entropy(4/6,2/6) +
4/14 * Entropy(3/4,1/4) = 0.9111
 So IG(Temperature) = 0.9403 – 0.9111 = 0.0292
July 23, 2014 Data Mining: Concepts and Techniques45
Humidity
 Now let us look at attribute Humidity
 What is the expected conditional entropy?
 7/14 * Entropy(4/7,3/7) +
7/14 * Entropy(6/7,1/7) = 0.7885
 So IG(Humidity) = 0.9403 – 0.7885
= 0.1518
July 23, 2014 Data Mining: Concepts and Techniques46
Wind
 What is the information gain for wind?
 Expected conditional entropy:
8/14 * Entropy(6/8,2/8) +
6/14 * Entropy(3/6,3/6) = 0.8922
 IG(Wind) = 0.9403 – 0.8922 = 0.048
July 23, 2014 Data Mining: Concepts and Techniques47
Information Gains
 Outlook 0.2468
 Temperature 0.0292
 Humidity 0.1518
 Wind 0.0481
 We choose Outlook since it has the highest
information gain
July 23, 2014 Data Mining: Concepts and Techniques48
Decision Tree So Far
 Now must decide what to do when Outlook is:
 Sunny
 Overcast
 Rain
July 23, 2014 Data Mining: Concepts and Techniques49
Sunny Branch
 Examples to classify:
 Temperature, Humidity, Wind, Tennis
 Hot, High, Weak, no
 Hot, High, Strong, no
 Mild, High, Weak, no
 Cool, Normal, Weak, yes
 Mild, Normal, Strong, yes
July 23, 2014 Data Mining: Concepts and Techniques50
Splitting Sunny on Temperature
 What is the Entropy of Sunny?
 Entropy(2/5,3/5) = 0.9710
 How about the expected utility?
 2/5 * Entropy(1,0) +
2/5 * Entropy(1/2,1/2) +
1/5 * Entropy(1,0) = 0.4000
 IG(Temperature) = 0.9710 – 0.4000
= 0.5710
July 23, 2014 Data Mining: Concepts and Techniques51
Splitting Sunny on Humidity
 The expected conditional entropy is
 3/5 * Entropy(1,0) +
2/5 * Entropy(1,0) = 0
 IG(Humidity) = 0.9710 – 0 = 0.9710
July 23, 2014 Data Mining: Concepts and Techniques52
Considering Wind?
 Do we need to consider wind as an attribute?
 No – it is not possible to do any better than an
expected entropy of 0; i.e. humidity must
maximize the information gain
July 23, 2014 Data Mining: Concepts and Techniques53
New Tree
July 23, 2014 Data Mining: Concepts and Techniques54
What if it is Overcast?
 All examples indicate yes
 So there is no need to further split on an attribute
 The information gain for any attribute would have
to be 0
 Just write yes at this node
July 23, 2014 Data Mining: Concepts and Techniques55
New Tree
July 23, 2014 Data Mining: Concepts and Techniques56
What about Rain?
 Let us consider attribute temperature
 First, what is the entropy of the data?
 Entropy(3/5,2/5) = 0.9710
 Second, what is the expected conditional entropy?
 3/5 * Entropy(2/3,1/3) + 2/5 * Entropy(1/2,1/2) = 0.9510
 IG(Temperature) = 0.9710 – 0.9510 = 0.020
July 23, 2014 Data Mining: Concepts and Techniques57
Or perhaps humidity?
 What is the expected conditional entropy?
 3/5 * Entropy(2/3,1/3) + 2/5 * Entropy(1/2,1/2) =
0.9510 (the same)
 IG(Humidity) = 0.9710 – 0.9510 = 0.020 (again,
the same)
July 23, 2014 Data Mining: Concepts and Techniques58
Now consider wind
 Expected conditional entropy:
 3/5*Entropy(1,0) + 2/5*Entropy(1,0) = 0
 IG(Wind) = 0.9710 – 0 = 0.9710
 Thus, we split on Wind
July 23, 2014 Data Mining: Concepts and Techniques59
Split Further?
July 23, 2014 Data Mining: Concepts and Techniques60
Final Tree
July 23, 2014 Data Mining: Concepts and Techniques61
Attribute Selection: Information Gain
 Class P: buys_computer = “yes”
 Class N: buys_computer = “no”
means “age <=30” has 5
out of 14 samples, with 2 yes’es
and 3 no’s. Hence
Similarly,
age pi ni I(pi, ni)
<=30 2 3 0.971
31…40 4 0 0
>40 3 2 0.971
694.0)2,3(
14
5
)0,4(
14
4
)3,2(
14
5
)(
=+
+=
I
IIDInfoage
048.0)_(
151.0)(
029.0)(
=
=
=
ratingcreditGain
studentGain
incomeGain
246.0)()()( =−= DInfoDInfoageGain age
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
)3,2(
14
5
I
940.0)
14
5
(log
14
5
)
14
9
(log
14
9
)5,9()( 22 =−−== IDInfo
July 23, 2014 Data Mining: Concepts and Techniques62
Computing Information-Gain for
Continuous-Value Attributes
 Let attribute A be a continuous-valued attribute
 Must determine the best split point for A
 Sort the value A in increasing order
 Typically, the midpoint between each pair of adjacent
values is considered as a possible split point
 (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
 The point with the minimum expected information
requirement for A is selected as the split-point for A
 Split:
 D1 is the set of tuples in D satisfying A ≤ split-point, and
D2 is the set of tuples in D satisfying A > split-point
July 23, 2014 Data Mining: Concepts and Techniques63
Gain Ratio for Attribute Selection
(C4.5)
 Information gain measure is biased towards attributes with
a large number of values
 C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
 GainRatio(A) = Gain(A)/SplitInfo(A)
 Ex.
 gain_ratio(income) = 0.029/0.926 = 0.031
 The attribute with the maximum gain ratio is selected as
the splitting attribute
)
||
||
(log
||
||
)( 2
1 D
D
D
D
DSplitInfo
j
v
j
j
A ×−= ∑=
926.0)
14
4
(log
14
4
)
14
6
(log
14
6
)
14
4
(log
14
4
)( 222 =×−×−×−=DSplitInfoA
July 23, 2014 Data Mining: Concepts and Techniques64
Gini index (CART, IBM
IntelligentMiner)
 If a data set D contains examples from n classes, gini index, gini(D) is
defined as
where pj is the relative frequency of class j in D
 If a data set D is split on A into two subsets D1 and D2, the gini index
gini(D) is defined as
 Reduction in Impurity:
 The attribute provides the smallest ginisplit(D) (or the largest reduction in
impurity) is chosen to split the node (need to enumerate all the
possible splitting points for each attribute)
∑
=
−=
n
j
p jDgini
1
21)(
)(
||
||
)(
||
||
)( 2
2
1
1
Dgini
D
D
Dgini
D
D
DginiA
+=
)()()( DginiDginiAgini A
−=∆
July 23, 2014 Data Mining: Concepts and Techniques65
Gini index (CART, IBM
IntelligentMiner)
 Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”
 Suppose the attribute income partitions D into 10 in D1: {low, medium}
and 4 in D2
but gini{medium,high} is 0.30 and thus the best since it is the lowest
 All attributes are assumed continuous-valued
 May need other tools, e.g., clustering, to get the possible split values
 Can be modified for categorical attributes
459.0
14
5
14
9
1)(
22
=





−





−=Dgini
)(
14
4
)(
14
10
)( 11},{ DGiniDGiniDgini mediumlowincome 





+





=∈
July 23, 2014 Data Mining: Concepts and Techniques66
Comparing Attribute Selection Measures
 The three measures, in general, return good results but
 Information gain:

biased towards multivalued attributes
 Gain ratio:

tends to prefer unbalanced splits in which one
partition is much smaller than the others
 Gini index:

biased to multivalued attributes

has difficulty when # of classes is large

tends to favor tests that result in equal-sized
partitions and purity in both partitions
July 23, 2014 Data Mining: Concepts and Techniques67
Other Attribute Selection Measures
 CHAID: a popular decision tree algorithm, measure based on χ2
test
for independence
 C-SEP: performs better than info. gain and gini index in certain cases
 G-statistics: has a close approximation to χ2
distribution
 MDL (Minimal Description Length) principle (i.e., the simplest solution
is preferred):
 The best tree as the one that requires the fewest # of bits to both
(1) encode the tree, and (2) encode the exceptions to the tree
 Multivariate splits (partition based on multiple variable combinations)
 CART: finds multivariate splits based on a linear comb. of attrs.
 Which attribute selection measure is the best?
 Most give good results, none is significantly superior than others
July 23, 2014 Data Mining: Concepts and Techniques68
Overfitting and Tree Pruning
 Overfitting: An induced tree may overfit the training data
 Too many branches, some may reflect anomalies due to noise or
outliers
 Poor accuracy for unseen samples
 Two approaches to avoid overfitting
 Prepruning: Halt tree construction early—do not split a node if this
would result in the goodness measure falling below a threshold

Difficult to choose an appropriate threshold
 Postpruning: Remove branches from a “fully grown” tree—get a
sequence of progressively pruned trees

Use a set of data different from the training data to decide
which is the “best pruned tree”
July 23, 2014 Data Mining: Concepts and Techniques69
Underfitting and Overfitting
Overfitting
Underfitting: when model is too simple, both training and test errors are large
July 23, 2014 Data Mining: Concepts and Techniques70
Overfitting in Classification
 Overfitting: An induced tree may overfit the training data
 Too many branches, some may reflect anomalies due to
noise or outliers
 Poor accuracy for unseen samples
July 23, 2014 Data Mining: Concepts and Techniques71
Enhancements to Basic Decision Tree Induction
 Allow for continuous-valued attributes
 Dynamically define new discrete-valued attributes that
partition the continuous attribute value into a discrete
set of intervals
 Handle missing attribute values
 Assign the most common value of the attribute
 Assign probability to each of the possible values
 Attribute construction
 Create new attributes based on existing ones that are
sparsely represented
 This reduces fragmentation, repetition, and replication
July 23, 2014 Data Mining: Concepts and Techniques72
Classification in Large Databases
 Classification—a classical problem extensively studied by
statisticians and machine learning researchers
 Scalability: Classifying data sets with millions of examples
and hundreds of attributes with reasonable speed
 Why decision tree induction in data mining?
 relatively faster learning speed (than other classification
methods)
 convertible to simple and easy to understand
classification rules
 can use SQL queries for accessing databases
 comparable classification accuracy with other methods
July 23, 2014 Data Mining: Concepts and Techniques73
Scalable Decision Tree Induction Methods
 SLIQ (EDBT’96 — Mehta et al.)
 Builds an index for each attribute and only class list and
the current attribute list reside in memory
 SPRINT (VLDB’96 — J. Shafer et al.)
 Constructs an attribute list data structure
 PUBLIC (VLDB’98 — Rastogi & Shim)
 Integrates tree splitting and tree pruning: stop growing
the tree earlier
 RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)
 Builds an AVC-list (attribute, value, class label)
 BOAT (PODS’99 — Gehrke, Ganti, Ramakrishnan & Loh)
 Uses bootstrapping to create several small samples
July 23, 2014 Data Mining: Concepts and Techniques74
Scalability Framework for
RainForest
 Separates the scalability aspects from the criteria that
determine the quality of the tree
 Builds an AVC-list: AVC (Attribute, Value, Class_label)
 AVC-set (of an attribute X )
 Projection of training dataset onto the attribute X and
class label where counts of individual class label are
aggregated
 AVC-group (of a node n )
 Set of AVC-sets of all predictor attributes at the node n
July 23, 2014 Data Mining: Concepts and Techniques75
Rainforest: Training Set and Its AVC
Sets
student Buy_Computer
yes no
yes 6 1
no 3 4
Age Buy_Computer
yes no
<=30 3 2
31..40 4 0
>40 3 2
Credit
rating
Buy_Computer
yes no
fair 6 2
excellent 3 3
age income studentcredit_ratingbuys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
AVC-set on incomeAVC-set on Age
AVC-set on Student
Training Examples
income Buy_Computer
yes no
high 2 2
medium 4 2
low 3 1
AVC-set on
credit_rating
July 23, 2014 Data Mining: Concepts and Techniques76
Chapter 6. Classification and
Prediction
 What is classification? What is
prediction?
 Issues regarding classification
and prediction
 Classification by decision tree
induction
 Bayesian classification
 Rule-based classification
 Classification by back
propagation
 Support Vector Machines
(SVM)
 Associative classification
 Lazy learners (or learning from
your neighbors)
 Other classification methods
 Prediction
 Accuracy and error measures
 Ensemble methods
 Model selection

July 23, 2014 Data Mining: Concepts and Techniques77
Bayesian Classification: Why?
 A statistical classifier: performs probabilistic prediction,
i.e., predicts class membership probabilities
 Foundation: Based on Bayes’ Theorem.
 Performance: A simple Bayesian classifier, naïve
Bayesian classifier, has comparable performance with
decision tree and selected neural network classifiers
 Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is
correct — prior knowledge can be combined with observed
data
 Standard: Even when Bayesian methods are
computationally intractable, they can provide a standard of
optimal decision making against which other methods can
be measured
July 23, 2014 Data Mining: Concepts and Techniques78
Bayesian Theorem: Basics
 Let X be a data sample (“evidence”): class label is unknown
 Let H be a hypothesis that X belongs to class C
 Classification is to determine P(H|X), the probability that the
hypothesis holds given the observed data sample X
 P(H) (prior probability), the initial probability
 E.g., X will buy computer, regardless of age, income, …
 P(X): probability that sample data is observed
 P(X|H) (posteriori probability), the probability of observing
the sample X, given that the hypothesis holds
 E.g., Given that X will buy computer, the prob. that X is
31..40, medium income
July 23, 2014 Data Mining: Concepts and Techniques79
Bayesian Theorem
 Given training data X, posteriori probability of a hypothesis
H, P(H|X), follows the Bayes theorem
 Informally, this can be written as
posteriori = likelihood x prior/evidence
 Predicts X belongs to C2 iff the probability P(Ci|X) is the
highest among all the P(Ck|X) for all the k classes
 Practical difficulty: require initial knowledge of many
probabilities, significant computational cost
)(
)()|()|(
X
XX
P
HPHPHP =
July 23, 2014 Data Mining: Concepts and Techniques80
Towards Naïve Bayesian
Classifier
 Let D be a training set of tuples and their associated class
labels, and each tuple is represented by an n-D attribute
vector X = (x1, x2, …, xn)
 Suppose there are m classes C1, C2, …, Cm.
 Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
 This can be derived from Bayes’ theorem
 Since P(X) is constant for all classes, only
needs to be maximized
)(
)()|(
)|(
X
X
X
P
i
CP
i
CP
i
CP =
)()|()|(
i
CP
i
CP
i
CP XX =
July 23, 2014 Data Mining: Concepts and Techniques81
Derivation of Naïve Bayes
Classifier
 A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between
attributes):
 This greatly reduces the computation cost: Only counts
the class distribution
 If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having
value xk for Ak divided by |Ci, D| (# of tuples of Ci in D)
 If Ak is continous-valued, P(xk|Ci) is usually computed
based on Gaussian distribution with a mean μ and
standard deviation σ
and P(xk|Ci) is
)|(...)|()|(
1
)|()|(
21
CixPCixPCixP
n
k
CixPCiP
nk
×××=∏
=
=X
2
2
2
)(
2
1
),,( σ
µ
σπ
σµ
−
−
=
x
exg
),,()|( ii CCkxgCiP σµ=X
July 23, 2014 Data Mining: Concepts and Techniques82
Naïve Bayesian Classifier: Training
Dataset
Class:
C1:buys_computer = ‘yes’
C2:buys_computer = ‘no’
Data sample
X = (age <=30,
Income = medium,
Student = yes
Credit_rating = Fair)
age income studentcredit_ratingbuys_compu
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
July 23, 2014 Data Mining: Concepts and Techniques83
Naïve Bayesian Classifier: An
Example
 P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
 Compute P(X|Ci) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
 X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
July 23, 2014 Data Mining: Concepts and Techniques84
Avoiding the 0-Probability
Problem
 Naïve Bayesian prediction requires each conditional prob. be non-
zero. Otherwise, the predicted prob. will be zero
 Ex. Suppose a dataset with 1000 tuples, income=low (0), income=
medium (990), and income = high (10),
 Use Laplacian correction (or Laplacian estimator)
 Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
 The “corrected” prob. estimates are close to their “uncorrected”
counterparts
∏
=
=
n
k
CixkPCiXP
1
)|()|(
July 23, 2014 Data Mining: Concepts and Techniques85
Naïve Bayesian Classifier:
Comments
 Advantages
 Easy to implement
 Good results obtained in most of the cases
 Disadvantages
 Assumption: class conditional independence, therefore
loss of accuracy
 Practically, dependencies exist among variables

E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.

Dependencies among these cannot be modeled by Naïve
Bayesian Classifier
 How to deal with these dependencies?
 Bayesian Belief Networks
July 23, 2014 Data Mining: Concepts and Techniques86
Chapter 6. Classification and
Prediction
 What is classification? What is
prediction?
 Issues regarding classification
and prediction
 Classification by decision tree
induction
 Bayesian classification
 Rule-based classification
 Classification by back
propagation
 Support Vector Machines
(SVM)
 Associative classification
 Lazy learners (or learning from
your neighbors)
 Other classification methods
 Prediction
 Accuracy and error measures
 Ensemble methods
 Model selection

July 23, 2014 Data Mining: Concepts and Techniques87
Rule-Based Classifier
 Classify records by using a collection of “if…
then…” rules
 Rule: (Condition) → y
 where

Condition is a conjunctions of attributes

y is the class label
 LHS: rule antecedent or condition
 RHS: rule consequent
 Examples of classification rules:

(Blood Type=Warm) ∧ (Lay Eggs=Yes) → Birds

(Taxable Income < 50K) ∧ (Refund=Yes) → Evade=No
July 23, 2014 Data Mining: Concepts and Techniques88
age?
student? credit rating?
<=30 >40
no yes yes
yes
31..40
no
fairexcellentyesno
 Example: Rule extraction from our buys_computer decision-tree
IF age = young AND student = no THEN buys_computer = no
IF age = young AND student = yes THEN buys_computer = yes
IF age = mid-age THEN buys_computer = yes
IF age = old AND credit_rating = excellent THEN buys_computer = yes
IF age = young AND credit_rating = fair THEN buys_computer = no
Rule Extraction from a Decision Tree
 Rules are easier to understand than large trees
 One rule is created for each path from the root to
a leaf
 Each attribute-value pair along a path forms a
conjunction: the leaf holds the class prediction
 Rules are mutually exclusive and exhaustive
July 23, 2014 Data Mining: Concepts and Techniques89
Rule-based Classifier (Example)
R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds
R2: (Give Birth = no) ∧ (Live in Water = yes) → Fishes
R3: (Give Birth = yes) ∧ (Blood Type = warm) →
Mammals
R4: (Give Birth = no) ∧ (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians
Name Blood Type Give Birth Can Fly Live in Water Class
human warm yes no no mammals
python cold no no no reptiles
salmon cold no no yes fishes
whale warm yes no yes mammals
frog cold no no sometimes amphibians
komodo cold no no no reptiles
bat warm yes yes no mammals
pigeon warm no yes no birds
cat warm yes no no mammals
leopard shark cold yes no yes fishes
turtle cold no no sometimes reptiles
penguin warm no no sometimes birds
porcupine warm yes no no mammals
eel cold no no yes fishes
salamander cold no no sometimes amphibians
gila monster cold no no no reptiles
platypus warm no no no mammals
owl warm no yes no birds
dolphin warm yes no yes mammals
eagle warm no yes no birds
July 23, 2014 Data Mining: Concepts and Techniques90
Application of Rule-Based
Classifier
 A rule r covers an instance x if the attributes of
the instance satisfy the condition of the rule
R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds
R2: (Give Birth = no) ∧ (Live in Water = yes) → Fishes
R3: (Give Birth = yes) ∧ (Blood Type = warm) → Mammals
R4: (Give Birth = no) ∧ (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians
The rule R1 covers a hawk => Bird
The rule R3 covers the grizzly bear => Mammal
Name Blood Type Give Birth Can Fly Live in Water Class
hawk warm no yes no ?
grizzly bear warm yes no no ?
July 23, 2014 Data Mining: Concepts and Techniques91
Rule Coverage and Accuracy
 Coverage of a rule:
 Fraction of records
that satisfy the
antecedent of a rule
 Accuracy of a rule:
 Fraction of records
that satisfy both the
antecedent and
consequent of a rule
Tid Refund Marital
Status
Taxable
Income Class
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
(Status=Single) → No
Coverage = 40%, Accuracy = 50%
July 23, 2014 Data Mining: Concepts and Techniques92
How does Rule-based Classifier
Work?
R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds
R2: (Give Birth = no) ∧ (Live in Water = yes) → Fishes
R3: (Give Birth = yes) ∧ (Blood Type = warm) → Mammals
R4: (Give Birth = no) ∧ (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians
A lemur triggers rule R3, so it is classified as a mammal
A turtle triggers both R4 and R5
A dogfish shark triggers none of the rules
Name Blood Type Give Birth Can Fly Live in Water Class
lemur warm yes no no ?
turtle cold no no sometimes ?
dogfish shark cold yes no yes ?
July 23, 2014 Data Mining: Concepts and Techniques93
Characteristics of Rule-Based
Classifier
 Mutually exclusive rules
 Classifier contains mutually exclusive rules if
the rules are independent of each other
 Every record is covered by at most one rule
 Exhaustive rules
 Classifier has exhaustive coverage if it
accounts for every possible combination of
attribute values
 Each record is covered by at least one rule
July 23, 2014 Data Mining: Concepts and Techniques94
From Decision Trees To Rules
Classification Rules
(Refund=Yes) ==> No
(Refund=No, Marital Status={Single,Divorced},
Taxable Income<80K) ==> No
(Refund=No, Marital Status={Single,Divorced},
Taxable Income>80K) ==> Yes
(Refund=No, Marital Status={Married}) ==> No
Rules are mutually exclusive and exhaustive
Rule set contains as much information as the
tree
July 23, 2014 Data Mining: Concepts and Techniques95
Rules Can Be Simplified
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Initial Rule: (Refund=No) ∧ (Status=Married) → No
Simplified Rule: (Status=Married) → No
July 23, 2014 Data Mining: Concepts and Techniques96
Effect of Rule Simplification
 Rules are no longer mutually exclusive
 A record may trigger more than one rule
 Solution?

Ordered rule set

Unordered rule set – use voting schemes
 Rules are no longer exhaustive
 A record may not trigger any rules
 Solution?

Use a default class
July 23, 2014 Data Mining: Concepts and Techniques97
Ordered Rule Set
 Rules are rank ordered according to their priority
 An ordered rule set is known as a decision list
 When a test record is presented to the classifier
 It is assigned to the class label of the highest ranked rule it has
triggered
 If none of the rules fired, it is assigned to the default class
R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds
R2: (Give Birth = no) ∧ (Live in Water = yes) → Fishes
R3: (Give Birth = yes) ∧ (Blood Type = warm) → Mammals
R4: (Give Birth = no) ∧ (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians
Name Blood Type Give Birth Can Fly Live in Water Class
turtle cold no no sometimes ?
July 23, 2014 Data Mining: Concepts and Techniques98
Rule Ordering Schemes
 Rule-based ordering
 Individual rules are ranked based on their quality
 Class-based ordering
 Rules that belong to the same class appear together
July 23, 2014 Data Mining: Concepts and Techniques99
Building Classification Rules
 Direct Method:

Extract rules directly from data

e.g.: RIPPER, CN2, Holte’s 1R
 Indirect Method:

Extract rules from other classification models (e.g.
decision trees, neural networks, etc).

e.g: C4.5rules
July 23, 2014 Data Mining: Concepts and Techniques100
Rule Extraction from the Training Data
 Sequential covering algorithm: Extracts rules directly from training data
 Typical sequential covering algorithms: FOIL, AQ, CN2, RIPPER
 Rules are learned sequentially, each for a given class Ci will cover many
tuples of Ci but none (or few) of the tuples of other classes
 Steps:
 Rules are learned one at a time
 Each time a rule is learned, the tuples covered by the rules are
removed
 The process repeats on the remaining tuples unless termination
condition, e.g., when no more training examples or when the quality
of a rule returned is below a user-specified threshold
 Comp. w. decision-tree induction: learning a set of rules simultaneously
July 23, 2014 Data Mining: Concepts and Techniques101
How to Learn-One-Rule?
 Star with the most general rule possible: condition = empty
 Adding new attributes by adopting a greedy depth-first strategy
 Picks the one that most improves the rule quality
 Rule-Quality measures: consider both coverage and accuracy
 Foil-gain (in FOIL & RIPPER): assesses info_gain by extending
condition
It favors rules that have high accuracy and cover many positive tuples
 Rule pruning based on an independent set of test tuples
Pos/neg are # of positive/negative tuples covered by R.
If FOIL_Prune is higher for the pruned version of R, prune R
)log
''
'
(log'_ 22
negpos
pos
negpos
pos
posGainFOIL
+
−
+
×=
negpos
negpos
RPruneFOIL
+
−
=)(_
July 23, 2014 Data Mining: Concepts and Techniques102
Direct Method: Sequential
Covering
1. Start from an empty rule
2. Grow a rule using the Learn-One-Rule function
3. Remove training records covered by the rule
4. Repeat Step (2) and (3) until stopping criterion is
met
July 23, 2014 Data Mining: Concepts and Techniques103
Example of Sequential Covering
(ii) Step 1
July 23, 2014 Data Mining: Concepts and Techniques104
Example of Sequential Covering…
(iii) Step 2
R1
(iv) Step 3
R1
R2
July 23, 2014 Data Mining: Concepts and Techniques105
Aspects of Sequential Covering
 Rule Growing
 Instance Elimination
 Rule Evaluation
 Stopping Criterion
 Rule Pruning
July 23, 2014 Data Mining: Concepts and Techniques106
Instance Elimination
 Why do we need to
eliminate instances?
 Otherwise, the next rule is
identical to previous rule
 Why do we remove
positive instances?
 Ensure that the next rule is
different
 Why do we remove
negative instances?
 Prevent underestimating
accuracy of rule
 Compare rules R2 and R3 in
the diagram
July 23, 2014 Data Mining: Concepts and Techniques107
Stopping Criterion and Rule
Pruning
 Stopping criterion
 Compute the gain
 If gain is not significant, discard the new rule
 Rule Pruning
 Similar to post-pruning of decision trees
 Reduced Error Pruning:

Remove one of the conjuncts in the rule

Compare error rate on validation set before and
after pruning

If error improves, prune the conjunct
July 23, 2014 Data Mining: Concepts and Techniques108
Advantages of Rule-Based
Classifiers
 As highly expressive as decision trees
 Easy to interpret
 Easy to generate
 Can classify new instances rapidly
 Performance comparable to decision trees
July 23, 2014 Data Mining: Concepts and Techniques109
Chapter 6. Classification and
Prediction
 What is classification? What is
prediction?
 Issues regarding classification
and prediction
 Classification by decision tree
induction
 Bayesian classification
 Rule-based classification
 Classification by back
propagation
 Support Vector Machines
(SVM)
 Associative classification
 Lazy learners (or learning from
your neighbors)
 Other classification methods
 Prediction
 Accuracy and error measures
 Ensemble methods
 Model selection

July 23, 2014 Data Mining: Concepts and Techniques110
 Classification:
 predicts categorical class labels
 E.g., Personal homepage classification
 xi = (x1, x2, x3, …), yi = +1 or –1
 x1 : # of a word “homepage”
 x2 : # of a word “welcome”
 Mathematically
 x ∈ X = ℜn
, y ∈ Y = {+1, –1}
 We want a function f: X  Y
Classification: A Mathematical Mapping
July 23, 2014 Data Mining: Concepts and Techniques111
Linear Classification
 Binary Classification
problem
 The data above the red
line belongs to class ‘x’
 The data below red line
belongs to class ‘o’
 Examples: SVM,
Perceptron, Probabilistic
Classifiers
x
x
x
x
xx
x
x
x
x
o
o
o
o
o
o
o
o
o o
o
o
o
July 23, 2014 Data Mining: Concepts and Techniques112
SVM—Support Vector Machines
 A new classification method for both linear and nonlinear
data
 It uses a nonlinear mapping to transform the original
training data into a higher dimension
 With the new dimension, it searches for the linear optimal
separating hyperplane (i.e., “decision boundary”)
 With an appropriate nonlinear mapping to a sufficiently
high dimension, data from two classes can always be
separated by a hyperplane
 SVM finds this hyperplane using support vectors
(“essential” training tuples) and margins (defined by the
support vectors)
July 23, 2014 Data Mining: Concepts and Techniques113
SVM—History and Applications
 Vapnik and colleagues (1992)—groundwork from Vapnik
& Chervonenkis’ statistical learning theory in 1960s
 Features: training can be slow but accuracy is high owing
to their ability to model complex nonlinear decision
boundaries (margin maximization)
 Used both for classification and prediction
 Applications:
 handwritten digit recognition, object recognition,
speaker identification, benchmarking time-series
prediction tests
July 23, 2014 Data Mining: Concepts and Techniques114
SVM—General Philosophy
Support Vectors
Small Margin Large Margin
July 23, 2014 Data Mining: Concepts and Techniques115
SVM—Margins and Support
Vectors
July 23, 2014 Data Mining: Concepts and Techniques116
SVM—When Data Is Linearly
Separable
m
Let data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training tuples
associated with the class labels yi
There are infinite lines (hyperplanes) separating the two classes but we want to
find the best one (the one that minimizes classification error on unseen data)
SVM searches for the hyperplane with the largest margin, i.e., maximum
marginal hyperplane (MMH)
July 23, 2014 Data Mining: Concepts and Techniques117
SVM—Linearly Separable
 A separating hyperplane can be written as
W ● X + b = 0
where W={w1, w2, …, wn} is a weight vector and b a scalar (bias)
 For 2-D it can be written as
w0 + w1 x1 + w2 x2 = 0
 The hyperplane defining the sides of the margin:
H1: w0 + w1 x1 + w2 x2 ≥ 1 for yi = +1, and
H2: w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1
 Any training tuples that fall on hyperplanes H1 or H2 (i.e., the
sides defining the margin) are support vectors
 This becomes a constrained (convex) quadratic optimization
problem: Quadratic objective function and linear constraints  Quadratic
Programming (QP)  Lagrangian multipliers
July 23, 2014 Data Mining: Concepts and Techniques118
Why Is SVM Effective on High Dimensional
Data?
 The complexity of trained classifier is characterized by the # of support
vectors rather than the dimensionality of the data
 The support vectors are the essential or critical training examples —
they lie closest to the decision boundary (MMH)
 If all other training examples are removed and the training is repeated,
the same separating hyperplane would be found
 The number of support vectors found can be used to compute an
(upper) bound on the expected error rate of the SVM classifier, which
is independent of the data dimensionality
 Thus, an SVM with a small number of support vectors can have good
generalization, even when the dimensionality of the data is high
July 23, 2014 Data Mining: Concepts and Techniques119
SVM—Linearly Inseparable
 Transform the original input data into a higher dimensional
space
 Search for a linear separating hyperplane in the new space
A 1
A 2
July 23, 2014 Data Mining: Concepts and Techniques120
SVM—Kernel functions
 Instead of computing the dot product on the transformed data tuples, it
is mathematically equivalent to instead applying a kernel function K(Xi,
Xj) to the original data, i.e., K(Xi, Xj) = Φ(Xi) Φ(Xj)
 Typical Kernel Functions
 SVM can also be used for classifying multiple (> 2) classes and for
regression analysis (with additional user parameters)
July 23, 2014 Data Mining: Concepts and Techniques121
SVM—Introduction Literature
 “Statistical Learning Theory” by Vapnik: extremely hard to understand,
containing many errors too.
 C. J. C. Burges.
A Tutorial on Support Vector Machines for Pattern Recognition.
Knowledge Discovery and Data Mining, 2(2), 1998.
 Better than the Vapnik’s book, but still written too hard for
introduction, and the examples are so not-intuitive
 The book “An Introduction to Support Vector Machines” by N.
Cristianini and J. Shawe-Taylor
 Also written hard for introduction, but the explanation about the
mercer’s theorem is better than above literatures
 The neural network book by Haykins


Weitere ähnliche Inhalte

Was ist angesagt?

Biostatistics CH Lecture Pack
Biostatistics CH Lecture PackBiostatistics CH Lecture Pack
Biostatistics CH Lecture Pack
Shaun Cochrane
 

Was ist angesagt? (18)

Lect1
Lect1Lect1
Lect1
 
Analysis of virtual labs - Paper presentation at ICALT 2018 (IIT Bombay)
Analysis of virtual labs - Paper presentation at ICALT 2018 (IIT Bombay)Analysis of virtual labs - Paper presentation at ICALT 2018 (IIT Bombay)
Analysis of virtual labs - Paper presentation at ICALT 2018 (IIT Bombay)
 
Outlier analysis,Chapter-12, Data Mining: Concepts and Techniques
Outlier analysis,Chapter-12, Data Mining: Concepts and TechniquesOutlier analysis,Chapter-12, Data Mining: Concepts and Techniques
Outlier analysis,Chapter-12, Data Mining: Concepts and Techniques
 
Ietcpresentation
IetcpresentationIetcpresentation
Ietcpresentation
 
615900072
615900072615900072
615900072
 
Data Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data AnalysisData Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data Analysis
 
Cs501 classification prediction
Cs501 classification predictionCs501 classification prediction
Cs501 classification prediction
 
Research methodology
Research methodologyResearch methodology
Research methodology
 
Applied Psych Test Design: Part B - Test and Item Development
Applied Psych Test Design: Part B - Test and Item DevelopmentApplied Psych Test Design: Part B - Test and Item Development
Applied Psych Test Design: Part B - Test and Item Development
 
Using Naive Bayesian Classifier for Predicting Performance of a Student
Using Naive Bayesian Classifier for Predicting Performance of a StudentUsing Naive Bayesian Classifier for Predicting Performance of a Student
Using Naive Bayesian Classifier for Predicting Performance of a Student
 
Biostatistics CH Lecture Pack
Biostatistics CH Lecture PackBiostatistics CH Lecture Pack
Biostatistics CH Lecture Pack
 
Not just for STEM: Open and reproducible research in the social sciences
Not just for STEM: Open and reproducible research in the social sciencesNot just for STEM: Open and reproducible research in the social sciences
Not just for STEM: Open and reproducible research in the social sciences
 
Lecture 6 qualitative data analysis
Lecture 6 qualitative data analysisLecture 6 qualitative data analysis
Lecture 6 qualitative data analysis
 
Brief Introduction to the 12 Steps of Evaluation Data Cleaning
Brief Introduction to the 12 Steps of Evaluation Data CleaningBrief Introduction to the 12 Steps of Evaluation Data Cleaning
Brief Introduction to the 12 Steps of Evaluation Data Cleaning
 
Data in science
Data in science Data in science
Data in science
 
Qualitative Data Analysis (Steps)
Qualitative Data Analysis (Steps)Qualitative Data Analysis (Steps)
Qualitative Data Analysis (Steps)
 
Breakdown of Regression Models for Dissertations
Breakdown of Regression Models for DissertationsBreakdown of Regression Models for Dissertations
Breakdown of Regression Models for Dissertations
 
Datascience Introduction WebSci Summer School 2014
Datascience Introduction WebSci Summer School 2014Datascience Introduction WebSci Summer School 2014
Datascience Introduction WebSci Summer School 2014
 

Ähnlich wie Cluster analysis

Module three ppt of DWDM. Details of data mining rules
Module three ppt of DWDM. Details of data mining rulesModule three ppt of DWDM. Details of data mining rules
Module three ppt of DWDM. Details of data mining rules
NivaTripathy1
 
Think-Aloud Protocols
Think-Aloud ProtocolsThink-Aloud Protocols
Think-Aloud Protocols
butest
 
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
A Novel Approach for Breast Cancer Detection using Data Mining TechniquesA Novel Approach for Breast Cancer Detection using Data Mining Techniques
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
ahmad abdelhafeez
 

Ähnlich wie Cluster analysis (20)

Ch06
Ch06Ch06
Ch06
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
 
Chapter 06 Data Mining Techniques
Chapter 06 Data Mining TechniquesChapter 06 Data Mining Techniques
Chapter 06 Data Mining Techniques
 
classification in data mining and data warehousing.pdf
classification in data mining and data warehousing.pdfclassification in data mining and data warehousing.pdf
classification in data mining and data warehousing.pdf
 
Unit-4 classification
Unit-4 classificationUnit-4 classification
Unit-4 classification
 
2.1 Data Mining-classification Basic concepts
2.1 Data Mining-classification Basic concepts2.1 Data Mining-classification Basic concepts
2.1 Data Mining-classification Basic concepts
 
7class
7class7class
7class
 
Module three ppt of DWDM. Details of data mining rules
Module three ppt of DWDM. Details of data mining rulesModule three ppt of DWDM. Details of data mining rules
Module three ppt of DWDM. Details of data mining rules
 
Talk
TalkTalk
Talk
 
Classification Techniques
Classification TechniquesClassification Techniques
Classification Techniques
 
Lesson 5 chapter 3
Lesson 5   chapter 3Lesson 5   chapter 3
Lesson 5 chapter 3
 
Lesson 5 chapter 3
Lesson 5   chapter 3Lesson 5   chapter 3
Lesson 5 chapter 3
 
Chapter 1: Introduction to Data Mining
Chapter 1: Introduction to Data MiningChapter 1: Introduction to Data Mining
Chapter 1: Introduction to Data Mining
 
Methodology
MethodologyMethodology
Methodology
 
Assessment of Decision Tree Algorithms on Student’s Recital
Assessment of Decision Tree Algorithms on Student’s RecitalAssessment of Decision Tree Algorithms on Student’s Recital
Assessment of Decision Tree Algorithms on Student’s Recital
 
Presentation on supervised learning
Presentation on supervised learningPresentation on supervised learning
Presentation on supervised learning
 
Think-Aloud Protocols
Think-Aloud ProtocolsThink-Aloud Protocols
Think-Aloud Protocols
 
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
A Novel Approach for Breast Cancer Detection using Data Mining TechniquesA Novel Approach for Breast Cancer Detection using Data Mining Techniques
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 

Kürzlich hochgeladen

Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
ptikerjasaptiker
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 

Kürzlich hochgeladen (20)

Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 

Cluster analysis

  • 1. July 23, 2014 Data Mining: Concepts and Techniques1 Data Mining: Concepts and Techniques — Chapter 6 — Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign www.cs.uiuc.edu/~hanj ©2006 Jiawei Han and Micheline Kamber, All rights reserved
  • 2. July 23, 2014 Data Mining: Concepts and Techniques2 Chapter 6. Classification and Prediction  What is classification? What is prediction?  Issues regarding classification and prediction  Classification by decision tree induction  Bayesian classification  Rule-based classification  Classification by back propagation  Support Vector Machines (SVM)  Associative classification  Lazy learners (or learning from your neighbors)  Other classification methods  Prediction  Accuracy and error measures  Ensemble methods  Model selection 
  • 3. July 23, 2014 Data Mining: Concepts and Techniques3  Classification  predicts categorical class labels (discrete or nominal)  classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data  Prediction  models continuous-valued functions, i.e., predicts unknown or missing values  Typical applications  Credit approval  Target marketing  Medical diagnosis  Fraud detection Classification vs. Prediction
  • 4. July 23, 2014 Data Mining: Concepts and Techniques4 Classification—A Two-Step Process  Model construction: describing a set of predetermined classes  Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute  The set of tuples used for model construction is training set  The model is represented as classification rules, decision trees, or mathematical formulae  Model usage: for classifying future or unknown objects  Estimate accuracy of the model  The known label of test sample is compared with the classified result from the model  Accuracy rate is the percentage of test set samples that are correctly classified by the model  Test set is independent of training set, otherwise over-fitting will occur  If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known
  • 5. July 23, 2014 Data Mining: Concepts and Techniques5 Process (1): Model Construction Training Data NAME RANK YEARS TENURED Mike Assistant Prof 3 no Mary Assistant Prof 7 yes Bill Professor 2 yes Jim Associate Prof 7 yes Dave Assistant Prof 6 no Anne Associate Prof 3 no Classification Algorithms IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ Classifier (Model)
  • 6. July 23, 2014 Data Mining: Concepts and Techniques6 Process (2): Using the Model in Prediction Classifier Testing Data NAME RANK YEARS TENURED Tom Assistant Prof 2 no Merlisa Associate Prof 7 no George Professor 5 yes Joseph Assistant Prof 7 yes Unseen Data (Jeff, Professor, 4) Tenured?
  • 7. July 23, 2014 Data Mining: Concepts and Techniques7 Supervised vs. Unsupervised Learning  Supervised learning (classification)  Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations  New data is classified based on the training set  Unsupervised learning (clustering)  The class labels of training data is unknown  Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data
  • 8. July 23, 2014 Data Mining: Concepts and Techniques8 Chapter 6. Classification and Prediction  What is classification? What is prediction?  Issues regarding classification and prediction  Classification by decision tree induction  Bayesian classification  Rule-based classification  Classification by back propagation  Support Vector Machines (SVM)  Associative classification  Lazy learners (or learning from your neighbors)  Other classification methods  Prediction  Accuracy and error measures  Ensemble methods  Model selection 
  • 9. July 23, 2014 Data Mining: Concepts and Techniques9 Issues: Data Preparation  Data cleaning  Preprocess data in order to reduce noise and handle missing values  Relevance analysis (feature selection)  Remove the irrelevant or redundant attributes  Data transformation  Generalize and/or normalize data
  • 10. July 23, 2014 Data Mining: Concepts and Techniques10 Issues: Evaluating Classification Methods  Accuracy  classifier accuracy: predicting class label  predictor accuracy: guessing value of predicted attributes  Speed  time to construct the model (training time)  time to use the model (classification/prediction time)  Robustness: handling noise and missing values  Scalability: efficiency in disk-resident databases  Interpretability  understanding and insight provided by the model  Other measures, e.g., goodness of rules, such as decision tree size or compactness of classification rules
  • 11. July 23, 2014 Data Mining: Concepts and Techniques11 Chapter 6. Classification and Prediction  What is classification? What is prediction?  Issues regarding classification and prediction  Classification by decision tree induction  Bayesian classification  Rule-based classification  Classification by back propagation  Support Vector Machines (SVM)  Associative classification  Lazy learners (or learning from your neighbors)  Other classification methods  Prediction  Accuracy and error measures  Ensemble methods  Model selection 
  • 12. July 23, 2014 Data Mining: Concepts and Techniques12 Example of a Decision Tree Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 categorical categorical continuous class Refund MarSt TaxInc YESNO NO NO Yes No MarriedSingle, Divorced < 80K > 80K Splitting Attributes Training Data Model: Decision Tree
  • 13. July 23, 2014 Data Mining: Concepts and Techniques13 Another Example of Decision Tree Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 categorical categorical continuous class MarSt Refund TaxInc YESNO NO NO Yes No Married Single, Divorced < 80K > 80K There could be more than one tree that fits the same data!
  • 14. July 23, 2014 Data Mining: Concepts and Techniques14 Decision Tree Classification Task Decision Tree
  • 15. July 23, 2014 Data Mining: Concepts and Techniques15 Apply Model to Test Data Refund MarSt TaxInc YESNO NO NO Yes No MarriedSingle, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data Start from the root of tree.
  • 16. July 23, 2014 Data Mining: Concepts and Techniques16 Apply Model to Test Data Refund MarSt TaxInc YESNO NO NO Yes No MarriedSingle, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data
  • 17. July 23, 2014 Data Mining: Concepts and Techniques17 Apply Model to Test Data Refund MarSt TaxInc YESNO NO NO Yes No MarriedSingle, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data
  • 18. July 23, 2014 Data Mining: Concepts and Techniques18 Apply Model to Test Data Refund MarSt TaxInc YESNO NO NO Yes No MarriedSingle, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data
  • 19. July 23, 2014 Data Mining: Concepts and Techniques19 Apply Model to Test Data Refund MarSt TaxInc YESNO NO NO Yes No MarriedSingle, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data
  • 20. July 23, 2014 Data Mining: Concepts and Techniques20 Apply Model to Test Data Refund MarSt TaxInc YESNO NO NO Yes No MarriedSingle, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data Assign Cheat to “No”
  • 21. July 23, 2014 Data Mining: Concepts and Techniques21 Decision Tree Classification Task Decision Tree
  • 22. July 23, 2014 Data Mining: Concepts and Techniques22 Algorithm for Decision Tree Induction  Basic algorithm (a greedy algorithm)  Tree is constructed in a top-down recursive divide-and-conquer manner  At start, all the training examples are at the root  Attributes are categorical (if continuous-valued, they are discretized in advance)  Examples are partitioned recursively based on selected attributes  Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain)  Conditions for stopping partitioning  All samples for a given node belong to the same class  There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf  There are no samples left
  • 23. July 23, 2014 Data Mining: Concepts and Techniques23 Decision Tree Induction  Many Algorithms:  Hunt’s Algorithm (one of the earliest)  CART  ID3, C4.5  SLIQ,SPRINT
  • 24. July 23, 2014 Data Mining: Concepts and Techniques24 General Structure of Hunt’s Algorithm  Let Dt be the set of training records that reach a node t  General Procedure:  If Dt contains records that belong the same class yt, then t is a leaf node labeled as yt  If Dt is an empty set, then t is a leaf node labeled by the default class, yd  If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset. Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 Dt ?
  • 25. July 23, 2014 Data Mining: Concepts and Techniques25 Hunt’s Algorithm Don’t Cheat Refund Don’t Cheat Don’t Cheat Yes No Refund Don’t Cheat Yes No Marital Status Don’t Cheat Cheat Single, Divorced Married Taxable Income Don’t Cheat < 80K >= 80K Refund Don’t Cheat Yes No Marital Status Don’t Cheat Cheat Single, Divorced Married
  • 26. July 23, 2014 Data Mining: Concepts and Techniques26 Tree Induction  Greedy strategy.  Split the records based on an attribute test that optimizes certain criterion.  Issues  Determine how to split the records  How to specify the attribute test condition?  How to determine the best split?  Determine when to stop splitting
  • 27. July 23, 2014 Data Mining: Concepts and Techniques27 Tree Induction  Greedy strategy.  Split the records based on an attribute test that optimizes certain criterion.  Issues  Determine how to split the records  How to specify the attribute test condition?  How to determine the best split?  Determine when to stop splitting
  • 28. July 23, 2014 Data Mining: Concepts and Techniques28 How to Specify Test Condition?  Depends on attribute types  Nominal  Ordinal  Continuous  Depends on number of ways to split  2-way split  Multi-way split
  • 29. July 23, 2014 Data Mining: Concepts and Techniques29 Attributes  Multi-way split: Use as many partitions as distinct values.  Binary split: Divides values into two subsets. Need to find optimal partitioning. CarType Family Sports Luxury CarType {Family, Luxury} {Sports} CarType {Sports, Luxury} {Family} OR
  • 30. July 23, 2014 Data Mining: Concepts and Techniques30  Multi-way split: Use as many partitions as distinct values.  Binary split: Divides values into two subsets. Need to find optimal partitioning.  What about this split? Splitting Based on Ordinal Attributes Size Small Medium Large Size {Medium, Large} {Small} Size {Small, Medium} {Large} OR Size {Small, Large} {Medium}
  • 31. July 23, 2014 Data Mining: Concepts and Techniques31 Attributes  Different ways of handling  Discretization to form an ordinal categorical attribute  Static – discretize once at the beginning  Dynamic – ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering.  Binary Decision: (A < v) or (A ≥ v)  consider all possible splits and finds the best cut  can be more compute intensive
  • 32. July 23, 2014 Data Mining: Concepts and Techniques32 Attributes
  • 33. July 23, 2014 Data Mining: Concepts and Techniques33 Tree Induction  Greedy strategy.  Split the records based on an attribute test that optimizes certain criterion.  Issues  Determine how to split the records  How to specify the attribute test condition?  How to determine the best split?  Determine when to stop splitting
  • 34. July 23, 2014 Data Mining: Concepts and Techniques34 How to determine the Best Split Before Splitting: 10 records of class 0, 10 records of class 1 Which test condition is the best?
  • 35. July 23, 2014 Data Mining: Concepts and Techniques35 How to determine the Best Split  Greedy approach:  Nodes with homogeneous class distribution are preferred  Need a measure of node impurity: Non-homogeneous, High degree of impurity Homogeneous, Low degree of impurity
  • 36. July 23, 2014 Data Mining: Concepts and Techniques36 Measures of Node Impurity  Gini Index  Entropy  Misclassification error
  • 37. July 23, 2014 Data Mining: Concepts and Techniques37 How to Find the Best Split B? Yes No Node N3 Node N4 A? Yes No Node N1 Node N2 Before Splitting: C0 N10 C1 N11 C0 N20 C1 N21 C0 N30 C1 N31 C0 N40 C1 N41 C0 N00 C1 N01 M0 M1 M2 M3 M4 M12 M34 Gain = M0 – M12 vs M0 – M34
  • 38. July 23, 2014 Data Mining: Concepts and Techniques38 Examples  Using Information Gain
  • 39. July 23, 2014 Data Mining: Concepts and Techniques39 Information Gain in a Nutshell )( || || )()( )( v AValuesv v SEntropy S S SEntropyAnGainInformatio ⋅−= ∑∈ ∑∈ −= Decisionsd dpdpEntropy ))(log(*)( typically yes/no
  • 40. July 23, 2014 Data Mining: Concepts and Techniques40 Playing Tennis
  • 41. July 23, 2014 Data Mining: Concepts and Techniques41 Choosing an Attribute  We want to split our decision tree on one of the attributes  There are four attributes to choose from:  Outlook  Temperature  Humidity  Wind
  • 42. July 23, 2014 Data Mining: Concepts and Techniques42 How to Choose an Attribute  Want to calculated the information gain of each attribute  Let us start with Outlook  What is Entropy(S)?  -5/14*log2(5/14) – 9/14*log2(9/14) = Entropy(5/14,9/14) = 0.9403
  • 43. July 23, 2014 Data Mining: Concepts and Techniques43 Outlook Continued  The expected conditional entropy is: 5/14 * Entropy(3/5,2/5) + 4/14 * Entropy(1,0) + 5/14 * Entropy(3/5,2/5) = 0.6935  So IG(Outlook) = 0.9403 – 0.6935 = 0.2468
  • 44. July 23, 2014 Data Mining: Concepts and Techniques44 Temperature  Now let us look at the attribute Temperature  The expected conditional entropy is: 4/14 * Entropy(2/4,2/4) + 6/14 * Entropy(4/6,2/6) + 4/14 * Entropy(3/4,1/4) = 0.9111  So IG(Temperature) = 0.9403 – 0.9111 = 0.0292
  • 45. July 23, 2014 Data Mining: Concepts and Techniques45 Humidity  Now let us look at attribute Humidity  What is the expected conditional entropy?  7/14 * Entropy(4/7,3/7) + 7/14 * Entropy(6/7,1/7) = 0.7885  So IG(Humidity) = 0.9403 – 0.7885 = 0.1518
  • 46. July 23, 2014 Data Mining: Concepts and Techniques46 Wind  What is the information gain for wind?  Expected conditional entropy: 8/14 * Entropy(6/8,2/8) + 6/14 * Entropy(3/6,3/6) = 0.8922  IG(Wind) = 0.9403 – 0.8922 = 0.048
  • 47. July 23, 2014 Data Mining: Concepts and Techniques47 Information Gains  Outlook 0.2468  Temperature 0.0292  Humidity 0.1518  Wind 0.0481  We choose Outlook since it has the highest information gain
  • 48. July 23, 2014 Data Mining: Concepts and Techniques48 Decision Tree So Far  Now must decide what to do when Outlook is:  Sunny  Overcast  Rain
  • 49. July 23, 2014 Data Mining: Concepts and Techniques49 Sunny Branch  Examples to classify:  Temperature, Humidity, Wind, Tennis  Hot, High, Weak, no  Hot, High, Strong, no  Mild, High, Weak, no  Cool, Normal, Weak, yes  Mild, Normal, Strong, yes
  • 50. July 23, 2014 Data Mining: Concepts and Techniques50 Splitting Sunny on Temperature  What is the Entropy of Sunny?  Entropy(2/5,3/5) = 0.9710  How about the expected utility?  2/5 * Entropy(1,0) + 2/5 * Entropy(1/2,1/2) + 1/5 * Entropy(1,0) = 0.4000  IG(Temperature) = 0.9710 – 0.4000 = 0.5710
  • 51. July 23, 2014 Data Mining: Concepts and Techniques51 Splitting Sunny on Humidity  The expected conditional entropy is  3/5 * Entropy(1,0) + 2/5 * Entropy(1,0) = 0  IG(Humidity) = 0.9710 – 0 = 0.9710
  • 52. July 23, 2014 Data Mining: Concepts and Techniques52 Considering Wind?  Do we need to consider wind as an attribute?  No – it is not possible to do any better than an expected entropy of 0; i.e. humidity must maximize the information gain
  • 53. July 23, 2014 Data Mining: Concepts and Techniques53 New Tree
  • 54. July 23, 2014 Data Mining: Concepts and Techniques54 What if it is Overcast?  All examples indicate yes  So there is no need to further split on an attribute  The information gain for any attribute would have to be 0  Just write yes at this node
  • 55. July 23, 2014 Data Mining: Concepts and Techniques55 New Tree
  • 56. July 23, 2014 Data Mining: Concepts and Techniques56 What about Rain?  Let us consider attribute temperature  First, what is the entropy of the data?  Entropy(3/5,2/5) = 0.9710  Second, what is the expected conditional entropy?  3/5 * Entropy(2/3,1/3) + 2/5 * Entropy(1/2,1/2) = 0.9510  IG(Temperature) = 0.9710 – 0.9510 = 0.020
  • 57. July 23, 2014 Data Mining: Concepts and Techniques57 Or perhaps humidity?  What is the expected conditional entropy?  3/5 * Entropy(2/3,1/3) + 2/5 * Entropy(1/2,1/2) = 0.9510 (the same)  IG(Humidity) = 0.9710 – 0.9510 = 0.020 (again, the same)
  • 58. July 23, 2014 Data Mining: Concepts and Techniques58 Now consider wind  Expected conditional entropy:  3/5*Entropy(1,0) + 2/5*Entropy(1,0) = 0  IG(Wind) = 0.9710 – 0 = 0.9710  Thus, we split on Wind
  • 59. July 23, 2014 Data Mining: Concepts and Techniques59 Split Further?
  • 60. July 23, 2014 Data Mining: Concepts and Techniques60 Final Tree
  • 61. July 23, 2014 Data Mining: Concepts and Techniques61 Attribute Selection: Information Gain  Class P: buys_computer = “yes”  Class N: buys_computer = “no” means “age <=30” has 5 out of 14 samples, with 2 yes’es and 3 no’s. Hence Similarly, age pi ni I(pi, ni) <=30 2 3 0.971 31…40 4 0 0 >40 3 2 0.971 694.0)2,3( 14 5 )0,4( 14 4 )3,2( 14 5 )( =+ += I IIDInfoage 048.0)_( 151.0)( 029.0)( = = = ratingcreditGain studentGain incomeGain 246.0)()()( =−= DInfoDInfoageGain age age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no )3,2( 14 5 I 940.0) 14 5 (log 14 5 ) 14 9 (log 14 9 )5,9()( 22 =−−== IDInfo
  • 62. July 23, 2014 Data Mining: Concepts and Techniques62 Computing Information-Gain for Continuous-Value Attributes  Let attribute A be a continuous-valued attribute  Must determine the best split point for A  Sort the value A in increasing order  Typically, the midpoint between each pair of adjacent values is considered as a possible split point  (ai+ai+1)/2 is the midpoint between the values of ai and ai+1  The point with the minimum expected information requirement for A is selected as the split-point for A  Split:  D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is the set of tuples in D satisfying A > split-point
  • 63. July 23, 2014 Data Mining: Concepts and Techniques63 Gain Ratio for Attribute Selection (C4.5)  Information gain measure is biased towards attributes with a large number of values  C4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to information gain)  GainRatio(A) = Gain(A)/SplitInfo(A)  Ex.  gain_ratio(income) = 0.029/0.926 = 0.031  The attribute with the maximum gain ratio is selected as the splitting attribute ) || || (log || || )( 2 1 D D D D DSplitInfo j v j j A ×−= ∑= 926.0) 14 4 (log 14 4 ) 14 6 (log 14 6 ) 14 4 (log 14 4 )( 222 =×−×−×−=DSplitInfoA
  • 64. July 23, 2014 Data Mining: Concepts and Techniques64 Gini index (CART, IBM IntelligentMiner)  If a data set D contains examples from n classes, gini index, gini(D) is defined as where pj is the relative frequency of class j in D  If a data set D is split on A into two subsets D1 and D2, the gini index gini(D) is defined as  Reduction in Impurity:  The attribute provides the smallest ginisplit(D) (or the largest reduction in impurity) is chosen to split the node (need to enumerate all the possible splitting points for each attribute) ∑ = −= n j p jDgini 1 21)( )( || || )( || || )( 2 2 1 1 Dgini D D Dgini D D DginiA += )()()( DginiDginiAgini A −=∆
  • 65. July 23, 2014 Data Mining: Concepts and Techniques65 Gini index (CART, IBM IntelligentMiner)  Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”  Suppose the attribute income partitions D into 10 in D1: {low, medium} and 4 in D2 but gini{medium,high} is 0.30 and thus the best since it is the lowest  All attributes are assumed continuous-valued  May need other tools, e.g., clustering, to get the possible split values  Can be modified for categorical attributes 459.0 14 5 14 9 1)( 22 =      −      −=Dgini )( 14 4 )( 14 10 )( 11},{ DGiniDGiniDgini mediumlowincome       +      =∈
  • 66. July 23, 2014 Data Mining: Concepts and Techniques66 Comparing Attribute Selection Measures  The three measures, in general, return good results but  Information gain:  biased towards multivalued attributes  Gain ratio:  tends to prefer unbalanced splits in which one partition is much smaller than the others  Gini index:  biased to multivalued attributes  has difficulty when # of classes is large  tends to favor tests that result in equal-sized partitions and purity in both partitions
  • 67. July 23, 2014 Data Mining: Concepts and Techniques67 Other Attribute Selection Measures  CHAID: a popular decision tree algorithm, measure based on χ2 test for independence  C-SEP: performs better than info. gain and gini index in certain cases  G-statistics: has a close approximation to χ2 distribution  MDL (Minimal Description Length) principle (i.e., the simplest solution is preferred):  The best tree as the one that requires the fewest # of bits to both (1) encode the tree, and (2) encode the exceptions to the tree  Multivariate splits (partition based on multiple variable combinations)  CART: finds multivariate splits based on a linear comb. of attrs.  Which attribute selection measure is the best?  Most give good results, none is significantly superior than others
  • 68. July 23, 2014 Data Mining: Concepts and Techniques68 Overfitting and Tree Pruning  Overfitting: An induced tree may overfit the training data  Too many branches, some may reflect anomalies due to noise or outliers  Poor accuracy for unseen samples  Two approaches to avoid overfitting  Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold  Difficult to choose an appropriate threshold  Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees  Use a set of data different from the training data to decide which is the “best pruned tree”
  • 69. July 23, 2014 Data Mining: Concepts and Techniques69 Underfitting and Overfitting Overfitting Underfitting: when model is too simple, both training and test errors are large
  • 70. July 23, 2014 Data Mining: Concepts and Techniques70 Overfitting in Classification  Overfitting: An induced tree may overfit the training data  Too many branches, some may reflect anomalies due to noise or outliers  Poor accuracy for unseen samples
  • 71. July 23, 2014 Data Mining: Concepts and Techniques71 Enhancements to Basic Decision Tree Induction  Allow for continuous-valued attributes  Dynamically define new discrete-valued attributes that partition the continuous attribute value into a discrete set of intervals  Handle missing attribute values  Assign the most common value of the attribute  Assign probability to each of the possible values  Attribute construction  Create new attributes based on existing ones that are sparsely represented  This reduces fragmentation, repetition, and replication
  • 72. July 23, 2014 Data Mining: Concepts and Techniques72 Classification in Large Databases  Classification—a classical problem extensively studied by statisticians and machine learning researchers  Scalability: Classifying data sets with millions of examples and hundreds of attributes with reasonable speed  Why decision tree induction in data mining?  relatively faster learning speed (than other classification methods)  convertible to simple and easy to understand classification rules  can use SQL queries for accessing databases  comparable classification accuracy with other methods
  • 73. July 23, 2014 Data Mining: Concepts and Techniques73 Scalable Decision Tree Induction Methods  SLIQ (EDBT’96 — Mehta et al.)  Builds an index for each attribute and only class list and the current attribute list reside in memory  SPRINT (VLDB’96 — J. Shafer et al.)  Constructs an attribute list data structure  PUBLIC (VLDB’98 — Rastogi & Shim)  Integrates tree splitting and tree pruning: stop growing the tree earlier  RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)  Builds an AVC-list (attribute, value, class label)  BOAT (PODS’99 — Gehrke, Ganti, Ramakrishnan & Loh)  Uses bootstrapping to create several small samples
  • 74. July 23, 2014 Data Mining: Concepts and Techniques74 Scalability Framework for RainForest  Separates the scalability aspects from the criteria that determine the quality of the tree  Builds an AVC-list: AVC (Attribute, Value, Class_label)  AVC-set (of an attribute X )  Projection of training dataset onto the attribute X and class label where counts of individual class label are aggregated  AVC-group (of a node n )  Set of AVC-sets of all predictor attributes at the node n
  • 75. July 23, 2014 Data Mining: Concepts and Techniques75 Rainforest: Training Set and Its AVC Sets student Buy_Computer yes no yes 6 1 no 3 4 Age Buy_Computer yes no <=30 3 2 31..40 4 0 >40 3 2 Credit rating Buy_Computer yes no fair 6 2 excellent 3 3 age income studentcredit_ratingbuys_computer <=30 high no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no AVC-set on incomeAVC-set on Age AVC-set on Student Training Examples income Buy_Computer yes no high 2 2 medium 4 2 low 3 1 AVC-set on credit_rating
  • 76. July 23, 2014 Data Mining: Concepts and Techniques76 Chapter 6. Classification and Prediction  What is classification? What is prediction?  Issues regarding classification and prediction  Classification by decision tree induction  Bayesian classification  Rule-based classification  Classification by back propagation  Support Vector Machines (SVM)  Associative classification  Lazy learners (or learning from your neighbors)  Other classification methods  Prediction  Accuracy and error measures  Ensemble methods  Model selection 
  • 77. July 23, 2014 Data Mining: Concepts and Techniques77 Bayesian Classification: Why?  A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities  Foundation: Based on Bayes’ Theorem.  Performance: A simple Bayesian classifier, naïve Bayesian classifier, has comparable performance with decision tree and selected neural network classifiers  Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct — prior knowledge can be combined with observed data  Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured
  • 78. July 23, 2014 Data Mining: Concepts and Techniques78 Bayesian Theorem: Basics  Let X be a data sample (“evidence”): class label is unknown  Let H be a hypothesis that X belongs to class C  Classification is to determine P(H|X), the probability that the hypothesis holds given the observed data sample X  P(H) (prior probability), the initial probability  E.g., X will buy computer, regardless of age, income, …  P(X): probability that sample data is observed  P(X|H) (posteriori probability), the probability of observing the sample X, given that the hypothesis holds  E.g., Given that X will buy computer, the prob. that X is 31..40, medium income
  • 79. July 23, 2014 Data Mining: Concepts and Techniques79 Bayesian Theorem  Given training data X, posteriori probability of a hypothesis H, P(H|X), follows the Bayes theorem  Informally, this can be written as posteriori = likelihood x prior/evidence  Predicts X belongs to C2 iff the probability P(Ci|X) is the highest among all the P(Ck|X) for all the k classes  Practical difficulty: require initial knowledge of many probabilities, significant computational cost )( )()|()|( X XX P HPHPHP =
  • 80. July 23, 2014 Data Mining: Concepts and Techniques80 Towards Naïve Bayesian Classifier  Let D be a training set of tuples and their associated class labels, and each tuple is represented by an n-D attribute vector X = (x1, x2, …, xn)  Suppose there are m classes C1, C2, …, Cm.  Classification is to derive the maximum posteriori, i.e., the maximal P(Ci|X)  This can be derived from Bayes’ theorem  Since P(X) is constant for all classes, only needs to be maximized )( )()|( )|( X X X P i CP i CP i CP = )()|()|( i CP i CP i CP XX =
  • 81. July 23, 2014 Data Mining: Concepts and Techniques81 Derivation of Naïve Bayes Classifier  A simplified assumption: attributes are conditionally independent (i.e., no dependence relation between attributes):  This greatly reduces the computation cost: Only counts the class distribution  If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for Ak divided by |Ci, D| (# of tuples of Ci in D)  If Ak is continous-valued, P(xk|Ci) is usually computed based on Gaussian distribution with a mean μ and standard deviation σ and P(xk|Ci) is )|(...)|()|( 1 )|()|( 21 CixPCixPCixP n k CixPCiP nk ×××=∏ = =X 2 2 2 )( 2 1 ),,( σ µ σπ σµ − − = x exg ),,()|( ii CCkxgCiP σµ=X
  • 82. July 23, 2014 Data Mining: Concepts and Techniques82 Naïve Bayesian Classifier: Training Dataset Class: C1:buys_computer = ‘yes’ C2:buys_computer = ‘no’ Data sample X = (age <=30, Income = medium, Student = yes Credit_rating = Fair) age income studentcredit_ratingbuys_compu <=30 high no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no
  • 83. July 23, 2014 Data Mining: Concepts and Techniques83 Naïve Bayesian Classifier: An Example  P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 P(buys_computer = “no”) = 5/14= 0.357  Compute P(X|Ci) for each class P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222 P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6 P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444 P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4 P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667 P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2 P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667 P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4  X = (age <= 30 , income = medium, student = yes, credit_rating = fair) P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044 P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019 P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028 P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007 Therefore, X belongs to class (“buys_computer = yes”)
  • 84. July 23, 2014 Data Mining: Concepts and Techniques84 Avoiding the 0-Probability Problem  Naïve Bayesian prediction requires each conditional prob. be non- zero. Otherwise, the predicted prob. will be zero  Ex. Suppose a dataset with 1000 tuples, income=low (0), income= medium (990), and income = high (10),  Use Laplacian correction (or Laplacian estimator)  Adding 1 to each case Prob(income = low) = 1/1003 Prob(income = medium) = 991/1003 Prob(income = high) = 11/1003  The “corrected” prob. estimates are close to their “uncorrected” counterparts ∏ = = n k CixkPCiXP 1 )|()|(
  • 85. July 23, 2014 Data Mining: Concepts and Techniques85 Naïve Bayesian Classifier: Comments  Advantages  Easy to implement  Good results obtained in most of the cases  Disadvantages  Assumption: class conditional independence, therefore loss of accuracy  Practically, dependencies exist among variables  E.g., hospitals: patients: Profile: age, family history, etc. Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.  Dependencies among these cannot be modeled by Naïve Bayesian Classifier  How to deal with these dependencies?  Bayesian Belief Networks
  • 86. July 23, 2014 Data Mining: Concepts and Techniques86 Chapter 6. Classification and Prediction  What is classification? What is prediction?  Issues regarding classification and prediction  Classification by decision tree induction  Bayesian classification  Rule-based classification  Classification by back propagation  Support Vector Machines (SVM)  Associative classification  Lazy learners (or learning from your neighbors)  Other classification methods  Prediction  Accuracy and error measures  Ensemble methods  Model selection 
  • 87. July 23, 2014 Data Mining: Concepts and Techniques87 Rule-Based Classifier  Classify records by using a collection of “if… then…” rules  Rule: (Condition) → y  where  Condition is a conjunctions of attributes  y is the class label  LHS: rule antecedent or condition  RHS: rule consequent  Examples of classification rules:  (Blood Type=Warm) ∧ (Lay Eggs=Yes) → Birds  (Taxable Income < 50K) ∧ (Refund=Yes) → Evade=No
  • 88. July 23, 2014 Data Mining: Concepts and Techniques88 age? student? credit rating? <=30 >40 no yes yes yes 31..40 no fairexcellentyesno  Example: Rule extraction from our buys_computer decision-tree IF age = young AND student = no THEN buys_computer = no IF age = young AND student = yes THEN buys_computer = yes IF age = mid-age THEN buys_computer = yes IF age = old AND credit_rating = excellent THEN buys_computer = yes IF age = young AND credit_rating = fair THEN buys_computer = no Rule Extraction from a Decision Tree  Rules are easier to understand than large trees  One rule is created for each path from the root to a leaf  Each attribute-value pair along a path forms a conjunction: the leaf holds the class prediction  Rules are mutually exclusive and exhaustive
  • 89. July 23, 2014 Data Mining: Concepts and Techniques89 Rule-based Classifier (Example) R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds R2: (Give Birth = no) ∧ (Live in Water = yes) → Fishes R3: (Give Birth = yes) ∧ (Blood Type = warm) → Mammals R4: (Give Birth = no) ∧ (Can Fly = no) → Reptiles R5: (Live in Water = sometimes) → Amphibians Name Blood Type Give Birth Can Fly Live in Water Class human warm yes no no mammals python cold no no no reptiles salmon cold no no yes fishes whale warm yes no yes mammals frog cold no no sometimes amphibians komodo cold no no no reptiles bat warm yes yes no mammals pigeon warm no yes no birds cat warm yes no no mammals leopard shark cold yes no yes fishes turtle cold no no sometimes reptiles penguin warm no no sometimes birds porcupine warm yes no no mammals eel cold no no yes fishes salamander cold no no sometimes amphibians gila monster cold no no no reptiles platypus warm no no no mammals owl warm no yes no birds dolphin warm yes no yes mammals eagle warm no yes no birds
  • 90. July 23, 2014 Data Mining: Concepts and Techniques90 Application of Rule-Based Classifier  A rule r covers an instance x if the attributes of the instance satisfy the condition of the rule R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds R2: (Give Birth = no) ∧ (Live in Water = yes) → Fishes R3: (Give Birth = yes) ∧ (Blood Type = warm) → Mammals R4: (Give Birth = no) ∧ (Can Fly = no) → Reptiles R5: (Live in Water = sometimes) → Amphibians The rule R1 covers a hawk => Bird The rule R3 covers the grizzly bear => Mammal Name Blood Type Give Birth Can Fly Live in Water Class hawk warm no yes no ? grizzly bear warm yes no no ?
  • 91. July 23, 2014 Data Mining: Concepts and Techniques91 Rule Coverage and Accuracy  Coverage of a rule:  Fraction of records that satisfy the antecedent of a rule  Accuracy of a rule:  Fraction of records that satisfy both the antecedent and consequent of a rule Tid Refund Marital Status Taxable Income Class 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 (Status=Single) → No Coverage = 40%, Accuracy = 50%
  • 92. July 23, 2014 Data Mining: Concepts and Techniques92 How does Rule-based Classifier Work? R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds R2: (Give Birth = no) ∧ (Live in Water = yes) → Fishes R3: (Give Birth = yes) ∧ (Blood Type = warm) → Mammals R4: (Give Birth = no) ∧ (Can Fly = no) → Reptiles R5: (Live in Water = sometimes) → Amphibians A lemur triggers rule R3, so it is classified as a mammal A turtle triggers both R4 and R5 A dogfish shark triggers none of the rules Name Blood Type Give Birth Can Fly Live in Water Class lemur warm yes no no ? turtle cold no no sometimes ? dogfish shark cold yes no yes ?
  • 93. July 23, 2014 Data Mining: Concepts and Techniques93 Characteristics of Rule-Based Classifier  Mutually exclusive rules  Classifier contains mutually exclusive rules if the rules are independent of each other  Every record is covered by at most one rule  Exhaustive rules  Classifier has exhaustive coverage if it accounts for every possible combination of attribute values  Each record is covered by at least one rule
  • 94. July 23, 2014 Data Mining: Concepts and Techniques94 From Decision Trees To Rules Classification Rules (Refund=Yes) ==> No (Refund=No, Marital Status={Single,Divorced}, Taxable Income<80K) ==> No (Refund=No, Marital Status={Single,Divorced}, Taxable Income>80K) ==> Yes (Refund=No, Marital Status={Married}) ==> No Rules are mutually exclusive and exhaustive Rule set contains as much information as the tree
  • 95. July 23, 2014 Data Mining: Concepts and Techniques95 Rules Can Be Simplified Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 Initial Rule: (Refund=No) ∧ (Status=Married) → No Simplified Rule: (Status=Married) → No
  • 96. July 23, 2014 Data Mining: Concepts and Techniques96 Effect of Rule Simplification  Rules are no longer mutually exclusive  A record may trigger more than one rule  Solution?  Ordered rule set  Unordered rule set – use voting schemes  Rules are no longer exhaustive  A record may not trigger any rules  Solution?  Use a default class
  • 97. July 23, 2014 Data Mining: Concepts and Techniques97 Ordered Rule Set  Rules are rank ordered according to their priority  An ordered rule set is known as a decision list  When a test record is presented to the classifier  It is assigned to the class label of the highest ranked rule it has triggered  If none of the rules fired, it is assigned to the default class R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds R2: (Give Birth = no) ∧ (Live in Water = yes) → Fishes R3: (Give Birth = yes) ∧ (Blood Type = warm) → Mammals R4: (Give Birth = no) ∧ (Can Fly = no) → Reptiles R5: (Live in Water = sometimes) → Amphibians Name Blood Type Give Birth Can Fly Live in Water Class turtle cold no no sometimes ?
  • 98. July 23, 2014 Data Mining: Concepts and Techniques98 Rule Ordering Schemes  Rule-based ordering  Individual rules are ranked based on their quality  Class-based ordering  Rules that belong to the same class appear together
  • 99. July 23, 2014 Data Mining: Concepts and Techniques99 Building Classification Rules  Direct Method:  Extract rules directly from data  e.g.: RIPPER, CN2, Holte’s 1R  Indirect Method:  Extract rules from other classification models (e.g. decision trees, neural networks, etc).  e.g: C4.5rules
  • 100. July 23, 2014 Data Mining: Concepts and Techniques100 Rule Extraction from the Training Data  Sequential covering algorithm: Extracts rules directly from training data  Typical sequential covering algorithms: FOIL, AQ, CN2, RIPPER  Rules are learned sequentially, each for a given class Ci will cover many tuples of Ci but none (or few) of the tuples of other classes  Steps:  Rules are learned one at a time  Each time a rule is learned, the tuples covered by the rules are removed  The process repeats on the remaining tuples unless termination condition, e.g., when no more training examples or when the quality of a rule returned is below a user-specified threshold  Comp. w. decision-tree induction: learning a set of rules simultaneously
  • 101. July 23, 2014 Data Mining: Concepts and Techniques101 How to Learn-One-Rule?  Star with the most general rule possible: condition = empty  Adding new attributes by adopting a greedy depth-first strategy  Picks the one that most improves the rule quality  Rule-Quality measures: consider both coverage and accuracy  Foil-gain (in FOIL & RIPPER): assesses info_gain by extending condition It favors rules that have high accuracy and cover many positive tuples  Rule pruning based on an independent set of test tuples Pos/neg are # of positive/negative tuples covered by R. If FOIL_Prune is higher for the pruned version of R, prune R )log '' ' (log'_ 22 negpos pos negpos pos posGainFOIL + − + ×= negpos negpos RPruneFOIL + − =)(_
  • 102. July 23, 2014 Data Mining: Concepts and Techniques102 Direct Method: Sequential Covering 1. Start from an empty rule 2. Grow a rule using the Learn-One-Rule function 3. Remove training records covered by the rule 4. Repeat Step (2) and (3) until stopping criterion is met
  • 103. July 23, 2014 Data Mining: Concepts and Techniques103 Example of Sequential Covering (ii) Step 1
  • 104. July 23, 2014 Data Mining: Concepts and Techniques104 Example of Sequential Covering… (iii) Step 2 R1 (iv) Step 3 R1 R2
  • 105. July 23, 2014 Data Mining: Concepts and Techniques105 Aspects of Sequential Covering  Rule Growing  Instance Elimination  Rule Evaluation  Stopping Criterion  Rule Pruning
  • 106. July 23, 2014 Data Mining: Concepts and Techniques106 Instance Elimination  Why do we need to eliminate instances?  Otherwise, the next rule is identical to previous rule  Why do we remove positive instances?  Ensure that the next rule is different  Why do we remove negative instances?  Prevent underestimating accuracy of rule  Compare rules R2 and R3 in the diagram
  • 107. July 23, 2014 Data Mining: Concepts and Techniques107 Stopping Criterion and Rule Pruning  Stopping criterion  Compute the gain  If gain is not significant, discard the new rule  Rule Pruning  Similar to post-pruning of decision trees  Reduced Error Pruning:  Remove one of the conjuncts in the rule  Compare error rate on validation set before and after pruning  If error improves, prune the conjunct
  • 108. July 23, 2014 Data Mining: Concepts and Techniques108 Advantages of Rule-Based Classifiers  As highly expressive as decision trees  Easy to interpret  Easy to generate  Can classify new instances rapidly  Performance comparable to decision trees
  • 109. July 23, 2014 Data Mining: Concepts and Techniques109 Chapter 6. Classification and Prediction  What is classification? What is prediction?  Issues regarding classification and prediction  Classification by decision tree induction  Bayesian classification  Rule-based classification  Classification by back propagation  Support Vector Machines (SVM)  Associative classification  Lazy learners (or learning from your neighbors)  Other classification methods  Prediction  Accuracy and error measures  Ensemble methods  Model selection 
  • 110. July 23, 2014 Data Mining: Concepts and Techniques110  Classification:  predicts categorical class labels  E.g., Personal homepage classification  xi = (x1, x2, x3, …), yi = +1 or –1  x1 : # of a word “homepage”  x2 : # of a word “welcome”  Mathematically  x ∈ X = ℜn , y ∈ Y = {+1, –1}  We want a function f: X  Y Classification: A Mathematical Mapping
  • 111. July 23, 2014 Data Mining: Concepts and Techniques111 Linear Classification  Binary Classification problem  The data above the red line belongs to class ‘x’  The data below red line belongs to class ‘o’  Examples: SVM, Perceptron, Probabilistic Classifiers x x x x xx x x x x o o o o o o o o o o o o o
  • 112. July 23, 2014 Data Mining: Concepts and Techniques112 SVM—Support Vector Machines  A new classification method for both linear and nonlinear data  It uses a nonlinear mapping to transform the original training data into a higher dimension  With the new dimension, it searches for the linear optimal separating hyperplane (i.e., “decision boundary”)  With an appropriate nonlinear mapping to a sufficiently high dimension, data from two classes can always be separated by a hyperplane  SVM finds this hyperplane using support vectors (“essential” training tuples) and margins (defined by the support vectors)
  • 113. July 23, 2014 Data Mining: Concepts and Techniques113 SVM—History and Applications  Vapnik and colleagues (1992)—groundwork from Vapnik & Chervonenkis’ statistical learning theory in 1960s  Features: training can be slow but accuracy is high owing to their ability to model complex nonlinear decision boundaries (margin maximization)  Used both for classification and prediction  Applications:  handwritten digit recognition, object recognition, speaker identification, benchmarking time-series prediction tests
  • 114. July 23, 2014 Data Mining: Concepts and Techniques114 SVM—General Philosophy Support Vectors Small Margin Large Margin
  • 115. July 23, 2014 Data Mining: Concepts and Techniques115 SVM—Margins and Support Vectors
  • 116. July 23, 2014 Data Mining: Concepts and Techniques116 SVM—When Data Is Linearly Separable m Let data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training tuples associated with the class labels yi There are infinite lines (hyperplanes) separating the two classes but we want to find the best one (the one that minimizes classification error on unseen data) SVM searches for the hyperplane with the largest margin, i.e., maximum marginal hyperplane (MMH)
  • 117. July 23, 2014 Data Mining: Concepts and Techniques117 SVM—Linearly Separable  A separating hyperplane can be written as W ● X + b = 0 where W={w1, w2, …, wn} is a weight vector and b a scalar (bias)  For 2-D it can be written as w0 + w1 x1 + w2 x2 = 0  The hyperplane defining the sides of the margin: H1: w0 + w1 x1 + w2 x2 ≥ 1 for yi = +1, and H2: w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1  Any training tuples that fall on hyperplanes H1 or H2 (i.e., the sides defining the margin) are support vectors  This becomes a constrained (convex) quadratic optimization problem: Quadratic objective function and linear constraints  Quadratic Programming (QP)  Lagrangian multipliers
  • 118. July 23, 2014 Data Mining: Concepts and Techniques118 Why Is SVM Effective on High Dimensional Data?  The complexity of trained classifier is characterized by the # of support vectors rather than the dimensionality of the data  The support vectors are the essential or critical training examples — they lie closest to the decision boundary (MMH)  If all other training examples are removed and the training is repeated, the same separating hyperplane would be found  The number of support vectors found can be used to compute an (upper) bound on the expected error rate of the SVM classifier, which is independent of the data dimensionality  Thus, an SVM with a small number of support vectors can have good generalization, even when the dimensionality of the data is high
  • 119. July 23, 2014 Data Mining: Concepts and Techniques119 SVM—Linearly Inseparable  Transform the original input data into a higher dimensional space  Search for a linear separating hyperplane in the new space A 1 A 2
  • 120. July 23, 2014 Data Mining: Concepts and Techniques120 SVM—Kernel functions  Instead of computing the dot product on the transformed data tuples, it is mathematically equivalent to instead applying a kernel function K(Xi, Xj) to the original data, i.e., K(Xi, Xj) = Φ(Xi) Φ(Xj)  Typical Kernel Functions  SVM can also be used for classifying multiple (> 2) classes and for regression analysis (with additional user parameters)
  • 121. July 23, 2014 Data Mining: Concepts and Techniques121 SVM—Introduction Literature  “Statistical Learning Theory” by Vapnik: extremely hard to understand, containing many errors too.  C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Knowledge Discovery and Data Mining, 2(2), 1998.  Better than the Vapnik’s book, but still written too hard for introduction, and the examples are so not-intuitive  The book “An Introduction to Support Vector Machines” by N. Cristianini and J. Shawe-Taylor  Also written hard for introduction, but the explanation about the mercer’s theorem is better than above literatures  The neural network book by Haykins 

Hinweis der Redaktion

  1. &amp;lt;number&amp;gt;
  2. &amp;lt;number&amp;gt;
  3. &amp;lt;number&amp;gt;
  4. &amp;lt;number&amp;gt;
  5. &amp;lt;number&amp;gt;
  6. &amp;lt;number&amp;gt;
  7. &amp;lt;number&amp;gt;
  8. &amp;lt;number&amp;gt;