Hinf6210 Project Classification Of Breast Cancer Dataset

Classification of Breast Cancer dataset using

Decision Tree Induction

Abel Medhanie Gebreyesus
Sunil Nair
HINF6210 Project Presentation – November 25, 2008

Agenda

Objective
Dataset
Approach
Classification Methods
Decision Tree
Problems
Future direction

11/25/2008 2
HINF6210/Project presentation/Abel/Sunil

Introduction

Breast Cancer prognosis
Breast cancer incidence is high
Improvement in diagnostic methods
Early diagnosis and treatment.
But, recurrence is high
Good prognosis is important….

11/25/2008 3

Objective

Significance of project
Previous work done using this dataset
Most previous work indicated room for
improvement in increasing accuracy of classifier

11/25/2008 4

Breast Cancer Dataset
Wisconsin Breast Cancer Database (1991) University of Wisconsin Hospitals,
Dr. William H. Wolberg

# of Instances: 699
# of Attributes: 10 plus
Class attribute
Class distribution:
Benign (2): 458 (65.5%)
Malignant (4): 241 (34.5%)
Missing Values : 16

11/25/2008 5

Attributes
•Indicate Cellular characteristics
•Variables are Continuous, Ordinal with 10 levels
1 id number
Sample code number

2 1-10
Clump Thickness
3 1-10
Uniformity of Cell Size
4 1-10
Uniformity of Cell Shape
5 1-10
Marginal Adhesion
6 1-10
Single Epithelial Cell Size
7 1-10
Bare Nuclei
8 1-10
Bland Chromatin
9 1-10
Normal Nucleoli
10 1-10
Mitoses
11 Class Benign (2), Malignant (4)

11/25/2008 6

Attributes / class - distribution
• Dataset unbalanced

11/25/2008 7

Our Approach

Data Pre-processing
Comparison between Classification techniques
Decision Tree Induction
Attribute Selection
J48
Evaluation

11/25/2008 8

Data Pre-processing
Filter out the ID column
Handle Missing Values
WEKA

11/25/2008 9

Data preprocessing

Two options to manage Missing data – WEKA
“Replacemissingvalues”
weka.filters.unsupervised.attribute.ReplaceMissingValues
Missing nominal and numeric attributes replaced with
mode-means
Remove (delete) the tuple with missing values.
Missing values are attribute bare nuclei = 16
Outliers

11/25/2008 10

Comparison chart – Handle Missing Value
Confusion Matrix
Total Correctly Classified Instances Test split = 223
Class B M Total
Accuracy Rate: PERFORMANCE EVALUATION
95.78% B 160 7 167

# Act. Exp.
M 3 63 66
DATASET RULES MAE Acc. Acc.
Total 163 70 233
Rate Rate

14 8% 94% 87%
How many predictions by chance? Complete

Missing
Expected Accuracy Rate = Kappa
11 5% 96% 90%
Statistic Removed
-is used to measure the agreement between
predicted and actual categorization of data Missing
while correcting for prediction that occurs by 14 7% 95% 89%
Replaced
chance.

11/25/2008 11

Data Pre-processing
Missing Value Replaced - Mean-Mode Missing Value Removed - Mean-Mode

11/25/2008 12

Agenda

Objective
Dataset
Approach
Data Pre-Processing
Classification Methods
Decision Tree
Problems
Future direction

11/25/2008 13

Classification Methods Comparison

PERFORMANCE EVALUATION
Test Set

# Act. Exp.
CLASSIFIER Total MAE Acc. Acc.
Inst. Rate Rate
Naïve Bayes
233 4% 96% 90%

Neural
233 10% 91% 79%
Network
Support Vector
233 3% 97% 94%
M

233 4% 97% 92%
DT‐J48

11/25/2008 14

Classification using Decision Tree

Decision Tree – WEKA J48 (C4.5)
Divide and conquer algorithm
Convert tree to Classification rules
J48 can handle numeric attributes, no need for
discretization

Attribute Selection - Information gain

11/25/2008 15

Attributes Selected – most IG
weka.filters.supervised.attribute.AttributeSelection-Eweka.attributeSelection.InfoGainAttributeEval-
Sweka.attributeSelection.Ranker

Rank Information Gain
Attribute
1 Uniformity of Cell Size 0.675
# Act. Exp.
2 Uniformity of Cell Shape 0.66 DATASET RULES MAE Acc. Acc.
Rate Rate
3 Bare Nucleoli 0.564
Attributes
4 Bland Chromatin 0.543
11 4% 97% 92%
Selected
5 Single Epithelial Cell Size 0.505

Missing
6 Normal Nucleoli 0.466
11 5% 96% 90%
Removed
7 Clump Thickness 0.459

Missing
8 Marginal Adhesion 0.443
14 7% 95% 89%
Replaced
9 Mitosis 0.198

11/25/2008 16

The DT – IG/Attribute selection
Visualization

11/25/2008 17

Decision Tree - Problems

Concerns
Missing values
Pruning – Preprune or postprune
Estimating error rates
Unbalanced Dataset
Bias in prediction
Overfitting – in test set
Underfitting

11/25/2008 18

Confusion Matrix – Performance
Evaluation
The overall Accuracy rate is the
number of correct classifications
Predicted Class
divided by the total number of
classifications:
B (2) M (4)
TP+TN /
TP+TN+FP+FN

Error Rate = 1- Accuracy
B (2) TP FN
Act.
Not a correct measure if
Class
Unbalanced Dataset
M (4) FP TN
Classes are unequally
represented

11/25/2008 19

Unbalanced dataset problem

Solution: Stratified Sampling Method

Partitioning of dataset based on class
Random Sampling Process
Create Training and Test set with equal size class
Testing set data independent from Training set.
Standard Verification technique
Best error estimate

11/25/2008 20

Stratified Sampling Method

11/25/2008 21

Performance Evaluation

Test Set

# # Act. Exp.
Dataset Instances Rules MAE Acc. Acc.
Rate Rate

476 13 2% 99% 97%
Training set

412 13 3% 96% 92%
Testing set

11/25/2008 22

Tree Visualization

11/25/2008 23

Unbalanced dataset Problem

Solution: Cost Matrix
Cost sensitive classification
Costs not known
Complete financial analysis needed; i.e cost of
Using ML tool
Gathering training data
Using the model
Determining the attributes for test
Cross Validation once all costs are known

11/25/2008 24

Future direction
The overall accuracy of the classifier needs to be
increased
Cluster based Stratified Sampling
Partitioning the original dataset using Kmeans Alg.
Multiple Classifier model
Bagging and Boosting techniques
ROC (Receiver Operating Characteristic)
Plotting the TP Rate (Y-axis) over FP Rate (X-Axis)
Advantage: Does not regard class distribution or
error costs.

11/25/2008 25

ROC Curve - Visualization
•Area under the curve AUC
•Larger the area, better is the model
For Benign class For Malignant class

11/25/2008 26

Questions / Comments

Thank You!

11/25/2008 27

Hinf6210 Project Classification Of Breast Cancer Dataset

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Empfohlen

Empfohlen (20)

Hinf6210 Project Classification Of Breast Cancer Dataset