Breast cancer treatment is one of the medical mysteries, yet unresolved challenge for medical practitioners. The key for better treatment is early diagnosis and treatment. However, even after early diagnosis and treatment, there is high chance of recurrence. By making early prognosis, thus, patients can get better treatment. Data mining, as a knowledge mining field, can contribute on better prognosis with better accuracy rate of prediction. In this report, working on WEKA software, we are trying to show on how to get a decision tree with better accuracy rate. Dealing with the Wisconsin Breast Cancer Database, collected by Dr. William H. Wolberg, University of Wisconsin Hospitals, we will discuss on how we a decision tree data mining technique gives better prediction tool.
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Hinf6210 Project Classification Of Breast Cancer Dataset
1. Classification of Breast Cancer dataset using
Decision Tree Induction
Abel Medhanie Gebreyesus
Sunil Nair
HINF6210 Project Presentation – November 25, 2008
2. Agenda
Objective
Dataset
Approach
Classification Methods
Decision Tree
Problems
Future direction
11/25/2008 2
HINF6210/Project presentation/Abel/Sunil
3. Introduction
Breast Cancer prognosis
Breast cancer incidence is high
Improvement in diagnostic methods
Early diagnosis and treatment.
But, recurrence is high
Good prognosis is important….
11/25/2008 3
HINF6210/Project presentation/Abel/Sunil
4. Objective
Significance of project
Previous work done using this dataset
Most previous work indicated room for
improvement in increasing accuracy of classifier
11/25/2008 4
HINF6210/Project presentation/Abel/Sunil
5. Breast Cancer Dataset
Wisconsin Breast Cancer Database (1991) University of Wisconsin Hospitals,
Dr. William H. Wolberg
# of Instances: 699
# of Attributes: 10 plus
Class attribute
Class distribution:
Benign (2): 458 (65.5%)
Malignant (4): 241 (34.5%)
Missing Values : 16
11/25/2008 5
HINF6210/Project presentation/Abel/Sunil
6. Attributes
•Indicate Cellular characteristics
•Variables are Continuous, Ordinal with 10 levels
1 id number
Sample code number
2 1-10
Clump Thickness
3 1-10
Uniformity of Cell Size
4 1-10
Uniformity of Cell Shape
5 1-10
Marginal Adhesion
6 1-10
Single Epithelial Cell Size
7 1-10
Bare Nuclei
8 1-10
Bland Chromatin
9 1-10
Normal Nucleoli
10 1-10
Mitoses
11 Class Benign (2), Malignant (4)
11/25/2008 6
HINF6210/Project presentation/Abel/Sunil
7. Attributes / class - distribution
• Dataset unbalanced
11/25/2008 7
HINF6210/Project presentation/Abel/Sunil
8. Our Approach
Data Pre-processing
Comparison between Classification techniques
Decision Tree Induction
Attribute Selection
J48
Evaluation
11/25/2008 8
HINF6210/Project presentation/Abel/Sunil
9. Data Pre-processing
Filter out the ID column
Handle Missing Values
WEKA
11/25/2008 9
HINF6210/Project presentation/Abel/Sunil
10. Data preprocessing
Two options to manage Missing data – WEKA
“Replacemissingvalues”
weka.filters.unsupervised.attribute.ReplaceMissingValues
Missing nominal and numeric attributes replaced with
mode-means
Remove (delete) the tuple with missing values.
Missing values are attribute bare nuclei = 16
Outliers
11/25/2008 10
HINF6210/Project presentation/Abel/Sunil
11. Comparison chart – Handle Missing Value
Confusion Matrix
Total Correctly Classified Instances Test split = 223
Class B M Total
Accuracy Rate: PERFORMANCE EVALUATION
95.78% B 160 7 167
# Act. Exp.
M 3 63 66
DATASET RULES MAE Acc. Acc.
Total 163 70 233
Rate Rate
14 8% 94% 87%
How many predictions by chance? Complete
Missing
Expected Accuracy Rate = Kappa
11 5% 96% 90%
Statistic Removed
-is used to measure the agreement between
predicted and actual categorization of data Missing
while correcting for prediction that occurs by 14 7% 95% 89%
Replaced
chance.
11/25/2008 11
HINF6210/Project presentation/Abel/Sunil
12. Data Pre-processing
Missing Value Replaced - Mean-Mode Missing Value Removed - Mean-Mode
11/25/2008 12
HINF6210/Project presentation/Abel/Sunil
13. Agenda
Objective
Dataset
Approach
Data Pre-Processing
Classification Methods
Decision Tree
Problems
Future direction
11/25/2008 13
HINF6210/Project presentation/Abel/Sunil
14. Classification Methods Comparison
PERFORMANCE EVALUATION
Test Set
# Act. Exp.
CLASSIFIER Total MAE Acc. Acc.
Inst. Rate Rate
Naïve Bayes
233 4% 96% 90%
Neural
233 10% 91% 79%
Network
Support Vector
233 3% 97% 94%
M
233 4% 97% 92%
DT‐J48
11/25/2008 14
HINF6210/Project presentation/Abel/Sunil
15. Classification using Decision Tree
Decision Tree – WEKA J48 (C4.5)
Divide and conquer algorithm
Convert tree to Classification rules
J48 can handle numeric attributes, no need for
discretization
Attribute Selection - Information gain
11/25/2008 15
HINF6210/Project presentation/Abel/Sunil
16. Attributes Selected – most IG
weka.filters.supervised.attribute.AttributeSelection-Eweka.attributeSelection.InfoGainAttributeEval-
Sweka.attributeSelection.Ranker
Rank Information Gain
Attribute
PERFORMANCE EVALUATION
1 Uniformity of Cell Size 0.675
# Act. Exp.
2 Uniformity of Cell Shape 0.66 DATASET RULES MAE Acc. Acc.
Rate Rate
3 Bare Nucleoli 0.564
Attributes
4 Bland Chromatin 0.543
11 4% 97% 92%
Selected
5 Single Epithelial Cell Size 0.505
Missing
6 Normal Nucleoli 0.466
11 5% 96% 90%
Removed
7 Clump Thickness 0.459
Missing
8 Marginal Adhesion 0.443
14 7% 95% 89%
Replaced
9 Mitosis 0.198
11/25/2008 16
HINF6210/Project presentation/Abel/Sunil
18. Decision Tree - Problems
Concerns
Missing values
Pruning – Preprune or postprune
Estimating error rates
Unbalanced Dataset
Bias in prediction
Overfitting – in test set
Underfitting
11/25/2008 18
HINF6210/Project presentation/Abel/Sunil
19. Confusion Matrix – Performance
Evaluation
The overall Accuracy rate is the
number of correct classifications
Predicted Class
divided by the total number of
classifications:
B (2) M (4)
TP+TN /
TP+TN+FP+FN
Error Rate = 1- Accuracy
B (2) TP FN
Act.
Not a correct measure if
Class
Unbalanced Dataset
M (4) FP TN
Classes are unequally
represented
11/25/2008 19
HINF6210/Project presentation/Abel/Sunil
20. Unbalanced dataset problem
Solution: Stratified Sampling Method
Partitioning of dataset based on class
Random Sampling Process
Create Training and Test set with equal size class
Testing set data independent from Training set.
Standard Verification technique
Best error estimate
11/25/2008 20
HINF6210/Project presentation/Abel/Sunil
24. Unbalanced dataset Problem
Solution: Cost Matrix
Cost sensitive classification
Costs not known
Complete financial analysis needed; i.e cost of
Using ML tool
Gathering training data
Using the model
Determining the attributes for test
Cross Validation once all costs are known
11/25/2008 24
HINF6210/Project presentation/Abel/Sunil
25. Future direction
The overall accuracy of the classifier needs to be
increased
Cluster based Stratified Sampling
Partitioning the original dataset using Kmeans Alg.
Multiple Classifier model
Bagging and Boosting techniques
ROC (Receiver Operating Characteristic)
Plotting the TP Rate (Y-axis) over FP Rate (X-Axis)
Advantage: Does not regard class distribution or
error costs.
11/25/2008 25
HINF6210/Project presentation/Abel/Sunil
26. ROC Curve - Visualization
•Area under the curve AUC
•Larger the area, better is the model
For Benign class For Malignant class
11/25/2008 26
HINF6210/Project presentation/Abel/Sunil