SlideShare ist ein Scribd-Unternehmen logo
1 von 6
Downloaden Sie, um offline zu lesen
1
CLINICAL DECISION SUPPORT FOR HEART DISEASE
USING PREDICTIVE MODELS
Srpriva Sundaraman
Northwestern University
SripriyaSundararaman2013@u.northwestern.edu
Sunil Kakade
Northwestern University
Sunil.kakade@gmail.com
Abstract--Over the last decade, data-driven decision
making has become very prevalent in the healthcare
industry. Computer systems that perform data analysis
and automated clinical decision-making are called
Clinical Decision Support (CDS) systems. CDS
enables healthcare providers take quick and effective
decisions based on existing patient data. Data-driven
evidence-based treatment is proven to have higher
success rates than intuition driven eminence-based
treatment.
This objective of this project is to build a clinical
decision support system for predicting the presence of
heart disease using a classification model. This is a
supervised learning problem with binary target
outcomes indicating either the presence or absence of
heart disease. Other models like clustering or logistic
regression have been used for such problems in the
past, but a classification model would be most ideal
due to the nature of the data.
Several data mining classification algorithms like
artificial neural networks, naïve bayes, and decision
tree were evaluated for performance. The best
performing model in terms of misclassification, false
positive rate, and area under ROC curve was selected
for data analysis. This project analyzes the model
performance results and suggests next steps to improve
upon this clinical decision support model.
I. INTRODUCTION
Healthcare is one of the largest sectors in the United
States. The US spends approximately $2.5 trillion in
healthcare costs each year. Each year about $250
billion of this is attributed to waste, fraud, and abuse.
Data mining and analytics reduce some of the
healthcare waste by improving outcomes. One of the
ways healthcare providers can improve outcomes is by
using data-driven decision making. Data mining is
well suited to help decision making in a healthcare
setting for the following reasons –
Healthcare organizations generate large amount of
data, thus making it a suitable domain for data mining.
The Affordable Care Act by the US government puts
the onus on healthcare providers to improve outcomes.
With the advent of computing power and medical
technology, diverse and elaborate classification
algorithms have been developed to mine the data.
Due to the reasons mentioned above, Data-driven
decision making has become very popular in the
healthcare domain and are called Clinical Decision
Support systems (CDS). CDS goes beyond prediction
by combining prediction with real-time data analysis
and rule based reasoning to arrive at more accurate
decisions.
This objective of this project is to build a clinical
decision support system for predicting the presence of
heart disease. A classification data model is proposed
to detect patterns in existing heart patient data. In the
past other researchers have tried to run classification
models on the same data. Logistic regression models
have generated an accuracy of 77%. Noise tolerant
instance based learning algorithms have been 77%
accurate as well. Clustering algorithms have shown a
marginal improvement in performance with 78.9%
accuracy. The aim of this project will be to create a
model with better accuracy than before.
The outcome from this classification model could be
used as an initial screening for patients. The prediction
would also be useful for physicians when they need to
make quick decisions (ex. Emergency rooms, operation
theaters). The classification model enables physicians
to tap into the “wisdom of other physicians” and come
up with accurate prediction of heart disease in new
patients by classification patterns in existing patient
data.
II. DATA UNDERSTANDING
Data Source - The data used for analysis is collected
from the University of California, Irvine machine
2
learning data repository. The creator of this data set are
– Andras Janosi (Hungarian Institute of Cardiology),
William Steinbrunn (University Hospital, Zurich),
Matthias Pfisterer (University Hospital, Basel), and
Robert Detrano (VA Medical Center and the Cleveland
Clinic Foundation).
This data mining exercise uses two of the datasets from
this heart disease database – The Cleveland Clinic data
for training, and the Switzerland data for testing. The
database originally consisted of 76 attributes with
details about patient body characteristics. The goal was
to determine if presence of heart disease can be
diagnosed from this data.
Data Attributes and Instances - While the original
dataset had 76 attributes, only a subset of the 14 most
important ones were published on the public domain.
Some of the patient sensitive information like name,
SSN, etc. was removed for security purposes. There are
303 instances of patient data in the training dataset and
123 instances in the test data set. The output attribute is
an integer that represents the presence or absence of
heart disease. There were 4 missing values for attribute
“ca’ and 2 missing values for attribute “thal”.
The 14 attributes of the patient are described below:
1. age (numeric) - age of the patient in years
2. sex (numeric) - represented as a binary number (1 =
male, 0 = female)
3. cp (numeric) - represents chest pain type as an
integer. Values range from 1 to 4.
• Value 1: typical angina
• Value 2: atypical angina
• Value 3: non-angina pain
• Value 4: asymptomatic	
  
4. trestbps (numeric) - resting blood pressure measured
in mm Hg on admission to the hospital.
5. chol (numeric) - serum cholesterol of the patient
measured in mg/dl
6. fbs (numeric) - fasting blood sugar of the patient. If
greater than 120 mg/dl the attribute value is 1 (true),
else the attribute value is 0 (false)
7. restecg (numeric) - resting electrocardiographic
results for the patient. This attribute can take 3 integer
values - 0, 1, or 2.
• Value 0: normal
• Value 1: having ST-T wave abnormality
• Value 2: showing probable or definite left
ventricular hypertrophy	
  
8. thalach (numeric) - maximum heart rate of the
patient.
9. exang (numeric) - exercise induced angina. Values
can be 1 for yes or 0 for no.
10. oldpeak (numeric) - measurement of depression
induced by exercise relative to rest
11. slope (numeric) - measure of slope for peak
exercise. Values can be 1, 2, or 3
• Value 1: up sloping
• Value 2: flat
• Value 3: down sloping	
  
12. ca (numeric) - represents number of major vessels.
Attribute values can be 0 to 3.
13. thal (numeric) - represents heart rate of the patient.
It can take values 3, 6, or 7.
• Value 3: normal
• Value 6: fixed defect
• Value 7: reversible defect
14. num (numeric) - contains a numeric value between
0 and 4. Each value represents a heart disease or
absence of all of them.
• Value 0: absence of heart disease
• Value 1 to 4: presence of different heart
diseases
III. DATA PREPARATION
The data mining tool for creating the model would be
WEKA (version 3.6.10), which is an open source
software. The data was obtained in a comma-separated
value (CSV) format. The following modifications were
done to prepare the dataset for training.
Data modification - Since we are only interested in the
presence or absence of heart disease and not the exact
disease classification, the output values of the original
data was modified. The Output value (num) of the
dataset contained 4 values. A value of 0 indicated no
heart disease. Values between 1 and 4 indicated
different heart diseases respectively. The num values of
1 to 4 were changed to 1 in Microsoft Excel.
Data type conversions - The output values (num)
containing the class label were integers. Classification
algorithms require the output to be binary in Weka.
The output values of 1 were converted to “yes” and the
values of 0 were converted to “no” respectively.
Imputation of missing values – The modified CSV file
was imported into Weka. This file had 6 missing values
(4 missing values for attribute “ca” and 2 missing
values for the attribute “thal”). The missing data values
were imputed using the “ReplaceMissingValues”
function under the preprocessing section. Weka uses
local averages to determine the imputed values.
Test data set - The training dataset contained 303
3
instances from the Cleveland Clinic data set. The test
data consisted of 123 records from the Switzerland
University hospitals. The test data was formatted the
same way as the training data. Classification variable
was converted to binary values (yes/no) and the
missing values were imputed. The test data set had
several missing values as compared to the training data
set.
IV. DATA MINING ALGORITHMS
The Cleveland dataset contained characteristics as well
as the diagnosis of patients with heart disease. This is a
supervised learning problem because we have a
specific target in the test data, which can be used to
predict output for new use cases. Algorithms like
classification, regression, or causal modeling are used
for supervised learning.
The output in the training and test data is binary
because the heart disease is either present or not. Hence
classification algorithms would be ideal to train the
Cleveland dataset. The following steps define the
classification workflow –
1. The training and test data is preprocessed.
2. Several learning algorithms are evaluated for
performance and the best one is selected.
3. A training set consisting of records whose output
class labels are known, is used to build the
classification model by using the learning algorithm
4. After training, this model is run against the test set
which consists of records with unknown labels.
5. The performance of classification model is
measured based on the number of correct
classifications.
The training dataset was run on three different
classification algorithms on Weka – Naïve Bayes, J48
Decision Tree, and Artificial Neural Networks using
10-fold cross validation. Cross validation is a technique
for estimating the performance of the model by using
part of the training set for testing. The most accurate
algorithm was determined by the misclassification
error rate, Area under ROC curve, and number of false
positives.
The misclassification error determines the model
accuracy in terms of percentage of incorrect predictions
made by the model. The AUC determines the
predictive capability of the model. This model will be
used for heart disease predictions and hence it is
important for the selected model to not only perform
well on the training data, but to also have a high
predictive accuracy.
The false positives i.e. number of cases that where the
patient had a heart disease, but the model incorrectly
predicts those patients as healthy. This error is very
important in this context. False positives could cause a
potential heart patient to go untreated and may be even
life threatening. We evaluate the models based on their
false positives count. Models with lower number of
false positives are desired.
Three classification algorithms – (1) Naïve Bayes, (2)
Decision Trees, and (3) Artificial Neural Networks
(ANN) were used in the analysis.
Comparison of the performance of the three algorithms
on Weka –
Algorit
hm
Misclassific
ation rate
(%)
Area
under
ROC
curve(A
UC)
False
Positi
ves
Time
taken
(secon
ds)
Naïve
Bayes
16.5 0.891 31 0
J48
Decisio
n Tree
20.79 0.783 36 0.04
Artifici
al
Neural
Networ
ks
22.77 0.833 46 0.58
The ROC chart for the three algorithms –
Naïve Bayes-AUC
J48 Decision Tree – AUC
4
Neural Network- AUC
Among the three algorithms, Naïve Bayes had the
lowest misclassification error, highest area under the
ROC curve (AUC), lowest false positive count, and
fastest time. Hence Naïve Bayes algorithm was
selected to train and test the Heart Disease training and
test model.
V. NAÏVE BAYES MODEL -
EXPERIMENTAL RESULTS AND
ANALYSIS
Naïve Bayes algorithm was run against the training
data set with 10-fold cross-validation, where one-tenths
of the data was held for testing and the rest of the data
was used for training. The algorithm took 0 seconds to
run through 303 records in the Cleveland Clinic
training data set.
Analysis of the training data results –
1. Accuracy - The model accuracy was 83.5% and the
error rate was 16.5%. Out of all the cases evaluated
83.5% of heart disease outcome predictions were
correct and 16.5% of predictions were wrong.
2. Confusion matrix – The classifier correctly
identified (true positives) 145 records as patients who
won’t have heart disease. 31 cases were identified as
healthy by the model, but had a heart disease (false
positives). 108 records are correctly classified as
patients having heart disease by the model. 19 cases are
identified as heart disease patients by the model, but
they were healthy in reality (false negatives).
3. True positive rate (TP Rate) and false positive rate
(FP Rate) – The Naïve Bayes algorithms gets the
weighted average of true positives and false positives
for both outcomes (yes and no). The average TP Rate
or sensitivity is 83.5%. And the average FP Rate is
17.4%. For 83.5% of the cases the model is correctly
predicting the existence of heart disease.
4. Precision – The average precision for this model is
0.836 which means 83.6% of the predictions by the
model are correct.
5. AUC - The area under the ROC curve (AUC) is
0.891. This represents how well the model is able to
distinguish between the target classes. A score of 0.667
is good model accuracy.
The test data (Switzerland university hospitals) was
used to evaluate the Naïve Bayes model built using the
Cleveland Clinic training data.
Analysis of the test data results –
1. Accuracy - The model accuracy for the test set was
70.73% and the error rate was 29.26%. Out of all the
cases evaluated 70.73% of heart disease outcome
predictions were correct and 29.26% of predictions
were wrong.
2. Confusion matrix – The classifier correctly
identified 5 records as patients who won’t have heart
disease. 33 cases are identified as healthy by the model,
but they had a heart disease in reality. 82 records are
correctly classified as patients having heart disease by
the model. 3 cases are identified as heart disease
patients by the model, but they were healthy in reality.
3. True positive rate (TP Rate) and false positive rate
(FP Rate) – The Naïve Bayes algorithms gets the
weighted average of true positives and false positives
for both outcomes (yes and no). The average TP Rate
or sensitivity is 70.7%. And the average FP Rate is
36.9%, which is almost double that of the training set.
For 70.7% of the cases the model is correctly
predicting the existence of heart disease.
4. Precision and Recall – The average precision for
this model is 0.836 which means 83.6% of the
predictions by the model are correct. This is the same
as the true positive rate. The harmonic mean of
precision and recall is F-measure. The weighted
average of f-measure is 0.834.
5. AUC - The area under the ROC curve (AUC) is
0.667. This represents how well the model is able to
distinguish between the target classes. A score of 0.667
is average performance for model accuracy.
The ROC threshold curve for both the training and test
sets –
5
V1. CONCLUSION
The Naïve Bayes model performed very well on the
training data. On the test data the performance reduced.
The model accuracy, and area under ROC curve were
both significantly lower in the test data set. There could
be several reasons for the poor performance –
1) Even though the training data was testing
continuously using cross-validation, the data being
tested was the same as the data being trained. The high
ROC values in training could have happened due to
model over-fitting.
2) The Switzerland patient data in the test set had
several missing values that were imputed in Weka
using local averages. These imputed values could have
been inaccurate.
3) The training data set was quite small. 300 instances
is probably not enough to build a reliable model.
Despite the “average” performance on the test set, this
model is still useful for predicting heart disease in
patients in 70% of the cases. This is definitely a better
solution as opposed to not knowing any information
about the patient. This kind of prediction of heart
disease could be used as a first level diagnosis for
doctors, emergency room services and other healthcare
providers. If a patient was predicted to have heart
disease, physicians could perform further tests on these
patients to confirm the diagnosis. This kind of clinical
decision support is a win-win proposition that helps
save time for doctors and save lives for patients.
VII. FUTURE WORK
The training data set had good quality data, but there
were very few instances to train the classification
model. The test data set was of poor quality with plenty
of missing attributes. The number of records were also
inadequate. Despite these shortcomings the model
performance was above average on the test data and
very good for the training data. With the right kind of
training and test data, the same classification model can
be used to generate better performance and results. The
following are a few ways to improve the classification
model –
1. The missing values in the dataset could be validated
with the corresponding hospitals to generate a more
accurate training and test data set.
2. Additional patient information could be collected
from EMR databases in hospitals. This would generate
more attributes, but we can perform attribute selection
to filter the most important ones.
3. The UCI heart disease data was only collected from
4 hospitals. The heart disease database could be
expanded to add more hospitals and patient information
to create a more diverse training set.
4. The heart disease model could be expanded beyond
predicting occurrence. We could identify the exact
heart disorder from the existing data. This would aid
the physicians further more.
5. Similar classification models can be trained for
other diseases apart from heart diseases.
REFERENCE
Tan, Pang-Ning, Michael Steinbach, and Vipin Kumar.
Introduction to Data Mining. Boston: Pearson Addison
Wesley, 2005.
Provost F., Fawcett T. (2013). Data Science for
Business (pp. 20-40). Sebastopol, CA: O’Reilly Media
Inc.
Bache, K. & Lichman, M. (2013). UCI Machine
Learning Repository [http://archive.ics.uci.edu/ml].
Irvine, CA: University of California, School of
Information and Computer Science.
Caroline Clabaugh, Dave Myszewski, and Jimmy Pang
(2000). Neural network applications. Eric Roberts'
Sophomore College.
URL:http://cs.stanford.edu/people/eroberts/courses/soc
o/projects/2000-01/neural-
networks/Applications/miscellaneous.html
Doust, Dominick and Walsh, Zack, "Data Mining
Clustering: A Healthcare Application" (2011).MCIS
2011 Proceedings. Paper 65.
http://aisel.aisnet.org/mcis2011/65
OECD Health Data (2012).
URL:www.oecd.org/unitedstates/HealthSpendingInUS
A_HealthData2012.pdf
WEKA on Wikipedia. URL:
http://en.wikipedia.org/wiki/WEKA
Detrano, R., Janosi, A., Steinbrunn, W., Pfisterer, M.,
Schmid, J., Sandhu, S., Guppy, K., Lee, S., &
Froelicher, V. (1989). International application of a
new probability algorithm for the diagnosis of coronary
artery disease. American Journal of Cardiology,
6
64,304--310.
David W. Aha & Dennis Kibler. "Instance-based
prediction of heart-disease presence with the Cleveland
database."
Gennari, J.H., Langley, P, & Fisher, D. (1989). Models
of incremental concept formation. Artificial
Intelligence, 40, 11--61.

Weitere ähnliche Inhalte

Was ist angesagt?

A Heart Disease Prediction Model using Decision Tree
A Heart Disease Prediction Model using Decision TreeA Heart Disease Prediction Model using Decision Tree
A Heart Disease Prediction Model using Decision TreeIOSR Journals
 
HEART DISEASE PREDICTION USING NAIVE BAYES ALGORITHM
HEART DISEASE PREDICTION USING NAIVE BAYES ALGORITHMHEART DISEASE PREDICTION USING NAIVE BAYES ALGORITHM
HEART DISEASE PREDICTION USING NAIVE BAYES ALGORITHMamiteshg
 
Survey on data mining techniques in heart disease prediction
Survey on data mining techniques in heart disease predictionSurvey on data mining techniques in heart disease prediction
Survey on data mining techniques in heart disease predictionSivagowry Shathesh
 
Classification of Heart Diseases Patients using Data Mining Techniques
Classification of Heart Diseases Patients using Data Mining TechniquesClassification of Heart Diseases Patients using Data Mining Techniques
Classification of Heart Diseases Patients using Data Mining TechniquesLovely Professional University
 
Survey on data mining techniques in heart disease prediction
Survey on data mining techniques in heart disease predictionSurvey on data mining techniques in heart disease prediction
Survey on data mining techniques in heart disease predictionSivagowry Shathesh
 
A Heart Disease Prediction Model using Logistic Regression By Cleveland DataBase
A Heart Disease Prediction Model using Logistic Regression By Cleveland DataBaseA Heart Disease Prediction Model using Logistic Regression By Cleveland DataBase
A Heart Disease Prediction Model using Logistic Regression By Cleveland DataBaseijtsrd
 
Heart Disease Prediction using Machine Learning Algorithm
Heart Disease Prediction using Machine Learning AlgorithmHeart Disease Prediction using Machine Learning Algorithm
Heart Disease Prediction using Machine Learning Algorithmijtsrd
 
A Heart Disease Prediction Model using Logistic Regression
A Heart Disease Prediction Model using Logistic RegressionA Heart Disease Prediction Model using Logistic Regression
A Heart Disease Prediction Model using Logistic Regressionijtsrd
 
Chronic Kidney Disease Prediction
Chronic Kidney Disease PredictionChronic Kidney Disease Prediction
Chronic Kidney Disease PredictionRajandeep Gill
 
A Survey on Heart Disease Prediction Techniques
A Survey on Heart Disease Prediction TechniquesA Survey on Heart Disease Prediction Techniques
A Survey on Heart Disease Prediction Techniquesijtsrd
 
Heart disease prediction system
Heart disease prediction systemHeart disease prediction system
Heart disease prediction systemSWAMI06
 
Psdot 14 using data mining techniques in heart
Psdot 14 using data mining techniques in heartPsdot 14 using data mining techniques in heart
Psdot 14 using data mining techniques in heartZTech Proje
 
Heart disease prediction using Naïve Bayes
Heart disease prediction using Naïve BayesHeart disease prediction using Naïve Bayes
Heart disease prediction using Naïve BayesIRJET Journal
 
HEALTH PREDICTION ANALYSIS USING DATA MINING
HEALTH PREDICTION ANALYSIS USING DATA  MININGHEALTH PREDICTION ANALYSIS USING DATA  MINING
HEALTH PREDICTION ANALYSIS USING DATA MININGAshish Salve
 
IRJET- Genetic Algorithm for Feature Selection to Improve Heart Disease Predi...
IRJET- Genetic Algorithm for Feature Selection to Improve Heart Disease Predi...IRJET- Genetic Algorithm for Feature Selection to Improve Heart Disease Predi...
IRJET- Genetic Algorithm for Feature Selection to Improve Heart Disease Predi...IRJET Journal
 
PSO-An Intellectual Technique for Feature Reduction on Heart Malady Anticipat...
PSO-An Intellectual Technique for Feature Reduction on Heart Malady Anticipat...PSO-An Intellectual Technique for Feature Reduction on Heart Malady Anticipat...
PSO-An Intellectual Technique for Feature Reduction on Heart Malady Anticipat...Sivagowry Shathesh
 
Heart Disease Identification Method Using Machine Learnin in E-healthcare.
Heart Disease Identification Method Using Machine Learnin in E-healthcare.Heart Disease Identification Method Using Machine Learnin in E-healthcare.
Heart Disease Identification Method Using Machine Learnin in E-healthcare.SUJIT SHIBAPRASAD MAITY
 
Prediction of Heart Disease Using Data Mining Techniques- A Review
Prediction of Heart Disease Using Data Mining Techniques- A ReviewPrediction of Heart Disease Using Data Mining Techniques- A Review
Prediction of Heart Disease Using Data Mining Techniques- A ReviewIRJET Journal
 
Heart Attack Prediction using Machine Learning
Heart Attack Prediction using Machine LearningHeart Attack Prediction using Machine Learning
Heart Attack Prediction using Machine Learningmohdshoaibuddin1
 

Was ist angesagt? (20)

A Heart Disease Prediction Model using Decision Tree
A Heart Disease Prediction Model using Decision TreeA Heart Disease Prediction Model using Decision Tree
A Heart Disease Prediction Model using Decision Tree
 
HEART DISEASE PREDICTION USING NAIVE BAYES ALGORITHM
HEART DISEASE PREDICTION USING NAIVE BAYES ALGORITHMHEART DISEASE PREDICTION USING NAIVE BAYES ALGORITHM
HEART DISEASE PREDICTION USING NAIVE BAYES ALGORITHM
 
Survey on data mining techniques in heart disease prediction
Survey on data mining techniques in heart disease predictionSurvey on data mining techniques in heart disease prediction
Survey on data mining techniques in heart disease prediction
 
Classification of Heart Diseases Patients using Data Mining Techniques
Classification of Heart Diseases Patients using Data Mining TechniquesClassification of Heart Diseases Patients using Data Mining Techniques
Classification of Heart Diseases Patients using Data Mining Techniques
 
Survey on data mining techniques in heart disease prediction
Survey on data mining techniques in heart disease predictionSurvey on data mining techniques in heart disease prediction
Survey on data mining techniques in heart disease prediction
 
A Heart Disease Prediction Model using Logistic Regression By Cleveland DataBase
A Heart Disease Prediction Model using Logistic Regression By Cleveland DataBaseA Heart Disease Prediction Model using Logistic Regression By Cleveland DataBase
A Heart Disease Prediction Model using Logistic Regression By Cleveland DataBase
 
Heart Disease Prediction using Machine Learning Algorithm
Heart Disease Prediction using Machine Learning AlgorithmHeart Disease Prediction using Machine Learning Algorithm
Heart Disease Prediction using Machine Learning Algorithm
 
A Heart Disease Prediction Model using Logistic Regression
A Heart Disease Prediction Model using Logistic RegressionA Heart Disease Prediction Model using Logistic Regression
A Heart Disease Prediction Model using Logistic Regression
 
Chronic Kidney Disease Prediction
Chronic Kidney Disease PredictionChronic Kidney Disease Prediction
Chronic Kidney Disease Prediction
 
A Survey on Heart Disease Prediction Techniques
A Survey on Heart Disease Prediction TechniquesA Survey on Heart Disease Prediction Techniques
A Survey on Heart Disease Prediction Techniques
 
Heart disease prediction system
Heart disease prediction systemHeart disease prediction system
Heart disease prediction system
 
E04733639
E04733639E04733639
E04733639
 
Psdot 14 using data mining techniques in heart
Psdot 14 using data mining techniques in heartPsdot 14 using data mining techniques in heart
Psdot 14 using data mining techniques in heart
 
Heart disease prediction using Naïve Bayes
Heart disease prediction using Naïve BayesHeart disease prediction using Naïve Bayes
Heart disease prediction using Naïve Bayes
 
HEALTH PREDICTION ANALYSIS USING DATA MINING
HEALTH PREDICTION ANALYSIS USING DATA  MININGHEALTH PREDICTION ANALYSIS USING DATA  MINING
HEALTH PREDICTION ANALYSIS USING DATA MINING
 
IRJET- Genetic Algorithm for Feature Selection to Improve Heart Disease Predi...
IRJET- Genetic Algorithm for Feature Selection to Improve Heart Disease Predi...IRJET- Genetic Algorithm for Feature Selection to Improve Heart Disease Predi...
IRJET- Genetic Algorithm for Feature Selection to Improve Heart Disease Predi...
 
PSO-An Intellectual Technique for Feature Reduction on Heart Malady Anticipat...
PSO-An Intellectual Technique for Feature Reduction on Heart Malady Anticipat...PSO-An Intellectual Technique for Feature Reduction on Heart Malady Anticipat...
PSO-An Intellectual Technique for Feature Reduction on Heart Malady Anticipat...
 
Heart Disease Identification Method Using Machine Learnin in E-healthcare.
Heart Disease Identification Method Using Machine Learnin in E-healthcare.Heart Disease Identification Method Using Machine Learnin in E-healthcare.
Heart Disease Identification Method Using Machine Learnin in E-healthcare.
 
Prediction of Heart Disease Using Data Mining Techniques- A Review
Prediction of Heart Disease Using Data Mining Techniques- A ReviewPrediction of Heart Disease Using Data Mining Techniques- A Review
Prediction of Heart Disease Using Data Mining Techniques- A Review
 
Heart Attack Prediction using Machine Learning
Heart Attack Prediction using Machine LearningHeart Attack Prediction using Machine Learning
Heart Attack Prediction using Machine Learning
 

Andere mochten auch

Giáo dục môi trường tại các trung tâm nghiên cứu
Giáo dục môi trường tại các trung tâm nghiên cứuGiáo dục môi trường tại các trung tâm nghiên cứu
Giáo dục môi trường tại các trung tâm nghiên cứuThành Nguyễn
 
Multimedia project am rev
Multimedia project am revMultimedia project am rev
Multimedia project am revGail Waters
 
Marist Financial Aid
Marist Financial AidMarist Financial Aid
Marist Financial AidBrian Apfel
 
GIÁO DỤC MÔI TRƯỜNG NGOÀI THỰC ĐỊA
GIÁO DỤC MÔI TRƯỜNG NGOÀI THỰC ĐỊAGIÁO DỤC MÔI TRƯỜNG NGOÀI THỰC ĐỊA
GIÁO DỤC MÔI TRƯỜNG NGOÀI THỰC ĐỊAThành Nguyễn
 
End of Summer Intern Presentation-3
End of Summer Intern Presentation-3End of Summer Intern Presentation-3
End of Summer Intern Presentation-3Rachel Howard
 
Тренинг по медиа-рекламе
Тренинг по медиа-рекламеТренинг по медиа-рекламе
Тренинг по медиа-рекламеDIEVO
 
Carlos luna: Despertando y estimulando el pensamiento creativo
Carlos luna: Despertando y estimulando el pensamiento creativoCarlos luna: Despertando y estimulando el pensamiento creativo
Carlos luna: Despertando y estimulando el pensamiento creativoplataformabotin
 
Marketing Consulting And Independent Contract Agreement
Marketing Consulting And Independent Contract AgreementMarketing Consulting And Independent Contract Agreement
Marketing Consulting And Independent Contract Agreementsasha lugo
 

Andere mochten auch (17)

Giáo dục môi trường tại các trung tâm nghiên cứu
Giáo dục môi trường tại các trung tâm nghiên cứuGiáo dục môi trường tại các trung tâm nghiên cứu
Giáo dục môi trường tại các trung tâm nghiên cứu
 
MahoutNew
MahoutNewMahoutNew
MahoutNew
 
Termodinamica
TermodinamicaTermodinamica
Termodinamica
 
Profile NOVAON Internet Group
Profile NOVAON Internet Group Profile NOVAON Internet Group
Profile NOVAON Internet Group
 
Cinetica
CineticaCinetica
Cinetica
 
Texvyn Prolearn 2016
Texvyn Prolearn 2016 Texvyn Prolearn 2016
Texvyn Prolearn 2016
 
Multimedia project am rev
Multimedia project am revMultimedia project am rev
Multimedia project am rev
 
Marist Financial Aid
Marist Financial AidMarist Financial Aid
Marist Financial Aid
 
GIÁO DỤC MÔI TRƯỜNG NGOÀI THỰC ĐỊA
GIÁO DỤC MÔI TRƯỜNG NGOÀI THỰC ĐỊAGIÁO DỤC MÔI TRƯỜNG NGOÀI THỰC ĐỊA
GIÁO DỤC MÔI TRƯỜNG NGOÀI THỰC ĐỊA
 
Overview wifi marketing of S-wifi
Overview wifi marketing of S-wifiOverview wifi marketing of S-wifi
Overview wifi marketing of S-wifi
 
FINISHING SCHOOL
FINISHING SCHOOLFINISHING SCHOOL
FINISHING SCHOOL
 
Don't make me think
Don't make me thinkDon't make me think
Don't make me think
 
Barranquilla1
Barranquilla1Barranquilla1
Barranquilla1
 
End of Summer Intern Presentation-3
End of Summer Intern Presentation-3End of Summer Intern Presentation-3
End of Summer Intern Presentation-3
 
Тренинг по медиа-рекламе
Тренинг по медиа-рекламеТренинг по медиа-рекламе
Тренинг по медиа-рекламе
 
Carlos luna: Despertando y estimulando el pensamiento creativo
Carlos luna: Despertando y estimulando el pensamiento creativoCarlos luna: Despertando y estimulando el pensamiento creativo
Carlos luna: Despertando y estimulando el pensamiento creativo
 
Marketing Consulting And Independent Contract Agreement
Marketing Consulting And Independent Contract AgreementMarketing Consulting And Independent Contract Agreement
Marketing Consulting And Independent Contract Agreement
 

Ähnlich wie Clinical_Decision_Support_For_Heart_Disease

javed_prethesis2608 on predcition of heart disease
javed_prethesis2608 on predcition of heart diseasejaved_prethesis2608 on predcition of heart disease
javed_prethesis2608 on predcition of heart diseasejaved75
 
IRJET -Improving the Accuracy of the Heart Disease Prediction using Hybrid Ma...
IRJET -Improving the Accuracy of the Heart Disease Prediction using Hybrid Ma...IRJET -Improving the Accuracy of the Heart Disease Prediction using Hybrid Ma...
IRJET -Improving the Accuracy of the Heart Disease Prediction using Hybrid Ma...IRJET Journal
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Prediction of heart disease using classification mining technique on spark
Prediction of heart disease using classification mining technique on sparkPrediction of heart disease using classification mining technique on spark
Prediction of heart disease using classification mining technique on sparkdbpublications
 
predictionofheartdiseaseusingmachinelearning.pdf
predictionofheartdiseaseusingmachinelearning.pdfpredictionofheartdiseaseusingmachinelearning.pdf
predictionofheartdiseaseusingmachinelearning.pdfDasariSeshadri
 
Prediction of heart disease using machine learning.pptx
Prediction of heart disease using machine learning.pptxPrediction of heart disease using machine learning.pptx
Prediction of heart disease using machine learning.pptxkumari36
 
HEALTH PREDICTION ANALYSIS USING DATA MINING
HEALTH PREDICTION ANALYSIS USING DATA  MININGHEALTH PREDICTION ANALYSIS USING DATA  MINING
HEALTH PREDICTION ANALYSIS USING DATA MININGAshish Salve
 
Supervised Feature Selection for Diagnosis of Coronary Artery Disease Based o...
Supervised Feature Selection for Diagnosis of Coronary Artery Disease Based o...Supervised Feature Selection for Diagnosis of Coronary Artery Disease Based o...
Supervised Feature Selection for Diagnosis of Coronary Artery Disease Based o...cscpconf
 
SUPERVISED FEATURE SELECTION FOR DIAGNOSIS OF CORONARY ARTERY DISEASE BASED O...
SUPERVISED FEATURE SELECTION FOR DIAGNOSIS OF CORONARY ARTERY DISEASE BASED O...SUPERVISED FEATURE SELECTION FOR DIAGNOSIS OF CORONARY ARTERY DISEASE BASED O...
SUPERVISED FEATURE SELECTION FOR DIAGNOSIS OF CORONARY ARTERY DISEASE BASED O...csitconf
 
Comparing Data Mining Techniques used for Heart Disease Prediction
Comparing Data Mining Techniques used for Heart Disease PredictionComparing Data Mining Techniques used for Heart Disease Prediction
Comparing Data Mining Techniques used for Heart Disease PredictionIRJET Journal
 
An Ill-identified Classification to Predict Cardiac Disease Using Data Cluste...
An Ill-identified Classification to Predict Cardiac Disease Using Data Cluste...An Ill-identified Classification to Predict Cardiac Disease Using Data Cluste...
An Ill-identified Classification to Predict Cardiac Disease Using Data Cluste...ijdmtaiir
 
Multivariate sample similarity measure for feature selection with a resemblan...
Multivariate sample similarity measure for feature selection with a resemblan...Multivariate sample similarity measure for feature selection with a resemblan...
Multivariate sample similarity measure for feature selection with a resemblan...IJECEIAES
 
IRJET- The Prediction of Heart Disease using Naive Bayes Classifier
IRJET- The Prediction of Heart Disease using Naive Bayes ClassifierIRJET- The Prediction of Heart Disease using Naive Bayes Classifier
IRJET- The Prediction of Heart Disease using Naive Bayes ClassifierIRJET Journal
 

Ähnlich wie Clinical_Decision_Support_For_Heart_Disease (20)

javed_prethesis2608 on predcition of heart disease
javed_prethesis2608 on predcition of heart diseasejaved_prethesis2608 on predcition of heart disease
javed_prethesis2608 on predcition of heart disease
 
PPT.pptx
PPT.pptxPPT.pptx
PPT.pptx
 
Short story_2.pptx
Short story_2.pptxShort story_2.pptx
Short story_2.pptx
 
IRJET -Improving the Accuracy of the Heart Disease Prediction using Hybrid Ma...
IRJET -Improving the Accuracy of the Heart Disease Prediction using Hybrid Ma...IRJET -Improving the Accuracy of the Heart Disease Prediction using Hybrid Ma...
IRJET -Improving the Accuracy of the Heart Disease Prediction using Hybrid Ma...
 
Presentation
PresentationPresentation
Presentation
 
Short story.pptx
Short story.pptxShort story.pptx
Short story.pptx
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Prediction of heart disease using classification mining technique on spark
Prediction of heart disease using classification mining technique on sparkPrediction of heart disease using classification mining technique on spark
Prediction of heart disease using classification mining technique on spark
 
predictionofheartdiseaseusingmachinelearning.pdf
predictionofheartdiseaseusingmachinelearning.pdfpredictionofheartdiseaseusingmachinelearning.pdf
predictionofheartdiseaseusingmachinelearning.pdf
 
Prediction of heart disease using machine learning.pptx
Prediction of heart disease using machine learning.pptxPrediction of heart disease using machine learning.pptx
Prediction of heart disease using machine learning.pptx
 
HEALTH PREDICTION ANALYSIS USING DATA MINING
HEALTH PREDICTION ANALYSIS USING DATA  MININGHEALTH PREDICTION ANALYSIS USING DATA  MINING
HEALTH PREDICTION ANALYSIS USING DATA MINING
 
Supervised Feature Selection for Diagnosis of Coronary Artery Disease Based o...
Supervised Feature Selection for Diagnosis of Coronary Artery Disease Based o...Supervised Feature Selection for Diagnosis of Coronary Artery Disease Based o...
Supervised Feature Selection for Diagnosis of Coronary Artery Disease Based o...
 
SUPERVISED FEATURE SELECTION FOR DIAGNOSIS OF CORONARY ARTERY DISEASE BASED O...
SUPERVISED FEATURE SELECTION FOR DIAGNOSIS OF CORONARY ARTERY DISEASE BASED O...SUPERVISED FEATURE SELECTION FOR DIAGNOSIS OF CORONARY ARTERY DISEASE BASED O...
SUPERVISED FEATURE SELECTION FOR DIAGNOSIS OF CORONARY ARTERY DISEASE BASED O...
 
Comparing Data Mining Techniques used for Heart Disease Prediction
Comparing Data Mining Techniques used for Heart Disease PredictionComparing Data Mining Techniques used for Heart Disease Prediction
Comparing Data Mining Techniques used for Heart Disease Prediction
 
238_heartdisease (1).pdf
238_heartdisease (1).pdf238_heartdisease (1).pdf
238_heartdisease (1).pdf
 
Rishat ML ppt.pptx
Rishat ML ppt.pptxRishat ML ppt.pptx
Rishat ML ppt.pptx
 
An Ill-identified Classification to Predict Cardiac Disease Using Data Cluste...
An Ill-identified Classification to Predict Cardiac Disease Using Data Cluste...An Ill-identified Classification to Predict Cardiac Disease Using Data Cluste...
An Ill-identified Classification to Predict Cardiac Disease Using Data Cluste...
 
Multivariate sample similarity measure for feature selection with a resemblan...
Multivariate sample similarity measure for feature selection with a resemblan...Multivariate sample similarity measure for feature selection with a resemblan...
Multivariate sample similarity measure for feature selection with a resemblan...
 
IRJET- The Prediction of Heart Disease using Naive Bayes Classifier
IRJET- The Prediction of Heart Disease using Naive Bayes ClassifierIRJET- The Prediction of Heart Disease using Naive Bayes Classifier
IRJET- The Prediction of Heart Disease using Naive Bayes Classifier
 

Clinical_Decision_Support_For_Heart_Disease

  • 1. 1 CLINICAL DECISION SUPPORT FOR HEART DISEASE USING PREDICTIVE MODELS Srpriva Sundaraman Northwestern University SripriyaSundararaman2013@u.northwestern.edu Sunil Kakade Northwestern University Sunil.kakade@gmail.com Abstract--Over the last decade, data-driven decision making has become very prevalent in the healthcare industry. Computer systems that perform data analysis and automated clinical decision-making are called Clinical Decision Support (CDS) systems. CDS enables healthcare providers take quick and effective decisions based on existing patient data. Data-driven evidence-based treatment is proven to have higher success rates than intuition driven eminence-based treatment. This objective of this project is to build a clinical decision support system for predicting the presence of heart disease using a classification model. This is a supervised learning problem with binary target outcomes indicating either the presence or absence of heart disease. Other models like clustering or logistic regression have been used for such problems in the past, but a classification model would be most ideal due to the nature of the data. Several data mining classification algorithms like artificial neural networks, naïve bayes, and decision tree were evaluated for performance. The best performing model in terms of misclassification, false positive rate, and area under ROC curve was selected for data analysis. This project analyzes the model performance results and suggests next steps to improve upon this clinical decision support model. I. INTRODUCTION Healthcare is one of the largest sectors in the United States. The US spends approximately $2.5 trillion in healthcare costs each year. Each year about $250 billion of this is attributed to waste, fraud, and abuse. Data mining and analytics reduce some of the healthcare waste by improving outcomes. One of the ways healthcare providers can improve outcomes is by using data-driven decision making. Data mining is well suited to help decision making in a healthcare setting for the following reasons – Healthcare organizations generate large amount of data, thus making it a suitable domain for data mining. The Affordable Care Act by the US government puts the onus on healthcare providers to improve outcomes. With the advent of computing power and medical technology, diverse and elaborate classification algorithms have been developed to mine the data. Due to the reasons mentioned above, Data-driven decision making has become very popular in the healthcare domain and are called Clinical Decision Support systems (CDS). CDS goes beyond prediction by combining prediction with real-time data analysis and rule based reasoning to arrive at more accurate decisions. This objective of this project is to build a clinical decision support system for predicting the presence of heart disease. A classification data model is proposed to detect patterns in existing heart patient data. In the past other researchers have tried to run classification models on the same data. Logistic regression models have generated an accuracy of 77%. Noise tolerant instance based learning algorithms have been 77% accurate as well. Clustering algorithms have shown a marginal improvement in performance with 78.9% accuracy. The aim of this project will be to create a model with better accuracy than before. The outcome from this classification model could be used as an initial screening for patients. The prediction would also be useful for physicians when they need to make quick decisions (ex. Emergency rooms, operation theaters). The classification model enables physicians to tap into the “wisdom of other physicians” and come up with accurate prediction of heart disease in new patients by classification patterns in existing patient data. II. DATA UNDERSTANDING Data Source - The data used for analysis is collected from the University of California, Irvine machine
  • 2. 2 learning data repository. The creator of this data set are – Andras Janosi (Hungarian Institute of Cardiology), William Steinbrunn (University Hospital, Zurich), Matthias Pfisterer (University Hospital, Basel), and Robert Detrano (VA Medical Center and the Cleveland Clinic Foundation). This data mining exercise uses two of the datasets from this heart disease database – The Cleveland Clinic data for training, and the Switzerland data for testing. The database originally consisted of 76 attributes with details about patient body characteristics. The goal was to determine if presence of heart disease can be diagnosed from this data. Data Attributes and Instances - While the original dataset had 76 attributes, only a subset of the 14 most important ones were published on the public domain. Some of the patient sensitive information like name, SSN, etc. was removed for security purposes. There are 303 instances of patient data in the training dataset and 123 instances in the test data set. The output attribute is an integer that represents the presence or absence of heart disease. There were 4 missing values for attribute “ca’ and 2 missing values for attribute “thal”. The 14 attributes of the patient are described below: 1. age (numeric) - age of the patient in years 2. sex (numeric) - represented as a binary number (1 = male, 0 = female) 3. cp (numeric) - represents chest pain type as an integer. Values range from 1 to 4. • Value 1: typical angina • Value 2: atypical angina • Value 3: non-angina pain • Value 4: asymptomatic   4. trestbps (numeric) - resting blood pressure measured in mm Hg on admission to the hospital. 5. chol (numeric) - serum cholesterol of the patient measured in mg/dl 6. fbs (numeric) - fasting blood sugar of the patient. If greater than 120 mg/dl the attribute value is 1 (true), else the attribute value is 0 (false) 7. restecg (numeric) - resting electrocardiographic results for the patient. This attribute can take 3 integer values - 0, 1, or 2. • Value 0: normal • Value 1: having ST-T wave abnormality • Value 2: showing probable or definite left ventricular hypertrophy   8. thalach (numeric) - maximum heart rate of the patient. 9. exang (numeric) - exercise induced angina. Values can be 1 for yes or 0 for no. 10. oldpeak (numeric) - measurement of depression induced by exercise relative to rest 11. slope (numeric) - measure of slope for peak exercise. Values can be 1, 2, or 3 • Value 1: up sloping • Value 2: flat • Value 3: down sloping   12. ca (numeric) - represents number of major vessels. Attribute values can be 0 to 3. 13. thal (numeric) - represents heart rate of the patient. It can take values 3, 6, or 7. • Value 3: normal • Value 6: fixed defect • Value 7: reversible defect 14. num (numeric) - contains a numeric value between 0 and 4. Each value represents a heart disease or absence of all of them. • Value 0: absence of heart disease • Value 1 to 4: presence of different heart diseases III. DATA PREPARATION The data mining tool for creating the model would be WEKA (version 3.6.10), which is an open source software. The data was obtained in a comma-separated value (CSV) format. The following modifications were done to prepare the dataset for training. Data modification - Since we are only interested in the presence or absence of heart disease and not the exact disease classification, the output values of the original data was modified. The Output value (num) of the dataset contained 4 values. A value of 0 indicated no heart disease. Values between 1 and 4 indicated different heart diseases respectively. The num values of 1 to 4 were changed to 1 in Microsoft Excel. Data type conversions - The output values (num) containing the class label were integers. Classification algorithms require the output to be binary in Weka. The output values of 1 were converted to “yes” and the values of 0 were converted to “no” respectively. Imputation of missing values – The modified CSV file was imported into Weka. This file had 6 missing values (4 missing values for attribute “ca” and 2 missing values for the attribute “thal”). The missing data values were imputed using the “ReplaceMissingValues” function under the preprocessing section. Weka uses local averages to determine the imputed values. Test data set - The training dataset contained 303
  • 3. 3 instances from the Cleveland Clinic data set. The test data consisted of 123 records from the Switzerland University hospitals. The test data was formatted the same way as the training data. Classification variable was converted to binary values (yes/no) and the missing values were imputed. The test data set had several missing values as compared to the training data set. IV. DATA MINING ALGORITHMS The Cleveland dataset contained characteristics as well as the diagnosis of patients with heart disease. This is a supervised learning problem because we have a specific target in the test data, which can be used to predict output for new use cases. Algorithms like classification, regression, or causal modeling are used for supervised learning. The output in the training and test data is binary because the heart disease is either present or not. Hence classification algorithms would be ideal to train the Cleveland dataset. The following steps define the classification workflow – 1. The training and test data is preprocessed. 2. Several learning algorithms are evaluated for performance and the best one is selected. 3. A training set consisting of records whose output class labels are known, is used to build the classification model by using the learning algorithm 4. After training, this model is run against the test set which consists of records with unknown labels. 5. The performance of classification model is measured based on the number of correct classifications. The training dataset was run on three different classification algorithms on Weka – Naïve Bayes, J48 Decision Tree, and Artificial Neural Networks using 10-fold cross validation. Cross validation is a technique for estimating the performance of the model by using part of the training set for testing. The most accurate algorithm was determined by the misclassification error rate, Area under ROC curve, and number of false positives. The misclassification error determines the model accuracy in terms of percentage of incorrect predictions made by the model. The AUC determines the predictive capability of the model. This model will be used for heart disease predictions and hence it is important for the selected model to not only perform well on the training data, but to also have a high predictive accuracy. The false positives i.e. number of cases that where the patient had a heart disease, but the model incorrectly predicts those patients as healthy. This error is very important in this context. False positives could cause a potential heart patient to go untreated and may be even life threatening. We evaluate the models based on their false positives count. Models with lower number of false positives are desired. Three classification algorithms – (1) Naïve Bayes, (2) Decision Trees, and (3) Artificial Neural Networks (ANN) were used in the analysis. Comparison of the performance of the three algorithms on Weka – Algorit hm Misclassific ation rate (%) Area under ROC curve(A UC) False Positi ves Time taken (secon ds) Naïve Bayes 16.5 0.891 31 0 J48 Decisio n Tree 20.79 0.783 36 0.04 Artifici al Neural Networ ks 22.77 0.833 46 0.58 The ROC chart for the three algorithms – Naïve Bayes-AUC J48 Decision Tree – AUC
  • 4. 4 Neural Network- AUC Among the three algorithms, Naïve Bayes had the lowest misclassification error, highest area under the ROC curve (AUC), lowest false positive count, and fastest time. Hence Naïve Bayes algorithm was selected to train and test the Heart Disease training and test model. V. NAÏVE BAYES MODEL - EXPERIMENTAL RESULTS AND ANALYSIS Naïve Bayes algorithm was run against the training data set with 10-fold cross-validation, where one-tenths of the data was held for testing and the rest of the data was used for training. The algorithm took 0 seconds to run through 303 records in the Cleveland Clinic training data set. Analysis of the training data results – 1. Accuracy - The model accuracy was 83.5% and the error rate was 16.5%. Out of all the cases evaluated 83.5% of heart disease outcome predictions were correct and 16.5% of predictions were wrong. 2. Confusion matrix – The classifier correctly identified (true positives) 145 records as patients who won’t have heart disease. 31 cases were identified as healthy by the model, but had a heart disease (false positives). 108 records are correctly classified as patients having heart disease by the model. 19 cases are identified as heart disease patients by the model, but they were healthy in reality (false negatives). 3. True positive rate (TP Rate) and false positive rate (FP Rate) – The Naïve Bayes algorithms gets the weighted average of true positives and false positives for both outcomes (yes and no). The average TP Rate or sensitivity is 83.5%. And the average FP Rate is 17.4%. For 83.5% of the cases the model is correctly predicting the existence of heart disease. 4. Precision – The average precision for this model is 0.836 which means 83.6% of the predictions by the model are correct. 5. AUC - The area under the ROC curve (AUC) is 0.891. This represents how well the model is able to distinguish between the target classes. A score of 0.667 is good model accuracy. The test data (Switzerland university hospitals) was used to evaluate the Naïve Bayes model built using the Cleveland Clinic training data. Analysis of the test data results – 1. Accuracy - The model accuracy for the test set was 70.73% and the error rate was 29.26%. Out of all the cases evaluated 70.73% of heart disease outcome predictions were correct and 29.26% of predictions were wrong. 2. Confusion matrix – The classifier correctly identified 5 records as patients who won’t have heart disease. 33 cases are identified as healthy by the model, but they had a heart disease in reality. 82 records are correctly classified as patients having heart disease by the model. 3 cases are identified as heart disease patients by the model, but they were healthy in reality. 3. True positive rate (TP Rate) and false positive rate (FP Rate) – The Naïve Bayes algorithms gets the weighted average of true positives and false positives for both outcomes (yes and no). The average TP Rate or sensitivity is 70.7%. And the average FP Rate is 36.9%, which is almost double that of the training set. For 70.7% of the cases the model is correctly predicting the existence of heart disease. 4. Precision and Recall – The average precision for this model is 0.836 which means 83.6% of the predictions by the model are correct. This is the same as the true positive rate. The harmonic mean of precision and recall is F-measure. The weighted average of f-measure is 0.834. 5. AUC - The area under the ROC curve (AUC) is 0.667. This represents how well the model is able to distinguish between the target classes. A score of 0.667 is average performance for model accuracy. The ROC threshold curve for both the training and test sets –
  • 5. 5 V1. CONCLUSION The Naïve Bayes model performed very well on the training data. On the test data the performance reduced. The model accuracy, and area under ROC curve were both significantly lower in the test data set. There could be several reasons for the poor performance – 1) Even though the training data was testing continuously using cross-validation, the data being tested was the same as the data being trained. The high ROC values in training could have happened due to model over-fitting. 2) The Switzerland patient data in the test set had several missing values that were imputed in Weka using local averages. These imputed values could have been inaccurate. 3) The training data set was quite small. 300 instances is probably not enough to build a reliable model. Despite the “average” performance on the test set, this model is still useful for predicting heart disease in patients in 70% of the cases. This is definitely a better solution as opposed to not knowing any information about the patient. This kind of prediction of heart disease could be used as a first level diagnosis for doctors, emergency room services and other healthcare providers. If a patient was predicted to have heart disease, physicians could perform further tests on these patients to confirm the diagnosis. This kind of clinical decision support is a win-win proposition that helps save time for doctors and save lives for patients. VII. FUTURE WORK The training data set had good quality data, but there were very few instances to train the classification model. The test data set was of poor quality with plenty of missing attributes. The number of records were also inadequate. Despite these shortcomings the model performance was above average on the test data and very good for the training data. With the right kind of training and test data, the same classification model can be used to generate better performance and results. The following are a few ways to improve the classification model – 1. The missing values in the dataset could be validated with the corresponding hospitals to generate a more accurate training and test data set. 2. Additional patient information could be collected from EMR databases in hospitals. This would generate more attributes, but we can perform attribute selection to filter the most important ones. 3. The UCI heart disease data was only collected from 4 hospitals. The heart disease database could be expanded to add more hospitals and patient information to create a more diverse training set. 4. The heart disease model could be expanded beyond predicting occurrence. We could identify the exact heart disorder from the existing data. This would aid the physicians further more. 5. Similar classification models can be trained for other diseases apart from heart diseases. REFERENCE Tan, Pang-Ning, Michael Steinbach, and Vipin Kumar. Introduction to Data Mining. Boston: Pearson Addison Wesley, 2005. Provost F., Fawcett T. (2013). Data Science for Business (pp. 20-40). Sebastopol, CA: O’Reilly Media Inc. Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. Caroline Clabaugh, Dave Myszewski, and Jimmy Pang (2000). Neural network applications. Eric Roberts' Sophomore College. URL:http://cs.stanford.edu/people/eroberts/courses/soc o/projects/2000-01/neural- networks/Applications/miscellaneous.html Doust, Dominick and Walsh, Zack, "Data Mining Clustering: A Healthcare Application" (2011).MCIS 2011 Proceedings. Paper 65. http://aisel.aisnet.org/mcis2011/65 OECD Health Data (2012). URL:www.oecd.org/unitedstates/HealthSpendingInUS A_HealthData2012.pdf WEKA on Wikipedia. URL: http://en.wikipedia.org/wiki/WEKA Detrano, R., Janosi, A., Steinbrunn, W., Pfisterer, M., Schmid, J., Sandhu, S., Guppy, K., Lee, S., & Froelicher, V. (1989). International application of a new probability algorithm for the diagnosis of coronary artery disease. American Journal of Cardiology,
  • 6. 6 64,304--310. David W. Aha & Dennis Kibler. "Instance-based prediction of heart-disease presence with the Cleveland database." Gennari, J.H., Langley, P, & Fisher, D. (1989). Models of incremental concept formation. Artificial Intelligence, 40, 11--61.