SlideShare a Scribd company logo
1 of 7
Download to read offline
An Empirical Study on Diabetes Mellitus Prediction
for Typical and Non-Typical Cases using Machine
Learning Approaches
Md. Tanvir Islam1
, M. Raihan2
, Fahmida Farzana3
, Md. Golam Morshed Raju4
and Md. Bellal Hossain5
Department of Computer Science and Engineering, North Western University, Khulna, Bangladesh1-4
Electronics and Communication Engineering Discipline, Khulna University, Khulna, Bangladesh5
Emails: tanvirislamnwu@gmail.com1
, mraihan@ieee.org2
, mraihan@nwu.edu.bd2
, raihanbme@gmail.com2
,
tanni.dorsonindrio@gmail.com3
, golam.morshed.raju.cse01@gmail.com4
and md.bellal.ku@gmail.com5
Abstract—Diabetes is a non-communicable disease and in-
creasing at an alarming rate all over the world. Having a high
sugar level in blood or lack of insulin are the primary reasons.
So, it is important to find an effective way to predict diabetes
before it turns into a major problem for human health. It is
possible to take control of diabetes on an early stage if we
take precautions. For this study, we have collected 340 instances
with 26 features of patients who have already diabetes with
various symptoms categorized by two types Typical and Non-
Typical. For training the dataset, cross-validation technique has
been used and for classification, three Machine Learning (ML)
algorithms such as Bagging, Logistic Regression and Random
Forest have been used. The accuracy for Bagging 89.12%, for
Logistic Regression 83.24% and for Random Forest 90.29%
which are very appreciative.
Keywords—Diabetes Mellitus, Type-2, Machine Learning, Bag-
ging, Logistic Regression, Random Forest, Typical, Non-typical.
I. INTRODUCTION
Nowadays in the era of modern technology, many people
are suffering from numerous diseases and diabetes is one of
them. Diabetes Mellitus (DM) arises when the level of sugar is
too high in our blood though it's our primary source of energy
[1]. In 2011, about 8.4 million or 10% of total population
have diabetes and according to the International Diabetes
Federation (IDF) in Bangladesh and the prevalence of diabetes
among adults will be increased to 13% by 2030 [2]. A news
report of Science Daily states that, If it's possible to prevent
diabetes on an earlier stage then possibility of minimizing the
devastating effects of diabetes is more [3]. Machine Learning
Techniques (MLT) are broadly utilized in medicinal forecasts
[4]. For example, a prediction model developed to predict
Type 2 Diabetes (T2D) using K-means and Decision Tree.
The accuracy, specificity, and sensitivity of the proposed model
are respectively 90.04% 91.28% and 87.27% [5]. For different
algorithms the percentage of accuracy can be different. So, it's
easy to know which algorithm is best from the current available
algorithms.
In this analysis, our goal is to identify the accuracy of
three popular algorithms named Bagging (BAG), Logistic
Regression (LR) and Random Forest (RF) by analyzing the
dataset and compare their performance and to integrate these
techniques in a system such as mobile or web to develop an
expert system.
The other part of the manuscript is arranged as follows:
in section II, section III the related works and methodology
have been elaborated with a distinguishing destination to the
justness of the classier algorithms respectively. In section
IV the outcome of this analysis has been clarified with the
impulsion to justify the novelty of this exploration work.
Finally, this research paper is terminated with section V.
II. RELATED WORKS
Deeraj et al. 2017 in [6] have proposed to use Bayesian
and K-Nearest Neighbor algorithms to predict diabetes malady
using data mining. The results of the prediction will depend
on the taken attributes for example age, pregnancy, tri fold
thick, BMI, bp function, bp diastolic and so on. Another
research performed on three available datasets to compare
the behavior of perceptron algorithms and they found that
the proposed algorithm is better than a perceptron algorithm
[7]. By using K-nearest neighbor (KNN) and Artificial Neural
Network (ANN) classify algorithms, another research team
estimated classifiers of diabetes diseases. They have used total
768 instances with 9 attributes for this evaluation. The accuracy
of ANN and KNN was 80.86% and 77.24% [8]. With Deep
Neural Network (DNN) and Support Vector Machine (SVM) a
system was developed to predict the diabetes. They have taken
total 8 significant attributes of patients such as Age, Number
of Times Pregnant, Plasma Glucose Concentration, Diastolic
Blood Pressure, Body Mass Index etc and got 77.86% accuracy
[9]. A mobile-based decision support system was developed
for gestational Diabetes Mellitus (DM) [10]. Another diabetes
prediction system was developed using cloud analytic where
they have used 3075 distinct person record with 14 variables
of all age group who may or may not diagnosed with Dia-
betes. They found that Logistic Regression outperformed than
Random Forest in all cases and that's why they determined to
use LR as their model [11]. Deepika Verma and Nidhi Mishra
conduct a study to identify Diabetes by using a dataset on
Naive Bayes (NB), J48, SMO, MLP, and REP Tree algorithms
and found that SMO gives 76.80% accuracy on diabetes dataset
[14].
Since diabetes is increasing at an alarming rate and it has
became one of the major health issues for human being, we
have felt the importance of finding out effective and efficient
solution. From this this perspective view, we have been moti-
IEEE - 45670
10th ICCCNT 2019
July 6-8, 2019 , IIT - Kanpur,
Kanpur, India
vated to increase the accuracy of Machine Learning algorithms
by using clinical dataset of diabetes affected patients.
III. METHODOLOGY
We can separate our strategy into four primary segments
as follows:
• Data Collection
• Data Preprocessing
• Data Training
• Applications of Machine Learning Algorithms
An overall work flow of our study has been shown in Fig. 1.
Import dataset with 340
instances and 27 features
Start
Data Preprocessing
Replace missing data with
Mean, Median and Mode
Feature Selection
Apply Best First Search
and Ranker Algorithm
Find Most
Significant Features
Apply 10-Fold
Cross-validation technique
Apply
Classification Algorithms
Bagging
Random
forest
Logistic
Regression
Determine
Statistical Matrics
Compare Performance
End
Fig. 1: Work-flow of the Overall Analysis
A. Data Collection:
For this analysis, we have collected data from Khulna
Diabetes Hospital, Khulna. The dataset includes total 340
TABLE I: Types and Names of Symptoms
Types of Symptoms Names of Symptoms
Typical
Thirst
Hunger
Weight Loss
Weakness
Non-Typical
Headache for High Blood Pressure
Burning Extremities
Weakness
instances with having 27 significant features for each instance.
The dataset contains basic information of patients and two
types of symptoms: Typical and Non-Typical. The Table I helps
to understand the categories of the symptoms.
B. Data preprocessing:
To handle missing information we've used two popular and
useful functions in WEKA 3.8 ( Waikato Environment for
Knowledge Analysis) . First, ReplaceMissingValue function
has been used to replace missing data. This function swaps
every single missing information for nominal and numeric
attributes with the modes and means [13]. We've used another
function named Randomize which can fill-up the missing field
without sacrificing too much performance [13].
C. Data Training:
For training all the features of the dataset shown in Table
II, we have used 10-Fold Cross-Validation technique. It is
a re-sampling technique to evaluate predictive models by
partitioning the original sample into a training set to train the
model, and a test set to evaluate it [14]. The methodology has
a solitary parameter considered K that alludes to the number
of algorithms that a given information test is to be part into. It
shuffles the dataset randomly, splits dataset into 10 groups and
finally abridge the expertise of the model utilizing the example
of model assessment scores [15].
D. Applications of ML Algorithms:
After having the preprocessed and trained dataset we have
applied three algorithms on the dataset. They are: Bagging,
Logistic Regression and Random Forest.
1) Bagging (BAG): It is a procurement method that re-
samples the preparation information to make new models
for each example that is drawn [16]. It makes a troupe of
arrangement models for a learning plan where each model
gives a similarly weighted forecast [14].
Input:
• R, a set of h training tuples
• t, the number of models in the ensemble
• A classification learning scheme (Decision Tree Algo-
rithm, Naive Bayesian, etc.)
Output: The ensemble, an associated model, L.
IEEE - 45670
10th ICCCNT 2019
July 6-8, 2019 , IIT - Kanpur,
Kanpur, India
TABLE II: Features List
Features Subcategory
Data
Distribution
Age
Lowest: 22 Mean ± SD
Highest: 80 49.21 ± 11.95
Sex
Male 47.65%
Female 52.35%
Profession
Lowest: Supervisor, Unemployed,
Servant, Banker
1.18%
Highest: Housewife 52.65%
Rest 46.17%
Height
Lowest: 133 cm
158.58 ± 8.40
Highest: 188 cm
Weight
Lowest: 33 kg
62.65 ± 11.03
Highest: 105 kg
Body Mass Index
Lowest: 11.6 Kgm-2
24.81 ± 3.53
Highest: 52.1Kgm-2
Heart Rate
Lowest: 56 bpm
74.94 ± 6.11
Highest: 90 bpm
Systolic Blood Pressure
Lowest: 90 mmHg
129.1 ± 12.39
Highest: 200 mmHg
Diastolic Blood Pressure
Lowest: 40 mmHg
81.52 ± 7.16
Highest: 110 mmHg
Blood Sugar Before
Meal
Lowest: 4.2 mmol/L
12.91 ± 4.83
Highest: 45.5 mmol/L
Blood Sugar After
Meal
Lowest: 4.4 mmol/L
18.51 ± 5.42
Highest: 45.8 mmol/L
Urine Color Before
Meal
Lowest: Orange 3.24%
Highest: Green 68.82%
Rest 27.94%
Urine Color After Meal
Lowest: Cyan 0.29%
Highest: Green 65.00%
Rest 34.71%
Drug History
Yes 99.41%
No 0.59%
Weight Loss
Yes 78.24%
No 21.76%
Thirst
Yes 92.35%
No 7.65%
Hunger
Yes 78.24%
No 21.76%
Relatives
Yes 79.41%
No 20.59%
Physical Activity
Yes 97.64%
No 2.36%
Smoking
Yes 6.18%
No 94.82%
Tobacco Chewing
Yes 12.9%
No 87.1%
Headache for Hight BP
Yes 81.18%
No 18.82%
Burning Extremities
Yes 82.94%
No 17.06%
Weakness
Yes 95.88%
No 4.12%
Symptoms Duration
Lowest: 1 Day Mean: 264.25
SD: 292.437
Highest: 3650 Days
Diabetes Mellitus
Yes 99.4%
No 0.59%
Outcome
Typical 61.18%
Non-Typical 14.70%
Both 24.12%
*SD = Standard Deviation
2) Logistic Regression (LR): LR is a well-known sys-
tem for grouping individuals into two totally unrelated and
thorough classifications, for instance, buyer, non-buyer and
responder, non-responder [17]. It predicts by the logit function
occurrence probabilistic outcome by fitting data of an event
[17].
LR accounts the logit of L, a log of the odds of a single
belonging to class 1 and can except much of a stretch be
changed over into the likelihood of an individual having a
place with class 1 [17].
The equations of Logit and Probability are as follow:
Logit L = a0 + a1 × Y1 + a2 × Y2 + · · · + an × Yn ...eq.(1)
Prop (L = 1) = exp

Logit L
1 + exp (Logit l)

...eq.(2)
An individual's predicted probability of belonging to class
1 is accounted by “plugging in” the values of the predictor
variables for that individual in the given two equations. Here,
a's = LR coefficients, which are determined by the calculus-
based method of maximum likelihood and it has no predictor
variable with which it is multiplied [17].
3) Random Forest (RF): RF is another ensemble technique
that used as classifiers. It also capable of performing regression
tasks [14]. If a training set, Y, of y tuples is given then the
procedure of begetting k decision trees is as follows: For each
iteration, j (j = 1, 2, ..., t), a training set, Yj, of y tuples is
sampled with replacement from Y. Let U be the number of
attributes to be used to determine the split at each node (where
U is much smaller than the number of available attributes).To
construct a decision tree classifier, Nj, randomly select, at each
node, U attributes as candidates for the split at the node [14].
IV. OUTCOMES
The result is analyzed based on 22 performance parameters
for example:
A. Seed:
It specify changing number randomly and getting different
result.
B. Correctly Classified Instances (CCI):
The accuracy of the model on the data used for testing
[13].
Accuracy =
Tp + Tn
Tp + Tn + Fp + Fn
...eq.(3)
Here, Tp= True positive, Tn= True negative, Fp= False positive,
Fn= False negative.
C. Kappa Statistics (KS):
The Kappa measurement is utilized to quantify the asser-
tion among anticipated and watched arrangements of a dataset.
[13].
K =
R0 − Re
1 − Re
...eq.(4)
Where, R0 = Relative watched understanding among raters
and Re = Theoretical likelihood of chance assertion.
IEEE - 45670
10th ICCCNT 2019
July 6-8, 2019 , IIT - Kanpur,
Kanpur, India
Fig. 2: ROC Curve of Typical Class for (a) Bagging at seed 3
(b) Logistic Regression at seed 3 (c) Random Forest at seed 1
D. KB Information Score:
For a correct class of an instance B we can consider 3
different cases [18] as follows:
• If P'(B)  P(B) then score is positive
• If P'(B)  P(B) then score is negative
• P'(B) = P(B) then no information (score 0)
where, P(B) = Probability of C, If P'(B) = Subordinate prob-
ability return by the classifier.
E. Mean Absolute Error (MAE):
It normal the size of the individual mistakes without
assessing their sign [13].
MAE =
| p1 − b1 | +...+ | pn − bn |
n
...eq.(5)
Here, p is for predicted value and b is for actual value.
Fig. 3: ROC Curve of Non-Typical Class for (a) Bagging at
seed 3 (b) Logistic Regression at seed 3 (c) Random Forest
at seed 1
F. Relative Absolute Error (RAE):
It is the total absolute error with the same kind of normal-
ization [13].
RAE =
| p1 − b1 | + · · · + | pn − bn |
| b1 − b̄ | + · · · + | bn − b̄ |
...eq.(6)
G. Specificity/ TN Rate:
It suggests the degree of people without infection who have
an adverse test result [13].
Specificity =
Tn
Fp + Tn
...eq.(7)
H. Precision (PRE):
Scientists characterize its by [13].
PRE =
Tp
Tp + Fp
...eq.(8)
IEEE - 45670
10th ICCCNT 2019
July 6-8, 2019 , IIT - Kanpur,
Kanpur, India
Fig. 4: ROC Curve of Both Class for (a) Bagging at seed 3
(b) Logistic Regression at seed 3 (c) Random Forest at seed 1
I. Recall (REC):
Researchers defined this parameter as follows [13],
REC =
Tp
Tp + Fn
...eq.(9)
J. F-Measure:
If it denotes by FM then,
FM = 2 ×
PRE × REC
PRE + REC
...eq.(10)
=
2 × Tp
2 × Tp + Fp + Fn
...eq.(11)
K. MCC:
The cohesion between PRE and REC [13].
TABLE III: Comparison of Statistical Metrics for SEED 1
Evaluation Metrics
Machine Learning Algorithms
Bagging Logistic
Random
Forest
Correctly Classified Instances 86.77% 80.88% 90.29%
Incorrectly Classified Instances 13.24% 19.12% 9.71%
Kappa statistic 0.75 0.65 0.81
KB Information Score 298.03 bits 279.89 bits 317.38 bits
Class complexity — order 0 454.75 bits 454.75 bits 454.75 bits
Class complexity — scheme 186.61 bits 8551.65 bits 148.78 bits
Mean absolute error 0.15 0.16 0.13
Root mean squared error 0.26 0.33 0.23
Relative absolute error 41.11% 43.05% 35.30%
Root relative squared error 60.31% 77.54% 54.31%
Coverage of cases
(0.95 level)
186.61 bits 8551.65 bits 148.78 bits
Mean rel. region size
(0.95 level)
0.15 0.16 0.13
Specificity/TN Rate
(Weighted Avg.)
0.26 0.33 0.23
Precision
(Weighted Avg.)
41.11% 43.05% 35.30%
Recall
(Weighted Avg.)
60.31% 77.54% 54.31%
F-Measure
(Weighted Avg.)
0.87 0.81 0.90
MCC
(Weighted Avg.)
0.75 0.64 0.82
ROC Area
(Weighted Avg.)
0.94 0.85 0.97
PRC Area
(Weighted Avg.)
0.93 0.78 0.96
L. ROC Area:
It is the probability that a randomly chosen positive in-
stance in the test data is ranked above a randomly chosen
negative instance, based on the ranking produced by the
classifier [13].
M. PRC Area:
It is an elective summary measurement that is favored by
a few specialists, especially in the data recovery zone [13].
N. Explanation of the Analysis:
The analysis has been accomplished in 3 seeds for each
algorithm for 3 classes named typical, non-typical and both.
Fig. 2 is only showing the best performance curves for class
Typical. For class typical BAG and LR have given the best per-
formance in seed 3 and accuracy was 89.1176% and 83.2353%
respectively. In seed number 1, RF performed better than seed
1 or 2 for the same class while the accuracy is 90.2941% which
is very impressive.
In Fig. 3, the scenario is not much different for class Non-
Typical than the Fig. 2. The BAG and LR again performed
well in seed 3 and similarly RF in seed 1 but keep in mind
that this time the class is Non-Typical. Moreover, in Fig. 4 the
results are again similar to Fig. 2  3. The BAG and LR again
given the best performance in seed 3 for class Both.
Table III shows the performance parameters and results of
our study where CCI are 86.7647%, 80.8824%  90.2941%
for BAG, LR  RF respectively and the values of their KS are
0.753, 0.6512  0.8126. Values of Specificity are 0.62, 0.881
 0.923 while the values of PRE are 0.866, 0.809  0.908.
In addition, the outcomes of ROC and PRC Area are 0.944,
0.847, 0.972 and 0.930, 0.775, 0.963 respectively.
IEEE - 45670
10th ICCCNT 2019
July 6-8, 2019 , IIT - Kanpur,
Kanpur, India
TABLE IV: Comparison of Statistical Metrics for SEED 2
Evaluation Metrics
Machine Learning Algorithms
Bagging Logistic
Random
Forest
Correctly Classified Instances 87.65% 80.29% 88.82%
Incorrectly Classified Instances 12.35% 19.71% 11.18%
Kappa statistic 0.76 0.64 0.78
KB Information Score 320.91 bits 263.70 bits 313.21 bits
Class complexity — order 0 454.75 bits 454.75 bits 454.75 bits
Class complexity — scheme 146.77 bits 1550.27 bits 155.03 bits
Mean absolute error 0.12 0.17 0.13
Root mean squared error 0.23 0.35 0.24
Relative absolute error 34.05% 47.06% 36.25%
Root relative squared error 54.37% 82.03% 55.72%
Coverage of cases
(0.95 level)
99.41% 87.06% 99.41%
Mean rel. region size
(0.95 level)
59.71% 49.61% 62.84%
Specificity/TN Rate
(Weighted Avg.)
0.87 0.81 0.90
Precision
(Weighted Avg.)
0.88 0.80 0.90
Recall
(Weighted Avg.)
0.88 0.80 0.89
F-Measure
(Weighted Avg.)
0.87 0.80 0.89
MCC
(Weighted Avg.)
0.76 0.63 0.79
ROC Area
(Weighted Avg.)
0.97 0.82 0.97
PRC Area
(Weighted Avg.)
0.97 0.74 0.96
TABLE V: Comparison of Statistical Metrics for SEED 3
Evaluation Metrics
Machine Learning Algorithms
Bagging Logistic
Random
Forest
Correctly Classified Instances 89.12% 83.24% 87.65%
Incorrectly Classified Instances 10.88% 16.77% 12.35%
Kappa statistic 0.79 0.69 0.76
KB Information Score 305.73 bits 279.89 bits 320.91 bits
Class complexity — order 0 454.75 bits 454.75 bits 454.745 bits
Class complexity — scheme 176.04 bits 8551.65 bits 146.77 bits
Mean absolute error 0.14 0.15 0.12
Root mean squared error 0.25 0.32 0.23
Relative absolute error 39.42% 41.77% 34.05%
Root relative squared error 58.07% 75.61% 54.37%
Coverage of cases
(0.95 level)
98.82% 89.12% 99.41%
Mean rel. region size
(0.95 level)
64.80% 49.12% 59.71%
Specificity/TN Rate
(Weighted Avg.)
0.92 0.87 0.90
Precision
(Weighted Avg.)
0.89 0.83 0.88
Recall
(Weighted Avg.)
0.89 0.83 0.88
F-Measure
(Weighted Avg.)
0.89 0.83 0.87
MCC
(Weighted Avg.)
0.79 0.69 0.76
ROC Area
(Weighted Avg.)
0.95 0.85 0.97
PRC Area
(Weighted Avg.)
0.94 0.79 0.96
Table IV presents the same variables for seed 2. CCI for
BAG, LR  RF are 87.6471%, 80.2941%  8.8235% respec-
tively and 0.7604, 0.6417, 0.7832. The values of Specificity
in this case for BAG, LR  Rf are respectively 0.868, 0.809
 0.903 and 0.882, 0.804  0.896 are for PRE. In addition,
REC and F-Measure for BAG are 0.876  0.872, for LR both
are same which is 0.803 and for RF 0.888  0.885. MCC for
TABLE VI: Comparison with Other Systems
Reference
Number
Sample
Size
No. of
Features
Algorithms Accuracy
Perspective
of the paper
[5] 768 8
K-means
with j48
Decision
Tree (DT)
90.04% Classification
[7] 4322 2
Ensemble
Boosting
with
Perceptron
Algorithm
(EPA)
75%
Ensemble
Learning
[8] 768 9
ANN 80.86%
Classification
KNN 77.24%
[9] 768 8
LR 77.47%
Classification
Deep Neural
Network
(DNN)
77.86%
Support
Vector
Machine
(SVM)
77.60%
DT 76.30%
Nave Bayes
(NB)
75.78%
[11] 3075 14
Bagging
with
DT, RF
 LR
89.17% Classification
[12] 768 9
NB, SMO,
MLP,
REP Tree
 J48
76.80% Classification
Our
Proposed
System
340 27
Bagging 89.12%
Classification
Logistic
Regression
83.24%
Random
Forest
90.29%
BAG is 0.762, for LR 0.633 and for RF it is 0.787.
The results of performance measurable factors for seed 3
have been shown in Table V. Specificity for BAG is 0.919
and 0.873, 0.904 for LR  RF. PRE are 0.893, 0.832, 0.882
and REC 0.891, 0.832, 0.876 for BAG, LR  RF respectively.
ROC and PRC Area are 0.950, 0.851, 0.974 and 0.940, 0.791,
0.966 respectively for BAG, LR  RF.
BAG and LF provided the best accuracy in seed 3 that
means after they were shuffled for 3 times. On the other hand,
the RF algorithm impressively given the best accuracy in its
first seed. So, comparatively, the RF is the best algorithm
among them.
Table VI illustrates comparisons between various previous
systems with our proposed system. The models have been
compared based on sample size, number of features, algorithms
and accuracy.
Among our proposed systems, the system with Random
Forest algorithm has given the highest accuracy that is 90.29%,
which is the best accuracy by compared with the algorithms
used in the previous systems.
V. CONCLUSION
Despite of having major limitations, the study has finished
successfully with expected outcomes. Collection of real time
data was one of the main problems we have faced at the
initial states and after managed it another bound was to fill
up the missing data since there was several missing data in
IEEE - 45670
10th ICCCNT 2019
July 6-8, 2019 , IIT - Kanpur,
Kanpur, India
the dataset. But using MLT, we have recovered the issues and
performed the analysis to achieve our goal. Among the three
algorithms, RF gives the best performance than BAG  LR and
BAG performed better than LR. In future, we will conduct this
study with more algorithms like ANN more specifically with
Neuro Fuzzy Inference System, CNN (Convolution Neural
Network) and advanced Ensemble Learning algorithms. An
expert system can be developed with our analysis to predict
diabetes more efficiently and effectively.
REFERENCES
[1] R. Basu, “Type 1 Diabetes ”, National Institute of Diabetes and
Digestive and Kidney Diseases (NIDDK), 2017.
[2] S. Akter, M. Rahman, S. Krull Abe and P. Sultana, “Preva-
lence of diabetes and prediabetes and their risk factors among
Bangladeshi adults: a nationwide survey ”, Bulletin of the World
Health Organization, vol. 92, no. 3, pp. 153-228, 2014. Available:
https://www.who.int/bulletin/volumes/92/3/13-128371/en/. [Accessed 8
January 2019].
[3] ScienceDaily, “A better way to predict diabetes: Scientists develop
highly accurate method to predict type 2 diabetes after delivery in
women with gestational diabetes ”, Science News, Toronto, 2016.
[4] I. Kononenko, “Machine learning for medical diagnosis: history, state
of the art and perspective ”, Artificial Intelligence in Medicine, vol.
23, no. 1, pp. 89-109, 2001. Available: 10.1016/s0933-3657(01)00077-
x [Accessed 25 January 2019].
[5] W. Chen, S. Chen, H. Zhang and T. Wu, “A Hybrid Prediction Model
for Type 2 Diabetes Using K-means and Decision Tree ”, in 2017 8th
IEEE International Conference on Software Engineering and Service
Science (ICSESS), Beijing, China, 2017.
[6] D. Shetty, K. Rit, S. Shaikh and N. Patil, “Diabetes disease prediction
using data mining ”, in 2017 International Conference on Innovations in
2017 Information, Embedded and Communication Systems (ICIIECS),
Coimbatore, India, 2017.
[7] R. Mirshahvalad and N. Zanjani, “Diabetes Prediction Using Ensemble
Perceptron Algorithm ”, in 2017 9th International Conference on Com-
putational Intelligence and Communication Networks (CICN), Girne,
Cyprus, 2017.
[8] I. Jasim, A. Duru, K. Shaker, B. Abed and H. Saleh, “Evaluation
and measuring classifiers of diabetes diseases ”, in 2017 International
Conference on Engineering and Technology (ICET), Antalya, Turkey,
2017.
[9] S. Wei, X. Zhao and C. Miao, “A Comprehensive Exploration to the
Machine Learning Techniques for Diabetes Identification ”, in 2018
IEEE 4th World Forum on Internet of Things (WF-IoT), Singapore,
Singapore, 2018.
[10] E. Pustozerov and P. Popova, “Mobile-based decision support sys-
tem for gestational diabetes mellitus ”, in 2018 Ural Symposium on
Biomedical Engineering, Radioelectronics and Information Technology
(USBEREIT), Yekaterinburg, Russia, 2018.
[11] S. Manna, S. Maity, S. Munshi and M. Adhikari, “Diabetes Prediction
Model Using Cloud Analytics ”, in 2018 International Conference on
Advances in Computing, Communications and Informatics (ICACCI),
Bangalore, India, 2018.
[12] D. Verma and N. Mishra, “Analysis and prediction of breast cancer and
diabetes disease datasets using data mining classification techniques ”,
in 2017 International Conference on Intelligent Sustainable Systems
(ICISS), Palladam, India, 2017.
[13] I. Witten, E. Frank and M. Hall, Data Mining practical Machine
Learning Tools and Techniques, 3rd ed. Morgan Kaufmann, 2011, pp.
166-580.
[14] J. Han, M. Kamber and J. Pei, Data Mining Concepts and Techniques,
3rd ed. Morgan Kaufmann, 2011, pp. 370-382.
[15] J. Brownlee, “A Gentle Introduction to k-fold Cross-Validation ”,
Machine Learning Mastery, 2018.
[16] W. Dean, Big Data Mining, and Machine Learning: Value Creation for
Business Leaders and Practitioners (Wiley and SAS Business Series).
Wiley, 2014, pp.124-125.
[17] B. Ratner, Statistical and Machine-Learning Data Mining: Techniques
for Better Predictive Modeling and Analysis of Big Data, 2nd ed. CRC
Press, 2011, pp.97-98.
[18] I. Kononenko and I. Bratko, “Information-Based Evaluation Criterion
for Classifier's Performance”, Machine Learning, vol. 6, no. 1, pp. 67-
80, 1991. [Accessed 21 January 2019].
IEEE - 45670
10th ICCCNT 2019
July 6-8, 2019 , IIT - Kanpur,
Kanpur, India

More Related Content

Similar to An Empirical Study On Diabetes Mellitus Prediction For Typical And Non-Typical Cases Using Machine Learning Approaches

An automatic heart disease prediction using cluster-based bidirectional LSTM ...
An automatic heart disease prediction using cluster-based bidirectional LSTM ...An automatic heart disease prediction using cluster-based bidirectional LSTM ...
An automatic heart disease prediction using cluster-based bidirectional LSTM ...
BASMAJUMAASALEHALMOH
 
Performance Evaluation of Data Mining Algorithm on Electronic Health Record o...
Performance Evaluation of Data Mining Algorithm on Electronic Health Record o...Performance Evaluation of Data Mining Algorithm on Electronic Health Record o...
Performance Evaluation of Data Mining Algorithm on Electronic Health Record o...
BRNSSPublicationHubI
 
Optimized stacking ensemble for early-stage diabetes mellitus prediction
Optimized stacking ensemble for early-stage diabetes mellitus predictionOptimized stacking ensemble for early-stage diabetes mellitus prediction
Optimized stacking ensemble for early-stage diabetes mellitus prediction
IJECEIAES
 
Cancer prognosis prediction using balanced stratified sampling
Cancer prognosis prediction using balanced stratified samplingCancer prognosis prediction using balanced stratified sampling
Cancer prognosis prediction using balanced stratified sampling
ijscai
 
Improving the performance of k nearest neighbor algorithm for the classificat...
Improving the performance of k nearest neighbor algorithm for the classificat...Improving the performance of k nearest neighbor algorithm for the classificat...
Improving the performance of k nearest neighbor algorithm for the classificat...
IAEME Publication
 

Similar to An Empirical Study On Diabetes Mellitus Prediction For Typical And Non-Typical Cases Using Machine Learning Approaches (20)

Chronic Kidney Disease Prediction Using Machine Learning
Chronic Kidney Disease Prediction Using Machine LearningChronic Kidney Disease Prediction Using Machine Learning
Chronic Kidney Disease Prediction Using Machine Learning
 
An automatic heart disease prediction using cluster-based bidirectional LSTM ...
An automatic heart disease prediction using cluster-based bidirectional LSTM ...An automatic heart disease prediction using cluster-based bidirectional LSTM ...
An automatic heart disease prediction using cluster-based bidirectional LSTM ...
 
Ijcatr04041015
Ijcatr04041015Ijcatr04041015
Ijcatr04041015
 
Performance Evaluation of Data Mining Algorithm on Electronic Health Record o...
Performance Evaluation of Data Mining Algorithm on Electronic Health Record o...Performance Evaluation of Data Mining Algorithm on Electronic Health Record o...
Performance Evaluation of Data Mining Algorithm on Electronic Health Record o...
 
A hybrid model for heart disease prediction using recurrent neural network an...
A hybrid model for heart disease prediction using recurrent neural network an...A hybrid model for heart disease prediction using recurrent neural network an...
A hybrid model for heart disease prediction using recurrent neural network an...
 
Optimized stacking ensemble for early-stage diabetes mellitus prediction
Optimized stacking ensemble for early-stage diabetes mellitus predictionOptimized stacking ensemble for early-stage diabetes mellitus prediction
Optimized stacking ensemble for early-stage diabetes mellitus prediction
 
Hybrid prediction model with missing value imputation for medical data 2015-g...
Hybrid prediction model with missing value imputation for medical data 2015-g...Hybrid prediction model with missing value imputation for medical data 2015-g...
Hybrid prediction model with missing value imputation for medical data 2015-g...
 
Multivariate sample similarity measure for feature selection with a resemblan...
Multivariate sample similarity measure for feature selection with a resemblan...Multivariate sample similarity measure for feature selection with a resemblan...
Multivariate sample similarity measure for feature selection with a resemblan...
 
Cancer prognosis prediction using balanced stratified sampling
Cancer prognosis prediction using balanced stratified samplingCancer prognosis prediction using balanced stratified sampling
Cancer prognosis prediction using balanced stratified sampling
 
Efficiency of Prediction Algorithms for Mining Biological Databases
Efficiency of Prediction Algorithms for Mining Biological  DatabasesEfficiency of Prediction Algorithms for Mining Biological  Databases
Efficiency of Prediction Algorithms for Mining Biological Databases
 
PREDICTION OF DIABETES MELLITUS USING MACHINE LEARNING TECHNIQUES
PREDICTION OF DIABETES MELLITUS USING MACHINE LEARNING TECHNIQUESPREDICTION OF DIABETES MELLITUS USING MACHINE LEARNING TECHNIQUES
PREDICTION OF DIABETES MELLITUS USING MACHINE LEARNING TECHNIQUES
 
Short story_2.pptx
Short story_2.pptxShort story_2.pptx
Short story_2.pptx
 
IRJET-Survey on Data Mining Techniques for Disease Prediction
IRJET-Survey on Data Mining Techniques for Disease PredictionIRJET-Survey on Data Mining Techniques for Disease Prediction
IRJET-Survey on Data Mining Techniques for Disease Prediction
 
Improving the performance of k nearest neighbor algorithm for the classificat...
Improving the performance of k nearest neighbor algorithm for the classificat...Improving the performance of k nearest neighbor algorithm for the classificat...
Improving the performance of k nearest neighbor algorithm for the classificat...
 
Classification AlgorithmBased Analysis of Breast Cancer Data
Classification AlgorithmBased Analysis of Breast Cancer DataClassification AlgorithmBased Analysis of Breast Cancer Data
Classification AlgorithmBased Analysis of Breast Cancer Data
 
Analysis and Prediction of Diabetes Diseases using Machine Learning Algorithm...
Analysis and Prediction of Diabetes Diseases using Machine Learning Algorithm...Analysis and Prediction of Diabetes Diseases using Machine Learning Algorithm...
Analysis and Prediction of Diabetes Diseases using Machine Learning Algorithm...
 
An efficient stacking based NSGA-II approach for predicting type 2 diabetes
An efficient stacking based NSGA-II approach for predicting  type 2 diabetesAn efficient stacking based NSGA-II approach for predicting  type 2 diabetes
An efficient stacking based NSGA-II approach for predicting type 2 diabetes
 
Diagnosis of rheumatoid arthritis using an ensemble learning approach
Diagnosis of rheumatoid arthritis using an ensemble learning approachDiagnosis of rheumatoid arthritis using an ensemble learning approach
Diagnosis of rheumatoid arthritis using an ensemble learning approach
 
DIAGNOSIS OF RHEUMATOID ARTHRITIS USING AN ENSEMBLE LEARNING APPROACH
DIAGNOSIS OF RHEUMATOID ARTHRITIS USING AN ENSEMBLE LEARNING APPROACH DIAGNOSIS OF RHEUMATOID ARTHRITIS USING AN ENSEMBLE LEARNING APPROACH
DIAGNOSIS OF RHEUMATOID ARTHRITIS USING AN ENSEMBLE LEARNING APPROACH
 
SUPERVISED FEATURE SELECTION FOR DIAGNOSIS OF CORONARY ARTERY DISEASE BASED O...
SUPERVISED FEATURE SELECTION FOR DIAGNOSIS OF CORONARY ARTERY DISEASE BASED O...SUPERVISED FEATURE SELECTION FOR DIAGNOSIS OF CORONARY ARTERY DISEASE BASED O...
SUPERVISED FEATURE SELECTION FOR DIAGNOSIS OF CORONARY ARTERY DISEASE BASED O...
 

More from Scott Faria

More from Scott Faria (20)

002 Essay Example Refle. Online assignment writing service.
002 Essay Example Refle. Online assignment writing service.002 Essay Example Refle. Online assignment writing service.
002 Essay Example Refle. Online assignment writing service.
 
How To Write A Proper Observation Essay - Adair
How To Write A Proper Observation Essay - AdairHow To Write A Proper Observation Essay - Adair
How To Write A Proper Observation Essay - Adair
 
Get Community College Essay Examples Tips - Es. Online assignment writing ser...
Get Community College Essay Examples Tips - Es. Online assignment writing ser...Get Community College Essay Examples Tips - Es. Online assignment writing ser...
Get Community College Essay Examples Tips - Es. Online assignment writing ser...
 
Ocean Writing Paper Writing Paper, Kindergarte
Ocean Writing Paper Writing Paper, KindergarteOcean Writing Paper Writing Paper, Kindergarte
Ocean Writing Paper Writing Paper, Kindergarte
 
CompareContrast Essay Outline Essay Outline The
CompareContrast Essay Outline Essay Outline TheCompareContrast Essay Outline Essay Outline The
CompareContrast Essay Outline Essay Outline The
 
Good Essay Guide Essay Writing Skills, Writing Lesso
Good Essay Guide Essay Writing Skills, Writing LessoGood Essay Guide Essay Writing Skills, Writing Lesso
Good Essay Guide Essay Writing Skills, Writing Lesso
 
Literature Review Chicago Style Sample Welcome T
Literature Review Chicago Style Sample Welcome TLiterature Review Chicago Style Sample Welcome T
Literature Review Chicago Style Sample Welcome T
 
100Th Day Writing Paper With Border And 3-Ruled Lines -
100Th Day Writing Paper With Border And 3-Ruled Lines -100Th Day Writing Paper With Border And 3-Ruled Lines -
100Th Day Writing Paper With Border And 3-Ruled Lines -
 
014 Essay Example Descriptive Person Writing
014 Essay Example Descriptive Person Writing014 Essay Example Descriptive Person Writing
014 Essay Example Descriptive Person Writing
 
6 Essay Writing Tips For Scoring Good Grades
6 Essay Writing Tips For Scoring Good Grades6 Essay Writing Tips For Scoring Good Grades
6 Essay Writing Tips For Scoring Good Grades
 
Scholarship Essay Graduate Program Essay Examples
Scholarship Essay Graduate Program Essay ExamplesScholarship Essay Graduate Program Essay Examples
Scholarship Essay Graduate Program Essay Examples
 
Writing A Strong Introduction To A Descriptive Essay
Writing A Strong Introduction To A Descriptive EssayWriting A Strong Introduction To A Descriptive Essay
Writing A Strong Introduction To A Descriptive Essay
 
Abstract Writing For Research Papers. How To Make Your
Abstract Writing For Research Papers. How To Make YourAbstract Writing For Research Papers. How To Make Your
Abstract Writing For Research Papers. How To Make Your
 
Essay On Child Labour In English How To Write Essay On Child Labour
Essay On Child Labour In English How To Write Essay On Child LabourEssay On Child Labour In English How To Write Essay On Child Labour
Essay On Child Labour In English How To Write Essay On Child Labour
 
Short Essay College Apa Format Paper Does Apa F
Short Essay College Apa Format Paper Does Apa FShort Essay College Apa Format Paper Does Apa F
Short Essay College Apa Format Paper Does Apa F
 
Pustakachi Atmakatha In Marathi Plz Help - Brainly.In
Pustakachi Atmakatha In Marathi Plz Help - Brainly.InPustakachi Atmakatha In Marathi Plz Help - Brainly.In
Pustakachi Atmakatha In Marathi Plz Help - Brainly.In
 
How To Write An Intro Paragraph For A Synthesis Essay - Airey Pen
How To Write An Intro Paragraph For A Synthesis Essay - Airey PenHow To Write An Intro Paragraph For A Synthesis Essay - Airey Pen
How To Write An Intro Paragraph For A Synthesis Essay - Airey Pen
 
(PDF) Guide To Writing Philosophy Essays Rhod
(PDF) Guide To Writing Philosophy Essays Rhod(PDF) Guide To Writing Philosophy Essays Rhod
(PDF) Guide To Writing Philosophy Essays Rhod
 
Social Issues Essay By Kelvin. Online assignment writing service.
Social Issues Essay By Kelvin. Online assignment writing service.Social Issues Essay By Kelvin. Online assignment writing service.
Social Issues Essay By Kelvin. Online assignment writing service.
 
How To Write A College Essay Step By Step Guid
How To Write A College Essay Step By Step GuidHow To Write A College Essay Step By Step Guid
How To Write A College Essay Step By Step Guid
 

Recently uploaded

Personalisation of Education by AI and Big Data - Lourdes Guàrdia
Personalisation of Education by AI and Big Data - Lourdes GuàrdiaPersonalisation of Education by AI and Big Data - Lourdes Guàrdia
Personalisation of Education by AI and Big Data - Lourdes Guàrdia
EADTU
 
SPLICE Working Group: Reusable Code Examples
SPLICE Working Group:Reusable Code ExamplesSPLICE Working Group:Reusable Code Examples
SPLICE Working Group: Reusable Code Examples
Peter Brusilovsky
 
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
中 央社
 

Recently uploaded (20)

AIM of Education-Teachers Training-2024.ppt
AIM of Education-Teachers Training-2024.pptAIM of Education-Teachers Training-2024.ppt
AIM of Education-Teachers Training-2024.ppt
 
An Overview of the Odoo 17 Knowledge App
An Overview of the Odoo 17 Knowledge AppAn Overview of the Odoo 17 Knowledge App
An Overview of the Odoo 17 Knowledge App
 
How To Create Editable Tree View in Odoo 17
How To Create Editable Tree View in Odoo 17How To Create Editable Tree View in Odoo 17
How To Create Editable Tree View in Odoo 17
 
Đề tieng anh thpt 2024 danh cho cac ban hoc sinh
Đề tieng anh thpt 2024 danh cho cac ban hoc sinhĐề tieng anh thpt 2024 danh cho cac ban hoc sinh
Đề tieng anh thpt 2024 danh cho cac ban hoc sinh
 
Mattingly "AI & Prompt Design: Named Entity Recognition"
Mattingly "AI & Prompt Design: Named Entity Recognition"Mattingly "AI & Prompt Design: Named Entity Recognition"
Mattingly "AI & Prompt Design: Named Entity Recognition"
 
male presentation...pdf.................
male presentation...pdf.................male presentation...pdf.................
male presentation...pdf.................
 
8 Tips for Effective Working Capital Management
8 Tips for Effective Working Capital Management8 Tips for Effective Working Capital Management
8 Tips for Effective Working Capital Management
 
UChicago CMSC 23320 - The Best Commit Messages of 2024
UChicago CMSC 23320 - The Best Commit Messages of 2024UChicago CMSC 23320 - The Best Commit Messages of 2024
UChicago CMSC 23320 - The Best Commit Messages of 2024
 
Sternal Fractures & Dislocations - EMGuidewire Radiology Reading Room
Sternal Fractures & Dislocations - EMGuidewire Radiology Reading RoomSternal Fractures & Dislocations - EMGuidewire Radiology Reading Room
Sternal Fractures & Dislocations - EMGuidewire Radiology Reading Room
 
Personalisation of Education by AI and Big Data - Lourdes Guàrdia
Personalisation of Education by AI and Big Data - Lourdes GuàrdiaPersonalisation of Education by AI and Big Data - Lourdes Guàrdia
Personalisation of Education by AI and Big Data - Lourdes Guàrdia
 
Observing-Correct-Grammar-in-Making-Definitions.pptx
Observing-Correct-Grammar-in-Making-Definitions.pptxObserving-Correct-Grammar-in-Making-Definitions.pptx
Observing-Correct-Grammar-in-Making-Definitions.pptx
 
How to Send Pro Forma Invoice to Your Customers in Odoo 17
How to Send Pro Forma Invoice to Your Customers in Odoo 17How to Send Pro Forma Invoice to Your Customers in Odoo 17
How to Send Pro Forma Invoice to Your Customers in Odoo 17
 
SPLICE Working Group: Reusable Code Examples
SPLICE Working Group:Reusable Code ExamplesSPLICE Working Group:Reusable Code Examples
SPLICE Working Group: Reusable Code Examples
 
Major project report on Tata Motors and its marketing strategies
Major project report on Tata Motors and its marketing strategiesMajor project report on Tata Motors and its marketing strategies
Major project report on Tata Motors and its marketing strategies
 
Supporting Newcomer Multilingual Learners
Supporting Newcomer  Multilingual LearnersSupporting Newcomer  Multilingual Learners
Supporting Newcomer Multilingual Learners
 
Analyzing and resolving a communication crisis in Dhaka textiles LTD.pptx
Analyzing and resolving a communication crisis in Dhaka textiles LTD.pptxAnalyzing and resolving a communication crisis in Dhaka textiles LTD.pptx
Analyzing and resolving a communication crisis in Dhaka textiles LTD.pptx
 
Basic Civil Engineering notes on Transportation Engineering & Modes of Transport
Basic Civil Engineering notes on Transportation Engineering & Modes of TransportBasic Civil Engineering notes on Transportation Engineering & Modes of Transport
Basic Civil Engineering notes on Transportation Engineering & Modes of Transport
 
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
 
How to Manage Website in Odoo 17 Studio App.pptx
How to Manage Website in Odoo 17 Studio App.pptxHow to Manage Website in Odoo 17 Studio App.pptx
How to Manage Website in Odoo 17 Studio App.pptx
 
Graduate Outcomes Presentation Slides - English (v3).pptx
Graduate Outcomes Presentation Slides - English (v3).pptxGraduate Outcomes Presentation Slides - English (v3).pptx
Graduate Outcomes Presentation Slides - English (v3).pptx
 

An Empirical Study On Diabetes Mellitus Prediction For Typical And Non-Typical Cases Using Machine Learning Approaches

  • 1. An Empirical Study on Diabetes Mellitus Prediction for Typical and Non-Typical Cases using Machine Learning Approaches Md. Tanvir Islam1 , M. Raihan2 , Fahmida Farzana3 , Md. Golam Morshed Raju4 and Md. Bellal Hossain5 Department of Computer Science and Engineering, North Western University, Khulna, Bangladesh1-4 Electronics and Communication Engineering Discipline, Khulna University, Khulna, Bangladesh5 Emails: tanvirislamnwu@gmail.com1 , mraihan@ieee.org2 , mraihan@nwu.edu.bd2 , raihanbme@gmail.com2 , tanni.dorsonindrio@gmail.com3 , golam.morshed.raju.cse01@gmail.com4 and md.bellal.ku@gmail.com5 Abstract—Diabetes is a non-communicable disease and in- creasing at an alarming rate all over the world. Having a high sugar level in blood or lack of insulin are the primary reasons. So, it is important to find an effective way to predict diabetes before it turns into a major problem for human health. It is possible to take control of diabetes on an early stage if we take precautions. For this study, we have collected 340 instances with 26 features of patients who have already diabetes with various symptoms categorized by two types Typical and Non- Typical. For training the dataset, cross-validation technique has been used and for classification, three Machine Learning (ML) algorithms such as Bagging, Logistic Regression and Random Forest have been used. The accuracy for Bagging 89.12%, for Logistic Regression 83.24% and for Random Forest 90.29% which are very appreciative. Keywords—Diabetes Mellitus, Type-2, Machine Learning, Bag- ging, Logistic Regression, Random Forest, Typical, Non-typical. I. INTRODUCTION Nowadays in the era of modern technology, many people are suffering from numerous diseases and diabetes is one of them. Diabetes Mellitus (DM) arises when the level of sugar is too high in our blood though it's our primary source of energy [1]. In 2011, about 8.4 million or 10% of total population have diabetes and according to the International Diabetes Federation (IDF) in Bangladesh and the prevalence of diabetes among adults will be increased to 13% by 2030 [2]. A news report of Science Daily states that, If it's possible to prevent diabetes on an earlier stage then possibility of minimizing the devastating effects of diabetes is more [3]. Machine Learning Techniques (MLT) are broadly utilized in medicinal forecasts [4]. For example, a prediction model developed to predict Type 2 Diabetes (T2D) using K-means and Decision Tree. The accuracy, specificity, and sensitivity of the proposed model are respectively 90.04% 91.28% and 87.27% [5]. For different algorithms the percentage of accuracy can be different. So, it's easy to know which algorithm is best from the current available algorithms. In this analysis, our goal is to identify the accuracy of three popular algorithms named Bagging (BAG), Logistic Regression (LR) and Random Forest (RF) by analyzing the dataset and compare their performance and to integrate these techniques in a system such as mobile or web to develop an expert system. The other part of the manuscript is arranged as follows: in section II, section III the related works and methodology have been elaborated with a distinguishing destination to the justness of the classier algorithms respectively. In section IV the outcome of this analysis has been clarified with the impulsion to justify the novelty of this exploration work. Finally, this research paper is terminated with section V. II. RELATED WORKS Deeraj et al. 2017 in [6] have proposed to use Bayesian and K-Nearest Neighbor algorithms to predict diabetes malady using data mining. The results of the prediction will depend on the taken attributes for example age, pregnancy, tri fold thick, BMI, bp function, bp diastolic and so on. Another research performed on three available datasets to compare the behavior of perceptron algorithms and they found that the proposed algorithm is better than a perceptron algorithm [7]. By using K-nearest neighbor (KNN) and Artificial Neural Network (ANN) classify algorithms, another research team estimated classifiers of diabetes diseases. They have used total 768 instances with 9 attributes for this evaluation. The accuracy of ANN and KNN was 80.86% and 77.24% [8]. With Deep Neural Network (DNN) and Support Vector Machine (SVM) a system was developed to predict the diabetes. They have taken total 8 significant attributes of patients such as Age, Number of Times Pregnant, Plasma Glucose Concentration, Diastolic Blood Pressure, Body Mass Index etc and got 77.86% accuracy [9]. A mobile-based decision support system was developed for gestational Diabetes Mellitus (DM) [10]. Another diabetes prediction system was developed using cloud analytic where they have used 3075 distinct person record with 14 variables of all age group who may or may not diagnosed with Dia- betes. They found that Logistic Regression outperformed than Random Forest in all cases and that's why they determined to use LR as their model [11]. Deepika Verma and Nidhi Mishra conduct a study to identify Diabetes by using a dataset on Naive Bayes (NB), J48, SMO, MLP, and REP Tree algorithms and found that SMO gives 76.80% accuracy on diabetes dataset [14]. Since diabetes is increasing at an alarming rate and it has became one of the major health issues for human being, we have felt the importance of finding out effective and efficient solution. From this this perspective view, we have been moti- IEEE - 45670 10th ICCCNT 2019 July 6-8, 2019 , IIT - Kanpur, Kanpur, India
  • 2. vated to increase the accuracy of Machine Learning algorithms by using clinical dataset of diabetes affected patients. III. METHODOLOGY We can separate our strategy into four primary segments as follows: • Data Collection • Data Preprocessing • Data Training • Applications of Machine Learning Algorithms An overall work flow of our study has been shown in Fig. 1. Import dataset with 340 instances and 27 features Start Data Preprocessing Replace missing data with Mean, Median and Mode Feature Selection Apply Best First Search and Ranker Algorithm Find Most Significant Features Apply 10-Fold Cross-validation technique Apply Classification Algorithms Bagging Random forest Logistic Regression Determine Statistical Matrics Compare Performance End Fig. 1: Work-flow of the Overall Analysis A. Data Collection: For this analysis, we have collected data from Khulna Diabetes Hospital, Khulna. The dataset includes total 340 TABLE I: Types and Names of Symptoms Types of Symptoms Names of Symptoms Typical Thirst Hunger Weight Loss Weakness Non-Typical Headache for High Blood Pressure Burning Extremities Weakness instances with having 27 significant features for each instance. The dataset contains basic information of patients and two types of symptoms: Typical and Non-Typical. The Table I helps to understand the categories of the symptoms. B. Data preprocessing: To handle missing information we've used two popular and useful functions in WEKA 3.8 ( Waikato Environment for Knowledge Analysis) . First, ReplaceMissingValue function has been used to replace missing data. This function swaps every single missing information for nominal and numeric attributes with the modes and means [13]. We've used another function named Randomize which can fill-up the missing field without sacrificing too much performance [13]. C. Data Training: For training all the features of the dataset shown in Table II, we have used 10-Fold Cross-Validation technique. It is a re-sampling technique to evaluate predictive models by partitioning the original sample into a training set to train the model, and a test set to evaluate it [14]. The methodology has a solitary parameter considered K that alludes to the number of algorithms that a given information test is to be part into. It shuffles the dataset randomly, splits dataset into 10 groups and finally abridge the expertise of the model utilizing the example of model assessment scores [15]. D. Applications of ML Algorithms: After having the preprocessed and trained dataset we have applied three algorithms on the dataset. They are: Bagging, Logistic Regression and Random Forest. 1) Bagging (BAG): It is a procurement method that re- samples the preparation information to make new models for each example that is drawn [16]. It makes a troupe of arrangement models for a learning plan where each model gives a similarly weighted forecast [14]. Input: • R, a set of h training tuples • t, the number of models in the ensemble • A classification learning scheme (Decision Tree Algo- rithm, Naive Bayesian, etc.) Output: The ensemble, an associated model, L. IEEE - 45670 10th ICCCNT 2019 July 6-8, 2019 , IIT - Kanpur, Kanpur, India
  • 3. TABLE II: Features List Features Subcategory Data Distribution Age Lowest: 22 Mean ± SD Highest: 80 49.21 ± 11.95 Sex Male 47.65% Female 52.35% Profession Lowest: Supervisor, Unemployed, Servant, Banker 1.18% Highest: Housewife 52.65% Rest 46.17% Height Lowest: 133 cm 158.58 ± 8.40 Highest: 188 cm Weight Lowest: 33 kg 62.65 ± 11.03 Highest: 105 kg Body Mass Index Lowest: 11.6 Kgm-2 24.81 ± 3.53 Highest: 52.1Kgm-2 Heart Rate Lowest: 56 bpm 74.94 ± 6.11 Highest: 90 bpm Systolic Blood Pressure Lowest: 90 mmHg 129.1 ± 12.39 Highest: 200 mmHg Diastolic Blood Pressure Lowest: 40 mmHg 81.52 ± 7.16 Highest: 110 mmHg Blood Sugar Before Meal Lowest: 4.2 mmol/L 12.91 ± 4.83 Highest: 45.5 mmol/L Blood Sugar After Meal Lowest: 4.4 mmol/L 18.51 ± 5.42 Highest: 45.8 mmol/L Urine Color Before Meal Lowest: Orange 3.24% Highest: Green 68.82% Rest 27.94% Urine Color After Meal Lowest: Cyan 0.29% Highest: Green 65.00% Rest 34.71% Drug History Yes 99.41% No 0.59% Weight Loss Yes 78.24% No 21.76% Thirst Yes 92.35% No 7.65% Hunger Yes 78.24% No 21.76% Relatives Yes 79.41% No 20.59% Physical Activity Yes 97.64% No 2.36% Smoking Yes 6.18% No 94.82% Tobacco Chewing Yes 12.9% No 87.1% Headache for Hight BP Yes 81.18% No 18.82% Burning Extremities Yes 82.94% No 17.06% Weakness Yes 95.88% No 4.12% Symptoms Duration Lowest: 1 Day Mean: 264.25 SD: 292.437 Highest: 3650 Days Diabetes Mellitus Yes 99.4% No 0.59% Outcome Typical 61.18% Non-Typical 14.70% Both 24.12% *SD = Standard Deviation 2) Logistic Regression (LR): LR is a well-known sys- tem for grouping individuals into two totally unrelated and thorough classifications, for instance, buyer, non-buyer and responder, non-responder [17]. It predicts by the logit function occurrence probabilistic outcome by fitting data of an event [17]. LR accounts the logit of L, a log of the odds of a single belonging to class 1 and can except much of a stretch be changed over into the likelihood of an individual having a place with class 1 [17]. The equations of Logit and Probability are as follow: Logit L = a0 + a1 × Y1 + a2 × Y2 + · · · + an × Yn ...eq.(1) Prop (L = 1) = exp Logit L 1 + exp (Logit l) ...eq.(2) An individual's predicted probability of belonging to class 1 is accounted by “plugging in” the values of the predictor variables for that individual in the given two equations. Here, a's = LR coefficients, which are determined by the calculus- based method of maximum likelihood and it has no predictor variable with which it is multiplied [17]. 3) Random Forest (RF): RF is another ensemble technique that used as classifiers. It also capable of performing regression tasks [14]. If a training set, Y, of y tuples is given then the procedure of begetting k decision trees is as follows: For each iteration, j (j = 1, 2, ..., t), a training set, Yj, of y tuples is sampled with replacement from Y. Let U be the number of attributes to be used to determine the split at each node (where U is much smaller than the number of available attributes).To construct a decision tree classifier, Nj, randomly select, at each node, U attributes as candidates for the split at the node [14]. IV. OUTCOMES The result is analyzed based on 22 performance parameters for example: A. Seed: It specify changing number randomly and getting different result. B. Correctly Classified Instances (CCI): The accuracy of the model on the data used for testing [13]. Accuracy = Tp + Tn Tp + Tn + Fp + Fn ...eq.(3) Here, Tp= True positive, Tn= True negative, Fp= False positive, Fn= False negative. C. Kappa Statistics (KS): The Kappa measurement is utilized to quantify the asser- tion among anticipated and watched arrangements of a dataset. [13]. K = R0 − Re 1 − Re ...eq.(4) Where, R0 = Relative watched understanding among raters and Re = Theoretical likelihood of chance assertion. IEEE - 45670 10th ICCCNT 2019 July 6-8, 2019 , IIT - Kanpur, Kanpur, India
  • 4. Fig. 2: ROC Curve of Typical Class for (a) Bagging at seed 3 (b) Logistic Regression at seed 3 (c) Random Forest at seed 1 D. KB Information Score: For a correct class of an instance B we can consider 3 different cases [18] as follows: • If P'(B) P(B) then score is positive • If P'(B) P(B) then score is negative • P'(B) = P(B) then no information (score 0) where, P(B) = Probability of C, If P'(B) = Subordinate prob- ability return by the classifier. E. Mean Absolute Error (MAE): It normal the size of the individual mistakes without assessing their sign [13]. MAE = | p1 − b1 | +...+ | pn − bn | n ...eq.(5) Here, p is for predicted value and b is for actual value. Fig. 3: ROC Curve of Non-Typical Class for (a) Bagging at seed 3 (b) Logistic Regression at seed 3 (c) Random Forest at seed 1 F. Relative Absolute Error (RAE): It is the total absolute error with the same kind of normal- ization [13]. RAE = | p1 − b1 | + · · · + | pn − bn | | b1 − b̄ | + · · · + | bn − b̄ | ...eq.(6) G. Specificity/ TN Rate: It suggests the degree of people without infection who have an adverse test result [13]. Specificity = Tn Fp + Tn ...eq.(7) H. Precision (PRE): Scientists characterize its by [13]. PRE = Tp Tp + Fp ...eq.(8) IEEE - 45670 10th ICCCNT 2019 July 6-8, 2019 , IIT - Kanpur, Kanpur, India
  • 5. Fig. 4: ROC Curve of Both Class for (a) Bagging at seed 3 (b) Logistic Regression at seed 3 (c) Random Forest at seed 1 I. Recall (REC): Researchers defined this parameter as follows [13], REC = Tp Tp + Fn ...eq.(9) J. F-Measure: If it denotes by FM then, FM = 2 × PRE × REC PRE + REC ...eq.(10) = 2 × Tp 2 × Tp + Fp + Fn ...eq.(11) K. MCC: The cohesion between PRE and REC [13]. TABLE III: Comparison of Statistical Metrics for SEED 1 Evaluation Metrics Machine Learning Algorithms Bagging Logistic Random Forest Correctly Classified Instances 86.77% 80.88% 90.29% Incorrectly Classified Instances 13.24% 19.12% 9.71% Kappa statistic 0.75 0.65 0.81 KB Information Score 298.03 bits 279.89 bits 317.38 bits Class complexity — order 0 454.75 bits 454.75 bits 454.75 bits Class complexity — scheme 186.61 bits 8551.65 bits 148.78 bits Mean absolute error 0.15 0.16 0.13 Root mean squared error 0.26 0.33 0.23 Relative absolute error 41.11% 43.05% 35.30% Root relative squared error 60.31% 77.54% 54.31% Coverage of cases (0.95 level) 186.61 bits 8551.65 bits 148.78 bits Mean rel. region size (0.95 level) 0.15 0.16 0.13 Specificity/TN Rate (Weighted Avg.) 0.26 0.33 0.23 Precision (Weighted Avg.) 41.11% 43.05% 35.30% Recall (Weighted Avg.) 60.31% 77.54% 54.31% F-Measure (Weighted Avg.) 0.87 0.81 0.90 MCC (Weighted Avg.) 0.75 0.64 0.82 ROC Area (Weighted Avg.) 0.94 0.85 0.97 PRC Area (Weighted Avg.) 0.93 0.78 0.96 L. ROC Area: It is the probability that a randomly chosen positive in- stance in the test data is ranked above a randomly chosen negative instance, based on the ranking produced by the classifier [13]. M. PRC Area: It is an elective summary measurement that is favored by a few specialists, especially in the data recovery zone [13]. N. Explanation of the Analysis: The analysis has been accomplished in 3 seeds for each algorithm for 3 classes named typical, non-typical and both. Fig. 2 is only showing the best performance curves for class Typical. For class typical BAG and LR have given the best per- formance in seed 3 and accuracy was 89.1176% and 83.2353% respectively. In seed number 1, RF performed better than seed 1 or 2 for the same class while the accuracy is 90.2941% which is very impressive. In Fig. 3, the scenario is not much different for class Non- Typical than the Fig. 2. The BAG and LR again performed well in seed 3 and similarly RF in seed 1 but keep in mind that this time the class is Non-Typical. Moreover, in Fig. 4 the results are again similar to Fig. 2 3. The BAG and LR again given the best performance in seed 3 for class Both. Table III shows the performance parameters and results of our study where CCI are 86.7647%, 80.8824% 90.2941% for BAG, LR RF respectively and the values of their KS are 0.753, 0.6512 0.8126. Values of Specificity are 0.62, 0.881 0.923 while the values of PRE are 0.866, 0.809 0.908. In addition, the outcomes of ROC and PRC Area are 0.944, 0.847, 0.972 and 0.930, 0.775, 0.963 respectively. IEEE - 45670 10th ICCCNT 2019 July 6-8, 2019 , IIT - Kanpur, Kanpur, India
  • 6. TABLE IV: Comparison of Statistical Metrics for SEED 2 Evaluation Metrics Machine Learning Algorithms Bagging Logistic Random Forest Correctly Classified Instances 87.65% 80.29% 88.82% Incorrectly Classified Instances 12.35% 19.71% 11.18% Kappa statistic 0.76 0.64 0.78 KB Information Score 320.91 bits 263.70 bits 313.21 bits Class complexity — order 0 454.75 bits 454.75 bits 454.75 bits Class complexity — scheme 146.77 bits 1550.27 bits 155.03 bits Mean absolute error 0.12 0.17 0.13 Root mean squared error 0.23 0.35 0.24 Relative absolute error 34.05% 47.06% 36.25% Root relative squared error 54.37% 82.03% 55.72% Coverage of cases (0.95 level) 99.41% 87.06% 99.41% Mean rel. region size (0.95 level) 59.71% 49.61% 62.84% Specificity/TN Rate (Weighted Avg.) 0.87 0.81 0.90 Precision (Weighted Avg.) 0.88 0.80 0.90 Recall (Weighted Avg.) 0.88 0.80 0.89 F-Measure (Weighted Avg.) 0.87 0.80 0.89 MCC (Weighted Avg.) 0.76 0.63 0.79 ROC Area (Weighted Avg.) 0.97 0.82 0.97 PRC Area (Weighted Avg.) 0.97 0.74 0.96 TABLE V: Comparison of Statistical Metrics for SEED 3 Evaluation Metrics Machine Learning Algorithms Bagging Logistic Random Forest Correctly Classified Instances 89.12% 83.24% 87.65% Incorrectly Classified Instances 10.88% 16.77% 12.35% Kappa statistic 0.79 0.69 0.76 KB Information Score 305.73 bits 279.89 bits 320.91 bits Class complexity — order 0 454.75 bits 454.75 bits 454.745 bits Class complexity — scheme 176.04 bits 8551.65 bits 146.77 bits Mean absolute error 0.14 0.15 0.12 Root mean squared error 0.25 0.32 0.23 Relative absolute error 39.42% 41.77% 34.05% Root relative squared error 58.07% 75.61% 54.37% Coverage of cases (0.95 level) 98.82% 89.12% 99.41% Mean rel. region size (0.95 level) 64.80% 49.12% 59.71% Specificity/TN Rate (Weighted Avg.) 0.92 0.87 0.90 Precision (Weighted Avg.) 0.89 0.83 0.88 Recall (Weighted Avg.) 0.89 0.83 0.88 F-Measure (Weighted Avg.) 0.89 0.83 0.87 MCC (Weighted Avg.) 0.79 0.69 0.76 ROC Area (Weighted Avg.) 0.95 0.85 0.97 PRC Area (Weighted Avg.) 0.94 0.79 0.96 Table IV presents the same variables for seed 2. CCI for BAG, LR RF are 87.6471%, 80.2941% 8.8235% respec- tively and 0.7604, 0.6417, 0.7832. The values of Specificity in this case for BAG, LR Rf are respectively 0.868, 0.809 0.903 and 0.882, 0.804 0.896 are for PRE. In addition, REC and F-Measure for BAG are 0.876 0.872, for LR both are same which is 0.803 and for RF 0.888 0.885. MCC for TABLE VI: Comparison with Other Systems Reference Number Sample Size No. of Features Algorithms Accuracy Perspective of the paper [5] 768 8 K-means with j48 Decision Tree (DT) 90.04% Classification [7] 4322 2 Ensemble Boosting with Perceptron Algorithm (EPA) 75% Ensemble Learning [8] 768 9 ANN 80.86% Classification KNN 77.24% [9] 768 8 LR 77.47% Classification Deep Neural Network (DNN) 77.86% Support Vector Machine (SVM) 77.60% DT 76.30% Nave Bayes (NB) 75.78% [11] 3075 14 Bagging with DT, RF LR 89.17% Classification [12] 768 9 NB, SMO, MLP, REP Tree J48 76.80% Classification Our Proposed System 340 27 Bagging 89.12% Classification Logistic Regression 83.24% Random Forest 90.29% BAG is 0.762, for LR 0.633 and for RF it is 0.787. The results of performance measurable factors for seed 3 have been shown in Table V. Specificity for BAG is 0.919 and 0.873, 0.904 for LR RF. PRE are 0.893, 0.832, 0.882 and REC 0.891, 0.832, 0.876 for BAG, LR RF respectively. ROC and PRC Area are 0.950, 0.851, 0.974 and 0.940, 0.791, 0.966 respectively for BAG, LR RF. BAG and LF provided the best accuracy in seed 3 that means after they were shuffled for 3 times. On the other hand, the RF algorithm impressively given the best accuracy in its first seed. So, comparatively, the RF is the best algorithm among them. Table VI illustrates comparisons between various previous systems with our proposed system. The models have been compared based on sample size, number of features, algorithms and accuracy. Among our proposed systems, the system with Random Forest algorithm has given the highest accuracy that is 90.29%, which is the best accuracy by compared with the algorithms used in the previous systems. V. CONCLUSION Despite of having major limitations, the study has finished successfully with expected outcomes. Collection of real time data was one of the main problems we have faced at the initial states and after managed it another bound was to fill up the missing data since there was several missing data in IEEE - 45670 10th ICCCNT 2019 July 6-8, 2019 , IIT - Kanpur, Kanpur, India
  • 7. the dataset. But using MLT, we have recovered the issues and performed the analysis to achieve our goal. Among the three algorithms, RF gives the best performance than BAG LR and BAG performed better than LR. In future, we will conduct this study with more algorithms like ANN more specifically with Neuro Fuzzy Inference System, CNN (Convolution Neural Network) and advanced Ensemble Learning algorithms. An expert system can be developed with our analysis to predict diabetes more efficiently and effectively. REFERENCES [1] R. Basu, “Type 1 Diabetes ”, National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK), 2017. [2] S. Akter, M. Rahman, S. Krull Abe and P. Sultana, “Preva- lence of diabetes and prediabetes and their risk factors among Bangladeshi adults: a nationwide survey ”, Bulletin of the World Health Organization, vol. 92, no. 3, pp. 153-228, 2014. Available: https://www.who.int/bulletin/volumes/92/3/13-128371/en/. [Accessed 8 January 2019]. [3] ScienceDaily, “A better way to predict diabetes: Scientists develop highly accurate method to predict type 2 diabetes after delivery in women with gestational diabetes ”, Science News, Toronto, 2016. [4] I. Kononenko, “Machine learning for medical diagnosis: history, state of the art and perspective ”, Artificial Intelligence in Medicine, vol. 23, no. 1, pp. 89-109, 2001. Available: 10.1016/s0933-3657(01)00077- x [Accessed 25 January 2019]. [5] W. Chen, S. Chen, H. Zhang and T. Wu, “A Hybrid Prediction Model for Type 2 Diabetes Using K-means and Decision Tree ”, in 2017 8th IEEE International Conference on Software Engineering and Service Science (ICSESS), Beijing, China, 2017. [6] D. Shetty, K. Rit, S. Shaikh and N. Patil, “Diabetes disease prediction using data mining ”, in 2017 International Conference on Innovations in 2017 Information, Embedded and Communication Systems (ICIIECS), Coimbatore, India, 2017. [7] R. Mirshahvalad and N. Zanjani, “Diabetes Prediction Using Ensemble Perceptron Algorithm ”, in 2017 9th International Conference on Com- putational Intelligence and Communication Networks (CICN), Girne, Cyprus, 2017. [8] I. Jasim, A. Duru, K. Shaker, B. Abed and H. Saleh, “Evaluation and measuring classifiers of diabetes diseases ”, in 2017 International Conference on Engineering and Technology (ICET), Antalya, Turkey, 2017. [9] S. Wei, X. Zhao and C. Miao, “A Comprehensive Exploration to the Machine Learning Techniques for Diabetes Identification ”, in 2018 IEEE 4th World Forum on Internet of Things (WF-IoT), Singapore, Singapore, 2018. [10] E. Pustozerov and P. Popova, “Mobile-based decision support sys- tem for gestational diabetes mellitus ”, in 2018 Ural Symposium on Biomedical Engineering, Radioelectronics and Information Technology (USBEREIT), Yekaterinburg, Russia, 2018. [11] S. Manna, S. Maity, S. Munshi and M. Adhikari, “Diabetes Prediction Model Using Cloud Analytics ”, in 2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Bangalore, India, 2018. [12] D. Verma and N. Mishra, “Analysis and prediction of breast cancer and diabetes disease datasets using data mining classification techniques ”, in 2017 International Conference on Intelligent Sustainable Systems (ICISS), Palladam, India, 2017. [13] I. Witten, E. Frank and M. Hall, Data Mining practical Machine Learning Tools and Techniques, 3rd ed. Morgan Kaufmann, 2011, pp. 166-580. [14] J. Han, M. Kamber and J. Pei, Data Mining Concepts and Techniques, 3rd ed. Morgan Kaufmann, 2011, pp. 370-382. [15] J. Brownlee, “A Gentle Introduction to k-fold Cross-Validation ”, Machine Learning Mastery, 2018. [16] W. Dean, Big Data Mining, and Machine Learning: Value Creation for Business Leaders and Practitioners (Wiley and SAS Business Series). Wiley, 2014, pp.124-125. [17] B. Ratner, Statistical and Machine-Learning Data Mining: Techniques for Better Predictive Modeling and Analysis of Big Data, 2nd ed. CRC Press, 2011, pp.97-98. [18] I. Kononenko and I. Bratko, “Information-Based Evaluation Criterion for Classifier's Performance”, Machine Learning, vol. 6, no. 1, pp. 67- 80, 1991. [Accessed 21 January 2019]. IEEE - 45670 10th ICCCNT 2019 July 6-8, 2019 , IIT - Kanpur, Kanpur, India