Graduate Outcomes Presentation Slides - English (v3).pptx
An Empirical Study On Diabetes Mellitus Prediction For Typical And Non-Typical Cases Using Machine Learning Approaches
1. An Empirical Study on Diabetes Mellitus Prediction
for Typical and Non-Typical Cases using Machine
Learning Approaches
Md. Tanvir Islam1
, M. Raihan2
, Fahmida Farzana3
, Md. Golam Morshed Raju4
and Md. Bellal Hossain5
Department of Computer Science and Engineering, North Western University, Khulna, Bangladesh1-4
Electronics and Communication Engineering Discipline, Khulna University, Khulna, Bangladesh5
Emails: tanvirislamnwu@gmail.com1
, mraihan@ieee.org2
, mraihan@nwu.edu.bd2
, raihanbme@gmail.com2
,
tanni.dorsonindrio@gmail.com3
, golam.morshed.raju.cse01@gmail.com4
and md.bellal.ku@gmail.com5
Abstract—Diabetes is a non-communicable disease and in-
creasing at an alarming rate all over the world. Having a high
sugar level in blood or lack of insulin are the primary reasons.
So, it is important to find an effective way to predict diabetes
before it turns into a major problem for human health. It is
possible to take control of diabetes on an early stage if we
take precautions. For this study, we have collected 340 instances
with 26 features of patients who have already diabetes with
various symptoms categorized by two types Typical and Non-
Typical. For training the dataset, cross-validation technique has
been used and for classification, three Machine Learning (ML)
algorithms such as Bagging, Logistic Regression and Random
Forest have been used. The accuracy for Bagging 89.12%, for
Logistic Regression 83.24% and for Random Forest 90.29%
which are very appreciative.
Keywords—Diabetes Mellitus, Type-2, Machine Learning, Bag-
ging, Logistic Regression, Random Forest, Typical, Non-typical.
I. INTRODUCTION
Nowadays in the era of modern technology, many people
are suffering from numerous diseases and diabetes is one of
them. Diabetes Mellitus (DM) arises when the level of sugar is
too high in our blood though it's our primary source of energy
[1]. In 2011, about 8.4 million or 10% of total population
have diabetes and according to the International Diabetes
Federation (IDF) in Bangladesh and the prevalence of diabetes
among adults will be increased to 13% by 2030 [2]. A news
report of Science Daily states that, If it's possible to prevent
diabetes on an earlier stage then possibility of minimizing the
devastating effects of diabetes is more [3]. Machine Learning
Techniques (MLT) are broadly utilized in medicinal forecasts
[4]. For example, a prediction model developed to predict
Type 2 Diabetes (T2D) using K-means and Decision Tree.
The accuracy, specificity, and sensitivity of the proposed model
are respectively 90.04% 91.28% and 87.27% [5]. For different
algorithms the percentage of accuracy can be different. So, it's
easy to know which algorithm is best from the current available
algorithms.
In this analysis, our goal is to identify the accuracy of
three popular algorithms named Bagging (BAG), Logistic
Regression (LR) and Random Forest (RF) by analyzing the
dataset and compare their performance and to integrate these
techniques in a system such as mobile or web to develop an
expert system.
The other part of the manuscript is arranged as follows:
in section II, section III the related works and methodology
have been elaborated with a distinguishing destination to the
justness of the classier algorithms respectively. In section
IV the outcome of this analysis has been clarified with the
impulsion to justify the novelty of this exploration work.
Finally, this research paper is terminated with section V.
II. RELATED WORKS
Deeraj et al. 2017 in [6] have proposed to use Bayesian
and K-Nearest Neighbor algorithms to predict diabetes malady
using data mining. The results of the prediction will depend
on the taken attributes for example age, pregnancy, tri fold
thick, BMI, bp function, bp diastolic and so on. Another
research performed on three available datasets to compare
the behavior of perceptron algorithms and they found that
the proposed algorithm is better than a perceptron algorithm
[7]. By using K-nearest neighbor (KNN) and Artificial Neural
Network (ANN) classify algorithms, another research team
estimated classifiers of diabetes diseases. They have used total
768 instances with 9 attributes for this evaluation. The accuracy
of ANN and KNN was 80.86% and 77.24% [8]. With Deep
Neural Network (DNN) and Support Vector Machine (SVM) a
system was developed to predict the diabetes. They have taken
total 8 significant attributes of patients such as Age, Number
of Times Pregnant, Plasma Glucose Concentration, Diastolic
Blood Pressure, Body Mass Index etc and got 77.86% accuracy
[9]. A mobile-based decision support system was developed
for gestational Diabetes Mellitus (DM) [10]. Another diabetes
prediction system was developed using cloud analytic where
they have used 3075 distinct person record with 14 variables
of all age group who may or may not diagnosed with Dia-
betes. They found that Logistic Regression outperformed than
Random Forest in all cases and that's why they determined to
use LR as their model [11]. Deepika Verma and Nidhi Mishra
conduct a study to identify Diabetes by using a dataset on
Naive Bayes (NB), J48, SMO, MLP, and REP Tree algorithms
and found that SMO gives 76.80% accuracy on diabetes dataset
[14].
Since diabetes is increasing at an alarming rate and it has
became one of the major health issues for human being, we
have felt the importance of finding out effective and efficient
solution. From this this perspective view, we have been moti-
IEEE - 45670
10th ICCCNT 2019
July 6-8, 2019 , IIT - Kanpur,
Kanpur, India
2. vated to increase the accuracy of Machine Learning algorithms
by using clinical dataset of diabetes affected patients.
III. METHODOLOGY
We can separate our strategy into four primary segments
as follows:
• Data Collection
• Data Preprocessing
• Data Training
• Applications of Machine Learning Algorithms
An overall work flow of our study has been shown in Fig. 1.
Import dataset with 340
instances and 27 features
Start
Data Preprocessing
Replace missing data with
Mean, Median and Mode
Feature Selection
Apply Best First Search
and Ranker Algorithm
Find Most
Significant Features
Apply 10-Fold
Cross-validation technique
Apply
Classification Algorithms
Bagging
Random
forest
Logistic
Regression
Determine
Statistical Matrics
Compare Performance
End
Fig. 1: Work-flow of the Overall Analysis
A. Data Collection:
For this analysis, we have collected data from Khulna
Diabetes Hospital, Khulna. The dataset includes total 340
TABLE I: Types and Names of Symptoms
Types of Symptoms Names of Symptoms
Typical
Thirst
Hunger
Weight Loss
Weakness
Non-Typical
Headache for High Blood Pressure
Burning Extremities
Weakness
instances with having 27 significant features for each instance.
The dataset contains basic information of patients and two
types of symptoms: Typical and Non-Typical. The Table I helps
to understand the categories of the symptoms.
B. Data preprocessing:
To handle missing information we've used two popular and
useful functions in WEKA 3.8 ( Waikato Environment for
Knowledge Analysis) . First, ReplaceMissingValue function
has been used to replace missing data. This function swaps
every single missing information for nominal and numeric
attributes with the modes and means [13]. We've used another
function named Randomize which can fill-up the missing field
without sacrificing too much performance [13].
C. Data Training:
For training all the features of the dataset shown in Table
II, we have used 10-Fold Cross-Validation technique. It is
a re-sampling technique to evaluate predictive models by
partitioning the original sample into a training set to train the
model, and a test set to evaluate it [14]. The methodology has
a solitary parameter considered K that alludes to the number
of algorithms that a given information test is to be part into. It
shuffles the dataset randomly, splits dataset into 10 groups and
finally abridge the expertise of the model utilizing the example
of model assessment scores [15].
D. Applications of ML Algorithms:
After having the preprocessed and trained dataset we have
applied three algorithms on the dataset. They are: Bagging,
Logistic Regression and Random Forest.
1) Bagging (BAG): It is a procurement method that re-
samples the preparation information to make new models
for each example that is drawn [16]. It makes a troupe of
arrangement models for a learning plan where each model
gives a similarly weighted forecast [14].
Input:
• R, a set of h training tuples
• t, the number of models in the ensemble
• A classification learning scheme (Decision Tree Algo-
rithm, Naive Bayesian, etc.)
Output: The ensemble, an associated model, L.
IEEE - 45670
10th ICCCNT 2019
July 6-8, 2019 , IIT - Kanpur,
Kanpur, India
3. TABLE II: Features List
Features Subcategory
Data
Distribution
Age
Lowest: 22 Mean ± SD
Highest: 80 49.21 ± 11.95
Sex
Male 47.65%
Female 52.35%
Profession
Lowest: Supervisor, Unemployed,
Servant, Banker
1.18%
Highest: Housewife 52.65%
Rest 46.17%
Height
Lowest: 133 cm
158.58 ± 8.40
Highest: 188 cm
Weight
Lowest: 33 kg
62.65 ± 11.03
Highest: 105 kg
Body Mass Index
Lowest: 11.6 Kgm-2
24.81 ± 3.53
Highest: 52.1Kgm-2
Heart Rate
Lowest: 56 bpm
74.94 ± 6.11
Highest: 90 bpm
Systolic Blood Pressure
Lowest: 90 mmHg
129.1 ± 12.39
Highest: 200 mmHg
Diastolic Blood Pressure
Lowest: 40 mmHg
81.52 ± 7.16
Highest: 110 mmHg
Blood Sugar Before
Meal
Lowest: 4.2 mmol/L
12.91 ± 4.83
Highest: 45.5 mmol/L
Blood Sugar After
Meal
Lowest: 4.4 mmol/L
18.51 ± 5.42
Highest: 45.8 mmol/L
Urine Color Before
Meal
Lowest: Orange 3.24%
Highest: Green 68.82%
Rest 27.94%
Urine Color After Meal
Lowest: Cyan 0.29%
Highest: Green 65.00%
Rest 34.71%
Drug History
Yes 99.41%
No 0.59%
Weight Loss
Yes 78.24%
No 21.76%
Thirst
Yes 92.35%
No 7.65%
Hunger
Yes 78.24%
No 21.76%
Relatives
Yes 79.41%
No 20.59%
Physical Activity
Yes 97.64%
No 2.36%
Smoking
Yes 6.18%
No 94.82%
Tobacco Chewing
Yes 12.9%
No 87.1%
Headache for Hight BP
Yes 81.18%
No 18.82%
Burning Extremities
Yes 82.94%
No 17.06%
Weakness
Yes 95.88%
No 4.12%
Symptoms Duration
Lowest: 1 Day Mean: 264.25
SD: 292.437
Highest: 3650 Days
Diabetes Mellitus
Yes 99.4%
No 0.59%
Outcome
Typical 61.18%
Non-Typical 14.70%
Both 24.12%
*SD = Standard Deviation
2) Logistic Regression (LR): LR is a well-known sys-
tem for grouping individuals into two totally unrelated and
thorough classifications, for instance, buyer, non-buyer and
responder, non-responder [17]. It predicts by the logit function
occurrence probabilistic outcome by fitting data of an event
[17].
LR accounts the logit of L, a log of the odds of a single
belonging to class 1 and can except much of a stretch be
changed over into the likelihood of an individual having a
place with class 1 [17].
The equations of Logit and Probability are as follow:
Logit L = a0 + a1 × Y1 + a2 × Y2 + · · · + an × Yn ...eq.(1)
Prop (L = 1) = exp
Logit L
1 + exp (Logit l)
...eq.(2)
An individual's predicted probability of belonging to class
1 is accounted by “plugging in” the values of the predictor
variables for that individual in the given two equations. Here,
a's = LR coefficients, which are determined by the calculus-
based method of maximum likelihood and it has no predictor
variable with which it is multiplied [17].
3) Random Forest (RF): RF is another ensemble technique
that used as classifiers. It also capable of performing regression
tasks [14]. If a training set, Y, of y tuples is given then the
procedure of begetting k decision trees is as follows: For each
iteration, j (j = 1, 2, ..., t), a training set, Yj, of y tuples is
sampled with replacement from Y. Let U be the number of
attributes to be used to determine the split at each node (where
U is much smaller than the number of available attributes).To
construct a decision tree classifier, Nj, randomly select, at each
node, U attributes as candidates for the split at the node [14].
IV. OUTCOMES
The result is analyzed based on 22 performance parameters
for example:
A. Seed:
It specify changing number randomly and getting different
result.
B. Correctly Classified Instances (CCI):
The accuracy of the model on the data used for testing
[13].
Accuracy =
Tp + Tn
Tp + Tn + Fp + Fn
...eq.(3)
Here, Tp= True positive, Tn= True negative, Fp= False positive,
Fn= False negative.
C. Kappa Statistics (KS):
The Kappa measurement is utilized to quantify the asser-
tion among anticipated and watched arrangements of a dataset.
[13].
K =
R0 − Re
1 − Re
...eq.(4)
Where, R0 = Relative watched understanding among raters
and Re = Theoretical likelihood of chance assertion.
IEEE - 45670
10th ICCCNT 2019
July 6-8, 2019 , IIT - Kanpur,
Kanpur, India
4. Fig. 2: ROC Curve of Typical Class for (a) Bagging at seed 3
(b) Logistic Regression at seed 3 (c) Random Forest at seed 1
D. KB Information Score:
For a correct class of an instance B we can consider 3
different cases [18] as follows:
• If P'(B) P(B) then score is positive
• If P'(B) P(B) then score is negative
• P'(B) = P(B) then no information (score 0)
where, P(B) = Probability of C, If P'(B) = Subordinate prob-
ability return by the classifier.
E. Mean Absolute Error (MAE):
It normal the size of the individual mistakes without
assessing their sign [13].
MAE =
| p1 − b1 | +...+ | pn − bn |
n
...eq.(5)
Here, p is for predicted value and b is for actual value.
Fig. 3: ROC Curve of Non-Typical Class for (a) Bagging at
seed 3 (b) Logistic Regression at seed 3 (c) Random Forest
at seed 1
F. Relative Absolute Error (RAE):
It is the total absolute error with the same kind of normal-
ization [13].
RAE =
| p1 − b1 | + · · · + | pn − bn |
| b1 − b̄ | + · · · + | bn − b̄ |
...eq.(6)
G. Specificity/ TN Rate:
It suggests the degree of people without infection who have
an adverse test result [13].
Specificity =
Tn
Fp + Tn
...eq.(7)
H. Precision (PRE):
Scientists characterize its by [13].
PRE =
Tp
Tp + Fp
...eq.(8)
IEEE - 45670
10th ICCCNT 2019
July 6-8, 2019 , IIT - Kanpur,
Kanpur, India
5. Fig. 4: ROC Curve of Both Class for (a) Bagging at seed 3
(b) Logistic Regression at seed 3 (c) Random Forest at seed 1
I. Recall (REC):
Researchers defined this parameter as follows [13],
REC =
Tp
Tp + Fn
...eq.(9)
J. F-Measure:
If it denotes by FM then,
FM = 2 ×
PRE × REC
PRE + REC
...eq.(10)
=
2 × Tp
2 × Tp + Fp + Fn
...eq.(11)
K. MCC:
The cohesion between PRE and REC [13].
TABLE III: Comparison of Statistical Metrics for SEED 1
Evaluation Metrics
Machine Learning Algorithms
Bagging Logistic
Random
Forest
Correctly Classified Instances 86.77% 80.88% 90.29%
Incorrectly Classified Instances 13.24% 19.12% 9.71%
Kappa statistic 0.75 0.65 0.81
KB Information Score 298.03 bits 279.89 bits 317.38 bits
Class complexity — order 0 454.75 bits 454.75 bits 454.75 bits
Class complexity — scheme 186.61 bits 8551.65 bits 148.78 bits
Mean absolute error 0.15 0.16 0.13
Root mean squared error 0.26 0.33 0.23
Relative absolute error 41.11% 43.05% 35.30%
Root relative squared error 60.31% 77.54% 54.31%
Coverage of cases
(0.95 level)
186.61 bits 8551.65 bits 148.78 bits
Mean rel. region size
(0.95 level)
0.15 0.16 0.13
Specificity/TN Rate
(Weighted Avg.)
0.26 0.33 0.23
Precision
(Weighted Avg.)
41.11% 43.05% 35.30%
Recall
(Weighted Avg.)
60.31% 77.54% 54.31%
F-Measure
(Weighted Avg.)
0.87 0.81 0.90
MCC
(Weighted Avg.)
0.75 0.64 0.82
ROC Area
(Weighted Avg.)
0.94 0.85 0.97
PRC Area
(Weighted Avg.)
0.93 0.78 0.96
L. ROC Area:
It is the probability that a randomly chosen positive in-
stance in the test data is ranked above a randomly chosen
negative instance, based on the ranking produced by the
classifier [13].
M. PRC Area:
It is an elective summary measurement that is favored by
a few specialists, especially in the data recovery zone [13].
N. Explanation of the Analysis:
The analysis has been accomplished in 3 seeds for each
algorithm for 3 classes named typical, non-typical and both.
Fig. 2 is only showing the best performance curves for class
Typical. For class typical BAG and LR have given the best per-
formance in seed 3 and accuracy was 89.1176% and 83.2353%
respectively. In seed number 1, RF performed better than seed
1 or 2 for the same class while the accuracy is 90.2941% which
is very impressive.
In Fig. 3, the scenario is not much different for class Non-
Typical than the Fig. 2. The BAG and LR again performed
well in seed 3 and similarly RF in seed 1 but keep in mind
that this time the class is Non-Typical. Moreover, in Fig. 4 the
results are again similar to Fig. 2 3. The BAG and LR again
given the best performance in seed 3 for class Both.
Table III shows the performance parameters and results of
our study where CCI are 86.7647%, 80.8824% 90.2941%
for BAG, LR RF respectively and the values of their KS are
0.753, 0.6512 0.8126. Values of Specificity are 0.62, 0.881
0.923 while the values of PRE are 0.866, 0.809 0.908.
In addition, the outcomes of ROC and PRC Area are 0.944,
0.847, 0.972 and 0.930, 0.775, 0.963 respectively.
IEEE - 45670
10th ICCCNT 2019
July 6-8, 2019 , IIT - Kanpur,
Kanpur, India
6. TABLE IV: Comparison of Statistical Metrics for SEED 2
Evaluation Metrics
Machine Learning Algorithms
Bagging Logistic
Random
Forest
Correctly Classified Instances 87.65% 80.29% 88.82%
Incorrectly Classified Instances 12.35% 19.71% 11.18%
Kappa statistic 0.76 0.64 0.78
KB Information Score 320.91 bits 263.70 bits 313.21 bits
Class complexity — order 0 454.75 bits 454.75 bits 454.75 bits
Class complexity — scheme 146.77 bits 1550.27 bits 155.03 bits
Mean absolute error 0.12 0.17 0.13
Root mean squared error 0.23 0.35 0.24
Relative absolute error 34.05% 47.06% 36.25%
Root relative squared error 54.37% 82.03% 55.72%
Coverage of cases
(0.95 level)
99.41% 87.06% 99.41%
Mean rel. region size
(0.95 level)
59.71% 49.61% 62.84%
Specificity/TN Rate
(Weighted Avg.)
0.87 0.81 0.90
Precision
(Weighted Avg.)
0.88 0.80 0.90
Recall
(Weighted Avg.)
0.88 0.80 0.89
F-Measure
(Weighted Avg.)
0.87 0.80 0.89
MCC
(Weighted Avg.)
0.76 0.63 0.79
ROC Area
(Weighted Avg.)
0.97 0.82 0.97
PRC Area
(Weighted Avg.)
0.97 0.74 0.96
TABLE V: Comparison of Statistical Metrics for SEED 3
Evaluation Metrics
Machine Learning Algorithms
Bagging Logistic
Random
Forest
Correctly Classified Instances 89.12% 83.24% 87.65%
Incorrectly Classified Instances 10.88% 16.77% 12.35%
Kappa statistic 0.79 0.69 0.76
KB Information Score 305.73 bits 279.89 bits 320.91 bits
Class complexity — order 0 454.75 bits 454.75 bits 454.745 bits
Class complexity — scheme 176.04 bits 8551.65 bits 146.77 bits
Mean absolute error 0.14 0.15 0.12
Root mean squared error 0.25 0.32 0.23
Relative absolute error 39.42% 41.77% 34.05%
Root relative squared error 58.07% 75.61% 54.37%
Coverage of cases
(0.95 level)
98.82% 89.12% 99.41%
Mean rel. region size
(0.95 level)
64.80% 49.12% 59.71%
Specificity/TN Rate
(Weighted Avg.)
0.92 0.87 0.90
Precision
(Weighted Avg.)
0.89 0.83 0.88
Recall
(Weighted Avg.)
0.89 0.83 0.88
F-Measure
(Weighted Avg.)
0.89 0.83 0.87
MCC
(Weighted Avg.)
0.79 0.69 0.76
ROC Area
(Weighted Avg.)
0.95 0.85 0.97
PRC Area
(Weighted Avg.)
0.94 0.79 0.96
Table IV presents the same variables for seed 2. CCI for
BAG, LR RF are 87.6471%, 80.2941% 8.8235% respec-
tively and 0.7604, 0.6417, 0.7832. The values of Specificity
in this case for BAG, LR Rf are respectively 0.868, 0.809
0.903 and 0.882, 0.804 0.896 are for PRE. In addition,
REC and F-Measure for BAG are 0.876 0.872, for LR both
are same which is 0.803 and for RF 0.888 0.885. MCC for
TABLE VI: Comparison with Other Systems
Reference
Number
Sample
Size
No. of
Features
Algorithms Accuracy
Perspective
of the paper
[5] 768 8
K-means
with j48
Decision
Tree (DT)
90.04% Classification
[7] 4322 2
Ensemble
Boosting
with
Perceptron
Algorithm
(EPA)
75%
Ensemble
Learning
[8] 768 9
ANN 80.86%
Classification
KNN 77.24%
[9] 768 8
LR 77.47%
Classification
Deep Neural
Network
(DNN)
77.86%
Support
Vector
Machine
(SVM)
77.60%
DT 76.30%
Nave Bayes
(NB)
75.78%
[11] 3075 14
Bagging
with
DT, RF
LR
89.17% Classification
[12] 768 9
NB, SMO,
MLP,
REP Tree
J48
76.80% Classification
Our
Proposed
System
340 27
Bagging 89.12%
Classification
Logistic
Regression
83.24%
Random
Forest
90.29%
BAG is 0.762, for LR 0.633 and for RF it is 0.787.
The results of performance measurable factors for seed 3
have been shown in Table V. Specificity for BAG is 0.919
and 0.873, 0.904 for LR RF. PRE are 0.893, 0.832, 0.882
and REC 0.891, 0.832, 0.876 for BAG, LR RF respectively.
ROC and PRC Area are 0.950, 0.851, 0.974 and 0.940, 0.791,
0.966 respectively for BAG, LR RF.
BAG and LF provided the best accuracy in seed 3 that
means after they were shuffled for 3 times. On the other hand,
the RF algorithm impressively given the best accuracy in its
first seed. So, comparatively, the RF is the best algorithm
among them.
Table VI illustrates comparisons between various previous
systems with our proposed system. The models have been
compared based on sample size, number of features, algorithms
and accuracy.
Among our proposed systems, the system with Random
Forest algorithm has given the highest accuracy that is 90.29%,
which is the best accuracy by compared with the algorithms
used in the previous systems.
V. CONCLUSION
Despite of having major limitations, the study has finished
successfully with expected outcomes. Collection of real time
data was one of the main problems we have faced at the
initial states and after managed it another bound was to fill
up the missing data since there was several missing data in
IEEE - 45670
10th ICCCNT 2019
July 6-8, 2019 , IIT - Kanpur,
Kanpur, India
7. the dataset. But using MLT, we have recovered the issues and
performed the analysis to achieve our goal. Among the three
algorithms, RF gives the best performance than BAG LR and
BAG performed better than LR. In future, we will conduct this
study with more algorithms like ANN more specifically with
Neuro Fuzzy Inference System, CNN (Convolution Neural
Network) and advanced Ensemble Learning algorithms. An
expert system can be developed with our analysis to predict
diabetes more efficiently and effectively.
REFERENCES
[1] R. Basu, “Type 1 Diabetes ”, National Institute of Diabetes and
Digestive and Kidney Diseases (NIDDK), 2017.
[2] S. Akter, M. Rahman, S. Krull Abe and P. Sultana, “Preva-
lence of diabetes and prediabetes and their risk factors among
Bangladeshi adults: a nationwide survey ”, Bulletin of the World
Health Organization, vol. 92, no. 3, pp. 153-228, 2014. Available:
https://www.who.int/bulletin/volumes/92/3/13-128371/en/. [Accessed 8
January 2019].
[3] ScienceDaily, “A better way to predict diabetes: Scientists develop
highly accurate method to predict type 2 diabetes after delivery in
women with gestational diabetes ”, Science News, Toronto, 2016.
[4] I. Kononenko, “Machine learning for medical diagnosis: history, state
of the art and perspective ”, Artificial Intelligence in Medicine, vol.
23, no. 1, pp. 89-109, 2001. Available: 10.1016/s0933-3657(01)00077-
x [Accessed 25 January 2019].
[5] W. Chen, S. Chen, H. Zhang and T. Wu, “A Hybrid Prediction Model
for Type 2 Diabetes Using K-means and Decision Tree ”, in 2017 8th
IEEE International Conference on Software Engineering and Service
Science (ICSESS), Beijing, China, 2017.
[6] D. Shetty, K. Rit, S. Shaikh and N. Patil, “Diabetes disease prediction
using data mining ”, in 2017 International Conference on Innovations in
2017 Information, Embedded and Communication Systems (ICIIECS),
Coimbatore, India, 2017.
[7] R. Mirshahvalad and N. Zanjani, “Diabetes Prediction Using Ensemble
Perceptron Algorithm ”, in 2017 9th International Conference on Com-
putational Intelligence and Communication Networks (CICN), Girne,
Cyprus, 2017.
[8] I. Jasim, A. Duru, K. Shaker, B. Abed and H. Saleh, “Evaluation
and measuring classifiers of diabetes diseases ”, in 2017 International
Conference on Engineering and Technology (ICET), Antalya, Turkey,
2017.
[9] S. Wei, X. Zhao and C. Miao, “A Comprehensive Exploration to the
Machine Learning Techniques for Diabetes Identification ”, in 2018
IEEE 4th World Forum on Internet of Things (WF-IoT), Singapore,
Singapore, 2018.
[10] E. Pustozerov and P. Popova, “Mobile-based decision support sys-
tem for gestational diabetes mellitus ”, in 2018 Ural Symposium on
Biomedical Engineering, Radioelectronics and Information Technology
(USBEREIT), Yekaterinburg, Russia, 2018.
[11] S. Manna, S. Maity, S. Munshi and M. Adhikari, “Diabetes Prediction
Model Using Cloud Analytics ”, in 2018 International Conference on
Advances in Computing, Communications and Informatics (ICACCI),
Bangalore, India, 2018.
[12] D. Verma and N. Mishra, “Analysis and prediction of breast cancer and
diabetes disease datasets using data mining classification techniques ”,
in 2017 International Conference on Intelligent Sustainable Systems
(ICISS), Palladam, India, 2017.
[13] I. Witten, E. Frank and M. Hall, Data Mining practical Machine
Learning Tools and Techniques, 3rd ed. Morgan Kaufmann, 2011, pp.
166-580.
[14] J. Han, M. Kamber and J. Pei, Data Mining Concepts and Techniques,
3rd ed. Morgan Kaufmann, 2011, pp. 370-382.
[15] J. Brownlee, “A Gentle Introduction to k-fold Cross-Validation ”,
Machine Learning Mastery, 2018.
[16] W. Dean, Big Data Mining, and Machine Learning: Value Creation for
Business Leaders and Practitioners (Wiley and SAS Business Series).
Wiley, 2014, pp.124-125.
[17] B. Ratner, Statistical and Machine-Learning Data Mining: Techniques
for Better Predictive Modeling and Analysis of Big Data, 2nd ed. CRC
Press, 2011, pp.97-98.
[18] I. Kononenko and I. Bratko, “Information-Based Evaluation Criterion
for Classifier's Performance”, Machine Learning, vol. 6, no. 1, pp. 67-
80, 1991. [Accessed 21 January 2019].
IEEE - 45670
10th ICCCNT 2019
July 6-8, 2019 , IIT - Kanpur,
Kanpur, India