IMAGE CLASSIFICATION USING KNN, RANDOM FOREST AND SVM ALGORITHM ON GLAUCOMA DATASETS AND EXPLAIN THE ACCURACY, SENSITIVITY, AND SPECIFICITY OF EACH AND EVERY ALGORITHMS
chapter 5.pptx: drainage and irrigation engineering
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
1. Progress REPORT ON
IMAGE CLASSIFICATION USING
DIFFERENT CLASSICAL APPROACHES
UNIVERSITY INSTITUTE OF TECHNOLOGY
THE UNIVERSITY OF BURDWAN
(Dept. Of Information Technology, 2016-2020)
SUPERVISOR: MR. ARINDAM CHOWDHURY
SUBMITTED BY:
(GROUP-03) - 7th
Semester
PRASHANT CHOUDHARY (2016-3003)
VIKASH KUMAR (2016-3028)
RAKESH RANJAN (2016-3027)
SUMIT ABHISHEK (2016-3031)
2. Contents
1. Abstract
2. Introduction
3. Problem Statement and Data sets
4. Some terminologies
5. Software & Hardware Requirement
6. Different models used (Algorithms)
a. K-Nearest Neighbors
b. Random Forest Classification
c. Adaptive Boosting
d. Support Vector Machine
7. Implementation of our models on problem set
8. Comparison between various Algorithms
9. Future improvements and scopes
10. Conclusion
11. References
3. ABSTRACT
Image classification is a complex process that may be affected by many
factors. This paper examines current practices, problems, and prospects
of image classification. The emphasis is placed on the summarization of
major advanced classification approaches and the techniques used for
improving classification accuracy. In addition, some important issues
affecting classification performance are discussed. This literature review
suggests that designing a suitable image‐processing procedure is a
prerequisite for a successful classification of remotely sensed data into a
thematic map. Effective use of multiple features of remotely sensed data
and the selection of a suitable classification method are especially
significant for improving classification accuracy. Non‐parametric
classifiers such as neural network, decision tree classifier, and
knowledge‐based classification have increasingly become important
approaches for multisource data classification. Integration of remote
sensing, geographical information systems (GIS), and expert system
emerges as a new research frontier.
More research, however, is needed to identify and reduce uncertainties
in the image‐processing chain to improve classification accuracy.
4. INTRODUCTION
The image classification follows the steps as pre-processing,
segmentation, feature extraction and classification. In the Classification
system database is very important that contains predefined sample
patterns of object under consideration that compare with the test object
to classify it appropriate class. Image Classification is an important task
in various fields such as biometry, remote sensing, and biomedical
images. In a typical classification system image is captured by a camera
and consequently processed. In Supervised classification, first of all
training took place through known group of pixels. The trained classifier
used to classify other images. The Unsupervised classification uses the
properties of the pixels to group them and these groups are known as
cluster and process is called clustering. The numbers of clusters are
decided by users. When trained pixels are not available the unsupervised
classification is used. The example for classification methods are:
Decision Tree, Artificial Neural Network (ANN) and Support Vector
Machines.
5. PROBLEM STATEMENTS AND DATA SETS
Problem statement: To study a retina image dataset and to model a
classifier for predicting whether a person is suffering from glaucoma or not.
the problem statement for a document classifier has two aspects: the
document space and set of document class. The former defines the range
of input documents and the latter defines the output that the classifier can
produce.
Here in our project, the document space is a database consisting of several
numerical data sets of retinal Image.
Data Sets: we have taken 255 retinal image data sets and performed our
classification operations on that image. We have used 70% of the image
data set for training our model and left 30% for testing the model.
The features are extracted from the fundus images using image processing
techniques - kurtosis, k-stat, mean, median, standard deviation and the
obtained numerical features are stored in a dataset.
6. Some Terminologies
Confusion Matrix:
A confusion matrix is a summary of prediction results on a classification problem.
The number of correct and incorrect predictions are summarized with count values
and broken down by each class. This is the key to the confusion matrix.
The confusion matrix shows the ways in which your classification model is
confused when it makes predictions.
It gives us insight not only into the errors being made by a classifier but more
importantly the types of errors that are being made.
Definition of the Terms:
• Positive (P) : Observation is positive (for example: is an apple).
• Negative (N) : Observation is not positive (for example: is not an apple).
• True Positive (TP) : Observation is positive, and is predicted to be positive.
• False Negative (FN) : Observation is positive, but is predicted negative.
• True Negative (TN) : Observation is negative, and is predicted to be negative.
• False Positive (FP) : Observation is negative, but is predicted positive.
7. SOFTWARE AND HARDWARE REQUIREMENTS
• SOFTWARE
1. Jupyter Notebook (Anaconda):Anaconda is a free and open-
source[5] distribution of the Python and R programming languages
for scientific computing (data science, machine
learning applications, large-scale data processing, predictive
analytics, etc.), that aims to simplify package management and
deployment. Package versions are managed by the package
management system conda.[6] The Anaconda distribution includes
data-science packages suitable for Windows, Linux, and MacOS.
And Different Package install for implementation
a) NumPy Library
b) Pandas Library
c) Matplotlib
2. Browser
• HARDWARE
1. Windows 7/8/10
2. RAM 2GB
3. Minimum Storage 20GB
8. DIFFERENT MODELS USED (Algorithms)
We Have used four algorithms which are
➢ K-Nearest Neighbors
➢ Random Forest Classification
➢ Adaptive Boosting
➢ Support Vector Machine
K-NEAREST NEIGHBORS
The K-NN is also the classifier of the category of supervised learning algorithm. In
supervised learning the targets are known to us but the pathway to target is not
known. To comprehend machine learning nearest neighbor forms is the perfect
example. Let us consider that there are many clusters of labelled samples. The
nature of items of the same identified clusters or groups are of homogeneous
nature. Now if an unlabeled item needs to be labelled under one of the labelled
groups. Now to classify it K-nearest neighbors is easy and best algorithm that have
record of all available classes can perfectly put the new item into the class on the
basis of largest number of votes for k neighbors. In this way KNN is one of the
alternate to classify an unlabeled item into identified class. Selecting the no. of
nearest neighbors or in another words calculating k value plays important role in
determining the efficiency of designed model. The accuracy and efficiency of k-
NN algorithm basically evaluated by the K value determined. A larger number for
k value has advantage in reducing the variance because of noisy data.
9. Advantage: The KNN is an unbiased algorithm and have not any assumption of
the data under consideration. It is very popular because of its simplicity and ease of
implementation plus effectiveness.
Disadvantage: The k-NN not create model so abstraction process not included. It
takes high time to predicate the item. It requires high time to prepare data to design
a robust system.
ALGORITHM FOR KNN:
10.
11.
12. RANDOM FOREST ALGORITHM
Random Forest is a method that operates by constructing multiple decision trees
during training phase.The decision of the majority of the trees is choose by the
random forest as the final decision.
Random Forests grows many classification trees. To classify a new object from an
input vector, put the input vector down each of the trees in the forest. Each tree
gives a classification, and we say the tree "votes" for that class. The forest chooses
the classification having the most votes (over all the trees in the forest).
Each tree is grown as follows:
1. If the number of cases in the training set is N, sample N cases at random -
but with replacement, from the original data. This sample will be the training
set for growing the tree.
2. If there are M input variables, a number m<<M is specified such that at each
node, m variables are selected at random out of the M and the best split on
these m is used to split the node. The value of m is held constant during the
forest growing.
3. Each tree is grown to the largest extent possible. There is no pruning.
13. Algorithm for Construction of Random Forest is
Step 1: Let the number of training cases be “n” and let the number of
variables included in the classifier be “m”.
Step 2: Let the number of input variables used to make decision at the
node of a tree be “p”. We assume that p is always less than “m”.
Step 3: Choose a training set for the decision tree by choosing k times
with replacement from all “n” available training cases by taking a
bootstrap sample. Bootstrapping computes for a given set of data the
accuracy in terms of deviation from the mean data. It is usually used for
hypothesis tests. Simple block bootstrap can be used when the data can
be divided into nonoverlapping blocks. But, moving block bootstrap is
used when we divide the data into overlapping blocks where the portion
“k” of overlap between first and second block is always equal to the “k”
overlap between second and third overlap and so on. We use the
remaining cases to estimate the error of the tree. Bootstrapping is also
used for estimating the properties of the given training data.
Step 4: For each node of the tree, randomly choose variables on which to
search for the best split. New data can be predicted by considering the
majority votes in the tree. Predict data which is not in the bootstrap
sample. And compute the aggregate.
Step 5: Calculate the best split based on these chosen variables in the
training set. Base the decision at that node using the best split.
Step 6: Each tree is fully grown and not pruned. Pruning is used to cut of
the leaf nodes so that the tree can grow further. Here the tree is
completely retained.
Step 7: The best split is one with the least error i.e. the least deviation
from the observed data set.
14. Advantages:
1. It provides accurate predictions for many types of applications
2. It can measure the importance of each feature with respect to the
training data set.
3. Pairwise proximity between samples can be measured by the
training data set.
Disadvantages:
1. For data including categorical variables with different number of
levels, random forests are biased in favor of those attributes
with more levels.
2. If the data contain groups of correlated features of similar
relevance for the output, then smaller groups are favored over
larger groups
Applications:
1. Is used for image classification for pixel analysis.
2. Is used in the field of Bioinformatics for complex data Analysis.
3. It is used for video segmentation (high dimensional data).
15.
16. ADABOOST ALGORITHM
First of all, AdaBoost is short for Adaptive Boosting. Basically, Ada Boosting was
the first really successful boosting algorithm developed for binary classification.
Also, it is the best starting point for understanding boosting. Moreover, modern
boosting methods build on AdaBoost, most notably stochastic gradient boosting
machines.
Generally, AdaBoost is used with short decision trees. Further, the first tree is
created, the performance of the tree on each training instance is used. Also, we use
it to weight how much attention the next tree. Thus, it is created should pay
attention to each training instance. Hence, training data that is hard to predict is
given more weight. Although, whereas easy to predict instances are given less
weight.
Learn AdaBoost Model from Data
Ada Boosting is best used to boost the performance of decision trees and this is
based on binary classification problems.
Each instance in the training dataset is weighted. The initial weight is set to:
weight(xi) = 1/n
Where xi is the i’th training instance and n is the number of training instances
How To Train One Model?
A weak classifier is prepared on the training data using the weighted samples. Only
binary classification problems are supported. So each decision stump makes one
decision on one input variable. And outputs a +1.0 or -1.0 value for the first or
second class value.
The misclassification rate is calculated for the trained model. Traditionally, this is
calculated as:
error = (correct – N) / N
Where error is the misclassification rate. While correct is the number of training
instance predicted by the model. And N is the total number of training instances.
17. AdaBoost Ensemble
• Basically, weak models are added sequentially, trained using the weighted
training data.
• Generally, the process continues until a pre-set number of weak learners
have been created.
• Once completed, you are left with a pool of weak learners each with a stage
value.
Making Predictions with AdaBoost
Predictions are made by calculating the weighted average of the weak classifiers.
For a new input instance, each weak learner calculates a predicted value as either
+1.0 or -1.0. The predicted values are weighted by each weak learner stage value.
The prediction for the ensemble model is taken as a sum of the weighted
predictions. If the sum is positive, then the first class is predicted, if negative the
second class is predicted
Data Preparation for AdaBoost
This section lists some heuristics for best preparing your data for AdaBoost.
Quality Data: Because of the ensemble method attempt to correct
misclassifications in the training data. Also, you need to be careful that the training
data is high-quality. Outliers: Generally, outliers will force the ensemble down the
rabbit hole of work. Although, it is so hard to correct for cases that are unrealistic.
These could be removed from the training dataset. Noisy Data: Basically, noisy
data, specifical noise in the output variable can be problematic. But if possible,
attempt to isolate and clean these from your training dataset.
18. AdaBoost algorithm advantages:
Very good use of weak classifiers for cascading;
Different classification algorithms can be used as weak classifiers;
AdaBoost has a high degree of precision;
Relative to the bagging algorithm and Random Forest Algorithm, AdaBoost fully
considers the weight of each classifier;
Adaboost algorithm disadvantages:
The number of AdaBoost iterations is also a poorly set number of weak classifiers,
which can be determined using cross-validation;
Data imbalance leads to a decrease in classification accuracy;
Training is time consuming, and it is best to cut the point at each reselection of the
current classifier;
19.
20. SUPPORT VECTOR MACHINE
The Support vector machine comes in the category of supervised learning .The
SVM used for regression and classification. But it is popularly known for
classification. It is a very efficient classifier. In this every object or item is
represented by a point in the n- dimensional space. The value of each feature is
represented by the particular coordinate. Then the items divided into classes by
finding hyper-plane as shown in the figure.
The diagram shows support Vectors that represent the coordinates of each item.
The SVM algorithm is a good choice to segregates the two classes.
SVM Advantages
SVM’s are very good when we have no idea on the data.
Works well with even unstructured and semi structured data like text, Images and
trees.
The kernel trick is real strength of SVM. With an appropriate kernel function, we
can solve any complex problem.
Unlike in neural networks, SVM is not solved for local optima.
21. It scales relatively well to high dimensional data.
SVM models have generalization in practice, the risk of over-fitting is less in
SVM.
SVM is always compared with ANN. When compared to ANN models, SVMs
give better results.
SVM Disadvantages
Choosing a “good” kernel function is not easy.
Long training time for large datasets.
Difficult to understand and interpret the final model, variable weights and
individual impact.
Since the final model is not so easy to see, we cannot do small calibrations to the
model hence it’s tough to incorporate our business logic.
The SVM hyper parameters are Cost -C and gamma. It is not that easy to fine-tune
these hyper-parameters. It is hard to visualize their impact
SVM Application
• Protein Structure Prediction
• Intrusion Detection
• Handwriting Recognition
• Detecting Steganography in digital images
• Breast Cancer Diagnosis
• Almost all the applications where ANN is used
24. FURTHER IMPROVEMENTS AND FUTURE SCOPES
In our Glaucoma dataset, we achieved accuracy of 82% in finding the disease and
in future we will increase the accuracy to higher extent.
We will use algorithms like Convolutional Neural Network, to increase the
accuracy rate.
Currently we are using numerical data set as our input for classification and we
will directly take image data set as input in future.
Advances in image processing and its classification will be helpful in diagnosing
medical conditions correctly.
It will be helpful in recognizing people, performing surgery, reprograming, defects
in human DNA etc.
25. CONCLUSION
The paper provides a brief idea of classifier to the beginners of this field.
It helps the researchers in selecting the appropriate classifier for their problem.
This paper explains about KNN, SVM, Random Forest and Adaboost Algorithm
which are very popular classifier in field of image processing. The classifier
mainly classified as supervised or unsupervised classifiers.so in short this paper
provides the theoretical knowledge of concept of above mentioned classifiers
We applied four algorithms on our glaucoma dataset and we found that random
forest algorithm has highest accuracy level of 82% in detecting glaucoma diseases.
We found that KNN algorithm has highest Specificity value.
All this Algorithms can be used for better medical diagnosis of disease like cancer,
Eye disease etc.
It can also be used for biometric purposes such as identity, face and finger print
documentation.