An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
Ph.D. THESIS
PRESENTATION
ON
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
• Introduction
ØTypes of SL
ØMotivation
ØTypes of gestures in SL
ØSign Language Recognition System (SLRS)
→Image Acquisition Technique
→Image Pre-processing
→Image Segmentation
→Feature Extraction
→Classification
ØChallenges
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
ØObjectives of work
• Literature Review
ØLiterature Review on different SLR processing methods
• Comparative Analysis of Feature Detection and Extraction methods for Vision -
based ISLRS (Objective-1)
ØTaxonomy of Feature Extraction Techniques
ØFeature Extraction
ØSIFT
ØSURF
ØFAST
ØBRIEF
ØORB
ØExperimental Results
ØSummary
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
• A Hybrid approach for Feature Extraction for Vision-Based Indian Sign Language
Recognition (Objective-2)
ØProblems in existing system
ØProposed Solution
ØBrief overview of FiST_CNN
ØBasic Terminology
ØDataset used
ØExperimental Results
ØSummary
• Hand Anatomy and Neural Network-Based Recognition for Indian Sign Language
(Objective-3)
ØHand Geometry
ØBrief overview of FiST_HGNN
ØBasic Terminology
ØDataset used
ØExperimental Results
ØSummary
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
• Applying FiST_HGNN for Recognition of ISL Words used in Daily Life
(Objective-4)
ØDataset Creation
ØExperimental Results
ØCode Snippets
ØSummary
• Conclusion
• Future Scope
• List of Publication
• References
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
• Humans communicate to express their emotions, share ideas, collaborate, support one another, serve the
community, and advance society.
• Communication is carried out in spoken form by speech and in non-verbal form through gestures.
• It is based on hand motions, body parts, and facial expressions.
• A Non-government agency(NGO), i.e. World Federation of the Deaf(WDF), states that there are around
70 million population of deaf-mute across the world[1].
6
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
• Within each country or region, wherever deaf-mute communities exist, sign languages develop independently
from the spoken language of the region.
• Each sign language has its grammar and rules with a shared property that is all visually perceived. As a result,
there is a communication gap between the deaf-mute community and the others.
• However, advances in science, technology, and computer vision have evolved as a tool to help the deaf-mute
community interact with the broader community.
7
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
• The first sign language recognized by researchers was American Sign Language (ASL). British Sign Language
(BSL), developed in the United Kingdom, is the most widely used. Currently, around 7000 sign languages are
being used worldwide [1].
• In India, with the range of different sign language, various sign language dictionary has been published, such as
Delhi sign language[2], Mumbai sign language [3][4], Calcutta sign language[5] and Bangalore sign language
[6].
• A decade ago, [7] produced a dictionary consisting of 1830 sign words found across 14 different states in India.
8
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques 9
Figure 1.1. Different SLs used around the world
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
• According to the 2011 Census, India's population of deaf people was around 50 lakhs.
• To provide a suitable medium for ISL training, the Indian Ministry of Social Justice and
Empowerment assisted the Indira Gandhi National Open University (IGNOU) in establishing
the Indian Sign Language Research and Training Centre (ISLRTC) in 2011.
• ISLRTC conducts research and training in various nodal centres around the country. The ISL
dictionary(http://indiansignlanguage.org/dictionary/, 2018) has approximately 2500 signs from
12 states and 42 cities.
• On February 27, 2019, a new dictionary containing 6000 words in medical, academic, legal,
technological, and daily phrases was released.
10
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques 11
0 1000000 2000000 3000000 4000000 5000000 6000000
In Seeing
In Hearing
In Speech
In Movement
Disable person by the type of Disability
Total Female Males
Figure 1.2. India’s disability distribution
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
• SL is the only mode of communication for deaf and mute people. It provides a medium to share thoughts, emotions, and
feelings with hard-of-hearing people. Some people have been deaf-mute since birth. However, some people suffered from this
afterwards.
• Learnings for such people became very hard. Hence, they need to learn things only through vision. The SL instructors resolve
this problem. They became the medium of communication between ordinary persons and the deaf-mute community. But not
everyone can afford these instructors.
• There is a lack of schools or learning centers for such people in India. There is still a lack of attention for these people. There
are only 478 government schools and 372 private schools throughout India. And mostly schools use the oral mode of
communication.
• Further, they face communication problems at public places such as banks, medical, home-related, public transport, schools,
etc. So, a Vision-based system should be effective and reliable when used in the real world and is less expensive, affordable,
and easy to use.
• This work emphasizes using soft computing techniques to reduce the conversation rift between the deaf-mute community and
other people.
12
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
• The hand gesture recognition system primarily depends on hand posture.
• Sign language gestures broadly can be divided into two categories: Manual and non-
manual gestures.
• Based on motion, the gestures are classified as static and dynamic gestures.
• The manual gesture is much simpler for recognition compared to non-manual gestures.
• Manual gestures use only hand gestures, while non-manual gestures include mouth
morphemes, eye gazes, facial expressions, body shifting and head tilting.
13
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
Figure 1.3. Classification of gestures in SL
14
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
• The signer's body is not involved in these movements.
• This category includes alphabets like 'J' and 'Z,' as well as the different 24 alphabets that can be
rendered in static form.
• On the one hand, ISL alphabets and numbers are shown in Figures 1.4 and 1.5, respectively
15
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
Figure 1.4. ISL Alphabets using one hand gesture Figure 1.5. ISL Number using one hand gesture(adopted
from google.com)
16
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
Figure 1.6. ASL Alphabets gesture’s
17
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
• Both hands are used to symbolize the two-handed gestures.
• Figure 1.7 illustrate the two-hand ISL alphabets gesture.
• Figure 1.8 shows the examples of ISL words.
18
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
Figure 1.8. Sample images for static two-hand ISL gesture
19
Figure 1.7. Sample images for static two-hand ISL alphabets
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
• These gestures contain motions. Dynamic gestures can be further depicted by using one hand or two-hand.
a) One-hand gesture: In these gestures, only one hand is used to depict the gesture; however, the hand will be in motion
while communicating. Among the alphabet, ‘J’ and ‘Z’ are the gestures which require motion as shown in Figure 1.9.
20
Figure 1.9. Sample images for alphabet “J” and “Z.”
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques 21
b) Two-hand gestures: Both hands are required to depict these gestures, as shown in Figure 1.10. Based on hand movement, two-
hand gestures are categorized into Type-0 and Type-1. In type 0, one hand will be considered as the principal hand, and the other
will be a non-principal hand. While in the case of type-1, both hands are in continuous motion.
Figure 1.10. Sample images for two hand dynamic gestures
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques 22
Non-manual gesture mainly resembles gestures having movement. The non-manual gesture consists of mouth gesture, body posture and
facial expression, as shown in Figure 1.11.
Figure 1.11. Sample images for non-manual gestures
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
• There can be two types of SLRS systems, one is device based, and another is vision based.
• A device-based system, a device is used to acquire and predict the gesture.
• The vision-based system uses the webcam to develop and predict the gesture.
• In sign language recognition, for the vision-based system, no trainer is required, and it is also
versatile.
• The vision-based system is a much more straightforward and intuitive communication between
the deaf-mute and the computer.
23
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
• With the advances in image and video processing techniques, hand gesture applications such as virtual
reality, medical applications, video game consoles, touch screens, and sign language recognition systems
also progressed.
• The device-based method is less flexible as a user must wear a glove always. In contrast, the vision-
based method allows users to interact remotely [9].
• The basic steps presented in the next section mainly focused on the design of a vision-based sign
language recognition. The processing steps needed to develop a predictive model for SLRS are discussed
further.
24
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
Figure 1.12. Steps in SLRS recognition
25
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques 26
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
• These gadgets have recently become popular for SLR [2, 3, 4]. Surface electromyogram devices use sensors on
the skin surface to measure the signals non-invasively created in the muscles.
• These systems consist of sensor-based devices that can measure the hand's motion irrespective of the rotation,
such as gyroscopes and accelerometers [5]. The sensors are placed in the hand gloves, as shown in Figure 1.13.
• In these systems, the gestures are acquired through signals using data gloves [6], and two cyber-gloves, one on
each hand [7], are also used. There are benefits and cons to each sensor modality device.
• However, wearing a glove for the whole duration is difficult and obstructs the natural signing. Not only these
devices are costly, but they also need a controlled environment to operate.
27
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
Figure 1.13. Data acquisition devices
28
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
• These systems use a camera in front of the signer to track hand motion [10]. In [8], the authors classified individual signs and
signed words from video streams with up to 97.3% accuracy using hand shape, velocity, and location as components.
• Hand geometry parameters are also used to evaluate the hand features for 10 ISL gestures [12]. However, deep learning
techniques such as CNN, VGG16, MobileNets etc., have provided a boon for SLR systems.
• To improve recognition through a vision-based system, state-of-art techniques have been hybridized with deep learning
techniques, as shown in Figure 1.14.
• These systems are inexpensive in terms of implementation, and recently, much progress has been made on vision-based
recognition systems.
29
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
Figure 1.14. Recognition process through the vision-based system
30
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
• In ISLR, the dataset is pre-processed as the cleaning step. This stage aims to prepare the dataset
for training, making it easier to evaluate and process computationally.
• Pre-processing is a technique for reducing the algorithm's complexity and increasing its
correctness. Image processing might include tasks such as image resizing, geometric and colour
transformations, colour to grayscale conversion, and many others.
• The images to a greyscale is converted to Black, and white, as shown in Figure 1.15. The pixels
with value 1 are the black pixels, i.e., the object; however, zero represents the white area.
31
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
Figure 1.15. Hand gesture conversion from RGB to Grey
32
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
Scaling, rotations, and other affine transformations are all part of data augmentation. Different types of
data augmentation techniques are:
i. Flipping: These tasks are typically completed to increase the dataset size and expose the neural
network to various image variants. The model can recognize the object in any shape or form.
ii. Colour Space: Color channels are used in this technique. Colors such as R, G, B are isolated into
single matrix to transform the colors.
iii. Cropping: In this procedure, a center patch of the same height and breadth is cropped for all images
in the collection. Random cropping is also employed to create a translation-like effect.
iv. Rotation: The rotation degree parameters determine the precision of this augmentation. An SLR
camera may overlap with other gestures if a picture is rotated at different angles.
33
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
v. Translation: To avoid positional bias in the data, translation is done by shifting the images left, right, up, or
down. This padding is used to keep the image's spatial dimensions intact.
vi. Noise injection: Injecting a matrix of random values from a Gaussian distribution is known as noise injection.
To make the model more robust, noise is added to the images.
Figure 1.16. Translation applied on an ISL word "Above" gesture.
34
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques 35
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
Segmentation
Techniques
Pros Cons
Thresholding based
segmentation
It's a simple and uncomplicated method. It does not
require any prior knowledge to function. As a result, it has
a cheap cost of calculation.
It relies heavily on peaks, with minimal
regard for spatial subtleties—noise
sensitivity.
It's challenging to choose the best
threshold value.
Edge-based
segmentation
Appropriate for images with higher object contrast. It's not suitable for images with a lot of
noise or edges.
Region-based
segmentation
It's less prone to noise and more valuable when creating
similarity criteria is straightforward.
In terms of processing time and
memory usage, it is pretty expensive.
Clustering based
segmentation
Because of the fuzzy partial membership used, it is more
beneficial for real-world problems.
It's not simple to figure out how to
determine membership functions.
Table 1.3. A
comparison based on
the study is illustrated
below
36
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
• The process of discovering the most promising and informative characteristics set to increase the accuracy and efficiency of
the data to be tested is referred to as feature extraction[23].
• After image pre-processing, feature extraction is the most important phase in SLRS.
• Local and global descriptors are the two types of features. Global features characterize the image, allowing the entire item to
be generalized.
• Shape matrices, invariant moments (Hu, Zernike), HOG, and Co-HOG are common global descriptors[24][25][26] for image
retrieval, object recognition, and classification.
• The local descriptors[27][28][29] are used for object recognition and identification [30] like SIFT, SURF. For feature
extraction in vision-based gesture recognition systems, a variety of approaches have been used, including Zernike moments,
Hu moments, HOG, SIFT, ED, FD, DWT, ANN, CNN, Fuzzy, GA[33][34][36][37][41][44][64].
37
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
i. Neural Network(NN): NN is a multi-layer network with n layers 𝑛 ≥ 3 and the model has 𝑥!
neurons at the last layer.
• A NN is a three-layered architecture composite of an input layer, a hidden layer and an output layer. In
the input layer, data processing is done using weights. Further, the output layer is responsible for the
prediction of results.
• The NN model is mainly of two types: feedforward and backpropagation. A feedforward NN is shown in
Figure 1.17.
38
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
Figure 1.17. The architecture of the Neural Network
39
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
• Convolutional Neural Network is part of deep, feed-forward artificial neural networks that can perform various
tasks with even better time and accuracy than other classifiers.
• A typical CNN has three layers: a convolution layer, a Max-pooling layer and a fully connected layer, as shown
in Figure 1.18.
• The first layer is the convolution layer, where the list of 'filters' such as 'blur', 'sharpen' and 'edge-detection' are
all done with a convolution of kernel or filter with the image.
• Each feature or pixel of the convolved image is a node in the hidden layer.
• Each number in the kernel is a weight, and that weight is the connection between the features, input image and
the node of the hidden layer
40
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
Figure 1.18. CNN model architecture
41
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
iii. Fuzzy Logic: The concept of fuzzy logic was introduced by Zadeh[31] as a method for
representing human knowledge that is imprecise by nature.
• The most significant benefit of fuzzy logic is that it provides a practical mechanism
for creating non-linear control systems that are difficult to develop and stabilize
using conventional methods.
• Hence, fuzzy logic is most frequently used in device-based recognition.
42
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
iv. Evolutionary Algorithms: These algorithms are used to solve various optimization problems encountered
in real-life applications[34][23].
• The main principle behind these algorithms is to find an appropriate selection for an application by
simulating natural selection.
• The main principle is that, as nature evolves with the fittest survival, EA aims to find the fittest.
• Some commonly used evolutionary algorithms are Genetic algorithm (GA), Particle swarm
optimization(PSO), Artificial bee colony optimization(ABC), Firefly optimization algorithm(FA), and Ant
colony optimization(ACO).
• In sign language recognition, these algorithms are the best used to select the best features. Feature extraction
technique such as HOG and PCA extracts several features; this algorithm gives the best uses with these
algorithms.
43
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
i. Segmentation: Using skin-colour-based approaches to segment the hand becomes challenging. However, placing in a cluttered
and complex background and extracting a hand shape is difficult in computer vision.
ii. Similar Gesture: A multiclass classification system categorizes sign language with an extensive vocabulary. Recognizing a
similar gesture is a challenging task.
iii. Feature Extraction: Feature extraction plays an essential role in ISLR. A feature extraction technique that can extract accurate
features with non-redundancy is required to reduce the time complexity of the systems.
iv. Dimensionality: Deep learning techniques are the most common technique used for SLRS. These techniques operate the number
of filters at each layer, which convolves the image; hence till the final layer, the number of parameters increases by a large
number. This led to an increase in the dimensionality of the training model. Although these networks provide higher accuracy
than traditional methods, the time complexity and redundancy in the number of features increased.
44
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
v. Hand geometry: A hand has 27 degrees of freedom and is an articulated object (related to the number of joints). The
interdependences between the fingers depend on the hand movements and the degrees of freedom. Hand gestures can vary in terms of
size, position, and orientation. Therefore, a combination of these parameters needs to be approximated to recognize a hand gesture.
vi. Self-occlusion: During the formation of a double hand gesture, some parts of the hand may hide behind the other, and then self-
occlusion occurs. It makes the segmentation and detection of hand tasks very difficult. A model robust to self-occlusion is required to
tackle this problem.
vii. Standardized dataset: Sign language consists of a vocabulary of signs precisely the same way spoken language consists of a
vocabulary of words. Sign languages are not standard and universal, and the grammars differ from state to state. Standard sign
language is required that can be followed within the country. Keeping this in mind, this work only focuses on ISL.
viii. Dynamic gesture recognition: In double-hand recognition systems, one hand might move more quickly than the others. The device
has trouble keeping track of the hands moving at various rates. Consequently, a fast recognition system is required.
45
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
• To analyze various soft computing-based techniques used for feature extraction and
gesture recognization in ISL.
• To propose an efficient and effective technique for feature extraction of static gesture in
ISL.
• To propose a soft computing-based technique for recognition of various gesture used in
ISL.
• To apply the above proposed techniques on some real-world problem.
46
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
• With the recent advancement in machine learning and computational intelligence methods, intelligent systems in
sign language recognition continue to attract academic researchers and industrial practitioners' attention.
• This study presents a systematic analysis of intelligent systems employed in sign language recognition related
studies between 2000 and 2022.
• An exhaustive search has been conducted using the Google search engine. All of the technique has been analyzed in
terms of accuracy.
• Approximately more than 150 articles from the field of gesture recognition have been selected, where main
emphasis is on vision based ISLR.
47
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques 48
Table 2.1. A summary of existing works in the image acquisition phase
S.No. Author Data acquisition
method
Feature Set Dataset Preprocessing Segmentation Technique
1. [75] Camera Hand gestures and
movement of the head
are collected as features
Self- created Grayscale images with
Gaussian Filtering have
been used
Otsu’s thresholding +
DWT + canny edge
detection.
2. [76] Camera Different handshapes are
collected as features
Self-created Grayscale images are used
with median filter +
Morphological operation +
Gaussian filter.
Thresholding and Blob +
crop + Sobel Edge
detector
3. [77] Camera Different handshapes are
collected as features
Self-created - Sobel edge detector
4. [78] Camera Hand gestures and head
positions are collected as
features
Self-created Grayscale images with
Morphological operations +
average filter
Canny edge + DWT
5. [70] Kinect + camera Different handshapes are
collected as features
Self-created Skin filter is used for
image pre-processing
HSV colour space is used
for feature extraction
6. [79] Camera Different handshapes
are collected as
features
Publicly
available
HE and Logarithmic
transformation are used for
pre-processing
CIELAB colour space +
canny edge detection are
used to extract features
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
S.No. Author Feature
extraction
technique
Gesture’s Advantage Disadvantage Accuracy
1. [138] ED 24-alphabets Less time complexity, recognize double-
handed gesture, differentiate skin color
Only static images have been used 97%
2. [17] ED 24-alphabets On video sequence, recognize single and
double handed gesture
Works only in ideal lightening
conditions
96.25%
3. [111] FD 15-words Differentiated similar gestures Large dataset 96.15%
4. [113] FD 46-alphabets,
numbers, and
words
Dynamic gestures Dataset of 130,000 is used 92.16%
5. [141] DWT 52-alphabets,
numbers, and
words
Considers dynamic gestures Simple background, large dataset 81.48%
6. [114] DWT 24-alphabets Increase adaptability in background complexity
and illumination
Less efficient for similar gesture 90%
7. [129] FL 90-alphabets,
numbers, and
words
Invariant to scaling, translation, and rotation Can’t work in real time system 96%
8. [132] ANN 22-alphabets No noise issue, data normalization is easily
done
- 99.63%
9. [138] ED 24-alphabets Less time complexity, recognize double-
handed gesture, differentiate skin color
Only static images have been used 97%
49
Table 2.2. Summary of ISL feature extraction techniques work
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
S.No. Author Feature
extraction
technique
Gesture’s Advantage Disadvantage Accuracy
10. [78] Fuzzy + Neural 26-alphabets High recognition rate for single and
double handed gestures
Accuracy lacks for
similar gestures
97.1%
11. [141] SIFT 26-alphabets Works better on illumination and scaled
images
Processing time is high -
12. [99] Elliptical FD and
PCA
59-alphabets, numbers, and words Video sequences are used for recognition,
works better in recognition of moving
objects
Need an improvement
in complex background
sequences
92.34%
13. [125] HOG 36-alphabets and numbers Robust in light changing condition,
Additional hardwired devices are not
required for shape extraction
Accuracy lacks in
complex background
images
-
14. [142] HOG 36-alphabets and numbers Reduced feature vector, minimum
computational time
- 92%
15. [93] Adaptive
thresholding and
SIFT
50-alphabets, numbers, and words Eliminate need of image pre-processing,
high accuracy
Works only on static
images
91.84%
16. [105] HOG + SIFT 26-alphabets and numbers Invariant to illumination, orientation, and
occlusion for double handed gestures
Accuracy lacks in
dynamic gestures
93%
17. [78] Fuzzy + Neural 26-alphabets High recognition rate for single and
double handed gestures
Accuracy lacks for
similar gestures
97.1%
18. [141] SIFT 26-alphabets Works better on illumination and scaled
images
Processing time is high -
50
Table 2.2. Summary of ISL feature extraction techniques work
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
S.NO Author Feature Extraction Technique Classifier Accuracy Observation
1. [105] Shape descriptors, SIFT + HOG SVM 93% Unable to classify similar gestures
2. [26] SIFT + LDA KNN and SVM 99% SVM achieved better accuracy
compared to KNN
3. [154] SIFT, k-mean clustering + BOW SVM and KNN - On large dataset, SVM perform better
than KNN
4. [70] Hu Moments + Motion trajectory SVM 97.5% ISL gestures has been classified
5. [142] TOPSIS SVM 99.2%-ASL
92%-ISL
Good performance under complex
background.
6. [155] AlexNet,VGG16 model SVM 99.82% Computational complexity is high on
large dataset.
7. [76] Centroid, area of edge ED 90.19% 26 ASL gesture has been recognized in
real time
8. [173] Edge detection, FD + DTW KNN 96.15% ISL dynamic gestures has been
recognized
9. [182] DTW KNN 99.23% 13 ISL alphabets recognized
10. [157] ORB, K-means clustering + BOW KNN KNN-
95.81%
MLP-
96.96%
MLP perform better than KNN on ASL
static gestures.
51
Table. 2.3. Summary of classification techniques reviewed work
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
11. [30] EFD ANN with backpropagation 95.10% Four cameras are used for acquisition
12. [78] EFD with PCA ANN 92.34% Better results compared to morphological process.
13. [79] Canny edge detector +
(FCC)
ANN 96.50% Unable to recognize gestures in lowly illumination
14. [184] HOG ANN with feedforward and backpropagation
algorithm
99.0% Two-hand BSL alphabets were recognized
15. [170] Boundary and region
feature
ANFIS 100%-19
Rules
97.5%- 10
Rules
Better performance than the previous model
16. [78] Active contours FIS 96% Better results were achieved compared to other models.
17. [185] GLCM Fuzzy c-means 91% 28 ArSL alphabets were recognized.
18. [35] - Continuous HMM and AdaBoost 92.70% Improved recognition accuracy compared to individual
CHMM model.
19. [186] - SVM with Bagged Tree classifier 80% The bagged tree classifier has outperformed the SVM
classifier
20. [187] - RF with ANN and SVM 95.48% RF outperforms ANN and SVM
21. [173] - ELM with multiple SVM 98.7% Results conclude that ELM outperforms single
classifiers.
22. [30] EFD ANN with backpropagation 95.10% Four cameras are used for acquisition
52
Table. 2.3. Summary of classification techniques reviewed work
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
To attained benchmark performance in this context, the following points are worthy of more attention for research:
• Different methods are used for data acquisition [17, 30]. The vision-based acquisition is inexpensive but very much
affected by the lighting and background, while device acquisition [70] is done under trainer and is very expensive.
• During pre-processing of image objects having same skin color as of background cannot be segmented [76]. Due to
high/low intensity of light, poor color of background and inappropriate position of signer will also affect
segmentation of gesture [94, 95].
• There are several feature extraction techniques [20, 21, 100, 102] used in ISLR. Although these extraction
techniques achieve higher accuracy [96] but with large feature set and processing time.
• Mostly existing work focuses on in increasing the accuracy for the recognition of alphabets and numbers [1, 52].
53
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
• Although, recognition of words in ISL is a demanding research area. Classifying trained image from testing image
is the most crucial part in recognizing system.
• Techniques like SVM [70, 109, 150, 154], HMM [39, 146, 168], KNN [31, 76, 165, 167], and ANN [38, 39, 40,
93, 160] are used but accuracy achieved by these systems is 95%. An algorithm for improving accuracy and
efficiency is needed.
• Even though the above discussed approaches perform effectively in all the applications where they are being used.
• The current methodologies have certain limits, either in terms of computing complexity or recognition accuracy,
therefore there is still opportunity for the development of new strategies.
54
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques 55
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
I. CBIR: Content Based Image Retrieval(CBIR) has been one of the most important research areas in the field of
computer vision over the last 20 years. The main idea of CBIR is to analyze image information by low level
features of an image[83], which include color, texture, shape, and space relationships of objects etc., and to set up
feature vectors of an image as its index. The most common CBIR techniques used for sign language recognition
has been discussed below.
i. Statistical: Zernike moments requires lower computation time compared to regular moments [84][85].
• In [86] these moments are used for extraction of mutually independent shape information while
[87]has used this approach on Tamil scripts to overcome the loss in information redundancy of
geometric mean.
• Although computation of these feature vector is easy here, but recognition rate is less efficient.
These features are invariant to shape and angles but variant to background andillumination.
56
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
ii. Shape based: These techniques are based on phenomenon that without any change in shape of image we can
extract the accurate features.
• [93] determines active finger count by evaluating the ED distance between palm and wrist. As a result,
feature vector of Finger projected distance (FPD) and finger base angle (FBA) are computed. But the
features selection depends on orientation and rotation angle.
• All the features of processed frames are then extracted using Fourier descriptor method. Instead of using
pre-processing techniques like filtering and segmentation of hand gesture, methods such as scaling and
shifting parameters were extracted based on high low frequency of images up-to 7th level [99].
• These feature extraction techniques lack for large database in terms of accuracy and efficiency [97].
• They also cannot perform well in cluttered background[100] and are variant to illumination changes.
57
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
• Soft computing is an emerging approach in field of computing that gives a remarkable ability of learning in the
atmosphere uncertainty.
• In addition to describe appearance and shape of local object within an image HOG is used in [115][116]. [117] works on
continuous gesture recognition by storing 15 frames per gesture in database. [118] uses HOG for vision-based gesture
recognition. [119] extracts global descriptors ofimage by local histogram feature descriptor (LHFD).
• [125] [160][161]develops three novel methods (NN- GA, NN- EA and NN-PSO) for effective recognition of gestures in
ISL. The NN has been optimized using GA, EA and PSO. Experimental results conclude NN-PSO approach outperforms
the two other methods. [126][162]-[166] uses CNNs for automating construction of pool with similar local region of
hand.
• [127] applied CNNs to directly extracts images from video. [128][129][130]automatic clustering of all frames for
dynamic hand gesture is done by CNNs. Three max-pooling layers, two fully connected layers and one SoftMax layer
constitutes the model.
58
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
S.No Author Acquisitio
n Method
Gesture
Type
Mode Technique Accuracy Remark
1. Verma and Dev(2009) Camera Both Dynamic FSM - The proposed technique was successfully applied on gestures
such as waving left hand, waving right hand, signalling to
stop, forward and rewind.
2. Adithya et al.[33] Camera Both Static ANN 91.11% Hand shape is extracted by using digital image processing
technique.
3. Kishore et al.[38] Camera Both Dynamic ANN 90.17% The word matching score over multiple instances of training
and testing of the neural network resulted in around 90%.
4. Kaluri and Reddy(2017) Camera Both Static GA-NN 90.18% Genetic algorithm has been used to improve the recognition
rate.
5. Prasad et al.(2016) Camera Both Dynamic Sugeno fuzzy
inference system
92.5% The video dataset of Indian signs contains 80 words and
sentence testing.
6. Kishore et al.(2016 Camera Both Dynamic Fuzzy Inference
Engine
96% The system achieved better results compared to other models
in the same categories
7. Hasan et al.(2017) Camera Single
hand
Static ANN 96.50% -
8. Fregoso et. al(2021) Camera Both Static PSO-CNN 99.98% Applied optimization algorithms to find the optimal
parameters of CNN architecture.
9. Shin et. al.(2021) Camera Both Static Gradient Boost
Machine
96.71% The complex shape of the hand could be easily detected
10. Meng and Li(2021) Camera Both Dynamic Graph Convolution
Network
98.08% The system achieved better performance and reduce motion
blurring, sign variation and finger occlusion.
59
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
• Fusion of soft computing based and CBIR bases techniques are also employed in literature to have advantages of
both techniques.
• [81] integrates SURF and Hu momentsto achieve high recognition rate with less time complexity.
• [88] embeds SIFT and HOG for robust feature extraction of images in cluttered background and under difficult
illumination.
• To improve efficiency a multiscale oriented histogram within addition to contour directions is used for feature
extraction [133].
• This integration of approaches makes system memory efficient with high recognition rate of 97.1%.
• Hybrid approaches develops efficient and effective system, but implementation is complex.
60
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
Figure 2.1. Taxonomy of
feature extraction techniques
61
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
From the previous study results, a comparative analysis of feature extraction is done further. Based on
characteristics, the commonly used feature extraction techniques are divided into three main categories: Scale-
based techniques, intensity-based techniques, and hybrid techniques, as shown in Figure 3.1.
Figure 3.1. Classification of feature extraction technique
62
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
i. SIFT: SIFT features are local and robust to brightness, contrast, affine transformation, and
noise. The number of octaves and scale depends on the size of the original image. Each
octave’s image size is half the previous one. SIFT uses local spectra making it detect and
compute key points efficiently as shown in Eq. 3.1.
𝐿 𝑥, 𝑦, 𝜎 = 𝐺 𝑥, 𝑦, 𝜎 ∗ 𝐼 𝑥, 𝑦 (3.1)
Where 𝐺 is the gaussian blur operator, 𝐿 is the blurred image, 𝐼 is an image, s is the
scale parameter and, (𝑥,𝑦) are the location coordinates.
63
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
ii. SURF: An improved and fast version of SIFT is SURF. SURF is a fast and robust algorithm
for local binary descriptors. SURF uses integral box filtering by using Gaussian kernels at
𝑋 = (𝑥, 𝑦). The Hessian matrix 𝐻 (𝑥, s) in 𝑥 at scale s defined in Eq. 3.2:
𝐻 𝑥, 𝜎 =
𝐿!! 𝑥, 𝜎 𝐿!" 𝑥, 𝜎
𝐿"! 𝑥, 𝜎 𝐿"" 𝑥, 𝜎
(3.2)
Where, 𝐿!! 𝑥, 𝜎 is the convolution of the gaussian second order derivative with the image
𝐼 in point 𝑥, and similarly for 𝐿"! 𝑥, 𝜎 and 𝐿"" 𝑥, 𝜎 at point 𝑦.
64
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
i. FAST: FAST is a corner detection method with great computational efficiency. FAST corner detector is
commonly used for real-time processing applications due to its high-speed performance. For every feature
point, it stores the 16 pixels around it as a vector and this is done for all the images to get feature vector 𝑝.
Every pixel in the image can have the following three states as shown in Eq. 3.3:
𝑆!→# = &
𝑑 𝐼!→# ≤ 𝐼!$% (𝑑𝑎𝑟𝑘𝑒𝑟)
𝑠 𝐼!$% < 𝐼!→# ≤ 𝐼!&% (𝑠𝑖𝑚𝑖𝑙𝑎𝑟)
𝑏 𝐼!&% ≤ 𝐼!→# (𝑏𝑟𝑖𝑔ℎ𝑡𝑒𝑟)
(3.3)
Where, 𝑆!→# is the state, 𝐼!→# is the intensity of the pixel 𝑥 and 𝑡 is a threshold. Pixel with a lower threshold
value than the selected one is discarded from the vector set.
65
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
ii. BRIEF: BRIEF is an efficient feature point detector developed using binary strings. BRIEF is very
fast both to build and match. BRIEF outperforms the other fast descriptors such as SIFT and SURF in
terms of speed and terms of recognition rate. A binary feature vector of the binary test (t) is defined as
shown in Eq. 3.4:
𝜏(𝑝, 𝑥, 𝑦) = ,
1 ∶ 𝑝 𝑥 < 𝑝(𝑦)
0 ∶ 𝑝 𝑥 ≥ 𝑝(𝑦)
(3.4)
Where, 𝑝(𝑥) is the intensity of 𝑝 at a point 𝑥. Choosing a set of 𝑛(𝑥, 𝑦)-location pairs uniquely
defines a set of binary tests. Where 𝑛 is the length of the binary feature vector and it could be 128,
256, or 512.
66
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
i. ORB: ORB is an improved version of FAST and BRIEF. ORB is robust to brightness, contrast, rotation, and
limited scale.
• The main contribution of ORB is the addition of a fast and accurate orientation, also it makes computation efficient
due to its oriented BRIEF features.
• As BRIEF, ORB also uses local binary spectra, and it uses intensity centroid to detect intensity change.
• Unlike FAST and BRIEF, ORB can detect and compute the descriptor and thus is now emerged as the most
efficient feature extraction technique in computer vision.
67
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
Table 3.1. Comparative analysis of techniques
68
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
Experimental Set up
Python 3-jupyter notebook has been used for performing the experiments presented in this chapter. The system's
specifications are- Intel® core™1.8 GHz, 8 GB RAM, and 256 caches per core, 3MB cache in total. Graphics with GPU type
with VRAM 1536 MB. SIFT, SURF, FAST, BRIEF, and ORB are used as detectors in OpenCV’s environment. As no standard
dataset for ISL alphabet gestures is available, so dataset from a GitHub project[18] which consists of 4962 images with more
than 200 images per gesture has been used for experiments excluding J and Z as these gestures require motion. The dataset
includes images of all typical forms such as different orientations, illumination, occlusion, blurring, intensity, and affine
transformation as shown in Figure 3.2.
69
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
i. Match Rate: The match rate is calculated as an error for corresponding pixels in query and training images based on
several keypoints matched as shown in Eq. 3.5.
𝑀𝑎𝑡𝑐ℎ 𝑟𝑎𝑡𝑒 % =
'()!*+,%- *. %/0+,(1 +203(&'()!*+,%- *. 45(/) +203(
6∗80%9:(1 '()!*+,%-
∗ 100 (3.5)
For example, in the case of the intensity scale for FAST, if the keypoints extracted by the training image are 144, while in
the query image only 72 keypoints are extracted, and matched keypoints calculated by using the Brute-Force(BF)
matcher[2] algorithm is 115, then the match rate can be evaluated as:
𝑀𝑎𝑡𝑐ℎ 𝑟𝑎𝑡𝑒 % =
144 + 72
2 ∗ 115
∗ 100 = 93.91
70
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
i. Affine Transformation: Table 3.2 of affine transformation concludes that SURF gives the least match rate and takes
the highest time.
ii. Intensity Scale:Intensity Scaling is defined as the measure to change the colour of images at different scales. Features
are detected from images at different intensities. The obtained result for the intensity scale by different techniques are
shown in Table 3.3.
iii. Orientation:In orientation, the image’s views are rotated concerning each other. Table 3.4 and Figures 3.6, 3.7, and 3.8
show the matching keypoints in training and query images on different rotation angles.
iv. Blurring: Table 3.5 shows that BRIEF is the fastest while SURF is the slowest for blurring.
v. Illumination: Table 3.6 shows that FAST is the fastest one and SIFT is the slowest one in case of illumination.
vi. Occlusion: From Table 3.7, occlusion results state that FAST provides the highest matching rate with the least time.
71
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
Figure 3.2. Matching images at different intensity scales for SIFT, SURF, BRIEF and ORB
72
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
Figure 3.6. Image rotated at 0° angle
73
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
Figure 3.7. Image rotated at 45° angle
74
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
Figure 3.8. Image rotated at 90° angle
75
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques 76
Table 3.2. Comparison based on the affine transformation Table 3.3. Comparison based on the intensity scale
SIFT SURF FAST BRIEF ORB
Match Rate (%) 77.23 70.44 96.62 95.22 96.21
Processing Time(sec) 1.79 2.21 1.37 1.45 1.43
SIFT SURF FAST BRIEF ORB
Match Rate (%) 77.47 65.52 93.91 78.67 85.68
Processing
Time(sec)
0.33 0.39 0.23 0.24 0.21
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques 77
Table 3.4. Comparison based on the orientation Table 3.5. Comparison based on the blurring
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques 78
Table 3.7. Comparison based on the occlusion
Table 3.6. Comparison based on the illumination
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
Figure 3.9 illustrate the performance of all the techniques for match rate on different parameters. SIFT has performed
better for intensity scale, occlusion and blurring while SURF performed the least on all parameters. The performance of
BRIEF is degraded in the case of blurring, orientation, occlusion, and intensity scale. However, ORB performs the
slightest on blurring. Results conclude that the performance of FAST is superior to all other techniques on all parameters .
Figure 3.9. Performance-based on the match rate
79
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
Figure 3.10 shows the time taken by all the techniques for all the parameters. Results state that SURF takes
much more time than the other three techniques, while ORB takes the least.
Figure 3.10. Performance-based on execution time
80
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
After experimenting with all the techniques, it is concluded that they have evolved from one another. From the experimental results
the techniques have been categorized into three categories as below:
a) Common: Match rate<70%, Execution time >2 sec.
b) Good: Match rate> 70 and <89%, Execution time>2 and <1.5 sec.
c) Best: Match rate >90%, Execution time <1.5 sec.
Table 3.8 shows the comparative analysis of all the techniques based on common, good, and best.
Table 3.8. Comparative analysis of all the techniques
81
Technique Processing
Time
Affine
Transformation
Intensity
Scale
Orientation Blurring Illumination Occlusion
SIFT Common Good Good Common Common Good Good
SURF Common Good Common Common Common Good Common
FAST Best Best Best Good Best Good Best
BRIEF Best Best Good Good Good Best Good
ORB Best Best Good Best Good Best Best
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
• Feature extraction is essential to ISLR, as the system’s computational efficiency mainly depends on them.
• Five feature extraction and detection techniques for vision-based ISLR have been compared in terms of match rate and
processing time on original images with deformations.
• It can also be observed that the performance of SURF is not suitable for real-time ISLR. Whereas BRIEF and ORB don’t
perform good on intensity scale and blurring.
• They give the best and fastest response in terms of detection and matching. SIFT has performed better for affine transformation,
intensity scale, illumination, and occlusion; however, the performance of SIFT is degraded in orientation and blurring. Further,
in terms of processing time, its performance is slower than FAST, BRIEF and ORB.
• The FAST provides the most noticeable results for all the variation parameters. But FAST can only detect the features; it can’t
compute features, and hence, it delimits the purpose of using it as an adequate algorithm for ISLR.
• The limitations of FAST can be overcome by hybridizing it with another algorithm. For the ISLR vision-based system, one
needs accurate and fast results in a real-time environment.
• Further, an attempt is made to improve the performance of existing SIFT for ISLR.
82
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques 83
A Hybrid approach for Feature Extraction for Vision-
Based Indian Sign Language Recognition (Objective-2)
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
• Although existing feature extraction techniques perform exceptionally well in one situation but may underperform in other
situations, they are intended to extract specific features from an image.
• The FAST technique is able to detects accurate and fast keypoints even in low-resolution images [21][25][26][33]. However,
it is not stable to rotation, blurring, and illumination.
• However, SIFT has been used for computing features, making analysis very efficient and effective [21][39]. It has also been
noticed that SIFT performs well in these conditions but takes more time for feature extraction [10,19].
• However, the incredible results from CNN in image processing and image classification have inspired researchers to apply it
to SLR [5][12]. There exist a lot of SLR systems making use of CNN [48][17][20][24]. Furthermore , CNN has good
generalization capability but is computationally expensive.
84
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
• To overcome the SIFT and FAST limitation, a hybridization of them is done.
• Firstly, FAST is used to detect the keypoints, as FAST detects the keypoints speedily and is also able to detect keypoints
even in low resolution images.
• The detected keypoints are then computed by using SIFT. SIFT is invariant to orientation, and blurring. The SIFT
returns the final keypoints after computation.
• Further these images are then passed to CNN for further classification. The computation of CNN is now reduced by
providing image only with the essential keypoints. In CNN image will be convolved layer after layer to each part of the
image, but with this new approach CNN will have only the high intensity pixel.
• This will lead to only convolve the part of image where actual gesture is present and other pixels will be considered as
non-trainable parameters as they don’t have any value.
• The proposed model is named as FiST_CNN.
85
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
• Figure 4.1 shows the overall architecture of the hybrid approach FiST_CNN for ISL.
• It consists of three major phases: data preprocessing, feature extraction, and training and testing of CNN.
• In the first phase, the stored static single-handed images are resized to 224*224, and then data augmentation is
done on resized images.
• Then in the next phase, key points are localized by FAST techniques. Then the value of these localized key points
is computed using SIFT in the third phase.
• Finally, these values are passed to CNN for training. After this classification of images into various classes is done
by CNN.
86
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques 87
Figure 4.1. Architecture of FiST_CNN for ISL
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
I. Image Resize and Data Augmentation
Firstly, all the images are resized to 224*224 pixels to maintain uniformity in the dataset. After this data augmentation is
applied to make the system more robust in terms of image orientations, occlusions and transformation at different angles
and lighting conditions.
II. Feature Extraction
In this phase firstly the key points are localized by using the FAST computer vision technique. To identify a pixel p as an
interesting point, Bresenham’s circle of 16 pixels is used as a test mask. Every pixel y in the considered circle may have
one of the following three states [14] as shown in Eq. 4.1:
𝑆!→) =
𝑑, 𝐼) ≤ 𝐼! − 𝑇 𝑑𝑎𝑟𝑘𝑒𝑟
𝑏, 𝐼) ≥ 𝐼! + 𝑇 𝑏𝑟𝑖𝑔ℎ𝑡𝑒𝑟
𝑠, 𝐼! − 𝑇 < 𝐼) < 𝐼! + 𝑇 𝑠𝑖𝑚𝑖𝑙𝑎𝑟
(4.1)
where, 𝐼) is the intensity value of pixel 𝑦 , 𝐼! is the intensity value of nucleus (pixel p) and 𝑇 is the threshold parameter
that controls the number of corner responses.
88
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
The magnitude and direction of the localized points are computed by SIFT using the equation [41]. The vector
with localized magnitudes and gradients computed using FiST is passed for training and testing groups using data
augmentation. However, the following approach has extracted only the essential keypoints from the image,
making other pixel value 0.
III. Data Partitioning
The FiST_CNN approach has saved the model from the chances of overfitting as, overfitting is generated when
the data contains noise. To validate the performance of the FiST_CNN model after data augmentation dataset is
also divided in ratio 70:30.
89
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
III. Model Training using CNN
Thereafter, the group with training images (𝑇!) is passed into CNN for training. After this various convolution
functions and max pooling, functions are performed on 𝑇! using Eq. (4.2 and 4.3) and (4.4) respectively.
𝐼" 𝑥, 𝑦 = 𝐾 ∗ 𝑥 − 𝑚 + 1 ∗ 𝑦 − 𝑚 + 1 (4.2)
𝐼" 𝑥, 𝑦 = 𝐾 ∗ 𝑥#) ∗ (𝑦# (4.3)
𝐼" 𝑥, 𝑦 = 𝐾 ∗
$!
%
∗
&!
%
(4.4)
Here 𝐼" 𝑥, 𝑦 Î 𝑇! is the input image from the training set. A kernel (K) with a size of 𝑚, 𝑚 and a stride of 𝑛, 𝑛 is
used.
After this normalization is performed using the ‘RELU’ function on 𝐼" 𝑥, 𝑦 as
𝐼" 𝑥, 𝑦 = 𝑚𝑎𝑥 0, 𝑥' (4.5)
Then this normalized output is flatted into a single vector and fed to the dense layer. A dropout ratio of 0.5 is further
added at a fully connected layer to avoid over-fitting. A dense layer with 124 neurons is linked as a fully connected
layer.
90
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
Leaky Rectifier Linear Unit (Leaky ReLU) is used to introduce the non-linearity of CNN. A categorical cross-entropy is used as
the cost function given in Eq. (4.6):
𝐶𝐸 = −𝑙𝑜𝑔
"!"
∑#$%
&
"
!#
(4.6)
Where 𝑆! is the CNN score for the positive class, 𝐶 is the class and 𝑆; is the class score for each class 𝑗 in 𝐶. The model is then
optimized using Adam, which is an adaptive gradient-based optimization method. Probabilities are calculated by using the softmax
function at the final layer using Eq. (4.7)
𝑓 𝐶𝑇 $% =
"
&'(#
∑#$%
&
"
&'(#
(4.7)
Then the trained model FiST_CNN is saved for the predictions. The saved trained model, after that, has been utilized for the
prediction of gestures in the testing group.
91
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques 92
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques 93
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
• For the extensive evaluation of the proposed algorithm four publicly available datasets were used.
• The proposed work has been tested on:
a) Uniform datasets
b) Complex background.
• The uniform dataset includes (ISL and MNIST) and dataset with complex background includes Jochen
Trisech’s (JTD) and NUS hand posture-II.
• Data-augmentation is applied in both uniform and complex datasets.
94
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
i. MNIST- This dataset contains images for numeric (0 to 9) gestures. It is available at MNIST [53]. It is having
2062 images with 206 images per gesture as shown in Figure 4.2.
95
Figure 4.2. Sample images from MNIST dataset
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
ii. ISL - This dataset contains images for alphabet gestures except for J and Z, as they require motion. This
dataset has been taken from a GitHub project [54]. It consists of 4962 images with more than 200 images
per gesture. Sample images are shown in Figure 4.3.
96
Figure 4.3. Sample images of ISL dataset
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
i. Jochen Trisech (JTD) - This dataset contains static gestures that are collected from 24 subjects in dark, light, and complex
backgrounds. Sample images are shown in Figure 4.4. It is available in [35]. These images are already converted in greyscale
before applying the proposed approach. This dataset has total of 2127 images with ten different classes.
97
Figure 4.4. Sample images from JTD
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
ii. NUS II- This is another dataset for ISL gestures with complex background. This data set contains images both for training
and testing purpose. Gestures in training set were collected from 40 subjects in complex background. It includes 2000
images categorized into 10 different classes. Samples of this dataset is shown in Figure 4.5. On the other hand, test set has
750 images collected from 15 subjects with different lighting conditions. It is available at [44].
98
Figure 4.5. Sample images from NUS hand posture-II dataset
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
For evaluation of FiST_CNN following performance metrics are considered:
i. Accuracy: Accuracy is the number of correct predictions made by the model over all the predictions made.
The accuracy of the FiST_CNN is computed based on correct gestures predictions.
ii. Confusion Matrix: The confusion matrix here is used to summarize the performance at the classification
stage, on a set of validation data whose value is mapped from training data.
iii. Computational Time: It is the total processing time of the model computed from image pre-processing to the
predictions of the label.
99
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
• The algorithm has been implemented on Python 3-jupyter notebook, and the simulation is done using Intel®
core™, 8 GB RAM and 256 caches per core, 3MB cache in total.
• Graphics with GPU type with VRAM 1536 MB.
• The dataset is split into two parts training (70%), and testing (30%) as per industry standards.
• The main objective of the performance analysis of FiST_CNN is to maximize the accuracy of the model with
reduced computation complexity.
100
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
Figure 4.7. Accuracy comparison per epochs for alphabet set
101
Figure 4.6. Accuracy comparison of FiST_CNN, CNN and SIFT_CNN
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques 102
Figure 4.8. Accuracy comparison per epochs for numbers
Table 4.1. Accuracy matrix for FiST_CNN
Model accuracy for alphabet Model accuracy for numbers
Training
Accuracy
Validation
Accuracy
Training
Accuracy
Validation
Accuracy
97.89 95.43 95.68 92.83
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques 103
Figure 4.9. Accuracy evaluation for ISL alphabets Figure 4.10. Loss evaluation for ISL alphabets
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques 104
Figure 4.12. Loss evaluation for numbers
Figure 4.11. Accuracy evaluation for numbers
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques 105
Figure 4.13. Time comparison of FiST_CNN, CNN and SIFT_CNN
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques 106
Table 4.2. The feature vector for the ISL alphabet Table 4.3. The feature vector for the ISL number
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques 107
Figure 4.14 Recognition accuracy on a uniform background Figure 4.15 Recognition accuracy on a complex background
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques 108
Table 4.4. Comparative evaluation of
FiST_CNN
Author SL Technique Dataset Background Accuracy
[93] ISL SIFT_CNN 5000 Uniform 92.78%
[208] ISL HOG 1300 Uniform 92.20%
[181] ISL HOG 780 Uniform <80%
[209] ARSL Skin-Blob
Tracking
30 Signs Uniform 97%
[161] ISL CNN 35000 Uniform 99.72%
[38] ISL CNN 52000 Uniform 99.40%
[210] ASL Gabor-edge 720 Dark, Light,
Complex
86.2%
[211] ASL MCT 720 Uniform,
Complex
99.2%,89.8%
[212] ASL MOGP - Complex 91.4%
[213] ISL Krwatchouk 1865 Uniform 97.9%
[142] ISL TOPSIS 2600 Complex 92%
[22] ASL Fusion
(HOG+LBP)
2000 Complex 95.09%
[232] ASL Deep learning
with CNN
2000 Complex 94%
Proposed ISL FiST_CNN 4962 Uniform 95.56%
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques 109
Table 4.5. Comparison of work with JTD and
NUS-II
Dataset Author
name/Approach used
Classifier Accuracy
JTD
Trisech et al.[210] Gabor edge filter 86.2%
MCT [211] Adaboost 98%
MOGP [212] SVM 91.4%
LHFD [129] SVM 95.2%
Cubic kernel [215] CNN 91
Joshi et al. [142] SVM 92%
Kelly et al. [216] SVM 93%
X. Y. Wu [217] CNN 98.02%
FiST_CNN CNN 94.78%
NUS
Kaur et al. [213] SVM 92.50%
Adithya et al [232] SVM 92.50%
Pisharady et al. [207] SVM 94.36%
Haile et al. [218] RTDD 90.66%
Kumar et al. [219] SVM 94.6%
Zhang et al. [22] - 95.07%
FiST_CNN CNN 95.56%
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
Figure 4.16. Confusion matrix of FiST_CNN for ISL alphabets
110
Figure 4.17. Confusion matrix of FiST_CNN for ISL Numbers
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
Figure 4.18. Confusion matrix for NUS hand posture-II dataset
111
Figure 4.19. Confusion matrix for JTD dataset
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
Sign Precision Recall F1 score Sign Precision Recall F1 Score
A 100 100 100 S 100 100 100
B 100 100 100 T 100 99 99
C 100 100 100 U 100 92 96
D 98 100 99 V 86 100 93
E 100 100 100 W 100 95 97
F 100 100 100 X 100 100 100
G 100 100 100 Y 100 100 100
H 100 100 100 ZERO 98 98 98
I 100 100 100 ONE 98 100 99
K 100 100 100 TWO 98 88 92
L 100 97 98 THREE 100 96 98
M 100 100 100 FOUR 90 94 92
N 99 100 99 FIVE 95 100 97
O 100 100 100 SIX 87 94 90
P 100 100 100 SEVEN 92 87 89
Q 100 100 100 EIGHT 90 96 93
R 100 100 100 NINE 100 96 98
112
Table 4.6. Precision, Recall and F1 score for FiST_CNN(%)
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques 113
Table 4.7. Precision, Recall and F1 score for JTD
Sign Precision Recall F1 score
a 0.64 0.82 0.71
b 0.81 0.68 0.73
c 0.78 0.76 0.76
d 0.80 0.63 0.70
i 0.69 0.60 0.64
l 0.78 0.79 0.78
g 0.58 0.80 0.67
h 0.72 0.69 0.70
v 0.69 0.71 0.69
y 0.92 0.82 0.86
Sign Precision Recall F1 score
a 0.77 0.73 0.74
b 0.50 0.66 0.56
c 0.53 0.61 0.56
d 0.63 0.65 0.63
e 0.76 0.50 0.60
f 0.80 0.68 0.73
g 0.69 0.71 0.69
h 0.70 0.76 0.72
i 0.62 0.71 0.66
j 0.73 0.83 0.77
Table 4.8. Precision, Recall and F1 score for NUS-II
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
• A hybrid technique FiST_CNN has been developed for the effective and efficient extraction of features for static gestures of
ISL.
• First, features are detected using FAST, which detects key points rapidly. Further, to compute key points in invariant and
distinctive conditions, SIFT is used. Finally, classification is done by using CNN.
• The performance of the proposed FiST_CNN has been compared with other techniques CNN and SIFT_CNN [93].
• Results in section 4.4 conclude that FiST_CNN is superior to both CNN and SIFT_CNN [93] compared to accuracy and
computation time. FiST_CNN has achieved an accuracy of 97.89%, 95.68%, 94.90% and 95.87% for ISL-alphabets, MNIST,
JTD and NUS-II, respectively.
• Although the proposed hybrid technique is effective and efficient for feature extraction of ISL gestures, there is still scope for
further reduction in the number of features for efficient recognition of various ISL gestures.
114
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques 115
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
• The analysis of the shape and geometry of the hand provides the essential features of the hand as shown in Figure 5.1.
• These methods have shown an impeccable result and giving an elevated recognition accuracy without using any sensor
devices.
• These methods follow the state of art techniques that is to locate a set of essential key points representing the position
of coordinates with the help of some neural network models.
• In view of ISLR, there are certain keypoints that need to be kept in consideration before selecting the image pre-
processing technique. Extracting hand shape accurately not only enhances the accuracy, but it also reduces the space
and time complexity.
• The Hand shape can be found out by using pre-processing techniques like hand segmentation, binary hand, hand
contour and 3D hand model.
• The essential keypoints of hand motion are hand coordinates, motion trajectories, and two hand motion.
116
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques 117
Figure. 5.1. Hand Anatomy with 21 keypoints
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
• In the proposed work, a three-stage algorithm based on hand anatomy and geometry is
proposed.
• At first, the palm is detected from the image.
• Further, in the second stage, the keypoints are detected on the gesture, using state-of-art
techniques.
• In the third stage, the prior information of geometrical features and hand kinematics are
used to locate the 21 keypoints on hand.
118
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
• For further training and classification of gestures, the Neural Network(NN) is used.
• The dataset is collected through the camera in a different orientation, scales, and illumination for
ISL words belonging to education, medical and other real-life usages .
• These steps are repeated for all the training images, and further results are generated over testing
images.
119
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques 120
Figure 5.2. Flowchart of FiST_HGNN
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
• Palm detection: The basic principle is that an object in an image has the same intensity pixels in a
particular region. An image is divided into 𝑥 ∗ 𝑥 grids; the value of 𝑥 is calculated based on the
image size. This process is iterated till the hand part is extracted, for all the images in the dataset.
This stage takes all images pixels as input. The equation can be formulated as:
𝐼! = 𝑃", 𝑓 𝑥 , 𝑥 ∈ 1, … … … , 𝑥 − 1 5.1
Where 𝐼! is the image output of the phase, 𝑃" denotes the pixel value of the image, and 𝑓 𝑥 are the
features extracted from the image 𝑥. This phase enables us to infer the part making it easier to detect
the keypoints.
121
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
The next step is to detect keypoints and remove the redundant pixels. For each detected point, the sum of the
absolute difference between pixels in the center and the contiguous arc is computed as a score function 𝑣. The
lower values are ignored when comparing the 𝑣 values of two neighboring keypoints.
𝐼+ 𝑥, 𝑦 = 𝑓 𝑥 ,
0, 𝑣(𝑥, 𝑦) ≤ 𝜏
1, 𝑣(𝑥, 𝑦) > 𝜏
5.2
where 𝜏 is the threshold value, the pixels value greater than 𝜏 are selected for further processing, others are
discarded. The next step is to localize the detect key points. The blurred image octaves are then created using the
Gaussian blur operator. Scale-space function is defined as Eq(5.3)-
𝐿 𝑥, 𝑦, 𝜎 = 𝐺 𝑥, 𝑦, 𝜎 ∗ 𝐼+ 𝑥, 𝑦 5.3
122
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
Where * denotes the convolution operator, and 𝐺 𝑥, 𝑦, 𝜎 denotes the variable-scale Gaussian. To find the
scale-invariant keypoints the Laplacian of Gaussian(LOG) approximations is applied. At this phase, the final
output is the image with keypoints located on the hand. The output of this stage can be formulated as:
𝐼+ = =
,!-.
,!/.
||𝑃0
,!
− 𝐿,!||1 (5.4)
Where 𝑘2 denotes the keypoints, 𝑃0
,!
denotes the pixel with higher threshold values, whereas 𝐿,! denotes the
pixels with lower threshold values. ||𝑃0
,!
− 𝐿,!||1 denotes the normalization of the pixels, with a lower bound
of 2. The output of this stage is the image with keypoints located on the higher intensities pixels.
123
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
After detecting keypoints, a two-dimensional array of coordinates, where each coordinate corresponds to
one of the keypoints in hand. The location of the 𝑘2 keypoint is denoted by 𝑃,2 = 𝑥,2, 𝑦,2 , and the
location of the wrist is denoted as 𝑃. = 𝑥., 𝑦. . To locate the wrist coordinate, the distance between two
extreme points (𝑤0 and 𝑤3) is calculated, as shown in the below equation:
𝑤03 =
𝜃 𝑤0 + 𝑤3
2
5.5
Where 𝑤03 is the wrist center point, 𝜃 is the angle associated with the wrist. 𝑤0 and 𝑤3 are the wrist
coordinates at 𝑥 and 𝑦 coordinates respectively.
124
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
Further, a template is created for each joint using the mentioned by[60]. The distance of each joint is
calculated using the formulated equation:
𝐷4 = G
0"!/ 3"!/ ∅
1
678 0"!,3"! /:
when ∅ ≥ 0 5.6
∅ = 2𝑤1-(𝑥,2- 𝑦,2)1 5.7
Where, 𝑥,2 and 𝑦,2 is the distance in 𝑥 and 𝑦 coordinates, respectively, 𝑥,2 = 𝑚𝑖𝑛 𝐷0/.,+, 𝐷0;.,+ , and
𝑦,2 = 𝑚𝑖𝑛 𝐷3,!/., 𝐷3,!;. , ∅ denotes the angle for the wrist corresponds to 𝑥 and 𝑦 axis.
125
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques 126
• Distance between coordinates:
𝑑 𝑐!, 𝑐" = (𝑐#!−𝑐#")" + (𝑐$!−𝑐$")"
Ex: 𝑑 𝑐!%, 𝑐"! = 143.013 − 113.916 " + 105.53 − 94.364 "
𝑑 𝑐!%, 𝑐"! = 962
𝑑 𝑐!%, 𝑐"! = 31.01𝑚
• Angle between coordinates:
𝑐𝑜𝑠𝜃#$ =
#&$&%#'$'%#($(
#&
'%#'
'%#(
' $&
'%$'
'%$(
'
Ex: Vector 𝑝 = (1,0,1) and ⃗
𝑞 = (1,1,0) for joint (12,18)
𝑐𝑜𝑠𝜃#$ =
(1.1) + (0.1) + (1.0)
1 + 0 + 1 1 + 1 + 0
𝑐𝑜𝑠𝜃#$ =
1 + 0 + 0
2 2
𝑐𝑜𝑠𝜃#$ =
1
2
𝑐𝑜𝑠𝜃#$ = 𝑐𝑜𝑠60
𝜃 = 60°
Figure 5.3. Distance and angle between the coordinates
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques 130
Contd.
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques 131
Contd.
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
A set of labelled training samples is provided, assuming that in the feature space, these features will cluster
around multiple centers. The FiST_HGNN is a feedforward with three hidden layers. It has sixty-three neurons
at the input layer, (128, 64, 32, and 16) neuron at first, second, third and fourth hidden layer, respectively.
Further twenty-five in the output layer as shown in Figure 5.6. Sixty- three neurons in the input layer
corresponds to the vector 𝑘2 for each gesture 𝑔< category.
132
Figure 5.6. Architecture of NN
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques 133
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques 134
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques 135
Prediction graph of FiST_HGNN
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
• To validate the performance of the proposed model, two types of datasets are considered.
• The first category consists of isolated single letter gestures, and in the second category, we considered ISL
word gestures with both uniform and complex backgrounds.
• Isolated hand-letter signs, gestures representing alphabets and digits from the ISL[22][23], and NUS-II [25]
datasets are considered.
• In the ISL two hand gestures, ISL alphabets and digits and words[26] gestures category have been considered.
• A detailed description of the used dataset is given in Table 5.3.
136
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
Table 5.3. Description of dataset
137
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
I. Performance metrics: The performance of the proposed model (FiST_HGNN) is evaluated based on
accuracy and time taken by the model. Accuracy gives us the percentage of all correct predictions out of
total predictions made as follows.
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
=>/=?
=>/@>/=?/@?
(5.6)
Time taken by the model is the total time taken for training and testing the gestures.
138
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques 139
Ø FIST_HGNN over ISL-
alphabet
FiST_CNN-2.83%
PointBased+Fullhand- 2.58%
Ø FIST_HGNN over
numbers-
FiST_CNN-6.45%
PointBased+Fullhand-3.34%
97.58 95.68
94.75
89.23
95 92.34
0
10
20
30
40
50
60
70
80
90
100
ISL-alphabet Number
Accuracy(%)
Dataset
Accuracy on ISL isolated gesture
FiST_HGNN
FiST_CNN
PointBased+Fullha
nd
Figure 5.9. Accuracy comparison on isolated single-letter gestures
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
Figure 5.10. Accuracy comparison on ISL word gestures
140
Ø FIST_HGNN
Improvement over
Sign_word-
FiST_CNN-4.88%
PointBased+Fullhand-
2.69%
Ø Over ISL alphabets and
numbers-
FiST_CNN-2.97%
PointBased+Fullhand-2%
Ø Over Ankita Wadhwan-
FiST_CNN-2.91%
PointBased+Fullhand-1.67%
98.88 98 97.14
94 95.03 94.23
96.19 96 95.47
0
10
20
30
40
50
60
70
80
90
100
Sign_word_Dataset ISL Alphabets and Numbers Ankita_wadhwan
Accuracy(%)
Dataset
Accuracy on ISL word gesture
FiST_HGNN
FiST_CNN
PointBased+Fullh
and
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques 141
Table 5.4. Time comparison on different datasets.
Dataset
Time(Second)
FiST_HGNN
FiST_CNN
[90]
PointBased
+
Fullhand
[221]
Isolated single letter
ISL alphabets [226] and
digits [205] 3217.43 4628.34 3924.45
NUS-II [207] 778.54 889.56 990.76
ISL Word gestures
ISL alphabets and
numbers [227]
2287.67 2987.90 3130.87
Sign-Word [228] 3278.65 5678.89 3657.34
Ankita Wadhwan [161] 1543.22 3189.43 2164.59
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques 142
Table 5.5. Accuracy
comparison with other
approaches
Dataset Author name/ Approach used Classifier Accuracy (%)
NUS-II [207]
Kaur et al. [213] SVM 92.50
Adithya et al. [232] SVM 92.5
Pisharady et al. [207] SVM 94.36
Kumar et al.[219] SVM 94.6
FiST_HGNN NN 95.78
ISL alphabets [226] and
digits [205]
Ansari and Harit [225]
NN 63.78
Kaur et al. [213] SVM 90
Joshi et al. [142] SVM 93.4
Rekha et al. [151] SVM 91.3
Rao et al. [92] ANN 90
FiST_HGNN NN 97.58
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques 143
Figure 5.11. Accuracy
comparison of
different classifiers on
isolated single letter
dataset
.
95.15 92.68 94.19
97.58
0
10
20
30
40
50
60
70
80
90
100
SVM MLP KNN NN
Accuracy(%)
Classifier
Accuracy comparison on isolated gesture
SVM
MLP
KNN
NN
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques 144
97.13 94.68 93.19
98
0
10
20
30
40
50
60
70
80
90
100
SVM MLP KNN NN
Accuracy(%)
Classifier
Accuracy comparison on double hand gesture
SVM
MLP
KNN
NN
Figure 5.12. Accuracy comparison
of different classifiers on double-
handed ISL alphabets and numbers
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques 145
Figure 5.13. Confusion matrix on isolated alphabets and numbers
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
Figure 5.14. Confusion matrix on NUS-II
146
Figure 5.15. Confusion matrix on sign word dataset
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
Figure 5.16. Confusion matrix on double-handed alphabets and numbers
147
Contd.
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
Table 5.6. Precision, Recall and F1 Score for ISL alphabets and numbers
Sign Precision Recall F1 score
0 0.86 0.96 0.91
1 0.87 0.87 0.87
2 1.00 0.92 0.97
3 1.00 1.00 1.00
4 0.96 0.92 0.94
5 1.00 1.00 1.00
6 0.83 0.83 0.83
7 0.90 0.90 0.90
8 1.00 0.83 0.91
9 0.97 0.94 0.96
A 0.98 1.00 0.99
B 1.00 1.00 1.00
C 0.97 0.97 0.97
D 0.93 0.96 0.95
E 1.00 1.00 1.00
F 1.00 1.00 1.00
G 1.00 1.00 1.00
148
Sign Precision Recall F1 score
H 1.00 1.00 1.00
I 0.95 1.00 0.97
K 0.96 0.98 0.97
L 1.00 0.97 0.99
M 1.00 1.00 1.00
N 1.00 1.00 1.00
O 0.97 0.94 0.96
P 1.00 0.98 0.99
Q 1.00 1.00 1.00
R 0.95 0.97 0.96
S 1.00 0.98 0.99
T 1.00 1.00 1.00
U 1.00 1.00 1.00
V 1.00 1.00 1.00
W 0.93 1.00 0.96
X 0.97 1.00 0.99
Y 1.00 1.00 1.00
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
Table 5.7. Precision, Recall and F1 Score for sign word dataset
Sign Precision Recall F1 score
Call 1.00 1.00 1.00
Close 1.00 0.99 1.00
Cold 1.00 1.00 1.00
Correct 1.00 1.00 1.00
fine 0.99 1.00 1.00
Help 1.00 1.00 1.00
Home 1.00 1.00 1.00
ILoveYou 1.00 1.00 1.00
Like 1.00 1.00 1.00
Love 1.00 1.00 1.00
No 1.00 1.00 1.00
Okk 1.00 1.00 1.00
Please 1.00 1.00 1.00
Single 1.00 1.00 1.00
Sit 0.99 1.00 1.00
Tall 1.00 1.00 1.00
Wash 1.00 0.99 0.99
Work 1.00 1.00 1.00
Yes 0.99 1.00 1.00
You 1.00 1.00 1.00
149
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
Table 5.9. Precision, Recall and F1 Score for NUS-II
Sign Precision Recall F1 score
a 1.00 1.00 1.00
b 0.98 1.00 0.99
c 0.97 0.97 0.97
d 1.00 1.00 1.00
i 1.00 0.95 0.97
l 1.00 0.97 0.98
g 1.00 1.00 1.00
h 0.95 0.97 0.96
v 0.93 0.95 0.94
y 1.00 1.00 1.00
150
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
• A hand anatomy-based technique for recognising ISL gestures has been proposed. The FiST_HGNN uses hybridization of
the previous FiST with hand geometry to recognise ISL gestures.
• The FiST technique provides the rapid detection of keypoints and generates a feature vector of 128 keypoints. Relevant
twenty-one keypoints from this feature vector are selected using hand geometry.
• Multilayer feedforward NN is used as a classifier. The FiST_HGNN was tested on two ISL datasets (isolated single letters
and words).
• The FiST_HGNN achieves an accuracy of 98.90% for isolated gestures and 97.58% for word gestures and outperforms the
other approaches in the literature.
• However, FiST_HGNN has been tested on a few real-world gestures. So, in the next chapter, the FiST_HGNN will be
applied to some functional gestures commonly used in real-world problems by the deaf and mute community.
151
Contd.
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques 152
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques 153
6.1.1 Self-Created Dataset
The dataset contains the RGB images of hand gestures of twenty static ISL words, namely, ‘afraid’, ‘agree’, ‘assistance’,
‘bad’, ‘become’, ‘college’, ‘doctor’, ‘from’, ‘pain’, ‘pray’, ‘secondary’, ‘skin’, ‘small’, ‘specific’, ‘today’, ‘stand’,
‘warn’, ‘which’, ‘work’, ‘you’, which are commonly used to convey messages or seek support during real-life usage
[229].
The captured gestures are chosen from the ISL dictionary [58].
The images were gathered from 8 individuals, comprising 6 males and 2 females, with ages varying from 9 years to 30
years.
Nine hundred images are captured for each gesture, so a total of 18000 images are collected.
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques 154
Figure 6.1.Categorization of the self-created dataset
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques 155
Figure 6.2. Sample images of the self-created dataset
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
Subject Computer Vision and Pattern Recognition(CVPR)
Specific subject type SL recognition
Type of data Images(200*200 pixels JPG format)
Data acquisition
method
The images in this dataset are captured by asking participants to stand
comfortably in front of the wall. Images are captured by using a smart
camera (iPhone XI)
Data format Labelled RGB Images
Data collection
parameter
All the images are captured with a plain background. The data collection
comprised both male and female volunteers with a range of hand sizes.
Images are collected in an indoor environment with normal lighting
conditions. To make the gesture displays as genuine as possible, no
limitations on the pace of hand motions have been enforced.
Data Source location BLOOM Speech and hearing clinic, Dehradun, Uttarakhand, India.
Data accessibility Data can be accessed through the Mendeley link.(“ Tyagi, Akansha; Bansal,
Sandhya (2022), “Indian sign Language-Real-life Words”, Mendeley Data,
V2, doi:10.17632/s6kgb6r3ss.2”)
156
Table 6.1. Specification of the self-created dataset
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
Folder File Name Description
Afraid afraid_1_user1_1 to
afraid_900_user6_150.
Nine hundred samples of the
‘Afraid’ gesture were taken
from six users.
Agree agree_1_user1_1 to
agree_900_user6_150.
Nine hundred samples of the
‘Agree’ gesture were taken from
six users.
Assistance assistance_1_user1_1 to
assistance_900_user6_150.
Nine hundred samples of the
‘Assistance’ gestures were taken
from six users.
157
Table 6.2. Organization of the images in the self-created dataset
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
Bt00trl SPEEGH AII,D HEARII{G GLIilIC
Anjati Sufiramanium
Audiologist, Speech Language
Pathologist
ABACertified (RBT)
RClRegistration
CRR No. :A6O779
CLINICAL SERVICES:
. Speech & Longuoge Theropy
. Voice Disorders
. NeurologicqlDisorders
. Flueng Disorders
. Autism Speckum Disorders
. Articulotion/Uncleor Speech
. Leorning Disobility
. Heoring Test
. Heoring Aid Triol
oate: .181 t..l.22
'":'
To,
The whomever concern,
This is to certifythat Ms. Akansha, a Ph. D. research scholar has collected a dataset
from our clinic. They are the Hard of Hearing patients, and we don't have any
obligations regarding the dataset collection. This letter can be considered as the No
objection certificate. she can use the dataset in her research publication.
Date: lB/f ln
E}LOOrla
SPEECH A HEARIHG GLITIIIC
llE, Preet Vihar, phase-2
lndra Gandhi Marg,
NlranJanpur, t)ehra Dun
118, PREETVIHAR, PHASE-2, INDRAGANDHI MARG, NIRANJANPUR, DEHRADUN (U.K.)
f,r^L . Ot-tETAIaOAOO I -^i^l:^..L-----:..- ^^a---!r -^-
158
Figure. Authorization letter for data collection
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques 159
Figure 6.3. Categorization of the
gestures for existing dataset
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
i. Medical: The gestures such as, ‘Elbow’, ’Help’, ’Skin’, ‘Call’, ‘Doctor’, ‘Hot’, ‘Lose’, ‘Pain’, ‘Leprosy’,
‘Tobacco’, ‘Keep’, ‘Assistance’, ‘Beside’, ‘Glove’, ‘Sample’ are mainly used in medical field. The sample gesture
are shown in Figure 6.4.
ii. Measurement: The gestures such as: ‘High’, ‘How_Many’, ‘Thick’, ‘Thin’, ‘Density’, ‘Measure’, ‘Quantity’,
‘Few’, ‘Size’, ‘Unit’, ‘Little’, ‘Small’, ‘Weight’, ‘Gram’, ‘Short’. A sample images from the dataset has been
shown Figure 6.5.
Figure 6.5. Sample images for measurement
160
Figure 6.4. Sample images for medical
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques 161
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
Figure 6.12. Accuracy comparison on ISL daily-life words
162
Ø FIST_HGNN
Improvement by-
CNN-4.55%
FiST_CNN-9%
94.23
89.78
98.78
0
10
20
30
40
50
60
70
80
90
100
CNN FiST_CNN FiST_HGNN
Accuracy(%)
Techniques
Accuracy comparison on ISL daily-life words dataset[229]
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
Figure 6.13. Accuracy comparison on different techniques for individual gesture from ISL daily-life words dataset
163
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
Figure 6.14. Accuracy comparison of different classifiers on ISL daily-life words
164
0
10
20
30
40
50
60
70
80
90
100
SVM MLP KNN NN
Accuracy(%)
Classifier
Accuracy comparison with different classifier
SVM
MLP
KNN
NN
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
Table 6.3. Comparison based on the number of features for self-created dataset
165
Techniques No. of
Images
Image Size Time(sec) Parameters Trainable
Parameters
Non-trainable
Parameters
CNN 18000 200*200 4789.9 72 * 107 72 * 107 0
FiST_CNN[90] 18000 200*200 4278.32 72 * 107 23* 106 71*106
FiST_HGNN 18000 200*200 3789.75 11* 105
11* 105
0
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
Figure 6.15. Confusion matrix on self-created ISL daily-life words
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
S.No Sign Recall Precision F1-Score Correctly
identified
gestures
Total
gestures
1. Afraid 0.99 0.98 0.98 171 172
2. Agree 0.99 1.00 0.99 173 173
3. Assistance 1.00 0.98 0.99 123 128
4. Bad 0.99 0.99 0.99 177 178
5. Become 0.99 1.00 0.99 152 154
6. College 0.99 0.98 0.98 167 167
7. Doctor 0.95 0.95 0.95 99 107
8. From 0.98 0.97 0.97 135 138
9. Pain 0.98 0.99 0.98 151 156
10. Pray 0.99 0.97 0.98 67 68
11. Secondary 0.99 0.97 0.98 176 177
12. Skin 0.97 0.98 0.97 171 172
13. Small 1.00 0.98 0.99 143 143
14. Specific 1.00 1.00 1.00 167 170
15. Stand 0.97 0.96 0.97 147 148
16. Today 0.98 1.00 0.99 172 175
17. Warn 0.97 0.99 0.98 172 177
18. Which 0.98 0.98 0.98 172 174
19. Work 0.99 0.98 0.99 186 186
20. You 0.97 1.00 0.98 160 160
Table 6.4. Precision, Recall and F1 Score for self-created dataset
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques 168
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques 169
Figure 6.16 Accuracy for general purpose gesture
Ø FIST_HGNN
Improvement by-
SIFT_VFH &
SIFT_NN-6.87%
FiST_CNN-3.32%
90.68
94.73 97.55
0
10
20
30
40
50
60
70
80
90
100
SIFT_VFH & SIFT_NN FiST_CNN FiST_HGNN
Accuracy(%)
Technique
Accuracy comparison on dataset[230]
An Efficient System for Vision-Based Recognition of Indian Sign Language Using Soft Computing Techniques
92.78 95.53
98.85
0
10
20
30
40
50
60
70
80
90
100
SIFT_CNN FiST_CNN FiST_HGNN
Accuracy(%)
Technique
Accuracy comparison on dataset [231]
170
Figure 6.17. Accuracy for family and relatives gestures
Ø FIST_HGNN
Improvement
by-
SIFT_CNN-6.07%
FiST_CNN-3.32%