"Isolated Sign Recognition with a Siamese Neural Network of RGB and Depth Streams" EUROCON 2019 Presentation
1. IEEE EUROCON 2019
1 – 4 July 2019 Novi Sad, Serbia
Isolated Sign Recognition with a Siamese
Neural Network of RGB and Depth Streams
Anil Osman TUR1,2, Hacer YALIM KELES1,3
1Ankara University Computer Engineering Department
2aotur@ankara.edu.tr, 3hkeles@ankara.edu.tr
Paper №: 02728
1 / 16
2. 18th IEEE International Conference on Smart Technologies - EUROCON 2019,
1–4 JULY 2019, NOVI SAD, SERBIA
Motivation
To solve communication problems between the deaf and the
hearing communities.
Human-machine interface that can be useful for controlling
machines with human gestures for other purposes.
2 / 16
3. 18th IEEE International Conference on Smart Technologies - EUROCON 2019,
1–4 JULY 2019, NOVI SAD, SERBIA
Problem & Challenges
Recognizing signs independent from each other.
Each sign is a composition of hand, face and body features.
High variance of the signs among different signers i.e. body and
pose variations, duration variance of the signs etc.
Multiple modalities of the input information i.e. illumination
changes, occlusion problems etc.
3 / 16
4. 18th IEEE International Conference on Smart Technologies - EUROCON 2019,
1–4 JULY 2019, NOVI SAD, SERBIA
Solution
1. To be able to represent inputs in more effective feature space, we
employed pretrained Convolutional Neural Networks (CNNs).
2. To classify generated feature vectors from CNN we need to
interpret sequences Recurrent Neural Networks (RNNs) used.
Specially Long-Short Term Memory (LSTM) [4] and Gated
Recurrent Unit (GRU) [5] models.
3. To generalize inputs and be robust to changes and variations e.g.
lightning, person in training regularization methods used.
4 / 16
5. 18th IEEE International Conference on Smart Technologies - EUROCON 2019,
1–4 JULY 2019, NOVI SAD, SERBIA
Montalbano Gesture Dataset1
used in experiments.
Video samples are in 640x480
pixels and recorded with
speed of 20 fps.
20 different Italian hand
gestures from 27 different
users.
Dataset includes clothing,
lightning, background
changes.
Dataset
1. S. Escalera, X. Bar, J. Gonzlez, M.A. Bautista, M. Madadi, M. Reyes, V. Ponce, H.J. Escalante, J. Shotton, I. Guyon, “Chalearn looking at people
challenge 2014: Dataset and results”. In: ECCV workshop. 2014
Depth SkeletalUserRGB
5 / 16
6. 18th IEEE International Conference on Smart Technologies - EUROCON 2019,
1–4 JULY 2019, NOVI SAD, SERBIA
RGB and Depth input cropped to
400 by 400 square images.
Median filter applied to both of the
inputs
User index data used as mask to
depth input to get background
subtraction.
Number of frames fixed to 40.
Preprocess
Cropping
RGB Image Depth Image
6 / 16
7. 18th IEEE International Conference on Smart Technologies - EUROCON 2019,
1–4 JULY 2019, NOVI SAD, SERBIA
Model Architecture
7 / 16
8. 18th IEEE International Conference on Smart Technologies - EUROCON 2019,
1–4 JULY 2019, NOVI SAD, SERBIA
Model Architecture
Convolutional parts from pretrained ResNet-50 [2] and VGG16 [3]
models used.
Global max pooling or global average pooling layers applied to the
outputs of pretrained networks.
Pooling layer outputs connected to Fully-connected (FC) layers.
We experimented FC layers with ReLu, Sigmoid and ReLu + Batch
Normalization configurations.
RGB and Depth outputs from FC layers concatenated and
connected to LSTM.
Output of LSTM connected to Softmax layer to classify gestures.
8 / 16
9. 18th IEEE International Conference on Smart Technologies - EUROCON 2019,
1–4 JULY 2019, NOVI SAD, SERBIA
Training
We used Adam optimizer with 1e-4 learning rate.
We chose batch size as 16.
Pretrained models are used as feature extractors and no finetuning
applied to them.
We experimented with L2 norm and Dropout as regularization
methods.
We chose 0.2 lambda constant for L2 norm and 0.5 probability rate
for Dropout.
9 / 16
10. 18th IEEE International Conference on Smart Technologies - EUROCON 2019,
1–4 JULY 2019, NOVI SAD, SERBIA
Results
ResNet-50
With ResNet-50 and LSTM network we reach to 93.1% accuracy.
Accuracy
Results
ResNet50
Avg Max
LSTM Relu Sigmoid
Relu + Batch
norm
Relu Sigmoid
Relu + Batch
norm
No Regularization 85,49 84,7 86,87 87,96 79,96 92,1
L2 87,17 86,97 85,78 86,08 86,97
Dropout 89,34 89,34 93,19
Dropout + L2 90,92 54,89 89,04
Accuracy
Results
ResNet50
Avg Max
GRU Relu Sigmoid
Relu + Batch
norm
Relu Sigmoid
Relu + Batch
norm
No Regularization 89,04 88,15 85,59 90,03 86,87 90,92
L2 85,19 80,75 79,17 82,43 67,82
Dropout 90,92 85,09 27,34 91,91
Dropout + L2 89,24 89,63 82,92 81,54
(a) ResNet-50 + LSTM
(b) ResNet-50 + GRU
10 / 16
11. 18th IEEE International Conference on Smart Technologies - EUROCON 2019,
1–4 JULY 2019, NOVI SAD, SERBIA
Results
VGG16
With VGG16 and LSTM network we reach to 91.6% accuracy.
Accuracy
Results
VGG16
Avg Max
LSTM Relu Sigmoid
Relu + Batch
norm
Relu Sigmoid
Relu + Batch
norm
No Regularization 87,27 88,15 87,56 85,49 83,32 85,39
L2 88,35 86,57 84,6 87,36 85,78 84,01
Dropout 89,24 89,14 87,86 86,28 88,25
Dropout + L2 89,73 88,25 87,86 88,55 85,88
Accuracy
Results
VGG16
Avg Max
GRU Relu Sigmoid
Relu + Batch
norm
Relu Sigmoid
Relu + Batch
norm
No Regularization 89,63 87,07 85,39 82,43 87,96 84,5
L2 86,48 68,41 87,46 54,59
Dropout 91,51 90,82 89,34 81,84 90,03 89,24
Dropout + L2 91,61 87,27 87,46 86,38 87,46
(a) VGG16 + LSTM
(b) VGG16 + GRU
11 / 16
12. 18th IEEE International Conference on Smart Technologies - EUROCON 2019,
1–4 JULY 2019, NOVI SAD, SERBIA
Results
Summary
Pretrained ResNet-50 and VGG16 networks used as feature
extractors.
We obtained the best results, i.e. 93.19% accuracy, using ResNet-
50 with LSTM.
We have not applied hand or face segmentation to the inputs.
We purposed simple yet effective architecture.
We observed that when LSTM model starts memorization GRU
model solves the memorization problem.
12 / 16
13. 18th IEEE International Conference on Smart Technologies - EUROCON 2019,
1–4 JULY 2019, NOVI SAD, SERBIA
Acknowledgement
The research presented is part of a project funded by TÜBİTAK (The
Scientific and Technological Research Council of Turkey) under grant
number 217E022.
13 / 16
14. 18th IEEE International Conference on Smart Technologies - EUROCON 2019,
1–4 JULY 2019, NOVI SAD, SERBIA
References
1. S. Escalera, X. Bar, J. Gonzlez, M.A. Bautista, M. Madadi, M. Reyes, V.
Ponce, H.J. Escalante, J. Shotton, I. Guyon, “Chalearn looking at people
challenge 2014: Dataset and results”. In: ECCV workshop. 2014.
2. K. He, X. Zhang, S. Ren, J. Su, “Deep Residual Learning for Image
Recognition”. Proceedings of the IEEE conference on computer vision and
pattern recognition. 2016.
3. K. Simonyan, A. Zisserman, “Very deep convolutional networks for large-
scale image recognition”. arXiv preprint arXiv:1409.1556, 2014.
4. I. Sutskever, O. Vinyals, Q. V. Le, “Sequence to sequence learning with
neural networks”. In: Advances in neural information processing systems,
pp. 3104-3112, 2014.
5. J. Chung, C. Gulcehre, K. Cho, Y. Bengio, "Empirical evaluation of gated
recurrent neural networks on sequence modeling." arXiv preprint
arXiv:1412.3555. 2014.
14 / 16
15. 18th IEEE International Conference on Smart Technologies - EUROCON 2019,
1–4 JULY 2019, NOVI SAD, SERBIA
Questions?
Questions?
15 / 16
16. 18th IEEE International Conference on Smart Technologies - EUROCON 2019,
1–4 JULY 2019, NOVI SAD, SERBIA
The End
Thank you for your attention
16 / 16