Speech recognition is one of the key topics in artificial intelligence, as it is one of the most common forms of communication in humans. Researchers have developed many speech-controlled prosthetic hands in the past decades, utilizing conventional speech recognition systems that use a combination of neural network and hidden Markov model. Recent advancements in general-purpose graphics processing units (GPGPUs) enable intelligent devices to run deep neural networks in real-time. Thus, state-of-the-art speech recognition systems have rapidly shifted from the paradigm of composite subsystems optimization to the paradigm of end-to-end optimization. However, a low-power embedded GPGPU cannot run these speech recognition systems in real-time. In this paper, we show the development of deep convolutional neural networks (CNN) for speech control of prosthetic hands that run in real-time on a NVIDIA Jetson TX2 developer kit. First, the device captures and converts speech into 2D features (like spectrogram). The CNN receives the 2D features and classifies the hand gestures. Finally, the hand gesture classes are sent to the prosthetic hand motion control system. The whole system is written in Python with Keras, a deep learning library that has a TensorFlow backend. Our experiments on the CNN demonstrate the 91% accuracy and 2ms running time of hand gestures (text output) from speech commands, which can be used to control the prosthetic hands in real-time.
2019 First International Conference on Transdisciplinary AI (TransAI), Laguna Hills, California, USA, 2019, pp. 35-42
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
Convolutional neural networks for speech controlled prosthetic hands
1. Convolutional Neural Networks for
Speech Controlled Prosthetic Hands
Date: 09/26/2019
2019 First International Conference on Transdisciplinary AI (TransAI)
Mohsen Jafarzadeh
Department of Electrical and Computer Engineering
The University of Texas at Dallas
Richardson, TX, USA
Mohsen.Jafarzadeh@utdallas.edu
Yonas Tadesse
Department of Mechanical Engineering
The University of Texas at Dallas
Richardson, TX, USA
Yonas.Tadesse@utdallas.edu
3. Introduction
•~94,000 upper limb amputees in Europe
• S. Micera, J. Carpaneto, and S. Raspopovic, “Control of Hand Prostheses Using Peripheral
Information,” IEEE Reviews in Biomedical Engineering, vol. 3, pp. 48–68, 2010.
•~41,000 upper limb amputees in the United States
• K. Ziegler-Graham, E. J. MacKenzie, P. L. Ephraim, T. G. Travison, and R. Brookmeyer,
“Estimating the prevalence of limb loss in the United States: 2005 to 2050,” Arch Phys Med
Rehabil, vol. 89, no. 3, pp. 422–429, Mar. 2008.
•About 40 million amputees in the world
• M. Marinoet al., “Access to prosthetic devices in developing countries:Pathways and
challenges,” inProc. IEEE Annu. Global HumanitarianTechnol. Conf., 2015, pp. 45–51.
3
4. Ways to command a prosthetic hand
• Push-buttons
• Joystick
• Keyboard
• Vision
• Electroencephalography (EEG)
• Electroneurography (ENG)
• Electromyography (EMG)
• Speech
4
5. Speech commanded prosthetic hands
1. Automatic speech recognition (ASR) System
• maps speech to text
2. Look-up table
• maps text to command
3. Low-level controller & driver
• maps commands and sensors data to electrical voltages
5
6. Automatic Speech Recognition (ASR) Systems
•Traditional ASR systems have 4 subsystems
• Preprocessing
• Feature extraction
• Language model
• Classifier
• Combination of Gaussian mixture model and the hidden Markov model (GMM-HMM)
• Combination of artificial neural networks and hidden Markov model (ANN-HMM)
•Recent ASR system are end-to-end
6
8. Related Works
•CMU Sphinx is used to control a surgical robot used
• K. Zinchenko, C.-Y. Wu, and K.-T. Song, “A Study on Speech Recognition Control for a Surgical Robot,” IEEE
Transactions on Industrial Informatics, vol. 13, no. 2, pp. 607–615, 2017.
•Control a hand exoskeleton
• combination of discrete wavelet transforms and hidden Markov models
• S. Guo, Z. Wang, J. Guo, Q. Fu, and N. Li, “Design of the Speech Control System for a Upper Limb
Rehabilitation Robot Based on Wavelet De-noising,” 2018 IEEE International Conference on Mechatronics
and Automation (ICMA), 2018.
•Control a robotic hand
• A multi-layer perceptron
• 13 speech (five time domain + eight features frequency domain)
• R. Ismail, M. Ariyanto, W. Caesarendra, I. Haryanto, H. K. Dewoto, and Paryanto, “Speech control of robotic
hand augmented with 3D animation using neural network,” 2016 IEEE EMBS Conference on Biomedical
Engineering and Sciences (IECBES), 2016.
8
9. Deep learning speech
•GPGPU + Dataset
•Very deep
•Slow for embedded devices
9
T. Tan, Y. Qian, H. Hu, Y. Zhou, W. Ding, and K. Yu, "Adaptive
very deep convolutional residual network for noise robust
speech recognition," IEEE/ACM Transactions on Audio,
Speech, and Language Processing, vol. 26, no. 8, pp. 1393-
1405, 2018.
10. Embedded GPGPU
10
Company Google NVIDIA NVIDIA NVIDIA
Model Coral Jetson Nano Jetson TX2 AGX Xavier
GPU Vivante GC7000 Lite 16 core Maxwell 128 core Pascal 256 core Volta 512 core
TPU Google Edge - - -
CPU 4 core Cortex-A53 4 core Cortex-A57
4 core Cortex-A57 + 2 core
Denver
8 core Carmel
RAM 1 GB 4 GB 8 GB 16 GB
Storage 8 GB 16 GB 32 GB 32 GB
GFLOPS 32 236 559 1300
GPIO 8 5 8 4
USB 1 x USB 3.0 + 1 x USB C 4 x USB 3.0 1 x USB 3.0 +1 x USB 2.0 2 x USB C [3.1]
UART 2 1 1 1
I2C 2 2 4 2
SPI 1 with 2 CS 2 with 2 CS 1 with 2 CS 1 with 2 CS
CAN 0 0 1 1
I2S 1 1 2 1
Size (mm) 88 x 60 x 24 100 x 80 x 29 170 x 170 x 51 105 x 105 x 85
Weight 227 g 244 g 1.5 Kg 630g
Price ($) 150 100 400 700
11. Related Works
•Reduce neural networks size
•by changing some weight of the network to zero
• To create sparse Network
• Excellent in case of FPGA
•by pruning neurons or even layers
•Useful but not sufficient
11
12. Contribution
•Control of prosthetic hands with speech input
•Using a convolutional neural network (CNN)
•Maps 2D features of speech input to text
•Without hidden Markov model
•Minimize the size of the CNN
•Real-time in an embedded GPGPU
12
15. Proposed Method
15
Laye
r
Type
Number of
filters
Filter
size
Strid
e
Activati
on
Output
shape
Number of
Parameters
0
Input (Log of
spectrogram)
- - - - 129 x 71 x 1 0
1 Convolution 2D 8 10 x 7 1 ReLU 120 x 65 x 8 568
2 Pooling 2D - 7 x 5 1 Max 17 x 13 x 8 0
3 Batch normalization - - - 17 x 13 x 8 32
4 Convolution 2D 32 7 x 5 1 ReLU 11 x 9 x 32 8992
5 Pooling 2D - 5 x 3 1 Max 2 x 3 x 32 0
6 Batch normalization - - - - 2 x 3 x 32 128
7 Flatten - - - - 192 0
8 Dense - - - ReLU 64 12352
9 Drop out - - - - 64 0
10 Dense - - - SoftMax 9 585
17. Data set
17
•Google speech command data set
•Open-source - Creative Commons BY 4.0 license
•35 words
•105,829 utterances
•Each utterance is one-second or shorter
•WAV format files
•16 kHz rate with linear 16-bit single-channel PCM values
•Several minutes long various kinds of background noise
18. Results
• We used the 8 classes
(words), which are "zero",
"one", "two", "three", "four",
and "five", "on" and "off".
• We used the rest of the
words as a class, "unknown".
• Adam optimizer
• Keras (TensorFlow back-end)
• Logarithm of spectrogram
18
19. Discussion
• One-hot encoding
• Dimension output vector = number of words +1
• Increasing the depth of the network has little effect on overall accuracy
• If the number of words increases significantly, user should increase the
number of filters
• Real time ~ 2 ms on NVIDIA Jetson TX2 developer kit (embedded GPGPU)
• Increasing speed
• NVIDIA AGX Xavier
• Using C++ and TensorRT
• WAV2LETTER++
19
20. Future works
• Investigate a CNN that detects the owner’s voice from other speakers
• The proposed CNN is robust to accent, speed, noise, etc.
• By combining these two CNNs
• Unconditional and Conditional Teacher-Student training
• Instead of one-hot-coding
• Teacher (bigger network) & student (smaller network)
• Comparing different type of 2D feature
• Current experiment: logarithm of speech spectrogram
• Future experiment: power-normalized cepstral coefficients (PNCC)
• CNNs with raw speech input
20
21. Conclusion
• Speech control of a prosthetic hand using a convolutional neural network (CNN)
• Without hidden Markov model (HMM)
• Proposed CNN maps 2D feature to text (classes, one-hot encoding)
• Look-up table to map text to the trajectory (command) for hand low-level controller and
driver
• Real-time performance (~ 2ms) on NVIDIA Jetson TX2 developer kit (an embedded GPGPU)
• Accuracy of 91% in a noisy environment
• We can increase speed by NVIDIA AGX Xavier, using either C++ and TensorRT or
WAV2LETTER++
• Future works: CNN that detects the owner’s voice from other speakers, unconditional and
conditional teacher-student training, comparing different type of 2D feature such as PNCC,
and investigating CNNs with raw speech input
21