Convolutional neural networks for speech controlled prosthetic hands

Convolutional Neural Networks for
Speech Controlled Prosthetic Hands
Date: 09/26/2019
2019 First International Conference on Transdisciplinary AI (TransAI)
Mohsen Jafarzadeh
Department of Electrical and Computer Engineering
The University of Texas at Dallas
Richardson, TX, USA
Mohsen.Jafarzadeh@utdallas.edu
Yonas Tadesse
Department of Mechanical Engineering
The University of Texas at Dallas
Richardson, TX, USA
Yonas.Tadesse@utdallas.edu

Content
•Introduction
•Proposed Method
•Results
•Discussion
•Conclusion
2

Introduction
•~94,000 upper limb amputees in Europe
• S. Micera, J. Carpaneto, and S. Raspopovic, “Control of Hand Prostheses Using Peripheral
Information,” IEEE Reviews in Biomedical Engineering, vol. 3, pp. 48–68, 2010.
•~41,000 upper limb amputees in the United States
• K. Ziegler-Graham, E. J. MacKenzie, P. L. Ephraim, T. G. Travison, and R. Brookmeyer,
“Estimating the prevalence of limb loss in the United States: 2005 to 2050,” Arch Phys Med
Rehabil, vol. 89, no. 3, pp. 422–429, Mar. 2008.
•About 40 million amputees in the world
• M. Marinoet al., “Access to prosthetic devices in developing countries:Pathways and
challenges,” inProc. IEEE Annu. Global HumanitarianTechnol. Conf., 2015, pp. 45–51.
3

Ways to command a prosthetic hand
• Push-buttons
• Joystick
• Keyboard
• Vision
• Electroencephalography (EEG)
• Electroneurography (ENG)
• Electromyography (EMG)
• Speech
4

Speech commanded prosthetic hands
1. Automatic speech recognition (ASR) System
• maps speech to text
2. Look-up table
• maps text to command
3. Low-level controller & driver
• maps commands and sensors data to electrical voltages
5

Automatic Speech Recognition (ASR) Systems
•Traditional ASR systems have 4 subsystems
• Preprocessing
• Feature extraction
• Language model
• Classifier
• Combination of Gaussian mixture model and the hidden Markov model (GMM-HMM)
• Combination of artificial neural networks and hidden Markov model (ANN-HMM)
•Recent ASR system are end-to-end
6

Traditional speech commanded prosthetic hands
7

Related Works
•CMU Sphinx is used to control a surgical robot used
• K. Zinchenko, C.-Y. Wu, and K.-T. Song, “A Study on Speech Recognition Control for a Surgical Robot,” IEEE
Transactions on Industrial Informatics, vol. 13, no. 2, pp. 607–615, 2017.
•Control a hand exoskeleton
• combination of discrete wavelet transforms and hidden Markov models
• S. Guo, Z. Wang, J. Guo, Q. Fu, and N. Li, “Design of the Speech Control System for a Upper Limb
Rehabilitation Robot Based on Wavelet De-noising,” 2018 IEEE International Conference on Mechatronics
and Automation (ICMA), 2018.
•Control a robotic hand
• A multi-layer perceptron
• 13 speech (five time domain + eight features frequency domain)
• R. Ismail, M. Ariyanto, W. Caesarendra, I. Haryanto, H. K. Dewoto, and Paryanto, “Speech control of robotic
hand augmented with 3D animation using neural network,” 2016 IEEE EMBS Conference on Biomedical
Engineering and Sciences (IECBES), 2016.
8

Deep learning speech
•GPGPU + Dataset
•Very deep
•Slow for embedded devices
9
T. Tan, Y. Qian, H. Hu, Y. Zhou, W. Ding, and K. Yu, "Adaptive
very deep convolutional residual network for noise robust
speech recognition," IEEE/ACM Transactions on Audio,
Speech, and Language Processing, vol. 26, no. 8, pp. 1393-
1405, 2018.

Embedded GPGPU
10
Company Google NVIDIA NVIDIA NVIDIA
Model Coral Jetson Nano Jetson TX2 AGX Xavier
GPU Vivante GC7000 Lite 16 core Maxwell 128 core Pascal 256 core Volta 512 core
TPU Google Edge - - -
CPU 4 core Cortex-A53 4 core Cortex-A57
4 core Cortex-A57 + 2 core
Denver
8 core Carmel
RAM 1 GB 4 GB 8 GB 16 GB
Storage 8 GB 16 GB 32 GB 32 GB
GFLOPS 32 236 559 1300
GPIO 8 5 8 4
USB 1 x USB 3.0 + 1 x USB C 4 x USB 3.0 1 x USB 3.0 +1 x USB 2.0 2 x USB C [3.1]
UART 2 1 1 1
I2C 2 2 4 2
SPI 1 with 2 CS 2 with 2 CS 1 with 2 CS 1 with 2 CS
CAN 0 0 1 1
I2S 1 1 2 1
Size (mm) 88 x 60 x 24 100 x 80 x 29 170 x 170 x 51 105 x 105 x 85
Weight 227 g 244 g 1.5 Kg 630g
Price ($) 150 100 400 700

Related Works
•Reduce neural networks size
•by changing some weight of the network to zero
• To create sparse Network
• Excellent in case of FPGA
•by pruning neurons or even layers
•Useful but not sufficient
11

Contribution
•Control of prosthetic hands with speech input
•Using a convolutional neural network (CNN)
•Maps 2D features of speech input to text
•Without hidden Markov model
•Minimize the size of the CNN
•Real-time in an embedded GPGPU
12

Proposed Method
15
Laye
r
Type
Number of
filters
Filter
size
Strid
e
Activati
on
Output
shape
Number of
Parameters
0
Input (Log of
spectrogram)
- - - - 129 x 71 x 1 0
1 Convolution 2D 8 10 x 7 1 ReLU 120 x 65 x 8 568
2 Pooling 2D - 7 x 5 1 Max 17 x 13 x 8 0
3 Batch normalization - - - 17 x 13 x 8 32
4 Convolution 2D 32 7 x 5 1 ReLU 11 x 9 x 32 8992
5 Pooling 2D - 5 x 3 1 Max 2 x 3 x 32 0
6 Batch normalization - - - - 2 x 3 x 32 128
7 Flatten - - - - 192 0
8 Dense - - - ReLU 64 12352
9 Drop out - - - - 64 0
10 Dense - - - SoftMax 9 585

Data set
17
•Google speech command data set
•Open-source - Creative Commons BY 4.0 license
•35 words
•105,829 utterances
•Each utterance is one-second or shorter
•WAV format files
•16 kHz rate with linear 16-bit single-channel PCM values
•Several minutes long various kinds of background noise

Results
• We used the 8 classes
(words), which are "zero",
"one", "two", "three", "four",
and "five", "on" and "off".
• We used the rest of the
words as a class, "unknown".
• Adam optimizer
• Keras (TensorFlow back-end)
• Logarithm of spectrogram
18

Discussion
• One-hot encoding
• Dimension output vector = number of words +1
• Increasing the depth of the network has little effect on overall accuracy
• If the number of words increases significantly, user should increase the
number of filters
• Real time ~ 2 ms on NVIDIA Jetson TX2 developer kit (embedded GPGPU)
• Increasing speed
• NVIDIA AGX Xavier
• Using C++ and TensorRT
• WAV2LETTER++
19

Future works
• Investigate a CNN that detects the owner’s voice from other speakers
• The proposed CNN is robust to accent, speed, noise, etc.
• By combining these two CNNs
• Unconditional and Conditional Teacher-Student training
• Instead of one-hot-coding
• Teacher (bigger network) & student (smaller network)
• Comparing different type of 2D feature
• Current experiment: logarithm of speech spectrogram
• Future experiment: power-normalized cepstral coefficients (PNCC)
• CNNs with raw speech input
20

Conclusion
• Speech control of a prosthetic hand using a convolutional neural network (CNN)
• Without hidden Markov model (HMM)
• Proposed CNN maps 2D feature to text (classes, one-hot encoding)
• Look-up table to map text to the trajectory (command) for hand low-level controller and
driver
• Real-time performance (~ 2ms) on NVIDIA Jetson TX2 developer kit (an embedded GPGPU)
• Accuracy of 91% in a noisy environment
• We can increase speed by NVIDIA AGX Xavier, using either C++ and TensorRT or
WAV2LETTER++
• Future works: CNN that detects the owner’s voice from other speakers, unconditional and
conditional teacher-student training, comparing different type of 2D feature such as PNCC,
and investigating CNNs with raw speech input
21

Convolutional neural networks for speech controlled prosthetic hands

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Convolutional neural networks for speech controlled prosthetic hands

Ähnlich wie Convolutional neural networks for speech controlled prosthetic hands (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Convolutional neural networks for speech controlled prosthetic hands