Many people are amazing at focusing their attention on one person or one voice in a multi speaker scenario, and ‘muting’ other people and background noise. This is known as the cocktail party effect. For other people it is a challenge to separate audio sources.
In this presentation I will focus on solving this problem with deep neural networks and TensorFlow. I will share technical and implementation details with the audience, and talk about gains, pains points, and merits of the solutions as it relates to:
* Preparing, transforming and augmenting relevant data for speech separation and noise removal.
* Creating, training and optimizing various neural network architectures.
* Hardware options for running networks on tiny devices.
* And the end goal : Real-time speech separation on a small embedded platform.
I will present a vision of future smart air pods, smart headsets and smart hearing aids that will be running deep neural networks .
Participants will get an insight into some of the latest advances and limitations in speech separation with deep neural networks on embedded devices in regards to:
* Data transformation and augmentation.
* Deep neural network models for speech separation and for removing noise.
* Training smaller and faster neural networks.
* Creating a real-time speech separation pipeline.
2. Christian Grant
Listening at the Cocktail Party
with Deep Neural Networks and TensorFlow
#UnifiedDataAnalytics #SparkAISummit
tcpip001@gmail.com
3. Agenda
• The cocktail party problem
• Solving the cocktail party problem with deep neural networks (DNN’s)
• Future vision, and use cases
• Barriers to adopting deep neural networks at the edge
• Overcoming infrastructure, datasets, AI chips, and edge AI software
barriers
4. Real time speech separation at the edge
4#UnifiedDataAnalytics #SparkAISummit
Vision
6. When multiple people are speaking at the same time
• at a restaurant
• an airport
• or a cocktail party
Tuning in to one speaker is relatively easy for individuals with no hearing impairment.
Individuals with hearing impairment have difficulty in understanding speech in the
presence of competing voices.
6#UnifiedDataAnalytics #SparkAISummit
The Cocktail Party Problem
7. Problem - Mixed Audio
7
M1: It was a great Halloween party
F1: The nest was build with small twigs
8. Solution – Separated Tracks
8
M1: It was a great Halloween party F1: The nest was build with small twigs
16. Tasks
16
Platform
•Deep learning
virtual machine
•Real time
prediction
platform
•Demo platform
•Tiny platform
Data
•Generalized
data set
•Noise data
Transformation
•STFT
•ISTFT
•Spectrogram
Code
•Theano to
Keras + TF
•Keras + TF to
tf.keras
•Estimator API
•User friendly
code
Training
•HINT dataset
•Training lots of
models
Evaluation
•Lab listening
tests
•Metric
•Signal to
distortion ratio
Predictions
•Prediction
pipeline
•Predict 1000’s
of examples for
lots of models
18. Tool Selection
18
Keras on Theano
No development
Keras on TensorFlow
•Keras
•Easy to convert
•Google
•Large ecosystem
•TensorFlow Lite
•GPU
TensorFlow Keras API
•Keras
•Very easy to convert
•Google
•Large ecosystem
•TensorFlow Lite
•GPU
•Distributed and local
•Keras models
•Google
•TensorFlow Lite
•TensorFlow Extended
•Production ready
19. Tool Selection
19
Keras on Theano
No development
Keras on TensorFlow
•Keras
•Easy to convert
•Google
•Large ecosystem
•TensorFlow Lite
•GPU
TensorFlow Keras API
•Keras
•Very easy to convert
•Google
•Large ecosystem
•TensorFlow Lite
•GPU
•Distributed and local
•Keras models
•Google
•TensorFlow Lite
•TensorFlow Extended
•Production ready
20. Tool Selection
20
Keras on Theano
No development
Keras on TensorFlow
•Keras
•Easy to convert
•Google
•Large ecosystem
•TensorFlow Lite
•GPU
TensorFlow Keras API
•Keras
•Very easy to convert
•Google
•Large ecosystem
•TensorFlow Lite
•GPU
•Distributed and local
•Keras models
•Google
•TensorFlow Lite
•TensorFlow Extended
•Production ready
21. Tool Selection
21
Keras on Theano
No development
Keras on TensorFlow
•Keras
•Easy to convert
•Google
•Large ecosystem
•TensorFlow Lite
•GPU
TensorFlow Keras API
•Keras
•Very easy to convert
•Google
•Large ecosystem
•TensorFlow Lite
•GPU
•Distributed and local
•Keras models
•Google
•TensorFlow Lite
•TensorFlow Extended
•Production ready
31. Training
Data is generated batch-by-batch as tuples (inputs, outputs) by a Python generator
callbacks_list = [checkpoint, early_stopping, tensor_board]
history = model.fit_generator(generator,
steps_per_epoch=int(epoch_size / minbatchsize),
epochs=num_epochs,
validation_data=(Val_in, Val_out),
callbacks=callbacks_list,
verbose=2)
39. Related Use Cases
Speech to Speech
• Speech
separation
• Accessibility
• Noise removal
• Drive through
fast food /
cashier
Speech to Text
• Live transcription
• Air Traffic
Control
• Audio environment
classification
• Keyword speech
interfaces
• Speech
identification
• Speaker voice
identification
Speech to Text +
Text to Speech
• Dialogue systems
• Wake word
detection (keyword
/ trigger word
detection)
Devices and
Sensors
• AI pods, AI ear
buds
• AI headsets
• Hearables,
hearing aids
• Brain chips
• EEG, ECG
• Microphone arrays
45. Barriers to Tiny ML Adoption
• Production environment
• Training
– Dataset
– Algorithm
• Inference
– Devices and chips
– Software
– Real time inference (latency < 20 milliseconds)
49. Production Environment
Integrated Frontend for Job Management, Monitoring, Debugging, Data/Model/Evaluation Visualization
Configuration Framework and Job Orchestration
Tuner
Trainer
Data
Transformation
Data
analysis +
validation
Data
ingestion
Model
Evaluation Serving Logging
Pipeline Storage
Utilities for Garbage Collection, Data Access Controls
Serving
Analysis
50. AppStore
Production Reference Architecture
Data Lake
(Production Models
and Analytics)
Data
Labs
Sources
ERP
RDBMS
3RD
PARTY
Text
Sensors
Audio
and
Video
Machine
Logs
Web and
Social
IoT
PRIVATE HYBRIDPUBLIC
Services: Data Quality, Profiling, Retention, Reconciliation, Metadata, Security, Monitoring
Landing Cleansed Publishing/LOB
SIM EDW EDM
Ingestion
Staging Integration Calculation Semantic
Cloud Deployment
Analyst
Workbench
(Diagnostic)
DecisionMakers
Consumption
OperationalProcesses
Agile Data Lab Data Science Lab
Compute
Information
Portal
(Descriptive)
Calculation
Engine(s)
Data Lake
(Data Processing)
Advanced
Analytical
Engine(s)&Servers
Deep Learning Lab
EdgeDevices
DataScientist
(Predictive/
Prescriptive)
BusinessOperations
Integrated Frontend for Job Management, Monitoring, Debugging, Data/Model/Evaluation Visualization
Configuration Framework and Job Orchestration
Tuner
Trainer
Data
Transformation
Data
Analysis + Validation
Data
Ingestion
Model
Evaluation
Serving Logging
Pipeline Storage
Utilities for Garbage Collection, Data Access Controls
Serving
Analysis
Engineering Lab
52. Improved Voice Dataset
HINT
• 1560 sentences
• 6 speakers
• 50% Male, 50%
Female
• 2 – 3 second
clips
• Studio recording
Build Your Own
Speech Dataset
• Download a few
hours of
speeches
• Chop the
speeches into 2
second clips ~ 1
million examples
• Preprocess as
needed
Common Voice
Mozilla
• Short sentences
• 30 GB
• 1087 hours
• 39,577 voices
• 23 % US English
• 9% UK English
• Open-source
voice database
Speech
Commands
Dataset
• 65000 utterances
• 30 short words
• 1 second clips
• Contributed by
the public through
the AIY website
AudioSet
Google
• 2,084,320 labeled
sound clips
• 10 second clips
• 1,01,065 speech
clips
• From YouTube
53. Improved Voice Dataset
HINT
• 1560 sentences
• 6 speakers
• 50% Male, 50%
Female
• 2 – 3 second
clips
• Studio recording
Build Your Own
Speech Dataset
• Download 1 hour
speeches from
multiple speakers
• Chop the
speeches into 2
second clips
• Preprocess as
needed
Common Voice
Mozilla
• Short sentences
• 30 GB
• 1087 hours
• 39,577 voices
• 23 % US English
• 9% UK English
• Open-source
voice database
Speech
Commands
Dataset
• 65000 utterances
• 30 short words
• 1 second clips
• Contributed by
the public through
the AIY website
AudioSet
Google
• 2,084,320 labeled
sound clips
• 10 second clips
• 1,01,065 speech
clips
• From YouTube
54. Improved Voice Datasets
HINT
• 1560 sentences
• 6 speakers
• 50% Male, 50%
Female
• 2 – 3 second
clips
• Studio recording
Build Your Own
Speech Dataset
• Download a few
hours of
speeches
• Chop the
speeches into 2
second clips ~ 1
million examples
• Preprocess as
needed
Common Voice
Mozilla
• Short sentences
• 30 GB
• 1087 hours
• 39,577 voices
• 23 % US English
• 9% UK English
• Open-source
voice database
Speech
Commands
Dataset
• 65000 utterances
• 30 short words
• 1 second clips
• Contributed by
the public through
the AIY website
AudioSet
Google
• 2,084,320 labeled
sound clips
• 10 second clips
• 1,01,065 speech
clips
• From YouTube
61. Lightweight AI Software
• Model size
– Parameters
– Megabytes
• Optimizing the model
– Pruning
– Quantization
• Libraries
– TensorFlow Lite
– TensorFlow Lite (next version)
• Real Time (latency <= 20 milliseconds)
62. Summary
• The cocktail party problem
• Solving the cocktail party problem with deep neural networks (DNN’s)
• Future vision, and related use cases
• Barriers to adopting deep neural networks at the edge
• Overcoming infrastructure, datasets, AI chips, and edge AI software
barriers
63. Resources
• Simple Audio Recognition Tutorial
https://www.tensorflow.org/tutorials/sequences/audio_recognition
• Speaker and speech dependence in a deep neural networks speech separation algorithm, Eriksholm
https://wdh01.azureedge.net/-/media/eriksholm/main/files/publications/2019/bramslow-et-al-spin-2019-speaker-
and-speech-dependence-in-a-deep-neural-networks-speech-separation-a.pdf?la=en&rev=9978
• Why the Future of Machine Learning is Tiny, Pete Warden’s Blog
https://petewarden.com/2018/06/11/why-the-future-of-machine-learning-is-tiny/
• Looking to Listen: Audio-Visual Speech Separation
https://ai.googleblog.com/2018/04/looking-to-listen-audio-visual-speech.html
• Voice Datasets
– Common Voice: https://voice.mozilla.org/en
– AudioSet: https://research.google.com/audioset/
• Live Transcribe
– https://play.google.com/store/apps/details?id=com.google.audio.hearing.visualization.accessibility.scribe&hl=e
n_US