Listening at the Cocktail Party with Deep Neural Networks and TensorFlow

WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics

Christian Grant
Listening at the Cocktail Party
with Deep Neural Networks and TensorFlow
#UnifiedDataAnalytics #SparkAISummit
tcpip001@gmail.com

Agenda
• The cocktail party problem
• Solving the cocktail party problem with deep neural networks (DNN’s)
• Future vision, and use cases
• Barriers to adopting deep neural networks at the edge
• Overcoming infrastructure, datasets, AI chips, and edge AI software
barriers

Real time speech separation at the edge
4#UnifiedDataAnalytics #SparkAISummit
Vision

When multiple people are speaking at the same time
• at a restaurant
• an airport
• or a cocktail party
Tuning in to one speaker is relatively easy for individuals with no hearing impairment.
Individuals with hearing impairment have difficulty in understanding speech in the
presence of competing voices.
The Cocktail Party Problem

Problem - Mixed Audio
7
M1: It was a great Halloween party
F1: The nest was build with small twigs

Solution – Separated Tracks
8
M1: It was a great Halloween party F1: The nest was build with small twigs

Speech Separation Approach
9
Input audio
Speaker 1 prediction
Speaker 1 source Speaker 2 source
ISTFT
STFT – Short –time Fourier transform
ISTFT
Speaker 1 mask prediction
Deep Neural Network
Model
Weights

10
Input audio
ISTFT
ISTFT
Deep Neural Network
Model
Weights

11
Input audio
ISTFT
ISTFT
Deep Neural Network
Model
Weights

12
Input audio
ISTFT
ISTFT
Deep Neural Network
Model
Weights

13
Input audio
ISTFT
ISTFT
Deep Neural Network
Model
Weights

14
Input audio
ISTFT
ISTFT
Deep Neural Network
Model
Weights

Tasks
15
Machine
Resource
Management
Process
Management
Data
Verification ML
Code
Configuration
Monitoring
Training
Transformations +
Feature Extraction
Inference
Partner
Collaboration
Publications
Evaluation
Data Collection
Tool & Platform Selection

Tasks
16
Platform
•Deep learning
virtual machine
•Real time
prediction
platform
•Demo platform
•Tiny platform
Data
•Generalized
data set
•Noise data
Transformation
•STFT
•ISTFT
•Spectrogram
Code
•Theano to
Keras + TF
•Keras + TF to
tf.keras
•Estimator API
•User friendly
code
Training
•HINT dataset
•Training lots of
models
Evaluation
•Lab listening
tests
•Metric
•Signal to
distortion ratio
Predictions
•Prediction
pipeline
•Predict 1000’s
of examples for
lots of models

Tool Selection
18
Keras on Theano
No development
Keras on TensorFlow
•Keras
•Easy to convert
•Google
•Large ecosystem
•TensorFlow Lite
•GPU
TensorFlow Keras API
•Keras
•Very easy to convert
•Google
•Large ecosystem
•TensorFlow Lite
•GPU
•Distributed and local
•Keras models
•Google
•TensorFlow Lite
•TensorFlow Extended
•Production ready

Tool Selection
19
Keras on Theano
No development
Keras on TensorFlow
•Keras
•Easy to convert
•Google
•Large ecosystem
•TensorFlow Lite
•GPU
•Keras
•Google
•Large ecosystem
•TensorFlow Lite
•GPU
•Keras models
•Google
•TensorFlow Lite
•Production ready

Tool Selection
20
Keras on Theano
No development
Keras on TensorFlow
•Keras
•Easy to convert
•Google
•Large ecosystem
•TensorFlow Lite
•GPU
•Keras
•Google
•Large ecosystem
•TensorFlow Lite
•GPU
•Keras models
•Google
•TensorFlow Lite
•Production ready

Tool Selection
21
Keras on Theano
No development
Keras on TensorFlow
•Keras
•Easy to convert
•Google
•Large ecosystem
•TensorFlow Lite
•GPU
•Keras
•Google
•Large ecosystem
•TensorFlow Lite
•GPU
•Keras models
•Google
•TensorFlow Lite
•Production ready

Data
• 3 males + 3 females6 speakers
• 13 lists * 20 sentences * 6 speakers1560 sentences / files
• 2 – 3 seconds per sentence200 KB to 312 MB
• ~ 10 minutes of speech per speaker260 examples / speaker
• 16 bits per sample44.1 kHz sampling rate

Short-time Fourier Transform
# Load audio file
wav1, sr1 = librosa.load(‘voice.wav’, sr=None, mono=True, duration=2)
# Short-time Fourier transform
stft1 = librosa.stft(wav1)

Fully Connected Neural Network
int_in = int(in_dim[0]) # 1032
inputs = Input(shape=(int_in,))
x = inputs
for i in range(n_hidden_layers): # 4 hidden layers
x = Dense(units=1024)(x)
x = Activation(‘sigmoid’)(x)
x = BatchNormalization()(x)
x = Dropout(dropout_val)(x)
int_out = int(op_dim[-1]) # 129
final_output = Dense(int_out, activation=‘sigmoid’)(x)
model = Model(inputs, final_output)
AO = tf.keras.optimizers.Adam(lr=lr, beta_1=beta_1, beta_2=beta_2, epsilon=epsilon)
loss_func = 'mse'
model.compile(loss=loss_func, optimizer=AO)

Training Configuration
• Topology: FDNN
• Speaker_1: M1
• Speaker_2: F1
• Epochs: 200
• Data_augmentation_ratio: 50
• Early_stopping: 25
• Data_input: HINT
• Training_list: L5 L6 L7 L8
• Validation_list: L9
• Testing_list: L1 L2
• Weights_output_directory:

Training
Data is generated batch-by-batch as tuples (inputs, outputs) by a Python generator
callbacks_list = [checkpoint, early_stopping, tensor_board]
history = model.fit_generator(generator,
steps_per_epoch=int(epoch_size / minbatchsize),
epochs=num_epochs,
validation_data=(Val_in, Val_out),
callbacks=callbacks_list,
verbose=2)

Evaluation Configuration
• Train_topology: FDNN
• Train_speaker_1: M1
• Train_speaker_2: F1
• Pred_speaker_1: M1
• Pred_speaker_2: F1
• Data_directory: HINT
• Weights_directory:
• Pred_speech_type: Sentences
• Pred_list: L1 L2
• Pred_sentences: S1 S2
• Pred_output_directory:

Evaluation Results
34
Trained Pair
Predicted Pair M1/F1 M1/M2 F1/F2 M3/F3 ….
M1
F1
6.84
8.01
- - -
M1
M2
3.17
6.54
6.13
7.48
0.01
1.82
-
F1
F2
-0.38
0.72
-1.49
0.94
5.27
6.87
-
M3
F3
-3.94
11.23
-5.00
10.53
-8.50
7.56
0.33
13.07
FDNN
Signal to Distortion Ratio

Predicted Examples
35
Model Test example
(mixed audio)
Separated
tracks
SDR value Audio
result
M1F1_FDNN M1F1_L1L3_S2S8 M1 12.01
M1F1_FDNN M1F1_L1L3_S2S8 F1 14.21
M1F1_FDNN M3F3_L1L2_S12S20 M3 15.59
M1F1_FDNN M3F3_L1L2_S12S20 F3 -4.51
M1F1_FDNN M3F3_L1L3_S20S7 M3 -0.02

Real time speech separation at the edge
Updated Vision

Related Use Cases
Speech to Speech
• Speech
separation
• Accessibility
• Noise removal
• Drive through
fast food /
cashier
Speech to Text
• Live transcription
• Air Traffic
Control
• Audio environment
classification
• Keyword speech
interfaces
• Speech
identification
• Speaker voice
identification
Speech to Text +
Text to Speech
• Dialogue systems
• Wake word
detection (keyword
/ trigger word
detection)
Devices and
Sensors
• AI pods, AI ear
buds
• AI headsets
• Hearables,
hearing aids
• Brain chips
• EEG, ECG
• Microphone arrays

Noise Removal
Use Case

Live Transcription
Use Case

Barriers to Tiny ML Adoption
• Production environment
• Training
– Dataset
– Algorithm
• Inference
– Devices and chips
– Software
– Real time inference (latency < 20 milliseconds)

Production Tasks
Machine
Resource
Management
Process
Management
Data
Verification ML
Code
Data Collection
Configuration
Monitoring
Training
Transformations +
Feature Extraction
Inference
Evaluation
Tool & Platform Selection

Production Pipeline
Example
Validator
Trainer
Transform
StatisticsGenExampleGen
Model
Server
SchemaGen Evaluator
Model
Validator
Pusher
TF
Lite
Training &
Eval Data
TensorFlow Extended
MLFlow

Production Environment
Integrated Frontend for Job Management, Monitoring, Debugging, Data/Model/Evaluation Visualization
Configuration Framework and Job Orchestration
Tuner
Trainer
Data
Transformation
Data
analysis +
validation
Data
ingestion
Model
Evaluation Serving Logging
Pipeline Storage
Utilities for Garbage Collection, Data Access Controls
Serving
Analysis

AppStore
Production Reference Architecture
Data Lake
(Production Models
and Analytics)
Data
Labs
Sources
ERP
RDBMS
3RD
PARTY
Text
Sensors
Audio
and
Video
Machine
Logs
Web and
Social
IoT
PRIVATE HYBRIDPUBLIC
Services: Data Quality, Profiling, Retention, Reconciliation, Metadata, Security, Monitoring
Landing Cleansed Publishing/LOB
SIM EDW EDM
Ingestion
Staging Integration Calculation Semantic
Cloud Deployment
Analyst
Workbench
(Diagnostic)
DecisionMakers
Consumption
OperationalProcesses
Agile Data Lab Data Science Lab
Compute
Information
Portal
(Descriptive)
Calculation
Engine(s)
Data Lake
(Data Processing)
Advanced
Analytical
Engine(s)&Servers
Deep Learning Lab
EdgeDevices
DataScientist
(Predictive/
Prescriptive)
BusinessOperations
Integrated Frontend for Job Management, Monitoring, Debugging, Data/Model/Evaluation Visualization
Configuration Framework and Job Orchestration
Tuner
Trainer
Data
Transformation
Data
Analysis + Validation
Data
Ingestion
Model
Evaluation
Serving Logging
Pipeline Storage
Utilities for Garbage Collection, Data Access Controls
Serving
Analysis
Engineering Lab

Improved Voice Dataset
HINT
• 1560 sentences
• 6 speakers
• 50% Male, 50%
Female
• 2 – 3 second
clips
• Studio recording
Build Your Own
Speech Dataset
• Download a few
hours of
speeches
• Chop the
speeches into 2
second clips ~ 1
million examples
• Preprocess as
needed
Common Voice
Mozilla
• Short sentences
• 30 GB
• 1087 hours
• 39,577 voices
• 23 % US English
• 9% UK English
• Open-source
voice database
Speech
Commands
Dataset
• 65000 utterances
• 30 short words
• 1 second clips
• Contributed by
the public through
the AIY website
AudioSet
Google
• 2,084,320 labeled
sound clips
• 10 second clips
• 1,01,065 speech
clips
• From YouTube

Improved Voice Dataset
HINT
• 1560 sentences
• 6 speakers
• 50% Male, 50%
Female
• 2 – 3 second
clips
Build Your Own
Speech Dataset
• Download 1 hour
speeches from
multiple speakers
• Chop the
speeches into 2
second clips
• Preprocess as
needed
Common Voice
Mozilla
• Short sentences
• 30 GB
• 1087 hours
• 39,577 voices
• 23 % US English
• 9% UK English
• Open-source
voice database
Speech
Commands
Dataset
• 30 short words
• 1 second clips
• Contributed by
the public through
the AIY website
AudioSet
Google
• 2,084,320 labeled
sound clips
• 10 second clips
• 1,01,065 speech
clips
• From YouTube

Improved Voice Datasets
HINT
• 1560 sentences
• 6 speakers
• 50% Male, 50%
Female
• 2 – 3 second
clips
Build Your Own
Speech Dataset
• Download a few
hours of
speeches
• Chop the
speeches into 2
second clips ~ 1
million examples
• Preprocess as
needed
Common Voice
Mozilla
• Short sentences
• 30 GB
• 1087 hours
• 39,577 voices
• 23 % US English
• 9% UK English
• Open-source
voice database
Speech
Commands
Dataset
• 30 short words
• 1 second clips
• Contributed by
the public through
the AIY website
AudioSet
Google
• 2,084,320 labeled
sound clips
• 10 second clips
• 1,01,065 speech
clips
• From YouTube

Improved Approach
Source: Looking to Listen: Audio-Visual Speech Separation (AI.GoogleBlog.com)

Tiny AI Devices
58
Cloud
•Training
•Inference
•DLVM
•GPU
Desktop
•Training (45m)
•Inference
•4 core I7
•GPU GTX 1080
Laptop
•Training (15h)
•Inference ok
•Laptop no GPU
•MacBook Pro
Smartphone
•Inference
•AI chips
•Coral Dev
Board
•Jetson Nano
Developer Kit
Edge
•Inference
•AI buds
•Hearables

AI Edge Chips
Google Edge TPU
• TPU
• TensorFlow Lite
• Image classification
• Object detection
• Mini-Go
• Deep feed forward
NN’s
• 4 TOPS
• 0.5 Watts / TOPS
Syntiant
• Neural decision
processor
• TensorFlow Lite
• 500000 parameters
• Wake word detection
• Speaker identification
• Keyword speech
interface
• Audio environment
classification
Smartphone NPU’s
• Snapdragon 855
• Kirin 990
59

Lightweight AI Software
• Model size
– Parameters
– Megabytes
• Optimizing the model
– Pruning
– Quantization
• Libraries
– TensorFlow Lite
– TensorFlow Lite (next version)
• Real Time (latency <= 20 milliseconds)

Summary
• The cocktail party problem
• Solving the cocktail party problem with deep neural networks (DNN’s)
• Future vision, and related use cases
• Barriers to adopting deep neural networks at the edge
• Overcoming infrastructure, datasets, AI chips, and edge AI software
barriers

Resources
• Simple Audio Recognition Tutorial
https://www.tensorflow.org/tutorials/sequences/audio_recognition
• Speaker and speech dependence in a deep neural networks speech separation algorithm, Eriksholm
https://wdh01.azureedge.net/-/media/eriksholm/main/files/publications/2019/bramslow-et-al-spin-2019-speaker-
and-speech-dependence-in-a-deep-neural-networks-speech-separation-a.pdf?la=en&rev=9978
• Why the Future of Machine Learning is Tiny, Pete Warden’s Blog
https://petewarden.com/2018/06/11/why-the-future-of-machine-learning-is-tiny/
• Looking to Listen: Audio-Visual Speech Separation
https://ai.googleblog.com/2018/04/looking-to-listen-audio-visual-speech.html
• Voice Datasets
– Common Voice: https://voice.mozilla.org/en
– AudioSet: https://research.google.com/audioset/
• Live Transcribe
– https://play.google.com/store/apps/details?id=com.google.audio.hearing.visualization.accessibility.scribe&hl=e
n_US

Christian Grant
tcpip001@gmail.com
Thank You
#UnifiedDataAnalytics #SparkAISummit

DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Listening at the Cocktail Party with Deep Neural Networks and TensorFlow

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Listening at the Cocktail Party with Deep Neural Networks and TensorFlow

Ähnlich wie Listening at the Cocktail Party with Deep Neural Networks and TensorFlow (20)

Mehr von Databricks

Mehr von Databricks (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Listening at the Cocktail Party with Deep Neural Networks and TensorFlow