SlideShare ist ein Scribd-Unternehmen logo
1 von 32
Downloaden Sie, um offline zu lesen
Speech Recognition Front-End:
Voice Activity Detection & Speech
Enhancement
Juntae Kim, Ph. D Candidate
School of Electrical Engineering
KAIST
For NAVER
Voice Activity
Detection
She had your dark suit in greasy
wash water all year.
Local device (smart speaker, robot)
Overview
End Point
Detection
Speech
Enhancement
Speech
Recognition
Server
Today’s topic
Voice Activity Detection Using an Adaptive
Context Attention Model
Kim, Juntae, and Minsoo Hahn. "Voice Activity Detection Using an Adaptive Context Attention Model." IEEE Signal Processing Letters (2018).
The most famous VAD
repository in github
Voice activity detection (VAD)
Objective: From incoming signal, detecting the speech signal only.
Important Points for VAD:
① Robustness to the various real-world noise environments.
② Robustness to the distance variation.
③ Compact computational cost with low-latency.
Conventional methods:
① Statistical signal processing based approaches.  Assume DFT coefficients of speech and noise signal to
Gaussian random variable and conduct the decision by calculating the likelihood ratio.
② Feature engineering based approaches.  Harmonicity, energy, zero-crossing rate, entropy and etc.
③ Traditional machine learning based approaches.  SVM, LDA, KNN and etc.
Deep learning based VAD
Research branches for deep-learning based VAD
Acoustic
Feature
Extraction
Neural network:
DNN, CNN,
LSTM
Decision
Which acoustic features are useful for VAD?
 These approaches show outstanding performance,
however, generally needs to multiple features so that
computation cost can be increased.
Can we directly use this raw-waveform for neural network?  Generally raw-waveform based approach needs high
computational cost even the performance improvement is slight.
Which neural network architecture make the VAD robust to various noise
environments?  Generally many researchers try to apply state-of-the-art
architectures from other field such as LSTM, CNN and DNN but such architecture
show the trade-off according to the noise types.
Which neural network architecture can effectively use the context
information of speech signal for VAD?
Deep learning based VAD
Boosted deep neural network (bDNN)
Zhang, Xiao-Lei, and DeLiang Wang. "Boosting contextual information for deep neural network based voice activity detection." IEEE/ACM Transactions on Audio, Speech and Language
Processing (TASLP) 24.2 (2016): 252-264.
𝑥 𝑛 𝑥 𝑛+1
𝑥 𝑛−1
𝑥 𝑛+1+𝑢 𝑥 𝑛+𝑊
Inputs:
Frames subsampled with W=19, u=9 are used as
input.
Combine with average method
Extended outputs through time are used as
output.
Loss : Mean squared error
bDNN shows outstanding performance by
adopting the boosting strategy.  However it
only can use fixed context information.
Deep learning based VAD
Boosted deep neural network (bDNN)
Zhang, Xiao-Lei, and DeLiang Wang. "Boosting contextual information for deep neural network based voice activity detection." IEEE/ACM Transactions on Audio, Speech and Language
Processing (TASLP) 24.2 (2016): 252-264.
bDNN shows outstanding performance by
adopting the boosting strategy.  However it
only can use fixed context information.
In order to solve this problem, Zhang et el.
built multi-stacking model with several
bDNNs that have various context size of inputs.
This structure shows state-of-the-art
performance but the computation cost was 11
times higher than single bDNN. However, this
result implies that if we use the context
information adaptively, we can get some
performance improvement.
Deep learning based VAD
Can a single model adaptively use the context information (CI) according to noise types and SNRs?
In noisy acoustic features, there is no ground truth what is proper usage of CI.
Let’s use the reinforcement-like method.
From the input acoustic features, repeatedly find the proper CI usage.
If found CI usage make the classification correct, give some rewards to model.
Motivation
Deep learning based VAD
Adaptive context attention model (ACAM)
• Decoder: Determine which context is important. (attention)
• Core network: Given previous hidden state (𝐡 𝑚, 𝑡−1) and the input information with noise
environment (𝐠 𝑚,𝑡), propose the next action to the succeeding module.
• Encoder: Aggregate the information of the results of current action.
Start
No
Yes
Stop
𝑚, ,
𝑚, 𝑡 ( 𝑚, 𝑡−1
𝑚, 𝑡 ( 𝑚, 𝑚, 𝑡
𝑚, 𝑡 ( 𝑚, 𝑡, 𝑚, 𝑡
𝐡 𝑚, 𝑡 (𝐡 𝑚, 𝑡−1, 𝐠 𝑚, 𝑡 𝑚 (𝐡 𝑚, 𝑡
⋯
Core network
Attention Attention
𝐡 𝑚, 𝑡−1 𝐡 𝑚, 𝑡
𝐡 𝑚, 𝑇
𝛒 𝑚, 𝑡
𝑚, 𝑡 [0.05, 0. , 0. , 0.5, 0. , 0. , 0.05] 𝑚, 𝑡+1 [0.05, 0. , 0.05 0.6 0.05, 0. , 0.05]
𝛒 𝑚, 𝑡+1 𝑚
𝐠 𝑚,𝑡 𝐠 𝑚,𝑡
Experimental Setup
Training Phase (20 h)
• Speech dataset: 4,620 utterances from the training set in the TIMIT corpus.
• Noise dataset: 20,000 types from Sound effect library.
• Noise addition was conducted with a randomly selected SNR between −10 and 12 dB.
Test phase
D1 dataset (2.4 h)
• Speech dataset: 192 utterances from the training set in the TIMIT corpus.
• Noise dataset: 15 types of noise from NOISEX-92 corpus. (jet cockpit1, jet cockpit2, destroyer engine,
destroyer operations, F-16 cockpit, factory1, factory2, HF channel, military vehicle, M109 tank,
machine gun, pink, Volvo, speech babble, and white).
• Noise addition was conducted with SNRs −5, 0, 5, 10 dB.
D2 dataset (2 h)
• Real world dataset recorded by Galaxy 8.
D3 dataset (72 h)
• Youtube dataset recorded in real-world.
Experimental Setup
Zhang, Xiao-Lei, and DeLiang Wang. "Boosting contextual information for deep neural network based voice activity detection." IEEE/ACM Transactions on Audio, Speech and Language
Processing (TASLP) 24.2 (2016): 252-264.
Acoustic features: Multi-resolution cochleagram features (MRCG)
ACAM 1: Fix the attention.
ACAM 2: Train the model with 𝐽𝑠𝑣 only.
ACAM 3: Train the model with 𝐽.
D1: TIMIT+NoiseX92 (2.4h), D2: Recorded dataset(2h), D3: YouTube Dataset(72h)
Performance Measure: area under the ROC curve (AUC) in %.
Number of parameters for each model.
HFCL MSFI DNN bDNN LSTM ACAM
# param. ▪ ~2 k ~3015 k ~3018 k ~2097 k ~953 k
Computation time (ms) for each model. The number in bracket is the MRCG extraction time.
HFCL MSFI DNN bDNN LSTM ACAM
The average p
rocessing tim
e per second
of signal
9.04 67.73 0.88 + (206) 9.24 + (206) 31.61 + (206) 10.12 + (206)
Experimental Results – Investigation of the TSN framework
End-to-End Speech Enhancement Using
Boosting-Based Two-Step Neural Network
Kim, Juntae, and Minsoo Hahn. “End-to-End Speech Enhancement Using Boosting-Based Two-Step Neural Network" submitted to IEEE Signal Processing Letters (2018).
Speech enhancement (SE)
Objective: From incoming noisy speech signal, removing the noise signal, while conserving the speech signal.
𝑥 ( )  ො 𝑥
Important Points for SE:
① Perceptual quality (related with speech distortion).
② Noise reduction.
③ Computational cost.
Conventional methods:
)( )H w
ˆ( )y t( )y t
( ) 1
( )
( ) ( ) 1 ( ) / ( )
x
x n n x
S w
H w
S w S w S w S w
 
 
   2 2
( ) [ ( )] , ( ) [ ( )]x nS w E F x t S w E F n t 
* Assumption:
x(t), n(t) are independent.
Estimated from silence region (by
conducting VAD)
How we can find the H(w)?  Minimum mean squared error estimation (MMSE).
2
minimize [( ( ) ( ) ( )) ]E Y w H w X w
Trade off
Deep learning based SE
 ˆ |t tf x Y
 ˆ |t tf x Y
Method 1: Directly maps the noisy log power spectral features (LPS) to clean LPS.
where ොxt ∈ ℝN is an enhanced LPS vector, yt ∈ ℝN is a noisy LPS vector, t is the frame index, N is
the feature dimension, f ( ∙ | θ) denotes the neural network-based function.
Xu, Yong, et al. "A regression approach to speech enhancement based on deep neural networks." IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 23.1 (2015): 7-19.
1. Fully Convolutional Neural Network
2. Deep Neural Network
3. Etc.
Deep learning based SE
Method 2: SE is carried out by adopting the masking method.
where ොxt(k), xt k , and yt k are the kth element of ොxt, xt, and yt, respectively, and xt∈ℝN is a clean
LPS vector, k is frequency bin, and mt k is a mask value for kth element. The IRM in second line is
the most widely used one for the mask.
          
 1/2
ˆ , 2 log ty kmask
t t t tx k f x k y k m k e

   
      1/2 t tx k y kIRM
tm k e
 

Clean/Noisy
samples
Noisy
samples
Feature
extraction
DNN
Training
Mask
extractionx t , y(t)
y(t)
m(t)
Feature
extraction
Reconstructi
on
ොx(t)
Training stage
Enhancement stage
Y
Y
𝐘
ෝm(t)
Deep learning based SE – Summary
Conventional
Approaches
Method 2
LSTM
Method 1
Fully convolutional network
(FCN)
GAN
Multitask learning
Convolutional LSTM
DNN with skip connection
Multitask learning
Ideal ratio mask (IRM)
Complex ideal ratio mask (cIRM)
MMSE
Wiener filtering
Optimally modified log-
spectral amplitude
Minimum variance distortionless Response
(MVDR)
Two-step network (TSN)
Conventional Approaches
Pros:
① Good! but only in specific noise environment (stationary noise).
Cons:
① Vulnerable in non-stationary noise environment
② Computation cost is relatively high (matrix inversion operation)
③ Model size is small (only have to save the impulse response H(w)).
Method 1
Pros:
① Good! compared to conventional approaches. But some muffled sound (smoothing effect in the spectrogram)
is investigated.
② Strong performance in unseen noise environments.
③ Computation cost is cheap if we use simple acoustic features such as log-power spectra and simple
architecture such as DNN.
Cons:
① In order to reduce the speech distortion, GAN based SE methods were proposed. But training is quite hard
because of over-regularization.
② According to used architecture, there is some trade-off between noise types.
③ Model size is relatively big (many parameters.).
④ Some neural network types (bi-directional RNN) prevent the online enhancement.
⑤ Phase reconstruction seems to be hard in this framework.
Method 2 compared to Method 1
Pros:
① Phase reconstruction become easier than method 1.
② There is some performance improvement if we use method 1
framework. (Some paper deny this fact.)
Cons:
① Additional masking operation is needed.
② Even though the neural network can model any arbitrary function,
there aren’t solid reason the masking method outperform the
method 1. According to used architecture, there is some trade-off
between noise types.
Two-Step Network
Clean/Noisy
samples
Noisy
samples
Feature
extraction
Prior Net
Training
x t , y(t)
Feature
extraction
Reconstructi
on
ොx(t)
Training stage
Enhancement stage
Y
Y
𝐗, 𝐘
Post Net
Training
Prior Net
Enhancing
Post Net
Enhancing
𝐗∗, 𝐘
   *
ˆ , , ,mask post
t t t t tf f x x y X Y
Proposed Method: SE is carried out in end-to-end manner but implicitly considering the masking method (Our model
directly maps noisy features to clean features).
y(t)
Why end-to-end?
① We can share acoustic features with following modules (speech recognition system).
② We can save computation cost (no additional masking operation).
③ We can fully exploit the potential modeling power of neural network.
Two-Step Network
( )pri
g  pri-NN
y 𝑡−1 y 𝑡y 𝑡−2
y 𝑡 y 𝑡+1y 𝑡−1
y 𝑡+1 y 𝑡+2y 𝑡
෤𝐱 𝑡−1 ෤𝐱 𝑡
1
෤𝐱 𝑡−2
−1
෤𝐱 𝑡 ෤𝐱 𝑡+1
1
෤𝐱 𝑡−1
−1
෤𝐱 𝑡+1 ෤𝐱 𝑡+2
1
෤𝐱 𝑡
−1
𝐗 𝑡
∗
𝐘𝑡
ො𝐱 𝑡
post-NN
𝐕𝑡
,CI
tX
pred
tX
  *
,pri
t t k k
g

 
X Y
  * ( )
,m pred CI
t t m k t tm k

 
   
 X x X X
Why multiple outputs for pri-NN?
 We can get multiple predictions Xt
pred
for xt.
From multiple predictions, we can adopt the
boosting method.
Simplest boosting method:
( )1
ˆ m
t t
mM
 x x
Two-Step Network
( )pri
g  pri-NN
y 𝑡−1 y 𝑡y 𝑡−2
y 𝑡 y 𝑡+1y 𝑡−1
y 𝑡+1 y 𝑡+2y 𝑡
෤𝐱 𝑡−1 ෤𝐱 𝑡
1
෤𝐱 𝑡−2
−1
෤𝐱 𝑡 ෤𝐱 𝑡+1
1
෤𝐱 𝑡−1
−1
෤𝐱 𝑡+1 ෤𝐱 𝑡+2
1
෤𝐱 𝑡
−1
𝐗 𝑡
∗
𝐘𝑡
ො𝐱 𝑡
post-NN
𝐕𝑡
,CI
tX
pred
tX
Simplest boosting method:
( )1
ˆ m
t t
mM
 x x
෤𝐱 𝑡
y 𝑡 y 𝑡+1y 𝑡−1
෤𝐱 𝑡
1
෤𝐱 𝑡
−1
y 𝑡+1 y 𝑡+2y 𝑡y 𝑡−1 y 𝑡y 𝑡−2
Drawback 1: are from different context. 
The data distribution across the frequency dimension can be
different each other.
(1) (2) (3)
, ,t t tx x x
Input contexts are different.
Two-Step Network
( )pri
g  pri-NN
y 𝑡−1 y 𝑡y 𝑡−2
y 𝑡 y 𝑡+1y 𝑡−1
y 𝑡+1 y 𝑡+2y 𝑡
෤𝐱 𝑡−1 ෤𝐱 𝑡
1
෤𝐱 𝑡−2
−1
෤𝐱 𝑡 ෤𝐱 𝑡+1
1
෤𝐱 𝑡−1
−1
෤𝐱 𝑡+1 ෤𝐱 𝑡+2
1
෤𝐱 𝑡
−1
𝐗 𝑡
∗
𝐘𝑡
ො𝐱 𝑡
post-NN
𝐕𝑡
,CI
tX
pred
tX
Drawback 2: We cannot use Xt
CI which is highly related with 𝐱 𝑡,
because it corresponds to neighboring frames.
Simplest boosting method:
( )1
ˆ m
t t
mM
 x x
Drawback 3: We cannot use 𝐘t  We cannot implicitly model
the masking method.
   *
ˆ , , ,mask post
t t t t tf f x x y X Y
Also, according to noise types, 𝐘t can have intact clean speech
features in some frequency band.
Two-Step Network
( )pri
g  pri-NN
y 𝑡−1 y 𝑡y 𝑡−2
y 𝑡 y 𝑡+1y 𝑡−1
y 𝑡+1 y 𝑡+2y 𝑡
෤𝐱 𝑡−1 ෤𝐱 𝑡
1
෤𝐱 𝑡−2
−1
෤𝐱 𝑡 ෤𝐱 𝑡+1
1
෤𝐱 𝑡−1
−1
෤𝐱 𝑡+1 ෤𝐱 𝑡+2
1
෤𝐱 𝑡
−1
𝐗 𝑡
∗
𝐘𝑡
ො𝐱 𝑡
post-NN
𝐕𝑡
,CI
tX
pred
tX
Idea: Try to use all the useful information Xt
pred
, Xt
CI, Yt that
related with xt.
Simplest boosting method:
( )1
ˆ m
t t
mM
 x x
     2 1 2 2*
, ( ) .
N
t t tconcat set
   
 V X Y
However, we cannot use the simple averaging method because
of Xt
CI, Yt .
Convolution based boosting for filter out some noisy
information while aggregating target-related information from
each feature vectors.
Two-Step Network
( )pri
g  pri-NN
y 𝑡−1 y 𝑡y 𝑡−2
y 𝑡 y 𝑡+1y 𝑡−1
y 𝑡+1 y 𝑡+2y 𝑡
෤𝐱 𝑡−1 ෤𝐱 𝑡
1
෤𝐱 𝑡−2
−1
෤𝐱 𝑡 ෤𝐱 𝑡+1
1
෤𝐱 𝑡−1
−1
෤𝐱 𝑡+1 ෤𝐱 𝑡+2
1
෤𝐱 𝑡
−1
𝐗 𝑡
∗
𝐘𝑡
ො𝐱 𝑡
post-NN
𝐕𝑡
,CI
tX
pred
tX
1
,l l l
t t

 H K H
     1
1 1
, , , 1, ,
l l
I S
l l l
t t
i u
f j u i j f u i 
 
 
    
 
H K H
where Ht
l∈ℝN×O l
and Kl∈ℝS l×I l×O l
are the output
feature maps and the convolutional kernel respectively, S l and
I l and O l denote the size of the convolutional kernel and the
number of input and output feature maps, respectively, from the
l-th convolutional layer.
Simplest boosting method:
( )1
ˆ m
t t
mM
 x x
Idea: Convolution based boosting for filter out some noisy
information while aggregating target-related information from
each feature vectors.
Two-Step Network
( )pri
g  pri-NN
y 𝑡−1 y 𝑡y 𝑡−2
y 𝑡 y 𝑡+1y 𝑡−1
y 𝑡+1 y 𝑡+2y 𝑡
෤𝐱 𝑡−1 ෤𝐱 𝑡
1
෤𝐱 𝑡−2
−1
෤𝐱 𝑡 ෤𝐱 𝑡+1
1
෤𝐱 𝑡−1
−1
෤𝐱 𝑡+1 ෤𝐱 𝑡+2
1
෤𝐱 𝑡
−1
𝐗 𝑡
∗
𝐘𝑡
ො𝐱 𝑡
post-NN
𝐕𝑡
,CI
tX
pred
tX
Loss function
22
ˆ ,prior
loss post pri t t t t F
t
J J J       x x X X
 , , , , ,t t t t  X x x x
prior
tX
If λ is set to 0, we cannot consider that Xt
*
contains Xt
pred
because there is no evidence that Xt
pred
includes the
prediction of xt, which means that the pri-NN loses the
characteristic as weak-predictors. Thus, the boosting effect
of TSN from multiple predictions would be negated.
Experimental Setup
Training Phase
• Speech dataset: 4,620 utterances from the training set in the TIMIT corpus.
• Noise dataset: 100 types of noise from HU dataset, 50 types from Sound effect library.
• Noise addition was conducted with a randomly selected SNR between −5 and 20 dB.
• We repeat this procedure until the length of the entire training dataset is approximately 50 h.
• 90% of the training dataset is used in training the model and the remaining 10% in validation.
Test phase
D1 dataset
• Speech dataset: 192 utterances from the training set in the TIMIT corpus.
• Noise dataset: 15 types of noise from NOISEX-92 corpus. (jet cockpit1, jet cockpit2, destroyer engine,
destroyer operations, F-16 cockpit, factory1, factory2, HF channel, military vehicle, M109 tank,
machine gun, pink, Volvo, speech babble, and white).
• Noise addition was conducted with SNRs −5, 0, 5, 10 dB.
D2 dataset
• Real world dataset recorded by Galaxy 8.
Experimental Setup
Additional Information
• Sampling rate: 8kHz.
• Window shift and length: 10 and 25ms
• Log-power spectra (LPS) were used for acoustic features.
• Z-score-normalization was conducted across the frequency dimension for LPS.
• When reconstructing the waveform, the noisy phase information was used.
Baseline Methods
• Deep neural network (DNN)
• DNN with skip connection (sDNN)
• Ideal ratio mask with DNN (IRM)
• Fully convolutional neural network (FCN)
• Long short term memory recurrent neural network (LSTM)
Experimental Setup
Pri-NN Post-NN
𝜏 =4
3 hidden layers={1024, 1024, 1024}
8 convolutional layers={(256, 5, 1), (128, 5, 1), (64, 5, 1), (32, 5,
1), (32, 5, 1), (32, 5, 1), (32, 5, 1), (1, 5, 1)}
TSN
Pri-NN Post-NN
𝜏 =4
3 hidden layers={512, 512, 512}
4 convolutional layers={(256, 5, 1), (128, 5, 1), (64, 5, 1), (1, 5,
1)}
Compact TSN (cTSN)
Model size comparison (in million)
DNN sDNN IRM FCN LSTM TSN cTSN
11.03 11.03 11.03 0.28 13.24 4.13 1.87
Evaluation metrics
• The perceptual evaluation of speech quality (PESQ).
• The short-time objective intelligibility (STOI, in %).
• The segmental signal-to-noise ratio (SSNR).
• The adopted log-spectral distortion (LSD).
Experimental Results – Investigation of the TSN framework
• TSN-1 was trained with Vt in which Xt
∗ is substituted to Xt
pred
to observe the influence of Xt
CI.
• TSN-2 was trained with Vt, from which Yt is removed to investigate the effect of implicitly modeling
the masking method.
• TSN-3, our proposed method, was trained with Vt.
Using 𝐕𝑡 is effective!
Well training the Pri-NN is more
important than post-NN.
22
ˆ ,prior
loss post pri t t t t F
t
J J J       x x X X
Boosting is important for
performance improvement.
Experimental Results – Performance Evaluation –D1
PESQ babble buccanner1 buccanner2 destroyer
destroyero
ps
f16 factory1 factory2 hfchannel leopard m109
machinegu
n
pink volvo white Average
Noisy 1.898 2.038 1.847 2.064 2.036 2.094 1.980 2.402 2.174 2.752 2.511 2.913 1.997 3.510 1.997 2.281
FNN 2.223 2.272 2.099 2.266 2.371 2.345 2.319 2.645 2.318 2.842 2.699 3.000 2.314 3.571 2.270 2.504
SDNN1 2.205 2.236 2.117 2.253 2.363 2.332 2.306 2.615 2.267 2.785 2.657 2.945 2.297 3.739 2.260 2.492
LSTM 2.233 2.145 1.985 2.152 2.384 2.241 2.293 2.604 1.997 2.666 2.695 2.944 2.220 3.647 2.301 2.434
FCN 2.075 1.979 1.986 2.248 2.068 2.118 2.019 2.159 2.316 2.377 2.201 2.349 1.997 2.603 2.146 2.176
IRM 2.198 2.220 2.110 2.248 2.358 2.322 2.309 2.610 2.271 2.781 2.648 2.932 2.300 3.731 2.266 2.487
cTSN 2.273 2.335 2.242 2.351 2.423 2.462 2.378 2.721 2.506 3.016 2.783 3.066 2.357 3.566 2.316 2.586
TSN 2.264 2.407 2.245 2.379 2.439 2.442 2.400 2.765 2.485 3.016 2.826 3.117 2.431 3.653 2.378 2.616
PESQ STOI SSNR LSD
SNR −5 dB 0 dB 5 dB 10 dB Avr. −5 dB 0 dB 5 dB 10 dB Avr −5 dB 0 dB 5 dB 10 dB Avr −5 dB 0 dB 5 dB 10 dB Avr
Noisy 1.627 1.926 2.247 2.572 2.093 60.62 70.5 79.86 87.77 74.69 −7.374 −5.632 −3.241 −0.259 −4.127 2.239 2.067 1.828 1.545 1.919
DNN 2.187 2.525 2.827 3.104 2.661 67.57 77.57 84.95 90.01 80.03 −1.463 0.167 1.903 3.54 1.037 1.539 1.369 1.21 1.077 1.299
sDNN1 2.158 2.504 2.813 3.096 2.643 66.89 77.03 84.66 90.12 79.67 −1.641 0.123 2.017 3.865 1.091 1.567 1.397 1.242 1.11 1.329
IRM 2.155 2.499 2.807 3.089 2.638 66.92 77.02 84.67 90.14 79.69 −1.641 0.118 2.035 3.873 1.096 1.566 1.393 1.235 1.102 1.324
FCN 1.9 2.178 2.451 2.703 2.308 60.51 69.32 76.56 82.1 72.12 −0.412 0.719 1.615 2.321 1.061 1.903 2.02 2.048 2.02 1.998
LSTM 2.073 2.445 2.783 3.085 2.597 66.15 77.04 85.15 90.62 79.74 −2.420 −0.267 1.757 3.545 0.654 1.693 1.449 1.252 1.099 1.373
cTSN 2.255 2.597 2.907 3.191 2.738 69.51 79.41 86.63 91.68 81.81 −0.619 1.145 3.024 4.825 2.094 1.479 1.319 1.173 1.048 1.255
TSN 2.281 2.629 2.939 3.225 2.769 70.47 80.38 87.45 92.31 82.66 −0.820 1.037 3.02 4.911 2.037 1.478 1.307 1.156 1.027 1.242
DNN SDNN1 LSTM FCN IRM cTSN TSN
13.91 14.15 63.98 30.09 13.54 13.10 23.87
The averaging process times per second of speech (ms) conducted on an Intel® Core™ i7-6700k workstation with an Nvidia GTX 1080 Ti .
Noisy (White 5dB) DNN cTSN
Experimental Results – Preference Test–D2
The number of participants: 20
Test: For each pair (noisy, DNN, TSN), choose the best one. (In the aspect of noise reduction and speech distortion.)
 20 pairs were used.
Result: TSN 85%, DNN 15%,
18
10p 

Noisy TSN DNN
TSN
DNN
Thank you!

Weitere ähnliche Inhalte

Was ist angesagt?

Imagen: Photorealistic Text-to-Image Diffusion Models with Deep Language Unde...
Imagen: Photorealistic Text-to-Image Diffusion Models with Deep Language Unde...Imagen: Photorealistic Text-to-Image Diffusion Models with Deep Language Unde...
Imagen: Photorealistic Text-to-Image Diffusion Models with Deep Language Unde...Vitaly Bondar
 
A Deep Journey into Super-resolution
A Deep Journey into Super-resolutionA Deep Journey into Super-resolution
A Deep Journey into Super-resolutionRonak Mehta
 
[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)Susang Kim
 
Deep neural networks cnn rnn_ae_some practical techniques
Deep neural networks cnn rnn_ae_some practical techniquesDeep neural networks cnn rnn_ae_some practical techniques
Deep neural networks cnn rnn_ae_some practical techniquesKang Pilsung
 
Decision tree, softmax regression and ensemble methods in machine learning
Decision tree, softmax regression and ensemble methods in machine learningDecision tree, softmax regression and ensemble methods in machine learning
Decision tree, softmax regression and ensemble methods in machine learningAbhishek Vijayvargia
 
Speech emotion recognition
Speech emotion recognitionSpeech emotion recognition
Speech emotion recognitionsaniya shaikh
 
How Deep Learning Could Predict Weather Events
How Deep Learning Could Predict Weather EventsHow Deep Learning Could Predict Weather Events
How Deep Learning Could Predict Weather Eventsinside-BigData.com
 
Real time implementation of unscented kalman filter for target tracking
Real time implementation of unscented kalman filter for target trackingReal time implementation of unscented kalman filter for target tracking
Real time implementation of unscented kalman filter for target trackingIAEME Publication
 
Convolutional neural network in practice
Convolutional neural network in practiceConvolutional neural network in practice
Convolutional neural network in practice남주 김
 
Support vector machine
Support vector machineSupport vector machine
Support vector machineMusa Hawamdah
 
Image Filtering in the Frequency Domain
Image Filtering in the Frequency DomainImage Filtering in the Frequency Domain
Image Filtering in the Frequency DomainAmnaakhaan
 
Artificial Neural Networks Lect3: Neural Network Learning rules
Artificial Neural Networks Lect3: Neural Network Learning rulesArtificial Neural Networks Lect3: Neural Network Learning rules
Artificial Neural Networks Lect3: Neural Network Learning rulesMohammed Bennamoun
 
PR-376: Softmax Splatting for Video Frame Interpolation
PR-376: Softmax Splatting for Video Frame InterpolationPR-376: Softmax Splatting for Video Frame Interpolation
PR-376: Softmax Splatting for Video Frame InterpolationHyeongmin Lee
 
What is Deep Learning | Deep Learning Simplified | Deep Learning Tutorial | E...
What is Deep Learning | Deep Learning Simplified | Deep Learning Tutorial | E...What is Deep Learning | Deep Learning Simplified | Deep Learning Tutorial | E...
What is Deep Learning | Deep Learning Simplified | Deep Learning Tutorial | E...Edureka!
 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Deep Learning Italia
 
DeepLab V3+: Encoder-Decoder with Atrous Separable Convolution for Semantic I...
DeepLab V3+: Encoder-Decoder with Atrous Separable Convolution for Semantic I...DeepLab V3+: Encoder-Decoder with Atrous Separable Convolution for Semantic I...
DeepLab V3+: Encoder-Decoder with Atrous Separable Convolution for Semantic I...Joonhyung Lee
 
Deep Learning For Speech Recognition
Deep Learning For Speech RecognitionDeep Learning For Speech Recognition
Deep Learning For Speech Recognitionananth
 

Was ist angesagt? (20)

Imagen: Photorealistic Text-to-Image Diffusion Models with Deep Language Unde...
Imagen: Photorealistic Text-to-Image Diffusion Models with Deep Language Unde...Imagen: Photorealistic Text-to-Image Diffusion Models with Deep Language Unde...
Imagen: Photorealistic Text-to-Image Diffusion Models with Deep Language Unde...
 
A Deep Journey into Super-resolution
A Deep Journey into Super-resolutionA Deep Journey into Super-resolution
A Deep Journey into Super-resolution
 
[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)
 
Deep neural networks cnn rnn_ae_some practical techniques
Deep neural networks cnn rnn_ae_some practical techniquesDeep neural networks cnn rnn_ae_some practical techniques
Deep neural networks cnn rnn_ae_some practical techniques
 
Decision tree, softmax regression and ensemble methods in machine learning
Decision tree, softmax regression and ensemble methods in machine learningDecision tree, softmax regression and ensemble methods in machine learning
Decision tree, softmax regression and ensemble methods in machine learning
 
Speech emotion recognition
Speech emotion recognitionSpeech emotion recognition
Speech emotion recognition
 
Bleu vs rouge
Bleu vs rougeBleu vs rouge
Bleu vs rouge
 
How Deep Learning Could Predict Weather Events
How Deep Learning Could Predict Weather EventsHow Deep Learning Could Predict Weather Events
How Deep Learning Could Predict Weather Events
 
Real time implementation of unscented kalman filter for target tracking
Real time implementation of unscented kalman filter for target trackingReal time implementation of unscented kalman filter for target tracking
Real time implementation of unscented kalman filter for target tracking
 
Convolutional neural network in practice
Convolutional neural network in practiceConvolutional neural network in practice
Convolutional neural network in practice
 
Support vector machine
Support vector machineSupport vector machine
Support vector machine
 
Image Filtering in the Frequency Domain
Image Filtering in the Frequency DomainImage Filtering in the Frequency Domain
Image Filtering in the Frequency Domain
 
Genetic Algorithms
Genetic AlgorithmsGenetic Algorithms
Genetic Algorithms
 
Artificial Neural Networks Lect3: Neural Network Learning rules
Artificial Neural Networks Lect3: Neural Network Learning rulesArtificial Neural Networks Lect3: Neural Network Learning rules
Artificial Neural Networks Lect3: Neural Network Learning rules
 
PR-376: Softmax Splatting for Video Frame Interpolation
PR-376: Softmax Splatting for Video Frame InterpolationPR-376: Softmax Splatting for Video Frame Interpolation
PR-376: Softmax Splatting for Video Frame Interpolation
 
What is Deep Learning | Deep Learning Simplified | Deep Learning Tutorial | E...
What is Deep Learning | Deep Learning Simplified | Deep Learning Tutorial | E...What is Deep Learning | Deep Learning Simplified | Deep Learning Tutorial | E...
What is Deep Learning | Deep Learning Simplified | Deep Learning Tutorial | E...
 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
 
DeepLab V3+: Encoder-Decoder with Atrous Separable Convolution for Semantic I...
DeepLab V3+: Encoder-Decoder with Atrous Separable Convolution for Semantic I...DeepLab V3+: Encoder-Decoder with Atrous Separable Convolution for Semantic I...
DeepLab V3+: Encoder-Decoder with Atrous Separable Convolution for Semantic I...
 
Transfer Learning (D2L4 Insight@DCU Machine Learning Workshop 2017)
Transfer Learning (D2L4 Insight@DCU Machine Learning Workshop 2017)Transfer Learning (D2L4 Insight@DCU Machine Learning Workshop 2017)
Transfer Learning (D2L4 Insight@DCU Machine Learning Workshop 2017)
 
Deep Learning For Speech Recognition
Deep Learning For Speech RecognitionDeep Learning For Speech Recognition
Deep Learning For Speech Recognition
 

Ähnlich wie Deep Learning Based Voice Activity Detection and Speech Enhancement

Attention gated encoder-decoder for ultrasonic signal denoising
Attention gated encoder-decoder for ultrasonic signal denoisingAttention gated encoder-decoder for ultrasonic signal denoising
Attention gated encoder-decoder for ultrasonic signal denoisingIAESIJAI
 
"An adaptive modular approach to the mining of sensor network ...
"An adaptive modular approach to the mining of sensor network ..."An adaptive modular approach to the mining of sensor network ...
"An adaptive modular approach to the mining of sensor network ...butest
 
Sound event detection using deep neural networks
Sound event detection using deep neural networksSound event detection using deep neural networks
Sound event detection using deep neural networksTELKOMNIKA JOURNAL
 
Applications of ann_in_microwave_engineering
Applications of ann_in_microwave_engineeringApplications of ann_in_microwave_engineering
Applications of ann_in_microwave_engineeringprasadhegdegn
 
MULTI-DOMAIN UNPAIRED ULTRASOUND IMAGE ARTIFACT REMOVAL USING A SINGLE CONVOL...
MULTI-DOMAIN UNPAIRED ULTRASOUND IMAGE ARTIFACT REMOVAL USING A SINGLE CONVOL...MULTI-DOMAIN UNPAIRED ULTRASOUND IMAGE ARTIFACT REMOVAL USING A SINGLE CONVOL...
MULTI-DOMAIN UNPAIRED ULTRASOUND IMAGE ARTIFACT REMOVAL USING A SINGLE CONVOL...JaeyoungHuh2
 
Beamforming for Multiuser Massive MIMO Systems: Digital versus Hybrid Analog-...
Beamforming for Multiuser Massive MIMO Systems: Digital versus Hybrid Analog-...Beamforming for Multiuser Massive MIMO Systems: Digital versus Hybrid Analog-...
Beamforming for Multiuser Massive MIMO Systems: Digital versus Hybrid Analog-...T. E. BOGALE
 
Real-time neural text-to-speech with sequence-to-sequence acoustic model and ...
Real-time neural text-to-speech with sequence-to-sequence acoustic model and ...Real-time neural text-to-speech with sequence-to-sequence acoustic model and ...
Real-time neural text-to-speech with sequence-to-sequence acoustic model and ...Takuma_OKAMOTO
 
thesis presentation_liyang
thesis presentation_liyangthesis presentation_liyang
thesis presentation_liyangLiyang Zhang
 
Introduction to adaptive filtering and its applications.ppt
Introduction to adaptive filtering and its applications.pptIntroduction to adaptive filtering and its applications.ppt
Introduction to adaptive filtering and its applications.pptdebeshidutta2
 
Non-Linear Optimization Scheme for Non-Orthogonal Multiuser Access
Non-Linear Optimization Schemefor Non-Orthogonal Multiuser AccessNon-Linear Optimization Schemefor Non-Orthogonal Multiuser Access
Non-Linear Optimization Scheme for Non-Orthogonal Multiuser AccessVladimir Lyashev
 
EBDSS Max Research Report - Final
EBDSS  Max  Research Report - FinalEBDSS  Max  Research Report - Final
EBDSS Max Research Report - FinalMax Robertson
 
Hairong Qi V Swaminathan
Hairong Qi V SwaminathanHairong Qi V Swaminathan
Hairong Qi V SwaminathanFNian
 
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중datasciencekorea
 
Neural Networks for High Performance Time-Delay Estimation and Acoustic Sourc...
Neural Networks for High Performance Time-Delay Estimation and Acoustic Sourc...Neural Networks for High Performance Time-Delay Estimation and Acoustic Sourc...
Neural Networks for High Performance Time-Delay Estimation and Acoustic Sourc...cscpconf
 
NEURAL NETWORKS FOR HIGH PERFORMANCE TIME-DELAY ESTIMATION AND ACOUSTIC SOURC...
NEURAL NETWORKS FOR HIGH PERFORMANCE TIME-DELAY ESTIMATION AND ACOUSTIC SOURC...NEURAL NETWORKS FOR HIGH PERFORMANCE TIME-DELAY ESTIMATION AND ACOUSTIC SOURC...
NEURAL NETWORKS FOR HIGH PERFORMANCE TIME-DELAY ESTIMATION AND ACOUSTIC SOURC...csandit
 
Jörg Stelzer
Jörg StelzerJörg Stelzer
Jörg Stelzerbutest
 

Ähnlich wie Deep Learning Based Voice Activity Detection and Speech Enhancement (20)

Attention gated encoder-decoder for ultrasonic signal denoising
Attention gated encoder-decoder for ultrasonic signal denoisingAttention gated encoder-decoder for ultrasonic signal denoising
Attention gated encoder-decoder for ultrasonic signal denoising
 
"An adaptive modular approach to the mining of sensor network ...
"An adaptive modular approach to the mining of sensor network ..."An adaptive modular approach to the mining of sensor network ...
"An adaptive modular approach to the mining of sensor network ...
 
Sound event detection using deep neural networks
Sound event detection using deep neural networksSound event detection using deep neural networks
Sound event detection using deep neural networks
 
Applications of ann_in_microwave_engineering
Applications of ann_in_microwave_engineeringApplications of ann_in_microwave_engineering
Applications of ann_in_microwave_engineering
 
MULTI-DOMAIN UNPAIRED ULTRASOUND IMAGE ARTIFACT REMOVAL USING A SINGLE CONVOL...
MULTI-DOMAIN UNPAIRED ULTRASOUND IMAGE ARTIFACT REMOVAL USING A SINGLE CONVOL...MULTI-DOMAIN UNPAIRED ULTRASOUND IMAGE ARTIFACT REMOVAL USING A SINGLE CONVOL...
MULTI-DOMAIN UNPAIRED ULTRASOUND IMAGE ARTIFACT REMOVAL USING A SINGLE CONVOL...
 
Beamforming for Multiuser Massive MIMO Systems: Digital versus Hybrid Analog-...
Beamforming for Multiuser Massive MIMO Systems: Digital versus Hybrid Analog-...Beamforming for Multiuser Massive MIMO Systems: Digital versus Hybrid Analog-...
Beamforming for Multiuser Massive MIMO Systems: Digital versus Hybrid Analog-...
 
Real-time neural text-to-speech with sequence-to-sequence acoustic model and ...
Real-time neural text-to-speech with sequence-to-sequence acoustic model and ...Real-time neural text-to-speech with sequence-to-sequence acoustic model and ...
Real-time neural text-to-speech with sequence-to-sequence acoustic model and ...
 
Real Time Speech Enhancement in the Waveform Domain
Real Time Speech Enhancement in the Waveform DomainReal Time Speech Enhancement in the Waveform Domain
Real Time Speech Enhancement in the Waveform Domain
 
thesis presentation_liyang
thesis presentation_liyangthesis presentation_liyang
thesis presentation_liyang
 
Final presentation
Final presentationFinal presentation
Final presentation
 
Introduction to adaptive filtering and its applications.ppt
Introduction to adaptive filtering and its applications.pptIntroduction to adaptive filtering and its applications.ppt
Introduction to adaptive filtering and its applications.ppt
 
Non-Linear Optimization Scheme for Non-Orthogonal Multiuser Access
Non-Linear Optimization Schemefor Non-Orthogonal Multiuser AccessNon-Linear Optimization Schemefor Non-Orthogonal Multiuser Access
Non-Linear Optimization Scheme for Non-Orthogonal Multiuser Access
 
Sudormrf.pdf
Sudormrf.pdfSudormrf.pdf
Sudormrf.pdf
 
EBDSS Max Research Report - Final
EBDSS  Max  Research Report - FinalEBDSS  Max  Research Report - Final
EBDSS Max Research Report - Final
 
Hairong Qi V Swaminathan
Hairong Qi V SwaminathanHairong Qi V Swaminathan
Hairong Qi V Swaminathan
 
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중
 
Neural Networks for High Performance Time-Delay Estimation and Acoustic Sourc...
Neural Networks for High Performance Time-Delay Estimation and Acoustic Sourc...Neural Networks for High Performance Time-Delay Estimation and Acoustic Sourc...
Neural Networks for High Performance Time-Delay Estimation and Acoustic Sourc...
 
NEURAL NETWORKS FOR HIGH PERFORMANCE TIME-DELAY ESTIMATION AND ACOUSTIC SOURC...
NEURAL NETWORKS FOR HIGH PERFORMANCE TIME-DELAY ESTIMATION AND ACOUSTIC SOURC...NEURAL NETWORKS FOR HIGH PERFORMANCE TIME-DELAY ESTIMATION AND ACOUSTIC SOURC...
NEURAL NETWORKS FOR HIGH PERFORMANCE TIME-DELAY ESTIMATION AND ACOUSTIC SOURC...
 
Jörg Stelzer
Jörg StelzerJörg Stelzer
Jörg Stelzer
 
J010245458
J010245458J010245458
J010245458
 

Mehr von NAVER Engineering

디자인 시스템에 직방 ZUIX
디자인 시스템에 직방 ZUIX디자인 시스템에 직방 ZUIX
디자인 시스템에 직방 ZUIXNAVER Engineering
 
진화하는 디자인 시스템(걸음마 편)
진화하는 디자인 시스템(걸음마 편)진화하는 디자인 시스템(걸음마 편)
진화하는 디자인 시스템(걸음마 편)NAVER Engineering
 
서비스 운영을 위한 디자인시스템 프로젝트
서비스 운영을 위한 디자인시스템 프로젝트서비스 운영을 위한 디자인시스템 프로젝트
서비스 운영을 위한 디자인시스템 프로젝트NAVER Engineering
 
BPL(Banksalad Product Language) 무야호
BPL(Banksalad Product Language) 무야호BPL(Banksalad Product Language) 무야호
BPL(Banksalad Product Language) 무야호NAVER Engineering
 
이번 생에 디자인 시스템은 처음이라
이번 생에 디자인 시스템은 처음이라이번 생에 디자인 시스템은 처음이라
이번 생에 디자인 시스템은 처음이라NAVER Engineering
 
날고 있는 여러 비행기 넘나 들며 정비하기
날고 있는 여러 비행기 넘나 들며 정비하기날고 있는 여러 비행기 넘나 들며 정비하기
날고 있는 여러 비행기 넘나 들며 정비하기NAVER Engineering
 
쏘카프레임 구축 배경과 과정
 쏘카프레임 구축 배경과 과정 쏘카프레임 구축 배경과 과정
쏘카프레임 구축 배경과 과정NAVER Engineering
 
플랫폼 디자이너 없이 디자인 시스템을 구축하는 프로덕트 디자이너의 우당탕탕 고통 연대기
플랫폼 디자이너 없이 디자인 시스템을 구축하는 프로덕트 디자이너의 우당탕탕 고통 연대기플랫폼 디자이너 없이 디자인 시스템을 구축하는 프로덕트 디자이너의 우당탕탕 고통 연대기
플랫폼 디자이너 없이 디자인 시스템을 구축하는 프로덕트 디자이너의 우당탕탕 고통 연대기NAVER Engineering
 
200820 NAVER TECH CONCERT 15_Code Review is Horse(코드리뷰는 말이야)(feat.Latte)
200820 NAVER TECH CONCERT 15_Code Review is Horse(코드리뷰는 말이야)(feat.Latte)200820 NAVER TECH CONCERT 15_Code Review is Horse(코드리뷰는 말이야)(feat.Latte)
200820 NAVER TECH CONCERT 15_Code Review is Horse(코드리뷰는 말이야)(feat.Latte)NAVER Engineering
 
200819 NAVER TECH CONCERT 03_화려한 코루틴이 내 앱을 감싸네! 코루틴으로 작성해보는 깔끔한 비동기 코드
200819 NAVER TECH CONCERT 03_화려한 코루틴이 내 앱을 감싸네! 코루틴으로 작성해보는 깔끔한 비동기 코드200819 NAVER TECH CONCERT 03_화려한 코루틴이 내 앱을 감싸네! 코루틴으로 작성해보는 깔끔한 비동기 코드
200819 NAVER TECH CONCERT 03_화려한 코루틴이 내 앱을 감싸네! 코루틴으로 작성해보는 깔끔한 비동기 코드NAVER Engineering
 
200819 NAVER TECH CONCERT 10_맥북에서도 아이맥프로에서 빌드하는 것처럼 빌드 속도 빠르게 하기
200819 NAVER TECH CONCERT 10_맥북에서도 아이맥프로에서 빌드하는 것처럼 빌드 속도 빠르게 하기200819 NAVER TECH CONCERT 10_맥북에서도 아이맥프로에서 빌드하는 것처럼 빌드 속도 빠르게 하기
200819 NAVER TECH CONCERT 10_맥북에서도 아이맥프로에서 빌드하는 것처럼 빌드 속도 빠르게 하기NAVER Engineering
 
200819 NAVER TECH CONCERT 08_성능을 고민하는 슬기로운 개발자 생활
200819 NAVER TECH CONCERT 08_성능을 고민하는 슬기로운 개발자 생활200819 NAVER TECH CONCERT 08_성능을 고민하는 슬기로운 개발자 생활
200819 NAVER TECH CONCERT 08_성능을 고민하는 슬기로운 개발자 생활NAVER Engineering
 
200819 NAVER TECH CONCERT 05_모르면 손해보는 Android 디버깅/분석 꿀팁 대방출
200819 NAVER TECH CONCERT 05_모르면 손해보는 Android 디버깅/분석 꿀팁 대방출200819 NAVER TECH CONCERT 05_모르면 손해보는 Android 디버깅/분석 꿀팁 대방출
200819 NAVER TECH CONCERT 05_모르면 손해보는 Android 디버깅/분석 꿀팁 대방출NAVER Engineering
 
200819 NAVER TECH CONCERT 09_Case.xcodeproj - 좋은 동료로 거듭나기 위한 노하우
200819 NAVER TECH CONCERT 09_Case.xcodeproj - 좋은 동료로 거듭나기 위한 노하우200819 NAVER TECH CONCERT 09_Case.xcodeproj - 좋은 동료로 거듭나기 위한 노하우
200819 NAVER TECH CONCERT 09_Case.xcodeproj - 좋은 동료로 거듭나기 위한 노하우NAVER Engineering
 
200820 NAVER TECH CONCERT 14_야 너두 할 수 있어. 비전공자, COBOL 개발자를 거쳐 네이버에서 FE 개발하게 된...
200820 NAVER TECH CONCERT 14_야 너두 할 수 있어. 비전공자, COBOL 개발자를 거쳐 네이버에서 FE 개발하게 된...200820 NAVER TECH CONCERT 14_야 너두 할 수 있어. 비전공자, COBOL 개발자를 거쳐 네이버에서 FE 개발하게 된...
200820 NAVER TECH CONCERT 14_야 너두 할 수 있어. 비전공자, COBOL 개발자를 거쳐 네이버에서 FE 개발하게 된...NAVER Engineering
 
200820 NAVER TECH CONCERT 13_네이버에서 오픈 소스 개발을 통해 성장하는 방법
200820 NAVER TECH CONCERT 13_네이버에서 오픈 소스 개발을 통해 성장하는 방법200820 NAVER TECH CONCERT 13_네이버에서 오픈 소스 개발을 통해 성장하는 방법
200820 NAVER TECH CONCERT 13_네이버에서 오픈 소스 개발을 통해 성장하는 방법NAVER Engineering
 
200820 NAVER TECH CONCERT 12_상반기 네이버 인턴을 돌아보며
200820 NAVER TECH CONCERT 12_상반기 네이버 인턴을 돌아보며200820 NAVER TECH CONCERT 12_상반기 네이버 인턴을 돌아보며
200820 NAVER TECH CONCERT 12_상반기 네이버 인턴을 돌아보며NAVER Engineering
 
200820 NAVER TECH CONCERT 11_빠르게 성장하는 슈퍼루키로 거듭나기
200820 NAVER TECH CONCERT 11_빠르게 성장하는 슈퍼루키로 거듭나기200820 NAVER TECH CONCERT 11_빠르게 성장하는 슈퍼루키로 거듭나기
200820 NAVER TECH CONCERT 11_빠르게 성장하는 슈퍼루키로 거듭나기NAVER Engineering
 
200819 NAVER TECH CONCERT 07_신입 iOS 개발자 개발업무 적응기
200819 NAVER TECH CONCERT 07_신입 iOS 개발자 개발업무 적응기200819 NAVER TECH CONCERT 07_신입 iOS 개발자 개발업무 적응기
200819 NAVER TECH CONCERT 07_신입 iOS 개발자 개발업무 적응기NAVER Engineering
 

Mehr von NAVER Engineering (20)

React vac pattern
React vac patternReact vac pattern
React vac pattern
 
디자인 시스템에 직방 ZUIX
디자인 시스템에 직방 ZUIX디자인 시스템에 직방 ZUIX
디자인 시스템에 직방 ZUIX
 
진화하는 디자인 시스템(걸음마 편)
진화하는 디자인 시스템(걸음마 편)진화하는 디자인 시스템(걸음마 편)
진화하는 디자인 시스템(걸음마 편)
 
서비스 운영을 위한 디자인시스템 프로젝트
서비스 운영을 위한 디자인시스템 프로젝트서비스 운영을 위한 디자인시스템 프로젝트
서비스 운영을 위한 디자인시스템 프로젝트
 
BPL(Banksalad Product Language) 무야호
BPL(Banksalad Product Language) 무야호BPL(Banksalad Product Language) 무야호
BPL(Banksalad Product Language) 무야호
 
이번 생에 디자인 시스템은 처음이라
이번 생에 디자인 시스템은 처음이라이번 생에 디자인 시스템은 처음이라
이번 생에 디자인 시스템은 처음이라
 
날고 있는 여러 비행기 넘나 들며 정비하기
날고 있는 여러 비행기 넘나 들며 정비하기날고 있는 여러 비행기 넘나 들며 정비하기
날고 있는 여러 비행기 넘나 들며 정비하기
 
쏘카프레임 구축 배경과 과정
 쏘카프레임 구축 배경과 과정 쏘카프레임 구축 배경과 과정
쏘카프레임 구축 배경과 과정
 
플랫폼 디자이너 없이 디자인 시스템을 구축하는 프로덕트 디자이너의 우당탕탕 고통 연대기
플랫폼 디자이너 없이 디자인 시스템을 구축하는 프로덕트 디자이너의 우당탕탕 고통 연대기플랫폼 디자이너 없이 디자인 시스템을 구축하는 프로덕트 디자이너의 우당탕탕 고통 연대기
플랫폼 디자이너 없이 디자인 시스템을 구축하는 프로덕트 디자이너의 우당탕탕 고통 연대기
 
200820 NAVER TECH CONCERT 15_Code Review is Horse(코드리뷰는 말이야)(feat.Latte)
200820 NAVER TECH CONCERT 15_Code Review is Horse(코드리뷰는 말이야)(feat.Latte)200820 NAVER TECH CONCERT 15_Code Review is Horse(코드리뷰는 말이야)(feat.Latte)
200820 NAVER TECH CONCERT 15_Code Review is Horse(코드리뷰는 말이야)(feat.Latte)
 
200819 NAVER TECH CONCERT 03_화려한 코루틴이 내 앱을 감싸네! 코루틴으로 작성해보는 깔끔한 비동기 코드
200819 NAVER TECH CONCERT 03_화려한 코루틴이 내 앱을 감싸네! 코루틴으로 작성해보는 깔끔한 비동기 코드200819 NAVER TECH CONCERT 03_화려한 코루틴이 내 앱을 감싸네! 코루틴으로 작성해보는 깔끔한 비동기 코드
200819 NAVER TECH CONCERT 03_화려한 코루틴이 내 앱을 감싸네! 코루틴으로 작성해보는 깔끔한 비동기 코드
 
200819 NAVER TECH CONCERT 10_맥북에서도 아이맥프로에서 빌드하는 것처럼 빌드 속도 빠르게 하기
200819 NAVER TECH CONCERT 10_맥북에서도 아이맥프로에서 빌드하는 것처럼 빌드 속도 빠르게 하기200819 NAVER TECH CONCERT 10_맥북에서도 아이맥프로에서 빌드하는 것처럼 빌드 속도 빠르게 하기
200819 NAVER TECH CONCERT 10_맥북에서도 아이맥프로에서 빌드하는 것처럼 빌드 속도 빠르게 하기
 
200819 NAVER TECH CONCERT 08_성능을 고민하는 슬기로운 개발자 생활
200819 NAVER TECH CONCERT 08_성능을 고민하는 슬기로운 개발자 생활200819 NAVER TECH CONCERT 08_성능을 고민하는 슬기로운 개발자 생활
200819 NAVER TECH CONCERT 08_성능을 고민하는 슬기로운 개발자 생활
 
200819 NAVER TECH CONCERT 05_모르면 손해보는 Android 디버깅/분석 꿀팁 대방출
200819 NAVER TECH CONCERT 05_모르면 손해보는 Android 디버깅/분석 꿀팁 대방출200819 NAVER TECH CONCERT 05_모르면 손해보는 Android 디버깅/분석 꿀팁 대방출
200819 NAVER TECH CONCERT 05_모르면 손해보는 Android 디버깅/분석 꿀팁 대방출
 
200819 NAVER TECH CONCERT 09_Case.xcodeproj - 좋은 동료로 거듭나기 위한 노하우
200819 NAVER TECH CONCERT 09_Case.xcodeproj - 좋은 동료로 거듭나기 위한 노하우200819 NAVER TECH CONCERT 09_Case.xcodeproj - 좋은 동료로 거듭나기 위한 노하우
200819 NAVER TECH CONCERT 09_Case.xcodeproj - 좋은 동료로 거듭나기 위한 노하우
 
200820 NAVER TECH CONCERT 14_야 너두 할 수 있어. 비전공자, COBOL 개발자를 거쳐 네이버에서 FE 개발하게 된...
200820 NAVER TECH CONCERT 14_야 너두 할 수 있어. 비전공자, COBOL 개발자를 거쳐 네이버에서 FE 개발하게 된...200820 NAVER TECH CONCERT 14_야 너두 할 수 있어. 비전공자, COBOL 개발자를 거쳐 네이버에서 FE 개발하게 된...
200820 NAVER TECH CONCERT 14_야 너두 할 수 있어. 비전공자, COBOL 개발자를 거쳐 네이버에서 FE 개발하게 된...
 
200820 NAVER TECH CONCERT 13_네이버에서 오픈 소스 개발을 통해 성장하는 방법
200820 NAVER TECH CONCERT 13_네이버에서 오픈 소스 개발을 통해 성장하는 방법200820 NAVER TECH CONCERT 13_네이버에서 오픈 소스 개발을 통해 성장하는 방법
200820 NAVER TECH CONCERT 13_네이버에서 오픈 소스 개발을 통해 성장하는 방법
 
200820 NAVER TECH CONCERT 12_상반기 네이버 인턴을 돌아보며
200820 NAVER TECH CONCERT 12_상반기 네이버 인턴을 돌아보며200820 NAVER TECH CONCERT 12_상반기 네이버 인턴을 돌아보며
200820 NAVER TECH CONCERT 12_상반기 네이버 인턴을 돌아보며
 
200820 NAVER TECH CONCERT 11_빠르게 성장하는 슈퍼루키로 거듭나기
200820 NAVER TECH CONCERT 11_빠르게 성장하는 슈퍼루키로 거듭나기200820 NAVER TECH CONCERT 11_빠르게 성장하는 슈퍼루키로 거듭나기
200820 NAVER TECH CONCERT 11_빠르게 성장하는 슈퍼루키로 거듭나기
 
200819 NAVER TECH CONCERT 07_신입 iOS 개발자 개발업무 적응기
200819 NAVER TECH CONCERT 07_신입 iOS 개발자 개발업무 적응기200819 NAVER TECH CONCERT 07_신입 iOS 개발자 개발업무 적응기
200819 NAVER TECH CONCERT 07_신입 iOS 개발자 개발업무 적응기
 

Kürzlich hochgeladen

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 

Kürzlich hochgeladen (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

Deep Learning Based Voice Activity Detection and Speech Enhancement

  • 1. Speech Recognition Front-End: Voice Activity Detection & Speech Enhancement Juntae Kim, Ph. D Candidate School of Electrical Engineering KAIST For NAVER
  • 2. Voice Activity Detection She had your dark suit in greasy wash water all year. Local device (smart speaker, robot) Overview End Point Detection Speech Enhancement Speech Recognition Server Today’s topic
  • 3. Voice Activity Detection Using an Adaptive Context Attention Model Kim, Juntae, and Minsoo Hahn. "Voice Activity Detection Using an Adaptive Context Attention Model." IEEE Signal Processing Letters (2018). The most famous VAD repository in github
  • 4. Voice activity detection (VAD) Objective: From incoming signal, detecting the speech signal only. Important Points for VAD: ① Robustness to the various real-world noise environments. ② Robustness to the distance variation. ③ Compact computational cost with low-latency. Conventional methods: ① Statistical signal processing based approaches.  Assume DFT coefficients of speech and noise signal to Gaussian random variable and conduct the decision by calculating the likelihood ratio. ② Feature engineering based approaches.  Harmonicity, energy, zero-crossing rate, entropy and etc. ③ Traditional machine learning based approaches.  SVM, LDA, KNN and etc.
  • 5. Deep learning based VAD Research branches for deep-learning based VAD Acoustic Feature Extraction Neural network: DNN, CNN, LSTM Decision Which acoustic features are useful for VAD?  These approaches show outstanding performance, however, generally needs to multiple features so that computation cost can be increased. Can we directly use this raw-waveform for neural network?  Generally raw-waveform based approach needs high computational cost even the performance improvement is slight. Which neural network architecture make the VAD robust to various noise environments?  Generally many researchers try to apply state-of-the-art architectures from other field such as LSTM, CNN and DNN but such architecture show the trade-off according to the noise types. Which neural network architecture can effectively use the context information of speech signal for VAD?
  • 6. Deep learning based VAD Boosted deep neural network (bDNN) Zhang, Xiao-Lei, and DeLiang Wang. "Boosting contextual information for deep neural network based voice activity detection." IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 24.2 (2016): 252-264. 𝑥 𝑛 𝑥 𝑛+1 𝑥 𝑛−1 𝑥 𝑛+1+𝑢 𝑥 𝑛+𝑊 Inputs: Frames subsampled with W=19, u=9 are used as input. Combine with average method Extended outputs through time are used as output. Loss : Mean squared error bDNN shows outstanding performance by adopting the boosting strategy.  However it only can use fixed context information.
  • 7. Deep learning based VAD Boosted deep neural network (bDNN) Zhang, Xiao-Lei, and DeLiang Wang. "Boosting contextual information for deep neural network based voice activity detection." IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 24.2 (2016): 252-264. bDNN shows outstanding performance by adopting the boosting strategy.  However it only can use fixed context information. In order to solve this problem, Zhang et el. built multi-stacking model with several bDNNs that have various context size of inputs. This structure shows state-of-the-art performance but the computation cost was 11 times higher than single bDNN. However, this result implies that if we use the context information adaptively, we can get some performance improvement.
  • 8. Deep learning based VAD Can a single model adaptively use the context information (CI) according to noise types and SNRs? In noisy acoustic features, there is no ground truth what is proper usage of CI. Let’s use the reinforcement-like method. From the input acoustic features, repeatedly find the proper CI usage. If found CI usage make the classification correct, give some rewards to model. Motivation
  • 9. Deep learning based VAD Adaptive context attention model (ACAM) • Decoder: Determine which context is important. (attention) • Core network: Given previous hidden state (𝐡 𝑚, 𝑡−1) and the input information with noise environment (𝐠 𝑚,𝑡), propose the next action to the succeeding module. • Encoder: Aggregate the information of the results of current action. Start No Yes Stop 𝑚, , 𝑚, 𝑡 ( 𝑚, 𝑡−1 𝑚, 𝑡 ( 𝑚, 𝑚, 𝑡 𝑚, 𝑡 ( 𝑚, 𝑡, 𝑚, 𝑡 𝐡 𝑚, 𝑡 (𝐡 𝑚, 𝑡−1, 𝐠 𝑚, 𝑡 𝑚 (𝐡 𝑚, 𝑡 ⋯ Core network Attention Attention 𝐡 𝑚, 𝑡−1 𝐡 𝑚, 𝑡 𝐡 𝑚, 𝑇 𝛒 𝑚, 𝑡 𝑚, 𝑡 [0.05, 0. , 0. , 0.5, 0. , 0. , 0.05] 𝑚, 𝑡+1 [0.05, 0. , 0.05 0.6 0.05, 0. , 0.05] 𝛒 𝑚, 𝑡+1 𝑚 𝐠 𝑚,𝑡 𝐠 𝑚,𝑡
  • 10. Experimental Setup Training Phase (20 h) • Speech dataset: 4,620 utterances from the training set in the TIMIT corpus. • Noise dataset: 20,000 types from Sound effect library. • Noise addition was conducted with a randomly selected SNR between −10 and 12 dB. Test phase D1 dataset (2.4 h) • Speech dataset: 192 utterances from the training set in the TIMIT corpus. • Noise dataset: 15 types of noise from NOISEX-92 corpus. (jet cockpit1, jet cockpit2, destroyer engine, destroyer operations, F-16 cockpit, factory1, factory2, HF channel, military vehicle, M109 tank, machine gun, pink, Volvo, speech babble, and white). • Noise addition was conducted with SNRs −5, 0, 5, 10 dB. D2 dataset (2 h) • Real world dataset recorded by Galaxy 8. D3 dataset (72 h) • Youtube dataset recorded in real-world.
  • 11. Experimental Setup Zhang, Xiao-Lei, and DeLiang Wang. "Boosting contextual information for deep neural network based voice activity detection." IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 24.2 (2016): 252-264. Acoustic features: Multi-resolution cochleagram features (MRCG)
  • 12. ACAM 1: Fix the attention. ACAM 2: Train the model with 𝐽𝑠𝑣 only. ACAM 3: Train the model with 𝐽. D1: TIMIT+NoiseX92 (2.4h), D2: Recorded dataset(2h), D3: YouTube Dataset(72h) Performance Measure: area under the ROC curve (AUC) in %. Number of parameters for each model. HFCL MSFI DNN bDNN LSTM ACAM # param. ▪ ~2 k ~3015 k ~3018 k ~2097 k ~953 k Computation time (ms) for each model. The number in bracket is the MRCG extraction time. HFCL MSFI DNN bDNN LSTM ACAM The average p rocessing tim e per second of signal 9.04 67.73 0.88 + (206) 9.24 + (206) 31.61 + (206) 10.12 + (206) Experimental Results – Investigation of the TSN framework
  • 13. End-to-End Speech Enhancement Using Boosting-Based Two-Step Neural Network Kim, Juntae, and Minsoo Hahn. “End-to-End Speech Enhancement Using Boosting-Based Two-Step Neural Network" submitted to IEEE Signal Processing Letters (2018).
  • 14. Speech enhancement (SE) Objective: From incoming noisy speech signal, removing the noise signal, while conserving the speech signal. 𝑥 ( )  ො 𝑥 Important Points for SE: ① Perceptual quality (related with speech distortion). ② Noise reduction. ③ Computational cost. Conventional methods: )( )H w ˆ( )y t( )y t ( ) 1 ( ) ( ) ( ) 1 ( ) / ( ) x x n n x S w H w S w S w S w S w        2 2 ( ) [ ( )] , ( ) [ ( )]x nS w E F x t S w E F n t  * Assumption: x(t), n(t) are independent. Estimated from silence region (by conducting VAD) How we can find the H(w)?  Minimum mean squared error estimation (MMSE). 2 minimize [( ( ) ( ) ( )) ]E Y w H w X w Trade off
  • 15. Deep learning based SE  ˆ |t tf x Y  ˆ |t tf x Y Method 1: Directly maps the noisy log power spectral features (LPS) to clean LPS. where ොxt ∈ ℝN is an enhanced LPS vector, yt ∈ ℝN is a noisy LPS vector, t is the frame index, N is the feature dimension, f ( ∙ | θ) denotes the neural network-based function. Xu, Yong, et al. "A regression approach to speech enhancement based on deep neural networks." IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 23.1 (2015): 7-19. 1. Fully Convolutional Neural Network 2. Deep Neural Network 3. Etc.
  • 16. Deep learning based SE Method 2: SE is carried out by adopting the masking method. where ොxt(k), xt k , and yt k are the kth element of ොxt, xt, and yt, respectively, and xt∈ℝN is a clean LPS vector, k is frequency bin, and mt k is a mask value for kth element. The IRM in second line is the most widely used one for the mask.             1/2 ˆ , 2 log ty kmask t t t tx k f x k y k m k e            1/2 t tx k y kIRM tm k e    Clean/Noisy samples Noisy samples Feature extraction DNN Training Mask extractionx t , y(t) y(t) m(t) Feature extraction Reconstructi on ොx(t) Training stage Enhancement stage Y Y 𝐘 ෝm(t)
  • 17. Deep learning based SE – Summary Conventional Approaches Method 2 LSTM Method 1 Fully convolutional network (FCN) GAN Multitask learning Convolutional LSTM DNN with skip connection Multitask learning Ideal ratio mask (IRM) Complex ideal ratio mask (cIRM) MMSE Wiener filtering Optimally modified log- spectral amplitude Minimum variance distortionless Response (MVDR) Two-step network (TSN) Conventional Approaches Pros: ① Good! but only in specific noise environment (stationary noise). Cons: ① Vulnerable in non-stationary noise environment ② Computation cost is relatively high (matrix inversion operation) ③ Model size is small (only have to save the impulse response H(w)). Method 1 Pros: ① Good! compared to conventional approaches. But some muffled sound (smoothing effect in the spectrogram) is investigated. ② Strong performance in unseen noise environments. ③ Computation cost is cheap if we use simple acoustic features such as log-power spectra and simple architecture such as DNN. Cons: ① In order to reduce the speech distortion, GAN based SE methods were proposed. But training is quite hard because of over-regularization. ② According to used architecture, there is some trade-off between noise types. ③ Model size is relatively big (many parameters.). ④ Some neural network types (bi-directional RNN) prevent the online enhancement. ⑤ Phase reconstruction seems to be hard in this framework. Method 2 compared to Method 1 Pros: ① Phase reconstruction become easier than method 1. ② There is some performance improvement if we use method 1 framework. (Some paper deny this fact.) Cons: ① Additional masking operation is needed. ② Even though the neural network can model any arbitrary function, there aren’t solid reason the masking method outperform the method 1. According to used architecture, there is some trade-off between noise types.
  • 18. Two-Step Network Clean/Noisy samples Noisy samples Feature extraction Prior Net Training x t , y(t) Feature extraction Reconstructi on ොx(t) Training stage Enhancement stage Y Y 𝐗, 𝐘 Post Net Training Prior Net Enhancing Post Net Enhancing 𝐗∗, 𝐘    * ˆ , , ,mask post t t t t tf f x x y X Y Proposed Method: SE is carried out in end-to-end manner but implicitly considering the masking method (Our model directly maps noisy features to clean features). y(t) Why end-to-end? ① We can share acoustic features with following modules (speech recognition system). ② We can save computation cost (no additional masking operation). ③ We can fully exploit the potential modeling power of neural network.
  • 19. Two-Step Network ( )pri g  pri-NN y 𝑡−1 y 𝑡y 𝑡−2 y 𝑡 y 𝑡+1y 𝑡−1 y 𝑡+1 y 𝑡+2y 𝑡 ෤𝐱 𝑡−1 ෤𝐱 𝑡 1 ෤𝐱 𝑡−2 −1 ෤𝐱 𝑡 ෤𝐱 𝑡+1 1 ෤𝐱 𝑡−1 −1 ෤𝐱 𝑡+1 ෤𝐱 𝑡+2 1 ෤𝐱 𝑡 −1 𝐗 𝑡 ∗ 𝐘𝑡 ො𝐱 𝑡 post-NN 𝐕𝑡 ,CI tX pred tX   * ,pri t t k k g    X Y   * ( ) ,m pred CI t t m k t tm k         X x X X Why multiple outputs for pri-NN?  We can get multiple predictions Xt pred for xt. From multiple predictions, we can adopt the boosting method. Simplest boosting method: ( )1 ˆ m t t mM  x x
  • 20. Two-Step Network ( )pri g  pri-NN y 𝑡−1 y 𝑡y 𝑡−2 y 𝑡 y 𝑡+1y 𝑡−1 y 𝑡+1 y 𝑡+2y 𝑡 ෤𝐱 𝑡−1 ෤𝐱 𝑡 1 ෤𝐱 𝑡−2 −1 ෤𝐱 𝑡 ෤𝐱 𝑡+1 1 ෤𝐱 𝑡−1 −1 ෤𝐱 𝑡+1 ෤𝐱 𝑡+2 1 ෤𝐱 𝑡 −1 𝐗 𝑡 ∗ 𝐘𝑡 ො𝐱 𝑡 post-NN 𝐕𝑡 ,CI tX pred tX Simplest boosting method: ( )1 ˆ m t t mM  x x ෤𝐱 𝑡 y 𝑡 y 𝑡+1y 𝑡−1 ෤𝐱 𝑡 1 ෤𝐱 𝑡 −1 y 𝑡+1 y 𝑡+2y 𝑡y 𝑡−1 y 𝑡y 𝑡−2 Drawback 1: are from different context.  The data distribution across the frequency dimension can be different each other. (1) (2) (3) , ,t t tx x x Input contexts are different.
  • 21. Two-Step Network ( )pri g  pri-NN y 𝑡−1 y 𝑡y 𝑡−2 y 𝑡 y 𝑡+1y 𝑡−1 y 𝑡+1 y 𝑡+2y 𝑡 ෤𝐱 𝑡−1 ෤𝐱 𝑡 1 ෤𝐱 𝑡−2 −1 ෤𝐱 𝑡 ෤𝐱 𝑡+1 1 ෤𝐱 𝑡−1 −1 ෤𝐱 𝑡+1 ෤𝐱 𝑡+2 1 ෤𝐱 𝑡 −1 𝐗 𝑡 ∗ 𝐘𝑡 ො𝐱 𝑡 post-NN 𝐕𝑡 ,CI tX pred tX Drawback 2: We cannot use Xt CI which is highly related with 𝐱 𝑡, because it corresponds to neighboring frames. Simplest boosting method: ( )1 ˆ m t t mM  x x Drawback 3: We cannot use 𝐘t  We cannot implicitly model the masking method.    * ˆ , , ,mask post t t t t tf f x x y X Y Also, according to noise types, 𝐘t can have intact clean speech features in some frequency band.
  • 22. Two-Step Network ( )pri g  pri-NN y 𝑡−1 y 𝑡y 𝑡−2 y 𝑡 y 𝑡+1y 𝑡−1 y 𝑡+1 y 𝑡+2y 𝑡 ෤𝐱 𝑡−1 ෤𝐱 𝑡 1 ෤𝐱 𝑡−2 −1 ෤𝐱 𝑡 ෤𝐱 𝑡+1 1 ෤𝐱 𝑡−1 −1 ෤𝐱 𝑡+1 ෤𝐱 𝑡+2 1 ෤𝐱 𝑡 −1 𝐗 𝑡 ∗ 𝐘𝑡 ො𝐱 𝑡 post-NN 𝐕𝑡 ,CI tX pred tX Idea: Try to use all the useful information Xt pred , Xt CI, Yt that related with xt. Simplest boosting method: ( )1 ˆ m t t mM  x x      2 1 2 2* , ( ) . N t t tconcat set      V X Y However, we cannot use the simple averaging method because of Xt CI, Yt . Convolution based boosting for filter out some noisy information while aggregating target-related information from each feature vectors.
  • 23. Two-Step Network ( )pri g  pri-NN y 𝑡−1 y 𝑡y 𝑡−2 y 𝑡 y 𝑡+1y 𝑡−1 y 𝑡+1 y 𝑡+2y 𝑡 ෤𝐱 𝑡−1 ෤𝐱 𝑡 1 ෤𝐱 𝑡−2 −1 ෤𝐱 𝑡 ෤𝐱 𝑡+1 1 ෤𝐱 𝑡−1 −1 ෤𝐱 𝑡+1 ෤𝐱 𝑡+2 1 ෤𝐱 𝑡 −1 𝐗 𝑡 ∗ 𝐘𝑡 ො𝐱 𝑡 post-NN 𝐕𝑡 ,CI tX pred tX 1 ,l l l t t   H K H      1 1 1 , , , 1, , l l I S l l l t t i u f j u i j f u i             H K H where Ht l∈ℝN×O l and Kl∈ℝS l×I l×O l are the output feature maps and the convolutional kernel respectively, S l and I l and O l denote the size of the convolutional kernel and the number of input and output feature maps, respectively, from the l-th convolutional layer. Simplest boosting method: ( )1 ˆ m t t mM  x x Idea: Convolution based boosting for filter out some noisy information while aggregating target-related information from each feature vectors.
  • 24. Two-Step Network ( )pri g  pri-NN y 𝑡−1 y 𝑡y 𝑡−2 y 𝑡 y 𝑡+1y 𝑡−1 y 𝑡+1 y 𝑡+2y 𝑡 ෤𝐱 𝑡−1 ෤𝐱 𝑡 1 ෤𝐱 𝑡−2 −1 ෤𝐱 𝑡 ෤𝐱 𝑡+1 1 ෤𝐱 𝑡−1 −1 ෤𝐱 𝑡+1 ෤𝐱 𝑡+2 1 ෤𝐱 𝑡 −1 𝐗 𝑡 ∗ 𝐘𝑡 ො𝐱 𝑡 post-NN 𝐕𝑡 ,CI tX pred tX Loss function 22 ˆ ,prior loss post pri t t t t F t J J J       x x X X  , , , , ,t t t t  X x x x prior tX If λ is set to 0, we cannot consider that Xt * contains Xt pred because there is no evidence that Xt pred includes the prediction of xt, which means that the pri-NN loses the characteristic as weak-predictors. Thus, the boosting effect of TSN from multiple predictions would be negated.
  • 25. Experimental Setup Training Phase • Speech dataset: 4,620 utterances from the training set in the TIMIT corpus. • Noise dataset: 100 types of noise from HU dataset, 50 types from Sound effect library. • Noise addition was conducted with a randomly selected SNR between −5 and 20 dB. • We repeat this procedure until the length of the entire training dataset is approximately 50 h. • 90% of the training dataset is used in training the model and the remaining 10% in validation. Test phase D1 dataset • Speech dataset: 192 utterances from the training set in the TIMIT corpus. • Noise dataset: 15 types of noise from NOISEX-92 corpus. (jet cockpit1, jet cockpit2, destroyer engine, destroyer operations, F-16 cockpit, factory1, factory2, HF channel, military vehicle, M109 tank, machine gun, pink, Volvo, speech babble, and white). • Noise addition was conducted with SNRs −5, 0, 5, 10 dB. D2 dataset • Real world dataset recorded by Galaxy 8.
  • 26. Experimental Setup Additional Information • Sampling rate: 8kHz. • Window shift and length: 10 and 25ms • Log-power spectra (LPS) were used for acoustic features. • Z-score-normalization was conducted across the frequency dimension for LPS. • When reconstructing the waveform, the noisy phase information was used. Baseline Methods • Deep neural network (DNN) • DNN with skip connection (sDNN) • Ideal ratio mask with DNN (IRM) • Fully convolutional neural network (FCN) • Long short term memory recurrent neural network (LSTM)
  • 27. Experimental Setup Pri-NN Post-NN 𝜏 =4 3 hidden layers={1024, 1024, 1024} 8 convolutional layers={(256, 5, 1), (128, 5, 1), (64, 5, 1), (32, 5, 1), (32, 5, 1), (32, 5, 1), (32, 5, 1), (1, 5, 1)} TSN Pri-NN Post-NN 𝜏 =4 3 hidden layers={512, 512, 512} 4 convolutional layers={(256, 5, 1), (128, 5, 1), (64, 5, 1), (1, 5, 1)} Compact TSN (cTSN) Model size comparison (in million) DNN sDNN IRM FCN LSTM TSN cTSN 11.03 11.03 11.03 0.28 13.24 4.13 1.87 Evaluation metrics • The perceptual evaluation of speech quality (PESQ). • The short-time objective intelligibility (STOI, in %). • The segmental signal-to-noise ratio (SSNR). • The adopted log-spectral distortion (LSD).
  • 28. Experimental Results – Investigation of the TSN framework • TSN-1 was trained with Vt in which Xt ∗ is substituted to Xt pred to observe the influence of Xt CI. • TSN-2 was trained with Vt, from which Yt is removed to investigate the effect of implicitly modeling the masking method. • TSN-3, our proposed method, was trained with Vt. Using 𝐕𝑡 is effective! Well training the Pri-NN is more important than post-NN. 22 ˆ ,prior loss post pri t t t t F t J J J       x x X X Boosting is important for performance improvement.
  • 29. Experimental Results – Performance Evaluation –D1 PESQ babble buccanner1 buccanner2 destroyer destroyero ps f16 factory1 factory2 hfchannel leopard m109 machinegu n pink volvo white Average Noisy 1.898 2.038 1.847 2.064 2.036 2.094 1.980 2.402 2.174 2.752 2.511 2.913 1.997 3.510 1.997 2.281 FNN 2.223 2.272 2.099 2.266 2.371 2.345 2.319 2.645 2.318 2.842 2.699 3.000 2.314 3.571 2.270 2.504 SDNN1 2.205 2.236 2.117 2.253 2.363 2.332 2.306 2.615 2.267 2.785 2.657 2.945 2.297 3.739 2.260 2.492 LSTM 2.233 2.145 1.985 2.152 2.384 2.241 2.293 2.604 1.997 2.666 2.695 2.944 2.220 3.647 2.301 2.434 FCN 2.075 1.979 1.986 2.248 2.068 2.118 2.019 2.159 2.316 2.377 2.201 2.349 1.997 2.603 2.146 2.176 IRM 2.198 2.220 2.110 2.248 2.358 2.322 2.309 2.610 2.271 2.781 2.648 2.932 2.300 3.731 2.266 2.487 cTSN 2.273 2.335 2.242 2.351 2.423 2.462 2.378 2.721 2.506 3.016 2.783 3.066 2.357 3.566 2.316 2.586 TSN 2.264 2.407 2.245 2.379 2.439 2.442 2.400 2.765 2.485 3.016 2.826 3.117 2.431 3.653 2.378 2.616 PESQ STOI SSNR LSD SNR −5 dB 0 dB 5 dB 10 dB Avr. −5 dB 0 dB 5 dB 10 dB Avr −5 dB 0 dB 5 dB 10 dB Avr −5 dB 0 dB 5 dB 10 dB Avr Noisy 1.627 1.926 2.247 2.572 2.093 60.62 70.5 79.86 87.77 74.69 −7.374 −5.632 −3.241 −0.259 −4.127 2.239 2.067 1.828 1.545 1.919 DNN 2.187 2.525 2.827 3.104 2.661 67.57 77.57 84.95 90.01 80.03 −1.463 0.167 1.903 3.54 1.037 1.539 1.369 1.21 1.077 1.299 sDNN1 2.158 2.504 2.813 3.096 2.643 66.89 77.03 84.66 90.12 79.67 −1.641 0.123 2.017 3.865 1.091 1.567 1.397 1.242 1.11 1.329 IRM 2.155 2.499 2.807 3.089 2.638 66.92 77.02 84.67 90.14 79.69 −1.641 0.118 2.035 3.873 1.096 1.566 1.393 1.235 1.102 1.324 FCN 1.9 2.178 2.451 2.703 2.308 60.51 69.32 76.56 82.1 72.12 −0.412 0.719 1.615 2.321 1.061 1.903 2.02 2.048 2.02 1.998 LSTM 2.073 2.445 2.783 3.085 2.597 66.15 77.04 85.15 90.62 79.74 −2.420 −0.267 1.757 3.545 0.654 1.693 1.449 1.252 1.099 1.373 cTSN 2.255 2.597 2.907 3.191 2.738 69.51 79.41 86.63 91.68 81.81 −0.619 1.145 3.024 4.825 2.094 1.479 1.319 1.173 1.048 1.255 TSN 2.281 2.629 2.939 3.225 2.769 70.47 80.38 87.45 92.31 82.66 −0.820 1.037 3.02 4.911 2.037 1.478 1.307 1.156 1.027 1.242 DNN SDNN1 LSTM FCN IRM cTSN TSN 13.91 14.15 63.98 30.09 13.54 13.10 23.87 The averaging process times per second of speech (ms) conducted on an Intel® Core™ i7-6700k workstation with an Nvidia GTX 1080 Ti . Noisy (White 5dB) DNN cTSN
  • 30. Experimental Results – Preference Test–D2 The number of participants: 20 Test: For each pair (noisy, DNN, TSN), choose the best one. (In the aspect of noise reduction and speech distortion.)  20 pairs were used. Result: TSN 85%, DNN 15%, 18 10p   Noisy TSN DNN