SlideShare ist ein Scribd-Unternehmen logo
1 von 62
Downloaden Sie, um offline zu lesen
• Do not compare results across different tables!
– Configurations may differ
• Most results shown here can be found in:
Takuya Yoshioka and Mark J. F. Gales, “Environmentally
robust ASR front-end for deep neural network acoustic
models,” Computer Speech and Language, vol. 31, no. 1, pp.
65-86, May 2015
1. Motivation
2. Corpus
• AMI meeting corpus
3. Baseline systems
• SI and SAT set-ups
4. Assessment of environmental robustness of
DNN acoustic models
5. Front-end techniques
6. Combined effects
Little investigation done
• Multi-party interaction
– 4 participants in each meeting
• Multi-channel recordings
– Distant microphones – only first channel used
– Head-set & lapel microphones
• 2 recording set-ups
– 70h scenario-based meetings
– 30h real meetings
• Different rooms
• Multiple sources of distortion
– Reverberation
– Additive noise
– Overlapping speech
• Moving speakers
• Many non-natives
• SI : speaker independent
– For online transcription
– DNN-HMM hybrid
• SAT: speaker adaptive training
– For offline transcription
– MLP tandem
• Manual segmentations used
• Overlapping segments ignored
State output distributions modelled with
– GMM or
– DNN
¦ –
/
Q
T
t
tttt qpqqPqPp
q
xX
1
10 )|()|()()|(
¦
M
m
jmjm
mjmt Ncjp
1
)()(
),;()|( Σμxx
)(
)|(
)|(
jp
jp
jp t
t
x
x
Æ
• Discriminative pre-training
• Cross entropy fine-tuning
• Discriminative pre-training
• Trained on Telta K20
• cuBLAS 5.5 used
• Mini-batch size: 800 frames
• Learning rate: “newbob” scheduling
• 10% held-out data for CV
System
Parame-
terisation
%WER
Dev Eval Avg
MPE GMM-HMM HLDA 54.7 55.6 55.2
DNN-HMM hybrid FBANK 43.5 42.6 43.1
This work 40.0 39.3 39.7
Data Set
Parame-
terisation
%WER
Dev Eval Avg
SDM FBANK 43.5 42.6 43.1
IHM FBANK 28.2 24.6 26.4
• 39.2% of the errors caused by acoustic distortion
• DNN-HMMs not so robust
Æ
• Discriminative pre-training
• Cross entropy fine-tuning
• Discriminative pre-training
Align-
ment
DNN
input
%WER
Dev Eval Avg
SDM IHM 30.6 27.0 28.8
IHM SDM 41.8 40.8 41.3
IHM SDM 41.7 40.6 41.2
Using 648-2,000 5-4,000 DNN:
DNN training more sensitive to noise than state
alignment
Speech enhancement
Feature transformation
Multi-stream features
Speech enhancement
Feature transformation
Multi-stream features
Previous work
– Beamforming yields gains
– No investigation on single-microphone algorithms
• Based on linear time (almost) invariant filters
• Applied to complex-valued STFT coefficients
• The filters automatically adjusted using observations
– WPE for 1ch dereverberation (NTT’s work)
– BeamformIt for denoising (ICSI’s work)
• 8 microphones used, dedicated to meetings
• Unlikely to produce irregular transitions
¦ 


1
0
,,,,
T
Tk
ktfkftftf xgxy
Align-
ment
Dev Eval
SDM +Derev
BFIt
(8mics)
SDM +Derev
BFIt
(8mics)
MPE 43.8 41.8 38.6 43.0 41.3 36.6
Hybrid 43.5 41.7 38.8 43.3 41.4 36.7
• Dereveberation helps even with single microphone
• Multi-microphone beamforming works well
DNN size
Context
frames
Dev Eval
SDM +Derev SDM +Derev
1,000 5 9 43.8 41.8 43.0 41.3
1,500 5
9 43.5 42.0 42.6 41.1
13 42.8 41.8 42.9 41.2
19 43.0 41.7 42.9 41.2
2,000 5 9 43.8 41.3 42.9 40.4
4.7% gain from 1ch dereverberation (relative)
Speech enhancement
Feature transformation
Multi-stream features
No positive results reported previously
• Applied to magnitude spectra
• Cross terms (often) ignored
• Frame-by-frame modification
– Harmful for DNN?
• Noise estimated using long-term statistics
– IMCRA (used here), minimum statistics, etc
• Deltas from un-enhanced speech
– Essential for obtaining gains
2
,
2
,
2
, tftftf nxy
• Applied to FBANK features
• The following mismatch function used
• Frame-by-frame modification
• Noise model estimated with EM
• Deltas from un-enhanced speech
))exp(1log( hynhxy  tttt
Enhancement target %WER
Spectrum Feature Dev Eval Avg
N N 42.0 41.1 41.6
Y N 41.3 40.9 41.1
N Y 41.4 40.5 41.0
Y Y 42.0 41.0 41.5
• Small consistent gains
• Different methods should not be connected
Enhancement target %WER
Spectrum Feature Dev Eval Avg
N N 42.0 41.1 41.6
Y N 41.3 40.9 41.1
N Y 41.4 40.5 41.0
Y Y 42.0 41.0 41.5
Y Y 41.4 40.4 40.9
Using multi-stream approach:
Speech enhancement
Feature transformation
Multi-stream features
• Frame level
– FMPE, RDT, FE-CMLLR
– Seems to be subsumed by DNN
• Speaker (or environment) level
– Global CMLLR, LIN, fDLR, VTLN
– Multiple decoding passes required Æ SAT
• Utterance level
– Single-pass decoding Æ SI
• Seems robust against supervision errors
• STC transform used to deal with correlations:
»
»
»
¼
º
«
«
«
¬
ª
tx
)()()()( ss
t
ss
t bLxAy
Form of speaker
transform
%WER
Dev Eval Avg
None (SI) 42.6 40.2 41.4
Full 37.4 37.4 37.4
Block diagonal 37.3 36.6 37.0
• ~10% relative gains obtained
• “Block diagonal” outperforms “full”
Form of speaker
transform
%WER
Dev Eval Avg
None (SI) 42.6 40.2 41.4
Full 37.4 37.4 37.4
Block diagonal 37.3 36.6 37.0
None (SI) 27.8 24.2 26.0
Full 23.8 21.6 22.7
On IHM data set
))(()())(()( ucu
t
ucu
t bLxAy 
uuc :)(
Clustering performed using:
– utterance-specific iVectors
– Kmeans (GMM yielded similar performance figures)
)()0()( uu
Twmm 
m(0)
T
w(u)
Subspace representation of the deviation from
UBM
m(0)
m(1)
m(2)
m(3)
Variability
subspace
#Clusters
%WER
Dev Eval Avg
No QCMLLR 41.9 40.9 41.4
64 41.0 40.4 40.7
32 41.0 40.0 40.5
16 41.5 40.5 41.0
No QCMLLR 27.8 24.2 26.0
32 26.9 23.5 25.2
On IHM data set
• Using 32 clusters yielded best performance
• Similar gains on both SDM and IHM
Speech enhancement
Feature transformation
Multi-stream features
• Originally proposed by Aachen for shallow MLP
tandem configurations
• Exploits DNN’s insensitivity to the increase in input
dimensionality
• (Hopefully) complement features masked by noise
• Allows multiple enhancement results to be
combined
• Four types of auxiliary features investigated:
– MFCC (Δ/Δ2)
– PLP
– Gammatone cepstra
• Different frequency warping
• STFT not used
– Intra-frame delta ( )
• Emphasises spectral peaks/dips
Feature set #features
%WER
Dev Eval Avg
FBANK+Δ+Δ2 (baseline) 72 41.9 40.9 41.4
+PLP 85 40.7 40.3 40.5
+Gammatone 88 40.8 40.0 40.4
+MFCC 85 41.1 39.7 40.4
+MFCC+Δ+Δ2 111 40.6 40.2 40.4
+ + 2 120 40.9 39.8 40.4
+MFCC+ + 2 133 40.4 39.8 40.1
• Speech enhancement
– Linear filtering
– Spectral/feature enhancement
• Feature transformation
– Quantised CMLLR
– (Global CMLLR for SAT)
• Multi-stream features
' '
Baseline
Front-end
%WER
Dev Eval Avg
FBANK baseline 43.1 42.4 42.8
+WPE 41.8 40.7 41.3
+MFCC+ + 2 40.5 40.1 40.3
+IMCRA+FE-VTS 40.0 39.3 39.7
+QCMLLR 40.9 39.5 40.2
• Effects additive except for QCMLLR
• QCMLLR may work if applied to the entire feature set
ÆÆ
System
Parame-
terisation
%WER
Dev Eval Avg
SAT GMM-HMM
MPE trained
HLDA 48.8 50.2 49.5
SAT tandem
MPE trained
FBANK 40.7 40.9 40.8
SI hybrid FBANK 43.5 42.6 43.1
• Outperforms SAT GMM-HMM
• Outperforms SI hybrid
' '
Baseline
Front-end
%WER
Dev Eval Avg
FBANK baseline 40.1 41.3 40.7
+WPE 38.9 39.3 39.1
+MFCC 38.5 38.5 38.5
+IMCRA+FE-VTS 38.4 38.7 38.6
+CMLLR 36.6 36.7 36.7
+CMLLR 36.9 37.0 37.0
+CMLLR 38.4 38.6 38.5
• Effects of WPE and CMLLR are additive
• Using auxiliary features yields small gains over CMLLR
features
• Denoising subsumed by CMLLR (as expected)
• Front-end processing approaches yield gains
over state-of-the-art DNN-based AMs
– Linear filtering (WPE, BeamformIt)
– Spectral/feature enhancement (IMCRA, FE-VTS)
– Feature transformation (QCMLLR, CMLLR)
– Multi-stream features
• Possible to combine different classes of
approaches

Weitere ähnliche Inhalte

Was ist angesagt?

Overview of sampling
Overview of samplingOverview of sampling
Overview of samplingSagar Kumar
 
Slide Handouts with Notes
Slide Handouts with NotesSlide Handouts with Notes
Slide Handouts with NotesLeon Nguyen
 
Speaker Dependent WaveNet Vocoder
Speaker Dependent WaveNet VocoderSpeaker Dependent WaveNet Vocoder
Speaker Dependent WaveNet VocoderAkira Tamamori
 
EC8553 Discrete time signal processing
EC8553 Discrete time signal processing EC8553 Discrete time signal processing
EC8553 Discrete time signal processing ssuser2797e4
 
Non-Uniform sampling and reconstruction of multi-band signals
Non-Uniform sampling and reconstruction of multi-band signalsNon-Uniform sampling and reconstruction of multi-band signals
Non-Uniform sampling and reconstruction of multi-band signalsmravendi
 
1 AUDIO SIGNAL PROCESSING
1 AUDIO SIGNAL PROCESSING1 AUDIO SIGNAL PROCESSING
1 AUDIO SIGNAL PROCESSINGmukesh bhardwaj
 
Dsp 2018 foehu - lec 10 - multi-rate digital signal processing
Dsp 2018 foehu - lec 10 - multi-rate digital signal processingDsp 2018 foehu - lec 10 - multi-rate digital signal processing
Dsp 2018 foehu - lec 10 - multi-rate digital signal processingAmr E. Mohamed
 
SAMPLING & RECONSTRUCTION OF DISCRETE TIME SIGNAL
SAMPLING & RECONSTRUCTION  OF DISCRETE TIME SIGNALSAMPLING & RECONSTRUCTION  OF DISCRETE TIME SIGNAL
SAMPLING & RECONSTRUCTION OF DISCRETE TIME SIGNALkaran sati
 
Fft analysis
Fft analysisFft analysis
Fft analysisSatrious
 
Audio Processing
Audio ProcessingAudio Processing
Audio Processinganeetaanu
 
Basics of Digital Filters
Basics of Digital FiltersBasics of Digital Filters
Basics of Digital Filtersop205
 
The Fast Fourier Transform (FFT)
The Fast Fourier Transform (FFT)The Fast Fourier Transform (FFT)
The Fast Fourier Transform (FFT)Oka Danil
 
Aliasing and Antialiasing filter
Aliasing and Antialiasing filterAliasing and Antialiasing filter
Aliasing and Antialiasing filterSuresh Mohta
 
SPEKER RECOGNITION UNDER LIMITED DATA CODITION
SPEKER RECOGNITION UNDER LIMITED DATA CODITIONSPEKER RECOGNITION UNDER LIMITED DATA CODITION
SPEKER RECOGNITION UNDER LIMITED DATA CODITIONniranjan kumar
 
DSP_2018_FOEHU - Lec 06 - FIR Filter Design
DSP_2018_FOEHU - Lec 06 - FIR Filter DesignDSP_2018_FOEHU - Lec 06 - FIR Filter Design
DSP_2018_FOEHU - Lec 06 - FIR Filter DesignAmr E. Mohamed
 

Was ist angesagt? (20)

Overview of sampling
Overview of samplingOverview of sampling
Overview of sampling
 
Slide Handouts with Notes
Slide Handouts with NotesSlide Handouts with Notes
Slide Handouts with Notes
 
Speaker Dependent WaveNet Vocoder
Speaker Dependent WaveNet VocoderSpeaker Dependent WaveNet Vocoder
Speaker Dependent WaveNet Vocoder
 
EC8553 Discrete time signal processing
EC8553 Discrete time signal processing EC8553 Discrete time signal processing
EC8553 Discrete time signal processing
 
Multrate dsp
Multrate dspMultrate dsp
Multrate dsp
 
Non-Uniform sampling and reconstruction of multi-band signals
Non-Uniform sampling and reconstruction of multi-band signalsNon-Uniform sampling and reconstruction of multi-band signals
Non-Uniform sampling and reconstruction of multi-band signals
 
1 AUDIO SIGNAL PROCESSING
1 AUDIO SIGNAL PROCESSING1 AUDIO SIGNAL PROCESSING
1 AUDIO SIGNAL PROCESSING
 
Multirate dtsp
Multirate dtspMultirate dtsp
Multirate dtsp
 
Dsp 2018 foehu - lec 10 - multi-rate digital signal processing
Dsp 2018 foehu - lec 10 - multi-rate digital signal processingDsp 2018 foehu - lec 10 - multi-rate digital signal processing
Dsp 2018 foehu - lec 10 - multi-rate digital signal processing
 
SAMPLING & RECONSTRUCTION OF DISCRETE TIME SIGNAL
SAMPLING & RECONSTRUCTION  OF DISCRETE TIME SIGNALSAMPLING & RECONSTRUCTION  OF DISCRETE TIME SIGNAL
SAMPLING & RECONSTRUCTION OF DISCRETE TIME SIGNAL
 
Fft analysis
Fft analysisFft analysis
Fft analysis
 
Audio Processing
Audio ProcessingAudio Processing
Audio Processing
 
Basics of Digital Filters
Basics of Digital FiltersBasics of Digital Filters
Basics of Digital Filters
 
Lecture9
Lecture9Lecture9
Lecture9
 
The Fast Fourier Transform (FFT)
The Fast Fourier Transform (FFT)The Fast Fourier Transform (FFT)
The Fast Fourier Transform (FFT)
 
Aliasing and Antialiasing filter
Aliasing and Antialiasing filterAliasing and Antialiasing filter
Aliasing and Antialiasing filter
 
Signal Processing
Signal ProcessingSignal Processing
Signal Processing
 
SPEKER RECOGNITION UNDER LIMITED DATA CODITION
SPEKER RECOGNITION UNDER LIMITED DATA CODITIONSPEKER RECOGNITION UNDER LIMITED DATA CODITION
SPEKER RECOGNITION UNDER LIMITED DATA CODITION
 
Digital signal processing part1
Digital signal processing part1Digital signal processing part1
Digital signal processing part1
 
DSP_2018_FOEHU - Lec 06 - FIR Filter Design
DSP_2018_FOEHU - Lec 06 - FIR Filter DesignDSP_2018_FOEHU - Lec 06 - FIR Filter Design
DSP_2018_FOEHU - Lec 06 - FIR Filter Design
 

Ähnlich wie Environmentally robust ASR front end for DNN-based acoustic models

Text independent speaker recognition system
Text independent speaker recognition systemText independent speaker recognition system
Text independent speaker recognition systemDeepesh Lekhak
 
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...NUGU developers
 
Digital Signal Processor evolution over the last 30 years
Digital Signal Processor evolution over the last 30 yearsDigital Signal Processor evolution over the last 30 years
Digital Signal Processor evolution over the last 30 yearsFrancois Charlot
 
Speaker recognition using MFCC
Speaker recognition using MFCCSpeaker recognition using MFCC
Speaker recognition using MFCCHira Shaukat
 
Final presentation
Final presentationFinal presentation
Final presentationRohan Lad
 
Introduction to deep learning based voice activity detection
Introduction to deep learning based voice activity detectionIntroduction to deep learning based voice activity detection
Introduction to deep learning based voice activity detectionNAVER Engineering
 
Digital signal processing
Digital signal processingDigital signal processing
Digital signal processingVedavyas PBurli
 
DNN-based frequency-domain permutation solver for multichannel audio source s...
DNN-based frequency-domain permutation solver for multichannel audio source s...DNN-based frequency-domain permutation solver for multichannel audio source s...
DNN-based frequency-domain permutation solver for multichannel audio source s...Kitamura Laboratory
 
Toward wave net speech synthesis
Toward wave net speech synthesisToward wave net speech synthesis
Toward wave net speech synthesisNAVER Engineering
 
Speech Compression using LPC
Speech Compression using LPCSpeech Compression using LPC
Speech Compression using LPCDisha Modi
 
VOICE PASSWORD BASED SPEAKER VERIFICATION SYSTEM USING VOWEL AND NON VOWEL RE...
VOICE PASSWORD BASED SPEAKER VERIFICATION SYSTEM USING VOWEL AND NON VOWEL RE...VOICE PASSWORD BASED SPEAKER VERIFICATION SYSTEM USING VOWEL AND NON VOWEL RE...
VOICE PASSWORD BASED SPEAKER VERIFICATION SYSTEM USING VOWEL AND NON VOWEL RE...niranjan kumar
 
End-to-End Memory Networks with Knowledge Carryover for Multi-Turn Spoken Lan...
End-to-End Memory Networks with Knowledge Carryover for Multi-Turn Spoken Lan...End-to-End Memory Networks with Knowledge Carryover for Multi-Turn Spoken Lan...
End-to-End Memory Networks with Knowledge Carryover for Multi-Turn Spoken Lan...Yun-Nung (Vivian) Chen
 
DSP Lesson 1 Slides (1).pdf
DSP Lesson 1 Slides (1).pdfDSP Lesson 1 Slides (1).pdf
DSP Lesson 1 Slides (1).pdfPearlInc1
 
Speech recognition final
Speech recognition finalSpeech recognition final
Speech recognition finalArchit Vora
 
A Distributed System for Recognizing Home Automation Commands and Distress Ca...
A Distributed System for Recognizing Home Automation Commands and Distress Ca...A Distributed System for Recognizing Home Automation Commands and Distress Ca...
A Distributed System for Recognizing Home Automation Commands and Distress Ca...a3labdsp
 
Introduction to ELINT Analyses
Introduction to ELINT AnalysesIntroduction to ELINT Analyses
Introduction to ELINT AnalysesJoseph Hennawy
 
SBE Filter Tuning 101 by Jeremy Ruck November 2015
SBE Filter Tuning 101 by Jeremy Ruck November 2015SBE Filter Tuning 101 by Jeremy Ruck November 2015
SBE Filter Tuning 101 by Jeremy Ruck November 2015kmsavage
 
COLEA : A MATLAB Tool for Speech Analysis
COLEA : A MATLAB Tool for Speech AnalysisCOLEA : A MATLAB Tool for Speech Analysis
COLEA : A MATLAB Tool for Speech AnalysisRushin Shah
 

Ähnlich wie Environmentally robust ASR front end for DNN-based acoustic models (20)

Text independent speaker recognition system
Text independent speaker recognition systemText independent speaker recognition system
Text independent speaker recognition system
 
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...
 
Digital Signal Processor evolution over the last 30 years
Digital Signal Processor evolution over the last 30 yearsDigital Signal Processor evolution over the last 30 years
Digital Signal Processor evolution over the last 30 years
 
Speaker recognition using MFCC
Speaker recognition using MFCCSpeaker recognition using MFCC
Speaker recognition using MFCC
 
ISSCS2011
ISSCS2011ISSCS2011
ISSCS2011
 
Final presentation
Final presentationFinal presentation
Final presentation
 
Introduction to deep learning based voice activity detection
Introduction to deep learning based voice activity detectionIntroduction to deep learning based voice activity detection
Introduction to deep learning based voice activity detection
 
Digital signal processing
Digital signal processingDigital signal processing
Digital signal processing
 
DNN-based frequency-domain permutation solver for multichannel audio source s...
DNN-based frequency-domain permutation solver for multichannel audio source s...DNN-based frequency-domain permutation solver for multichannel audio source s...
DNN-based frequency-domain permutation solver for multichannel audio source s...
 
Toward wave net speech synthesis
Toward wave net speech synthesisToward wave net speech synthesis
Toward wave net speech synthesis
 
Speech Compression using LPC
Speech Compression using LPCSpeech Compression using LPC
Speech Compression using LPC
 
VOICE PASSWORD BASED SPEAKER VERIFICATION SYSTEM USING VOWEL AND NON VOWEL RE...
VOICE PASSWORD BASED SPEAKER VERIFICATION SYSTEM USING VOWEL AND NON VOWEL RE...VOICE PASSWORD BASED SPEAKER VERIFICATION SYSTEM USING VOWEL AND NON VOWEL RE...
VOICE PASSWORD BASED SPEAKER VERIFICATION SYSTEM USING VOWEL AND NON VOWEL RE...
 
End-to-End Memory Networks with Knowledge Carryover for Multi-Turn Spoken Lan...
End-to-End Memory Networks with Knowledge Carryover for Multi-Turn Spoken Lan...End-to-End Memory Networks with Knowledge Carryover for Multi-Turn Spoken Lan...
End-to-End Memory Networks with Knowledge Carryover for Multi-Turn Spoken Lan...
 
DSP Lesson 1 Slides (1).pdf
DSP Lesson 1 Slides (1).pdfDSP Lesson 1 Slides (1).pdf
DSP Lesson 1 Slides (1).pdf
 
Speaker Segmentation (2006)
Speaker Segmentation (2006)Speaker Segmentation (2006)
Speaker Segmentation (2006)
 
Speech recognition final
Speech recognition finalSpeech recognition final
Speech recognition final
 
A Distributed System for Recognizing Home Automation Commands and Distress Ca...
A Distributed System for Recognizing Home Automation Commands and Distress Ca...A Distributed System for Recognizing Home Automation Commands and Distress Ca...
A Distributed System for Recognizing Home Automation Commands and Distress Ca...
 
Introduction to ELINT Analyses
Introduction to ELINT AnalysesIntroduction to ELINT Analyses
Introduction to ELINT Analyses
 
SBE Filter Tuning 101 by Jeremy Ruck November 2015
SBE Filter Tuning 101 by Jeremy Ruck November 2015SBE Filter Tuning 101 by Jeremy Ruck November 2015
SBE Filter Tuning 101 by Jeremy Ruck November 2015
 
COLEA : A MATLAB Tool for Speech Analysis
COLEA : A MATLAB Tool for Speech AnalysisCOLEA : A MATLAB Tool for Speech Analysis
COLEA : A MATLAB Tool for Speech Analysis
 

Kürzlich hochgeladen

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 

Kürzlich hochgeladen (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 

Environmentally robust ASR front end for DNN-based acoustic models

  • 1.
  • 2. • Do not compare results across different tables! – Configurations may differ • Most results shown here can be found in: Takuya Yoshioka and Mark J. F. Gales, “Environmentally robust ASR front-end for deep neural network acoustic models,” Computer Speech and Language, vol. 31, no. 1, pp. 65-86, May 2015
  • 3. 1. Motivation 2. Corpus • AMI meeting corpus 3. Baseline systems • SI and SAT set-ups 4. Assessment of environmental robustness of DNN acoustic models 5. Front-end techniques 6. Combined effects
  • 4.
  • 5.
  • 7. • Multi-party interaction – 4 participants in each meeting • Multi-channel recordings – Distant microphones – only first channel used – Head-set & lapel microphones • 2 recording set-ups – 70h scenario-based meetings – 30h real meetings
  • 8. • Different rooms • Multiple sources of distortion – Reverberation – Additive noise – Overlapping speech • Moving speakers • Many non-natives
  • 9. • SI : speaker independent – For online transcription – DNN-HMM hybrid • SAT: speaker adaptive training – For offline transcription – MLP tandem
  • 10.
  • 11. • Manual segmentations used • Overlapping segments ignored
  • 12.
  • 13. State output distributions modelled with – GMM or – DNN ¦ – / Q T t tttt qpqqPqPp q xX 1 10 )|()|()()|( ¦ M m jmjm mjmt Ncjp 1 )()( ),;()|( Σμxx )( )|( )|( jp jp jp t t x x
  • 14. Æ • Discriminative pre-training • Cross entropy fine-tuning • Discriminative pre-training
  • 15. • Trained on Telta K20 • cuBLAS 5.5 used • Mini-batch size: 800 frames • Learning rate: “newbob” scheduling • 10% held-out data for CV
  • 16. System Parame- terisation %WER Dev Eval Avg MPE GMM-HMM HLDA 54.7 55.6 55.2 DNN-HMM hybrid FBANK 43.5 42.6 43.1 This work 40.0 39.3 39.7
  • 17.
  • 18. Data Set Parame- terisation %WER Dev Eval Avg SDM FBANK 43.5 42.6 43.1 IHM FBANK 28.2 24.6 26.4 • 39.2% of the errors caused by acoustic distortion • DNN-HMMs not so robust
  • 19.
  • 20. Æ • Discriminative pre-training • Cross entropy fine-tuning • Discriminative pre-training
  • 21.
  • 22. Align- ment DNN input %WER Dev Eval Avg SDM IHM 30.6 27.0 28.8 IHM SDM 41.8 40.8 41.3 IHM SDM 41.7 40.6 41.2 Using 648-2,000 5-4,000 DNN: DNN training more sensitive to noise than state alignment
  • 23.
  • 26. Previous work – Beamforming yields gains – No investigation on single-microphone algorithms
  • 27. • Based on linear time (almost) invariant filters • Applied to complex-valued STFT coefficients • The filters automatically adjusted using observations – WPE for 1ch dereverberation (NTT’s work) – BeamformIt for denoising (ICSI’s work) • 8 microphones used, dedicated to meetings • Unlikely to produce irregular transitions ¦ 1 0 ,,,, T Tk ktfkftftf xgxy
  • 28. Align- ment Dev Eval SDM +Derev BFIt (8mics) SDM +Derev BFIt (8mics) MPE 43.8 41.8 38.6 43.0 41.3 36.6 Hybrid 43.5 41.7 38.8 43.3 41.4 36.7 • Dereveberation helps even with single microphone • Multi-microphone beamforming works well
  • 29. DNN size Context frames Dev Eval SDM +Derev SDM +Derev 1,000 5 9 43.8 41.8 43.0 41.3 1,500 5 9 43.5 42.0 42.6 41.1 13 42.8 41.8 42.9 41.2 19 43.0 41.7 42.9 41.2 2,000 5 9 43.8 41.3 42.9 40.4 4.7% gain from 1ch dereverberation (relative)
  • 31. No positive results reported previously
  • 32. • Applied to magnitude spectra • Cross terms (often) ignored • Frame-by-frame modification – Harmful for DNN? • Noise estimated using long-term statistics – IMCRA (used here), minimum statistics, etc • Deltas from un-enhanced speech – Essential for obtaining gains 2 , 2 , 2 , tftftf nxy
  • 33.
  • 34. • Applied to FBANK features • The following mismatch function used • Frame-by-frame modification • Noise model estimated with EM • Deltas from un-enhanced speech ))exp(1log( hynhxy tttt
  • 35. Enhancement target %WER Spectrum Feature Dev Eval Avg N N 42.0 41.1 41.6 Y N 41.3 40.9 41.1 N Y 41.4 40.5 41.0 Y Y 42.0 41.0 41.5 • Small consistent gains • Different methods should not be connected
  • 36. Enhancement target %WER Spectrum Feature Dev Eval Avg N N 42.0 41.1 41.6 Y N 41.3 40.9 41.1 N Y 41.4 40.5 41.0 Y Y 42.0 41.0 41.5 Y Y 41.4 40.4 40.9 Using multi-stream approach:
  • 38. • Frame level – FMPE, RDT, FE-CMLLR – Seems to be subsumed by DNN • Speaker (or environment) level – Global CMLLR, LIN, fDLR, VTLN – Multiple decoding passes required Æ SAT • Utterance level – Single-pass decoding Æ SI
  • 39. • Seems robust against supervision errors • STC transform used to deal with correlations: » » » ¼ º « « « ¬ ª tx )()()()( ss t ss t bLxAy
  • 40.
  • 41.
  • 42. Form of speaker transform %WER Dev Eval Avg None (SI) 42.6 40.2 41.4 Full 37.4 37.4 37.4 Block diagonal 37.3 36.6 37.0 • ~10% relative gains obtained • “Block diagonal” outperforms “full”
  • 43. Form of speaker transform %WER Dev Eval Avg None (SI) 42.6 40.2 41.4 Full 37.4 37.4 37.4 Block diagonal 37.3 36.6 37.0 None (SI) 27.8 24.2 26.0 Full 23.8 21.6 22.7 On IHM data set
  • 44. ))(()())(()( ucu t ucu t bLxAy uuc :)( Clustering performed using: – utterance-specific iVectors – Kmeans (GMM yielded similar performance figures)
  • 45. )()0()( uu Twmm m(0) T w(u) Subspace representation of the deviation from UBM m(0) m(1) m(2) m(3) Variability subspace
  • 46.
  • 47.
  • 48.
  • 49. #Clusters %WER Dev Eval Avg No QCMLLR 41.9 40.9 41.4 64 41.0 40.4 40.7 32 41.0 40.0 40.5 16 41.5 40.5 41.0 No QCMLLR 27.8 24.2 26.0 32 26.9 23.5 25.2 On IHM data set
  • 50. • Using 32 clusters yielded best performance • Similar gains on both SDM and IHM
  • 52. • Originally proposed by Aachen for shallow MLP tandem configurations • Exploits DNN’s insensitivity to the increase in input dimensionality • (Hopefully) complement features masked by noise • Allows multiple enhancement results to be combined
  • 53. • Four types of auxiliary features investigated: – MFCC (Δ/Δ2) – PLP – Gammatone cepstra • Different frequency warping • STFT not used – Intra-frame delta ( ) • Emphasises spectral peaks/dips
  • 54. Feature set #features %WER Dev Eval Avg FBANK+Δ+Δ2 (baseline) 72 41.9 40.9 41.4 +PLP 85 40.7 40.3 40.5 +Gammatone 88 40.8 40.0 40.4 +MFCC 85 41.1 39.7 40.4 +MFCC+Δ+Δ2 111 40.6 40.2 40.4 + + 2 120 40.9 39.8 40.4 +MFCC+ + 2 133 40.4 39.8 40.1
  • 55. • Speech enhancement – Linear filtering – Spectral/feature enhancement • Feature transformation – Quantised CMLLR – (Global CMLLR for SAT) • Multi-stream features
  • 57. Front-end %WER Dev Eval Avg FBANK baseline 43.1 42.4 42.8 +WPE 41.8 40.7 41.3 +MFCC+ + 2 40.5 40.1 40.3 +IMCRA+FE-VTS 40.0 39.3 39.7 +QCMLLR 40.9 39.5 40.2 • Effects additive except for QCMLLR • QCMLLR may work if applied to the entire feature set
  • 58. ÆÆ
  • 59. System Parame- terisation %WER Dev Eval Avg SAT GMM-HMM MPE trained HLDA 48.8 50.2 49.5 SAT tandem MPE trained FBANK 40.7 40.9 40.8 SI hybrid FBANK 43.5 42.6 43.1 • Outperforms SAT GMM-HMM • Outperforms SI hybrid
  • 61. Front-end %WER Dev Eval Avg FBANK baseline 40.1 41.3 40.7 +WPE 38.9 39.3 39.1 +MFCC 38.5 38.5 38.5 +IMCRA+FE-VTS 38.4 38.7 38.6 +CMLLR 36.6 36.7 36.7 +CMLLR 36.9 37.0 37.0 +CMLLR 38.4 38.6 38.5 • Effects of WPE and CMLLR are additive • Using auxiliary features yields small gains over CMLLR features • Denoising subsumed by CMLLR (as expected)
  • 62. • Front-end processing approaches yield gains over state-of-the-art DNN-based AMs – Linear filtering (WPE, BeamformIt) – Spectral/feature enhancement (IMCRA, FE-VTS) – Feature transformation (QCMLLR, CMLLR) – Multi-stream features • Possible to combine different classes of approaches