presentation_ASR_MIT

•

1 gefällt mir•468 views

Maryam Najafian

Improving Speech Recognition Using Limited Accent Diverse
British English Training Data
With Acoustic Model and Data Selection
By Maryam Najafian
Supervisor Prof. Martin Russell
University of Birmingham, UK
4th October 2016
Email: m.najafian@utdallas.edu

Motivation
1/12
Regional accents can be a problem for Speech Technology!

Overview
• Problems: (1) Multi conditional data problem (2) Recognition of 14
regional accents of British English (3) Define an approach to measure the
accent difficulty
• Low dimensional visualisation of the AID feature space reveals expected
relationships between regional accents.
• One approach to accent robust ASR is adaption to the speakers accent
using an online AID to select an accent specific acoustic model 1,2,3
• Another approach to accent-robust ASR is AID and analyse the training
and apply data selection to train a DNN based system 1,2.
[1] M Najafian, Acoustic model selection for recognition of regional accented speech”,
Doctoral dissertation, Ph. D. dissertation, University of Birmingham, UK, 2016.
[2] M Najafian et al. Identification of British English regional accents using fusion of i-vector and multi-accent phonotactic systems," in ODYSSEY, 2016, pp. 132-139.
[3] M Najafian et al. Unsupervised Model Selection for Recognition of Regional Accented “peech , Proc. Interspeech 2014.
[4] M Najafian et al. Improving speech recognition using limited accent diverse British English training data with deep neural networks in MLSP, 2016.
2/12

Objectives
• This research is concerned with automatic speech recognition (ASR) for accented
speech using a range of different AID systems for GMM-HMM and DNN-HMM
based acoustic model selection
• Trained on the SI training set (92 speakers, 7861 utterances) of the WSJCAM0
corpus of read British English speech
• Tested/adapted on ABI Corpus, 14 different accents (285 speakers)
3/12

Baseline AID System Design
Phonotactic
Accuracy : 80.65 %
I-vector
Accuracy :76.76%
ACCDIST-SVM
Accuracy: 95 %
4/12

ACCDIST Accent ID feature space
5/12

GMM-HMM: Unsupervised Adaptation
6/12

GMM-HMM: Speaker versus Accent Adaptation
Supervised speaker versus accent adaptation
Unsupervised speaker versus accent adaptation
7/12

DNN-HMM versus GMM-HMM
???
8/12

Accent properties of WSJCAM0
using an i-vector based AID
9/12

DNN based ASR Vs
i-vector based AID
error rates
10/12

DNN-HMM: Extra Training Material (ETM) &
Extra Pre-Training Material (EPM)
The relationship between AI &ASR error rates motivated analysis of the eﬀect of
supplementing the WSJCAM0 training set with diﬀerent types of accented speech!

Summary and publications
• To address the multi-accent learning problem in a deep learning acoustic
modelling framework with limited resources, this work introduced a
concept called accent difficulty to analyse the training set
• A relative gain of 46.85% is achieved in recognising the Accents of British
Isles corpus by applying a baseline DNN model rather than a Gaussian
mixture model.
• Our results show that across all accent regions supplementing the
training set with a small amount of data from the most difficult accent
(2.25 hours of Glaswegian accent) leads to a similar gain in performance
as using a large amount of accent diverse data (8.96 hours from 14
accent regions), even though this accent accounts for just 14% of the test
data.
12/12

Thank you for listening
Email: m.najafian@utdallas.edu

Empfohlen

presentation_Viva

presentation_Viva

presentation_VivaMaryam Najafian

presentation_Diarization_MIT

presentation_Diarization_MIT

presentation_Diarization_MITMaryam Najafian

Personalising speech to-speech translation

Personalising speech to-speech translation

Personalising speech to-speech translationbehzad66

IMPROVING MYANMAR AUTOMATIC SPEECH RECOGNITION WITH OPTIMIZATION OF CONVOLUTI...

IMPROVING MYANMAR AUTOMATIC SPEECH RECOGNITION WITH OPTIMIZATION OF CONVOLUTI...

IMPROVING MYANMAR AUTOMATIC SPEECH RECOGNITION WITH OPTIMIZATION OF CONVOLUTI...ijnlc

IMPROVING MYANMAR AUTOMATIC SPEECH RECOGNITION WITH OPTIMIZATION OF CONVOLUTI...

IMPROVING MYANMAR AUTOMATIC SPEECH RECOGNITION WITH OPTIMIZATION OF CONVOLUTI...

IMPROVING MYANMAR AUTOMATIC SPEECH RECOGNITION WITH OPTIMIZATION OF CONVOLUTI...kevig

Hearing by seeing: Can improving the visibility of the speaker's lips make yo...

Hearing by seeing: Can improving the visibility of the speaker's lips make yo...

Hearing by seeing: Can improving the visibility of the speaker's lips make yo...HCI Lab

AUTOMATIC SPEECH RECOGNITION- A SURVEY

AUTOMATIC SPEECH RECOGNITION- A SURVEY

AUTOMATIC SPEECH RECOGNITION- A SURVEYIJCERT

ENHANCING NON-NATIVE ACCENT RECOGNITION THROUGH A COMBINATION OF SPEAKER EMBE...

ENHANCING NON-NATIVE ACCENT RECOGNITION THROUGH A COMBINATION OF SPEAKER EMBE...

ENHANCING NON-NATIVE ACCENT RECOGNITION THROUGH A COMBINATION OF SPEAKER EMBE...sipij

Empfohlen

presentation_Viva

presentation_Viva

presentation_VivaMaryam Najafian

presentation_Diarization_MIT

presentation_Diarization_MIT

presentation_Diarization_MITMaryam Najafian

Personalising speech to-speech translation

Personalising speech to-speech translation

Personalising speech to-speech translationbehzad66

IMPROVING MYANMAR AUTOMATIC SPEECH RECOGNITION WITH OPTIMIZATION OF CONVOLUTI...

IMPROVING MYANMAR AUTOMATIC SPEECH RECOGNITION WITH OPTIMIZATION OF CONVOLUTI...

IMPROVING MYANMAR AUTOMATIC SPEECH RECOGNITION WITH OPTIMIZATION OF CONVOLUTI...ijnlc

IMPROVING MYANMAR AUTOMATIC SPEECH RECOGNITION WITH OPTIMIZATION OF CONVOLUTI...

IMPROVING MYANMAR AUTOMATIC SPEECH RECOGNITION WITH OPTIMIZATION OF CONVOLUTI...

IMPROVING MYANMAR AUTOMATIC SPEECH RECOGNITION WITH OPTIMIZATION OF CONVOLUTI...kevig

Hearing by seeing: Can improving the visibility of the speaker's lips make yo...

Hearing by seeing: Can improving the visibility of the speaker's lips make yo...

Hearing by seeing: Can improving the visibility of the speaker's lips make yo...HCI Lab

AUTOMATIC SPEECH RECOGNITION- A SURVEY

AUTOMATIC SPEECH RECOGNITION- A SURVEY

AUTOMATIC SPEECH RECOGNITION- A SURVEYIJCERT

ENHANCING NON-NATIVE ACCENT RECOGNITION THROUGH A COMBINATION OF SPEAKER EMBE...

ENHANCING NON-NATIVE ACCENT RECOGNITION THROUGH A COMBINATION OF SPEAKER EMBE...

ENHANCING NON-NATIVE ACCENT RECOGNITION THROUGH A COMBINATION OF SPEAKER EMBE...sipij

Performance estimation based recurrent-convolutional encoder decoder for spee...

Performance estimation based recurrent-convolutional encoder decoder for spee...

Performance estimation based recurrent-convolutional encoder decoder for spee...karthik annam

SPEAKER VERIFICATION USING ACOUSTIC AND PROSODIC FEATURES

SPEAKER VERIFICATION USING ACOUSTIC AND PROSODIC FEATURES

SPEAKER VERIFICATION USING ACOUSTIC AND PROSODIC FEATURESacijjournal

Hybrid Phonemic and Graphemic Modeling for Arabic Speech Recognition

Hybrid Phonemic and Graphemic Modeling for Arabic Speech Recognition

Hybrid Phonemic and Graphemic Modeling for Arabic Speech RecognitionWaqas Tariq

Using a Manifold Vocoder for Spectral Voice and Style Conversion

Using a Manifold Vocoder for Spectral Voice and Style Conversion

Using a Manifold Vocoder for Spectral Voice and Style ConversionOHSU | Oregon Health & Science University

Capturing Word-level Dependencies in Morpheme-based Language Modeling

Capturing Word-level Dependencies in Morpheme-based Language Modeling

Capturing Word-level Dependencies in Morpheme-based Language ModelingGuy De Pauw

Performance Calculation of Speech Synthesis Methods for Hindi language

Performance Calculation of Speech Synthesis Methods for Hindi language

Performance Calculation of Speech Synthesis Methods for Hindi languageiosrjce

High level speaker specific features modeling in automatic speaker recognitio...

High level speaker specific features modeling in automatic speaker recognitio...

High level speaker specific features modeling in automatic speaker recognitio...IJECEIAES

Prosodic Control of Unit-Selection Speech Synthesis: A Probabilistic Approach

Prosodic Control of Unit-Selection Speech Synthesis: A Probabilistic Approach

Prosodic Control of Unit-Selection Speech Synthesis: A Probabilistic ApproachChristophe Veaux

Semi-Supervised Keyword Spotting in Arabic Speech Using Self-Training Ensembles

Semi-Supervised Keyword Spotting in Arabic Speech Using Self-Training Ensembles

Semi-Supervised Keyword Spotting in Arabic Speech Using Self-Training EnsemblesMohamed El-Geish

Pptphrase tagset mapping for french and english treebanks and its application...

Pptphrase tagset mapping for french and english treebanks and its application...

Pptphrase tagset mapping for french and english treebanks and its application...Lifeng (Aaron) Han

07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf

07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf

07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdfsimonp16

Machine learning for Arabic phonemes recognition using electrolarynx speech

Machine learning for Arabic phonemes recognition using electrolarynx speech

Machine learning for Arabic phonemes recognition using electrolarynx speechIJECEIAES

EFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMES

EFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMES

EFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMESkevig

Effect of Dynamic Time Warping on Alignment of Phrases and Phonemes

Effect of Dynamic Time Warping on Alignment of Phrases and Phonemes

Effect of Dynamic Time Warping on Alignment of Phrases and Phonemeskevig

Hindi digits recognition system on speech data collected in different natural...

Hindi digits recognition system on speech data collected in different natural...

Hindi digits recognition system on speech data collected in different natural...csandit

Replicating Speech Experts’ Assessment for Parkinson’s Disease Treatment usin...

Replicating Speech Experts’ Assessment for Parkinson’s Disease Treatment usin...

Replicating Speech Experts’ Assessment for Parkinson’s Disease Treatment usin...Mark (Mong) Montances

01 8445 speech enhancement

01 8445 speech enhancement

01 8445 speech enhancementIAESIJEECS

Improving speech Intelligibility through Speaker Dependent and Independent Sp...

Improving speech Intelligibility through Speaker Dependent and Independent Sp...

Improving speech Intelligibility through Speaker Dependent and Independent Sp...OHSU | Oregon Health & Science University

Evaluation of Hidden Markov Model based Marathi Text-ToSpeech Synthesis System

Evaluation of Hidden Markov Model based Marathi Text-ToSpeech Synthesis System

Evaluation of Hidden Markov Model based Marathi Text-ToSpeech Synthesis SystemIJERA Editor

Interspeech 2017 s_miyoshi

Interspeech 2017 s_miyoshi

Interspeech 2017 s_miyoshiHiroyuki Miyoshi

Weitere ähnliche Inhalte

Ähnlich wie presentation_ASR_MIT

Performance estimation based recurrent-convolutional encoder decoder for spee...

Performance estimation based recurrent-convolutional encoder decoder for spee...

Performance estimation based recurrent-convolutional encoder decoder for spee...karthik annam

SPEAKER VERIFICATION USING ACOUSTIC AND PROSODIC FEATURES

SPEAKER VERIFICATION USING ACOUSTIC AND PROSODIC FEATURES

SPEAKER VERIFICATION USING ACOUSTIC AND PROSODIC FEATURESacijjournal

Hybrid Phonemic and Graphemic Modeling for Arabic Speech Recognition

Hybrid Phonemic and Graphemic Modeling for Arabic Speech Recognition

Hybrid Phonemic and Graphemic Modeling for Arabic Speech RecognitionWaqas Tariq

Using a Manifold Vocoder for Spectral Voice and Style Conversion

Using a Manifold Vocoder for Spectral Voice and Style Conversion

Using a Manifold Vocoder for Spectral Voice and Style ConversionOHSU | Oregon Health & Science University

Capturing Word-level Dependencies in Morpheme-based Language Modeling

Capturing Word-level Dependencies in Morpheme-based Language Modeling

Capturing Word-level Dependencies in Morpheme-based Language ModelingGuy De Pauw

Performance Calculation of Speech Synthesis Methods for Hindi language

Performance Calculation of Speech Synthesis Methods for Hindi language

Performance Calculation of Speech Synthesis Methods for Hindi languageiosrjce

High level speaker specific features modeling in automatic speaker recognitio...

High level speaker specific features modeling in automatic speaker recognitio...

High level speaker specific features modeling in automatic speaker recognitio...IJECEIAES

Prosodic Control of Unit-Selection Speech Synthesis: A Probabilistic Approach

Prosodic Control of Unit-Selection Speech Synthesis: A Probabilistic Approach

Prosodic Control of Unit-Selection Speech Synthesis: A Probabilistic ApproachChristophe Veaux

Semi-Supervised Keyword Spotting in Arabic Speech Using Self-Training Ensembles

Semi-Supervised Keyword Spotting in Arabic Speech Using Self-Training Ensembles

Semi-Supervised Keyword Spotting in Arabic Speech Using Self-Training EnsemblesMohamed El-Geish

Pptphrase tagset mapping for french and english treebanks and its application...

Pptphrase tagset mapping for french and english treebanks and its application...

Pptphrase tagset mapping for french and english treebanks and its application...Lifeng (Aaron) Han

07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf

07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf

07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdfsimonp16

Machine learning for Arabic phonemes recognition using electrolarynx speech

Machine learning for Arabic phonemes recognition using electrolarynx speech

Machine learning for Arabic phonemes recognition using electrolarynx speechIJECEIAES

EFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMES

EFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMES

EFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMESkevig

Effect of Dynamic Time Warping on Alignment of Phrases and Phonemes

Effect of Dynamic Time Warping on Alignment of Phrases and Phonemes

Effect of Dynamic Time Warping on Alignment of Phrases and Phonemeskevig

Hindi digits recognition system on speech data collected in different natural...

Hindi digits recognition system on speech data collected in different natural...

Hindi digits recognition system on speech data collected in different natural...csandit

Replicating Speech Experts’ Assessment for Parkinson’s Disease Treatment usin...

Replicating Speech Experts’ Assessment for Parkinson’s Disease Treatment usin...

Replicating Speech Experts’ Assessment for Parkinson’s Disease Treatment usin...Mark (Mong) Montances

01 8445 speech enhancement

01 8445 speech enhancement

01 8445 speech enhancementIAESIJEECS

Improving speech Intelligibility through Speaker Dependent and Independent Sp...

Improving speech Intelligibility through Speaker Dependent and Independent Sp...

Improving speech Intelligibility through Speaker Dependent and Independent Sp...OHSU | Oregon Health & Science University

Evaluation of Hidden Markov Model based Marathi Text-ToSpeech Synthesis System

Evaluation of Hidden Markov Model based Marathi Text-ToSpeech Synthesis System

Evaluation of Hidden Markov Model based Marathi Text-ToSpeech Synthesis SystemIJERA Editor

Interspeech 2017 s_miyoshi

Interspeech 2017 s_miyoshi

Interspeech 2017 s_miyoshiHiroyuki Miyoshi

Ähnlich wie presentation_ASR_MIT (20)

Performance estimation based recurrent-convolutional encoder decoder for spee...

Performance estimation based recurrent-convolutional encoder decoder for spee...

Performance estimation based recurrent-convolutional encoder decoder for spee...

SPEAKER VERIFICATION USING ACOUSTIC AND PROSODIC FEATURES

SPEAKER VERIFICATION USING ACOUSTIC AND PROSODIC FEATURES

SPEAKER VERIFICATION USING ACOUSTIC AND PROSODIC FEATURES

Hybrid Phonemic and Graphemic Modeling for Arabic Speech Recognition

Hybrid Phonemic and Graphemic Modeling for Arabic Speech Recognition

Hybrid Phonemic and Graphemic Modeling for Arabic Speech Recognition

Using a Manifold Vocoder for Spectral Voice and Style Conversion

Using a Manifold Vocoder for Spectral Voice and Style Conversion

Using a Manifold Vocoder for Spectral Voice and Style Conversion

Capturing Word-level Dependencies in Morpheme-based Language Modeling

Capturing Word-level Dependencies in Morpheme-based Language Modeling

Capturing Word-level Dependencies in Morpheme-based Language Modeling

Performance Calculation of Speech Synthesis Methods for Hindi language

Performance Calculation of Speech Synthesis Methods for Hindi language

Performance Calculation of Speech Synthesis Methods for Hindi language

High level speaker specific features modeling in automatic speaker recognitio...

High level speaker specific features modeling in automatic speaker recognitio...

High level speaker specific features modeling in automatic speaker recognitio...

Prosodic Control of Unit-Selection Speech Synthesis: A Probabilistic Approach

Prosodic Control of Unit-Selection Speech Synthesis: A Probabilistic Approach

Prosodic Control of Unit-Selection Speech Synthesis: A Probabilistic Approach

Semi-Supervised Keyword Spotting in Arabic Speech Using Self-Training Ensembles

Semi-Supervised Keyword Spotting in Arabic Speech Using Self-Training Ensembles

Semi-Supervised Keyword Spotting in Arabic Speech Using Self-Training Ensembles

Pptphrase tagset mapping for french and english treebanks and its application...

Pptphrase tagset mapping for french and english treebanks and its application...

Pptphrase tagset mapping for french and english treebanks and its application...

07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf

07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf

07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf

Machine learning for Arabic phonemes recognition using electrolarynx speech

Machine learning for Arabic phonemes recognition using electrolarynx speech

Machine learning for Arabic phonemes recognition using electrolarynx speech

EFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMES

EFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMES

EFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMES

Effect of Dynamic Time Warping on Alignment of Phrases and Phonemes

Effect of Dynamic Time Warping on Alignment of Phrases and Phonemes

Effect of Dynamic Time Warping on Alignment of Phrases and Phonemes

Hindi digits recognition system on speech data collected in different natural...

Hindi digits recognition system on speech data collected in different natural...

Hindi digits recognition system on speech data collected in different natural...

Replicating Speech Experts’ Assessment for Parkinson’s Disease Treatment usin...

Replicating Speech Experts’ Assessment for Parkinson’s Disease Treatment usin...

Replicating Speech Experts’ Assessment for Parkinson’s Disease Treatment usin...

01 8445 speech enhancement

01 8445 speech enhancement

01 8445 speech enhancement

Improving speech Intelligibility through Speaker Dependent and Independent Sp...

Improving speech Intelligibility through Speaker Dependent and Independent Sp...

Improving speech Intelligibility through Speaker Dependent and Independent Sp...

Evaluation of Hidden Markov Model based Marathi Text-ToSpeech Synthesis System

Evaluation of Hidden Markov Model based Marathi Text-ToSpeech Synthesis System

Evaluation of Hidden Markov Model based Marathi Text-ToSpeech Synthesis System

Interspeech 2017 s_miyoshi

Interspeech 2017 s_miyoshi

Interspeech 2017 s_miyoshi

presentation_ASR_MIT

1. Improving Speech Recognition Using Limited Accent Diverse British English Training Data With Acoustic Model and Data Selection By Maryam Najafian Supervisor Prof. Martin Russell University of Birmingham, UK 4th October 2016 Email: m.najafian@utdallas.edu

2. Motivation 1/12 Regional accents can be a problem for Speech Technology!

3. Overview • Problems: (1) Multi conditional data problem (2) Recognition of 14 regional accents of British English (3) Define an approach to measure the accent difficulty • Low dimensional visualisation of the AID feature space reveals expected relationships between regional accents. • One approach to accent robust ASR is adaption to the speakers accent using an online AID to select an accent specific acoustic model 1,2,3 • Another approach to accent-robust ASR is AID and analyse the training and apply data selection to train a DNN based system 1,2. [1] M Najafian, Acoustic model selection for recognition of regional accented speech”, Doctoral dissertation, Ph. D. dissertation, University of Birmingham, UK, 2016. [2] M Najafian et al. Identification of British English regional accents using fusion of i-vector and multi-accent phonotactic systems," in ODYSSEY, 2016, pp. 132-139. [3] M Najafian et al. Unsupervised Model Selection for Recognition of Regional Accented “peech , Proc. Interspeech 2014. [4] M Najafian et al. Improving speech recognition using limited accent diverse British English training data with deep neural networks in MLSP, 2016. 2/12

4. Objectives • This research is concerned with automatic speech recognition (ASR) for accented speech using a range of different AID systems for GMM-HMM and DNN-HMM based acoustic model selection • Trained on the SI training set (92 speakers, 7861 utterances) of the WSJCAM0 corpus of read British English speech • Tested/adapted on ABI Corpus, 14 different accents (285 speakers) 3/12

5. Baseline AID System Design Phonotactic Accuracy : 80.65 % I-vector Accuracy :76.76% ACCDIST-SVM Accuracy: 95 % 4/12

6. ACCDIST Accent ID feature space 5/12

7. GMM-HMM: Unsupervised Adaptation 6/12

8. GMM-HMM: Speaker versus Accent Adaptation Supervised speaker versus accent adaptation Unsupervised speaker versus accent adaptation 7/12

9. DNN-HMM versus GMM-HMM ??? 8/12

10. Accent properties of WSJCAM0 using an i-vector based AID 9/12

11. DNN based ASR Vs i-vector based AID error rates 10/12

12. DNN-HMM: Extra Training Material (ETM) & Extra Pre-Training Material (EPM) The relationship between AI &ASR error rates motivated analysis of the eﬀect of supplementing the WSJCAM0 training set with diﬀerent types of accented speech!

13. Summary and publications • To address the multi-accent learning problem in a deep learning acoustic modelling framework with limited resources, this work introduced a concept called accent difficulty to analyse the training set • A relative gain of 46.85% is achieved in recognising the Accents of British Isles corpus by applying a baseline DNN model rather than a Gaussian mixture model. • Our results show that across all accent regions supplementing the training set with a small amount of data from the most difficult accent (2.25 hours of Glaswegian accent) leads to a similar gain in performance as using a large amount of accent diverse data (8.96 hours from 14 accent regions), even though this accent accounts for just 14% of the test data. 12/12

14. Thank you for listening Email: m.najafian@utdallas.edu