A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
Performance analysis of bangla speech recognizer model using hmm
1. Performance analysis of Bangla
Speech Recognizer model using
Hidden Markov Model (HMM)
Submitted by:
Md. Abdullah-al-MAMUN
1
2. OUTLINEOUTLINE
What is speech recognition ?What is speech recognition ?
The Structure of ASRThe Structure of ASR
Speech DatabaseSpeech Database
Feature ExtractionFeature Extraction
Hidden Markov ModelHidden Markov Model
Forward algorithmForward algorithm
Backward algorithmBackward algorithm
Viterbi algorithmViterbi algorithm
Training & RecogntionTraining & Recogntion
ResultResult
ConclusionsConclusions
ReferencesReferences
2
3. What isWhat is SSpeechpeech RRecognitionecognition??
In Computer Science, In Computer Science, Speech recognitionSpeech recognition is is
the translation of spoken words into text .the translation of spoken words into text .
Process of converting acoustic signal capturedProcess of converting acoustic signal captured
by microphone to a set of words.by microphone to a set of words.
Speech recognition known as “AutomaticSpeech recognition known as “Automatic
Speech Recognition (ASR) ”, “Speech to TextSpeech Recognition (ASR) ”, “Speech to Text
(STT)".(STT)".
3
4. Model ofModel of BBanglaangla SSpeechpeech
RRecognitionecognition
4
Fig -1 : Simple model of Bangla Speech Recognition
6. Speech Database:Speech Database:
-A speech database is a collection ofA speech database is a collection of
recorded speech accessible on a computerrecorded speech accessible on a computer
and supported with the necessaryand supported with the necessary
transcriptions.transcriptions.
-The databases collect the observationsThe databases collect the observations
required for parameter estimations.required for parameter estimations.
-In this ASR system, I have used aboutIn this ASR system, I have used about
1200 keywords.1200 keywords.
6
9. Speech Signal AnalysisSpeech Signal Analysis
Feature Extraction for ASR:Feature Extraction for ASR:
- The aim is to extract the voice features to- The aim is to extract the voice features to
distinguish different phonemes of a language.distinguish different phonemes of a language.
9
5
1
5
6
4
5
4
6
5
1
5
6
1
5
6
1
6
5
1
5
6
4
5
6
4
5
4
2
5
1
5
6
1
5
6
5
Feature
Extraction
10. MFCCMFCC extractionextraction
Pre-emphasis DFT
Mel filter
banks
Log(||2
) IDFT
Speech
signal
x(n)
WINDOW
x’
(n)
xt (n)
Xt(k)
Yt(m)
MFCC
yt
(m)
(k)
10
MFCC means Mel-frequency cepstral coefficients that
representation of the short-term power spectrum of a sound for
audio processing.
The MFCCs are the amplitudes of the resulting spectrum.
11. Speech waveform of aSpeech waveform of a
phoneme “ae”phoneme “ae”
After pre-emphasis andAfter pre-emphasis and
Hamming windowingHamming windowing
Power spectrumPower spectrum MFCCMFCC
Explanatory ExampleExplanatory Example
11
12. FFeatureeature VVector toector to P(O|M)P(O|M) viavia
HMMHMM
12
5
1
5
6
4
6
5
4
5
6
4
P(O|M)HMM
For each input word O the HMM generate a corresponding
probability P(O|M) that could be computed by the HMM.
14. 14
Elements of an HMMElements of an HMM
1) Set of hidden states1) Set of hidden states S={1.2., … … N}S={1.2., … … N}
2) Set of observation symbols2) Set of observation symbols O={oO={o11, o, o22, … … o, … … oMM}}
M: the number of observation symbolsM: the number of observation symbols
3) The initial state distribution3) The initial state distribution
4) State transition probability distribution4) State transition probability distribution
5) Observation symbol probability distribution in state j5) Observation symbol probability distribution in state j
1{ } ( | ), 1 ,ij ij t tA a a P s j s i i j N−= = = = ≤ ≤
{ ( )} ( ) ( | ) 1 ,1j j t k tB b k b k P X o s j j N k M= = = = ≤ ≤ ≤ ≤
0{ } ( ) 1i i P s i i Nπ π π= = = ≤ ≤
15. 15
Three Basic Problems in HMMThree Basic Problems in HMM
1.The Evaluation Problem1.The Evaluation Problem –Given a model–Given a model λλ =(A, B, π)=(A, B, π) and aand a
sequence of observations Osequence of observations O = (o= (o11, o, o22, o, o33,...o,...oMM )), what is the, what is the
probability P(O|probability P(O|λλ); i.e., the probability of the model that); i.e., the probability of the model that
generates the observations?generates the observations?
2.The Decoding Problem2.The Decoding Problem – Given a model– Given a model λλ =(A, B, π)=(A, B, π) and aand a
sequence of observation Osequence of observation O = (o= (o11, o, o22, o, o33,...o,...oMM )), what is the, what is the
most likely state sequence in the model that produces themost likely state sequence in the model that produces the
observations?observations?
3.The Learning Problem3.The Learning Problem –Given a model–Given a model λλ =(A, B, π)=(A, B, π) and aand a
set of observationsset of observations O = (oO = (o11, o, o22, o, o33,...o,...oMM )), how can we adjust, how can we adjust
the model parameterthe model parameter λλ to maximize the joint probabilityto maximize the joint probability
P(O|P(O|λλ)?)?
How to evaluate an HMM?
Forward Algorithm
How to Decode an HMM?
Viterbi Algorithm
How to Train an HMM?
Baum-Welch Algorithm
34. Viterbi Algorithm (Backtracking to Obtain Labeling)Viterbi Algorithm (Backtracking to Obtain Labeling)
S0
S1
S2
S1
S2
S1
S2
π1
π2
a12=0.3
a11=0.7
a22=0.5
a21=0.5
TIME 2 TIME 3 TIME 4
S1
S2
0.6
0.1
0.3
0.1
0.1
0.2
34
35. ImplementingImplementing HMMHMM to speech Modelingto speech Modeling
((TrainingTraining andand RecognitionRecognition ))
- Building HMM speech models based on the- Building HMM speech models based on the
correspondence between the observation sequencescorrespondence between the observation sequences
YY and the state sequence (and the state sequence (SS).). (TRAINNING).(TRAINNING).
- Recognizing speech by the stored HMM models- Recognizing speech by the stored HMM models ΘΘ
and by the actual observation Y.and by the actual observation Y.
(RECOGNITION)(RECOGNITION)
Training HMM
Feature
Extraction
Recognition
W*Y
Y
S
Speech
Samples
Θ
35
36. RECOGNITIONRECOGNITION ProcessProcess
Given an input speechGiven an input speech S=(sS=(s11,s,s22,…,s,…,sTT)) be the recognized .be the recognized .
xxtt be the feature samples computed at timebe the feature samples computed at time tt, where the feature, where the feature
sequence from timesequence from time 11 toto tt is indicated as:is indicated as: X=(xX=(x11,x,x22,…,x,…,xtt ))..
The recognized statesThe recognized states S*S* could be obtained by:could be obtained by:
S*=ArgMax P(S,X|S*=ArgMax P(S,X|ΦΦ))..
Dynamic Structure
Search Algorithm
S*
Static Structure Φ
St , P(xt,{st}|{st-1},Φ)
}St-1{
xt
36
40. ConclusionsConclusions
No speech recognizer till now has 100%No speech recognizer till now has 100%
accuracy.accuracy.
You should avoided poor quality microphoneYou should avoided poor quality microphone
consider using a better microphoneconsider using a better microphone
On important matter is that , training theOn important matter is that , training the
computer will provide an even better experience.computer will provide an even better experience.
40