This document describes using deep belief networks (DBNs) for acoustic modeling in automatic speech recognition. It involves pre-training a multi-layer neural network as a generative model one layer at a time using restricted Boltzmann machines. The pre-trained network is then fine-tuned discriminatively using backpropagation to output phoneme probabilities. The approach achieves better phone recognition than Gaussian mixture models by learning multiple layers of features from data without strong distribution assumptions.
1. Acoustic Modeling using Deep Belief
Networks
Yueshen Xu
xuyueshen@163.com
Zhejiang University
1 CCNT, ZJU
2. Abstract
Problem
Achieving better phone recognition
Method
Deep neural networks which contain many layers of features and
numbers of parameters
Rather than Gaussian Mixture Models
Step
Step 1: Pre-trained as a multi-layer generative models without
making use of any discriminative information spectral feature
vector
Step2: Using backpropagation to make those features better at
predicting a probability distribution
2 CCNT, ZJU
3. Introduction
Typical Automatic Speech Recognition System
Model the sequential structure of speech signals: Hidden Markov
Model
Spectral representation of the sound wave: HMM state + mixture
of Gaussians+Mel-frequency Cepstral Coefficients(梅尔倒频谱
系数)
New research direction
Deeper acoustic models containing many layers of features
Feedforward neural networks
Advantages
The estimation of posterior probabilities of HMM does not require
detailed assumptions about data distribution
Suitable for discrete and continuous features
3 CCNT, ZJU
4. Introduction
Comparison among MFCCs, GMM
MFCCs
Partially overcome the very strong conditional independence
assumption of HMM
GMM
Easy to fit to data using the EM algorithm
Inefficient at modeling high-dimensional data
Previous work of neural network
Using backpropagation algorithms to train neural networks
discriminatively
Generative modeling vs discriminative training
Efficient to handle those unlabeled speech
4 CCNT, ZJU
5. Introduction
Main novelty of this paper
Achieve consistently better phone recognition performance by pre-
training a multi-layer neural network
One layer at a time, as a generative model
General Description
The generative pre-training creates many layers of feature detector
Using backpropagation algorithm to adjust the features in every
layer to make features more useful for discrimination
5 CCNT, ZJU
6. Learning a multilayer generative model
Two vital assumptions of this paper
The discrimination is more directly related to the underlying causes
of data than to the individual elements of data itself
A good feature vector representation of the underlying causes can
be recovered from the input data by modeling its higher order
statistical structure
Directed view
Fit a multilayer generative model having infinitely layers of latent
variables
Undirected view
Fitting a relatively simple type of learning module that only has one
layer of latent variables
6 CCNT, ZJU
7. Learning a multilayer generative model
Undirected view
Restricted Boltzmann Machine(RBM)
Bipartite graph in which visible units are connected to hidden units
No visible-visible or hidden-hidden connections
Visible units vs. Hidden units
Visible units: representing observation
Hidden units: representing features using undirected weighted
connections
RBM in this paper
Binary RBM
Both hidden and visible units are binary and stochastic
Gaussian-Bernouli RBM
Hidden units are binary but visible units are linear with Gaussian noise
7 CCNT, ZJU
8. Learning a multilayer generative model
Binary RBM
The weights on the connections and biases of individual units
define a probability distribution over the joint states of visible and
hidden units via an energy function
The conditional distribution p(h| v, )
The conditional distribution p(v| h, )
8 CCNT, ZJU
9. Learning a multilayer generative model
Learning DBN
Updating each weight wij using the difference between two
measured, pairwise correlations:
Directed view
A sigmoid belief net consisting of multiple layers of binary
stochastic units
Hidden layers
Binary features
Visible layers
Binary data vectors
9 CCNT, ZJU
10. Learning a multilayer generative model
Generating data from the model
Binary states are chosen for the top layer of hidden units
Adjusting the weights on the top-down connections
Performing gradient ascent in the expected log probability
Challenge
Getting unbiased samples from exponentially large posterior is
intractable
Lack of conditional independence
Learning with tied weights (1/2)
Learning Context: a sigmoid belief net with an infinite number of layers
and tied symmetric weights between layers
The posterior can be computed by simply multiplying visible vectors by
transposed weight matrix
10 CCNT, ZJU
11. Learning a multilayer generative model
• An infinite sigmoid
belief net with
weights
• Inference is easy since once posteriors
have been sampled for the first hidden
layer, the same process can be used for
the next hidden layer
Learning is a little more difficult
Because every copy of tied weight matrix gets different derivatives
11 CCNT, ZJU
12. Learning a multilayer generative model
Unbiased estimate of the sum of derivatives
h(2) can be viewed as a noisy but unbiased estimate of probabilities
for visible units predicted by h(1)
h(3) can be viewed as a noisy but unbiased estimate of probabilities
for visible units predicted by h(2)
Unbiased estimate of the sum of derivatives
h(2) can be viewed as a noisy but unbiased estimate of probabilities
for visible units predicted by h(1)
h(3) can be viewed as a noisy but unbiased estimate of probabilities
for visible units predicted by h(2)
12 CCNT, ZJU
13. Learning a multilayer generative model
Learning different weights in each layer
Making the generative model more powerful by allowing different
weights in different layers
Step1: Learn with all of weight matrices tied together
Step2: Untie the bottom weight matrix form the other matrices
Step3: Obtain the frozen matrix W(1)
Step4: Keeping all remaining matrices tied together, and continuing
to learn higher matrices
This involves first inferring h(1) from v by using W(1) and then
inferring h(2) , h(3) , and h(4) in a similar bottom up manner using W or
WT
13 CCNT, ZJU
14. Learning a multilayer generative model
Deep belief net(DBN)
Having learned K layers of features, we get a directed generative
model called ’Deep Belief Net’
DBN has K different weight matrices between lower layers and an
infinite number of higher layers
This paper models the whole system as a
feedforward, deterministic neural network
This network is then discriminatively fine tuned by using
backpropagation to maximize the log probability of correct HMM
states
14 CCNT, ZJU
15. Using Deep Belief Nets for Phone Recognition
Visible unit
Using a context window of n successive frames of speech
coefficients
Generate phone sequences
The resulting feedforward neural network is discriminatively trained
to output a probability distribution over all possible labels of central
frames
Then the pdfs over all possible labels for each frame is fed into a
standard Viterbi decoder
15 CCNT, ZJU
16. Conclusions
Initiative
This is the first application to acoustic modeling of neural networks
in which multiple layers of features are generatively pre-trained
This approach can be extended to explicitly model the covariance
structure of input features
It can be used to jointly train acoustic and language models
It can be applied to a large vocabulary task replace of GMM
16 CCNT, ZJU