Acoustic Modeling using Deep Belief Networks

Acoustic Modeling using Deep Belief
Networks
Yueshen Xu
xuyueshen@163.com
Zhejiang University

1 CCNT, ZJU

Abstract
 Problem
 Achieving better phone recognition
 Method
 Deep neural networks which contain many layers of features and
numbers of parameters
 Rather than Gaussian Mixture Models
 Step
 Step 1: Pre-trained as a multi-layer generative models without
making use of any discriminative information  spectral feature
vector
 Step2: Using backpropagation to make those features better at
predicting a probability distribution

2 CCNT, ZJU

Introduction
 Typical Automatic Speech Recognition System
 Model the sequential structure of speech signals: Hidden Markov
Model
 Spectral representation of the sound wave: HMM state + mixture
of Gaussians+Mel-frequency Cepstral Coefficients(梅尔倒频谱
系数)
 New research direction
 Deeper acoustic models containing many layers of features
 Feedforward neural networks
 Advantages
 The estimation of posterior probabilities of HMM does not require
detailed assumptions about data distribution
 Suitable for discrete and continuous features

3 CCNT, ZJU

Introduction
 Comparison among MFCCs, GMM
 MFCCs
 Partially overcome the very strong conditional independence
assumption of HMM
 GMM
 Easy to fit to data using the EM algorithm
 Inefficient at modeling high-dimensional data
 Previous work of neural network
 Using backpropagation algorithms to train neural networks
discriminatively
 Generative modeling vs discriminative training
 Efficient to handle those unlabeled speech

4 CCNT, ZJU

Introduction
 Main novelty of this paper
 Achieve consistently better phone recognition performance by pre-
training a multi-layer neural network
 One layer at a time, as a generative model
 General Description
 The generative pre-training creates many layers of feature detector
 Using backpropagation algorithm to adjust the features in every
layer to make features more useful for discrimination

5 CCNT, ZJU

Learning a multilayer generative model
 Two vital assumptions of this paper
 The discrimination is more directly related to the underlying causes
of data than to the individual elements of data itself
 A good feature vector representation of the underlying causes can
be recovered from the input data by modeling its higher order
statistical structure
 Directed view
 Fit a multilayer generative model having infinitely layers of latent
variables
 Undirected view
 Fitting a relatively simple type of learning module that only has one
layer of latent variables

6 CCNT, ZJU

 Undirected view
 Restricted Boltzmann Machine(RBM)
 Bipartite graph in which visible units are connected to hidden units
 No visible-visible or hidden-hidden connections
 Visible units vs. Hidden units
 Visible units: representing observation
 Hidden units: representing features using undirected weighted
connections
 RBM in this paper
 Binary RBM
 Both hidden and visible units are binary and stochastic
 Gaussian-Bernouli RBM
 Hidden units are binary but visible units are linear with Gaussian noise

7 CCNT, ZJU

 Binary RBM
 The weights on the connections and biases of individual units
define a probability distribution over the joint states of visible and
hidden units via an energy function

 The conditional distribution p(h| v, )


 The conditional distribution p(v| h, )

8 CCNT, ZJU

 Learning DBN
 Updating each weight wij using the difference between two
measured, pairwise correlations:

 Directed view
 A sigmoid belief net consisting of multiple layers of binary
stochastic units

 Hidden layers
 Binary features
 Visible layers
 Binary data vectors

9 CCNT, ZJU

 Generating data from the model
 Binary states are chosen for the top layer of hidden units
 Adjusting the weights on the top-down connections
 Performing gradient ascent in the expected log probability
 Challenge
 Getting unbiased samples from exponentially large posterior is
intractable
 Lack of conditional independence
 Learning with tied weights (1/2)
 Learning Context: a sigmoid belief net with an infinite number of layers
and tied symmetric weights between layers
 The posterior can be computed by simply multiplying visible vectors by
transposed weight matrix

10 CCNT, ZJU


• An infinite sigmoid
belief net with
weights
• Inference is easy since once posteriors
have been sampled for the first hidden
layer, the same process can be used for
the next hidden layer

 Learning is a little more difficult
 Because every copy of tied weight matrix gets different derivatives

11 CCNT, ZJU

 Unbiased estimate of the sum of derivatives
 h(2) can be viewed as a noisy but unbiased estimate of probabilities
for visible units predicted by h(1)

 Unbiased estimate of the sum of derivatives

12 CCNT, ZJU

 Learning different weights in each layer
 Making the generative model more powerful by allowing different
weights in different layers
 Step1: Learn with all of weight matrices tied together
 Step2: Untie the bottom weight matrix form the other matrices
 Step3: Obtain the frozen matrix W(1)
 Step4: Keeping all remaining matrices tied together, and continuing
to learn higher matrices
 This involves first inferring h(1) from v by using W(1) and then
inferring h(2) , h(3) , and h(4) in a similar bottom up manner using W or
WT

13 CCNT, ZJU

 Deep belief net(DBN)
 Having learned K layers of features, we get a directed generative
model called ’Deep Belief Net’
 DBN has K different weight matrices between lower layers and an
infinite number of higher layers
 This paper models the whole system as a
feedforward, deterministic neural network
 This network is then discriminatively fine tuned by using
backpropagation to maximize the log probability of correct HMM
states

14 CCNT, ZJU

Using Deep Belief Nets for Phone Recognition
 Visible unit
 Using a context window of n successive frames of speech
coefficients
 Generate phone sequences
 The resulting feedforward neural network is discriminatively trained
to output a probability distribution over all possible labels of central
frames
 Then the pdfs over all possible labels for each frame is fed into a
standard Viterbi decoder

15 CCNT, ZJU

Conclusions
 Initiative
 This is the first application to acoustic modeling of neural networks
in which multiple layers of features are generatively pre-trained
 This approach can be extended to explicitly model the covariance
structure of input features
 It can be used to jointly train acoustic and language models
 It can be applied to a large vocabulary task replace of GMM

16 CCNT, ZJU

Thank you

17 CCNT, ZJU

Acoustic Modeling using Deep Belief Networks

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Acoustic Modeling using Deep Belief Networks

Similar to Acoustic Modeling using Deep Belief Networks (20)

More from Yueshen Xu

More from Yueshen Xu (20)

Recently uploaded

Recently uploaded (20)

Acoustic Modeling using Deep Belief Networks