Towards Machine Comprehension of Spoken Content

TAIPEI | SEP. 21-22, 2016
李宏毅 Hung-yi Lee
TOWARDS MACHINE COMPREHENSION
OF SPOKEN CONTENT

2
MULTIMEDIA INTERNET CONTENT
300 hrs multimedia is
uploaded per minute.
(2015.01)
1874 courses on coursera
(2016.04)
Ø We need machine to listen to the audio data,
understand it, and extract useful information for humans.
Ø In these multimedia, the spoken part carries
very important information about the content.
Ø Nobody is able to go through the data.
Ø Overview the technology developed at NTU Speech Lab

3
OVERVIEW
2016/9/26
Spoken
Content
Text
Speech
Recognition
Deep
Learning

Deep Learning
for Speech Recognition
• Acoustic Model (聲學模型)
• DNN + HMM
• Widely used
• CTC
• Sequence to sequence
learning
• DNN + structured SVM
[Meng & Lee, ICASSP 10]
• DNN + structured DNN [Liao
& Lee, ASRU 15]
hidden layer h1
hidden layer h2
W1
W2
F2
(x, y; θ2
)
WL
speech signal
F1
(x, y; θ1
)
y (phoneme label sequence)
(a) use DNN phone posterior as acoustic vector
(b) structured SVM (c) structured DNN
Ψ(x,y)
hidden layer hL-1
hidden layer h1
hidden layer hL
W0,0
output layer
input layer
W0,L
feature extraction
a c b a
x (acoustic vector sequence)

Deep Learning
for Speech Recognition
• Language Model (語言模型)
http://colah.github.io/posts/2015-08-
Understanding-LSTMs/
RNN
LSTM
Neural
Turing
Machine
Attention-
based
Model
[Ko & Lee, submitted
to ICASSP 17]
[Liu & Lee, submitted
to ICASSP 17]

6
OVERVIEW
Spoken
Content
Text
Speech
Recognition
Summarization

7
SPEECH SUMMARIZATION
Retrieved
Audio File
Summary
Select the most informative
segments to form a compact version
1 hour
long
10 minutes
Extractive Summaries
Ref:
http://speech.ee.ntu.edu.tw/~tlkagk/courses/
MLDS_2015/Structured%20Lecture/Summariz
ation%20Hidden_2.ecm.mp4/index.html

8
SPEECH SUMMARIZATION
Abstractive Summaries
x1 x2 x3 xN
……
……
Input document (long word sequence)
Summary (short word sequence)
y1 y2 y3 y4
機器先看懂文章
機器用自己的話來寫摘要
[Yu & Lee, SLT 16]

9
OVERVIEW
2016/9/26
Spoken
Content
Text
Speech
Recognition
Key Term Extraction
Summarization

Key Term Extraction
[Shen & Lee, Interspeech 16]
α1 α2 α3 α4 … αT
ΣαiVi
x4x3x2x1 xT…
…V3V2V1 V4 VT
Embedding Layer
…V3V2V1 V4 VT
OT
…
document
Output Layer
Hidden Layer
Embedding Layer
Key Terms:
DNN, LSTN
機器先大略讀
過整篇文章
機器擷取
文章中的重點
回頭把重點
畫起來

11
OVERVIEW
2016/9/26
Spoken
Content
Text
Speech
Recognition
Spoken Content
Retrieval
Key Term Extraction
Summarization

Spoken Content Retrieval
l Transcribe spoken content into text by speech recognition
Speech
Recognition Models
Text
Retrieval
Result
Text
Retrieval
Query learner
l Use text retrieval approach to search the transcriptions
Spoken
Content
Black Box

Overview Paper
• Lin-shan Lee, James Glass, Hung-yi Lee, Chun-an Chan, "Spoken Content
Retrieval —Beyond Cascading Speech Recognition with Text Retrieval,"
IEEE/ACM Transactions on Audio, Speech, and Language Processing,
vol.23, no.9, pp.1389-1420, Sept. 2015
• http://speech.ee.ntu.edu.tw/~tlkagk/paper/Overview.pdf
• 3 hours tutorial at INTERSPEECH 2016
• Slide:
http://speech.ee.ntu.edu.tw/~tlkagk/slide/spoken_content_retrieval_
IS16.pdf

Audio is difficult to browse
• Retrieval results of spoken content is usually noisy
• When the system returns the retrieval results, user doesn’t
know what he/she get at the first glance
Retrieval Result

15
OVERVIEW
2016/9/26
Spoken
Content
Text
Speech
Recognition
Spoken Content
Retrieval
Key Term Extraction
Summarization
Interaction
user
“Deep Learning”
Related to Machine
Learning or Education?

Challenges
• Given the information entered by the users, which
action should be taken?
“Give me an example.”“Is it relevant to XXX?”
“More precisely, please.”
“Show the results.”
The retrieval system learns to take the most effective
actions from historical interaction experiences.

Deep Reinforcement Learning
• The actions are determined by a neural network
• Input: information to help to make the decision
• Output: which action should be taken
• Taking the action with the highest score
…
…
DNN
…
Information
Action Z
Action B
Action A
Max
The network parameters can be optimized by historical interaction.

Deep Reinforcement Learning
• Different network depth
The task cannot be
addressed by linear model.
Some depth is needed.
More Interaction
Better retrieval
performance,
Less user labor

20
OVERVIEW
2016/9/26
Spoken
Content
Text
Speech
Recognition
Spoken Content
Retrieval
Key Term Extraction
Summarization
Interaction
Organization

Today’s Retrieval Techniques
752 matches

More is less …...
• Given all the related lectures from different courses
Which lecture should I
go first?
Learning Map
Ø Nodes: lectures in the
same topics
Ø Edges: suggested learning
order
learner
[Shen & Lee, Interspeech 15]

24
OVERVIEW
Spoken
Content
Text
Speech
Recognition
Spoken Content
Retrieval
Key Term Extraction
Summarization
Question Answering
Interaction
Organization

Spoken Question Answering
What is a possible
origin of Venus’ clouds?
Spoken Question Answering: Machine answers
questions based on the information in spoken content
Gases released as a
result of volcanic activity

Spoken Question Answering
• TOEFL Listening Comprehension Test by Machine
Question: “ What is a possible origin of Venus’ clouds? ”
Audio Story:
Choices:
(A) gases released as a result of volcanic activity
(B) chemical reactions caused by high surface temperatures
(C) bursts of radio energy from the plane's surface
(D) strong winds that blow dust into the atmosphere
(The original story is 5 min long.)
[Tseng & Lee, Interspeech 16]

Results
Accuracy(%)
(1) (2) (3) (4) (5) (6) (7)
Memory Network: 39.2%
(proposed by FB AI group)
Naive Approaches

Model Architecture
(A)
(A) (A) (A) (A)
(B) (B) (B)

Model Architecture
“what is a possible
origin of Venus‘ clouds?"
Question:
Question
Semantics
…… It be quite possible that this be due to
volcanic eruption because volcanic eruption
often emit gas. If that be the case volcanism
could very well be the root cause of Venus 's
thick cloud cover. And also we have observe
burst of radio energy from the planet 's
surface. These burst be similar to what we
see when volcano ……
Audio Story:
Speech
Recognition
Semantic
Analysis
Semantic
Analysis
Attention
Answer
Select the choice most
similar to the answer
Attention
The model is learned
end-to-end.

Results
Accuracy(%)
(1) (2) (3) (4) (5) (6) (7)
Memory Network: 39.2%
Naive Approaches
Proposed Approach: 48.8%
(proposed by FB AI group)
[Fang & Hsu & Lee, SLT 16]
[Tseng & Lee, Interspeech 16]

31
OVERVIEW
2016/9/26
Spoken
Content
Text
Speech
Recognition
Spoken Content
Retrieval
Key Term Extraction
Summarization
Question Answering
Interaction
Organization
Speech recognition is essential?

32
CHALLENGES IN SPEECH RECOGNITION?
Lots of audio files in different languages on the Internet
Most languages have little annotated data for training
speech recognition systems.
Some audio files are produced in several different of
languages
Some languages even do not have written form
Out-of-vocabulary (OOV) problem

33
OVERVIEW
2016/9/26
Spoken
Content
Text
Speech
Recognition
Spoken Content
Retrieval
Key Term Extraction
Summarization
Question Answering
Interaction
Organization
Speech recognition is essential?
Is it possible to directly
understand spoken content?

Preliminary Study: Learning from Audio Book
Machine listens to lots of
audio book
[Chung, Interspeech 16)
Machine does not have
any prior knowledge
Like an infant

Preliminary Study: Audio Word to Vector
• Audio segment corresponding to an unknown word
Fixed-length vector

Preliminary Study: Audio Word to Vector
• The audio segments corresponding to words with similar
pronunciations are close to each other.
ever ever
never
never
never
dog
dog
dogs
Unsupervised

Sequence-to-sequence
Auto-encoder
audio segment
acoustic features
The values in the memory
represent the whole audio
segment
x1 x2 x3 x4
RNN Encoder
audio segment
vector
The vector we want
How to train RNN Encoder?

Sequence-to-sequence
Auto-encoder
RNN Decoder
x1 x2 x3 x4
y1 y2 y3 y4
x1 x2 x3
x4
RNN Encoder
audio segment
acoustic features
The RNN encoder and
decoder are jointly trained.
Input acoustic features

Experimental Results
neverever
Cosine
Similarity
Edit Distance between
Phoneme sequences
RNN
Encoder
RNN
Encoder

Experimental Results
More similar
pronunciation
Higher cosine similarity.

Observation
• Visualizing embedding vectors of the words
fear
nearname
fame
audio segment
vector
Project
on 2-D

Next Step ……
• Including semantics?
flower tree
dog
cat
cats
walk
walked
run

44
CONCLUDING REMARKS
2016/9/26
Spoken
Content
Text
Speech
Recognition
Spoken Content
Retrieval
Key Term Extraction
Summarization
Question-answering
Interaction
Organization
With Deep Learning,
machine will understand spoken content, and
extract useful information for humans.

45
如果你想 “深度學習深度學習”
My Course: Machine learning and having it deep and structured
http://speech.ee.ntu.edu.tw/~tlkagk/courses_MLSD15_2.html
6 hour version: http://www.slideshare.net/tw_dsconf/ss-62245351
“Neural Networks and Deep Learning”
written by Michael Nielsen
http://neuralnetworksanddeeplearning.com/
“Deep Learning”
Written by Yoshua Bengio, Ian J. Goodfellow and Aaron Courville
http://www.deeplearningbook.org

TAIPEI | SEP. 21-22, 2016
THANK YOU

Towards Machine Comprehension of Spoken Content

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (17)

Ähnlich wie Towards Machine Comprehension of Spoken Content

Ähnlich wie Towards Machine Comprehension of Spoken Content (20)

Mehr von NVIDIA Taiwan

Mehr von NVIDIA Taiwan (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Towards Machine Comprehension of Spoken Content