SlideShare a Scribd company logo
1 of 1
Download to read offline
Deep Recurrent Neural Networks with Layer-wise Multi-head Attentions
for Punctuation Restoration
Seokhwan Kim
Adobe Research, San Jose, CA, USA
Punctuation Restoration
Aims to generate the punctuation marks from the
unpunctuated ASR outputs
A sequence labelling task
to predict the punctuation label yt for the t-th timestep
in a given word sequence X = {x1 · · · xt · · · xT }
where C ∈ {‘, COMMA , ‘.PERIOD , ‘?QMARK }
yt =



c ∈ C if a punctuation symbol c is located
between xt−1 and xt,
O otherwise,
Examples
Input: so if we make no changes today what does tomorrow
look like well i think you probably already have the picture
Output: so , if we make no changes today , what does
tomorrow look like ? well , i think you probably already have
the picture .
t xt yt
1 so O
2 if ,COMMA
3 we O
4 make O
5 no O
6 changes O
7 today O
8 what ,COMMA
9 does O
10 tomorrow O
11 look O
t xt yt
12 like O
13 well ?QMARK
14 i ,COMMA
15 think O
16 you O
17 probably O
18 already O
19 have O
20 the O
21 picture O
22 </s> .PERIOD
Related Work
Multiple fully-connected hidden layers [1]
Convolutional neural networks (CNNs) [2]
Recurrent neural networks (RNNs) [4]
RNNs with neural attention mechanisms [3]
Main Contributions
Context encoding from the stacked recurrent layers to
learn more hierarchical aspects
Neural attentions applied on every intermediate
hidden layer to capture the layer-wise features
Multi-head attentions instead of a single attention
function to diversify the layer-wise attentions further
Methods
Model Architecture
yt
st-1st-1
stst
∙∙∙∙∙∙
n-bidirectional recurrent layers
AttentionAttention AttentionAttention AttentionAttention
m-heads mm mm
Attention Attention Attention
m-heads m m
x1 ∙∙∙h1
1h1
1 h2
1h2
1 hn
1hn
1x1 ∙∙∙h1
1 h2
1 hn
1
xT ∙∙∙h1
Th1
T h2
Th2
T hn
Thn
TxT ∙∙∙h1
T h2
T hn
T
h2
th2
txt ∙∙∙h1
th1
t hn
thn
th2
txt ∙∙∙h1
t hn
t
∙∙∙
∙∙∙∙∙∙
∙∙∙∙∙∙
∙∙∙∙∙∙
∙∙∙
∙∙∙
∙∙∙
∙∙∙
∙∙∙
∙∙∙∙∙∙
∙∙∙∙∙∙
∙∙∙∙∙∙
∙∙∙
∙∙∙
∙∙∙
∙∙∙
Word Embedding
Each word xt in a given input sequence X is represented as a
d-dimensional distributed vector xt ∈ Rd
Word embeddings can be learned from scratch with random
initialization or fine-tuned from pre-trained word vectors
Bi-directional Recurrent Layers
Stacking multiple recurrent layers on top of each other
−→
h i
t =



GRU xt,
−→
h 1
t−1 if i = 1,
GRU hi−1
t ,
−→
h i
t−1 if i = 2, · · · n,
The backward state
←−
h i
t is computed also with the same way
but in the reverse order from the end of the sequence T to t.
Both directional states are concatenated into the output state
hi
t =
−→
h i
t,
←−
h i
t ∈ R2d
to represent both contexts together
Uni-directional Recurrent Layer
The top-layer outputs [hn
1, · · · , hn
T ] are forwarded to a
uni-directional recurrent layer
st = GRU (hn
t , st−1) ∈ Rd
,
Multi-head Attentions
The hidden state st constitutes a query to neural attentions
based on scaled dot-product attention
Attn (Q, K, V) = softmax
QKT
√
d
V,
Multi-head attentions are applied for every layer separately to
compute the weighted contexts from different subspaces
fi,j
= Attn S · Wi,j
Q , Hi
· Wi,j
K , Hi
· Wi,j
V ,
where S = [s1; s2; · · · ; sT ] and Hi
= hi
1; hi
2; · · · ; hi
T
Softmax Classification
yt = softmax st, f1,1
t , · · · , fn,m
t Wy + by
Experimental Setup
IWSLT dataset
English reference transcripts of TED Talks
2.1M, 296k, and 13k words for training,
development, and test on reference, respectively
Model Configurations
Number of recurrent layers: n ∈ {1, 2, 3, 4}
Number of attention heads: m ∈ {1, 2, 3, 4, 5}
Attended contexts: layer-wise or only on top
Implementation Details
Word embeddings: 50d GloVe on Wikipedia
Optimizer: Adam optimizer
Loss: Negative log likelihood loss
Hidden dimension d = 256
The same dimension across all the components
Except the word embedding layer
Mini-batch size: 128
Dropout rate: 0.5
Early stopping within 100 epochs
Results
Comparisons of the performances with
different network configurations
Single-head
Top-layer
Single-head
Layer-wise
Multi-head
Top-layer
Multi-head
Layer-wise
0.62
0.63
0.64
0.65
0.66
0.67
0.68
F-measure
1 layer 2 layers 3 layers 4 layers
Results
COMMA PERIOD QUESTION OVERALL
Models P R F P R F P R F P R F SER
DNN-A [1] 48.6 42.4 45.3 59.7 68.3 63.7 - - - 54.8 53.6 54.2 66.9
CNN-2A [1] 48.1 44.5 46.2 57.6 69.0 62.8 - - - 53.4 55.0 54.2 68.0
T-LSTM [2] 49.6 41.4 45.1 60.2 53.4 56.6 57.1 43.5 49.4 55.0 47.2 50.8 74.0
T-BRNN [3] 64.4 45.2 53.1 72.3 71.5 71.9 67.5 58.7 62.8 68.9 58.1 63.1 51.3
T-BRNN-pre [3] 65.5 47.1 54.8 73.3 72.5 72.9 70.7 63.0 66.7 70.0 59.7 64.4 49.7
Single-BiRNN [4] 62.2 47.7 54.0 74.6 72.1 73.4 67.5 52.9 59.3 69.2 59.8 64.2 51.1
Corr-BiRNN [4] 60.9 52.4 56.4 75.3 70.8 73.0 70.7 56.9 63.0 68.6 61.6 64.9 50.8
DRNN-LWMA 63.4 55.7 59.3 76.0 73.5 74.7 75.0 71.7 73.3 70.0 64.6 67.2 47.3
DRNN-LWMA-pre 62.9 60.8 61.9 77.3 73.7 75.5 69.6 69.6 69.6 69.9 67.2 68.6 46.1
so
if
wemake
nochangestodaywhatdoestomorrow
look
likewell
ithink
youprobablyalreadyhave
thepicture
so
if
we
make
no
changes
today
what
does
tomorrow
look
like
well
i
think
you
probably
already
have
the
picture
so
if
wemake
nochangestodaywhatdoestomorrow
look
likewell
ithink
youprobablyalreadyhave
thepicture
so
if
we
make
no
changes
today
what
does
tomorrow
look
like
well
i
think
you
probably
already
have
the
picture
so
if
wemake
nochangestodaywhatdoestomorrow
look
likewell
ithink
youprobablyalreadyhave
thepicture
so
if
we
make
no
changes
today
what
does
tomorrow
look
like
well
i
think
you
probably
already
have
the
picture
so
if
wemake
nochangestodaywhatdoestomorrow
look
likewell
ithink
youprobablyalreadyhave
thepicture
so
if
we
make
no
changes
today
what
does
tomorrow
look
like
well
i
think
you
probably
already
have
the
picture
References
[1] Xiaoyin Che, ChengWang, Haojin Yang, and Christoph Meinel, ‘Punctuation prediction for unsegmented transcript based on word vector’, LREC 2016.
[2] Ottokar Tilk and Tanel Alumae, ‘LSTM for punctuation restoration in speech transcripts’, INTERSPEECH 2015.
[3] Ottokar Tilk and Tanel Alumae, ‘Bidirectional recurrent neural network with attention mechanism for punctuation restoration’, INTERSPEECH 2016.
[4] Vardaan Pahuja et al., ‘Joint learning of correlated sequence labelling tasks using bidirectional recurrent neural networks’, INTERSPEECH 2017.
345 Park Avenue, San Jose, CA 95110, USA Email: seokim@adobe.com

More Related Content

What's hot

Dijkstra's algorithm
Dijkstra's algorithmDijkstra's algorithm
Dijkstra's algorithm
gsp1294
 
Sns mid term-test2-solution
Sns mid term-test2-solutionSns mid term-test2-solution
Sns mid term-test2-solution
cheekeong1231
 
From L to N: Nonlinear Predictors in Generalized Models
From L to N: Nonlinear Predictors in Generalized ModelsFrom L to N: Nonlinear Predictors in Generalized Models
From L to N: Nonlinear Predictors in Generalized Models
htstatistics
 
gnm: a Package for Generalized Nonlinear Models
gnm: a Package for Generalized Nonlinear Modelsgnm: a Package for Generalized Nonlinear Models
gnm: a Package for Generalized Nonlinear Models
htstatistics
 
Multiplicative Interaction Models in R
Multiplicative Interaction Models in RMultiplicative Interaction Models in R
Multiplicative Interaction Models in R
htstatistics
 
lecture 30
lecture 30lecture 30
lecture 30
sajinsc
 
Dijkstra's Algorithm
Dijkstra's AlgorithmDijkstra's Algorithm
Dijkstra's Algorithm
guest862df4e
 

What's hot (20)

Biconnectivity
BiconnectivityBiconnectivity
Biconnectivity
 
Linear Cryptanalysis Lecture 線形解読法
Linear Cryptanalysis Lecture 線形解読法Linear Cryptanalysis Lecture 線形解読法
Linear Cryptanalysis Lecture 線形解読法
 
Single source stortest path bellman ford and dijkstra
Single source stortest path bellman ford and dijkstraSingle source stortest path bellman ford and dijkstra
Single source stortest path bellman ford and dijkstra
 
Euclid's Algorithm for Greatest Common Divisor - Time Complexity Analysis
Euclid's Algorithm for Greatest Common Divisor - Time Complexity AnalysisEuclid's Algorithm for Greatest Common Divisor - Time Complexity Analysis
Euclid's Algorithm for Greatest Common Divisor - Time Complexity Analysis
 
Prim's Algorithm on minimum spanning tree
Prim's Algorithm on minimum spanning treePrim's Algorithm on minimum spanning tree
Prim's Algorithm on minimum spanning tree
 
Generalized Nonlinear Models in R
Generalized Nonlinear Models in RGeneralized Nonlinear Models in R
Generalized Nonlinear Models in R
 
Dijkstra's algorithm
Dijkstra's algorithmDijkstra's algorithm
Dijkstra's algorithm
 
Sns mid term-test2-solution
Sns mid term-test2-solutionSns mid term-test2-solution
Sns mid term-test2-solution
 
From L to N: Nonlinear Predictors in Generalized Models
From L to N: Nonlinear Predictors in Generalized ModelsFrom L to N: Nonlinear Predictors in Generalized Models
From L to N: Nonlinear Predictors in Generalized Models
 
Merge Sort
Merge SortMerge Sort
Merge Sort
 
gnm: a Package for Generalized Nonlinear Models
gnm: a Package for Generalized Nonlinear Modelsgnm: a Package for Generalized Nonlinear Models
gnm: a Package for Generalized Nonlinear Models
 
Aaex4 group2(中英夾雜)
Aaex4 group2(中英夾雜)Aaex4 group2(中英夾雜)
Aaex4 group2(中英夾雜)
 
Multiplicative Interaction Models in R
Multiplicative Interaction Models in RMultiplicative Interaction Models in R
Multiplicative Interaction Models in R
 
Deep dive into rsa
Deep dive into rsaDeep dive into rsa
Deep dive into rsa
 
Prim Algorithm and kruskal algorithm
Prim Algorithm and kruskal algorithmPrim Algorithm and kruskal algorithm
Prim Algorithm and kruskal algorithm
 
Sns pre sem
Sns pre semSns pre sem
Sns pre sem
 
lecture 30
lecture 30lecture 30
lecture 30
 
Justesen codes alternant codes goppa codes
Justesen codes alternant codes goppa codesJustesen codes alternant codes goppa codes
Justesen codes alternant codes goppa codes
 
Lec10
Lec10Lec10
Lec10
 
Dijkstra's Algorithm
Dijkstra's AlgorithmDijkstra's Algorithm
Dijkstra's Algorithm
 

Similar to Deep Recurrent Neural Networks with Layer-wise Multi-head Attentions for Punctuation Restoration

Divide-and-Conquer & Dynamic ProgrammingDivide-and-Conqu.docx
Divide-and-Conquer & Dynamic ProgrammingDivide-and-Conqu.docxDivide-and-Conquer & Dynamic ProgrammingDivide-and-Conqu.docx
Divide-and-Conquer & Dynamic ProgrammingDivide-and-Conqu.docx
jacksnathalie
 
2_1 Edit Distance.pptx
2_1 Edit Distance.pptx2_1 Edit Distance.pptx
2_1 Edit Distance.pptx
tanishamahajan11
 
Skiena algorithm 2007 lecture17 edit distance
Skiena algorithm 2007 lecture17 edit distanceSkiena algorithm 2007 lecture17 edit distance
Skiena algorithm 2007 lecture17 edit distance
zukun
 
machinelearning project
machinelearning projectmachinelearning project
machinelearning project
Lianli Liu
 
Skiena algorithm 2007 lecture15 backtracing
Skiena algorithm 2007 lecture15 backtracingSkiena algorithm 2007 lecture15 backtracing
Skiena algorithm 2007 lecture15 backtracing
zukun
 
HW 5-RSAascii2str.mfunction str = ascii2str(ascii) .docx
HW 5-RSAascii2str.mfunction str = ascii2str(ascii)        .docxHW 5-RSAascii2str.mfunction str = ascii2str(ascii)        .docx
HW 5-RSAascii2str.mfunction str = ascii2str(ascii) .docx
wellesleyterresa
 

Similar to Deep Recurrent Neural Networks with Layer-wise Multi-head Attentions for Punctuation Restoration (20)

Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
 
Divide-and-Conquer & Dynamic ProgrammingDivide-and-Conqu.docx
Divide-and-Conquer & Dynamic ProgrammingDivide-and-Conqu.docxDivide-and-Conquer & Dynamic ProgrammingDivide-and-Conqu.docx
Divide-and-Conquer & Dynamic ProgrammingDivide-and-Conqu.docx
 
Machine Learning 1
Machine Learning 1Machine Learning 1
Machine Learning 1
 
RNN and sequence-to-sequence processing
RNN and sequence-to-sequence processingRNN and sequence-to-sequence processing
RNN and sequence-to-sequence processing
 
2_1 Edit Distance.pptx
2_1 Edit Distance.pptx2_1 Edit Distance.pptx
2_1 Edit Distance.pptx
 
Skiena algorithm 2007 lecture17 edit distance
Skiena algorithm 2007 lecture17 edit distanceSkiena algorithm 2007 lecture17 edit distance
Skiena algorithm 2007 lecture17 edit distance
 
Estimating structured vector autoregressive models
Estimating structured vector autoregressive modelsEstimating structured vector autoregressive models
Estimating structured vector autoregressive models
 
2_EditDistance_Jan_08_2020.pptx
2_EditDistance_Jan_08_2020.pptx2_EditDistance_Jan_08_2020.pptx
2_EditDistance_Jan_08_2020.pptx
 
Nondeterministic Finite Automata AFN.pdf
Nondeterministic Finite Automata AFN.pdfNondeterministic Finite Automata AFN.pdf
Nondeterministic Finite Automata AFN.pdf
 
01 - DAA - PPT.pptx
01 - DAA - PPT.pptx01 - DAA - PPT.pptx
01 - DAA - PPT.pptx
 
machinelearning project
machinelearning projectmachinelearning project
machinelearning project
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine Learning
 
Skiena algorithm 2007 lecture15 backtracing
Skiena algorithm 2007 lecture15 backtracingSkiena algorithm 2007 lecture15 backtracing
Skiena algorithm 2007 lecture15 backtracing
 
Identification of the Mathematical Models of Complex Relaxation Processes in ...
Identification of the Mathematical Models of Complex Relaxation Processes in ...Identification of the Mathematical Models of Complex Relaxation Processes in ...
Identification of the Mathematical Models of Complex Relaxation Processes in ...
 
MVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsMVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priors
 
LeetCode Solutions In Java .pdf
LeetCode Solutions In Java .pdfLeetCode Solutions In Java .pdf
LeetCode Solutions In Java .pdf
 
Chapter 06 rsa cryptosystem
Chapter 06   rsa cryptosystemChapter 06   rsa cryptosystem
Chapter 06 rsa cryptosystem
 
Response Surface in Tensor Train format for Uncertainty Quantification
Response Surface in Tensor Train format for Uncertainty QuantificationResponse Surface in Tensor Train format for Uncertainty Quantification
Response Surface in Tensor Train format for Uncertainty Quantification
 
Introduction to Tree-LSTMs
Introduction to Tree-LSTMsIntroduction to Tree-LSTMs
Introduction to Tree-LSTMs
 
HW 5-RSAascii2str.mfunction str = ascii2str(ascii) .docx
HW 5-RSAascii2str.mfunction str = ascii2str(ascii)        .docxHW 5-RSAascii2str.mfunction str = ascii2str(ascii)        .docx
HW 5-RSAascii2str.mfunction str = ascii2str(ascii) .docx
 

More from Seokhwan Kim

Exploring Convolutional and Recurrent Neural Networks in Sequential Labelling...
Exploring Convolutional and Recurrent Neural Networks in Sequential Labelling...Exploring Convolutional and Recurrent Neural Networks in Sequential Labelling...
Exploring Convolutional and Recurrent Neural Networks in Sequential Labelling...
Seokhwan Kim
 
The Fourth Dialog State Tracking Challenge (DSTC4)
The Fourth Dialog State Tracking Challenge (DSTC4)The Fourth Dialog State Tracking Challenge (DSTC4)
The Fourth Dialog State Tracking Challenge (DSTC4)
Seokhwan Kim
 
Wikification of Concept Mentions within Spoken Dialogues Using Domain Constra...
Wikification of Concept Mentions within Spoken Dialogues Using Domain Constra...Wikification of Concept Mentions within Spoken Dialogues Using Domain Constra...
Wikification of Concept Mentions within Spoken Dialogues Using Domain Constra...
Seokhwan Kim
 
Towards Improving Dialogue Topic Tracking Performances with Wikification of C...
Towards Improving Dialogue Topic Tracking Performances with Wikification of C...Towards Improving Dialogue Topic Tracking Performances with Wikification of C...
Towards Improving Dialogue Topic Tracking Performances with Wikification of C...
Seokhwan Kim
 
Wikipedia-based Kernels for Dialogue Topic Tracking
Wikipedia-based Kernels for Dialogue Topic TrackingWikipedia-based Kernels for Dialogue Topic Tracking
Wikipedia-based Kernels for Dialogue Topic Tracking
Seokhwan Kim
 
A Graph-based Cross-lingual Projection Approach for Spoken Language Understan...
A Graph-based Cross-lingual Projection Approach for Spoken Language Understan...A Graph-based Cross-lingual Projection Approach for Spoken Language Understan...
A Graph-based Cross-lingual Projection Approach for Spoken Language Understan...
Seokhwan Kim
 
A Graph-based Cross-lingual Projection Approach for Weakly Supervised Relatio...
A Graph-based Cross-lingual Projection Approach for Weakly Supervised Relatio...A Graph-based Cross-lingual Projection Approach for Weakly Supervised Relatio...
A Graph-based Cross-lingual Projection Approach for Weakly Supervised Relatio...
Seokhwan Kim
 
MMR-based active machine learning for Bio named entity recognition
MMR-based active machine learning for Bio named entity recognitionMMR-based active machine learning for Bio named entity recognition
MMR-based active machine learning for Bio named entity recognition
Seokhwan Kim
 
A semi-supervised method for efficient construction of statistical spoken lan...
A semi-supervised method for efficient construction of statistical spoken lan...A semi-supervised method for efficient construction of statistical spoken lan...
A semi-supervised method for efficient construction of statistical spoken lan...
Seokhwan Kim
 
A spoken dialog system for electronic program guide information access
A spoken dialog system for electronic program guide information accessA spoken dialog system for electronic program guide information access
A spoken dialog system for electronic program guide information access
Seokhwan Kim
 
An alignment-based approach to semi-supervised relation extraction including ...
An alignment-based approach to semi-supervised relation extraction including ...An alignment-based approach to semi-supervised relation extraction including ...
An alignment-based approach to semi-supervised relation extraction including ...
Seokhwan Kim
 
An Alignment-based Pattern Representation Model for Information Extraction
An Alignment-based Pattern Representation Model for Information ExtractionAn Alignment-based Pattern Representation Model for Information Extraction
An Alignment-based Pattern Representation Model for Information Extraction
Seokhwan Kim
 
EPG 정보 검색을 위한 예제 기반 자연어 대화 시스템
EPG 정보 검색을 위한 예제 기반 자연어 대화 시스템EPG 정보 검색을 위한 예제 기반 자연어 대화 시스템
EPG 정보 검색을 위한 예제 기반 자연어 대화 시스템
Seokhwan Kim
 
A Cross-Lingual Annotation Projection Approach for Relation Detection
A Cross-Lingual Annotation Projection Approach for Relation DetectionA Cross-Lingual Annotation Projection Approach for Relation Detection
A Cross-Lingual Annotation Projection Approach for Relation Detection
Seokhwan Kim
 

More from Seokhwan Kim (20)

The Eighth Dialog System Technology Challenge (DSTC8)
The Eighth Dialog System Technology Challenge (DSTC8)The Eighth Dialog System Technology Challenge (DSTC8)
The Eighth Dialog System Technology Challenge (DSTC8)
 
Dynamic Memory Networks for Dialogue Topic Tracking
Dynamic Memory Networks for Dialogue Topic TrackingDynamic Memory Networks for Dialogue Topic Tracking
Dynamic Memory Networks for Dialogue Topic Tracking
 
The Fifth Dialog State Tracking Challenge (DSTC5)
The Fifth Dialog State Tracking Challenge (DSTC5)The Fifth Dialog State Tracking Challenge (DSTC5)
The Fifth Dialog State Tracking Challenge (DSTC5)
 
Natural Language in Human-Robot Interaction
Natural Language in Human-Robot InteractionNatural Language in Human-Robot Interaction
Natural Language in Human-Robot Interaction
 
Exploring Convolutional and Recurrent Neural Networks in Sequential Labelling...
Exploring Convolutional and Recurrent Neural Networks in Sequential Labelling...Exploring Convolutional and Recurrent Neural Networks in Sequential Labelling...
Exploring Convolutional and Recurrent Neural Networks in Sequential Labelling...
 
The Fourth Dialog State Tracking Challenge (DSTC4)
The Fourth Dialog State Tracking Challenge (DSTC4)The Fourth Dialog State Tracking Challenge (DSTC4)
The Fourth Dialog State Tracking Challenge (DSTC4)
 
Wikification of Concept Mentions within Spoken Dialogues Using Domain Constra...
Wikification of Concept Mentions within Spoken Dialogues Using Domain Constra...Wikification of Concept Mentions within Spoken Dialogues Using Domain Constra...
Wikification of Concept Mentions within Spoken Dialogues Using Domain Constra...
 
Towards Improving Dialogue Topic Tracking Performances with Wikification of C...
Towards Improving Dialogue Topic Tracking Performances with Wikification of C...Towards Improving Dialogue Topic Tracking Performances with Wikification of C...
Towards Improving Dialogue Topic Tracking Performances with Wikification of C...
 
A Composite Kernel Approach for Dialog Topic Tracking with Structured Domain ...
A Composite Kernel Approach for Dialog Topic Tracking with Structured Domain ...A Composite Kernel Approach for Dialog Topic Tracking with Structured Domain ...
A Composite Kernel Approach for Dialog Topic Tracking with Structured Domain ...
 
Sequential Labeling for Tracking Dynamic Dialog States
Sequential Labeling for Tracking Dynamic Dialog StatesSequential Labeling for Tracking Dynamic Dialog States
Sequential Labeling for Tracking Dynamic Dialog States
 
Wikipedia-based Kernels for Dialogue Topic Tracking
Wikipedia-based Kernels for Dialogue Topic TrackingWikipedia-based Kernels for Dialogue Topic Tracking
Wikipedia-based Kernels for Dialogue Topic Tracking
 
A Graph-based Cross-lingual Projection Approach for Spoken Language Understan...
A Graph-based Cross-lingual Projection Approach for Spoken Language Understan...A Graph-based Cross-lingual Projection Approach for Spoken Language Understan...
A Graph-based Cross-lingual Projection Approach for Spoken Language Understan...
 
A Graph-based Cross-lingual Projection Approach for Weakly Supervised Relatio...
A Graph-based Cross-lingual Projection Approach for Weakly Supervised Relatio...A Graph-based Cross-lingual Projection Approach for Weakly Supervised Relatio...
A Graph-based Cross-lingual Projection Approach for Weakly Supervised Relatio...
 
MMR-based active machine learning for Bio named entity recognition
MMR-based active machine learning for Bio named entity recognitionMMR-based active machine learning for Bio named entity recognition
MMR-based active machine learning for Bio named entity recognition
 
A semi-supervised method for efficient construction of statistical spoken lan...
A semi-supervised method for efficient construction of statistical spoken lan...A semi-supervised method for efficient construction of statistical spoken lan...
A semi-supervised method for efficient construction of statistical spoken lan...
 
A spoken dialog system for electronic program guide information access
A spoken dialog system for electronic program guide information accessA spoken dialog system for electronic program guide information access
A spoken dialog system for electronic program guide information access
 
An alignment-based approach to semi-supervised relation extraction including ...
An alignment-based approach to semi-supervised relation extraction including ...An alignment-based approach to semi-supervised relation extraction including ...
An alignment-based approach to semi-supervised relation extraction including ...
 
An Alignment-based Pattern Representation Model for Information Extraction
An Alignment-based Pattern Representation Model for Information ExtractionAn Alignment-based Pattern Representation Model for Information Extraction
An Alignment-based Pattern Representation Model for Information Extraction
 
EPG 정보 검색을 위한 예제 기반 자연어 대화 시스템
EPG 정보 검색을 위한 예제 기반 자연어 대화 시스템EPG 정보 검색을 위한 예제 기반 자연어 대화 시스템
EPG 정보 검색을 위한 예제 기반 자연어 대화 시스템
 
A Cross-Lingual Annotation Projection Approach for Relation Detection
A Cross-Lingual Annotation Projection Approach for Relation DetectionA Cross-Lingual Annotation Projection Approach for Relation Detection
A Cross-Lingual Annotation Projection Approach for Relation Detection
 

Recently uploaded

TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 

Recently uploaded (20)

The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 

Deep Recurrent Neural Networks with Layer-wise Multi-head Attentions for Punctuation Restoration

  • 1. Deep Recurrent Neural Networks with Layer-wise Multi-head Attentions for Punctuation Restoration Seokhwan Kim Adobe Research, San Jose, CA, USA Punctuation Restoration Aims to generate the punctuation marks from the unpunctuated ASR outputs A sequence labelling task to predict the punctuation label yt for the t-th timestep in a given word sequence X = {x1 · · · xt · · · xT } where C ∈ {‘, COMMA , ‘.PERIOD , ‘?QMARK } yt =    c ∈ C if a punctuation symbol c is located between xt−1 and xt, O otherwise, Examples Input: so if we make no changes today what does tomorrow look like well i think you probably already have the picture Output: so , if we make no changes today , what does tomorrow look like ? well , i think you probably already have the picture . t xt yt 1 so O 2 if ,COMMA 3 we O 4 make O 5 no O 6 changes O 7 today O 8 what ,COMMA 9 does O 10 tomorrow O 11 look O t xt yt 12 like O 13 well ?QMARK 14 i ,COMMA 15 think O 16 you O 17 probably O 18 already O 19 have O 20 the O 21 picture O 22 </s> .PERIOD Related Work Multiple fully-connected hidden layers [1] Convolutional neural networks (CNNs) [2] Recurrent neural networks (RNNs) [4] RNNs with neural attention mechanisms [3] Main Contributions Context encoding from the stacked recurrent layers to learn more hierarchical aspects Neural attentions applied on every intermediate hidden layer to capture the layer-wise features Multi-head attentions instead of a single attention function to diversify the layer-wise attentions further Methods Model Architecture yt st-1st-1 stst ∙∙∙∙∙∙ n-bidirectional recurrent layers AttentionAttention AttentionAttention AttentionAttention m-heads mm mm Attention Attention Attention m-heads m m x1 ∙∙∙h1 1h1 1 h2 1h2 1 hn 1hn 1x1 ∙∙∙h1 1 h2 1 hn 1 xT ∙∙∙h1 Th1 T h2 Th2 T hn Thn TxT ∙∙∙h1 T h2 T hn T h2 th2 txt ∙∙∙h1 th1 t hn thn th2 txt ∙∙∙h1 t hn t ∙∙∙ ∙∙∙∙∙∙ ∙∙∙∙∙∙ ∙∙∙∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙∙∙∙ ∙∙∙∙∙∙ ∙∙∙∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ Word Embedding Each word xt in a given input sequence X is represented as a d-dimensional distributed vector xt ∈ Rd Word embeddings can be learned from scratch with random initialization or fine-tuned from pre-trained word vectors Bi-directional Recurrent Layers Stacking multiple recurrent layers on top of each other −→ h i t =    GRU xt, −→ h 1 t−1 if i = 1, GRU hi−1 t , −→ h i t−1 if i = 2, · · · n, The backward state ←− h i t is computed also with the same way but in the reverse order from the end of the sequence T to t. Both directional states are concatenated into the output state hi t = −→ h i t, ←− h i t ∈ R2d to represent both contexts together Uni-directional Recurrent Layer The top-layer outputs [hn 1, · · · , hn T ] are forwarded to a uni-directional recurrent layer st = GRU (hn t , st−1) ∈ Rd , Multi-head Attentions The hidden state st constitutes a query to neural attentions based on scaled dot-product attention Attn (Q, K, V) = softmax QKT √ d V, Multi-head attentions are applied for every layer separately to compute the weighted contexts from different subspaces fi,j = Attn S · Wi,j Q , Hi · Wi,j K , Hi · Wi,j V , where S = [s1; s2; · · · ; sT ] and Hi = hi 1; hi 2; · · · ; hi T Softmax Classification yt = softmax st, f1,1 t , · · · , fn,m t Wy + by Experimental Setup IWSLT dataset English reference transcripts of TED Talks 2.1M, 296k, and 13k words for training, development, and test on reference, respectively Model Configurations Number of recurrent layers: n ∈ {1, 2, 3, 4} Number of attention heads: m ∈ {1, 2, 3, 4, 5} Attended contexts: layer-wise or only on top Implementation Details Word embeddings: 50d GloVe on Wikipedia Optimizer: Adam optimizer Loss: Negative log likelihood loss Hidden dimension d = 256 The same dimension across all the components Except the word embedding layer Mini-batch size: 128 Dropout rate: 0.5 Early stopping within 100 epochs Results Comparisons of the performances with different network configurations Single-head Top-layer Single-head Layer-wise Multi-head Top-layer Multi-head Layer-wise 0.62 0.63 0.64 0.65 0.66 0.67 0.68 F-measure 1 layer 2 layers 3 layers 4 layers Results COMMA PERIOD QUESTION OVERALL Models P R F P R F P R F P R F SER DNN-A [1] 48.6 42.4 45.3 59.7 68.3 63.7 - - - 54.8 53.6 54.2 66.9 CNN-2A [1] 48.1 44.5 46.2 57.6 69.0 62.8 - - - 53.4 55.0 54.2 68.0 T-LSTM [2] 49.6 41.4 45.1 60.2 53.4 56.6 57.1 43.5 49.4 55.0 47.2 50.8 74.0 T-BRNN [3] 64.4 45.2 53.1 72.3 71.5 71.9 67.5 58.7 62.8 68.9 58.1 63.1 51.3 T-BRNN-pre [3] 65.5 47.1 54.8 73.3 72.5 72.9 70.7 63.0 66.7 70.0 59.7 64.4 49.7 Single-BiRNN [4] 62.2 47.7 54.0 74.6 72.1 73.4 67.5 52.9 59.3 69.2 59.8 64.2 51.1 Corr-BiRNN [4] 60.9 52.4 56.4 75.3 70.8 73.0 70.7 56.9 63.0 68.6 61.6 64.9 50.8 DRNN-LWMA 63.4 55.7 59.3 76.0 73.5 74.7 75.0 71.7 73.3 70.0 64.6 67.2 47.3 DRNN-LWMA-pre 62.9 60.8 61.9 77.3 73.7 75.5 69.6 69.6 69.6 69.9 67.2 68.6 46.1 so if wemake nochangestodaywhatdoestomorrow look likewell ithink youprobablyalreadyhave thepicture so if we make no changes today what does tomorrow look like well i think you probably already have the picture so if wemake nochangestodaywhatdoestomorrow look likewell ithink youprobablyalreadyhave thepicture so if we make no changes today what does tomorrow look like well i think you probably already have the picture so if wemake nochangestodaywhatdoestomorrow look likewell ithink youprobablyalreadyhave thepicture so if we make no changes today what does tomorrow look like well i think you probably already have the picture so if wemake nochangestodaywhatdoestomorrow look likewell ithink youprobablyalreadyhave thepicture so if we make no changes today what does tomorrow look like well i think you probably already have the picture References [1] Xiaoyin Che, ChengWang, Haojin Yang, and Christoph Meinel, ‘Punctuation prediction for unsegmented transcript based on word vector’, LREC 2016. [2] Ottokar Tilk and Tanel Alumae, ‘LSTM for punctuation restoration in speech transcripts’, INTERSPEECH 2015. [3] Ottokar Tilk and Tanel Alumae, ‘Bidirectional recurrent neural network with attention mechanism for punctuation restoration’, INTERSPEECH 2016. [4] Vardaan Pahuja et al., ‘Joint learning of correlated sequence labelling tasks using bidirectional recurrent neural networks’, INTERSPEECH 2017. 345 Park Avenue, San Jose, CA 95110, USA Email: seokim@adobe.com