This document summarizes research on transition-based dependency parsing with selectional branching. It discusses how transition-based parsing can be made non-greedy by considering multiple transition sequences using beam search or selectional branching. Selectional branching dynamically adjusts the beam size for each sentence by collecting the k-best transition sequences for low confidence transitions in the one-best sequence, generating fewer parse trees than fixed-width beam search while maintaining accuracy.
Transition-based Dependency Parsing with Selectional Branching
1. Transition-based Dependency Parsing
with Selectional Branching
Presented at the 4th workshop on
Statistical Parsing in Morphologically Rich Languages
October 18th, 2013
Jinho D. Choi
University of Massachusetts Amherst
Wednesday, October 23, 13
2. Greedy vs. Non-greedy Parsing
•
•
Greedy parsing
-
Considers only one head for each token.
Generates one parse tree per sentence.
e.g., transition-based parsing (2 ms / sentence).
Non-greedy parsing
-
Considers multiple heads for each token.
Generates multiple parse trees per sentence.
e.g., transition-based parsing with beam search, graph-based
parsing, linear programming, dual decomposition (≥ 93%).
2
Wednesday, October 23, 13
3. Motivation
•
How often do we need non-greedy parsing?
•
Our greedy parser performs as accurately as our non-greedy
parser about 64% of the time.
-
•
-
This gap is even closer when they are evaluated on nonbenchmark data (e.g., twits, chats, blogs).
Many applications are time sensitive.
-
Some applications need at least one complete parse tree
ready given a limited time period (e.g., search, dialog, Q/A).
Hard sentences are hard for any parser!
-
Considering more heads does not always guarantee more
accurate parse results.
3
Wednesday, October 23, 13
13. Transition-based Parsing
Transition-based dependency parsing (greedy)
-
t1
t′
S
t1
…
S
Considers one transition for each parsing state.
…
•
tL
tL
4
Wednesday, October 23, 13
t′ …
S
T
What if t′ is not
the correct transition?
19. Transition-based Parsing
Transition-based dependency parsing with beam search
t′1
S1
t1L
t′b
t1
Sb
tb1
t11
…
…
…
S
Considers b-num. of transitions for each block of parsing states.
…
-
…
•
tL
tbL
5
Wednesday, October 23, 13
20. Transition-based Parsing
Transition-based dependency parsing with beam search
t′1
S1
t1L
t′b
t1
Sb
tb1
t11
…
…
…
S
Considers b-num. of transitions for each block of parsing states.
…
-
…
•
tL
tbL
5
Wednesday, October 23, 13
21. Transition-based Parsing
Transition-based dependency parsing with beam search
t′1
S1
t1L
t′b
t1
Sb
tb1
t11
t′1
…
…
…
S
Considers b-num. of transitions for each block of parsing states.
…
-
…
•
tL
tbL
5
Wednesday, October 23, 13
t′b
22. Transition-based Parsing
Transition-based dependency parsing with beam search
t′1
S1
Sb
tb1
t′1 …
S1
t′b …
Sb
t1L
t′b
t1
t11
…
…
…
S
Considers b-num. of transitions for each block of parsing states.
…
-
…
•
tL
tbL
5
Wednesday, October 23, 13
23. Transition-based Parsing
Transition-based dependency parsing with beam search
t′1
S1
Sb
tb1
t′1 …
S1
t′b …
Sb
t1L
t′b
t1
t11
T1
…
…
…
S
Considers b-num. of transitions for each block of parsing states.
…
-
…
•
tL
tbL
5
Wednesday, October 23, 13
Tb
24. Selectional Branching
•
Issues with beam search
Generates the fixed number of parse trees no matter how
easy/hard the input sentence is.
-
•
-
Is it possible to dynamically adjust the beam size for each
individual sentence?
Selectional branching
-
One-best transition sequence is found by a greedy parser.
-
Generate transition sequences from the b-1 highest scoring
state-transition pairs in the collection.
Collect k-best state-transition pairs for each low confidence
transition used to generate the one-best sequence.
6
Wednesday, October 23, 13
47. Selectional Branching
λ=
S1
S1
S3
Sa
S3 …
t′22
S3
t′22
S2 …
t′12
S2
S2
t′12
S4 …
t′32
t′32
T
Sb
T
Sc
T
Carries on parsing states from the one-best sequence.
Guarantees to generate fewer trees than
beam search when |λ| ≤ b.
8
Wednesday, October 23, 13
48. …
…
a classifier C 1 that uses a feature map (x, y) and
a weight1vector w to measure a score for each label
…
ssdt
2t
C (x) = arg max{f (x, y)}
…
y 2 Y, and choosing a label with the highest score.
y2Y
When there is a tie between labels(x, y)) highest
exp(w · with the
…
s3t
f (x, y) = P
ranching strategy. 1 score, the first one is chosen. This canhighest scoring
be y 0 ))
Let C be a classifier that2Y exp(w · (x, expressed
0 finds the
y
om T1 .
as logistic regression:
transitiona given the parsing state x.
To find low confidence predictions, we use the marC 1 (x) = arg max{f (x, y)} prediction
T1 =…[s11 , ... , ss1t ]
dt
gins (score differences) between the best
y2Y
While generating
and the other predictions. If all ·margins are greater
exp(w (x, y))
p2j ), ... , (s1j , pkj ) than a threshold, the best prediction is considered
f (x, y) = P
ranching strategy.
(x, y 0 ))
0 2Y exp(w ·
y
ce T .
highly confident; otherwise, it is not. Given this
om prediction p1j
1
sequences are gen- k analogy, the k-best predictions can be found as
Let C To findclassifier that finds the k-highest scoring
be a low confidence predictions, we use the maring predictions in
follows (m 0 is a margin threshold): prediction
T1 = [s11 , ...transitions(score differences) between the x and the margin m.
, s1t ]
gins given the parsing state best
| < b, all predicWhile generating
and k other predictions. If all margins are greater
the
edy ... , (s is pkj )
C (x, m) = K arg max{f (x, y)}
p2j ),parser 1j ,used
than a threshold, the best prediction is considered
y2Y
sce predictionnow
although it p1j
highly confident; otherwise, it is not. Given m
s.t. f (x, C 1 (x)) f (x, y) this
tial parsing state,
sequences are gen- analogy, the k-best predictions can be found as
further transitions.
ing predictions in
follows (m 0 transition labels
The highest max’ returns margin threshold):whose mar‘K arg scoringis a a set of k 0C1(x) is low confident if
generated, a parse
| < b, all predic|Ck(x, m)| kto1. 1 (x) are smaller than any other label’s
the parser isscore. gins> C 1 = K arg max{f (x, y)} 0
edy highest used
C (x, m)
margin to C (x) and also m, where k k.
y2Y
= 2, which gave
s although it now
it returns aC 1 of the f (x, y) m
stial parsing1. We
than k = state, When m = 0, s.t. f (x, 9set(x)) highest scoring
labels only, including C 1 (x). When m = 1, it rehich did not23, 13
show
further transitions.
Wednesday, October
0
Low Confidence Transition
•
•
…
…
•
49. Experiments
•
•
Parsing algorithm (Choi & McCallum, 2013)
-
Hybrid between Nivre’s arc-eager and list-based algorithms.
Projective parsing: O(n).
Non-projective parsing: expected linear time.
Features
-
Rich non-local features from Zhang & Nivre, 2011.
-
For languages with morphological features, morphologies of
σ[0] and β[0] are used as unigram features.
For languages with coarse-grained POS tags, feature templates
using fine-grained POS tags are replicated.
10
Wednesday, October 23, 13
50. ber of A DAG RAD iterations. Using an Intel Xeon
The thir
2.57GHz machine, it takes less than 40 minutes
external
to train the entire Penn Treebank, which includes
our appr
#times for IO, feature extraction and bootstrapping.sizes.
of transitions performed with respect to beam
seconds
Number of Transitions
•
1,200,000
Transitions
1,000,000
800,000
600,000
400,000
200,000
0
0
10
20
30
40
50
60
70
80
Beam size = 1, 2, 4, 8, 16, 32, 64, 80
Figure 5: The total number of transitions performed
11
during decoding with respect to beam sizes on the
Wednesday, October 23, 13
Approa
Zhang a
Huang a
Zhang a
Bohnet a
McDona
Mcdona
Sagae an
Koo and
Zhang a
Martins
Rush et
Koo et a
Carreras
Bohnet a
Suzuki e
bt = 80
54. SPMRL 2013 Shared Task
•
Baseline results provided by ClearNLP.
Language
5K
Full
LAS
UAS
LS
LAS
UAS
LS
Arabic
81.72
84.46
93.41
84.19
86.48
94.43
Basque
78.01
84.62
82.71
79.16
85.32
83.63
French
73.39
85.30
81.42
74.51
86.41
82.00
German
82.58
85.36
90.49
86.73
88.80
92.95
Hebrew
75.09
81.74
82.84
-
-
-
Hungarian
81.98
86.09
88.26
82.68
86.56
88.80
Korean
76.28
80.39
87.32
83.55
86.82
92.39
Polish
80.64
88.49
86.47
81.12
89.24
86.59
Swedish
80.96
86.48
85.10
-
-
-
15
Wednesday, October 23, 13
55. Conclusion
•
•
Selectional branching
-
Uses confidence estimates to decide when to employ a beam.
Shows comparable accuracy against traditional beam search.
Gives faster speed against any other non-greedy parsing.
ClearNLP
-
Provides several NLP tools including morphological analyzer,
dependency parser, semantic role labeler, etc.
-
Webpage: clearnlp.com.
16
Wednesday, October 23, 13