Transition-based Dependency Parsing with Selectional Branching

Transition-based Dependency Parsing
with Selectional Branching
Presented at the 4th workshop on
Statistical Parsing in Morphologically Rich Languages
October 18th, 2013
Jinho D. Choi
University of Massachusetts Amherst

Wednesday, October 23, 13

Greedy vs. Non-greedy Parsing

•

•

Greedy parsing

-

Considers only one head for each token.
Generates one parse tree per sentence.
e.g., transition-based parsing (2 ms / sentence).

Non-greedy parsing

-

Considers multiple heads for each token.
Generates multiple parse trees per sentence.
e.g., transition-based parsing with beam search, graph-based
parsing, linear programming, dual decomposition (≥ 93%).

2

Motivation

•

How often do we need non-greedy parsing?

•

Our greedy parser performs as accurately as our non-greedy
parser about 64% of the time.

-

•

-

This gap is even closer when they are evaluated on nonbenchmark data (e.g., twits, chats, blogs).

Many applications are time sensitive.

-

Some applications need at least one complete parse tree
ready given a limited time period (e.g., search, dialog, Q/A).

Hard sentences are hard for any parser!

-

Considering more heads does not always guarantee more
accurate parse results.
3


Transition-based Parsing

•

Transition-based dependency parsing (greedy)

-

Considers one transition for each parsing state.

4


•


-


S

4


•


-

t1

…

S


tL

4


•


-

t1

t′

…

S


tL

4


•


-

t1

t′

S

…

S


tL

4


-

t1

t′

S

t1

…

S


…

•

tL

tL

4


-

t1

t′

S

t1

…

S


…

•

tL

tL

4

t′


-

t1

t′

S

t1

…

S


…

•

tL

tL

4

t′ …

S


-

t1

t′

S

t1

…

S


…

•

tL

tL

4

t′ …

S

T


-

t1

t′

S

t1

…

S


…

•

tL

tL

4

t′ …

S

T

What if t′ is not
the correct transition?


•

Transition-based dependency parsing with beam search

-

Considers b-num. of transitions for each block of parsing states.

5


•


-


S

5


•


-

t1

…

S


tL
5


•


-

t1

t′1

…

…

S


t′b

tL
5


t′1

S1

t′b

t1

Sb

…

S


…

-

…

•

tL
5


t′1

S1

t1L

t′b

t1

Sb

tb1

t11

…

…

…

S


…

-

…

•

tL

tbL
5



t′1

S1

t1L

t′b

t1

Sb

tb1

t11

t′1

…

…

…

S


…

-

…

•

tL

tbL
5


t′b


t′1

S1

Sb

tb1

t′1 …

S1

t′b …

Sb

t1L

t′b

t1

t11

…

…

…

S


…

-

…

•

tL

tbL
5



t′1

S1

Sb

tb1

t′1 …

S1

t′b …

Sb

t1L

t′b

t1

t11

T1

…

…

…

S


…

-

…

•

tL

tbL
5


Tb

Selectional Branching

•

Issues with beam search
Generates the ﬁxed number of parse trees no matter how
easy/hard the input sentence is.

-

•

-

Is it possible to dynamically adjust the beam size for each
individual sentence?

Selectional branching

-

One-best transition sequence is found by a greedy parser.

-

Generate transition sequences from the b-1 highest scoring
state-transition pairs in the collection.

Collect k-best state-transition pairs for each low conﬁdence
transition used to generate the one-best sequence.

6

S1

7

t11

…

S1

t1L

7

t11

t′11

…

S1

t1L

7

t11

…

S1

t1L

t′11

low
conﬁdent?

7

t11

…

S1

t1L

t′11

low
conﬁdent?

λ=

7

t11

…

S1

t1L

λ=

S1

t′11

low
conﬁdent?
t′12 …

S1

t′1k

7

t11

…

S1

t1L

λ=

S1

S2

t′11

low
conﬁdent?
t′12 …

S1

t′1k

7


t1L

λ=

S1

S2

t′11

low
conﬁdent?
t′12 …

S1

t21

…

t11

…

S1

t2L

t′1k

7


t1L

λ=

S1

S2

t′11

low
conﬁdent?
t′12 …

S1

t21

…

t11

…

S1

t2L

t′1k

7

t′21


t1L

λ=

S1

S2

t′11

low
conﬁdent?
t′12 …

S1

t21

…

t11

…

S1

t2L

t′1k

7

t′21

low
conﬁdent?


t1L

λ=

S1

S2

t′11

low
conﬁdent?
t′12 …

S1

t21

…

t11

…

S1

t2L

S2

t′1k

7

t′21

low
conﬁdent?
t′22 …

S2

t′2k


t1L

λ=

S1

S2

t′11

low
conﬁdent?
t′12 …

S1

t21

…

t11

…

S1

t2L

S2

t′1k

7

t′21 …

Sn

low
conﬁdent?
t′22 …

S2

t′2k


t1L

λ=

S1

S2

t′11

low
conﬁdent?
t′12 …

S1

t21

…

t11

…

S1

t2L

S2

t′1k

7

t′21 …

Sn

low
conﬁdent?
t′22 …

S2

t′2k …


t1L

λ=

S1

S2

t′11

low
conﬁdent?
t′12 …

S1

t21

…

t11

…

S1

t2L

S2

t′1k

7

t′21 …

Sn

low
conﬁdent?
t′22 …

S2

t′2k …

T


…

t11

t1L

λ=

S1

S2

t′11

low
conﬁdent?
t′12 …

S1

t21

…

S1

t2L

S2

t′1k

t′21 …

Sn

low
conﬁdent?
t′22 …

S2

t′2k …

Pick b-1 number of pairs with the highest scores.

7

T


…

t11

t1L

λ=

S1

S2

t′11

low
conﬁdent?
t′12 …

S1

t21

…

S1

t2L

S2

t′1k

t′21 …

Sn

low
conﬁdent?
t′22 …

S2

t′2k …

Pick b-1 number of pairs with the highest scores.
For our experiments, k = 2 is used.
7

T


8

λ=

S1

t′12

S2

8

t′22

S3

t′32

λ=
S1

t′12

S1

t′12

S2

S2 …

Sa

8

t′22

S3

t′32

T

λ=
S1

S1

S2 …

t′12
S2

S2

t′12

t′22

Sa

S3 …

8

S3

t′22

t′32

T
Sb

T

λ=
S1

S1

S3

Sa

S3 …

t′22

S4 …

t′32

8

S3

t′22

S2 …

t′12
S2

S2

t′12

t′32

T
Sb

T
Sc

T

λ=
S1

S1

S3

Sa

S3 …

t′22

S3

t′22

S2 …

t′12
S2

S2

t′12

S4 …

t′32

t′32

T
Sb

T
Sc

T

Carries on parsing states from the one-best sequence.

8

λ=
S1

S1

S3

Sa

S3 …

t′22

S3

t′22

S2 …

t′12
S2

S2

t′12

S4 …

t′32

t′32

T
Sb

T
Sc

T

Carries on parsing states from the one-best sequence.
Guarantees to generate fewer trees than
beam search when |λ| ≤ b.
8

…

…

a classifier C 1 that uses a feature map (x, y) and
a weight1vector w to measure a score for each label
…
ssdt
2t
C (x) = arg max{f (x, y)}
…
y 2 Y, and choosing a label with the highest score.
y2Y
When there is a tie between labels(x, y)) highest
exp(w · with the
…
s3t
f (x, y) = P
ranching strategy. 1 score, the first one is chosen. This canhighest scoring
be y 0 ))
Let C be a classifier that2Y exp(w · (x, expressed
0 finds the
y
om T1 .
as logistic regression:
transitiona given the parsing state x.
To find low confidence predictions, we use the marC 1 (x) = arg max{f (x, y)} prediction
T1 =…[s11 , ... , ss1t ]
dt
gins (score differences) between the best
y2Y
While generating
and the other predictions. If all ·margins are greater
exp(w (x, y))
p2j ), ... , (s1j , pkj ) than a threshold, the best prediction is considered
f (x, y) = P
ranching strategy.
(x, y 0 ))
0 2Y exp(w ·
y
ce T .
highly confident; otherwise, it is not. Given this
om prediction p1j
1
sequences are gen- k analogy, the k-best predictions can be found as
Let C To findclassifier that finds the k-highest scoring
be a low confidence predictions, we use the maring predictions in
follows (m 0 is a margin threshold): prediction
T1 = [s11 , ...transitions(score differences) between the x and the margin m.
, s1t ]
gins given the parsing state best
| < b, all predicWhile generating
and k other predictions. If all margins are greater
the
edy ... , (s is pkj )
C (x, m) = K arg max{f (x, y)}
p2j ),parser 1j ,used
than a threshold, the best prediction is considered
y2Y
sce predictionnow
although it p1j
highly confident; otherwise, it is not. Given m
s.t. f (x, C 1 (x)) f (x, y)  this
tial parsing state,
sequences are gen- analogy, the k-best predictions can be found as
further transitions.
ing predictions in
follows (m 0 transition labels
The highest max’ returns margin threshold):whose mar‘K arg scoringis a a set of k 0C1(x) is low confident if
generated, a parse
| < b, all predic|Ck(x, m)| kto1. 1 (x) are smaller than any other label’s
the parser isscore. gins> C 1 = K arg max{f (x, y)} 0
edy highest used
C (x, m)
margin to C (x) and also  m, where k  k.
y2Y
= 2, which gave
s although it now
it returns aC 1 of the f (x, y)  m
stial parsing1. We
than k = state, When m = 0, s.t. f (x, 9set(x)) highest scoring
labels only, including C 1 (x). When m = 1, it rehich did not23, 13
show
further transitions.
Wednesday, October
0

Low Confidence Transition

•
•

…

…

•

Experiments

•

•

Parsing algorithm (Choi & McCallum, 2013)

-

Hybrid between Nivre’s arc-eager and list-based algorithms.
Projective parsing: O(n).
Non-projective parsing: expected linear time.

Features

-

Rich non-local features from Zhang & Nivre, 2011.

-

For languages with morphological features, morphologies of
σ[0] and β[0] are used as unigram features.

For languages with coarse-grained POS tags, feature templates
using ﬁne-grained POS tags are replicated.

10

ber of A DAG RAD iterations. Using an Intel Xeon
The thir
2.57GHz machine, it takes less than 40 minutes
external
to train the entire Penn Treebank, which includes
our appr
#times for IO, feature extraction and bootstrapping.sizes.
of transitions performed with respect to beam
seconds

Number of Transitions

•

1,200,000

Transitions

1,000,000
800,000
600,000
400,000
200,000
0

0

10

20

30

40

50

60

70

80

Beam size = 1, 2, 4, 8, 16, 32, 64, 80

Figure 5: The total number of transitions performed
11
during decoding with respect to beam sizes on the

Approa
Zhang a
Huang a
Zhang a
Bohnet a
McDona
Mcdona
Sagae an
Koo and
Zhang a
Martins
Rush et
Koo et a
Carreras
Bohnet a
Suzuki e
bt = 80

Projective Parsing

•

The benchmark setup using WSJ.
Approach

USA

LAS

Time

bt = 80, bd = 80

92.96

91.93

0.009

bt = 80, bd = 64

92.96

91.93

0.009

bt = 80, bd = 32

92.96

91.94

0.009

bt = 80, bd = 16

92.96

91.94

0.008

bt = 80, bd = 8

92.89

91.87

0.006

bt = 80, bd = 4

92.76

91.76

0.004

bt = 80, bd = 2

92.56

91.54

0.003

bt = 80, bd = 1

92.26

91.25

0.002

bt = 1, bd = 1

92.06

91.05

0.002

12

Projective Parsing

•

The benchmark setup using WSJ.
Approach

LAS

Time

bt = 80, bd = 80

92.96

91.93

0.009

Zhang & Clark, 2008

92.1

Huang & Sagae, 2010

92.1

Zhang & Nivre, 2011

92.9

91.8

0.03

Bohnet & Nivre, 2012

93.38

92.44

0.4

McDonald et al., 2005

90.9

McDonald & Pereira, 2006

91.5

Sagae & Lavie, 2006

92.7

Koo & Collins, 2010

93.04

Zhang & McDonald, 2012

93.06

Martins et al., 2010

93.26

Rush et al., 2010

USA

93.8
13

0.04

91.86

Non-projective Parsing

•

CoNLL-X shared task data
Approach

Danish

Dutch

Slovene

Swedish

LAS UAS LAS UAS LAS UAS LAS UAS

bt = 80, bd = 80

87.27 91.36 82.45 85.33 77.46 84.65 86.80 91.36

bt = 80, bd = 1

86.75 91.04 80.75 83.59 75.66 83.29 86.32 91.12

Nivre et al., 2006

84.77 89.80 78.59 81.35 70.30 78.72 84.58 89.50

McDonald et al., 2006

84.79 90.58 79.19 83.57 73.44 83.17 82.55 88.93

Nivre, 2009

84.2

-

-

-

75.2

-

-

-

-

-

-

-

83.55 89.30

F.-Gonz. & G.-Rodr., 2012

85.17 90.10

Nivre & McDonald, 2008

86.67

-

81.63

-

75.94

-

84.66

-

Martins et al., 2010

-

91.50

-

84.91

-

85.53

-

89.80

14

SPMRL 2013 Shared Task

•

Baseline results provided by ClearNLP.
Language

5K

Full

LAS

UAS

LS

LAS

UAS

LS

Arabic

81.72

84.46

93.41

84.19

86.48

94.43

Basque

78.01

84.62

82.71

79.16

85.32

83.63

French

73.39

85.30

81.42

74.51

86.41

82.00

German

82.58

85.36

90.49

86.73

88.80

92.95

Hebrew

75.09

81.74

82.84

-

-

-

Hungarian

81.98

86.09

88.26

82.68

86.56

88.80

Korean

76.28

80.39

87.32

83.55

86.82

92.39

Polish

80.64

88.49

86.47

81.12

89.24

86.59

Swedish

80.96

86.48

85.10

-

-

-

15

Conclusion

•

•

Selectional branching

-

Uses conﬁdence estimates to decide when to employ a beam.
Shows comparable accuracy against traditional beam search.
Gives faster speed against any other non-greedy parsing.

ClearNLP

-

Provides several NLP tools including morphological analyzer,
dependency parser, semantic role labeler, etc.

-

Webpage: clearnlp.com.

16

Transition-based Dependency Parsing with Selectional Branching

Recommended

Recommended

More Related Content

More from Jinho Choi

More from Jinho Choi (20)

Recently uploaded

Recently uploaded (20)

Transition-based Dependency Parsing with Selectional Branching