The document summarizes research on statistical dependency parsing for the Korean language. It discusses why dependency parsing is useful for Korean given its flexible word order and rich morphology. It describes several Korean treebanks that can be used for training dependency parsers, including the large Sejong Treebank. The researchers converted the Sejong Treebank to dependency structures and developed a statistical model for dependency parsing Korean using transition-based parsing and machine learning techniques. Feature selection was important given that Korean tokens contain multiple morphemes.
Statistical Dependency Parsing in Korean: From Corpus Generation To Automatic Parsing
1. Statistical Dependency Parsing in Korean:
From Corpus Generation To Automatic Parsing
Workshop on Statistical Parsing of Morphologically-Rich Languages
12th International Conference on Parsing Technologies
Jinho D. Choi & Martha Palmer
University of Colorado at Boulder
October 6th, 2011
choijd@colorado.edu
Thursday, October 6, 2011
2. Dependency Parsing in Korean
• Why dependency parsing in Korean?
- Korean is a flexible word order language.
S
SOV construction
S NP-OBJ-1 S
NP-SBJ VP NP-SBJ VP
AP VP AP VP
NP-OBJ VP NP-OBJ VP
She still him loved Him she still *T* loved
OBJ ADV
ADV SBJ
SBJ OBJ
2
Thursday, October 6, 2011
3. Dependency Parsing in Korean
• Why dependency parsing in Korean?
- Korean is a flexible word order language.
- Rich morphology makes it easy for dependency parsing.
She + Aux. particle loved
He + Obj. case marker
SBJ ADV OBJ
She still him
3
Thursday, October 6, 2011
4. Dependency Parsing in Korean
• Statistical dependency parsing in Korean
- Sufficiently large training data is required.
• Not much training data available for Korean dependency parsing.
• Constituent Treebanks in Korean
- Penn Korean Treebank: 15K sentences.
- KAIST Treebank: 30K sentences.
- Sejong Treebank: 60K sentences.
• The most recent and largest Treebank in Korean.
• Containing Penn Treebank style constituent trees.
4
Thursday, October 6, 2011
5. Sejong Treebank
• Phrase structure
- Including phrase tags, POS tags, and function tags.
- Each token can be broken into several morphemes.
S ! ( )/NP+ /JX
! /MAG
NP-SBJ VP
! /NP+ /JKO
AP VP
! /NNG+ /XSV+ /EP+ /EF
NP-OBJ VP
She still him loved Tokens are mostly separated
by white spaces.
5
Thursday, October 6, 2011
6. containing only left and right brackets, respectively. higher precedenc
These tags are also used to determine dependency precedence in VP
Sejong Treebank
relations during the conversion.
Once we have th
Phrase-level tags Function tags erate dependenc
S Sentence SBJ Subject each phrase (or
Q Quotative clause OBJ Object
the head of the
NP Noun phrase CMP Complement
VP Verb phrase MOD Noun modifier
all other nodes i
VNP Copula phrase AJT Predicate modifier The procedure g
AP Adverb phrase CNJ Conjunctive in the tree finds
DP Adnoun phrase INT Vocative and Palmer (20
IP Interjection phrase PRN Parenthetical by this procedu
(unique root, si
NNG General noun Table 2: Phrase-level
MM Adnoun tags (left) and function tags (right)
EP Prefinal EM JX Auxiliary PR
NNP Proper noun MAG General adverb
in the Sejong Treebank.
EF Final EM however, it doe
JC Conjunctive PR
NNB Bound noun MAJ Conjunctive adverb EC Conjunctive EM IC Interjection
NP Pronoun JKS Subjective CP ETN Nominalizing EM SN Numbershows how to a
NR Numeral JKC Complemental CP In addition, Sec
ETM Adnominalizing EM SL Foreign word
VV Verb JKG Adnomial CP XPN Noun prefix SH Chinese word
VA Adjective 3.2 Head-percolation rules Noun DS
JKO Objective CP XSN solve some of th
NF Noun-like word
VX Auxiliary predicate JKB Adverbial CP XSV Verb DS
VCP Copula Table 3 givesVocative CPof head-percolation rules (from
JKV the list XSA Adjective DS
nested function
NV Predicate-like word
NA Unknown word
headrules), derived from analysis of each SS, SE,It is worth m
VCN Negation now on,JKQ Quotative CP
adjective XR Base morpheme SF, SP, SO, SW
Table 1: P OS tags phrase type in the Sejongmarker, CP: case particle, EM: ending marker, DS: Sejong Tree
in the Sejong Treebank (PM: predicate Treebank. Except for the the deriva-
tional suffix, PR: particle, SF SP SS SE SO: different types of punctuation).
quotative clause (Q), all 6 other phrase types try to egories. This im
find their heads been one of rightmost(2000) presented an approach for by these he
Thursday, October 6, 2011 morphological analysis has
Automatic
from the Han et al. children, which ated han-
7. Dependency Conversion
• Conversion steps
- Find the head of each phrase using head-percolation rules.
• All other nodes in the phrase become dependents of the head.
- Re-direct dependencies for empty categories.
• Empty categories are not annotated in the Sejong Treebank.
• Skipping this step generates only projective dependency trees.
- Label (automatically generated) dependencies.
• Special cases
- Coordination, nested function tags.
7
Thursday, October 6, 2011
8. Dependency Conversion
• Head-percolation rules that have treated each
as described in there are some approaches
-
to several mor-Achieved by analyzing each phrase in the Sejong Treebank.
morpheme as an individual token to parse (Chung
le 1). In the Se- et al., 2010).5
Korean is a head-final language.
mostly by white
B+C)D’ is con- S r VP;VNP;S;NP|AP;Q;*
oes not contain Q l S|VP|VNP|NP;Q;*
ult, a token can NP r NP;S;VP;VNP;AP;*
ual morphemes VP r VP;VNP;NP;S;IP;*
VNP r VNP;NP;S;*
ated with func- AP r AP;VP;NP;S;*
gs show depen- DP r DP;VP;*
hrases and their IP r IP;VNP;*
y labels during X|L|R r *
special types of No rules to find the head morpheme of each token.
Table 3: Head-percolation rules for the Sejong Tree-
Table 2. X indi- bank. l/r implies looking for the leftmost/rightmost con-
articles, ending stituent. * implies any phrase-level tag. | implies a logi-
8
ndicate phrases cal OR and ; is a delimiter between tags. Each rule gives
Thursday, October 6, 2011
9. Dependency Conversion
• Dependency labels
- Labels retained from the function tags.
- Labels inferred from constituent relations.
ejong Treebank, and use the automati-
S input : (c, p), where c is a dependent of p.
d and linked empty categories to gener- l
output: A dependency label l as c ← p.
−
NP-SBJ VP
ective dependencies. begin
AP VP if p = root then ROOT → l
dency labels elif c.pos = AP then ADV → l
NP-OBJ VP
f dependency labels are derived from the elif p.pos = AP then AMOD → l
rees. The first type includes labels re- elif p.pos = DP then DMOD → l
the function tags. still
She When any nodeloved
him an- elif p.pos = NP then NMOD → l
a function tag is determined toOBJ a de-
be elif p.pos = VP|VNP|IP then VMOD → l
some other node by our headrules, the
ADV else DEP → l
SBJ end
is taken as the dependency label to its Algorithm 1: Getting inferred labels.
e 3 shows a dependency tree converted
stituent tree in Figure 2, using the func- AJT
9 11.70 MOD 18.71 X 0.01
dependency labels (SBJ and OBJ). CMP 1.49 AMOD 0.13 X AJT 0.08
Thursday, October 6, 2011
10. Dependency Conversion
• Coordination
- Previous conjuncts as dependents of the following conjuncts.
• Nested function tag
- Nodes with nested f-tags become the heads of the phrases.
S
NP-SBJ VP
NP-CNJ NP-SBJ NP-OBJ VP
NP-CNJ NP-SBJ
I_and he_and she home left
CNJ CNJ OBJ
SBJ
10
Thursday, October 6, 2011
11. Dependency Parsing
• Dependency parsing algorithm
- Transition-based, non-projective parsing algorithm.
• Choi & Palmer, 2011.
- Performs transitions from both projective and non-projective
dependency parsing algorithms selectively.
• Linear time parsing speed in practice for non-projective trees.
• Machine learning algorithm
- Liblinear L2-regularized L1-loss support vector.
Jinho D. Choi & Martha Palmer. 2011. Getting the Most out of
Transition-based Dependency Parsing. In Proceedings of ACL:HLT’11
11
Thursday, October 6, 2011
12. Dependency Parsing
• Feature selection
- Each token consists of multiple morphemes (up to 21).
- POS tag feature of each token?
• (NNG & XSV & EP & EF & SF) vs. (NNG | XSV | EP | EF | SF)
• Sparse information vs. lack of information.
Happy medium?
! /NNP+ /NNG+ /JX
Nakrang_ Nakrang + Princess + JX
! /NNP+ /NNG+ /JKO
Hodong_ Hodong + Prince + JKO
! /NNG+ /XSV+ /EP+ /EF+./SF
Love + XSV + EP + EF + .
12
Thursday, October 6, 2011
13. ure extraction j *
ed to λ1 along with all many other morphemes helpfullast punctuation, only if there is no other
PY The for parsing. Thus,
ned in Section 3.1, each token in our cor-
Dependency Parsing
edure is repeated with as a compromise, we decide to select followed by the punctuation
sts of one or many morphemes annotated
+1 . The algorithm ter-
morpheme certain types
of morphemes and use only these as features. Ta-
rent POS tags. This morphology makes Table 6: Types of morphemes in each token used to ex-
n left in β. ble 6 shows the types of morphemes used to extract
•
to extract features for dependency pars- tract features for our parsing models.
rithm when two tokens, wi
nglish,
Morphemeand for ,our parsing models.
selection
features
wj are
for a dependency relation, FS extract fea- Figure 6 shows morphemes extracted from the to-
we The first morpheme
zed L1-loss S VM for kens in Figure 5. For unigrams, these morphemes
POS tags of wi and wj (wi .pos,The .pos),
LS wj last morpheme before JO|DS|EM
, applying c = 0.1
d feature of POS tags between two tokens(J*canTable 1) either individually (e.g., the POS tag of
JK Particles in be used
iterion), B = 0 (bias).
annotated with JK for the 1st token is JX) or jointly (e.g., a joined
j .pos). Since each token isDS Derivational suffixes (XS* in Table 1)
EM Ending markers (E* of Tabletags between LS and JK for the 1st
OS tag in English, it is trivial to extract feature in POS 1)
ures. In Korean, each token is annotated token is only if there isgenerate features. From our
PY The last punctuation, NNG+JX) to no other
each token in our cor-
uence of POS tags, depending on how mor- followed by thefeatures extracted from the JK and EM
morpheme experiments, punctuation
morphemes annotated
e segmented. It is possible to join all POS morphemes are found to be the most useful.
s morphology makes Table Types of
n a token and treat that as/NNG+6: tag (e.g., morphemes in each token used to ex-
/NNP+ a single /JX
for dependency pars- tract features for our parsing models.
+JX for the first token in FigureJX how-
Nakrang + Princess + 5);
okens, wi and wj , are
e tags usually cause very sparse vectorsmorphemes extracted from the to-
Figure 6 shows /NNP /NNG /JX
lation, we extract fea- /NNG+ /JKO
as features. /NNP+ kens in Figure 5. For unigrams, these morphemes
wj (wi .pos, wj .pos),Prince + JKO
Hodong + /NNP /NNG /JKO
s between two tokens can be used either individually (e.g., the POS tag of /XSV /EF /SF
/NNG
oken is annotated with JK for the 1st token is JX) or jointly (e.g., a joined
/NNG+ /XSV+ /EP+ /EF+./SF
Hodong_
it is trivial to extract + EP + EF of . POS tags between LS and JK for the 1st
Love + XSV feature +
ch! token /NNP+ is annotated token is NNG+JX) to generate features. extracted from the tokens in Fig-
Figure 6: Morphemes From our
/NNG+ /JX
epending on how mor- experiments, features extracted from to the types in Table 6.
ure 5 with respect
the JK and EM
Nakrang + POS + JX 13
ossible to join allPrincess morphemes are found ton-grams where n > 1, it is not obvious which
be the most useful.
For
at! a single 2011 (e.g.,
Thursday, October 6, tag
as /NNP+ /NNG+ /JKO combinations of these morphemes across different
14. rning (Hsieh et al., 2008), applying c = 0.1
st), e = 0.1 (termination criterion), B = 0 (bias). JK Particles (J* in Table 1)
Derivational suffixes (XS* in Table 1)
Dependency Parsing
DS
Feature extraction EM Ending markers (E* in Table 1)
PY The last punctuation, only if there is no other
mentioned in Section 3.1, each token in our cor-
•
morpheme followed by the punctuation
Feature extraction
a consists of one or many morphemes annotated
h different POS tags. This morphology makes Table 6: Types of morphemes in each token used to ex-
-
difficult to extract features for dependency pars- tract features for our parsing models.
Extract features using only important morphemes.
. In English, when two tokens, wi and wj , are
•
mpared for a dependency relation, we extract fea- Figure 6the1st morphemestokens. from the to-
shows
Individual POS tag features of Figure 5.and 3rd
es like POS tags of wi and wj (wi .pos, wj .pos), kens in
extracted
For unigrams, these morphemes
: NNP1, NNG1, JK1, NNG3,can be ,used either individually (e.g., the POS tag of
XSV3 EF3
a joined feature of POS tags between two tokens
•
Joined is annotated with JK for the 1st token is JX) or jointly (e.g., a joined
.pos+wj .pos). Since each token features of POS tags between the 1st and 3rd tokens.
ingle POS tag in English, it is trivial toNNP _XSV , NNP _EF ,tags _NNG , JK _XSV for the 1st
: NNP1_NNG3, extract feature of POS JK between LS and JK
1 3 1 3 1 3 1 3
se features. In Korean, each token is annotated token is NNG+JX) to generate features. From our
-
h a sequence of POS tags, depending on how mor- experiments, features extracted from the JK and EM
Tokens used: wiall POS i±1, wj±1 are found to be the most useful.
emes are segmented. It is possible to join
, wj, w morphemes
s within a token and treat that as/NNG+ tag (e.g.,
/NNP+ a single /JX
P+NNG+JX for the first token in Figure 5); how-
Nakrang + Princess + JX
r, these tags usually cause very sparse vectors /NNP /NNG /JX
en used as features. /NNP+ /NNG+ /JKO
Hodong + Prince + JKO /NNP /NNG /JKO
/NNG /XSV /EF /SF
akrang_ /NNG+ /XSV+
Hodong_ /EP+ /EF+./SF
Love + XSV + EP + EF + .
Figure 6: Morphemes extracted from the tokens in Fig-
! /NNP+ /NNG+ /JX ure 5 with respect to the types in Table 6.
Nakrang + Princess + JX
14
For n-grams where n > 1, it is not obvious which
! /NNP+ /NNG+ /JKO combinations of these morphemes across different
Thursday, October 6, 2011
15. discussing world history.
m Table 10 shows how these corpora are divided into
, wi and wj .
Experiments
training, development, and evaluation sets. For the
development and evaluation sets, we pick one news-
d features ex- •
Corpora about art, one fiction text, and one informa-
paper
h column and -
Dependencyabout trans-nationalism, the Sejong Treebank.
tive book trees converted from and use each of
the first half for development and the second half for
d wj , respec-
e of POS tags -
Consists of 20 sources these genres.
evaluation. Note that in 6 development and evalu-
ation sets are very diverse(MZ), Fiction (FI), Memoir (ME),
Newspaper (NP), Magazine compared to the training
joined feature •
data. Testing on such and Educational Cartoon the ro-
Informative Book (IB), evaluation sets ensures (EC).
d a POS tag of
bustness of our parsing model, which is very impor-
and a form of
-
ed feature be- Evaluation sets are very diverse compared to training sets.
tant for our purpose because we are hoping to use
features used •
this model to parse various texts on the web.
Ensures the robustness of our parsing models.
ed morpholo- NP MZ FI ME IB EC
T 8,060 6,713 15,646 5,053 7,983 1,548
D 2,048 - 2,174 - 1,307 -
DS EM
E 2,048 - 2,175 - 1,308 -
z z
# ofof sentences in training (T), develop-
Table 10: Number
sentences in each set
x,z x
y+ x∗ ,y+ ment (D), and evaluation (E) sets for each genre.
15
xThursday, October z 2011
x, 6,
16. Experiments
• Morphological analysis
- Two automatic morphological analyzers are used.
• Intelligent Morphological Analyzer
- Developed by the Sejong project.
- Provides the same morphological analysis as their Treebank.
• Considered as fine-grained morphological analysis.
• Mach (Shim and Yang, 2002)
- Analyzes 1.3M words per second.
- Provides more coarse-grained morphological analysis.
Kwangseob Shim & Jaehyung Yang. 2002. A Supersonic Korean
Morphological Analyzer. In Proceedings of COLING’02
16
Thursday, October 6, 2011
17. Experiments
• Evaluations
- Gold-standard vs. automatic morphological analysis.
• Relatively low performance from the automatic system.
- Fine vs. course-grained morphological analysis.
• Differences are not too significant.
- Robustness across different genres.
Gold, fine-grained Auto, fine-grained Auto, coarse-grained
LAS UAS LS LAS UAS LS LAS UAS LS
NP 82.58 84.32 94.05 79.61 82.35 91.49 79.00 81.68 91.50
FI 84.78 87.04 93.70 81.54 85.04 90.95 80.11 83.96 90.24
IB 84.21 85.50 95.82 80.45 82.14 92.73 81.43 83.38 93.89
Avg. 83.74 85.47 94.57 80.43 83.01 91.77 80.14 82.89 91.99
ble 11: Parsing accuracies achieved by three models (in %). L AS - labeled attachment score, UAS - unlabel
achment score, L S - label accuracy score 17
Thursday, October 6, 2011
18. Conclusion
• Contributions
- Generating a Korean Dependency Treebank.
- Selecting important morphemes for dependency parsing.
- Evaluating the impact of fine vs. coarse-grained
morphological analysis on dependency parsing.
- Evaluating the robustness across different genres.
• Future work
- Increase the feature span beyond bigrams.
- Find head morphemes of individual tokens.
- Insert empty categories.
18
Thursday, October 6, 2011
19. Acknowledgements
• Special thanks are due to
- Professor Kong Joo Lee of Chungnam National University.
- Professor Kwangseob Shim of Shungshin Women’s University.
• We gratefully acknowledge the support of the National
Science Foundation Grants CISE-IIS-RI-0910992, Richer
Representations for Machine Translation. Any opinions,
findings, and conclusions or recommendations expressed
in this material are those of the authors and do not
necessarily reflect the views of the National Science
Foundation.
19
Thursday, October 6, 2011