SlideShare ist ein Scribd-Unternehmen logo
1 von 19
Downloaden Sie, um offline zu lesen
Statistical Dependency Parsing in Korean:
              From Corpus Generation To Automatic Parsing
       Workshop on Statistical Parsing of Morphologically-Rich Languages
           12th International Conference on Parsing Technologies


                            Jinho D. Choi & Martha Palmer

                            University of Colorado at Boulder
                                   October 6th, 2011
                                  choijd@colorado.edu



Thursday, October 6, 2011
Dependency Parsing in Korean
             •       Why dependency parsing in Korean?
                   -        Korean is a flexible word order language.
                                                                             S
                            SOV construction
                               S                                 NP-OBJ-1           S

               NP-SBJ                   VP                                 NP-SBJ               VP

                               AP                  VP                               AP                    VP

                                        NP-OBJ            VP                                   NP-OBJ           VP


                  She          still         him         loved       Him    she     still           *T*        loved
                                                   OBJ                                              ADV
                                             ADV                                              SBJ
                                       SBJ                                              OBJ




                                                                 2
Thursday, October 6, 2011
Dependency Parsing in Korean
             •       Why dependency parsing in Korean?
                   -        Korean is a flexible word order language.

                   -        Rich morphology makes it easy for dependency parsing.


                            	  	                                              	  	 
       She + Aux. particle                            loved
                                                                      He + Obj. case marker

                                                SBJ   ADV     OBJ




                                         She          still         him




                                                      3
Thursday, October 6, 2011
Dependency Parsing in Korean
             •       Statistical dependency parsing in Korean
                   -        Sufficiently large training data is required.
                            •   Not much training data available for Korean dependency parsing.


             •       Constituent Treebanks in Korean
                   -        Penn Korean Treebank: 15K sentences.

                   -        KAIST Treebank: 30K sentences.

                   -        Sejong Treebank: 60K sentences.
                            •   The most recent and largest Treebank in Korean.

                            •   Containing Penn Treebank style constituent trees.



                                                          4
Thursday, October 6, 2011
Sejong Treebank
             •       Phrase structure
                   -        Including phrase tags, POS tags, and function tags.

                   -        Each token can be broken into several morphemes.
                             S                                !      (    )/NP+   /JX
                                                              !                 /MAG
            NP-SBJ                    VP
                                                              !          /NP+   /JKO
                              AP             VP
                                                              !             /NNG+       /XSV+   /EP+   /EF
                                      NP-OBJ       VP


               She            still    him        loved           Tokens are mostly separated
                                                                       by white spaces.


                                                          5
Thursday, October 6, 2011
containing only left and right brackets, respectively.                    higher precedenc
                       These tags are also used to determine dependency                          precedence in VP
                                    Sejong Treebank
                       relations during the conversion.
                                                                                                       Once we have th
                                         Phrase-level tags                      Function tags          erate dependenc
                                  S          Sentence                   SBJ Subject                    each phrase (or
                                  Q          Quotative clause           OBJ Object
                                                                                                       the head of the
                                  NP         Noun phrase                CMP Complement
                                  VP         Verb phrase                MOD Noun modifier
                                                                                                       all other nodes i
                                  VNP Copula phrase                     AJT Predicate modifier          The procedure g
                                  AP         Adverb phrase              CNJ Conjunctive                in the tree finds
                                  DP         Adnoun phrase              INT Vocative                   and Palmer (20
                                  IP         Interjection phrase PRN Parenthetical                     by this procedu
                                                                                                       (unique root, si
            NNG General noun    Table 2: Phrase-level
                                            MM     Adnoun tags (left) and function tags (right)
                                                                       EP     Prefinal EM      JX Auxiliary PR
            NNP Proper noun                 MAG General adverb
                                in the Sejong Treebank.
                                                                       EF     Final EM                 however, it doe
                                                                                              JC Conjunctive PR
            NNB Bound noun                  MAJ Conjunctive adverb EC         Conjunctive EM  IC Interjection
            NP        Pronoun               JKS Subjective CP          ETN Nominalizing EM    SN Numbershows how to a
            NR        Numeral               JKC Complemental CP                                        In addition, Sec
                                                                       ETM Adnominalizing EM SL Foreign word
            VV        Verb                  JKG Adnomial CP            XPN Noun prefix         SH Chinese word
            VA        Adjective 3.2      Head-percolation rules Noun DS
                                            JKO Objective CP           XSN                             solve some of th
                                                                                              NF Noun-like word
            VX        Auxiliary predicate JKB Adverbial CP             XSV Verb DS
            VCP Copula Table 3 givesVocative CPof head-percolation rules (from
                                            JKV     the list           XSA Adjective DS
                                                                                                       nested function
                                                                                              NV Predicate-like word
                                                                                              NA Unknown word
                                              headrules), derived from analysis of each SS, SE,It is worth m
            VCN Negation now on,JKQ Quotative CP
                                adjective                              XR     Base morpheme   SF, SP,        SO, SW

          Table 1: P OS tags phrase type in the Sejongmarker, CP: case particle, EM: ending marker, DS: Sejong Tree
                                in the Sejong Treebank (PM: predicate Treebank. Except for the         the deriva-
          tional suffix, PR: particle, SF SP SS SE SO: different types of punctuation).
                                quotative clause (Q), all 6           other phrase types try to egories. This im
                                find their heads been one of rightmost(2000) presented an approach for by these he
Thursday, October 6, 2011 morphological analysis has
          Automatic
                                                        from the Han et al. children, which ated han-
Dependency Conversion
             •       Conversion steps
                   -        Find the head of each phrase using head-percolation rules.
                            •   All other nodes in the phrase become dependents of the head.

                   -        Re-direct dependencies for empty categories.
                            •   Empty categories are not annotated in the Sejong Treebank.

                            •   Skipping this step generates only projective dependency trees.

                   -        Label (automatically generated) dependencies.


             •       Special cases
                   -        Coordination, nested function tags.


                                                          7
Thursday, October 6, 2011
Dependency Conversion
        • Head-percolation rules that have treated each
as described in there are some approaches
                    -
 to several mor-Achieved by analyzing each phrase in the Sejong Treebank.
                   morpheme as an individual token to parse (Chung
le 1). In the Se- et al., 2010).5
                  Korean is a head-final language.
mostly by white
B+C)D’ is con-           S         r VP;VNP;S;NP|AP;Q;*
oes not contain          Q         l S|VP|VNP|NP;Q;*
ult, a token can         NP        r NP;S;VP;VNP;AP;*
ual morphemes            VP        r VP;VNP;NP;S;IP;*
                         VNP       r VNP;NP;S;*
ated with func-          AP        r AP;VP;NP;S;*
gs show depen-           DP        r DP;VP;*
hrases and their         IP        r IP;VNP;*
 y labels during         X|L|R r *
special types of No rules to find the head morpheme of each token.
                   Table 3: Head-percolation rules for the Sejong Tree-
Table 2. X indi- bank. l/r implies looking for the leftmost/rightmost con-
articles, ending stituent. * implies any phrase-level tag. | implies a logi-
                                              8
ndicate phrases cal OR and ; is a delimiter between tags. Each rule gives
 Thursday, October 6, 2011
Dependency Conversion
              •       Dependency labels
                    -        Labels retained from the function tags.

                    -        Labels inferred from constituent relations.
ejong Treebank, and use the automati-
                  S                                  input : (c, p), where c is a dependent of p.
d and linked empty categories to gener-                                                  l
                                                     output: A dependency label l as c ← p.
                                                                                         −
         NP-SBJ         VP
ective dependencies.                                 begin
                                AP        VP            if p = root     then ROOT → l
dency labels                                            elif c.pos = AP then ADV → l
                                     NP-OBJ    VP
f dependency labels are derived from the                elif p.pos = AP then AMOD → l
 rees. The first type includes labels re-                elif p.pos = DP then DMOD → l
 the function tags. still
             She      When any nodeloved
                               him      an-             elif p.pos = NP then NMOD → l
  a function tag is determined toOBJ a de-
                                   be                   elif p.pos = VP|VNP|IP then VMOD → l
some other node by our headrules, the
                               ADV                      else DEP → l
                           SBJ                       end
  is taken as the dependency label to its                 Algorithm 1: Getting inferred labels.
e 3 shows a dependency tree converted
 stituent tree in Figure 2, using the func-          AJT
                                                      9     11.70     MOD     18.71   X         0.01
dependency labels (SBJ and OBJ).                     CMP     1.49    AMOD      0.13   X AJT     0.08
 Thursday, October 6, 2011
Dependency Conversion
             •       Coordination
                   -        Previous conjuncts as dependents of the following conjuncts.

             •       Nested function tag
                   -        Nodes with nested f-tags become the heads of the phrases.
                                                                   S

                                             NP-SBJ                            VP

                                     NP-CNJ       NP-SBJ               NP-OBJ        VP

                                              NP-CNJ     NP-SBJ


                                     I_and    he_and         she        home         left
                                         CNJ           CNJ                     OBJ
                                                                         SBJ


                                                              10
Thursday, October 6, 2011
Dependency Parsing
             •       Dependency parsing algorithm
                   -        Transition-based, non-projective parsing algorithm.
                            •   Choi & Palmer, 2011.

                   -        Performs transitions from both projective and non-projective
                            dependency parsing algorithms selectively.
                            •   Linear time parsing speed in practice for non-projective trees.


             •       Machine learning algorithm
                   -        Liblinear L2-regularized L1-loss support vector.

                                   Jinho D. Choi & Martha Palmer. 2011. Getting the Most out of
                                Transition-based Dependency Parsing. In Proceedings of ACL:HLT’11


                                                            11
Thursday, October 6, 2011
Dependency Parsing
             •       Feature selection
                   -        Each token consists of multiple morphemes (up to 21).

                   -        POS tag feature of each token?
                            •   (NNG & XSV & EP & EF & SF) vs. (NNG | XSV | EP | EF | SF)

                            •   Sparse information vs. lack of information.
                                                                                 Happy medium?
                                                 !       /NNP+      /NNG+     /JX
                                 Nakrang_              Nakrang + Princess + JX

                                                 !       /NNP+      /NNG+     /JKO
                                  Hodong_              Hodong + Prince + JKO

                                                 !       /NNG+    /XSV+     /EP+ /EF+./SF
                                                       Love + XSV + EP + EF + .


                                                           12
Thursday, October 6, 2011
ure extraction                j                                                   *
 ed to λ1 along with all many other morphemes helpfullast punctuation, only if there is no other
                                                         PY The for parsing. Thus,
 ned in Section 3.1, each token in our cor-
                               Dependency Parsing
edure is repeated with as a compromise, we decide to select followed by the punctuation
 sts of one or many morphemes annotated
+1 . The algorithm ter-
                                                               morpheme certain types
                                   of morphemes and use only these as features. Ta-
rent POS tags. This morphology makes Table 6: Types of morphemes in each token used to ex-
n left in β.                       ble 6 shows the types of morphemes used to extract
          •
  to extract features for dependency pars- tract features for our parsing models.
rithm when two tokens, wi
nglish,
                         Morphemeand for ,our parsing models.
                                         selection
                                   features
                                              wj are
 for a dependency relation, FS extract fea- Figure 6 shows morphemes extracted from the to-
                                     we The first morpheme
 zed L1-loss S VM for                                  kens in Figure 5. For unigrams, these morphemes
 POS tags of wi and wj (wi .pos,The .pos),
                                     LS     wj last morpheme before JO|DS|EM
  , applying c = 0.1
d feature of POS tags between two tokens(J*canTable 1) either individually (e.g., the POS tag of
                                     JK Particles       in be used
 iterion), B = 0 (bias).
                                      annotated with JK for the 1st token is JX) or jointly (e.g., a joined
 j .pos). Since each token isDS Derivational suffixes (XS* in Table 1)
                                     EM Ending markers (E* of Tabletags between LS and JK for the 1st
 OS tag in English, it is trivial to extract           feature in POS 1)
ures. In Korean, each token is annotated token is only if there isgenerate features. From our
                                     PY The last punctuation, NNG+JX) to no other
 each token in our cor-
uence of POS tags, depending on how mor- followed by thefeatures extracted from the JK and EM
                                           morpheme experiments, punctuation
 morphemes annotated
 e segmented. It is possible to join all POS morphemes are found to be the most useful.
  s morphology makes Table Types of
n a token and treat that as/NNG+6: tag (e.g., morphemes in each token used to ex-
                         /NNP+     a single /JX
  for dependency pars- tract features for our parsing models.
 +JX for the first token in FigureJX how-
                    Nakrang + Princess + 5);
okens, wi and wj , are
e tags usually cause very sparse vectorsmorphemes extracted from the to-
                                   Figure 6 shows            /NNP       /NNG      /JX
 lation, we extract fea- /NNG+ /JKO
   as features. /NNP+              kens in Figure 5. For unigrams, these morphemes
   wj (wi .pos, wj .pos),Prince + JKO
                     Hodong +                                /NNP       /NNG /JKO
 s between two tokens can be used either individually (e.g., the POS tag of /XSV /EF /SF
                                                             /NNG
oken is annotated with JK for the 1st token is JX) or jointly (e.g., a joined
                 /NNG+ /XSV+ /EP+ /EF+./SF
                     Hodong_
  it is trivial to extract + EP + EF of . POS tags between LS and JK for the 1st
                   Love + XSV feature +
 ch!  token /NNP+ is annotated     token is NNG+JX) to generate features. extracted from the tokens in Fig-
                                                       Figure 6: Morphemes From our
                               /NNG+ /JX
epending on how mor-               experiments, features extracted from to the types in Table 6.
                                                       ure 5 with respect
                                                                          the JK and EM
          Nakrang + POS + JX                                 13
 ossible to join allPrincess morphemes are found ton-grams where n > 1, it is not obvious which
                                                            be the most useful.
                                                       For
 at! a single 2011 (e.g.,
   Thursday, October 6, tag
     as           /NNP+         /NNG+ /JKO             combinations of these morphemes across different
rning (Hsieh et al., 2008), applying c = 0.1
st), e = 0.1 (termination criterion), B = 0 (bias).        JK    Particles (J* in Table 1)
                                                                 Derivational suffixes (XS* in Table 1)
                                        Dependency Parsing
                                                           DS
   Feature extraction                                      EM    Ending markers (E* in Table 1)
                                                           PY    The last punctuation, only if there is no other
  mentioned in Section 3.1, each token in our cor-

              •
                                                                 morpheme followed by the punctuation
                      Feature extraction
 a consists of one or many morphemes annotated
 h different POS tags. This morphology makes Table 6: Types of morphemes in each token used to ex-

                    -
difficult to extract features for dependency pars- tract features for our parsing models.
                     Extract features using only important morphemes.
 . In English, when two tokens, wi and wj , are
                             •
mpared for a dependency relation, we extract fea- Figure 6the1st morphemestokens. from the to-
                                                                 shows
                         Individual POS tag features of Figure 5.and 3rd
es like POS tags of wi and wj (wi .pos, wj .pos), kens in
                                                                                    extracted
                                                                          For unigrams, these morphemes
                         : NNP1, NNG1, JK1, NNG3,can be ,used either individually (e.g., the POS tag of
                                                        XSV3 EF3
a joined feature of POS tags between two tokens
                             •
                         Joined is annotated with JK for the 1st token is JX) or jointly (e.g., a joined
 .pos+wj .pos). Since each token features of POS tags between the 1st and 3rd tokens.
 ingle POS tag in English, it is trivial toNNP _XSV , NNP _EF ,tags _NNG , JK _XSV for the 1st
                         : NNP1_NNG3,          extract feature of POS JK between LS and JK
                                                   1    3       1    3    1       3    1      3
 se features. In Korean, each token is annotated token is NNG+JX) to generate features. From our

                    -
 h a sequence of POS tags, depending on how mor- experiments, features extracted from the JK and EM
                     Tokens used: wiall POS i±1, wj±1 are found to be the most useful.
emes are segmented. It is possible to join
                                               , wj, w morphemes
 s within a token and treat that as/NNG+ tag (e.g.,
                       /NNP+        a single /JX
 P+NNG+JX for the first token in Figure 5); how-
                    Nakrang + Princess + JX
 r, these tags usually cause very sparse vectors            /NNP       /NNG    /JX
 en used as features. /NNP+       /NNG+ /JKO
                             Hodong + Prince + JKO              /NNP     /NNG     /JKO
                                                                /NNG                      /XSV     /EF    /SF
akrang_                 /NNG+ /XSV+
                          Hodong_         /EP+ /EF+./SF
                         Love + XSV + EP + EF + .
                                                          Figure 6: Morphemes extracted from the tokens in Fig-
             !          /NNP+        /NNG+   /JX          ure 5 with respect to the types in Table 6.
                   Nakrang + Princess + JX
                                                              14
                                                          For n-grams where n > 1, it is not obvious which
             !           /NNP+       /NNG+   /JKO         combinations of these morphemes across different
 Thursday, October 6, 2011
discussing world history.
m                     Table 10 shows how these corpora are divided into
 , wi and wj .
                                               Experiments
                  training, development, and evaluation sets. For the
                  development and evaluation sets, we pick one news-
d features ex- •
               Corpora about art, one fiction text, and one informa-
                  paper
 h column and        -
                Dependencyabout trans-nationalism, the Sejong Treebank.
                  tive book trees converted from and use each of
                  the first half for development and the second half for
 d wj , respec-
e of POS tags        -
                Consists of 20 sources these genres.
                  evaluation. Note that in 6 development and evalu-
                  ation sets are very diverse(MZ), Fiction (FI), Memoir (ME),
                   Newspaper (NP), Magazine compared to the training
joined feature               •
                  data. Testing on such and Educational Cartoon the ro-
                   Informative Book (IB), evaluation sets ensures (EC).
d a POS tag of
                  bustness of our parsing model, which is very impor-
and a form of
                     -
ed feature be- Evaluation sets are very diverse compared to training sets.
                  tant for our purpose because we are hoping to use
  features used              •
                  this model to parse various texts on the web.
                   Ensures the robustness of our parsing models.
 ed morpholo-             NP      MZ        FI     ME       IB      EC
                                  T    8,060   6,713    15,646    5,053   7,983   1,548
                                  D    2,048     -       2,174      -     1,307     -
DS             EM
                                  E    2,048     -       2,175      -     1,308     -
 z               z
                                               # ofof sentences in training (T), develop-
                                 Table 10: Number
                                                    sentences in each set
x,z              x
 y+ x∗ ,y+                       ment (D), and evaluation (E) sets for each genre.
                                                           15
 xThursday, October z 2011
              x, 6,
Experiments
             •       Morphological analysis
                   -        Two automatic morphological analyzers are used.

             •       Intelligent Morphological Analyzer
                   -        Developed by the Sejong project.

                   -        Provides the same morphological analysis as their Treebank.
                            •   Considered as fine-grained morphological analysis.

             •       Mach (Shim and Yang, 2002)
                   -        Analyzes 1.3M words per second.

                   -        Provides more coarse-grained morphological analysis.
                                   Kwangseob Shim & Jaehyung Yang. 2002. A Supersonic Korean
                                     Morphological Analyzer. In Proceedings of COLING’02

                                                           16
Thursday, October 6, 2011
Experiments
             •       Evaluations
                   -        Gold-standard vs. automatic morphological analysis.
                            •    Relatively low performance from the automatic system.

                   -        Fine vs. course-grained morphological analysis.
                            •    Differences are not too significant.

                   -        Robustness across different genres.
                                  Gold, fine-grained     Auto, fine-grained    Auto, coarse-grained
                                LAS UAS          LS   LAS UAS          LS   LAS UAS           LS
                NP              82.58 84.32 94.05     79.61 82.35 91.49     79.00 81.68 91.50
                 FI             84.78 87.04 93.70     81.54 85.04 90.95     80.11 83.96 90.24
                 IB             84.21 85.50 95.82     80.45 82.14 92.73     81.43 83.38 93.89
                Avg.            83.74 85.47 94.57     80.43 83.01 91.77     80.14 82.89 91.99

ble 11: Parsing accuracies achieved by three models (in %). L AS - labeled attachment score, UAS - unlabel
achment score, L S - label accuracy score           17
Thursday, October 6, 2011
Conclusion
             •       Contributions
                   -        Generating a Korean Dependency Treebank.

                   -        Selecting important morphemes for dependency parsing.

                   -        Evaluating the impact of fine vs. coarse-grained
                            morphological analysis on dependency parsing.

                   -        Evaluating the robustness across different genres.

             •       Future work
                   -        Increase the feature span beyond bigrams.

                   -        Find head morphemes of individual tokens.

                   -        Insert empty categories.


                                                       18
Thursday, October 6, 2011
Acknowledgements
             •       Special thanks are due to
                   -        Professor Kong Joo Lee of Chungnam National University.

                   -        Professor Kwangseob Shim of Shungshin Women’s University.


             •       We gratefully acknowledge the support of the National
                     Science Foundation Grants CISE-IIS-RI-0910992, Richer
                     Representations for Machine Translation. Any opinions,
                     findings, and conclusions or recommendations expressed
                     in this material are those of the authors and do not
                     necessarily reflect the views of the National Science
                     Foundation.



                                                    19
Thursday, October 6, 2011

Weitere ähnliche Inhalte

Andere mochten auch

Transition-based Semantic Role Labeling Using Predicate Argument Clustering
Transition-based Semantic Role Labeling Using Predicate Argument ClusteringTransition-based Semantic Role Labeling Using Predicate Argument Clustering
Transition-based Semantic Role Labeling Using Predicate Argument ClusteringJinho Choi
 
The CLEAR Dependency
The CLEAR DependencyThe CLEAR Dependency
The CLEAR DependencyJinho Choi
 
CS571: Distributional semantics
CS571: Distributional semanticsCS571: Distributional semantics
CS571: Distributional semanticsJinho Choi
 
Transition-based Dependency Parsing with Selectional Branching
Transition-based Dependency Parsing with Selectional BranchingTransition-based Dependency Parsing with Selectional Branching
Transition-based Dependency Parsing with Selectional BranchingJinho Choi
 
Transition-based Dependency Parsing with Selectional Branching
Transition-based Dependency Parsing with Selectional BranchingTransition-based Dependency Parsing with Selectional Branching
Transition-based Dependency Parsing with Selectional BranchingJinho Choi
 
Optimization of NLP Components for Robustness and Scalability
Optimization of NLP Components for Robustness and ScalabilityOptimization of NLP Components for Robustness and Scalability
Optimization of NLP Components for Robustness and ScalabilityJinho Choi
 
CS571: Vector Space Models
CS571: Vector Space ModelsCS571: Vector Space Models
CS571: Vector Space ModelsJinho Choi
 
CS571: Gradient Descent
CS571: Gradient DescentCS571: Gradient Descent
CS571: Gradient DescentJinho Choi
 
CS571: Sentiment Analysis
CS571: Sentiment AnalysisCS571: Sentiment Analysis
CS571: Sentiment AnalysisJinho Choi
 
CS571: Language Models
CS571: Language ModelsCS571: Language Models
CS571: Language ModelsJinho Choi
 
CS571: Introduction
CS571: IntroductionCS571: Introduction
CS571: IntroductionJinho Choi
 
CS571:: Part of-Speech Tagging
CS571:: Part of-Speech TaggingCS571:: Part of-Speech Tagging
CS571:: Part of-Speech TaggingJinho Choi
 

Andere mochten auch (12)

Transition-based Semantic Role Labeling Using Predicate Argument Clustering
Transition-based Semantic Role Labeling Using Predicate Argument ClusteringTransition-based Semantic Role Labeling Using Predicate Argument Clustering
Transition-based Semantic Role Labeling Using Predicate Argument Clustering
 
The CLEAR Dependency
The CLEAR DependencyThe CLEAR Dependency
The CLEAR Dependency
 
CS571: Distributional semantics
CS571: Distributional semanticsCS571: Distributional semantics
CS571: Distributional semantics
 
Transition-based Dependency Parsing with Selectional Branching
Transition-based Dependency Parsing with Selectional BranchingTransition-based Dependency Parsing with Selectional Branching
Transition-based Dependency Parsing with Selectional Branching
 
Transition-based Dependency Parsing with Selectional Branching
Transition-based Dependency Parsing with Selectional BranchingTransition-based Dependency Parsing with Selectional Branching
Transition-based Dependency Parsing with Selectional Branching
 
Optimization of NLP Components for Robustness and Scalability
Optimization of NLP Components for Robustness and ScalabilityOptimization of NLP Components for Robustness and Scalability
Optimization of NLP Components for Robustness and Scalability
 
CS571: Vector Space Models
CS571: Vector Space ModelsCS571: Vector Space Models
CS571: Vector Space Models
 
CS571: Gradient Descent
CS571: Gradient DescentCS571: Gradient Descent
CS571: Gradient Descent
 
CS571: Sentiment Analysis
CS571: Sentiment AnalysisCS571: Sentiment Analysis
CS571: Sentiment Analysis
 
CS571: Language Models
CS571: Language ModelsCS571: Language Models
CS571: Language Models
 
CS571: Introduction
CS571: IntroductionCS571: Introduction
CS571: Introduction
 
CS571:: Part of-Speech Tagging
CS571:: Part of-Speech TaggingCS571:: Part of-Speech Tagging
CS571:: Part of-Speech Tagging
 

Mehr von Jinho Choi

Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...Jinho Choi
 
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...Jinho Choi
 
Competence-Level Prediction and Resume & Job Description Matching Using Conte...
Competence-Level Prediction and Resume & Job Description Matching Using Conte...Competence-Level Prediction and Resume & Job Description Matching Using Conte...
Competence-Level Prediction and Resume & Job Description Matching Using Conte...Jinho Choi
 
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...Jinho Choi
 
The Myth of Higher-Order Inference in Coreference Resolution
The Myth of Higher-Order Inference in Coreference ResolutionThe Myth of Higher-Order Inference in Coreference Resolution
The Myth of Higher-Order Inference in Coreference ResolutionJinho Choi
 
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...Jinho Choi
 
Abstract Meaning Representation
Abstract Meaning RepresentationAbstract Meaning Representation
Abstract Meaning RepresentationJinho Choi
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role LabelingJinho Choi
 
CS329 - WordNet Similarities
CS329 - WordNet SimilaritiesCS329 - WordNet Similarities
CS329 - WordNet SimilaritiesJinho Choi
 
CS329 - Lexical Relations
CS329 - Lexical RelationsCS329 - Lexical Relations
CS329 - Lexical RelationsJinho Choi
 
Automatic Knowledge Base Expansion for Dialogue Management
Automatic Knowledge Base Expansion for Dialogue ManagementAutomatic Knowledge Base Expansion for Dialogue Management
Automatic Knowledge Base Expansion for Dialogue ManagementJinho Choi
 
Attention is All You Need for AMR Parsing
Attention is All You Need for AMR ParsingAttention is All You Need for AMR Parsing
Attention is All You Need for AMR ParsingJinho Choi
 
Graph-to-Text Generation and its Applications to Dialogue
Graph-to-Text Generation and its Applications to DialogueGraph-to-Text Generation and its Applications to Dialogue
Graph-to-Text Generation and its Applications to DialogueJinho Choi
 
Real-time Coreference Resolution for Dialogue Understanding
Real-time Coreference Resolution for Dialogue UnderstandingReal-time Coreference Resolution for Dialogue Understanding
Real-time Coreference Resolution for Dialogue UnderstandingJinho Choi
 
Topological Sort
Topological SortTopological Sort
Topological SortJinho Choi
 
Multi-modal Embedding Learning for Early Detection of Alzheimer's Disease
Multi-modal Embedding Learning for Early Detection of Alzheimer's DiseaseMulti-modal Embedding Learning for Early Detection of Alzheimer's Disease
Multi-modal Embedding Learning for Early Detection of Alzheimer's DiseaseJinho Choi
 
Building Widely-Interpretable Semantic Networks for Dialogue Contexts
Building Widely-Interpretable Semantic Networks for Dialogue ContextsBuilding Widely-Interpretable Semantic Networks for Dialogue Contexts
Building Widely-Interpretable Semantic Networks for Dialogue ContextsJinho Choi
 
How to make Emora talk about Sports Intelligently
How to make Emora talk about Sports IntelligentlyHow to make Emora talk about Sports Intelligently
How to make Emora talk about Sports IntelligentlyJinho Choi
 

Mehr von Jinho Choi (20)

Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
 
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...
 
Competence-Level Prediction and Resume & Job Description Matching Using Conte...
Competence-Level Prediction and Resume & Job Description Matching Using Conte...Competence-Level Prediction and Resume & Job Description Matching Using Conte...
Competence-Level Prediction and Resume & Job Description Matching Using Conte...
 
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
 
The Myth of Higher-Order Inference in Coreference Resolution
The Myth of Higher-Order Inference in Coreference ResolutionThe Myth of Higher-Order Inference in Coreference Resolution
The Myth of Higher-Order Inference in Coreference Resolution
 
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
 
Abstract Meaning Representation
Abstract Meaning RepresentationAbstract Meaning Representation
Abstract Meaning Representation
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role Labeling
 
CKY Parsing
CKY ParsingCKY Parsing
CKY Parsing
 
CS329 - WordNet Similarities
CS329 - WordNet SimilaritiesCS329 - WordNet Similarities
CS329 - WordNet Similarities
 
CS329 - Lexical Relations
CS329 - Lexical RelationsCS329 - Lexical Relations
CS329 - Lexical Relations
 
Automatic Knowledge Base Expansion for Dialogue Management
Automatic Knowledge Base Expansion for Dialogue ManagementAutomatic Knowledge Base Expansion for Dialogue Management
Automatic Knowledge Base Expansion for Dialogue Management
 
Attention is All You Need for AMR Parsing
Attention is All You Need for AMR ParsingAttention is All You Need for AMR Parsing
Attention is All You Need for AMR Parsing
 
Graph-to-Text Generation and its Applications to Dialogue
Graph-to-Text Generation and its Applications to DialogueGraph-to-Text Generation and its Applications to Dialogue
Graph-to-Text Generation and its Applications to Dialogue
 
Real-time Coreference Resolution for Dialogue Understanding
Real-time Coreference Resolution for Dialogue UnderstandingReal-time Coreference Resolution for Dialogue Understanding
Real-time Coreference Resolution for Dialogue Understanding
 
Topological Sort
Topological SortTopological Sort
Topological Sort
 
Tries - Put
Tries - PutTries - Put
Tries - Put
 
Multi-modal Embedding Learning for Early Detection of Alzheimer's Disease
Multi-modal Embedding Learning for Early Detection of Alzheimer's DiseaseMulti-modal Embedding Learning for Early Detection of Alzheimer's Disease
Multi-modal Embedding Learning for Early Detection of Alzheimer's Disease
 
Building Widely-Interpretable Semantic Networks for Dialogue Contexts
Building Widely-Interpretable Semantic Networks for Dialogue ContextsBuilding Widely-Interpretable Semantic Networks for Dialogue Contexts
Building Widely-Interpretable Semantic Networks for Dialogue Contexts
 
How to make Emora talk about Sports Intelligently
How to make Emora talk about Sports IntelligentlyHow to make Emora talk about Sports Intelligently
How to make Emora talk about Sports Intelligently
 

Kürzlich hochgeladen

Babel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxBabel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxYounusS2
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
Spring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdfSpring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdfAnna Loughnan Colquhoun
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
GenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation IncGenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation IncObject Automation
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Things you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceThings you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceMartin Humpolec
 
RAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AIRAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AIUdaiappa Ramachandran
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 

Kürzlich hochgeladen (20)

Babel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxBabel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptx
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
Spring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdfSpring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdf
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
GenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation IncGenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation Inc
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Things you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceThings you didn't know you can use in your Salesforce
Things you didn't know you can use in your Salesforce
 
RAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AIRAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AI
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 

Statistical Dependency Parsing in Korean: From Corpus Generation To Automatic Parsing

  • 1. Statistical Dependency Parsing in Korean: From Corpus Generation To Automatic Parsing Workshop on Statistical Parsing of Morphologically-Rich Languages 12th International Conference on Parsing Technologies Jinho D. Choi & Martha Palmer University of Colorado at Boulder October 6th, 2011 choijd@colorado.edu Thursday, October 6, 2011
  • 2. Dependency Parsing in Korean • Why dependency parsing in Korean? - Korean is a flexible word order language. S SOV construction S NP-OBJ-1 S NP-SBJ VP NP-SBJ VP AP VP AP VP NP-OBJ VP NP-OBJ VP She still him loved Him she still *T* loved OBJ ADV ADV SBJ SBJ OBJ 2 Thursday, October 6, 2011
  • 3. Dependency Parsing in Korean • Why dependency parsing in Korean? - Korean is a flexible word order language. - Rich morphology makes it easy for dependency parsing. She + Aux. particle loved He + Obj. case marker SBJ ADV OBJ She still him 3 Thursday, October 6, 2011
  • 4. Dependency Parsing in Korean • Statistical dependency parsing in Korean - Sufficiently large training data is required. • Not much training data available for Korean dependency parsing. • Constituent Treebanks in Korean - Penn Korean Treebank: 15K sentences. - KAIST Treebank: 30K sentences. - Sejong Treebank: 60K sentences. • The most recent and largest Treebank in Korean. • Containing Penn Treebank style constituent trees. 4 Thursday, October 6, 2011
  • 5. Sejong Treebank • Phrase structure - Including phrase tags, POS tags, and function tags. - Each token can be broken into several morphemes. S ! ( )/NP+ /JX ! /MAG NP-SBJ VP ! /NP+ /JKO AP VP ! /NNG+ /XSV+ /EP+ /EF NP-OBJ VP She still him loved Tokens are mostly separated by white spaces. 5 Thursday, October 6, 2011
  • 6. containing only left and right brackets, respectively. higher precedenc These tags are also used to determine dependency precedence in VP Sejong Treebank relations during the conversion. Once we have th Phrase-level tags Function tags erate dependenc S Sentence SBJ Subject each phrase (or Q Quotative clause OBJ Object the head of the NP Noun phrase CMP Complement VP Verb phrase MOD Noun modifier all other nodes i VNP Copula phrase AJT Predicate modifier The procedure g AP Adverb phrase CNJ Conjunctive in the tree finds DP Adnoun phrase INT Vocative and Palmer (20 IP Interjection phrase PRN Parenthetical by this procedu (unique root, si NNG General noun Table 2: Phrase-level MM Adnoun tags (left) and function tags (right) EP Prefinal EM JX Auxiliary PR NNP Proper noun MAG General adverb in the Sejong Treebank. EF Final EM however, it doe JC Conjunctive PR NNB Bound noun MAJ Conjunctive adverb EC Conjunctive EM IC Interjection NP Pronoun JKS Subjective CP ETN Nominalizing EM SN Numbershows how to a NR Numeral JKC Complemental CP In addition, Sec ETM Adnominalizing EM SL Foreign word VV Verb JKG Adnomial CP XPN Noun prefix SH Chinese word VA Adjective 3.2 Head-percolation rules Noun DS JKO Objective CP XSN solve some of th NF Noun-like word VX Auxiliary predicate JKB Adverbial CP XSV Verb DS VCP Copula Table 3 givesVocative CPof head-percolation rules (from JKV the list XSA Adjective DS nested function NV Predicate-like word NA Unknown word headrules), derived from analysis of each SS, SE,It is worth m VCN Negation now on,JKQ Quotative CP adjective XR Base morpheme SF, SP, SO, SW Table 1: P OS tags phrase type in the Sejongmarker, CP: case particle, EM: ending marker, DS: Sejong Tree in the Sejong Treebank (PM: predicate Treebank. Except for the the deriva- tional suffix, PR: particle, SF SP SS SE SO: different types of punctuation). quotative clause (Q), all 6 other phrase types try to egories. This im find their heads been one of rightmost(2000) presented an approach for by these he Thursday, October 6, 2011 morphological analysis has Automatic from the Han et al. children, which ated han-
  • 7. Dependency Conversion • Conversion steps - Find the head of each phrase using head-percolation rules. • All other nodes in the phrase become dependents of the head. - Re-direct dependencies for empty categories. • Empty categories are not annotated in the Sejong Treebank. • Skipping this step generates only projective dependency trees. - Label (automatically generated) dependencies. • Special cases - Coordination, nested function tags. 7 Thursday, October 6, 2011
  • 8. Dependency Conversion • Head-percolation rules that have treated each as described in there are some approaches - to several mor-Achieved by analyzing each phrase in the Sejong Treebank. morpheme as an individual token to parse (Chung le 1). In the Se- et al., 2010).5 Korean is a head-final language. mostly by white B+C)D’ is con- S r VP;VNP;S;NP|AP;Q;* oes not contain Q l S|VP|VNP|NP;Q;* ult, a token can NP r NP;S;VP;VNP;AP;* ual morphemes VP r VP;VNP;NP;S;IP;* VNP r VNP;NP;S;* ated with func- AP r AP;VP;NP;S;* gs show depen- DP r DP;VP;* hrases and their IP r IP;VNP;* y labels during X|L|R r * special types of No rules to find the head morpheme of each token. Table 3: Head-percolation rules for the Sejong Tree- Table 2. X indi- bank. l/r implies looking for the leftmost/rightmost con- articles, ending stituent. * implies any phrase-level tag. | implies a logi- 8 ndicate phrases cal OR and ; is a delimiter between tags. Each rule gives Thursday, October 6, 2011
  • 9. Dependency Conversion • Dependency labels - Labels retained from the function tags. - Labels inferred from constituent relations. ejong Treebank, and use the automati- S input : (c, p), where c is a dependent of p. d and linked empty categories to gener- l output: A dependency label l as c ← p. − NP-SBJ VP ective dependencies. begin AP VP if p = root then ROOT → l dency labels elif c.pos = AP then ADV → l NP-OBJ VP f dependency labels are derived from the elif p.pos = AP then AMOD → l rees. The first type includes labels re- elif p.pos = DP then DMOD → l the function tags. still She When any nodeloved him an- elif p.pos = NP then NMOD → l a function tag is determined toOBJ a de- be elif p.pos = VP|VNP|IP then VMOD → l some other node by our headrules, the ADV else DEP → l SBJ end is taken as the dependency label to its Algorithm 1: Getting inferred labels. e 3 shows a dependency tree converted stituent tree in Figure 2, using the func- AJT 9 11.70 MOD 18.71 X 0.01 dependency labels (SBJ and OBJ). CMP 1.49 AMOD 0.13 X AJT 0.08 Thursday, October 6, 2011
  • 10. Dependency Conversion • Coordination - Previous conjuncts as dependents of the following conjuncts. • Nested function tag - Nodes with nested f-tags become the heads of the phrases. S NP-SBJ VP NP-CNJ NP-SBJ NP-OBJ VP NP-CNJ NP-SBJ I_and he_and she home left CNJ CNJ OBJ SBJ 10 Thursday, October 6, 2011
  • 11. Dependency Parsing • Dependency parsing algorithm - Transition-based, non-projective parsing algorithm. • Choi & Palmer, 2011. - Performs transitions from both projective and non-projective dependency parsing algorithms selectively. • Linear time parsing speed in practice for non-projective trees. • Machine learning algorithm - Liblinear L2-regularized L1-loss support vector. Jinho D. Choi & Martha Palmer. 2011. Getting the Most out of Transition-based Dependency Parsing. In Proceedings of ACL:HLT’11 11 Thursday, October 6, 2011
  • 12. Dependency Parsing • Feature selection - Each token consists of multiple morphemes (up to 21). - POS tag feature of each token? • (NNG & XSV & EP & EF & SF) vs. (NNG | XSV | EP | EF | SF) • Sparse information vs. lack of information. Happy medium? ! /NNP+ /NNG+ /JX Nakrang_ Nakrang + Princess + JX ! /NNP+ /NNG+ /JKO Hodong_ Hodong + Prince + JKO ! /NNG+ /XSV+ /EP+ /EF+./SF Love + XSV + EP + EF + . 12 Thursday, October 6, 2011
  • 13. ure extraction j * ed to λ1 along with all many other morphemes helpfullast punctuation, only if there is no other PY The for parsing. Thus, ned in Section 3.1, each token in our cor- Dependency Parsing edure is repeated with as a compromise, we decide to select followed by the punctuation sts of one or many morphemes annotated +1 . The algorithm ter- morpheme certain types of morphemes and use only these as features. Ta- rent POS tags. This morphology makes Table 6: Types of morphemes in each token used to ex- n left in β. ble 6 shows the types of morphemes used to extract • to extract features for dependency pars- tract features for our parsing models. rithm when two tokens, wi nglish, Morphemeand for ,our parsing models. selection features wj are for a dependency relation, FS extract fea- Figure 6 shows morphemes extracted from the to- we The first morpheme zed L1-loss S VM for kens in Figure 5. For unigrams, these morphemes POS tags of wi and wj (wi .pos,The .pos), LS wj last morpheme before JO|DS|EM , applying c = 0.1 d feature of POS tags between two tokens(J*canTable 1) either individually (e.g., the POS tag of JK Particles in be used iterion), B = 0 (bias). annotated with JK for the 1st token is JX) or jointly (e.g., a joined j .pos). Since each token isDS Derivational suffixes (XS* in Table 1) EM Ending markers (E* of Tabletags between LS and JK for the 1st OS tag in English, it is trivial to extract feature in POS 1) ures. In Korean, each token is annotated token is only if there isgenerate features. From our PY The last punctuation, NNG+JX) to no other each token in our cor- uence of POS tags, depending on how mor- followed by thefeatures extracted from the JK and EM morpheme experiments, punctuation morphemes annotated e segmented. It is possible to join all POS morphemes are found to be the most useful. s morphology makes Table Types of n a token and treat that as/NNG+6: tag (e.g., morphemes in each token used to ex- /NNP+ a single /JX for dependency pars- tract features for our parsing models. +JX for the first token in FigureJX how- Nakrang + Princess + 5); okens, wi and wj , are e tags usually cause very sparse vectorsmorphemes extracted from the to- Figure 6 shows /NNP /NNG /JX lation, we extract fea- /NNG+ /JKO as features. /NNP+ kens in Figure 5. For unigrams, these morphemes wj (wi .pos, wj .pos),Prince + JKO Hodong + /NNP /NNG /JKO s between two tokens can be used either individually (e.g., the POS tag of /XSV /EF /SF /NNG oken is annotated with JK for the 1st token is JX) or jointly (e.g., a joined /NNG+ /XSV+ /EP+ /EF+./SF Hodong_ it is trivial to extract + EP + EF of . POS tags between LS and JK for the 1st Love + XSV feature + ch! token /NNP+ is annotated token is NNG+JX) to generate features. extracted from the tokens in Fig- Figure 6: Morphemes From our /NNG+ /JX epending on how mor- experiments, features extracted from to the types in Table 6. ure 5 with respect the JK and EM Nakrang + POS + JX 13 ossible to join allPrincess morphemes are found ton-grams where n > 1, it is not obvious which be the most useful. For at! a single 2011 (e.g., Thursday, October 6, tag as /NNP+ /NNG+ /JKO combinations of these morphemes across different
  • 14. rning (Hsieh et al., 2008), applying c = 0.1 st), e = 0.1 (termination criterion), B = 0 (bias). JK Particles (J* in Table 1) Derivational suffixes (XS* in Table 1) Dependency Parsing DS Feature extraction EM Ending markers (E* in Table 1) PY The last punctuation, only if there is no other mentioned in Section 3.1, each token in our cor- • morpheme followed by the punctuation Feature extraction a consists of one or many morphemes annotated h different POS tags. This morphology makes Table 6: Types of morphemes in each token used to ex- - difficult to extract features for dependency pars- tract features for our parsing models. Extract features using only important morphemes. . In English, when two tokens, wi and wj , are • mpared for a dependency relation, we extract fea- Figure 6the1st morphemestokens. from the to- shows Individual POS tag features of Figure 5.and 3rd es like POS tags of wi and wj (wi .pos, wj .pos), kens in extracted For unigrams, these morphemes : NNP1, NNG1, JK1, NNG3,can be ,used either individually (e.g., the POS tag of XSV3 EF3 a joined feature of POS tags between two tokens • Joined is annotated with JK for the 1st token is JX) or jointly (e.g., a joined .pos+wj .pos). Since each token features of POS tags between the 1st and 3rd tokens. ingle POS tag in English, it is trivial toNNP _XSV , NNP _EF ,tags _NNG , JK _XSV for the 1st : NNP1_NNG3, extract feature of POS JK between LS and JK 1 3 1 3 1 3 1 3 se features. In Korean, each token is annotated token is NNG+JX) to generate features. From our - h a sequence of POS tags, depending on how mor- experiments, features extracted from the JK and EM Tokens used: wiall POS i±1, wj±1 are found to be the most useful. emes are segmented. It is possible to join , wj, w morphemes s within a token and treat that as/NNG+ tag (e.g., /NNP+ a single /JX P+NNG+JX for the first token in Figure 5); how- Nakrang + Princess + JX r, these tags usually cause very sparse vectors /NNP /NNG /JX en used as features. /NNP+ /NNG+ /JKO Hodong + Prince + JKO /NNP /NNG /JKO /NNG /XSV /EF /SF akrang_ /NNG+ /XSV+ Hodong_ /EP+ /EF+./SF Love + XSV + EP + EF + . Figure 6: Morphemes extracted from the tokens in Fig- ! /NNP+ /NNG+ /JX ure 5 with respect to the types in Table 6. Nakrang + Princess + JX 14 For n-grams where n > 1, it is not obvious which ! /NNP+ /NNG+ /JKO combinations of these morphemes across different Thursday, October 6, 2011
  • 15. discussing world history. m Table 10 shows how these corpora are divided into , wi and wj . Experiments training, development, and evaluation sets. For the development and evaluation sets, we pick one news- d features ex- • Corpora about art, one fiction text, and one informa- paper h column and - Dependencyabout trans-nationalism, the Sejong Treebank. tive book trees converted from and use each of the first half for development and the second half for d wj , respec- e of POS tags - Consists of 20 sources these genres. evaluation. Note that in 6 development and evalu- ation sets are very diverse(MZ), Fiction (FI), Memoir (ME), Newspaper (NP), Magazine compared to the training joined feature • data. Testing on such and Educational Cartoon the ro- Informative Book (IB), evaluation sets ensures (EC). d a POS tag of bustness of our parsing model, which is very impor- and a form of - ed feature be- Evaluation sets are very diverse compared to training sets. tant for our purpose because we are hoping to use features used • this model to parse various texts on the web. Ensures the robustness of our parsing models. ed morpholo- NP MZ FI ME IB EC T 8,060 6,713 15,646 5,053 7,983 1,548 D 2,048 - 2,174 - 1,307 - DS EM E 2,048 - 2,175 - 1,308 - z z # ofof sentences in training (T), develop- Table 10: Number sentences in each set x,z x y+ x∗ ,y+ ment (D), and evaluation (E) sets for each genre. 15 xThursday, October z 2011 x, 6,
  • 16. Experiments • Morphological analysis - Two automatic morphological analyzers are used. • Intelligent Morphological Analyzer - Developed by the Sejong project. - Provides the same morphological analysis as their Treebank. • Considered as fine-grained morphological analysis. • Mach (Shim and Yang, 2002) - Analyzes 1.3M words per second. - Provides more coarse-grained morphological analysis. Kwangseob Shim & Jaehyung Yang. 2002. A Supersonic Korean Morphological Analyzer. In Proceedings of COLING’02 16 Thursday, October 6, 2011
  • 17. Experiments • Evaluations - Gold-standard vs. automatic morphological analysis. • Relatively low performance from the automatic system. - Fine vs. course-grained morphological analysis. • Differences are not too significant. - Robustness across different genres. Gold, fine-grained Auto, fine-grained Auto, coarse-grained LAS UAS LS LAS UAS LS LAS UAS LS NP 82.58 84.32 94.05 79.61 82.35 91.49 79.00 81.68 91.50 FI 84.78 87.04 93.70 81.54 85.04 90.95 80.11 83.96 90.24 IB 84.21 85.50 95.82 80.45 82.14 92.73 81.43 83.38 93.89 Avg. 83.74 85.47 94.57 80.43 83.01 91.77 80.14 82.89 91.99 ble 11: Parsing accuracies achieved by three models (in %). L AS - labeled attachment score, UAS - unlabel achment score, L S - label accuracy score 17 Thursday, October 6, 2011
  • 18. Conclusion • Contributions - Generating a Korean Dependency Treebank. - Selecting important morphemes for dependency parsing. - Evaluating the impact of fine vs. coarse-grained morphological analysis on dependency parsing. - Evaluating the robustness across different genres. • Future work - Increase the feature span beyond bigrams. - Find head morphemes of individual tokens. - Insert empty categories. 18 Thursday, October 6, 2011
  • 19. Acknowledgements • Special thanks are due to - Professor Kong Joo Lee of Chungnam National University. - Professor Kwangseob Shim of Shungshin Women’s University. • We gratefully acknowledge the support of the National Science Foundation Grants CISE-IIS-RI-0910992, Richer Representations for Machine Translation. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. 19 Thursday, October 6, 2011