This thesis gives two contributions in the form of lexical resourcesto (1) dependency parsing in Korean and (2) semantic parsing in English. First,we describe our methodology for building three dependency treebanks in Korean derived from existing treebanks and pseudo-annotated according to the latest guidelines from the Universal Dependencies (UD). The original Google Korean UD Treebank is re-tokenized to ensure morpheme-level annotation consistency with other corpora while maintaining linguistic validity of the revised tokens. Phrase structure trees in the Penn Korean Treebank and the Kaist Treebank are automatically converted into UD dependency trees by applying head-percolation rules and linguistically motivated heuristics. A total of 38K+ dependency trees are generated.To the best of our knowledge, this is the first time that the three Korean treebanks are converted into UD dependency treebanks following the latest annotation guidelines. Second, we introduce an on-going project for augmenting the OntoNotes phrase structure treebank with semantic features found in the Abstract Meaning Representation (AMR), as part of an effort to build an accurate AMR parser. We propose a novel technique for AMR parsing that first trains a dependency parser on the OntoNotes corpus augmented with numbered arguments in the Proposition Bank (PropBank), and then does a transfer learning of the trained dependency parser for the AMR parsing task. A preliminary step is to prepare dependency data by performing an automatic replacement of dependencies that define predicate argument structure with their corresponding PropBank argument labels during constituent-to-dependency conversion. To the best of our knowledge, this is the first time that the PropBank labels are directly inserted into dependency structure, producing a new dependency corpus with rich syntactic information as well as semantic role information provided by PropBank that fully describes the predicate-argument structure, making it an ideal resource for AMR parsing and, broadly, semantic parsing.
2. Contents
1. Thesis Road Map
2. Background Part 1: [Constituency & Dependency Grammar]
3. Constituent-to-Dependency Conversion
4. Universal Dependency Treebanks in Korean
5. Background Part 2: [Predicate-Argument Structure & AMR]
6. PropBank-Augmented OntoNotes Corpus
7. Contributions
3. ParsingSyntactic Parsing Semantic Parsing
Constituency Dependency Semantic Role Labeling AMR
Korean AMR ..?
done
in progress..
PropBank
4. Constituency (Phrase Structure)
´ Constituent: a word or a phrase
that acts like a single
grammatical unit
´ Root
´ Terminals
´ Non-Terminals
5. Dependency
´ Dependency: A directed arc that establishes a head-child relation
between two nodes
´ Dependency label describes the child’s role in relation to the head
´ Can represent languages with flexible word order
6. Well-Formed Dependency Graphs
head child
dep
1. Unique Root
2. Single Head
3. Connected
4. Acyclic
5. Projective
Jurafsky D.; Martin, J. H., Speech and Language Processing:
Dependency Parsing, Ch. 14 pg. 5
7. Korean
´ One of Morphologically Rich Languages
´ Morphology: study of how words are formed
´ Morphological Analysis: kamsahamnida (thank) -> kamsa (thank) + ham
(verbalize) + nida (ending marker)
´ Several large constituency treebanks
´ Q: What about dependency?
´ Relatively free word order
´ Morphemes provide syntactic function as well as meaning of words
´ Lack of large publicly available dependency corpora
9. Approach
´ Leverage the large annotated constituency treebanks
´ Convert the constituent trees into dependency trees!
10. Constituent-to-Dependency
Conversion [1]
0. Redirect Dependencies for Empty Categories (if they exist)
1. Establish Head-Child Dependency relations using Head-Percolation Rules
2. Infer Dependency Labels using Linguistic Heuristics
11. Empty Categories
´ Characteristic of treebanks annotated in Penn Treebank [3] style
´ OntoNotes [4], Penn Korean Treebank [5]
´ Nominal units that indicate the location of their antecedent syntactic
elements
´ Enables to represent long-term dependencies
´ often breaks the projectivity property
12. Types of Empty Categories
in the Penn Korean Treebank
1. Trace (*T*): Argument that precedes its subject leaves in its place a trace, a
pointer to the index of the antecedent in the tree
´ Trace Mapping
2. Ellipsis (*?*): Dropped predicate in a matrix clause or a clausal coordination
´ Heuristics to identify the location of the shared predicate
3. Empty Assignment (*pro*): Dropped arguments
4. Empty Operator (*op*): Relative Clauses
After Wh-Movement
*?*
13. (S (ADCP (ADC 반면[Meanwhile]))
(S (NP-SBJ (NPR+NNX+PAU 삼성+측+은[Samsung]))
(VP (NP-OBJ (NNC+PCA 논평+을[to comment]))
(VV (NNC+XSV+EPF+EFN 거부+하+었+다[refused]))))
(SFN .))
Head-Percolation Rules
´ For every node in the tree, locates the head by iterating through its
immediate children and matching the POS in the order delimited by ;
´ r: Iterate from right to left (Korean is a head-final language)
´ Terminal node’s head is itself
14. Dependency Label Inference
´ Linguistic heuristics:
´ Morphological analysis of the head and the dependent
´ POS
´ Word
´ Function tags
´ Function Tags
´ Annotated in the Penn Treebank style treebanks
´ Provides additional syntactic / semantic information
´ Ex) NP-SBJ -> The NP (Noun Phrase) is the subject of a clause or a sentence
15. Universal Dependencies [6]
´ Effort to create a consistent annotation scheme for multiple languages
´ Encourage multi-lingual parsing experiments and comparative analysis
´ Defines a POS and dependency label tagset
´ Suggests a universal way of annotating certain sentence constructions, but
allows room for language-specific extensions
´ Ex) Coordination
16. The Google UD Korean Treebank
´ McDonald et al. [10] released a UD Korean Treebank of 6K sentences
´ Issues:
´ Coarse-tokenization regarding suffixes, particles, and punctuation marks
´ Outdated annotation scheme
´ Our approach:
´ perform a systematic conversion, including re-tokenization, to match the latest
guidelines
´ shown image by image
19. Discussion
´ Google Korean Treebank
´ Further possibilities for errors exist
´ Ex) abundance of flat dependency relation
´ Kaist Treebank
´ Small set of phrasal POS and lack of function tags rendered dependency
inference difficult
´ Source code to be available at https://github.com/emorynlp/ud-korean.
20. Predicate Argument Structure
´ Predicate: describes the subject
´ Usually a verb
´ Argument: helps the predicate complete its meaning
´ ARG0: agent, ARG1: patient, ARG2: instrument, attribute, benefactive (for …)
´ Ex 1) Michael played the guitar
´ play (ARG0: Michael, ARG1: the guitar)
´ Ex 2) Sam was awake by 9 a.m.
´ be (ARG1: Sam, ARG2: awake, ARGM-TMP: by 9 a.m.)
´ awake(ARG0: Sam, ARGM-TMP: by 9 a.m.)
´ The task of assigning semantic roles to words or phrases is known as
Semantic Role Labeling.
21. PropBank [7]
´ Given a predicate of a sentence in the OntoNotes corpus,
´ Provides the sense ID to specify a particular meaning of the predicate
´ Lists the predicate’s arguments, along with their semantic roles
´ Ex) follow.01 : be subsequent
´ ARG0: causal agent
´ ARG1: thing following
´ ARG2: thing followed
22. But…
´ Hard to guarantee that a typical dependency parser will represent all
predicate argument relations annotated in PropBank in its parse tree.
´ Cannot break the properties that define a dependency tree
23. Deep Dependency Graph (DDG) [11]
´ Retains two of the four properties:
1. Unique Root
2. Connected
´ Seeks to abstract away from syntactic idiosyncrasies and produce a same
dependency graph (not a tree) for phrases/sentences with similar
meaning.
´ DDG can represent complete predicate argument structures
24. Abstract Meaning Representation
(AMR) [8]
´ Represents meaning in a rooted, directed
and labeled graph
´ Variables easily handle intra-sentence co-
reference
´ Inherits the PropBank semantic roles (arg0,
arg1, etc)
´ Ex) “The professor likes to drink coffee.”
´ Note, “The” and “to” is omitted in the AMR
for their lack of semantic contribution.
25. AMR Parsing
´ Transition-based Dependency Tree to AMR Mapping [9]
´ Exploits the head-child dependency in both representations
´ Two step algorithm:
1. Dependency parser is run to obtain dependency tree of the source text
2. Transition-based framework transforms the input dependency tree into an AMR
´ Adding linguistic features such as named entities as an input to the
mapping framework obtains better results
26. Hypothesis
´ Premise
´ AMR inherits the core semantic roles from PropBank
´ DDG can produce dependency graphs with complete predicate-argument structure
´ Preliminary Step
´ Insert PropBank labels in place of dependency relations between a predicate and its
arguments into OntoNotes
´ Hypothesis
´ Training a dependency parser on thus modified treebank will partially teach it how to
do semantic role labeling
´ The trained model can then be trained on AMR parsing task
27. Insertion of PropBank Labels into
OntoNotes
´ Straight forward in a general case
´ For each predicate in the OntoNotes sentence,
1. invoke the corresponding PropBank entry
2. identify the DDG dependency between the predicate and each of its
arguments
3. replace the dependency relation with PropBank labels
28. Example
(TOP (S (CC And)
(NP-SBJ (NN ad)
(NNS agencies))
(VP (VBP insist)
(SBAR (IN that)
(S (NP-SBJ (PRP they))
(VP (VBP do)
(VP (-NONE- *?*))))))
(. .)))
nw/wsj/17/wsj_1705.parse 25 3 gold insist insist.01 ----- 1:1-ARG0 3:0-rel 4:1-ARG1
arg0
arg1
node index height
nw/wsj/17/wsj_1705.parse 25 6 gold do do.01 ----- 6:0-rel
30. Contributions
1. Systematic updates to the Google UD Korean Treebank to match the latest
UD annotation guidelines
2. Constituent-to-dependency conversion of the phrase structure trees in the
Penn Korean Treebank and the Kaist Treebank
3. Analysis of the three converted Korean dependency treebanks
4. Construction of new corpus by replacing dependencies that represent
predicate argument structure in OntoNotes with PropBank labels
5. Analysis of mismatch cases between PropBank and DDG
31. References
´ [1] Choi, J. D.; and Palmer, M., Guidelines for the Clear Style Constituent to
Dependency Conversion,Technical Report 01-12, University of Colorado Boulder,
2012.
´ [2] Jurafsky, D.; Martin, J. H., Speech and Language Processing: Dependency
Parsing, Ch. 14 pg. 5
´ [3] Marcus, M. et al, The Penn Treebank: Annotating Predicate Argument
Structure, In Proceedings of the Workshop on Human Language Technology,
HLT ‘94, Association for Computational Linguistics, pp.114-119
´ [4] Weischedel, R. et al, Ontonotes: A Large Training Corpus for Enhanced
Processing
´ [5] Han, C. et al, Development and Evaluation of a Korean Treebank and Its
Application to NLP, In Proceedings to the Third International Conference on
Langauge Resources and Evaluation, LREC 2002, May 29-31, 2002
´ [6] Nivre, Joakim; Bosco, Cristina; Choi, Jinho; et al., 2015, Universal
Dependencies 1.0
32. References
´ [7] Palmer, M. et al, The Proposition Bank: An annotated corpus of semantic
roles, Computational Linguistics 31, 1 (2005), 71-106.
´ [8] Banarescu, L. et al, Abstract Meaning Representation for Sembanking,
2013.
´ [9] Wang, C. et al, A Transition-Based Algorithm for AMR Parsing, 2015
´ [10] Mcdonald, R. et al, Universal dependency annotation for multilingual
parsing, 2013
´ [11] Choi, . D., Deep Dependency Graph Conversion in
English, In Proceedings of the 15th International Workshop on Treebanks
and Linguistic Theories, of TLT'17, pages 35--62, Bloomington, IN, 2017.