Material of the Natural Language Processing (NLP) Workshop with STIC-Asia representatives and the Nepal team.
August 30-31, 2007.
Patan Dhoka, Lalitpur, Nepal.
1. Robust rule-based parsing
(quick overview)
I.
II.
III.
IV.
Robustness
Three robust rule-based
parsers of English
Common features
Example : identification of
subjects in Syntex
2. I. Robustness
(Aït-Mohktar et al. 1997)
« the ability to provide useful analyses for realword input text. By useful analyses, we mean
analyses that are (at least partially) correct and
usable in some automatic task or application »
implies :
1 analysis (even partial) for any real world input
ability to process irregular input, to overcome error
analysis
efficiency
3. I. Types of robust parsers
(Aït Mokhtar et al. 1997)
based on traditional theorical models with rule-based and/or
stochastic post-processing
Minipar (Lin 1995)
stochastic parsers
Charniak’s parser (2000)
rule-based parsers
Non-Projective Dependency Parser (Järvinen & Tapanainen
1997)
Syntex (Bourigault 2007)
Cass (Abney 1990,1995)
most parsers are hybrid
4. II.1 Non-Projective Dependency Parser
(Tapanainen & Järvinen 1997)
Tagged Text
Syntactic
Labeling
valency
subcategorization
information
Selection of
syntactic
links
Pruning
OUTPUT
« all legitimate surface-syntactic
labels are added to the set of
morphological readings »
« syntactic rules discard
contextually illegitimate
alternatives or select legitimate
ones »
General heuristics
disambiguate the last of the
syntactic links
5. II.1 Non-Projective Dependency Parser
(Tapanainen & Järvinen 1997)
Rules establish dependency links between words
Rules are contextual :
SELECT (@SUBJ)
IF (1C AUXMOD HEAD);
SUBJ
How do you do ?
AUX
If the preceding the word is an unambiguous auxiliary,
the current word is the subject of this auxiliary
Rules use syntactic links established by preceding rules
6. II.2 Syntex
(Bourigault 2007)
Tagged Text
Endogenous and
exogenous
subcategorization
information
Verb Chunk
he will leave
non recursive NP
the man
non recursive SP
Object, Subject
Endogenous and
exogenous
subcategorization
information
Prep
Attachement
OUTPUT
?
?
happy tree friends
from Paris
This is the man
?
?
This is the man from Paris
7. II.2 Syntex (Bourigault 2007)
One module per syntactic relation
Each module processes the sentence from left to right.
Like the Non-Projective Dependency Parser, the rules
establish dependency relations between words
are contextual
use syntactic links established by preceding rules
The identification of a dependency link is formulated as a «path» to be
followed up through the existing links and grammatical categories
from governor to dependent or from dependent to governor
Ambiguous relations : selection of potential governors +
desambiguisation with probabilities
Those who think they are interested in water supply must vote
8. II.3 Cass (Abney 1990,1995)
Tagged Text
CHUNK FILTER
NP filter
Chunk filter
CLAUSE FILTER
Raw Clause filter
Non recursive chunks
Internal structure remains ambiguous
[NP the happy tree friends]
[VP will leave]
[SP from [NP the happy tree friends]
Subject-predicate relation
Beginning and end of simplex clauses
[SUBJThis] [PREDis] [NPthe man][SPfrom Paris]
Clause Repair filter
subcategorization
information
PARSE FILTER
OUTPUT
Repair if no Subject-predicate relation
Assembles recursive structures
[[This] [is] [NPthe man][SPfrom Paris] ]
9. II.3 Cass (Abney 1990,1995)
Each filter uses transducers :
PP (Prep|To)+(NP|Vbg)
Use of repair
(also used in Syntex and NPDP but less explicit):
Each filter makes a decision (determinism), the safest one in
case of ambiguity
« ambiguity is not propagated downstream »
« repair consists in directly modifying erroneous structure
without regard to the history of computation that produced
the structure »
« when errors become apparent downstream, the parser
attempts to repair them »
10. II.3 Cass (Abney 1990,1995)
Example of repair
In South Australia beds of boulders were deposited …
Erroneous structure output from the Chunk filter
[SPIn [NPSouth Australia beds]][SPof [NPboulders]][VPwere
deposited]
Raw Clause filter : no subject is found
Repair filter tries to find a subject by modifying the structure
[SPIn [NPSouth Australia beds]][SPof [SPof boulders][VPwere
Australia]][NP-SUBJbeds][ NPboulders]][VPwere
deposited]
11. III. Common features : Incrementality
The parsing task is divided into substasks
reduces the overall complexity of the main task :
« factoring the problem into a sequence of small, well defined
questions » (Abney 1990).
The sentence is parsed in several phases, each phase producing
an intermediate structure
allows each phase to use the syntactic information left by the
predecing phase
« the level of abstraction produced during the 1st phase (...)
facilitates the description of deeper syntactic relations» (Aït-Mohktar et al.
1997)
ease of maintenance
problem of circularity : difficult to choose in what order the relation
should be identified (Bourigault 2007)
12. III. Common features :
determinism and repair
Each parsing phase yields one solution.
In case of ambiguity, the safest choice is made, even if
some higher level information is needed
ambiguity is not propagated downstream
Most regular errors can be repaired later on
≠ parallelism, backtracking
« The salient performance is not errors vs no errors,
but the tradeoff between speed and error rate » (Abney
1990)
13. III. Common features: no syntactic
theory
Difference between :
Difficulties in automatic syntactic analysis :
the theoretical study of the syntactic structures of language
automatic identification of grammatical relation in real-word
texts
lack of knowledge (semantics/pragmatics for desambiguation)
deviation from the norm of the language
errors of preceding processing steps
Use of common grammatical knowledge
Hours of corpus observation to find clues for automatic
identification
14. III. Common features : implicit
grammatical knowledge
Bipartite architecture :
Lexical information
Recognition routines
No independent declaration of grammatical knowlege
Difficult / impossible to set apart :
Grammatical knowledge
Non grammar-based heuristics
No linguist/computer scientist job separation
Need both linguistic and programming know-hows
A condition to scalability and robustness
15. IV. Example : the subject relation
in Syntex
The identification of the subject relation is formulated as a
«path» through the already identified grammatical relations :
start from tensed verb
move to the left
stop when you encounter an ungoverned Noun
SUJ
DET
PREP
NOMPREP
the cost of
Det
Noun
SUBJECT
Prep
OBJ
NOMPREP
technology takes time to
Noun
Verb
Noun
TENSED VERB
Prep
shrink
Noun
16. IV. Using existing links
The Subject might be far from the tensed verb
Lots of configuration are possible :
Initiatives leading to cessation of smoking in workplaces
are adopted
Gerund
PP
PP
Those who think they are interested in water supply must
vote.
Clause
Clause
PP
No reference to the war, or to the alliance, should remain
PP
Conj
PP
Existing links form dependency islands (~syntagms or isolated words)
Following up the islands until a reasonnable subject is found allows to find subjects without
describing all possible configurations or doing too much computing
17. IV. Ambiguities
Many persons have died in Darfur since the conflict began
A person sitting on the death row since the age of 16 is
not the same as before.
Many adults believe education equates intelligence.
Those who think they are interested in water supply must
vote.
When to stop? When to follow up ? When to repair ?
18. IV. Path decomposition
At each island, a decision is made by a dedicated submodule (one type of island = one sub-module) :
follow up to the island on the left
stop and identify a subject
without
repair
with repair
change path direction
to
the right
to any other position in the sentence
call other module
stop and return failure
Decisions are encoded as if-then rules that may test :
local and non-local context : lemmas, ms tags, links, presence of commas…
specific information left by other modules : encountered tags, activated modules …
19. IV. Path Example : following up
SUBJ
Korea who we believe to have WMD is safe from us.
Clause
PP
PP module
Clause module
_ RelPron [[SUJPron] Verb ]
20. IV. Path example : repair
OBJ
SUBJ
Many adults believe education equates intelligence.
Clause
Clause module
## [[SUBJNP] Verb [[OBJ [SUBJNP] Verb ]
Verb
OBJNP]]
21. IV. Path example : sub-module call
SUBJ
On the walls were scarlett banners
PP
PP module
Wall module
## [PP] Verb
NP
InvertedSubject
module
_
22. IV. Path example : change path
On the contrary, war hysteria was continuous and
PP module
Clause module
Conj
deliberate, and acts such as looting, murdering, the
Adj
slaughters of
Noun
PP module
prisonners,
were considered as normal.
Commas module
+2.6 Recall
-0.07 Precision
All three political Parties at the federal level, and certainly at the
provincial level in different sections, have parity clauses.
Although no directive was ever issued, it was known that the chief of
the Departement intended that within one week no reference to the war
with Eurasia, or to the alliance, should remain
23. IV. Evaluation on Susanne Corpus
Tensed verb
Identification
Subject
Identification
(TreeTagger)
(if tensed verb correct)
(correct tensed verb and
correct subject)
precision
94,87
94,56
89,51
recall
89,76
90,84
81,53
f-mesure
92,24
92,66
85,33
Shallow subjects evaluation only
are not identified or evaluated :
I’ve never seen the dog hiding his bones.
She wants me to clean my shoes
The book is read by the boy
SUBJECT
RELATION
24. Bibliography
Abney (1990) : « Rapid Incremental Parsing with Repair », Proceedings of the 6th
New OED Conference, University of Waterloo, Waterloo, Ontario.
Abney (1995) : «Partial Parsing with finite state cascade », Natural Language
Engineering, Cambridge University Press
Aït-Mokhtar et al. (1997) : « Incremental Finite State Parsing », Proceedings of the
ANLP-97, Washington
Bourigault (2007) : Syntex, analyseur syntaxique opérationnel, Thèse d’Habilitation
à Diriger les Recherches, Université Toulouse - Le Mirail.
http://www.cs.ualberta.ca/~lindek/downloads.htm
Tapanainen & Järvinen (1997) : « A Dependency Parser for English», Technical
Reports, No.TR-1, Department of General Linguistics University, March 1997.
http://www.cfilt.iitb.ac.in/~anupama/charniak.php
Lin (1995) :« Dependency-based Evaluation of Minipar », Proceedings of JCAI.
w3.univ-tlse2.fr/erss/textes/pagespersos/bourigault/syntex.html
Charniak (2000): «A maximum-entropy-inspired parser », In The Proceedings
of the North American, Chapter of the Association for Computational
Linguistics,pp 132–139.
www.sfs.uni-tuebingen.de/~abney/StevenAbney.html#cass
www.connexor.com
TreeTagger : http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/
Evaluation Corpus : ftp://ftp.cs.umanitoba.ca/pub/lindek/depeval