SlideShare ist ein Scribd-Unternehmen logo
1 von 53
Downloaden Sie, um offline zu lesen
Annotation of anaphora
   and coreference for
 automatic processing
                   Constantin Orasan

Research Group in Computational Linguistics
          University of Wolverhampton, UK
             http://www.wlv.ac.uk/~in6093/
Why use corpora in
anaphora/coreference resolution
   In this talk corpora discussed for:
       Training machine learning systems
       Testing anaphora/coreference resolution
        algorithms


   Annotation:
       Linguistically motivated: tries to capture certain
        phenomena (usually focuses on anaphora)
       Application motivated: limited relations are
        encoded (usually focuses on coreference)
Structure
1.   Background information
2.   The MUC annotation for coreference
3.   The NP4E corpus
4.   Event coreference and NP coreference
5.   Conclusions
Anaphora and anaphora
resolution
   cohesion which points back to some previous item
    (Halliday and Hasan, 1976)
   the pointing back word is called an anaphor, the
    entity to which it refers or for which it stands is its
    antecedent (Mitkov, 2002)
   The process of determining the antecedent of an
    anaphor is called anaphora resolution (Mitkov,
    2002)
   Anaphora resolution can be seen as a process of
    filling empty or almost empty expressions with
    information from other expressions
Coreference and coreference
resolution
   When the anaphor refers to an antecedent
    and when both have the same referent in real
    world they are termed coreferential (Mitkov,
    2002)

   The process of establishing which referential
    NPs point to the same discourse entity is
    called coreference resolution
Examples of anaphoric
expressions from Mitkov (2002)
Sophia Loren says she will always be grateful to
Bono. The actress revealed that the U2 singer helped
her calm down when she became scared by a
thunderstorm while travelling on a plane.

Coreferential chains:
      {Sophia Loren, she, the actress, her, she},
      {Bono, the U2 singer},
      {a thunderstorm},
      {a plane}
Examples of anaphoric
expressions from Mitkov (2002)
   Indirect anaphora: Although the store had only just
    opened, the food hall was busy and there were long
    queues at the tills.
   Identity-of-sense anaphora: The man who gave
    his paycheck to his wife was wiser that the man who
    gave it to his mistress
   Verb and adverb anaphora: Stephanie sang, as
    did Mike
   Bound anaphora: Every man has his own agenda
   Cataphora: The elevator opened for him on the 14th
    floor, and Alec stepped out quickly.
Anaphora vs. coreference
   There are many anaphoric expressions which are
    not coreferential
   Most of the coreferential expressions are anaphoric
    (Sophia Loren, the actress)
   Coreferential expressions that may be or may not be
    anaphoric
       (Sophia Loren, the actress Sophia Loren) – not anaphoric?
       (the actress Sophia Loren, Sophia Loren) – anaphoric
   Coreferential expressions which are not anaphoric
    (Sophia Loren, Sophia Loren)
   Cross-document coreference is not anaphora
Substitution test
   To determine whether two entities are
    coreferential substitution test is used
       Sophia Loren says she will always be grateful to
        Bono  Sophia Loren says Sophia Loren will
        always be grateful to Bono.
       John has his own agenda  John has John’s own
        agenda
       Every man has his own agenda.  Every man has
        every man’s own agenda. ??
Anaphora & coreference in
computational linguistics
   are important preprocessing steps for a wide
    range of applications such as machine
    translation, information extraction, automatic
    summarisation, etc.

   From linguistic perspective the expressions
    processed are rather limited
Developing annotated corpora for
   computational linguistics
     A simple, reliable annotation task
     Producing an CL-oriented resource
     Capturing the most widespread and best-understood anaphoric
     relation

        identity-of-reference direct nominal anaphora



                        Including identity,   Referring expressions (pronouns,
Elements                synonymy,             definite NPs, or proper names)
corresponding to the    generalisation and    have non-pronominal NP
same discourse entity   specialisation        antecedents in the preceding text /
                                              dialogue
Terminology
   Entity = an object or set of objects in the world
   Entities can have types (ACE requires to annotate
    only certain types e.g. person, location,
    organisation, etc.)
   Mention = a textual reference to an entity (usually an
    NP)
   Direct anaphora = identity of head, generalisation,
    specialisation or synonymy
   Indirect anaphora = part-of, set-membership
Annotation of anaphora/
coreference
   In general the process can be split into two
    stages:
       Identification and annotation of elements involved
        in a relation (annotation of mentions)
       Identification and annotation of relations between
        mentions
   The two stages can be done together or
    separately
Annotation of mentions
   Annotate everything?
   Singletons should be annotated because they
    influence evaluation measures (except MUC
    score)
   If everything is annotated it is easier if this
    annotation is done in the first instance

   Syntactic annotation can be useful
Annotation of relations
   Each annotation scheme defines a set of
    relations that should be covered
   The relations normally happen between
    mentions/markables
MUC annotation (Hirchmann
1997)
    Defined in the coreference resolution task at MUC
    The criteria used to define the task were:
    1.   Support for the MUC information extraction tasks;
    2.   Ability to achieve good (ca. 95%) interannotator
         agreement;
    3.   Ability to mark text up quickly (and therefore, cheaply);
    4.   Desire to create a corpus for research on coreference and
         discourse phenomena, independent of the MUC extraction
         task.
    These criteria are not necessarily consistent with
     each other
MUC annotation scheme
   Marks only relations between noun phrases
   Does not mark relations between verbs,
    clauses, etc.
   Marks only IDENTITY which defines
    equivalence classes and is not directional
   Values which are clearly distinct should not
    be allowed to be in the same class e.g. the
    stock price fell from $4.02 to $3.85
MUC annotation scheme (II)
   SGML used
    <COREF ID="100">Lawson Mardon Group Ltd.</COREF> said
    <COREF ID="101" TYPE="IDENT" REF="100">it </COREF> ...
   Attributes:
       ID a unique identifier for a mention
       REF indicates links between mentions
       TYPE the type of link (only IDENT supported)
       MIN the minimum span to be identified in order to be
        considered correct in automatic evaluation
       STATUS=“OPT” to indicate optional elements to be
        resolved
MUC annotation scheme –
markables (III)
   NPs (including dates, percentages and
    currency expressions), personal and
    demonstrative pronouns
   Interrogative “wh-” NPs are not marked
    (Which engine would you like to use?)
   The extent of the markable is quite loosely
    defined (must include the head, but should
    really include the maximal NP and MIN
    attribute have the head as the value)
MUC annotation scheme –
relations
   Basic coreference
   Bound anaphors
   Apposition
    <COREF ID="1" MIN="Julius Caesar">Julius Caesar, <COREF
    ID="2" REF="1" MIN="emperor" TYPE="IDENT"> the/a well-known
    emperor,</COREF></COREF>
   Predicate nominals
    <COREF ID="1" MIN="Julius Caesar">Julius Caesar</COREF> is
    <COREF ID="2" REF="1" MIN="emperor" TYPE="IDENT">the/a
    well-known emperor</COREF> who …
   For appositions and predicate nominals there needs
    to be certainty (is not may be)
MUC annotation - criticism
   Van Deemter and Kibble (1999) criticised the
    MUC scheme because it goes beyond
    annotation of coreference as it is commonly
    understood because:
       It marks quantifying NPs (e.g. every man, most
        people)
       Marks indefinite NPs
        Henry Higgins, who was formerly sales director of Sudsy
        Soaps, became president of Dreamy Detergents.
       and one can argue not in a consistent manner
        the stock price fell from $4.02 to $3.85
MUC annotation & corpus
   Despite criticism the MUC annotation provided a
    starting point for standardising
    anaphora/coreference annotation schemes
   Designed to mark only a small set of expressions
    and relations which can be tackled by computers
   Was proposed in the context of a competition 
    comparison of results and backing of an
    organisation
   The corpus is available
Corpus of technical manuals
(Mitkov et. al. 2000)
   A corpus of technical manuals annotated with
    a MUC-7 like annotation scheme
   Annotates only identity of reference between
    direct nominal referential expressions
   Less interesting from linguistic perspective,
    but used to develop automatic methods
Corpus of technical manuals
(Mitkov et. al. 2000)
   Full coreferential chains are annotated
   All the mentions are annotated regardless
    whether they are singletons or not
   The relation of coreference is considered fully
    transitive
   The MUC annotation scheme was used but
    the guidelines were not adapted completely
   CLinkA (Orasan 2000) was used for
    annotation
Annotation guidelines
   The starting point the MUC-7 annotation
    guidelines, but
       More strict with what means identity of meaning
        (e.g. we do not consider indefinite appositional
        phrases coreferential with the phrases that
        contain them)
       An indefinite NP cannot refer to anything
       Not consider gerunds as mentions
   Add missing phenomena:
       V [NP1] as [NP2] – not coreferential
        [use [a diagonal linear gradient] as [the map]] – is not
        coreferential
        [elect [John Prescott] as [Prime Minister]], – is not coreferential

       …if [[ an NTSC Ø ]i or [ PAL monitor ]j]k is being used…[ The
        NTSC monitor ]l… - not coreferential

        …[[the pixels’ luminance]i or [Ø Ø saturation]j ]k is important…
        [The pixels’ saturation]j - coreferential
Annotation guidelines – short version
Do:                                                          Do not:

(i) annotate identity-of-reference direct nominal            (i) annotate indefinite predicate nominals that are linked to
anaphora                                                     other elements by perception verbs as coreferential with
                                                             those elements
(ii) annotate definite descriptions which stand in any of    (ii) annotate identity-of-sense anaphora
the identity, synonymy, generalisation, specialisation, or
copula relationships with an antecedent

(iii) annotate definite NPs in a copula relation as          (iii) annotate indirect anaphora between markables
coreferential
(iv) annotate definite appositional and bracketed phrases    (iv) annotate cross-document coreference
as coreferential with the NP of which they are a part

(v) annotate NPs at all levels from base to complex and      (v) annotate indefinite NPs in copula relations with other
co-ordinated                                                 NPs as coreferential
(vi) familiarise yourself with the use of unfamiliar,        (vi) annotate non-permanent or “potential” coreference
highly specialised terminology by search through the         between markables
text
                                                             (vii) annotate bound anaphors

                                                             (viii) consider gerunds of any kind markable

                                                             (ix) annotate anaphora over disjoined antecedents

                                                             (x) consider first or second person pronouns markable
Speed of annotation (Mitkov et.
al. 2000)
   Speed of annotation in one hour:
       At the beginning while the guidelines were being created:
        assign 288 mentions to 220 entities covering on average
        2051 words in text
       After the annotators became used to the task and the
        guidelines finalised: assign 315 mentions to 250 entities
        covering on average 1411 words in text
   Fast track annotation for pronoun resolution in one
    hour: 113 pronouns, 944 candidates and 148
    antecedents, covering 10862 words
Speed of annotation (II)
   Most of the time during the annotation is
    spent identifying the mentions

   … existing annotation levels can prove very
    beneficial
Reasons for disagreements
   The process is tedious and requires high
    levels of concentration
   Two main reasons for disagreement:
       Unsteady references – mentions which may
        belong to different entities through the document
        (e.g. image, the window) – the automatic
        annotation option of the annotation tool may also
        mislead
       Specialised terminology
Improving annotation strategies
   Unsteady reference: Pre-annotation stage to clarify
    topic segments

   Domain knowledge: Pre-annotation stage to
    disambiguate unknown technical terminology

   ‘Master strategy’ combining individual
    approaches:
       Printing text prior to annotation - increases familiarity
       Two step process
       Taking note of troublesome cases to discuss later with others
       Annotating intensively vs sporadically
NP4E (Hasler et. al. 2006)
   The goal was to develop annotation guidelines for
    NP and event coreference in newspaper articles
    about terrorism/security threats
   A small corpus annotated with NP and event
    coreference was produced
   An attempt to produce a more refined annotated
    resource than our previous corpus
   5 clusters of related documents in the domain were
    built, about 50,000 words
   http://clg.wlv.ac.uk/projects/NP4E/
NP coreference annotation
   Used the guidelines developed by (Mitkov et. al.
    2000) as the starting point,
   but adapted them for our goals and texts
       All the mentions need to be annotated, both definite and
        indefinte NPs
       Introduced coref and ucoref tags to be able to deal with
        uncertainties
        The government] will argue that… [[McVeigh] and [Nichols]] were [the
        masterminds of [the bombing plot]]

       Types of relations between an NP and its antecedent:
        identity, synonymy, generalisation, specialisation and
        other, but we do not annotate indirect anaphora
NP coreference annotation (II)
   Types of (coreference) relations we identify NP, copular, apposition,
    bracketed text, speech pronoun and other
   Link to the first element of the chain in most of the cases for type NP
   For copular, apposition, bracketed text and speech pronouns (pronouns
    which occur in direct speech), the anaphor should be linked back to the
    nearest mention of the antecedent in the text

   Do not annotate coreferential different readings of an NP as
    coreferential
    [A jobless Taiwanese journalist who commandeered [a Taiwan airliner] to [China]]…
    [China] ordered [[its] airports] to beef up [security]…
The user can
             override             WordNet is consulted
             WordNet’s decision   about the relation
                                  between the two NPs




                                                         Annotation of NPs using PALinkA
the plane is marked as
coreferential with The
aircraft
Issues arising during the NP
annotation
   The antecedent of pronoun we in direct speech can
    be linked to: the individual speaker, a group
    represented by the speaker or nothing
   General concepts such as violence, terror, terrorism,
    police, etc are sometimes used in a general sense
    so it is difficult to know whether to annotate and how
   Sometimes difficult to decide the best indefinite NP
    as an antecedent

    …the man detained for hijacking [a Taiwanese airliner]… Liu
    forced [a Far East Air Transport domestic plane]… Beijing
    returned [the Boeing 757]…
Issues arising during the NP
annotation (II)
   Mark relative pronouns/clauses and link them
    to the nearest mention
    Chinese officials were tightlipped whether [Liu Shan-chung,
    45, [who] is in custody in China's southeastern city of
    Xiamen], would be prosecuted or repatriated to Taiwan.


   The type of relation is sometimes difficult to
    establish without the help of WordNet (have
    ident, non-ident)
Annotation of event coreference
    Event = a thing that happens or takes place, a single
     specific occurrence, either instantaneous or
     ongoing.
    Used the ACE annotation guidelines as starting
     point
    Events marked: ATTACK, DEFEND, INJURE, DIE,
     CONTACT
    Identify the trigger = the best word to represent the
     event
    Triggers: verbs, nouns, adjectives and pronouns
    {The blast} {killed} 168 people…and {injured}
     hundreds more… (ATTACK: noun, DIE: verb,
     INJURE: verb)
Event triggers
   ATTACK: attack events are physical actions which aim to cause harm
    or damage tothings or people: attack, bomb, shoot, blast, war, fighting,
    clashes, throw, hit, hold, spent.
   DEFEND: defend events are events where people or organisations
    defend something, usually against someone or something else:
    sheltering, reinforcing, running, prepared.
   INJURE: injure events involve people experiencing physical harm:
    injure, hurt, maim, paralyse, wounded, ailing.
   DIE: die events happen when a person’s life ends: kill, dead, suicide,
    fatal, assassinate, died, death.
   CONTACT: contact events occur when two or more parties
    communicate in order to try and resolve something, reach an
    agreement or better relations between different sides etc. This category
    includes demands, threats and promises made by parties during
    negotiations: meeting, talks, summit, met, negotiations, conference,
    called, talked, phoned, discussed, promised, threatened, agree, reject,
    demand.
Annotation of event coreference
   Two stage process: identify the triggers and then
    link them
   Link arguments of an event to NP annotated in the
    previous stage
   The arguments are event dependent (e.g.
    ATTACKER, MEANS, VICTIM, CAUSE, AGENT,
    TOPIC and MEDIUM)
   The arguments should be linked to NPs from the
    same sentence or near by sentences if they are
    necessary to disambiguate the event
   Also TENSE, MODALITY and POLARITY needs to
    be indicated
Annotation of an attack event using PALinkA




                               the operation
                                     TYPE: attack
                                     TIME: Dec. 17
                                     REF: stormed
                                     TARGET: the Japanese
                               ambassador's residence in
                               Lima (FACILITY)
                                     ATTACKER: MRTA rebels
                                                (PERSON)
              the operation          PLACE: Lima (LOCATION)
Issues with event annotation
   Very difficult annotation task
   At times it is difficult to decide the tense of an event
    in direct speech
   Whether to include demands, promises or threats in
    the CONTACT (or use them only as a signal of
    modality)
   Whether to make a distinction between
    speaker/hearer in CONTACT events (especially in
    the case of demands, promises or threats)
What coreferential events indicate?
(Hasler and Orasan 2009)

   Starting point – do coreferential events have
    coreferential arguments?
   We had a corpus of about 12,000 words
    annotated with event coreference
   344 unique event mentions
   106 coreferential chains with 2 to 10 triggers
   238 events referred by only one trigger
Zaire planes bombs rebels as U.N. seeks war’s end.
a293   TRIGGER: bombs
       ATTACKER: –
       MEANS: Zaire planes: ID=0: CHAIN=0: VEHICLE
       PLACE: –
       TARGET: rebels: ID=1: CHAIN=1: PERSON
       TIME: –

       Zaire said on Monday its warplanes were bombing three key rebel-held towns in its eastern
       border provinces and that the raids would increase in intensity.
a333   TRIGGER: bombing
       ATTACKER: Zaire: ID=44: CHAIN=5: ORGANISATION
       MEANS: its warplanes: ID=46: CHAIN=46: VEHICLE
       PLACE: three key rebel-held towns in its eastern border provinces: ID=48:
       CHAIN=14: LOCATION
       TARGET: three key rebel-held towns in its eastern border provinces: ID=48:
       CHAIN=14: LOCATION
       TIME: Monday: ID=45: CHAIN=7

       “Since this morning the FAZ (Zaire army) has been bombing Bukavu, Shabunda and
       Walikale”, said a defence ministry statement in the capital Kinshasa.
a334   TRIGGER: bombing
       ATTACKER: the FAZ (Zaire army): ID=53: CHAIN=53: ORGANISATION
       MEANS: –
       PLACE: Bukavu, Shabunda and Walikale: ID=55: CHAIN=14: LOCATION
       TARGET: Bukavu, Shabunda and Walikale: ID=55: CHAIN=14: LOCATION
       TIME: this morning: ID=52: CHAIN=52
Referential relations between
arguments
   104 chains considered:
       22 (21.15%) contained only coferential NPs
       23 (22.12%) contained only non-coferential NPs
       9 chains ignored
       50 (48.07%) contain a mixture of coreferential and
        non-coreferential NPs


   If indirect anaphora is not annotated, 70% of
    chains are affected
ID     TRIGGER                       ARGUMENT: AGENT(S)
c389   an emergency summit           the leaders of both nations: ID=20:
                                     CHAIN=20: PERS

c397   the two-hour closed meeting   they: ID=24: CHAIN=20: PERS

c408   the summit                    Fujimori: ID=60: CHAIN=32: PERS
                                     Hashimoto: ID=58:CHAIN=40:PERS

c409   the summit                    Fujimori: ID=60: CHAIN=32: PERS
                                     Hashimoto: ID=58: CHAIN=40: PERS

c418   the summit                    rebels: ID=110: CHAIN=14: PERS

c432   the summit                    he: ID=170: CHAIN=40: PERS
Identity of sense
   There are cases where even though the strings are
    the same we do not have identity of reference: at
    least nine people and nine confirmed dead
   Hundred, at least 500 people, the first group of at
    least 500 people, but probably more than that and
    the 500

   It can be argued that events of INJURE, DIE and
    DEFEND with such parameters are not
    coreferential, but the ATTACK events that causes
    them are.
at least nine people were killed and up to 37 wounded
i343   TRIGGER: wounded
       AGENT: the FAZ (Zaire army): ID=53: CHAIN=53: ORG
       VICTIM: up to 37: ID=66: CHAIN=66: PERSON
       CAUSE: –
       PLACE: Bukavu: ID=70: CHAIN=17: LOCATION
       TIME: Monday: ID=69: CHAIN=7

       there are nine confirmed dead and 37 wounded
i346   TRIGGER: wounded
       AGENT: –
       VICTIM: 37 wounded: ID=86: CHAIN=78: PERSON
       CAUSE: –
       PLACE: –
       TIME: –
Missing slots
   Coreference between events can be established
    even if many slots are not filled in:
       Peru’s Fujimori says hostage talks still young.
       ...the President said talks to free them were still in their
        preliminary phase.
       ”We cannot predict how many more weeks these
        discussions will take.”
       ”We are still at a preliminary stage in the conversations.”
       Fujimori said he hoped Nestor Cerpa would personally take
        part in the talks when they resume on Monday at 11am.
Contact events
   Involve 2 or more parties
   The parties are usually introduced bit by bit
    and event coreference is necessary to
    establish all the participants
   Cross-document event coreference is
    sometimes necessary collect all the
    participants
Conclusions
   The guidelines should not be used directly and the
    characteristics of the texts should be considered
   For automatic processing MUC-like may provide a good trade off
    between the linguistic detail encoded and the difficulty of
    annotation
   However, quite often this annotation is not enough for more
    advanced processing
   Have a more refined notion of “identity”

    Coreference is a scalar relation holding between two (or more)
    linguistic expressions that refer to DEs considered to be at the
    same granularity level relevant to the pragmatic purpose.
    (Recasens, Hovy and Marti, forthcoming)
Thank you!
References
   van Deemter, Kees and Rodger Kibble, (1999). What is coreference and what
    should coreference annotation be? In Amit Bagga, Breck Baldwin, and S Shelton
    (eds.), Proceedings of ACL workshop on Coreference and Its Applications.
    Maryland.
   Halliday, M. A. K., and Hasan, R. (1976).Cohesion in English. London: Longman.
   Hasler, L. and Orasan. C (2009). Do coreferential arguments make event mentions
    coreferential? Proceedings of the 7th Discourse Anaphora and Anaphor Resolution
    Colloquium (DAARC 2009), Goa, India, 5-6 November 2009, 151-163
   Hasler, L., Orasan, C. and Naumann, K. (2006) NPs for Events: Experiments in
    coreference annotation. In Proceedings of the 5th Language Resources and
    Evaluation Conference (LREC2006). Genoa, Italy, 24-26 May, 1167-1172
   Hirschman, L. (1997). MUC-7 coreference task definition. Version 3.0
   Mitkov, R. (2002): Anaphora Resolution. Longman
   Mitkov, R., Evans, R., Orasan, C., Barbu, C., Jones L. and Sotirova, V. (2000)
    Coreference and anaphora: developing annotating tools, annotated resources and
    annotation strategies Proceedings of the Discourse Anaphora and Anaphora
    Resolution Colloquium (DAARC'2000)), 49-58. Lancaster, UK

Weitere ähnliche Inhalte

Andere mochten auch

Happines is a voyage
Happines is a voyageHappines is a voyage
Happines is a voyageIYAPPAN K
 
Lrrcm Analysis Process
Lrrcm Analysis ProcessLrrcm Analysis Process
Lrrcm Analysis Processrichardn0922
 
Loving hut - dessert time!
Loving hut - dessert time!Loving hut - dessert time!
Loving hut - dessert time!Heena Modi
 
Woman De John Lennon
Woman De John LennonWoman De John Lennon
Woman De John LennonMISANTLA
 
Humans And Animals
Humans And AnimalsHumans And Animals
Humans And AnimalsHeena Modi
 
New  Forodhani ( Jubilee Garden ) Being Rebuilt
New  Forodhani ( Jubilee Garden ) Being RebuiltNew  Forodhani ( Jubilee Garden ) Being Rebuilt
New  Forodhani ( Jubilee Garden ) Being RebuiltHeena Modi
 
Boost Conversions and Raise your Revenues with A/B testing in Rails
Boost Conversions and Raise your Revenues with A/B testing in RailsBoost Conversions and Raise your Revenues with A/B testing in Rails
Boost Conversions and Raise your Revenues with A/B testing in RailsDaniel Pritchett
 
Spc. La Llegenda De Sant Jordi
Spc. La Llegenda De Sant JordiSpc. La Llegenda De Sant Jordi
Spc. La Llegenda De Sant JordiMOBRADO
 
Buffalo Social Media Summit Presentation
Buffalo Social Media Summit PresentationBuffalo Social Media Summit Presentation
Buffalo Social Media Summit PresentationChris Treadaway
 
2014 11-09 book 9 v8
2014 11-09 book 9 v82014 11-09 book 9 v8
2014 11-09 book 9 v8Horace Poon
 
Afrika Sunrise Jantjebeton
Afrika Sunrise  JantjebetonAfrika Sunrise  Jantjebeton
Afrika Sunrise JantjebetonHeena Modi
 
24 Tirthankaras
24 Tirthankaras24 Tirthankaras
24 TirthankarasHeena Modi
 
Greens - a gorgeous setting with delicious vegan treats on the menu!
Greens - a gorgeous setting with delicious vegan treats on the menu!Greens - a gorgeous setting with delicious vegan treats on the menu!
Greens - a gorgeous setting with delicious vegan treats on the menu!Heena Modi
 
The broiler hen
The broiler hen The broiler hen
The broiler hen Heena Modi
 
Social Media for Local Businesses
Social Media for Local BusinessesSocial Media for Local Businesses
Social Media for Local BusinessesChris Treadaway
 
Jean Fares Couture BIO
Jean Fares Couture BIO Jean Fares Couture BIO
Jean Fares Couture BIO Norma HAYEK
 
ROI Conference 2013 - Your Social Success Story
ROI Conference 2013 - Your Social Success StoryROI Conference 2013 - Your Social Success Story
ROI Conference 2013 - Your Social Success StoryChris Treadaway
 

Andere mochten auch (20)

Simple
SimpleSimple
Simple
 
Happines is a voyage
Happines is a voyageHappines is a voyage
Happines is a voyage
 
Lrrcm Analysis Process
Lrrcm Analysis ProcessLrrcm Analysis Process
Lrrcm Analysis Process
 
Loving hut - dessert time!
Loving hut - dessert time!Loving hut - dessert time!
Loving hut - dessert time!
 
Woman De John Lennon
Woman De John LennonWoman De John Lennon
Woman De John Lennon
 
Utagoe intro
Utagoe introUtagoe intro
Utagoe intro
 
Humans And Animals
Humans And AnimalsHumans And Animals
Humans And Animals
 
New  Forodhani ( Jubilee Garden ) Being Rebuilt
New  Forodhani ( Jubilee Garden ) Being RebuiltNew  Forodhani ( Jubilee Garden ) Being Rebuilt
New  Forodhani ( Jubilee Garden ) Being Rebuilt
 
Boost Conversions and Raise your Revenues with A/B testing in Rails
Boost Conversions and Raise your Revenues with A/B testing in RailsBoost Conversions and Raise your Revenues with A/B testing in Rails
Boost Conversions and Raise your Revenues with A/B testing in Rails
 
Spc. La Llegenda De Sant Jordi
Spc. La Llegenda De Sant JordiSpc. La Llegenda De Sant Jordi
Spc. La Llegenda De Sant Jordi
 
Buffalo Social Media Summit Presentation
Buffalo Social Media Summit PresentationBuffalo Social Media Summit Presentation
Buffalo Social Media Summit Presentation
 
Kansas sights
Kansas sightsKansas sights
Kansas sights
 
2014 11-09 book 9 v8
2014 11-09 book 9 v82014 11-09 book 9 v8
2014 11-09 book 9 v8
 
Afrika Sunrise Jantjebeton
Afrika Sunrise  JantjebetonAfrika Sunrise  Jantjebeton
Afrika Sunrise Jantjebeton
 
24 Tirthankaras
24 Tirthankaras24 Tirthankaras
24 Tirthankaras
 
Greens - a gorgeous setting with delicious vegan treats on the menu!
Greens - a gorgeous setting with delicious vegan treats on the menu!Greens - a gorgeous setting with delicious vegan treats on the menu!
Greens - a gorgeous setting with delicious vegan treats on the menu!
 
The broiler hen
The broiler hen The broiler hen
The broiler hen
 
Social Media for Local Businesses
Social Media for Local BusinessesSocial Media for Local Businesses
Social Media for Local Businesses
 
Jean Fares Couture BIO
Jean Fares Couture BIO Jean Fares Couture BIO
Jean Fares Couture BIO
 
ROI Conference 2013 - Your Social Success Story
ROI Conference 2013 - Your Social Success StoryROI Conference 2013 - Your Social Success Story
ROI Conference 2013 - Your Social Success Story
 

Ähnlich wie Annotation of anaphora and coreference for automatic processing

The role of linguistic information for shallow language processing
The role of linguistic information for shallow language processingThe role of linguistic information for shallow language processing
The role of linguistic information for shallow language processingConstantin Orasan
 
Determining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseDetermining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseLeon Derczynski
 
Ijarcet vol-3-issue-3-623-625 (1)
Ijarcet vol-3-issue-3-623-625 (1)Ijarcet vol-3-issue-3-623-625 (1)
Ijarcet vol-3-issue-3-623-625 (1)Dhabal Sethi
 
Shallow parser for hindi language with an input from a transliterator
Shallow parser for hindi language with an input from a transliteratorShallow parser for hindi language with an input from a transliterator
Shallow parser for hindi language with an input from a transliteratorShashank Shisodia
 
Day2-Slides.ppt pppppppppppppppppppppppppp
Day2-Slides.ppt ppppppppppppppppppppppppppDay2-Slides.ppt pppppppppppppppppppppppppp
Day2-Slides.ppt ppppppppppppppppppppppppppratnapatil14
 
Robust rule-based parsing
Robust rule-based parsingRobust rule-based parsing
Robust rule-based parsingEstelle Delpech
 
speech recognition and removal of disfluencies
speech recognition and removal of disfluenciesspeech recognition and removal of disfluencies
speech recognition and removal of disfluenciesAnkit Sharma
 
05 handbook summ-hovy
05 handbook summ-hovy05 handbook summ-hovy
05 handbook summ-hovySagar Dabhi
 
Natural Language Processing_in semantic web.pptx
Natural Language Processing_in semantic web.pptxNatural Language Processing_in semantic web.pptx
Natural Language Processing_in semantic web.pptxAlyaaMachi
 
WORD RECOGNITION MASLP
WORD RECOGNITION MASLPWORD RECOGNITION MASLP
WORD RECOGNITION MASLPHimaniBansal15
 
5a use of annotated corpus
5a use of annotated corpus5a use of annotated corpus
5a use of annotated corpusThennarasuSakkan
 
Pragmatic Functions of Interpreters? Own Discourse Markers in Simultaneous In...
Pragmatic Functions of Interpreters? Own Discourse Markers in Simultaneous In...Pragmatic Functions of Interpreters? Own Discourse Markers in Simultaneous In...
Pragmatic Functions of Interpreters? Own Discourse Markers in Simultaneous In...English Literature and Language Review ELLR
 
FCA-MERGE: Bottom-Up Merging of Ontologies
FCA-MERGE: Bottom-Up Merging of OntologiesFCA-MERGE: Bottom-Up Merging of Ontologies
FCA-MERGE: Bottom-Up Merging of Ontologiesalemarrena
 
Ontology and its various aspects
Ontology and its various aspectsOntology and its various aspects
Ontology and its various aspectssamhati27
 
Pos Tagging for Classical Tamil Texts
Pos Tagging for Classical Tamil TextsPos Tagging for Classical Tamil Texts
Pos Tagging for Classical Tamil Textsijcnes
 
AUTOMATIC ARABIC NAMED ENTITY EXTRACTION AND CLASSIFICATION FOR INFORMATION R...
AUTOMATIC ARABIC NAMED ENTITY EXTRACTION AND CLASSIFICATION FOR INFORMATION R...AUTOMATIC ARABIC NAMED ENTITY EXTRACTION AND CLASSIFICATION FOR INFORMATION R...
AUTOMATIC ARABIC NAMED ENTITY EXTRACTION AND CLASSIFICATION FOR INFORMATION R...kevig
 

Ähnlich wie Annotation of anaphora and coreference for automatic processing (20)

The role of linguistic information for shallow language processing
The role of linguistic information for shallow language processingThe role of linguistic information for shallow language processing
The role of linguistic information for shallow language processing
 
Determining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseDetermining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in Discourse
 
Make it simple with paraphrases: Automated paraphrasing for authoring aids an...
Make it simple with paraphrases: Automated paraphrasing for authoring aids an...Make it simple with paraphrases: Automated paraphrasing for authoring aids an...
Make it simple with paraphrases: Automated paraphrasing for authoring aids an...
 
Ijarcet vol-3-issue-3-623-625 (1)
Ijarcet vol-3-issue-3-623-625 (1)Ijarcet vol-3-issue-3-623-625 (1)
Ijarcet vol-3-issue-3-623-625 (1)
 
Shallow parser for hindi language with an input from a transliterator
Shallow parser for hindi language with an input from a transliteratorShallow parser for hindi language with an input from a transliterator
Shallow parser for hindi language with an input from a transliterator
 
Theories of Speech Perception
Theories of Speech PerceptionTheories of Speech Perception
Theories of Speech Perception
 
NLP
NLPNLP
NLP
 
NLP
NLPNLP
NLP
 
Day2-Slides.ppt pppppppppppppppppppppppppp
Day2-Slides.ppt ppppppppppppppppppppppppppDay2-Slides.ppt pppppppppppppppppppppppppp
Day2-Slides.ppt pppppppppppppppppppppppppp
 
Robust rule-based parsing
Robust rule-based parsingRobust rule-based parsing
Robust rule-based parsing
 
speech recognition and removal of disfluencies
speech recognition and removal of disfluenciesspeech recognition and removal of disfluencies
speech recognition and removal of disfluencies
 
05 handbook summ-hovy
05 handbook summ-hovy05 handbook summ-hovy
05 handbook summ-hovy
 
Natural Language Processing_in semantic web.pptx
Natural Language Processing_in semantic web.pptxNatural Language Processing_in semantic web.pptx
Natural Language Processing_in semantic web.pptx
 
WORD RECOGNITION MASLP
WORD RECOGNITION MASLPWORD RECOGNITION MASLP
WORD RECOGNITION MASLP
 
5a use of annotated corpus
5a use of annotated corpus5a use of annotated corpus
5a use of annotated corpus
 
Pragmatic Functions of Interpreters? Own Discourse Markers in Simultaneous In...
Pragmatic Functions of Interpreters? Own Discourse Markers in Simultaneous In...Pragmatic Functions of Interpreters? Own Discourse Markers in Simultaneous In...
Pragmatic Functions of Interpreters? Own Discourse Markers in Simultaneous In...
 
FCA-MERGE: Bottom-Up Merging of Ontologies
FCA-MERGE: Bottom-Up Merging of OntologiesFCA-MERGE: Bottom-Up Merging of Ontologies
FCA-MERGE: Bottom-Up Merging of Ontologies
 
Ontology and its various aspects
Ontology and its various aspectsOntology and its various aspects
Ontology and its various aspects
 
Pos Tagging for Classical Tamil Texts
Pos Tagging for Classical Tamil TextsPos Tagging for Classical Tamil Texts
Pos Tagging for Classical Tamil Texts
 
AUTOMATIC ARABIC NAMED ENTITY EXTRACTION AND CLASSIFICATION FOR INFORMATION R...
AUTOMATIC ARABIC NAMED ENTITY EXTRACTION AND CLASSIFICATION FOR INFORMATION R...AUTOMATIC ARABIC NAMED ENTITY EXTRACTION AND CLASSIFICATION FOR INFORMATION R...
AUTOMATIC ARABIC NAMED ENTITY EXTRACTION AND CLASSIFICATION FOR INFORMATION R...
 

Mehr von Constantin Orasan

New trends in NLP applications
New trends in NLP applicationsNew trends in NLP applications
New trends in NLP applicationsConstantin Orasan
 
From TREC to Watson: is open domain question answering a solved problem?
From TREC to Watson: is open domain question answering a solved problem?From TREC to Watson: is open domain question answering a solved problem?
From TREC to Watson: is open domain question answering a solved problem?Constantin Orasan
 
QALL-ME: Ontology and Semantic Web
QALL-ME: Ontology and Semantic WebQALL-ME: Ontology and Semantic Web
QALL-ME: Ontology and Semantic WebConstantin Orasan
 
What is Computer-Aided Summarisation and does it really work?
What is Computer-Aided Summarisation and does it really work?What is Computer-Aided Summarisation and does it really work?
What is Computer-Aided Summarisation and does it really work?Constantin Orasan
 
Tutorial on automatic summarization
Tutorial on automatic summarizationTutorial on automatic summarization
Tutorial on automatic summarizationConstantin Orasan
 
Porting the QALL-ME framework to Romanian
Porting the QALL-ME framework to RomanianPorting the QALL-ME framework to Romanian
Porting the QALL-ME framework to RomanianConstantin Orasan
 

Mehr von Constantin Orasan (7)

New trends in NLP applications
New trends in NLP applicationsNew trends in NLP applications
New trends in NLP applications
 
From TREC to Watson: is open domain question answering a solved problem?
From TREC to Watson: is open domain question answering a solved problem?From TREC to Watson: is open domain question answering a solved problem?
From TREC to Watson: is open domain question answering a solved problem?
 
QALL-ME: Ontology and Semantic Web
QALL-ME: Ontology and Semantic WebQALL-ME: Ontology and Semantic Web
QALL-ME: Ontology and Semantic Web
 
What is Computer-Aided Summarisation and does it really work?
What is Computer-Aided Summarisation and does it really work?What is Computer-Aided Summarisation and does it really work?
What is Computer-Aided Summarisation and does it really work?
 
Tutorial on automatic summarization
Tutorial on automatic summarizationTutorial on automatic summarization
Tutorial on automatic summarization
 
Message project leaflet
Message project leafletMessage project leaflet
Message project leaflet
 
Porting the QALL-ME framework to Romanian
Porting the QALL-ME framework to RomanianPorting the QALL-ME framework to Romanian
Porting the QALL-ME framework to Romanian
 

Annotation of anaphora and coreference for automatic processing

  • 1. Annotation of anaphora and coreference for automatic processing Constantin Orasan Research Group in Computational Linguistics University of Wolverhampton, UK http://www.wlv.ac.uk/~in6093/
  • 2. Why use corpora in anaphora/coreference resolution  In this talk corpora discussed for:  Training machine learning systems  Testing anaphora/coreference resolution algorithms  Annotation:  Linguistically motivated: tries to capture certain phenomena (usually focuses on anaphora)  Application motivated: limited relations are encoded (usually focuses on coreference)
  • 3. Structure 1. Background information 2. The MUC annotation for coreference 3. The NP4E corpus 4. Event coreference and NP coreference 5. Conclusions
  • 4. Anaphora and anaphora resolution  cohesion which points back to some previous item (Halliday and Hasan, 1976)  the pointing back word is called an anaphor, the entity to which it refers or for which it stands is its antecedent (Mitkov, 2002)  The process of determining the antecedent of an anaphor is called anaphora resolution (Mitkov, 2002)  Anaphora resolution can be seen as a process of filling empty or almost empty expressions with information from other expressions
  • 5. Coreference and coreference resolution  When the anaphor refers to an antecedent and when both have the same referent in real world they are termed coreferential (Mitkov, 2002)  The process of establishing which referential NPs point to the same discourse entity is called coreference resolution
  • 6. Examples of anaphoric expressions from Mitkov (2002) Sophia Loren says she will always be grateful to Bono. The actress revealed that the U2 singer helped her calm down when she became scared by a thunderstorm while travelling on a plane. Coreferential chains:  {Sophia Loren, she, the actress, her, she},  {Bono, the U2 singer},  {a thunderstorm},  {a plane}
  • 7. Examples of anaphoric expressions from Mitkov (2002)  Indirect anaphora: Although the store had only just opened, the food hall was busy and there were long queues at the tills.  Identity-of-sense anaphora: The man who gave his paycheck to his wife was wiser that the man who gave it to his mistress  Verb and adverb anaphora: Stephanie sang, as did Mike  Bound anaphora: Every man has his own agenda  Cataphora: The elevator opened for him on the 14th floor, and Alec stepped out quickly.
  • 8. Anaphora vs. coreference  There are many anaphoric expressions which are not coreferential  Most of the coreferential expressions are anaphoric (Sophia Loren, the actress)  Coreferential expressions that may be or may not be anaphoric  (Sophia Loren, the actress Sophia Loren) – not anaphoric?  (the actress Sophia Loren, Sophia Loren) – anaphoric  Coreferential expressions which are not anaphoric (Sophia Loren, Sophia Loren)  Cross-document coreference is not anaphora
  • 9. Substitution test  To determine whether two entities are coreferential substitution test is used  Sophia Loren says she will always be grateful to Bono  Sophia Loren says Sophia Loren will always be grateful to Bono.  John has his own agenda  John has John’s own agenda  Every man has his own agenda.  Every man has every man’s own agenda. ??
  • 10. Anaphora & coreference in computational linguistics  are important preprocessing steps for a wide range of applications such as machine translation, information extraction, automatic summarisation, etc.  From linguistic perspective the expressions processed are rather limited
  • 11. Developing annotated corpora for computational linguistics A simple, reliable annotation task Producing an CL-oriented resource Capturing the most widespread and best-understood anaphoric relation identity-of-reference direct nominal anaphora Including identity, Referring expressions (pronouns, Elements synonymy, definite NPs, or proper names) corresponding to the generalisation and have non-pronominal NP same discourse entity specialisation antecedents in the preceding text / dialogue
  • 12. Terminology  Entity = an object or set of objects in the world  Entities can have types (ACE requires to annotate only certain types e.g. person, location, organisation, etc.)  Mention = a textual reference to an entity (usually an NP)  Direct anaphora = identity of head, generalisation, specialisation or synonymy  Indirect anaphora = part-of, set-membership
  • 13. Annotation of anaphora/ coreference  In general the process can be split into two stages:  Identification and annotation of elements involved in a relation (annotation of mentions)  Identification and annotation of relations between mentions  The two stages can be done together or separately
  • 14. Annotation of mentions  Annotate everything?  Singletons should be annotated because they influence evaluation measures (except MUC score)  If everything is annotated it is easier if this annotation is done in the first instance  Syntactic annotation can be useful
  • 15. Annotation of relations  Each annotation scheme defines a set of relations that should be covered  The relations normally happen between mentions/markables
  • 16. MUC annotation (Hirchmann 1997)  Defined in the coreference resolution task at MUC  The criteria used to define the task were: 1. Support for the MUC information extraction tasks; 2. Ability to achieve good (ca. 95%) interannotator agreement; 3. Ability to mark text up quickly (and therefore, cheaply); 4. Desire to create a corpus for research on coreference and discourse phenomena, independent of the MUC extraction task.  These criteria are not necessarily consistent with each other
  • 17. MUC annotation scheme  Marks only relations between noun phrases  Does not mark relations between verbs, clauses, etc.  Marks only IDENTITY which defines equivalence classes and is not directional  Values which are clearly distinct should not be allowed to be in the same class e.g. the stock price fell from $4.02 to $3.85
  • 18. MUC annotation scheme (II)  SGML used <COREF ID="100">Lawson Mardon Group Ltd.</COREF> said <COREF ID="101" TYPE="IDENT" REF="100">it </COREF> ...  Attributes:  ID a unique identifier for a mention  REF indicates links between mentions  TYPE the type of link (only IDENT supported)  MIN the minimum span to be identified in order to be considered correct in automatic evaluation  STATUS=“OPT” to indicate optional elements to be resolved
  • 19. MUC annotation scheme – markables (III)  NPs (including dates, percentages and currency expressions), personal and demonstrative pronouns  Interrogative “wh-” NPs are not marked (Which engine would you like to use?)  The extent of the markable is quite loosely defined (must include the head, but should really include the maximal NP and MIN attribute have the head as the value)
  • 20. MUC annotation scheme – relations  Basic coreference  Bound anaphors  Apposition <COREF ID="1" MIN="Julius Caesar">Julius Caesar, <COREF ID="2" REF="1" MIN="emperor" TYPE="IDENT"> the/a well-known emperor,</COREF></COREF>  Predicate nominals <COREF ID="1" MIN="Julius Caesar">Julius Caesar</COREF> is <COREF ID="2" REF="1" MIN="emperor" TYPE="IDENT">the/a well-known emperor</COREF> who …  For appositions and predicate nominals there needs to be certainty (is not may be)
  • 21. MUC annotation - criticism  Van Deemter and Kibble (1999) criticised the MUC scheme because it goes beyond annotation of coreference as it is commonly understood because:  It marks quantifying NPs (e.g. every man, most people)  Marks indefinite NPs Henry Higgins, who was formerly sales director of Sudsy Soaps, became president of Dreamy Detergents.  and one can argue not in a consistent manner the stock price fell from $4.02 to $3.85
  • 22. MUC annotation & corpus  Despite criticism the MUC annotation provided a starting point for standardising anaphora/coreference annotation schemes  Designed to mark only a small set of expressions and relations which can be tackled by computers  Was proposed in the context of a competition  comparison of results and backing of an organisation  The corpus is available
  • 23. Corpus of technical manuals (Mitkov et. al. 2000)  A corpus of technical manuals annotated with a MUC-7 like annotation scheme  Annotates only identity of reference between direct nominal referential expressions  Less interesting from linguistic perspective, but used to develop automatic methods
  • 24. Corpus of technical manuals (Mitkov et. al. 2000)  Full coreferential chains are annotated  All the mentions are annotated regardless whether they are singletons or not  The relation of coreference is considered fully transitive  The MUC annotation scheme was used but the guidelines were not adapted completely  CLinkA (Orasan 2000) was used for annotation
  • 25. Annotation guidelines  The starting point the MUC-7 annotation guidelines, but  More strict with what means identity of meaning (e.g. we do not consider indefinite appositional phrases coreferential with the phrases that contain them)  An indefinite NP cannot refer to anything  Not consider gerunds as mentions
  • 26. Add missing phenomena:  V [NP1] as [NP2] – not coreferential [use [a diagonal linear gradient] as [the map]] – is not coreferential [elect [John Prescott] as [Prime Minister]], – is not coreferential  …if [[ an NTSC Ø ]i or [ PAL monitor ]j]k is being used…[ The NTSC monitor ]l… - not coreferential …[[the pixels’ luminance]i or [Ø Ø saturation]j ]k is important… [The pixels’ saturation]j - coreferential
  • 27. Annotation guidelines – short version Do: Do not: (i) annotate identity-of-reference direct nominal (i) annotate indefinite predicate nominals that are linked to anaphora other elements by perception verbs as coreferential with those elements (ii) annotate definite descriptions which stand in any of (ii) annotate identity-of-sense anaphora the identity, synonymy, generalisation, specialisation, or copula relationships with an antecedent (iii) annotate definite NPs in a copula relation as (iii) annotate indirect anaphora between markables coreferential (iv) annotate definite appositional and bracketed phrases (iv) annotate cross-document coreference as coreferential with the NP of which they are a part (v) annotate NPs at all levels from base to complex and (v) annotate indefinite NPs in copula relations with other co-ordinated NPs as coreferential (vi) familiarise yourself with the use of unfamiliar, (vi) annotate non-permanent or “potential” coreference highly specialised terminology by search through the between markables text (vii) annotate bound anaphors (viii) consider gerunds of any kind markable (ix) annotate anaphora over disjoined antecedents (x) consider first or second person pronouns markable
  • 28. Speed of annotation (Mitkov et. al. 2000)  Speed of annotation in one hour:  At the beginning while the guidelines were being created: assign 288 mentions to 220 entities covering on average 2051 words in text  After the annotators became used to the task and the guidelines finalised: assign 315 mentions to 250 entities covering on average 1411 words in text  Fast track annotation for pronoun resolution in one hour: 113 pronouns, 944 candidates and 148 antecedents, covering 10862 words
  • 29. Speed of annotation (II)  Most of the time during the annotation is spent identifying the mentions  … existing annotation levels can prove very beneficial
  • 30. Reasons for disagreements  The process is tedious and requires high levels of concentration  Two main reasons for disagreement:  Unsteady references – mentions which may belong to different entities through the document (e.g. image, the window) – the automatic annotation option of the annotation tool may also mislead  Specialised terminology
  • 31. Improving annotation strategies  Unsteady reference: Pre-annotation stage to clarify topic segments  Domain knowledge: Pre-annotation stage to disambiguate unknown technical terminology  ‘Master strategy’ combining individual approaches:  Printing text prior to annotation - increases familiarity  Two step process  Taking note of troublesome cases to discuss later with others  Annotating intensively vs sporadically
  • 32. NP4E (Hasler et. al. 2006)  The goal was to develop annotation guidelines for NP and event coreference in newspaper articles about terrorism/security threats  A small corpus annotated with NP and event coreference was produced  An attempt to produce a more refined annotated resource than our previous corpus  5 clusters of related documents in the domain were built, about 50,000 words  http://clg.wlv.ac.uk/projects/NP4E/
  • 33. NP coreference annotation  Used the guidelines developed by (Mitkov et. al. 2000) as the starting point,  but adapted them for our goals and texts  All the mentions need to be annotated, both definite and indefinte NPs  Introduced coref and ucoref tags to be able to deal with uncertainties The government] will argue that… [[McVeigh] and [Nichols]] were [the masterminds of [the bombing plot]]  Types of relations between an NP and its antecedent: identity, synonymy, generalisation, specialisation and other, but we do not annotate indirect anaphora
  • 34. NP coreference annotation (II)  Types of (coreference) relations we identify NP, copular, apposition, bracketed text, speech pronoun and other  Link to the first element of the chain in most of the cases for type NP  For copular, apposition, bracketed text and speech pronouns (pronouns which occur in direct speech), the anaphor should be linked back to the nearest mention of the antecedent in the text  Do not annotate coreferential different readings of an NP as coreferential [A jobless Taiwanese journalist who commandeered [a Taiwan airliner] to [China]]… [China] ordered [[its] airports] to beef up [security]…
  • 35. The user can override WordNet is consulted WordNet’s decision about the relation between the two NPs Annotation of NPs using PALinkA the plane is marked as coreferential with The aircraft
  • 36. Issues arising during the NP annotation  The antecedent of pronoun we in direct speech can be linked to: the individual speaker, a group represented by the speaker or nothing  General concepts such as violence, terror, terrorism, police, etc are sometimes used in a general sense so it is difficult to know whether to annotate and how  Sometimes difficult to decide the best indefinite NP as an antecedent …the man detained for hijacking [a Taiwanese airliner]… Liu forced [a Far East Air Transport domestic plane]… Beijing returned [the Boeing 757]…
  • 37. Issues arising during the NP annotation (II)  Mark relative pronouns/clauses and link them to the nearest mention Chinese officials were tightlipped whether [Liu Shan-chung, 45, [who] is in custody in China's southeastern city of Xiamen], would be prosecuted or repatriated to Taiwan.  The type of relation is sometimes difficult to establish without the help of WordNet (have ident, non-ident)
  • 38. Annotation of event coreference  Event = a thing that happens or takes place, a single specific occurrence, either instantaneous or ongoing.  Used the ACE annotation guidelines as starting point  Events marked: ATTACK, DEFEND, INJURE, DIE, CONTACT  Identify the trigger = the best word to represent the event  Triggers: verbs, nouns, adjectives and pronouns {The blast} {killed} 168 people…and {injured} hundreds more… (ATTACK: noun, DIE: verb, INJURE: verb)
  • 39. Event triggers  ATTACK: attack events are physical actions which aim to cause harm or damage tothings or people: attack, bomb, shoot, blast, war, fighting, clashes, throw, hit, hold, spent.  DEFEND: defend events are events where people or organisations defend something, usually against someone or something else: sheltering, reinforcing, running, prepared.  INJURE: injure events involve people experiencing physical harm: injure, hurt, maim, paralyse, wounded, ailing.  DIE: die events happen when a person’s life ends: kill, dead, suicide, fatal, assassinate, died, death.  CONTACT: contact events occur when two or more parties communicate in order to try and resolve something, reach an agreement or better relations between different sides etc. This category includes demands, threats and promises made by parties during negotiations: meeting, talks, summit, met, negotiations, conference, called, talked, phoned, discussed, promised, threatened, agree, reject, demand.
  • 40. Annotation of event coreference  Two stage process: identify the triggers and then link them  Link arguments of an event to NP annotated in the previous stage  The arguments are event dependent (e.g. ATTACKER, MEANS, VICTIM, CAUSE, AGENT, TOPIC and MEDIUM)  The arguments should be linked to NPs from the same sentence or near by sentences if they are necessary to disambiguate the event  Also TENSE, MODALITY and POLARITY needs to be indicated
  • 41. Annotation of an attack event using PALinkA the operation TYPE: attack TIME: Dec. 17 REF: stormed TARGET: the Japanese ambassador's residence in Lima (FACILITY) ATTACKER: MRTA rebels (PERSON) the operation PLACE: Lima (LOCATION)
  • 42. Issues with event annotation  Very difficult annotation task  At times it is difficult to decide the tense of an event in direct speech  Whether to include demands, promises or threats in the CONTACT (or use them only as a signal of modality)  Whether to make a distinction between speaker/hearer in CONTACT events (especially in the case of demands, promises or threats)
  • 43. What coreferential events indicate? (Hasler and Orasan 2009)  Starting point – do coreferential events have coreferential arguments?  We had a corpus of about 12,000 words annotated with event coreference  344 unique event mentions  106 coreferential chains with 2 to 10 triggers  238 events referred by only one trigger
  • 44. Zaire planes bombs rebels as U.N. seeks war’s end. a293 TRIGGER: bombs ATTACKER: – MEANS: Zaire planes: ID=0: CHAIN=0: VEHICLE PLACE: – TARGET: rebels: ID=1: CHAIN=1: PERSON TIME: – Zaire said on Monday its warplanes were bombing three key rebel-held towns in its eastern border provinces and that the raids would increase in intensity. a333 TRIGGER: bombing ATTACKER: Zaire: ID=44: CHAIN=5: ORGANISATION MEANS: its warplanes: ID=46: CHAIN=46: VEHICLE PLACE: three key rebel-held towns in its eastern border provinces: ID=48: CHAIN=14: LOCATION TARGET: three key rebel-held towns in its eastern border provinces: ID=48: CHAIN=14: LOCATION TIME: Monday: ID=45: CHAIN=7 “Since this morning the FAZ (Zaire army) has been bombing Bukavu, Shabunda and Walikale”, said a defence ministry statement in the capital Kinshasa. a334 TRIGGER: bombing ATTACKER: the FAZ (Zaire army): ID=53: CHAIN=53: ORGANISATION MEANS: – PLACE: Bukavu, Shabunda and Walikale: ID=55: CHAIN=14: LOCATION TARGET: Bukavu, Shabunda and Walikale: ID=55: CHAIN=14: LOCATION TIME: this morning: ID=52: CHAIN=52
  • 45. Referential relations between arguments  104 chains considered:  22 (21.15%) contained only coferential NPs  23 (22.12%) contained only non-coferential NPs  9 chains ignored  50 (48.07%) contain a mixture of coreferential and non-coreferential NPs  If indirect anaphora is not annotated, 70% of chains are affected
  • 46. ID TRIGGER ARGUMENT: AGENT(S) c389 an emergency summit the leaders of both nations: ID=20: CHAIN=20: PERS c397 the two-hour closed meeting they: ID=24: CHAIN=20: PERS c408 the summit Fujimori: ID=60: CHAIN=32: PERS Hashimoto: ID=58:CHAIN=40:PERS c409 the summit Fujimori: ID=60: CHAIN=32: PERS Hashimoto: ID=58: CHAIN=40: PERS c418 the summit rebels: ID=110: CHAIN=14: PERS c432 the summit he: ID=170: CHAIN=40: PERS
  • 47. Identity of sense  There are cases where even though the strings are the same we do not have identity of reference: at least nine people and nine confirmed dead  Hundred, at least 500 people, the first group of at least 500 people, but probably more than that and the 500  It can be argued that events of INJURE, DIE and DEFEND with such parameters are not coreferential, but the ATTACK events that causes them are.
  • 48. at least nine people were killed and up to 37 wounded i343 TRIGGER: wounded AGENT: the FAZ (Zaire army): ID=53: CHAIN=53: ORG VICTIM: up to 37: ID=66: CHAIN=66: PERSON CAUSE: – PLACE: Bukavu: ID=70: CHAIN=17: LOCATION TIME: Monday: ID=69: CHAIN=7 there are nine confirmed dead and 37 wounded i346 TRIGGER: wounded AGENT: – VICTIM: 37 wounded: ID=86: CHAIN=78: PERSON CAUSE: – PLACE: – TIME: –
  • 49. Missing slots  Coreference between events can be established even if many slots are not filled in:  Peru’s Fujimori says hostage talks still young.  ...the President said talks to free them were still in their preliminary phase.  ”We cannot predict how many more weeks these discussions will take.”  ”We are still at a preliminary stage in the conversations.”  Fujimori said he hoped Nestor Cerpa would personally take part in the talks when they resume on Monday at 11am.
  • 50. Contact events  Involve 2 or more parties  The parties are usually introduced bit by bit and event coreference is necessary to establish all the participants  Cross-document event coreference is sometimes necessary collect all the participants
  • 51. Conclusions  The guidelines should not be used directly and the characteristics of the texts should be considered  For automatic processing MUC-like may provide a good trade off between the linguistic detail encoded and the difficulty of annotation  However, quite often this annotation is not enough for more advanced processing  Have a more refined notion of “identity” Coreference is a scalar relation holding between two (or more) linguistic expressions that refer to DEs considered to be at the same granularity level relevant to the pragmatic purpose. (Recasens, Hovy and Marti, forthcoming)
  • 53. References  van Deemter, Kees and Rodger Kibble, (1999). What is coreference and what should coreference annotation be? In Amit Bagga, Breck Baldwin, and S Shelton (eds.), Proceedings of ACL workshop on Coreference and Its Applications. Maryland.  Halliday, M. A. K., and Hasan, R. (1976).Cohesion in English. London: Longman.  Hasler, L. and Orasan. C (2009). Do coreferential arguments make event mentions coreferential? Proceedings of the 7th Discourse Anaphora and Anaphor Resolution Colloquium (DAARC 2009), Goa, India, 5-6 November 2009, 151-163  Hasler, L., Orasan, C. and Naumann, K. (2006) NPs for Events: Experiments in coreference annotation. In Proceedings of the 5th Language Resources and Evaluation Conference (LREC2006). Genoa, Italy, 24-26 May, 1167-1172  Hirschman, L. (1997). MUC-7 coreference task definition. Version 3.0  Mitkov, R. (2002): Anaphora Resolution. Longman  Mitkov, R., Evans, R., Orasan, C., Barbu, C., Jones L. and Sotirova, V. (2000) Coreference and anaphora: developing annotating tools, annotated resources and annotation strategies Proceedings of the Discourse Anaphora and Anaphora Resolution Colloquium (DAARC'2000)), 49-58. Lancaster, UK