2. BioNLP'09 Task 1
Events in abstracts
Given: gene and gene products (proteins)
Wanted: events
− type
− trigger
− participant(s)
− cause (if applicable)
3. Example
"I kappa B/MAD3 masks the nuclear localization
signal of NFkappa B p65 and requires the
transactivation domain to inhibit NFkappa B
p65 DNA binding. "
Event: negative regulation
Trigger: masks
Theme1: the first p65
Cause: MAD3
5. Training and Test Data
Training data: 800 abstracts
Development data: 150 abstracts
Test data: 260 abstracts
6. The System
Trigger recognition
− Methods similar to NER
− Classification
Argument detection
− Graph edge selection
− Classification
Semantic postprocessing
− Rulebased
7. Trigger Detection
Token labelling (one for each type and one )
92% of triggers are single token
− Adjacent tokens form a trigger if they appear in the
training data
Triggers that share a token:
− Combined class: gene expression/pos regulation
A graph node for each trigger
− Not duplicated just yet
8. Classification SVM
Token features
− Binary: capitalisation, presence of punctuation or
numeric characters
− Stem
− Character bigrams and trigrams
− Token is known triggers in training data
− All the above for linear and dependency
“neighbours”
9. Classification SVM
Frequency features
− # of named entities
In sentence
In a linear window around the token
Bagofwords count of token texts in the sentence (?)
Dependency chains
− Up to depth of 3 from the token are constructed
− At each depth both token and frequency features
− Plus dep type and sequence of dep types in chain
10. Two SVMs
“Somewhat” different feature sets
Combined weighted results
“This design should be considered an artifact of
the timeconstrained, experimentdriven
development of the system rather than a
principled design”
11. Precision/Recall tradeoff
Undetected trigger > undetected event
All triggers have events in the training data >
bias towards reporting an event for all detected
triggers
Adjust P/R explicitly
− multiply the negative class by β
− find β experimentally
12. Edge Detection
Multiclass SVM
All potential directed edges
− Event node to named entity
− Event node to event node (nested event)
− Labelled as theme, cause, or negative
Each edge is predicted independently
13. Feature Set – Central Concept
Shortest undirected
path of syntactic
dependencies in the
Stanford scheme
parse of the
sentence.
14. Feature Set
Token text, POS, entity/event class,
dependency (subject)
Ngrams: merging the attributes of 24
− Consecutive tokens
− Consecutive dependencies
− Each token and two neighbouring dependencies
− Each dependency and two neighbouring tokens
− One bigram showing direction
15. Other Features
Individual component features
Semantic node features
Frequency features
16. Semantic PostProcessing
Duplicate nodes
− Same class and same trigger
− Combined trigger
Remove improper arguments
Remove directed cycles by removing the
weakest link
17. Duplicating Event Nodes
Task restrictions
− Two causes,
− must have theme,
− etc.
Several heuristics
xth first dependency
in shortest path from
the event for binding