Learning to Extract Relations for Protein Annotation

Learning to Extract Relations
Learning to Extract Relations for Protein
Annotation
Jee-Hyub Kima, Alex Mitchellb,c, Teresa K. Attwoodb,c,
Melanie Hilarioa
aUniversity of Geneva
bUniversity of Manchster
cEuropean Bioinformatics Institute
ISMB/ECCB2007, 23 July 2007
1 / 32

Contents
Introduction
Related Work
Problem and Approach
Methods
Experimental Results
Conclusion
2 / 32

Introduction
Protein Annotation
Deﬁnition
Given a protein sequence or name, describe the protein
with all relevant information
Two main methods
Sequence analysis method (e.g. BLAST, CLUSTALW, etc.)
Literature analysis method
Traditionally, done manually by human annotators, and
automation needed
Text-mining has been used for automation.
Two main tasks in text-mining
Information retrieval (IR): to retrieve relevant documents
Information extraction (IE): to extract certain pieces of
information from text for pre-deﬁned entities and their
relations.
3 / 32

Introduction
Adaptive Information Extraction
IE is domain-specific.
Generally, one developed system cannot be used for a new
domain.
Developing IE systems requires a significant amount of
domain knowledge.
Developing IE rules
Defining relations
These are two main bottlenecks.
Still many new domains need to develop their own IE
systems.
e.g., cell cycle, tissue specificity, etc.
4 / 32

Related Work
IE System Development: Developing IE Rules
Kowledge engineering (KE) based approach
Knowledge engineers write hand-crafted rules with the help
of domain experts (e.g., biologists).
Not scalable
Machine learning (ML) based approach
To increase robustness and coverage of IE rules
ML has been used to learn IE rules.
Annotated corpora, pre-labelled corpora, raw corpora.
Labor required to develop IE systems
KE > ML (annotated corpora > pre-labelled corpora > raw
corpora)
5 / 32

Related Work
IE System Development: Defining Relations
All the previous IE systems assume relations to be
extracted are already defined.
It is hard to specify precisely all possible relations to
extract, especially in complex and dynamically-evolving
domains (e.g., biological domain)
Positioning (ML-based approach)
Corpora Relations Pre-defined Not defined
Annotated Soderland (1999), Freitag (2000),
Califf and Money (2003)
Pre-labeled Riloff (1996) Our work
Raw Hasegawa et al. (2004) Collier (1996)
6 / 32

Our Problem
Goal
To alleviate the burden of developing IE systems for
biologists
Problem deﬁnition
Given relevant sentences that describe protein X in terms
of any topic Y and irrelevant sentences
Learn to extract relations for protein annotation
Two sub-problems
What to extract
How to extract
7 / 32

Bottom-up Approach
Figure: From sentences to relations
8 / 32

Methods
System Architecture
9 / 32

Methods
Analyzing Sentences: MBSP
Memory-Based Shallow Parser (MBSP)
Developed by Walter Daelemans
Extended with named entity taggers and SVO relation ﬁnder
Provides various types of information: POS, SVO, NE, etc.
Adapted to the biological domain on the basis of the GENIA
corpus
97.6% accuracy on POS tagging
71.0% accuracy on protein named entity recognition
10 / 32

Methods
Analyzing Sentences: Example
Example
INPUT: Examples of this are the RNA-binding protein containing
the RNA-binding domain (RBD) ...
OUTPUT:
Chunk Syntactic Semantic SVO relation
Examples noun phrase subject of ’are’
of preposition
this noun phrase
are verb phrase
the RNA-binding protein noun phrase protein subject of ’contain’
containing verb phrase
the RNA-binding domain (RBD) noun phrase domain object of ’contain’
11 / 32

Methods
Learning IE Rules: Inductive Logic Programming (ILP)
We applied ILP to learn IE rules.
ILP is a ML algorithm that induces rules from examples.
Outputs are readable and interpretable by the domain
experts.
Can deal with relational information (e.g., parse trees).
Problem Setting
B ∧ H |= E
Given B (Background Knowledge) and E (Examples), ﬁnd H
(Hypothesis).
12 / 32

Methods
Learning IE Rules: Representation
B (Background Knowledge)
Linguistic heuristics (for single-slot IE pattern)
e.g., <subj> verb, verb <dobj>, verb preposition <np>, noun
prep <np>, etc.
Sentence descriptions (ie., analyzed sentences)
E (Examples)
Positive and negative examples (ie., relevant and irrelevant
sentences)
H (Hypothesis)
A set of IE rules
13 / 32

Methods
Learning IE Rules: Generalization
Figure: From a sentence to an IE rule
14 / 32

Methods
Learning IE Rules: Step 1
15 / 32

Methods
16 / 32

Methods
17 / 32

Methods
WRAcc(Rule) = coverage(Rule) ∗ (accuracy(Rule) − accuracy(Head ← true))
18 / 32

Methods
Selecting IE Rules
Why is this step necessary?
Rules are learned to classify sentences. Not to extract
information.
Spurious IE rules are learned from the previous step.
Need to be ﬁltered out by domain experts.
Rules are provided to users with information.
Example
RULE: <subj:*> vp:contain & vp:contain <dobj:domain> [9, 0.9]
S: Myocilin is a secreted glycoprotein that forms multimers and contains a
leucine zipper and an olfactomedin domain.
19 / 32

Methods
Transforming Rules
Once IE rules are selected, transformed into relations.
Mapping between IE rules and relations
IE Rules Relations
extract argument
trigger relation name
syntactic tag argument position
Example
<subj:*>[X] vp:contain & vp:contain <dobj:domain>[Y] → contain(X,Y)
20 / 32

Methods
Grouping Rules
Post-processing rules
Pattern Example
A verb (active form) B A activate B
B be ed-participle by A B be activated by A
nominal form (with sufﬁx -tion) of verb of B by A activation of B by A
A be nominal form (with sufﬁx -or) of verb of B A is an activator of B
A be ... that verb (active form) B A is ... that activates B
21 / 32

Methods
Applying Rules to Extract Relations
Now, we have a set of IE rules for each relation.
IE rule and relation
RULE: <subj:protein>[X] vp:promote & vp:promote <dobj:disease>[Y]
TRIGGER: promote
RELATION: promote(X,Y)
Example
INPUT: Our data demonstrate that PKC beta II promotes colon cancer, at
least in part, through induction of Cox-2, suppression of TGF-beta
signaling, and establishment of a TGF-beta-resistant, hyperproliferative
state in the colonic epithelium.
OUTPUT: promote(’PKC beta II’,’colon cancer’)
22 / 32

Experiments
Our IE system was applied for PRINTS (a protein family
database) annotation.
Development Corpora
Topic Positives Negatives Class Distribution
Disease 777 1403 36-64%
Function 1268 2625 33-67%
Structure 1159 1750 40-60%
80% for training, 20% for test
23 / 32

Evaluation
All the extracted relations are manually evaluated by
domain experts.
Results
Topic Learned Selected Relations Precision Recall F1-measure
rules rules
Disease 55 32 21 75 18.3 29.4
Function 125 64 23 66.3 15.1 24.6
Structure 146 76 20 85.3 61 71.1
cf. a recent work of extracting regulatory gene/protein
networks
F1-measure of 44% (Saric et al, 2006)
24 / 32

Protein Annotation Examples
Discovered relations
Disease be_associated, is_a, be_mutated, be_caused, be_deleted, contribute, etc.
Function induce, block, mediate, is_a, belong, act, etc.
Structure contain, form, share, lack, bind, encode, be_conserved, etc.
Annotation example for protein NF-kappaB
be_implicated(,’NF-kappaB’, in:’the pathogenesis’)
regulate(’IkappaBalpha’, ’NF-kappaB’), activate(’BCMA’, ’NF-kappaB’)
be_composed(’NF-kappaB’, of:’heterodimeric complexes’)
25 / 32

Limitations
Failure to ﬁnd inverse relation
S: Expression of uPAR in tumor extracts also inversely correlates with
prognosis in many forms of cancer.
RULE: np:expression <of:protein>[X] & vp:correlate <with:prognosis>[Y]
RELATION: correlate(expression(’expression’,of:’uPAR’), with:’prognosis’)
Anaphora problem
S: Whereas the overall structure resembles that of the NF-kappaB
p50-DNA complex , pronounced differences are observed within the ’
insert region ’.
RULE: <subj:structure>[X] vp:resemble & vp:resemble <dobj:C>[Y]
RELATION: resemble(’the overall structure’,’that’)
26 / 32

Conclusion
Conclusion
Proposed a methodology for developing IE systems with
resources that can be provided by biologists.
Learned relations as well as IE rules without annotated
corpora.
Annotated proteins with structured information (i.e.,
predicate argument structure) in terms of any topic.
Validated the methodology over different topics (function,
structure, disease, cancer) in bio-medical domain.
Advantage: will alleviate the burden of developing IE
systems for users who have little or no formal IE training.
27 / 32

Appendix
For Further Reading
For Further Reading I
Mary Elaine Califf and Raymond J. Mooney.
Bottom-up relational learning of pattern matching rules for
information extraction.
Journal of Machine Learning Research, 4:177–210, 2003.
R. Collier.
Automatic template creation for information extraction, an
overview, 1996.
Dayne Freitag.
Machine learning for information extraction in informal
domains.
Machine Learning, 39(2/3):169–202, 2000.
29 / 32

Appendix
For Further Reading
For Further Reading II
Takaaki Hasegawa, Satoshi Sekine, and Ralph Grishman.
Discovering relations among named entities from large
corpora.
In ACL, pages 415–422, 2004.
Ellen Riloff.
Automatically generating extraction patterns from untagged
text.
In Proceedings of the Thirteenth National Conference on
Artiﬁcial Intelligence (AAAI-96), pages 1044–1049, 1996.
30 / 32

Appendix
For Further Reading
For Further Reading III
Jasmin Saric, Lars Juhl Jensen, Peer Bork, Rossitza
Ouzounova, and Isabel Rojas.
Extracting regulatory gene expression networks from
pubmed.
In ACL, pages 191–198, 2004.
Stephen Soderland.
Learning information extraction rules for semi-structured
and free text.
Machine Learning, 34(1-3):233–272, 1999.
31 / 32

[Sod99]
[Fre00]
[CM03]
[Ril96]
[HSG04]
[Col96]
[SJB+04]

Learning to Extract Relations for Protein Annotation

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (9)

Andere mochten auch

Andere mochten auch (15)

Ähnlich wie Learning to Extract Relations for Protein Annotation

Ähnlich wie Learning to Extract Relations for Protein Annotation (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Learning to Extract Relations for Protein Annotation