SlideShare ist ein Scribd-Unternehmen logo
1 von 32
Downloaden Sie, um offline zu lesen
Learning to Extract Relations
Learning to Extract Relations for Protein
Annotation
Jee-Hyub Kima, Alex Mitchellb,c, Teresa K. Attwoodb,c,
Melanie Hilarioa
aUniversity of Geneva
bUniversity of Manchster
cEuropean Bioinformatics Institute
ISMB/ECCB2007, 23 July 2007
1 / 32
Learning to Extract Relations
Contents
Introduction
Related Work
Problem and Approach
Methods
Experimental Results
Conclusion
2 / 32
Learning to Extract Relations
Introduction
Protein Annotation
DeïŹnition
Given a protein sequence or name, describe the protein
with all relevant information
Two main methods
Sequence analysis method (e.g. BLAST, CLUSTALW, etc.)
Literature analysis method
Traditionally, done manually by human annotators, and
automation needed
Text-mining has been used for automation.
Two main tasks in text-mining
Information retrieval (IR): to retrieve relevant documents
Information extraction (IE): to extract certain pieces of
information from text for pre-deïŹned entities and their
relations.
3 / 32
Learning to Extract Relations
Introduction
Adaptive Information Extraction
IE is domain-speciïŹc.
Generally, one developed system cannot be used for a new
domain.
Developing IE systems requires a signiïŹcant amount of
domain knowledge.
Developing IE rules
DeïŹning relations
These are two main bottlenecks.
Still many new domains need to develop their own IE
systems.
e.g., cell cycle, tissue speciïŹcity, etc.
4 / 32
Learning to Extract Relations
Related Work
IE System Development: Developing IE Rules
Kowledge engineering (KE) based approach
Knowledge engineers write hand-crafted rules with the help
of domain experts (e.g., biologists).
Not scalable
Machine learning (ML) based approach
To increase robustness and coverage of IE rules
ML has been used to learn IE rules.
Annotated corpora, pre-labelled corpora, raw corpora.
Labor required to develop IE systems
KE > ML (annotated corpora > pre-labelled corpora > raw
corpora)
5 / 32
Learning to Extract Relations
Related Work
IE System Development: DeïŹning Relations
All the previous IE systems assume relations to be
extracted are already deïŹned.
It is hard to specify precisely all possible relations to
extract, especially in complex and dynamically-evolving
domains (e.g., biological domain)
Positioning (ML-based approach)
Corpora  Relations Pre-deïŹned Not deïŹned
Annotated Soderland (1999), Freitag (2000),
Califf and Money (2003)
Pre-labeled Riloff (1996) Our work
Raw Hasegawa et al. (2004) Collier (1996)
6 / 32
Learning to Extract Relations
Problem and Approach
Our Problem
Goal
To alleviate the burden of developing IE systems for
biologists
Problem deïŹnition
Given relevant sentences that describe protein X in terms
of any topic Y and irrelevant sentences
Learn to extract relations for protein annotation
Two sub-problems
What to extract
How to extract
7 / 32
Learning to Extract Relations
Problem and Approach
Bottom-up Approach
Figure: From sentences to relations
8 / 32
Learning to Extract Relations
Methods
System Architecture
9 / 32
Learning to Extract Relations
Methods
Analyzing Sentences: MBSP
Memory-Based Shallow Parser (MBSP)
Developed by Walter Daelemans
Extended with named entity taggers and SVO relation ïŹnder
Provides various types of information: POS, SVO, NE, etc.
Adapted to the biological domain on the basis of the GENIA
corpus
97.6% accuracy on POS tagging
71.0% accuracy on protein named entity recognition
10 / 32
Learning to Extract Relations
Methods
Analyzing Sentences: Example
Example
INPUT: Examples of this are the RNA-binding protein containing
the RNA-binding domain (RBD) ...
OUTPUT:
Chunk Syntactic Semantic SVO relation
Examples noun phrase subject of ’are’
of preposition
this noun phrase
are verb phrase
the RNA-binding protein noun phrase protein subject of ’contain’
containing verb phrase
the RNA-binding domain (RBD) noun phrase domain object of ’contain’
11 / 32
Learning to Extract Relations
Methods
Learning IE Rules: Inductive Logic Programming (ILP)
We applied ILP to learn IE rules.
ILP is a ML algorithm that induces rules from examples.
Outputs are readable and interpretable by the domain
experts.
Can deal with relational information (e.g., parse trees).
Problem Setting
B ∧ H |= E
Given B (Background Knowledge) and E (Examples), ïŹnd H
(Hypothesis).
12 / 32
Learning to Extract Relations
Methods
Learning IE Rules: Representation
B (Background Knowledge)
Linguistic heuristics (for single-slot IE pattern)
e.g., <subj> verb, verb <dobj>, verb preposition <np>, noun
prep <np>, etc.
Sentence descriptions (ie., analyzed sentences)
E (Examples)
Positive and negative examples (ie., relevant and irrelevant
sentences)
H (Hypothesis)
A set of IE rules
13 / 32
Learning to Extract Relations
Methods
Learning IE Rules: Generalization
Figure: From a sentence to an IE rule
14 / 32
Learning to Extract Relations
Methods
Learning IE Rules: Step 1
Figure: From a sentence to an IE rule
15 / 32
Learning to Extract Relations
Methods
Learning IE Rules: Step 2
Figure: From a sentence to an IE rule
16 / 32
Learning to Extract Relations
Methods
Learning IE Rules: Step 3
Figure: From a sentence to an IE rule
17 / 32
Learning to Extract Relations
Methods
Learning IE Rules: Step 4
Figure: From a sentence to an IE rule
WRAcc(Rule) = coverage(Rule) ∗ (accuracy(Rule) − accuracy(Head ← true))
18 / 32
Learning to Extract Relations
Methods
Selecting IE Rules
Why is this step necessary?
Rules are learned to classify sentences. Not to extract
information.
Spurious IE rules are learned from the previous step.
Need to be ïŹltered out by domain experts.
Rules are provided to users with information.
Example
RULE: <subj:*> vp:contain & vp:contain <dobj:domain> [9, 0.9]
S: Myocilin is a secreted glycoprotein that forms multimers and contains a
leucine zipper and an olfactomedin domain.
19 / 32
Learning to Extract Relations
Methods
Transforming Rules
Once IE rules are selected, transformed into relations.
Mapping between IE rules and relations
IE Rules Relations
extract argument
trigger relation name
syntactic tag argument position
Example
<subj:*>[X] vp:contain & vp:contain <dobj:domain>[Y] → contain(X,Y)
20 / 32
Learning to Extract Relations
Methods
Grouping Rules
Post-processing rules
Pattern Example
A verb (active form) B A activate B
B be ed-participle by A B be activated by A
nominal form (with sufïŹx -tion) of verb of B by A activation of B by A
A be nominal form (with sufïŹx -or) of verb of B A is an activator of B
A be ... that verb (active form) B A is ... that activates B
21 / 32
Learning to Extract Relations
Methods
Applying Rules to Extract Relations
Now, we have a set of IE rules for each relation.
IE rule and relation
RULE: <subj:protein>[X] vp:promote & vp:promote <dobj:disease>[Y]
TRIGGER: promote
RELATION: promote(X,Y)
Example
INPUT: Our data demonstrate that PKC beta II promotes colon cancer, at
least in part, through induction of Cox-2, suppression of TGF-beta
signaling, and establishment of a TGF-beta-resistant, hyperproliferative
state in the colonic epithelium.
OUTPUT: promote(’PKC beta II’,’colon cancer’)
22 / 32
Learning to Extract Relations
Experimental Results
Experiments
Our IE system was applied for PRINTS (a protein family
database) annotation.
Development Corpora
Topic Positives Negatives Class Distribution
Disease 777 1403 36-64%
Function 1268 2625 33-67%
Structure 1159 1750 40-60%
80% for training, 20% for test
23 / 32
Learning to Extract Relations
Experimental Results
Evaluation
All the extracted relations are manually evaluated by
domain experts.
Results
Topic Learned Selected Relations Precision Recall F1-measure
rules rules
Disease 55 32 21 75 18.3 29.4
Function 125 64 23 66.3 15.1 24.6
Structure 146 76 20 85.3 61 71.1
cf. a recent work of extracting regulatory gene/protein
networks
F1-measure of 44% (Saric et al, 2006)
24 / 32
Learning to Extract Relations
Experimental Results
Protein Annotation Examples
Discovered relations
Disease be_associated, is_a, be_mutated, be_caused, be_deleted, contribute, etc.
Function induce, block, mediate, is_a, belong, act, etc.
Structure contain, form, share, lack, bind, encode, be_conserved, etc.
Annotation example for protein NF-kappaB
be_implicated(,’NF-kappaB’, in:’the pathogenesis’)
regulate(’IkappaBalpha’, ’NF-kappaB’), activate(’BCMA’, ’NF-kappaB’)
be_composed(’NF-kappaB’, of:’heterodimeric complexes’)
25 / 32
Learning to Extract Relations
Experimental Results
Limitations
Failure to ïŹnd inverse relation
S: Expression of uPAR in tumor extracts also inversely correlates with
prognosis in many forms of cancer.
RULE: np:expression <of:protein>[X] & vp:correlate <with:prognosis>[Y]
RELATION: correlate(expression(’expression’,of:’uPAR’), with:’prognosis’)
Anaphora problem
S: Whereas the overall structure resembles that of the NF-kappaB
p50-DNA complex , pronounced differences are observed within the ’
insert region ’.
RULE: <subj:structure>[X] vp:resemble & vp:resemble <dobj:C>[Y]
RELATION: resemble(’the overall structure’,’that’)
26 / 32
Learning to Extract Relations
Conclusion
Conclusion
Proposed a methodology for developing IE systems with
resources that can be provided by biologists.
Learned relations as well as IE rules without annotated
corpora.
Annotated proteins with structured information (i.e.,
predicate argument structure) in terms of any topic.
Validated the methodology over different topics (function,
structure, disease, cancer) in bio-medical domain.
Advantage: will alleviate the burden of developing IE
systems for users who have little or no formal IE training.
27 / 32
Thank you for your attention!
Learning to Extract Relations
Appendix
For Further Reading
For Further Reading I
Mary Elaine Califf and Raymond J. Mooney.
Bottom-up relational learning of pattern matching rules for
information extraction.
Journal of Machine Learning Research, 4:177–210, 2003.
R. Collier.
Automatic template creation for information extraction, an
overview, 1996.
Dayne Freitag.
Machine learning for information extraction in informal
domains.
Machine Learning, 39(2/3):169–202, 2000.
29 / 32
Learning to Extract Relations
Appendix
For Further Reading
For Further Reading II
Takaaki Hasegawa, Satoshi Sekine, and Ralph Grishman.
Discovering relations among named entities from large
corpora.
In ACL, pages 415–422, 2004.
Ellen Riloff.
Automatically generating extraction patterns from untagged
text.
In Proceedings of the Thirteenth National Conference on
ArtiïŹcial Intelligence (AAAI-96), pages 1044–1049, 1996.
30 / 32
Learning to Extract Relations
Appendix
For Further Reading
For Further Reading III
Jasmin Saric, Lars Juhl Jensen, Peer Bork, Rossitza
Ouzounova, and Isabel Rojas.
Extracting regulatory gene expression networks from
pubmed.
In ACL, pages 191–198, 2004.
Stephen Soderland.
Learning information extraction rules for semi-structured
and free text.
Machine Learning, 34(1-3):233–272, 1999.
31 / 32
[Sod99]
[Fre00]
[CM03]
[Ril96]
[HSG04]
[Col96]
[SJB+04]

Weitere Àhnliche Inhalte

Was ist angesagt?

Bio inspiring computing and its application in cheminformatics
Bio inspiring computing and its application in cheminformaticsBio inspiring computing and its application in cheminformatics
Bio inspiring computing and its application in cheminformaticsabdelazim Galal
 
Novelty detection via topic modeling in research articles
Novelty detection via topic modeling in research articlesNovelty detection via topic modeling in research articles
Novelty detection via topic modeling in research articlescsandit
 
Cheminformatics
CheminformaticsCheminformatics
Cheminformaticsbaoilleach
 
Brian Andrews Resume
Brian Andrews ResumeBrian Andrews Resume
Brian Andrews ResumeBrian Andrews
 
Cheminformatics
CheminformaticsCheminformatics
Cheminformaticsbaoilleach
 
A little more semantics goes a lot further!  Getting more out of Linked Data ...
A little more semantics goes a lot further!  Getting more out of Linked Data ...A little more semantics goes a lot further!  Getting more out of Linked Data ...
A little more semantics goes a lot further!  Getting more out of Linked Data ...Michel Dumontier
 
STATISTICAL FUNCTION TAGGING AND GRAMMATICAL RELATIONS OF MYANMAR SENTENCES
STATISTICAL FUNCTION TAGGING AND GRAMMATICAL RELATIONS OF MYANMAR SENTENCESSTATISTICAL FUNCTION TAGGING AND GRAMMATICAL RELATIONS OF MYANMAR SENTENCES
STATISTICAL FUNCTION TAGGING AND GRAMMATICAL RELATIONS OF MYANMAR SENTENCEScscpconf
 

Was ist angesagt? (9)

Bio inspiring computing and its application in cheminformatics
Bio inspiring computing and its application in cheminformaticsBio inspiring computing and its application in cheminformatics
Bio inspiring computing and its application in cheminformatics
 
D017422528
D017422528D017422528
D017422528
 
BEA12_sakaguchi
BEA12_sakaguchiBEA12_sakaguchi
BEA12_sakaguchi
 
Novelty detection via topic modeling in research articles
Novelty detection via topic modeling in research articlesNovelty detection via topic modeling in research articles
Novelty detection via topic modeling in research articles
 
Cheminformatics
CheminformaticsCheminformatics
Cheminformatics
 
Brian Andrews Resume
Brian Andrews ResumeBrian Andrews Resume
Brian Andrews Resume
 
Cheminformatics
CheminformaticsCheminformatics
Cheminformatics
 
A little more semantics goes a lot further!  Getting more out of Linked Data ...
A little more semantics goes a lot further!  Getting more out of Linked Data ...A little more semantics goes a lot further!  Getting more out of Linked Data ...
A little more semantics goes a lot further!  Getting more out of Linked Data ...
 
STATISTICAL FUNCTION TAGGING AND GRAMMATICAL RELATIONS OF MYANMAR SENTENCES
STATISTICAL FUNCTION TAGGING AND GRAMMATICAL RELATIONS OF MYANMAR SENTENCESSTATISTICAL FUNCTION TAGGING AND GRAMMATICAL RELATIONS OF MYANMAR SENTENCES
STATISTICAL FUNCTION TAGGING AND GRAMMATICAL RELATIONS OF MYANMAR SENTENCES
 

Andere mochten auch

From logistic regression to linear chain CRF
From logistic regression to linear chain CRFFrom logistic regression to linear chain CRF
From logistic regression to linear chain CRFDarren Yow-Bang Wang
 
Scalable Text Mining
Scalable Text MiningScalable Text Mining
Scalable Text MiningJee-Hyub Kim
 
Europe PubMed Central and Linked Data
Europe PubMed Central and Linked DataEurope PubMed Central and Linked Data
Europe PubMed Central and Linked DataJee-Hyub Kim
 
Literature Services Resource Description Framework
Literature Services Resource Description FrameworkLiterature Services Resource Description Framework
Literature Services Resource Description FrameworkJee-Hyub Kim
 
AMAR - Projeto Jardim OceĂąnico Presente
AMAR - Projeto Jardim OceĂąnico PresenteAMAR - Projeto Jardim OceĂąnico Presente
AMAR - Projeto Jardim OceĂąnico PresenteAmar Jardim OceĂąnico
 
Planificaciondesistemas er
Planificaciondesistemas erPlanificaciondesistemas er
Planificaciondesistemas erJorge Pong Ng Chong
 
3Com ACCB-100
3Com ACCB-1003Com ACCB-100
3Com ACCB-100savomir
 
3Com 08004E34R5B9
3Com 08004E34R5B93Com 08004E34R5B9
3Com 08004E34R5B9savomir
 
Kim kleps 10 most dangerous sports
Kim kleps 10 most dangerous sportsKim kleps 10 most dangerous sports
Kim kleps 10 most dangerous sportsKimKleps
 
Pembangunan mapan
Pembangunan mapanPembangunan mapan
Pembangunan mapanmila_famila
 
Forced marriage
Forced marriageForced marriage
Forced marriageNabeel Raza
 
Circuito electrico 11 2
Circuito electrico 11 2Circuito electrico 11 2
Circuito electrico 11 2sluciia06
 
The art of writing proper paragraphs
The art of writing proper paragraphsThe art of writing proper paragraphs
The art of writing proper paragraphsMarc Draijer
 

Andere mochten auch (15)

From logistic regression to linear chain CRF
From logistic regression to linear chain CRFFrom logistic regression to linear chain CRF
From logistic regression to linear chain CRF
 
Scalable Text Mining
Scalable Text MiningScalable Text Mining
Scalable Text Mining
 
Regularization
RegularizationRegularization
Regularization
 
Europe PubMed Central and Linked Data
Europe PubMed Central and Linked DataEurope PubMed Central and Linked Data
Europe PubMed Central and Linked Data
 
Literature Services Resource Description Framework
Literature Services Resource Description FrameworkLiterature Services Resource Description Framework
Literature Services Resource Description Framework
 
AMAR - Projeto Jardim OceĂąnico Presente
AMAR - Projeto Jardim OceĂąnico PresenteAMAR - Projeto Jardim OceĂąnico Presente
AMAR - Projeto Jardim OceĂąnico Presente
 
Planificaciondesistemas er
Planificaciondesistemas erPlanificaciondesistemas er
Planificaciondesistemas er
 
DNA Labs
DNA LabsDNA Labs
DNA Labs
 
3Com ACCB-100
3Com ACCB-1003Com ACCB-100
3Com ACCB-100
 
3Com 08004E34R5B9
3Com 08004E34R5B93Com 08004E34R5B9
3Com 08004E34R5B9
 
Kim kleps 10 most dangerous sports
Kim kleps 10 most dangerous sportsKim kleps 10 most dangerous sports
Kim kleps 10 most dangerous sports
 
Pembangunan mapan
Pembangunan mapanPembangunan mapan
Pembangunan mapan
 
Forced marriage
Forced marriageForced marriage
Forced marriage
 
Circuito electrico 11 2
Circuito electrico 11 2Circuito electrico 11 2
Circuito electrico 11 2
 
The art of writing proper paragraphs
The art of writing proper paragraphsThe art of writing proper paragraphs
The art of writing proper paragraphs
 

Ähnlich wie Learning to Extract Relations for Protein Annotation

ppt
pptppt
pptbutest
 
A03730108
A03730108A03730108
A03730108theijes
 
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATA
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATAIDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATA
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATAijistjournal
 
Identifying the semantic relations on
Identifying the semantic relations onIdentifying the semantic relations on
Identifying the semantic relations onijistjournal
 
Cornell Pbsb 20090126 Nets
Cornell Pbsb 20090126 NetsCornell Pbsb 20090126 Nets
Cornell Pbsb 20090126 NetsMark Gerstein
 
Essential Biology 04.1 Chromosomes, Genes, Alleles, Mutations
Essential Biology 04.1   Chromosomes, Genes, Alleles, MutationsEssential Biology 04.1   Chromosomes, Genes, Alleles, Mutations
Essential Biology 04.1 Chromosomes, Genes, Alleles, MutationsStephen Taylor
 
32_Nov07_MachineLear..
32_Nov07_MachineLear..32_Nov07_MachineLear..
32_Nov07_MachineLear..butest
 
Annotation of SBML Models Through Rule-Based Semantic Integration
Annotation of SBML Models Through Rule-Based Semantic IntegrationAnnotation of SBML Models Through Rule-Based Semantic Integration
Annotation of SBML Models Through Rule-Based Semantic IntegrationAllyson Lister
 
Poster genome engineering & Synthetic Biology 2016
Poster genome engineering & Synthetic Biology 2016Poster genome engineering & Synthetic Biology 2016
Poster genome engineering & Synthetic Biology 2016Michiel Stock
 
Structural Systems Pharmacology
Structural Systems PharmacologyStructural Systems Pharmacology
Structural Systems PharmacologyPhilip Bourne
 
Revealing Trends Based on Defined Queries in Biological Publications Using Co...
Revealing Trends Based on Defined Queries in Biological Publications Using Co...Revealing Trends Based on Defined Queries in Biological Publications Using Co...
Revealing Trends Based on Defined Queries in Biological Publications Using Co...Hadi Mohammadzadeh
 
provenance of microarray experiments
provenance of microarray experimentsprovenance of microarray experiments
provenance of microarray experimentsHelena Deus
 
COMPUTATIONAL METHODS FOR FUNCTIONAL ANALYSIS OF GENE EXPRESSION
COMPUTATIONAL METHODS FOR FUNCTIONAL ANALYSIS OF GENE EXPRESSIONCOMPUTATIONAL METHODS FOR FUNCTIONAL ANALYSIS OF GENE EXPRESSION
COMPUTATIONAL METHODS FOR FUNCTIONAL ANALYSIS OF GENE EXPRESSIONcsandit
 
University of Texas at Austin
University of Texas at AustinUniversity of Texas at Austin
University of Texas at Austinbutest
 
Lec1-Into
Lec1-IntoLec1-Into
Lec1-Intobutest
 
Information extraction for Free Text
Information extraction for Free TextInformation extraction for Free Text
Information extraction for Free Textbutest
 
PMC Poster - phylogenetic algorithm for morphological data
PMC Poster - phylogenetic algorithm for morphological dataPMC Poster - phylogenetic algorithm for morphological data
PMC Poster - phylogenetic algorithm for morphological dataYiteng Dang
 
Advantages And Disadvantages Of Chronic Kidney Disease
Advantages And Disadvantages Of Chronic Kidney DiseaseAdvantages And Disadvantages Of Chronic Kidney Disease
Advantages And Disadvantages Of Chronic Kidney DiseaseKaren Oliver
 
SBML FOR OPTIMIZING DECISION SUPPORT'S TOOLS
SBML FOR OPTIMIZING DECISION SUPPORT'S TOOLS SBML FOR OPTIMIZING DECISION SUPPORT'S TOOLS
SBML FOR OPTIMIZING DECISION SUPPORT'S TOOLS cscpconf
 

Ähnlich wie Learning to Extract Relations for Protein Annotation (20)

ppt
pptppt
ppt
 
A03730108
A03730108A03730108
A03730108
 
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATA
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATAIDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATA
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATA
 
Identifying the semantic relations on
Identifying the semantic relations onIdentifying the semantic relations on
Identifying the semantic relations on
 
Cornell Pbsb 20090126 Nets
Cornell Pbsb 20090126 NetsCornell Pbsb 20090126 Nets
Cornell Pbsb 20090126 Nets
 
Essential Biology 04.1 Chromosomes, Genes, Alleles, Mutations
Essential Biology 04.1   Chromosomes, Genes, Alleles, MutationsEssential Biology 04.1   Chromosomes, Genes, Alleles, Mutations
Essential Biology 04.1 Chromosomes, Genes, Alleles, Mutations
 
32_Nov07_MachineLear..
32_Nov07_MachineLear..32_Nov07_MachineLear..
32_Nov07_MachineLear..
 
Annotation of SBML Models Through Rule-Based Semantic Integration
Annotation of SBML Models Through Rule-Based Semantic IntegrationAnnotation of SBML Models Through Rule-Based Semantic Integration
Annotation of SBML Models Through Rule-Based Semantic Integration
 
Poster genome engineering & Synthetic Biology 2016
Poster genome engineering & Synthetic Biology 2016Poster genome engineering & Synthetic Biology 2016
Poster genome engineering & Synthetic Biology 2016
 
Prosdocimi ucb cdao
Prosdocimi ucb cdaoProsdocimi ucb cdao
Prosdocimi ucb cdao
 
Structural Systems Pharmacology
Structural Systems PharmacologyStructural Systems Pharmacology
Structural Systems Pharmacology
 
Revealing Trends Based on Defined Queries in Biological Publications Using Co...
Revealing Trends Based on Defined Queries in Biological Publications Using Co...Revealing Trends Based on Defined Queries in Biological Publications Using Co...
Revealing Trends Based on Defined Queries in Biological Publications Using Co...
 
provenance of microarray experiments
provenance of microarray experimentsprovenance of microarray experiments
provenance of microarray experiments
 
COMPUTATIONAL METHODS FOR FUNCTIONAL ANALYSIS OF GENE EXPRESSION
COMPUTATIONAL METHODS FOR FUNCTIONAL ANALYSIS OF GENE EXPRESSIONCOMPUTATIONAL METHODS FOR FUNCTIONAL ANALYSIS OF GENE EXPRESSION
COMPUTATIONAL METHODS FOR FUNCTIONAL ANALYSIS OF GENE EXPRESSION
 
University of Texas at Austin
University of Texas at AustinUniversity of Texas at Austin
University of Texas at Austin
 
Lec1-Into
Lec1-IntoLec1-Into
Lec1-Into
 
Information extraction for Free Text
Information extraction for Free TextInformation extraction for Free Text
Information extraction for Free Text
 
PMC Poster - phylogenetic algorithm for morphological data
PMC Poster - phylogenetic algorithm for morphological dataPMC Poster - phylogenetic algorithm for morphological data
PMC Poster - phylogenetic algorithm for morphological data
 
Advantages And Disadvantages Of Chronic Kidney Disease
Advantages And Disadvantages Of Chronic Kidney DiseaseAdvantages And Disadvantages Of Chronic Kidney Disease
Advantages And Disadvantages Of Chronic Kidney Disease
 
SBML FOR OPTIMIZING DECISION SUPPORT'S TOOLS
SBML FOR OPTIMIZING DECISION SUPPORT'S TOOLS SBML FOR OPTIMIZING DECISION SUPPORT'S TOOLS
SBML FOR OPTIMIZING DECISION SUPPORT'S TOOLS
 

KĂŒrzlich hochgeladen

100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptDr. Soumendra Kumar Patra
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 

KĂŒrzlich hochgeladen (20)

100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 

Learning to Extract Relations for Protein Annotation

  • 1. Learning to Extract Relations Learning to Extract Relations for Protein Annotation Jee-Hyub Kima, Alex Mitchellb,c, Teresa K. Attwoodb,c, Melanie Hilarioa aUniversity of Geneva bUniversity of Manchster cEuropean Bioinformatics Institute ISMB/ECCB2007, 23 July 2007 1 / 32
  • 2. Learning to Extract Relations Contents Introduction Related Work Problem and Approach Methods Experimental Results Conclusion 2 / 32
  • 3. Learning to Extract Relations Introduction Protein Annotation DeïŹnition Given a protein sequence or name, describe the protein with all relevant information Two main methods Sequence analysis method (e.g. BLAST, CLUSTALW, etc.) Literature analysis method Traditionally, done manually by human annotators, and automation needed Text-mining has been used for automation. Two main tasks in text-mining Information retrieval (IR): to retrieve relevant documents Information extraction (IE): to extract certain pieces of information from text for pre-deïŹned entities and their relations. 3 / 32
  • 4. Learning to Extract Relations Introduction Adaptive Information Extraction IE is domain-speciïŹc. Generally, one developed system cannot be used for a new domain. Developing IE systems requires a signiïŹcant amount of domain knowledge. Developing IE rules DeïŹning relations These are two main bottlenecks. Still many new domains need to develop their own IE systems. e.g., cell cycle, tissue speciïŹcity, etc. 4 / 32
  • 5. Learning to Extract Relations Related Work IE System Development: Developing IE Rules Kowledge engineering (KE) based approach Knowledge engineers write hand-crafted rules with the help of domain experts (e.g., biologists). Not scalable Machine learning (ML) based approach To increase robustness and coverage of IE rules ML has been used to learn IE rules. Annotated corpora, pre-labelled corpora, raw corpora. Labor required to develop IE systems KE > ML (annotated corpora > pre-labelled corpora > raw corpora) 5 / 32
  • 6. Learning to Extract Relations Related Work IE System Development: DeïŹning Relations All the previous IE systems assume relations to be extracted are already deïŹned. It is hard to specify precisely all possible relations to extract, especially in complex and dynamically-evolving domains (e.g., biological domain) Positioning (ML-based approach) Corpora Relations Pre-deïŹned Not deïŹned Annotated Soderland (1999), Freitag (2000), Califf and Money (2003) Pre-labeled Riloff (1996) Our work Raw Hasegawa et al. (2004) Collier (1996) 6 / 32
  • 7. Learning to Extract Relations Problem and Approach Our Problem Goal To alleviate the burden of developing IE systems for biologists Problem deïŹnition Given relevant sentences that describe protein X in terms of any topic Y and irrelevant sentences Learn to extract relations for protein annotation Two sub-problems What to extract How to extract 7 / 32
  • 8. Learning to Extract Relations Problem and Approach Bottom-up Approach Figure: From sentences to relations 8 / 32
  • 9. Learning to Extract Relations Methods System Architecture 9 / 32
  • 10. Learning to Extract Relations Methods Analyzing Sentences: MBSP Memory-Based Shallow Parser (MBSP) Developed by Walter Daelemans Extended with named entity taggers and SVO relation ïŹnder Provides various types of information: POS, SVO, NE, etc. Adapted to the biological domain on the basis of the GENIA corpus 97.6% accuracy on POS tagging 71.0% accuracy on protein named entity recognition 10 / 32
  • 11. Learning to Extract Relations Methods Analyzing Sentences: Example Example INPUT: Examples of this are the RNA-binding protein containing the RNA-binding domain (RBD) ... OUTPUT: Chunk Syntactic Semantic SVO relation Examples noun phrase subject of ’are’ of preposition this noun phrase are verb phrase the RNA-binding protein noun phrase protein subject of ’contain’ containing verb phrase the RNA-binding domain (RBD) noun phrase domain object of ’contain’ 11 / 32
  • 12. Learning to Extract Relations Methods Learning IE Rules: Inductive Logic Programming (ILP) We applied ILP to learn IE rules. ILP is a ML algorithm that induces rules from examples. Outputs are readable and interpretable by the domain experts. Can deal with relational information (e.g., parse trees). Problem Setting B ∧ H |= E Given B (Background Knowledge) and E (Examples), ïŹnd H (Hypothesis). 12 / 32
  • 13. Learning to Extract Relations Methods Learning IE Rules: Representation B (Background Knowledge) Linguistic heuristics (for single-slot IE pattern) e.g., <subj> verb, verb <dobj>, verb preposition <np>, noun prep <np>, etc. Sentence descriptions (ie., analyzed sentences) E (Examples) Positive and negative examples (ie., relevant and irrelevant sentences) H (Hypothesis) A set of IE rules 13 / 32
  • 14. Learning to Extract Relations Methods Learning IE Rules: Generalization Figure: From a sentence to an IE rule 14 / 32
  • 15. Learning to Extract Relations Methods Learning IE Rules: Step 1 Figure: From a sentence to an IE rule 15 / 32
  • 16. Learning to Extract Relations Methods Learning IE Rules: Step 2 Figure: From a sentence to an IE rule 16 / 32
  • 17. Learning to Extract Relations Methods Learning IE Rules: Step 3 Figure: From a sentence to an IE rule 17 / 32
  • 18. Learning to Extract Relations Methods Learning IE Rules: Step 4 Figure: From a sentence to an IE rule WRAcc(Rule) = coverage(Rule) ∗ (accuracy(Rule) − accuracy(Head ← true)) 18 / 32
  • 19. Learning to Extract Relations Methods Selecting IE Rules Why is this step necessary? Rules are learned to classify sentences. Not to extract information. Spurious IE rules are learned from the previous step. Need to be ïŹltered out by domain experts. Rules are provided to users with information. Example RULE: <subj:*> vp:contain & vp:contain <dobj:domain> [9, 0.9] S: Myocilin is a secreted glycoprotein that forms multimers and contains a leucine zipper and an olfactomedin domain. 19 / 32
  • 20. Learning to Extract Relations Methods Transforming Rules Once IE rules are selected, transformed into relations. Mapping between IE rules and relations IE Rules Relations extract argument trigger relation name syntactic tag argument position Example <subj:*>[X] vp:contain & vp:contain <dobj:domain>[Y] → contain(X,Y) 20 / 32
  • 21. Learning to Extract Relations Methods Grouping Rules Post-processing rules Pattern Example A verb (active form) B A activate B B be ed-participle by A B be activated by A nominal form (with sufïŹx -tion) of verb of B by A activation of B by A A be nominal form (with sufïŹx -or) of verb of B A is an activator of B A be ... that verb (active form) B A is ... that activates B 21 / 32
  • 22. Learning to Extract Relations Methods Applying Rules to Extract Relations Now, we have a set of IE rules for each relation. IE rule and relation RULE: <subj:protein>[X] vp:promote & vp:promote <dobj:disease>[Y] TRIGGER: promote RELATION: promote(X,Y) Example INPUT: Our data demonstrate that PKC beta II promotes colon cancer, at least in part, through induction of Cox-2, suppression of TGF-beta signaling, and establishment of a TGF-beta-resistant, hyperproliferative state in the colonic epithelium. OUTPUT: promote(’PKC beta II’,’colon cancer’) 22 / 32
  • 23. Learning to Extract Relations Experimental Results Experiments Our IE system was applied for PRINTS (a protein family database) annotation. Development Corpora Topic Positives Negatives Class Distribution Disease 777 1403 36-64% Function 1268 2625 33-67% Structure 1159 1750 40-60% 80% for training, 20% for test 23 / 32
  • 24. Learning to Extract Relations Experimental Results Evaluation All the extracted relations are manually evaluated by domain experts. Results Topic Learned Selected Relations Precision Recall F1-measure rules rules Disease 55 32 21 75 18.3 29.4 Function 125 64 23 66.3 15.1 24.6 Structure 146 76 20 85.3 61 71.1 cf. a recent work of extracting regulatory gene/protein networks F1-measure of 44% (Saric et al, 2006) 24 / 32
  • 25. Learning to Extract Relations Experimental Results Protein Annotation Examples Discovered relations Disease be_associated, is_a, be_mutated, be_caused, be_deleted, contribute, etc. Function induce, block, mediate, is_a, belong, act, etc. Structure contain, form, share, lack, bind, encode, be_conserved, etc. Annotation example for protein NF-kappaB be_implicated(,’NF-kappaB’, in:’the pathogenesis’) regulate(’IkappaBalpha’, ’NF-kappaB’), activate(’BCMA’, ’NF-kappaB’) be_composed(’NF-kappaB’, of:’heterodimeric complexes’) 25 / 32
  • 26. Learning to Extract Relations Experimental Results Limitations Failure to ïŹnd inverse relation S: Expression of uPAR in tumor extracts also inversely correlates with prognosis in many forms of cancer. RULE: np:expression <of:protein>[X] & vp:correlate <with:prognosis>[Y] RELATION: correlate(expression(’expression’,of:’uPAR’), with:’prognosis’) Anaphora problem S: Whereas the overall structure resembles that of the NF-kappaB p50-DNA complex , pronounced differences are observed within the ’ insert region ’. RULE: <subj:structure>[X] vp:resemble & vp:resemble <dobj:C>[Y] RELATION: resemble(’the overall structure’,’that’) 26 / 32
  • 27. Learning to Extract Relations Conclusion Conclusion Proposed a methodology for developing IE systems with resources that can be provided by biologists. Learned relations as well as IE rules without annotated corpora. Annotated proteins with structured information (i.e., predicate argument structure) in terms of any topic. Validated the methodology over different topics (function, structure, disease, cancer) in bio-medical domain. Advantage: will alleviate the burden of developing IE systems for users who have little or no formal IE training. 27 / 32
  • 28. Thank you for your attention!
  • 29. Learning to Extract Relations Appendix For Further Reading For Further Reading I Mary Elaine Califf and Raymond J. Mooney. Bottom-up relational learning of pattern matching rules for information extraction. Journal of Machine Learning Research, 4:177–210, 2003. R. Collier. Automatic template creation for information extraction, an overview, 1996. Dayne Freitag. Machine learning for information extraction in informal domains. Machine Learning, 39(2/3):169–202, 2000. 29 / 32
  • 30. Learning to Extract Relations Appendix For Further Reading For Further Reading II Takaaki Hasegawa, Satoshi Sekine, and Ralph Grishman. Discovering relations among named entities from large corpora. In ACL, pages 415–422, 2004. Ellen Riloff. Automatically generating extraction patterns from untagged text. In Proceedings of the Thirteenth National Conference on ArtiïŹcial Intelligence (AAAI-96), pages 1044–1049, 1996. 30 / 32
  • 31. Learning to Extract Relations Appendix For Further Reading For Further Reading III Jasmin Saric, Lars Juhl Jensen, Peer Bork, Rossitza Ouzounova, and Isabel Rojas. Extracting regulatory gene expression networks from pubmed. In ACL, pages 191–198, 2004. Stephen Soderland. Learning information extraction rules for semi-structured and free text. Machine Learning, 34(1-3):233–272, 1999. 31 / 32