WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
Extending the Espresso Method for Greater Recall
1. Relationship Extraction from Text
Extending the Espresso Method for Greater Recall
Derek Springer
UCLA Computer Science Department
November 19, 2009
2. Related Works
• Ganapathi, Swathi. “Relationship Extraction from Text:
Comparison and Experimental Evaluation of the State-of-
the-Art.” UCLA comp exam. March 2009.
• Chu, A., Sakurai, S., Cárdenas, A. F., "Automatic Detection
of Treatment Relationships in Patent Retrieval." 2008 CIKM
Patent Information Retrieval Workshop. October 2008.
3. Related Works, cont'd
• Girju, R. "Automatic Detection of Causal Relations for
Question Answering." In the proceedings of the 41st Annual
Meeting of the Association for Computational Linguistics
(ACL 2003). Workshop on "Multilingual Summarization and
Question Answering - Machine Learning and Beyond".
2003.
• Pantel, Patrick and Pennacchiotti, Marco. "Espresso:
Leveraging Generic Patterns for Automatically Harvesting
Semantic Relations." In Proceedings of Conference on
Computational Linguistics / Association for Computational
Linguistics (COLING/ACL- 06). pp. 113-120.
Sydney, Australia. 2006.
4. Relationship Extraction
• The task of recognizing the assertion of a
particular relationship between two or more
entities in text.
• Can aid in the development of
standalone, intelligent, automated and adaptable
user-specific content retrieval systems.
• We focus on extracting treatment relationships
→ A (subject) used to treat B (object).
5. Goals and Contributions
• Extended state-of-the-art Espresso relationship
extraction system originally implemented by
Ganapathi.
• Did an in-depth experimental evaluation of the
developed system while comparing it to prior
work (Chu, Ganapathi).
• Future goal is to use the system developed here
as a plug for relationship feature extractor in
iScore.
6. Integration Into iScore
• iScore presents additional articles based on an
aggregate score of “interestingness.”
• We believe filtering articles based on
relationships can improve the results of iScore.
• We hypothesize that extending the Espresso
system implemented by Swathi Ganapathi will
improve the ability of a system such as iScore to
utilize relationship extraction as a feature.
7. Comparison Criteria
• Performance: Want system to have high
precision and recall
• Minimal Supervision: Want system to require
little to no human supervision
• Breadth: Want system to extract relations from
varying corpus sizes, domains and formats.
• Generality: Want system to extract wide variety
of relation types without losing its edge in any of
the above criteria.
8. The Espresso Algorithm
• General purpose algorithm which can be used to
extract a wide variety of binary relations.
• Requires minimal supervision. Only input is a
small seed set of known relations.
• By looking at individual sentences in detecting
relationships, works well on all kinds of corpora.
• On tests conducted by the creators of the
algorithm, Espresso generated balanced
precision and recall.
11. Ganapathi's Implementation
• Ganapathi's approach uses lexico-syntactic
patterns of the form NP1 VP NP2 (Verb category
in Table 1).
• VP contains treatment verb or pattern and the
two NPs would contain the subject and object.
• This structure is a very common
relationship, accounting for 37.8% of all
relationships.
12. Extension
• There still remains a large number of
relationships that may provide fruitful results.
• Expanding the implementation to include:
- Noun+Prep e.g. "X settlement with Y"
- Verb+Prep e.g. "X moved to Y"
- Infinitive e.g. "X plans to acquire Y" and
- Modifier e.g. "X is Y winner" relationship
• Retrieves 91.2% of common relationships.
13. Test Corpora
• Patent Corpus: Developed by Shige
o 50,000 drug patent documents from 2008 from Class 424 & 514 of
the U.S. Patents Classification: “drug, bio-affecting and body
treating compositions” and their subclasses.
o Patents were pre-filtered to only contain keywords
“diabetes”, “metastatic”, “cancer”, “tuberculosis”, “lung”, “bronchitis”,
“coronary artery”
o All sentences from each document added to a sentence table in the
schema
• PubMed Corpus: Developed by Gustavo
o Comprised of medical abstracts from PubMed
o Each abstract was parsed and all sentences from each abstract
was stored as individual tuples in the sentence table
16. Procedure
1.Re-tag original data set to incorporate extended
relationship types.
2.Re-run Ganapathi's baseline Espresso
implementation to compare against updated data
set.
3.Run extended Espresso implementation to
compare against updated data set.
17. Experiment #1: Extraction on Drug
Patent Corpus
• Drug Patent corpus used.
• Algorithm was run with seed relations and 12 verbs were extracted as
being relevant (verbs with rπ greater than 0.2).
• These treatment verbs were used to create a test sentence set of 120
sentences i.e. 10 sentences containing a treatment verb for every
relevant treatment verb.
• 358 possible relations were extracted for each of which we calculated
the ri score.
• 208 relations were obtained with ri score greater than the threshold out
of which 126 were actually correct (through manual tagging).
• Of the original 358 relations, manual tagging determined that 213 of
them were correct treatment relations.
19. Experiment #2: Number of
Relationships and Performance
• Drug Patent corpus used.
• Test the performance of the system under
smaller and larger data loads.
• Started with initial set of 120 sentences obtained
from Drug Patent corpus (10 sentences for each
verb, 12 verbs as in test #1)
• Increased the number of sentences for each
verb by 10 in each case, so that we had
sentence sets of 240 and 360 sentences each
21. Experiment #2 Analysis
• Performance of the system and the number of
relationships are inversely related.
• ri scores are affected inversely by the max pmi across
all relationship instances, it is possible that having more
relationship instances in a set lowers the ri for all those
relationships.
• more relationships => chance of a greater max pmi =>
lowered ri for all relationship instances.
• Not worried → articles likely won't have 200 relations of
the same type.
22. Experiment #3: Extraction on
PubMed Corpus
• PubMed corpus used.
• Want to test the performance of the system on a different
type and sized corpus
• Algorithm was run with input seed relations on this corpus
and10 verbs with the topmost rπ values were extracted
• We constructed a test sentence set of 80 sentences (8
sentences for every relevant verb)
• We then extracted a total of 162 relations from this test set
and calculated their ri scores.
• The average ri score was used as the threshold value
25. Experiment #3 Analysis
• Performance is worse on PubMed corpus.
• Patent corpus dealt with drugs and cures for diseases.
• Therefore, there was an abundance of treatment type
relations in patent corpus.
• PubMed had more general medical data and only
contained abstracts => less info.
• Therefore, there were fewer treatment relations in
PubMed which affected performance.
27. Analysis
• F-score of Ganapathi's version of Espresso fell
nearly 10% → due to lower recall, as predicted.
• Results of extension over the re-tagged data are
on par with Ganapathi's original results.
• When you consider that Ganapathi's system
dropped nearly 10%, it seems to indicate the
increased general purpose nature of the
extension over the original version.
28. Success
• Recall of system is more important than
precision, especially when it comes to using
relationships as a feature in iScore.
• Method is almost completely automated.
• Easily expanded to extract other relationship types by
changing the input seed relations.
• Initial results seem insignificant, but analysis indicates
that extended system has the potential to be a general-
purpose relationship extraction feature.
29. Future Work
• Development of a relationship feature extractor
for iScore.
• Relations will have to be syntactically and
semantically compared with relations present in
other articles and the best article matches will be
returned as “interesting” choices for a user.
• Optimizations: algorithm design
improvements, database connection
optimizations and parallelization.