SlideShare ist ein Scribd-Unternehmen logo
1 von 34
Concept-Based Information Retrieval using Explicit Semantic Analysis M.Sc. Seminar talk Ofer Egozi, CS Department, Technion Supervisor: Prof. Shaul Markovitch 24/6/09
Information Retrieval Query IR Recall Precision
Ranked retrieval Query IR
Keyword-based retrieval Bag Of Words (BOW) Query IR
Problem: retrieval misses TREC document LA071689-0089  “ANCIENT ARTIFACTS FOUND. Divers have recovered artifacts lying underwater for more than 2,000 years in the wreck of a Roman ship that sank in the Gulf of Baratti, 12 miles off the island of Elba, newspapers reported Saturday." TREC topic #411 salvaging shipwreck treasure I I Query IR
The vocabulary problem Identity: Syntax   (tokenization, stemming…) Similarity: Synonyms (Wordnet etc.) Relatedness: Semantics / world knowledge  (???) “ANCIENT ARTIFACTS FOUND. Divers have recovered artifacts lying underwater for more than 2,000 years in the wreck of a Roman ship that sank in the Gulf of Baratti, 12 miles off the island of Elba, newspapers reported Saturday." ? [but also shipping/treasurer] Synonymy / Polysemy ? [but also deliver/scavenge/relieve] salvaging shipwreck treasure
Concept-based retrieval “ANCIENT ARTIFACTS FOUND. Divers have recovered artifacts lying underwater for more than 2,000 years in the wreck of a Roman ship that sank in the Gulf of Baratti, 12 miles off the island of Elba, newspapers reported Saturday." IR salvaging shipwreck treasure
Concept-based representations Human-edited Thesauri (e.g. WordNet) Source: editors , concepts: words, mapping: manual Corpus-based Thesauri (e.g. co-occurrence) Source: corpus , concepts: words , mapping: automatic Ontology mapping (e.g. KeyConcept) Source: ontology , concepts: ontology node(s) , mapping: automatic Latent analysis (e.g. LSA, pLSA, LDA) Source: corpus , concepts: word distributions , mapping: automatic Insufficient granularity Non-intuitive Concepts Expensive  repetitive computations Non-scalable solution
Concept-based representations Human-edited Thesauri (e.g. WordNet) Source: editors , concepts: words, mapping: manual Corpus-based Thesauri (e.g. co-occurrence) Source: corpus , concepts: words , mapping: automatic Ontology mapping (e.g. KeyConcept) Source: ontology , concepts: ontology node(s) , mapping: automatic Latent analysis (e.g. LSA, pLSA, LDA) Source: corpus , concepts: word distributions , mapping: automatic        Is it possible to devise a    concept-based representation, that is scalable, computationally feasible, and uses intuitive          and granular concepts? Insufficient granularity Non-intuitive Concepts Expensive  repetitive computations Non-scalable solution
Explicit Semantic Analysis Gabrilovich and Markovitch (2005,2006,2007)
Explicit Semantic Analysis (ESA) Wikipedia is viewed as an ontology - a collection of ~1M concepts World War II Panthera Jane Fonda Island concept
Explicit Semantic Analysis (ESA) Wikipedia is viewed as an ontology - a collection of ~1M concepts Every Wikipedia article represents a concept Panthera Cat [0.92] Leopard [0.84] Article words are associated with the concept(TF.IDF) Roar [0.77] concept
Explicit Semantic Analysis (ESA) Wikipedia is viewed as an ontology - a collection of ~1M concepts Every Wikipedia article represents a concept Panthera Cat [0.92] Leopard [0.84] Article words are associated with the concept(TF.IDF) Roar [0.77]
Explicit Semantic Analysis (ESA) Wikipedia is viewed as an ontology - a collection of ~1M concepts Every Wikipedia article represents a concept Article words are associated with the concept(TF.IDF) Panthera The semantics of a word is the vector of its associations with Wikipedia concepts Cat [0.92] Leopard [0.84] Panthera [0.92] Cat [0.95] Jane Fonda [0.07] Cat Roar [0.77]
Explicit Semantic Analysis (ESA) The semantics of a text fragment is the average vector (centroid) of the semantics of its words In practice – disambiguation… Mouse (computing) [0.81] MickeyMouse[0.81] Game Controller [0.64] Button [0.93] Game Controller [0.32] Mouse (rodent) [0.91] John Steinbeck [0.17] Mouse (computing) [0.95] Mouse (rodent) [0.56] Dick Button [0.84] Mouse (computing) [0.84] Drag- and-drop [0.91] button mouse mouse  button mouse  button
MORAG*: An ESA-based information retrieval algorithm *MORAG: Flail in Hebrew “Concept-based feature generation and selection for information retrieval”, AAAI-2008
Enrich documents/queries ESA IR Query Constraint: use only the    strongest concepts
Problem: documents (in)coherence TREC document LA120790-0036 REFERENCE BOOKS SPEAK VOLUMES TO KIDS; With the school year in high gear, it's a good time to consider new additions to children's home reference libraries… …Also new from Pharos-World Almanac: "The World Almanac InfoPedia," a single-volume visual encyclopedia designed for ages 8 to 16… …"The Doubleday Children's Encyclopedia," designed for youngsters 7 to 11, bridges the gap between single-subject picture books and formal reference books… …"The Lost Wreck of the Isis" by Robert Ballard is the latest adventure in the  Time Quest Series from Scholastic-Madison Press ($15.95 hardcover).  Designed for children 8 to 12, it tells the story of Ballard's 1988 discovery of an ancient Roman shipwreck deep in the Mediterranean Sea… Document is judged relevant for topic 411 due to one relevant passage in it Not an issue in BOW retrieval where words are indexed independently. How  to deal with in concept-based? Concepts generated for this document will average to the books / children concepts, and lose the shipwreck mentions…
Solution: split to passages ESA IR Query ConceptScore(d) =  ConceptScore(full-doc) +             max ConceptScore(passage) passaged Index both full document and passages. Best performance achieved by fixed-length overlapping sliding windows.
Morag ranking Score(q,d) =                 ConceptScore(q,d) +                (1-)KeywordScore(q,d) IR Query
ESA-based retrieval example ,[object Object]
Treasure
Maritime archaeology
Marine salvage
History of the British Virgin Islands
Wrecking (shipwreck)
Key West, Florida
Flotsam and jetsam
Wreck diving
Spanish treasure fleet“ANCIENT ARTIFACTS FOUND. Divers have recovered artifacts lying underwater for more than 2,000 years in the wreck of a Roman ship that sank in the Gulf of Baratti, 12 miles off the island of Elba, newspapers reported Saturday."  ,[object Object]
Wreck diving
RMS Titanic
USS Hoel (DD-533)
Shipwreck

Weitere ähnliche Inhalte

Was ist angesagt?

stackconf 2021 | Weaviate Vector Search Engine – Introduction
stackconf 2021 | Weaviate Vector Search Engine – Introductionstackconf 2021 | Weaviate Vector Search Engine – Introduction
stackconf 2021 | Weaviate Vector Search Engine – Introduction
NETWAYS
 
Lectures 1,2,3
Lectures 1,2,3Lectures 1,2,3
Lectures 1,2,3
alaa223
 

Was ist angesagt? (20)

Tokenization using nlp | NLP Course
Tokenization using nlp | NLP CourseTokenization using nlp | NLP Course
Tokenization using nlp | NLP Course
 
Introduction to Named Entity Recognition
Introduction to Named Entity RecognitionIntroduction to Named Entity Recognition
Introduction to Named Entity Recognition
 
What’s next for deep learning for Search?
What’s next for deep learning for Search?What’s next for deep learning for Search?
What’s next for deep learning for Search?
 
Deep Learning with Audio Signals: Prepare, Process, Design, Expect
Deep Learning with Audio Signals: Prepare, Process, Design, ExpectDeep Learning with Audio Signals: Prepare, Process, Design, Expect
Deep Learning with Audio Signals: Prepare, Process, Design, Expect
 
[ppt]
[ppt][ppt]
[ppt]
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
Textmining Introduction
Textmining IntroductionTextmining Introduction
Textmining Introduction
 
Introduction to spaCy
Introduction to spaCyIntroduction to spaCy
Introduction to spaCy
 
Text mining
Text miningText mining
Text mining
 
NLP
NLPNLP
NLP
 
Recurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text AnalysisRecurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text Analysis
 
stackconf 2021 | Weaviate Vector Search Engine – Introduction
stackconf 2021 | Weaviate Vector Search Engine – Introductionstackconf 2021 | Weaviate Vector Search Engine – Introduction
stackconf 2021 | Weaviate Vector Search Engine – Introduction
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
 
Elasticsearch for beginners
Elasticsearch for beginnersElasticsearch for beginners
Elasticsearch for beginners
 
[AAAI 2019 tutorial] End-to-end goal-oriented question answering systems
[AAAI 2019 tutorial] End-to-end goal-oriented question answering systems[AAAI 2019 tutorial] End-to-end goal-oriented question answering systems
[AAAI 2019 tutorial] End-to-end goal-oriented question answering systems
 
Python Web Development Tutorial | Web Development Using Django | Edureka
Python Web Development Tutorial | Web Development Using Django | EdurekaPython Web Development Tutorial | Web Development Using Django | Edureka
Python Web Development Tutorial | Web Development Using Django | Edureka
 
Linked Data: principles and examples
Linked Data: principles and examples Linked Data: principles and examples
Linked Data: principles and examples
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
Lectures 1,2,3
Lectures 1,2,3Lectures 1,2,3
Lectures 1,2,3
 
Knowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data ScienceKnowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data Science
 

Andere mochten auch

Concept based information retrieval using explicit
Concept based information retrieval using explicitConcept based information retrieval using explicit
Concept based information retrieval using explicit
nadikari123
 
Cvpr2007 object category recognition p1 - bag of words models
Cvpr2007 object category recognition   p1 - bag of words modelsCvpr2007 object category recognition   p1 - bag of words models
Cvpr2007 object category recognition p1 - bag of words models
zukun
 
A Comparison of Different Strategies for Automated Semantic Document Annotation
A Comparison of Different Strategies for Automated Semantic Document AnnotationA Comparison of Different Strategies for Automated Semantic Document Annotation
A Comparison of Different Strategies for Automated Semantic Document Annotation
Ansgar Scherp
 
New concept Information systems
New concept Information systemsNew concept Information systems
New concept Information systems
mohanraj123
 
Sequential pattern mining
Sequential pattern miningSequential pattern mining
Sequential pattern mining
kiran said
 

Andere mochten auch (20)

Concept based information retrieval using explicit
Concept based information retrieval using explicitConcept based information retrieval using explicit
Concept based information retrieval using explicit
 
Explicit Semantic Analysis
Explicit Semantic AnalysisExplicit Semantic Analysis
Explicit Semantic Analysis
 
Docker up and Running For Web Developers
Docker up and Running For Web DevelopersDocker up and Running For Web Developers
Docker up and Running For Web Developers
 
Interval Pattern Structures: An introdution
Interval Pattern Structures: An introdutionInterval Pattern Structures: An introdution
Interval Pattern Structures: An introdution
 
Recommendation and Information Retrieval: Two Sides of the Same Coin?
Recommendation and Information Retrieval: Two Sides of the Same Coin?Recommendation and Information Retrieval: Two Sides of the Same Coin?
Recommendation and Information Retrieval: Two Sides of the Same Coin?
 
Cvpr2007 object category recognition p1 - bag of words models
Cvpr2007 object category recognition   p1 - bag of words modelsCvpr2007 object category recognition   p1 - bag of words models
Cvpr2007 object category recognition p1 - bag of words models
 
Formal Concept Analysis
Formal Concept AnalysisFormal Concept Analysis
Formal Concept Analysis
 
(In)Formal Concept Analysis
(In)Formal Concept Analysis(In)Formal Concept Analysis
(In)Formal Concept Analysis
 
Conceptual indexing
Conceptual indexingConceptual indexing
Conceptual indexing
 
Phrase Based Indexing and Information Retrivel
Phrase Based Indexing and Information RetrivelPhrase Based Indexing and Information Retrivel
Phrase Based Indexing and Information Retrivel
 
Context-Aware Recommender System Based on Boolean Matrix Factorisation
Context-Aware Recommender System Based on Boolean Matrix FactorisationContext-Aware Recommender System Based on Boolean Matrix Factorisation
Context-Aware Recommender System Based on Boolean Matrix Factorisation
 
Context Semantic Analysis: a knowledge-based technique for computing inter-do...
Context Semantic Analysis: a knowledge-based technique for computing inter-do...Context Semantic Analysis: a knowledge-based technique for computing inter-do...
Context Semantic Analysis: a knowledge-based technique for computing inter-do...
 
A Comparison of Different Strategies for Automated Semantic Document Annotation
A Comparison of Different Strategies for Automated Semantic Document AnnotationA Comparison of Different Strategies for Automated Semantic Document Annotation
A Comparison of Different Strategies for Automated Semantic Document Annotation
 
Rule based approach to sentiment analysis at romip’11 slides
Rule based approach to sentiment analysis at romip’11 slidesRule based approach to sentiment analysis at romip’11 slides
Rule based approach to sentiment analysis at romip’11 slides
 
Semantic Analysis using Wikipedia Taxonomy
Semantic Analysis using Wikipedia TaxonomySemantic Analysis using Wikipedia Taxonomy
Semantic Analysis using Wikipedia Taxonomy
 
I want to know more about compuerized text analysis
I want to know more about   compuerized text analysisI want to know more about   compuerized text analysis
I want to know more about compuerized text analysis
 
New concept Information systems
New concept Information systemsNew concept Information systems
New concept Information systems
 
Sequential pattern mining
Sequential pattern miningSequential pattern mining
Sequential pattern mining
 
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache SparkUnsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
 
Multimedia content based retrieval slideshare.ppt
Multimedia content based retrieval slideshare.pptMultimedia content based retrieval slideshare.ppt
Multimedia content based retrieval slideshare.ppt
 

Ähnlich wie Concept-Based Information Retrieval using Explicit Semantic Analysis

1ancientolympicgames 111008105305-phpapp02 (1)
1ancientolympicgames 111008105305-phpapp02 (1)1ancientolympicgames 111008105305-phpapp02 (1)
1ancientolympicgames 111008105305-phpapp02 (1)
school
 
1ancientolympicgames 111008105305-phpapp02 (1)
1ancientolympicgames 111008105305-phpapp02 (1)1ancientolympicgames 111008105305-phpapp02 (1)
1ancientolympicgames 111008105305-phpapp02 (1)
school
 
Olimpiadas
OlimpiadasOlimpiadas
Olimpiadas
14091998
 
1ancientolympicgames 111008105305-phpapp02
1ancientolympicgames 111008105305-phpapp021ancientolympicgames 111008105305-phpapp02
1ancientolympicgames 111008105305-phpapp02
manuelje17
 
1ancientolympicgames 111008105305-phpapp02 (1)
1ancientolympicgames 111008105305-phpapp02 (1)1ancientolympicgames 111008105305-phpapp02 (1)
1ancientolympicgames 111008105305-phpapp02 (1)
school
 
Ancientolympicstheory 131230051225-phpapp01 (3)
Ancientolympicstheory 131230051225-phpapp01 (3)Ancientolympicstheory 131230051225-phpapp01 (3)
Ancientolympicstheory 131230051225-phpapp01 (3)
Gabriel Castaño Nieto
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
IJERD Editor
 
2006 Caroubalos&Al
2006 Caroubalos&Al2006 Caroubalos&Al
2006 Caroubalos&Al
petousis
 
2006a Tsitsipis&Al
2006a Tsitsipis&Al2006a Tsitsipis&Al
2006a Tsitsipis&Al
petousis
 
Genesis, ch 6, #2, 6;9 22, critics of Noah’s flood, what became of the ark, a...
Genesis, ch 6, #2, 6;9 22, critics of Noah’s flood, what became of the ark, a...Genesis, ch 6, #2, 6;9 22, critics of Noah’s flood, what became of the ark, a...
Genesis, ch 6, #2, 6;9 22, critics of Noah’s flood, what became of the ark, a...
Valley Bible Fellowship
 
1 ancient olympic games
1 ancient olympic games1 ancient olympic games
1 ancient olympic games
25071998
 
1.ancient olympic games
1.ancient olympic games1.ancient olympic games
1.ancient olympic games
Julian García
 
1.ancient olympic games
1.ancient olympic games1.ancient olympic games
1.ancient olympic games
LaSolanafer
 

Ähnlich wie Concept-Based Information Retrieval using Explicit Semantic Analysis (20)

Metadata costs per unit of effort (cpue)
Metadata  costs per unit of effort (cpue)Metadata  costs per unit of effort (cpue)
Metadata costs per unit of effort (cpue)
 
Teachers' Notes to accompany Werner Herzog film
Teachers' Notes to accompany Werner Herzog filmTeachers' Notes to accompany Werner Herzog film
Teachers' Notes to accompany Werner Herzog film
 
Ancient olympics theory
Ancient olympics theoryAncient olympics theory
Ancient olympics theory
 
Iceman Overall
Iceman OverallIceman Overall
Iceman Overall
 
Ancient olympics theory
Ancient olympics theory Ancient olympics theory
Ancient olympics theory
 
Ice station-zebra
Ice station-zebraIce station-zebra
Ice station-zebra
 
1ancientolympicgames 111008105305-phpapp02 (1)
1ancientolympicgames 111008105305-phpapp02 (1)1ancientolympicgames 111008105305-phpapp02 (1)
1ancientolympicgames 111008105305-phpapp02 (1)
 
1ancientolympicgames 111008105305-phpapp02 (1)
1ancientolympicgames 111008105305-phpapp02 (1)1ancientolympicgames 111008105305-phpapp02 (1)
1ancientolympicgames 111008105305-phpapp02 (1)
 
Olimpiadas
OlimpiadasOlimpiadas
Olimpiadas
 
1ancientolympicgames 111008105305-phpapp02
1ancientolympicgames 111008105305-phpapp021ancientolympicgames 111008105305-phpapp02
1ancientolympicgames 111008105305-phpapp02
 
1ancientolympicgames 111008105305-phpapp02 (1)
1ancientolympicgames 111008105305-phpapp02 (1)1ancientolympicgames 111008105305-phpapp02 (1)
1ancientolympicgames 111008105305-phpapp02 (1)
 
Erik von-daniken-chariots-of-the-gods (was god an astronaut?)
Erik von-daniken-chariots-of-the-gods (was god an astronaut?)Erik von-daniken-chariots-of-the-gods (was god an astronaut?)
Erik von-daniken-chariots-of-the-gods (was god an astronaut?)
 
Ancientolympicstheory 131230051225-phpapp01 (3)
Ancientolympicstheory 131230051225-phpapp01 (3)Ancientolympicstheory 131230051225-phpapp01 (3)
Ancientolympicstheory 131230051225-phpapp01 (3)
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
2006 Caroubalos&Al
2006 Caroubalos&Al2006 Caroubalos&Al
2006 Caroubalos&Al
 
2006a Tsitsipis&Al
2006a Tsitsipis&Al2006a Tsitsipis&Al
2006a Tsitsipis&Al
 
Genesis, ch 6, #2, 6;9 22, critics of Noah’s flood, what became of the ark, a...
Genesis, ch 6, #2, 6;9 22, critics of Noah’s flood, what became of the ark, a...Genesis, ch 6, #2, 6;9 22, critics of Noah’s flood, what became of the ark, a...
Genesis, ch 6, #2, 6;9 22, critics of Noah’s flood, what became of the ark, a...
 
1 ancient olympic games
1 ancient olympic games1 ancient olympic games
1 ancient olympic games
 
1.ancient olympic games
1.ancient olympic games1.ancient olympic games
1.ancient olympic games
 
1.ancient olympic games
1.ancient olympic games1.ancient olympic games
1.ancient olympic games
 

Kürzlich hochgeladen

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Kürzlich hochgeladen (20)

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Concept-Based Information Retrieval using Explicit Semantic Analysis

  • 1. Concept-Based Information Retrieval using Explicit Semantic Analysis M.Sc. Seminar talk Ofer Egozi, CS Department, Technion Supervisor: Prof. Shaul Markovitch 24/6/09
  • 2. Information Retrieval Query IR Recall Precision
  • 4. Keyword-based retrieval Bag Of Words (BOW) Query IR
  • 5. Problem: retrieval misses TREC document LA071689-0089 “ANCIENT ARTIFACTS FOUND. Divers have recovered artifacts lying underwater for more than 2,000 years in the wreck of a Roman ship that sank in the Gulf of Baratti, 12 miles off the island of Elba, newspapers reported Saturday." TREC topic #411 salvaging shipwreck treasure I I Query IR
  • 6. The vocabulary problem Identity: Syntax (tokenization, stemming…) Similarity: Synonyms (Wordnet etc.) Relatedness: Semantics / world knowledge (???) “ANCIENT ARTIFACTS FOUND. Divers have recovered artifacts lying underwater for more than 2,000 years in the wreck of a Roman ship that sank in the Gulf of Baratti, 12 miles off the island of Elba, newspapers reported Saturday." ? [but also shipping/treasurer] Synonymy / Polysemy ? [but also deliver/scavenge/relieve] salvaging shipwreck treasure
  • 7. Concept-based retrieval “ANCIENT ARTIFACTS FOUND. Divers have recovered artifacts lying underwater for more than 2,000 years in the wreck of a Roman ship that sank in the Gulf of Baratti, 12 miles off the island of Elba, newspapers reported Saturday." IR salvaging shipwreck treasure
  • 8. Concept-based representations Human-edited Thesauri (e.g. WordNet) Source: editors , concepts: words, mapping: manual Corpus-based Thesauri (e.g. co-occurrence) Source: corpus , concepts: words , mapping: automatic Ontology mapping (e.g. KeyConcept) Source: ontology , concepts: ontology node(s) , mapping: automatic Latent analysis (e.g. LSA, pLSA, LDA) Source: corpus , concepts: word distributions , mapping: automatic Insufficient granularity Non-intuitive Concepts Expensive repetitive computations Non-scalable solution
  • 9. Concept-based representations Human-edited Thesauri (e.g. WordNet) Source: editors , concepts: words, mapping: manual Corpus-based Thesauri (e.g. co-occurrence) Source: corpus , concepts: words , mapping: automatic Ontology mapping (e.g. KeyConcept) Source: ontology , concepts: ontology node(s) , mapping: automatic Latent analysis (e.g. LSA, pLSA, LDA) Source: corpus , concepts: word distributions , mapping: automatic Is it possible to devise a concept-based representation, that is scalable, computationally feasible, and uses intuitive and granular concepts? Insufficient granularity Non-intuitive Concepts Expensive repetitive computations Non-scalable solution
  • 10. Explicit Semantic Analysis Gabrilovich and Markovitch (2005,2006,2007)
  • 11. Explicit Semantic Analysis (ESA) Wikipedia is viewed as an ontology - a collection of ~1M concepts World War II Panthera Jane Fonda Island concept
  • 12. Explicit Semantic Analysis (ESA) Wikipedia is viewed as an ontology - a collection of ~1M concepts Every Wikipedia article represents a concept Panthera Cat [0.92] Leopard [0.84] Article words are associated with the concept(TF.IDF) Roar [0.77] concept
  • 13. Explicit Semantic Analysis (ESA) Wikipedia is viewed as an ontology - a collection of ~1M concepts Every Wikipedia article represents a concept Panthera Cat [0.92] Leopard [0.84] Article words are associated with the concept(TF.IDF) Roar [0.77]
  • 14. Explicit Semantic Analysis (ESA) Wikipedia is viewed as an ontology - a collection of ~1M concepts Every Wikipedia article represents a concept Article words are associated with the concept(TF.IDF) Panthera The semantics of a word is the vector of its associations with Wikipedia concepts Cat [0.92] Leopard [0.84] Panthera [0.92] Cat [0.95] Jane Fonda [0.07] Cat Roar [0.77]
  • 15. Explicit Semantic Analysis (ESA) The semantics of a text fragment is the average vector (centroid) of the semantics of its words In practice – disambiguation… Mouse (computing) [0.81] MickeyMouse[0.81] Game Controller [0.64] Button [0.93] Game Controller [0.32] Mouse (rodent) [0.91] John Steinbeck [0.17] Mouse (computing) [0.95] Mouse (rodent) [0.56] Dick Button [0.84] Mouse (computing) [0.84] Drag- and-drop [0.91] button mouse mouse button mouse button
  • 16. MORAG*: An ESA-based information retrieval algorithm *MORAG: Flail in Hebrew “Concept-based feature generation and selection for information retrieval”, AAAI-2008
  • 17. Enrich documents/queries ESA IR Query Constraint: use only the strongest concepts
  • 18. Problem: documents (in)coherence TREC document LA120790-0036 REFERENCE BOOKS SPEAK VOLUMES TO KIDS; With the school year in high gear, it's a good time to consider new additions to children's home reference libraries… …Also new from Pharos-World Almanac: "The World Almanac InfoPedia," a single-volume visual encyclopedia designed for ages 8 to 16… …"The Doubleday Children's Encyclopedia," designed for youngsters 7 to 11, bridges the gap between single-subject picture books and formal reference books… …"The Lost Wreck of the Isis" by Robert Ballard is the latest adventure in the Time Quest Series from Scholastic-Madison Press ($15.95 hardcover). Designed for children 8 to 12, it tells the story of Ballard's 1988 discovery of an ancient Roman shipwreck deep in the Mediterranean Sea… Document is judged relevant for topic 411 due to one relevant passage in it Not an issue in BOW retrieval where words are indexed independently. How to deal with in concept-based? Concepts generated for this document will average to the books / children concepts, and lose the shipwreck mentions…
  • 19. Solution: split to passages ESA IR Query ConceptScore(d) = ConceptScore(full-doc) + max ConceptScore(passage) passaged Index both full document and passages. Best performance achieved by fixed-length overlapping sliding windows.
  • 20. Morag ranking Score(q,d) = ConceptScore(q,d) + (1-)KeywordScore(q,d) IR Query
  • 21.
  • 25. History of the British Virgin Islands
  • 30.
  • 39. USS Meade (DD-602)salvaging shipwreck treasure
  • 40.
  • 41. Economy of Estonia
  • 42. Estonia at the 2000 Summer Olympics
  • 43. Estonia at the 2004 Summer Olympics
  • 44. Estonia national football team
  • 45. Estonia at the 2006 Winter Olympics
  • 49.
  • 50. Estonia at the 2004 Summer Olympics
  • 52. Estonia at the 2006 Winter Olympics
  • 53. 1992 Summer Olympics
  • 54. Athletics at the 2004 Summer Olympics
  • 55. 2000 Summer Olympics
  • 56. 2006 Winter Olympics
  • 57. Cross-country skiing 2006 Winter Olympics
  • 58.
  • 59. “Economy” is not mentioned, but TF·IDF of “Estonia” is strong enough to trigger this concept on its own…
  • 60. Selection could remove noisy ESA concepts However, IR task provides no training data… Problem: selecting query features Focus on query concepts - Query is short and noisy, while FS at indexing lacks context Utility function U(+|-) requires target measure >> training set U f =ESA(q) Filter f’
  • 61. Solution: Pseudo Relevance Feedback Use BOW results as positive / negative examples
  • 62. ESA feature selection methods IG (filter) – calculate each feature’s Information Gain in separating positive and negative examples, take best performing features RV (filter) – add concepts in the positive examples to candidate features, and re-weight all features based on their weights in examples IIG (wrapper) – find subset of features that best separates positive and negative examples, employing heuristic search
  • 63.
  • 64. Economy of Estonia
  • 65. Estonia at the 2000 Summer Olympics
  • 66. Estonia at the 2004 Summer Olympics
  • 67. Estonia national football team
  • 68. Estonia at the 2006 Winter Olympics
  • 72.
  • 74. Economy of Europe
  • 76.
  • 77. Estonia at the 2000 Summer Olympics
  • 78. Estonia at the 2004 Summer Olympics
  • 80. Estonia at the 2006 Winter Olympics
  • 81. 1992 Summer Olympics
  • 82. Athletics at the 2004 Summer Olympics
  • 83. 2000 Summer Olympics
  • 84. 2006 Winter Olympics
  • 85. Cross-country skiing 2006 Winter Olympics
  • 86.
  • 87. Morag evaluation Significant performance improvement, over our own baseline and also over top performing TREC-8 BOW baselines Concept-based performance by itself is quite low, a major reason is the TREC ‘pooling’ method, which implies that relevant documents found only by Morag will not be judged as such…
  • 88. Morag evaluation Optimal (“Oracle”) selection analysis shows much more potential for Morag
  • 89. Morag evaluation Pseudo-relevance proves to be a good approximation of actual relevance
  • 90. Conclusion Morag: a new methodology for concept-based information retrieval Documents and query are enhanced by Wikipedia concepts Informative features are selected using pseudo-relevance feedback The generated features improve the performance of BOW-based systems

Hinweis der Redaktion

  1. This is a relevant document for this TREC query that is not retrieved by standard BOW system – none of the keywords are found in the document
  2. Methods for dealing mainly with synonymy. Each of the existing methods have their issues – stemming loses nuances, tokenization may create words the author did not intend; synonyms may intensify ambiguity>Polysemy. However, mapping “treasure” to “ancient artifacts found in a sunken Roman ship” requires significant world knowledge that these methods cannot offer.
  3. The promise of concept-based retrieval is that by transforming to a domain of concepts and performing retrieval in it rather than in the domain of words, the previously described problems will be dramatically reduced.
  4. Existing concept-based representation approaches.KeyConceptis most similar to ESA, but: 1) very small ontology is used (1564 concepts), 2) query processing is MANUAL
  5. ESA can also be generated using other knowledge sources – was succesfully applied to ODP – but recent papers focused on Wikipedia which proved most fitting.
  6. These vectors are for illustration only, actual concepts and weights are different in real life (so don’t try the maths…)
  7. First results were published in AAAI 2008
  8. Constraint is due to the very large number of concepts in vector – can easily inflate the index to a huge scale
  9. Enough overlap between concepts of query and target document – document is retrieved despite having no keywords match!
  10. However, we seem to also have false positives causing results to be far from optimal. These (and previous slide) are actual top 10 concepts generated for these texts.
  11. Going back to the ESA classifier generation, we see why Baltic Sea was triggered by the query (although we still consider it not relevant enough)
  12. So still the bottom line is that we prefer this not to happen. One option is to change the method of how concepts are generated for more than one word, in this research we decided not to make any changes to the ESA mechanism itself.
  13. Indeed, when applying ESA to text categorization, training data was crucial in removing noisy features…
  14. Relevance feedback is the process where a user assigns relevancy labels to retrieved documents. Pseudo means we let the system decide that the top documents are considered relevant. Naturally this is less accurate, but better than no data at all.
  15. More details on these actual methods are given in the paper.
  16. The less relevant features are such that will usually either appear in both positive and negative example sets, or not appear in both sets. Useful features are such that appear more in positive examples.
  17. TREC is the most comprehensive and studies IR benchmark. We used two datasets and a third one (TREC-7) for parameter tuning. These graphs show feature selection impact, hence they show performance of concept-based subsystem alone.
  18. The full MORAG system results. Parameter tuning works well. Improvement is most apparent when baseline is weaker. Consider that the concept-based retrieval itself is quite low, one major reason for that is that there is a high chance it finds relevant documents that other systems did not find, and therefore were not judged in the TREC ‘pooling’ judgment method. That means that MORAG’s perfromance is probably underrated. See paper for more details.
  19. We attempted to estimate what further potential future work may uncover. Performing exhaustive search in all features subsets provides such an estimate, which shows a lot of potential for more work.
  20. The graphs are not so far apart. An interesting trend is that beyond a certain threshold, adding more pseudo-relevant documents harms performance as their relevance becomes less accurate, but when using true relevant documents this doesn’t happen, which proves the cause is indeed that.