Concept-Based Information Retrieval using Explicit Semantic Analysis

Concept-Based Information Retrieval using Explicit Semantic Analysis M.Sc. Seminar talk Ofer Egozi, CS Department, Technion Supervisor: Prof. Shaul Markovitch 24/6/09

Information Retrieval Query IR Recall Precision

Keyword-based retrieval Bag Of Words (BOW) Query IR

Problem: retrieval misses TREC document LA071689-0089 “ANCIENT ARTIFACTS FOUND. Divers have recovered artifacts lying underwater for more than 2,000 years in the wreck of a Roman ship that sank in the Gulf of Baratti, 12 miles off the island of Elba, newspapers reported Saturday." TREC topic #411 salvaging shipwreck treasure I I Query IR

The vocabulary problem Identity: Syntax (tokenization, stemming…) Similarity: Synonyms (Wordnet etc.) Relatedness: Semantics / world knowledge (???) “ANCIENT ARTIFACTS FOUND. Divers have recovered artifacts lying underwater for more than 2,000 years in the wreck of a Roman ship that sank in the Gulf of Baratti, 12 miles off the island of Elba, newspapers reported Saturday." ? [but also shipping/treasurer] Synonymy / Polysemy ? [but also deliver/scavenge/relieve] salvaging shipwreck treasure

Concept-based retrieval “ANCIENT ARTIFACTS FOUND. Divers have recovered artifacts lying underwater for more than 2,000 years in the wreck of a Roman ship that sank in the Gulf of Baratti, 12 miles off the island of Elba, newspapers reported Saturday." IR salvaging shipwreck treasure

Concept-based representations Human-edited Thesauri (e.g. WordNet) Source: editors , concepts: words, mapping: manual Corpus-based Thesauri (e.g. co-occurrence) Source: corpus , concepts: words , mapping: automatic Ontology mapping (e.g. KeyConcept) Source: ontology , concepts: ontology node(s) , mapping: automatic Latent analysis (e.g. LSA, pLSA, LDA) Source: corpus , concepts: word distributions , mapping: automatic Insufficient granularity Non-intuitive Concepts Expensive repetitive computations Non-scalable solution

Concept-based representations Human-edited Thesauri (e.g. WordNet) Source: editors , concepts: words, mapping: manual Corpus-based Thesauri (e.g. co-occurrence) Source: corpus , concepts: words , mapping: automatic Ontology mapping (e.g. KeyConcept) Source: ontology , concepts: ontology node(s) , mapping: automatic Latent analysis (e.g. LSA, pLSA, LDA) Source: corpus , concepts: word distributions , mapping: automatic Is it possible to devise a concept-based representation, that is scalable, computationally feasible, and uses intuitive and granular concepts? Insufficient granularity Non-intuitive Concepts Expensive repetitive computations Non-scalable solution

Explicit Semantic Analysis Gabrilovich and Markovitch (2005,2006,2007)

Explicit Semantic Analysis (ESA) Wikipedia is viewed as an ontology - a collection of ~1M concepts World War II Panthera Jane Fonda Island concept

Explicit Semantic Analysis (ESA) Wikipedia is viewed as an ontology - a collection of ~1M concepts Every Wikipedia article represents a concept Panthera Cat [0.92] Leopard [0.84] Article words are associated with the concept(TF.IDF) Roar [0.77] concept

Explicit Semantic Analysis (ESA) Wikipedia is viewed as an ontology - a collection of ~1M concepts Every Wikipedia article represents a concept Panthera Cat [0.92] Leopard [0.84] Article words are associated with the concept(TF.IDF) Roar [0.77]

Explicit Semantic Analysis (ESA) Wikipedia is viewed as an ontology - a collection of ~1M concepts Every Wikipedia article represents a concept Article words are associated with the concept(TF.IDF) Panthera The semantics of a word is the vector of its associations with Wikipedia concepts Cat [0.92] Leopard [0.84] Panthera [0.92] Cat [0.95] Jane Fonda [0.07] Cat Roar [0.77]

Explicit Semantic Analysis (ESA) The semantics of a text fragment is the average vector (centroid) of the semantics of its words In practice – disambiguation… Mouse (computing) [0.81] MickeyMouse[0.81] Game Controller [0.64] Button [0.93] Game Controller [0.32] Mouse (rodent) [0.91] John Steinbeck [0.17] Mouse (computing) [0.95] Mouse (rodent) [0.56] Dick Button [0.84] Mouse (computing) [0.84] Drag- and-drop [0.91] button mouse mouse button mouse button

MORAG*: An ESA-based information retrieval algorithm *MORAG: Flail in Hebrew “Concept-based feature generation and selection for information retrieval”, AAAI-2008

Enrich documents/queries ESA IR Query Constraint: use only the strongest concepts

Problem: documents (in)coherence TREC document LA120790-0036 REFERENCE BOOKS SPEAK VOLUMES TO KIDS; With the school year in high gear, it's a good time to consider new additions to children's home reference libraries… …Also new from Pharos-World Almanac: "The World Almanac InfoPedia," a single-volume visual encyclopedia designed for ages 8 to 16… …"The Doubleday Children's Encyclopedia," designed for youngsters 7 to 11, bridges the gap between single-subject picture books and formal reference books… …"The Lost Wreck of the Isis" by Robert Ballard is the latest adventure in the Time Quest Series from Scholastic-Madison Press ($15.95 hardcover). Designed for children 8 to 12, it tells the story of Ballard's 1988 discovery of an ancient Roman shipwreck deep in the Mediterranean Sea… Document is judged relevant for topic 411 due to one relevant passage in it Not an issue in BOW retrieval where words are indexed independently. How to deal with in concept-based? Concepts generated for this document will average to the books / children concepts, and lose the shipwreck mentions…

Solution: split to passages ESA IR Query ConceptScore(d) = ConceptScore(full-doc) + max ConceptScore(passage) passaged Index both full document and passages. Best performance achieved by fixed-length overlapping sliding windows.

Morag ranking Score(q,d) = ConceptScore(q,d) + (1-)KeywordScore(q,d) IR Query

ESA-based retrieval example ,[object Object]

History of the British Virgin Islands

Spanish treasure fleet“ANCIENT ARTIFACTS FOUND. Divers have recovered artifacts lying underwater for more than 2,000 years in the wreck of a Roman ship that sank in the Gulf of Baratti, 12 miles off the island of Elba, newspapers reported Saturday."  ,[object Object]

Concept-Based Information Retrieval using Explicit Semantic Analysis

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Concept-Based Information Retrieval using Explicit Semantic Analysis

Ähnlich wie Concept-Based Information Retrieval using Explicit Semantic Analysis (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Concept-Based Information Retrieval using Explicit Semantic Analysis

Hinweis der Redaktion