My master's thesis seminar at the Technion, summarizing my research work which was partly published in a AAAI-08 paper and now submitted to TOIS. Download and read notes for more details. Comments/questions are very welcome!
5. Problem: retrieval misses TREC document LA071689-0089 “ANCIENT ARTIFACTS FOUND. Divers have recovered artifacts lying underwater for more than 2,000 years in the wreck of a Roman ship that sank in the Gulf of Baratti, 12 miles off the island of Elba, newspapers reported Saturday." TREC topic #411 salvaging shipwreck treasure I I Query IR
6. The vocabulary problem Identity: Syntax (tokenization, stemming…) Similarity: Synonyms (Wordnet etc.) Relatedness: Semantics / world knowledge (???) “ANCIENT ARTIFACTS FOUND. Divers have recovered artifacts lying underwater for more than 2,000 years in the wreck of a Roman ship that sank in the Gulf of Baratti, 12 miles off the island of Elba, newspapers reported Saturday." ? [but also shipping/treasurer] Synonymy / Polysemy ? [but also deliver/scavenge/relieve] salvaging shipwreck treasure
7. Concept-based retrieval “ANCIENT ARTIFACTS FOUND. Divers have recovered artifacts lying underwater for more than 2,000 years in the wreck of a Roman ship that sank in the Gulf of Baratti, 12 miles off the island of Elba, newspapers reported Saturday." IR salvaging shipwreck treasure
11. Explicit Semantic Analysis (ESA) Wikipedia is viewed as an ontology - a collection of ~1M concepts World War II Panthera Jane Fonda Island concept
12. Explicit Semantic Analysis (ESA) Wikipedia is viewed as an ontology - a collection of ~1M concepts Every Wikipedia article represents a concept Panthera Cat [0.92] Leopard [0.84] Article words are associated with the concept(TF.IDF) Roar [0.77] concept
13. Explicit Semantic Analysis (ESA) Wikipedia is viewed as an ontology - a collection of ~1M concepts Every Wikipedia article represents a concept Panthera Cat [0.92] Leopard [0.84] Article words are associated with the concept(TF.IDF) Roar [0.77]
14. Explicit Semantic Analysis (ESA) Wikipedia is viewed as an ontology - a collection of ~1M concepts Every Wikipedia article represents a concept Article words are associated with the concept(TF.IDF) Panthera The semantics of a word is the vector of its associations with Wikipedia concepts Cat [0.92] Leopard [0.84] Panthera [0.92] Cat [0.95] Jane Fonda [0.07] Cat Roar [0.77]
15. Explicit Semantic Analysis (ESA) The semantics of a text fragment is the average vector (centroid) of the semantics of its words In practice – disambiguation… Mouse (computing) [0.81] MickeyMouse[0.81] Game Controller [0.64] Button [0.93] Game Controller [0.32] Mouse (rodent) [0.91] John Steinbeck [0.17] Mouse (computing) [0.95] Mouse (rodent) [0.56] Dick Button [0.84] Mouse (computing) [0.84] Drag- and-drop [0.91] button mouse mouse button mouse button
16. MORAG*: An ESA-based information retrieval algorithm *MORAG: Flail in Hebrew “Concept-based feature generation and selection for information retrieval”, AAAI-2008
18. Problem: documents (in)coherence TREC document LA120790-0036 REFERENCE BOOKS SPEAK VOLUMES TO KIDS; With the school year in high gear, it's a good time to consider new additions to children's home reference libraries… …Also new from Pharos-World Almanac: "The World Almanac InfoPedia," a single-volume visual encyclopedia designed for ages 8 to 16… …"The Doubleday Children's Encyclopedia," designed for youngsters 7 to 11, bridges the gap between single-subject picture books and formal reference books… …"The Lost Wreck of the Isis" by Robert Ballard is the latest adventure in the Time Quest Series from Scholastic-Madison Press ($15.95 hardcover). Designed for children 8 to 12, it tells the story of Ballard's 1988 discovery of an ancient Roman shipwreck deep in the Mediterranean Sea… Document is judged relevant for topic 411 due to one relevant passage in it Not an issue in BOW retrieval where words are indexed independently. How to deal with in concept-based? Concepts generated for this document will average to the books / children concepts, and lose the shipwreck mentions…
19. Solution: split to passages ESA IR Query ConceptScore(d) = ConceptScore(full-doc) + max ConceptScore(passage) passaged Index both full document and passages. Best performance achieved by fixed-length overlapping sliding windows.
59. “Economy” is not mentioned, but TF·IDF of “Estonia” is strong enough to trigger this concept on its own…
60. Selection could remove noisy ESA concepts However, IR task provides no training data… Problem: selecting query features Focus on query concepts - Query is short and noisy, while FS at indexing lacks context Utility function U(+|-) requires target measure >> training set U f =ESA(q) Filter f’
62. ESA feature selection methods IG (filter) – calculate each feature’s Information Gain in separating positive and negative examples, take best performing features RV (filter) – add concepts in the positive examples to candidate features, and re-weight all features based on their weights in examples IIG (wrapper) – find subset of features that best separates positive and negative examples, employing heuristic search
87. Morag evaluation Significant performance improvement, over our own baseline and also over top performing TREC-8 BOW baselines Concept-based performance by itself is quite low, a major reason is the TREC ‘pooling’ method, which implies that relevant documents found only by Morag will not be judged as such…
90. Conclusion Morag: a new methodology for concept-based information retrieval Documents and query are enhanced by Wikipedia concepts Informative features are selected using pseudo-relevance feedback The generated features improve the performance of BOW-based systems
This is a relevant document for this TREC query that is not retrieved by standard BOW system – none of the keywords are found in the document
Methods for dealing mainly with synonymy. Each of the existing methods have their issues – stemming loses nuances, tokenization may create words the author did not intend; synonyms may intensify ambiguity>Polysemy. However, mapping “treasure” to “ancient artifacts found in a sunken Roman ship” requires significant world knowledge that these methods cannot offer.
The promise of concept-based retrieval is that by transforming to a domain of concepts and performing retrieval in it rather than in the domain of words, the previously described problems will be dramatically reduced.
Existing concept-based representation approaches.KeyConceptis most similar to ESA, but: 1) very small ontology is used (1564 concepts), 2) query processing is MANUAL
ESA can also be generated using other knowledge sources – was succesfully applied to ODP – but recent papers focused on Wikipedia which proved most fitting.
These vectors are for illustration only, actual concepts and weights are different in real life (so don’t try the maths…)
First results were published in AAAI 2008
Constraint is due to the very large number of concepts in vector – can easily inflate the index to a huge scale
Enough overlap between concepts of query and target document – document is retrieved despite having no keywords match!
However, we seem to also have false positives causing results to be far from optimal. These (and previous slide) are actual top 10 concepts generated for these texts.
Going back to the ESA classifier generation, we see why Baltic Sea was triggered by the query (although we still consider it not relevant enough)
So still the bottom line is that we prefer this not to happen. One option is to change the method of how concepts are generated for more than one word, in this research we decided not to make any changes to the ESA mechanism itself.
Indeed, when applying ESA to text categorization, training data was crucial in removing noisy features…
Relevance feedback is the process where a user assigns relevancy labels to retrieved documents. Pseudo means we let the system decide that the top documents are considered relevant. Naturally this is less accurate, but better than no data at all.
More details on these actual methods are given in the paper.
The less relevant features are such that will usually either appear in both positive and negative example sets, or not appear in both sets. Useful features are such that appear more in positive examples.
TREC is the most comprehensive and studies IR benchmark. We used two datasets and a third one (TREC-7) for parameter tuning. These graphs show feature selection impact, hence they show performance of concept-based subsystem alone.
The full MORAG system results. Parameter tuning works well. Improvement is most apparent when baseline is weaker. Consider that the concept-based retrieval itself is quite low, one major reason for that is that there is a high chance it finds relevant documents that other systems did not find, and therefore were not judged in the TREC ‘pooling’ judgment method. That means that MORAG’s perfromance is probably underrated. See paper for more details.
We attempted to estimate what further potential future work may uncover. Performing exhaustive search in all features subsets provides such an estimate, which shows a lot of potential for more work.
The graphs are not so far apart. An interesting trend is that beyond a certain threshold, adding more pseudo-relevant documents harms performance as their relevance becomes less accurate, but when using true relevant documents this doesn’t happen, which proves the cause is indeed that.