The document discusses different models for search systems, including Boolean retrieval, vector space models, and latent semantic indexing. Boolean retrieval represents documents and queries with descriptors and uses Boolean logic for matching. Vector space models represent documents and queries as vectors in a multidimensional space based on terms and calculate similarity between vectors for matching. Latent semantic indexing performs further matrix manipulation on the vector space to capture word dependencies and project vectors into a smaller, denser space. The document also discusses evaluating search effectiveness using measures like precision, recall, and F-score on test collections.
Unraveling Multimodality with Large Language Models.pdf
Under the Hood of Your Favorite Search System
1. The Advanced E-Discovery Institute November 12-13, 2009 What’s Under the Hood of your Favorite Search System? Ellen Voorhees ellen.voorhees@nist.gov
2. So you want to build a search engine What is the collection to be searched? How will the content (text other media) be represented? [indexing] How will the information need be represented? [query language] How will respective representations be matched? [retrieval model] How effective is the search? The Advanced E-Discovery Institute November 13, 2009
3. Boolean Retrieval The Model documents represented by descriptors descriptors originally manually assigned concepts from controlled vocabulary modern implementations generally use words in text as descriptors information need represented by descriptors structured with Boolean operators modern implementations include more operators than just AND, OR, NOT a match occurs if and only if doc satisfies Boolean expression “fuzzy match” systems use descriptor weights, relax strict binary interpretation Pros and cons good: transparency---clear exactly why doc retrieved bad: little control over retrieved set size; no ranking; searchers must learn query language The Advanced E-Discovery Institute November 13, 2009
4. Vector Space Model The Model documents represented as vectors in N-dimensional space where N is number of ‘terms’ in the document set term is usually a word (stem); but might be phrase or thesaurus class terms are weighted based on frequency and distribution of occurrences information need is natural language text mapped in same space matching is similarity between query and doc vectors example similarity: cosine of angle between vectors allows documents to be ranked by decreasing similarity Pros and Cons good: less brittle than pure Boolean bad: less transparency---depending on weights, a doc with few query terms can be ranked higher than a doc with many The Advanced E-Discovery Institute November 13, 2009
5. Vector Similarities Document-Document similarity docs are similar to the extent they contain the same terms doc pairs with maximal similarity detects duplicates document clustering cluster hypothesis: “Closely associated documents tend to be relevant to the same requests.” thus, do retrieval based on returning whole clusters since usually much more information in doc-doc comparison than doc-query Term-Term similarity terms are similar to the extent the occur inthe same documents term clustering query expansion provide bottom-up description of document set T1 T2 T3 T4 … 5 0 33 0 … 0 0 8 0 … 1 4 0 2 … 0 3 0 4 … 0 1 0 0 … 3 2 0 … D1 D2 D3 D4 D5 D6… The Advanced E-Discovery Institute November 13, 2009
6. Further Matrix Manipulation: Latent Semantic Indexing Mathematically, the axes in a vector space are orthogonal to one another so, vector space model technically assumes words occur in documents independently of any other words (which is nonsense) this vector space is very large, and very sparse Perform singular value decomposition of original matrix and select first X eigenvectors as new axes X chosen to be much smaller than number of terms, producing much smaller denser vector space project document vectors into new space elements in vector no longer correspond to words new axes capture some (but which?) dependencies among original word occurrences The Advanced E-Discovery Institute November 13, 2009
11. too much variability for test collections to predict tight boundsNumber relevant num_rel = num_ret R R Number retrieved number relevant retrieved Precision = number retrieved number relevant retrieved Recall = total relevant 2×Precision×Recall F = Precision + Recall