Under the Hood of Your Favorite Search System

•Download as PPTX, PDF•

0 likes•239 views

Georgetown University Law Center Office of Continuing Legal Education

The document discusses different models for search systems, including Boolean retrieval, vector space models, and latent semantic indexing. Boolean retrieval represents documents and queries with descriptors and uses Boolean logic for matching. Vector space models represent documents and queries as vectors in a multidimensional space based on terms and calculate similarity between vectors for matching. Latent semantic indexing performs further matrix manipulation on the vector space to capture word dependencies and project vectors into a smaller, denser space. The document also discusses evaluating search effectiveness using measures like precision, recall, and F-score on test collections.

Technology Education

The Advanced E-Discovery Institute November 12-13, 2009 What’s Under the Hood of your Favorite Search System? Ellen Voorhees ellen.voorhees@nist.gov

So you want to build a search engine What is the collection to be searched? How will the content (text other media) be represented? [indexing] How will the information need be represented? [query language] How will respective representations be matched? [retrieval model] How effective is the search? The Advanced E-Discovery Institute  November 13, 2009

Boolean Retrieval The Model documents represented by descriptors descriptors originally manually assigned concepts from controlled vocabulary modern implementations generally use words in text as descriptors information need represented by descriptors structured with Boolean operators modern implementations include more operators than just AND, OR, NOT a match occurs if and only if doc satisfies Boolean expression “fuzzy match” systems use descriptor weights, relax strict binary interpretation Pros and cons good: transparency---clear exactly why doc retrieved bad: little control over retrieved set size; no ranking; searchers must learn query language The Advanced E-Discovery Institute  November 13, 2009

Vector Space Model The Model documents represented as vectors in N-dimensional space where N is number of ‘terms’ in the document set term is usually a word (stem); but might be phrase or thesaurus class terms are weighted based on frequency and distribution of occurrences information need is natural language text mapped in same space matching is similarity between query and doc vectors example similarity: cosine of angle between vectors allows documents to be ranked by decreasing similarity Pros and Cons good: less brittle than pure Boolean bad: less transparency---depending on weights, a doc with few query terms can be ranked higher than a doc with many The Advanced E-Discovery Institute  November 13, 2009

Vector Similarities Document-Document similarity docs are similar to the extent they contain the same terms doc pairs with maximal similarity detects duplicates document clustering cluster hypothesis: “Closely associated documents tend to be relevant to the same requests.” thus, do retrieval based on returning whole clusters since usually much more information in doc-doc comparison than doc-query Term-Term similarity terms are similar to the extent the occur inthe same documents term clustering query expansion provide bottom-up description of document set T1 T2 T3 T4 … 5 0 33 0 … 0 0 8 0 … 1 4 0 2 … 0 3 0 4 … 0 1 0 0 … 3 2 0 … D1 D2 D3 D4 D5 D6… The Advanced E-Discovery Institute  November 13, 2009

Further Matrix Manipulation: Latent Semantic Indexing Mathematically, the axes in a vector space are orthogonal to one another so, vector space model technically assumes words occur in documents independently of any other words (which is nonsense) this vector space is very large, and very sparse Perform singular value decomposition of original matrix and select first X eigenvectors as new axes X chosen to be much smaller than number of terms, producing much smaller denser vector space project document vectors into new space elements in vector no longer correspond to words new axes capture some (but which?) dependencies among original word occurrences The Advanced E-Discovery Institute  November 13, 2009

How Effective is the Search? ,[object Object]

What's hot

The vector space modelpkgosh

Information Retrievalssbd6985

Knowledge based SystemTamanna36

Adversarial and reinforcement learning-based approaches to information retrievalBhaskar Mitra

Information retrieval 7 boolean modelVaibhav Khanna

Textminingsidhunileshwar

Finding Similar Files in Large Document Repositoriesfeiwin

TextRank: Bringing Order into TextsShubhangi Tandon

Ju3517011704IJERA Editor

Scalable Discovery Of Hidden Emails From Large Foldersfeiwin

Information Retrieval-1Jeet Das

SvvJyothsna Sridhar

What to read next? Challenges and Preliminary Results in Selecting Represen...MOVING Project

FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...Sebastian Ruder

What can corpus software do? Routledge chpt 11RajpootBhatti5

Tdm probabilistic models (part 2)KU Leuven

A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma

Conceptual foundations of text mining and preprocessing steps nfaoui el_habibEl Habib NFAOUI

Language Models for Information RetrievalNik Spirin

Does sizematterAmparo Elizabeth Cano Basave

What's hot (20)

The vector space model

Information Retrieval

Knowledge based System

Adversarial and reinforcement learning-based approaches to information retrieval

Information retrieval 7 boolean model

Textmining

Finding Similar Files in Large Document Repositories

TextRank: Bringing Order into Texts

Ju3517011704

Scalable Discovery Of Hidden Emails From Large Folders

Information Retrieval-1

Svv

What to read next? Challenges and Preliminary Results in Selecting Represen...

FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...

What can corpus software do? Routledge chpt 11

Tdm probabilistic models (part 2)

A Document Exploring System on LDA Topic Model for Wikipedia Articles

Conceptual foundations of text mining and preprocessing steps nfaoui el_habib

Language Models for Information Retrieval

Does sizematter

Viewers also liked

EDI 2009 Information Everywhere: Understanding New Technologies & Coping with...Georgetown University Law Center Office of Continuing Legal Education

EDI 2009 Panel Clash Of Law And CulturesGeorgetown University Law Center Office of Continuing Legal Education

EDI 2009- Government InvestigationsGeorgetown University Law Center Office of Continuing Legal Education

EDI 2009 Case Law UpdateGeorgetown University Law Center Office of Continuing Legal Education

EDI 2009 E Discovery Issues In Business Closings, Downsizings And BankruptcyGeorgetown University Law Center Office of Continuing Legal Education

Powerpoint For Delete! Litigation Risk Management Seminars, H&K Llp, Fall 2008Seth Row

Viewers also liked (6)

EDI 2009 Information Everywhere: Understanding New Technologies & Coping with...

EDI 2009 Panel Clash Of Law And Cultures

EDI 2009- Government Investigations

EDI 2009 Case Law Update

EDI 2009 E Discovery Issues In Business Closings, Downsizings And Bankruptcy

Powerpoint For Delete! Litigation Risk Management Seminars, H&K Llp, Fall 2008

Similar to Under the Hood of Your Favorite Search System

Neural Models for Information RetrievalBhaskar Mitra

The Duet modelBhaskar Mitra

TopicModels_BleiPaper_Summary.pptxKalpit Desai

Concurrent Inference of Topic Models and Distributed Vector RepresentationsParang Saraf

Document Classification Using KNN with Fuzzy Bags of Word Representationsuthi

Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...Jonathon Hare

Tovek Presentation by Livio Costantinimaxfalc

A Novel Approach for Keyword extraction in learning objects using text miningIJSRD

Information_Retrieval_Models_Nfaoui_El_HabibEl Habib NFAOUI

Text mining introduction-1Sumit Sony

A rough set based hybrid method to text categorizationNinad Samel

Tdm information retrievalKU Leuven

6&7-Query Languages & Operations.pptBereketAraya

G04124041046IOSR-JEN

IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...onlmcq

Semantic Interoperability - grafi della conoscenzaGiorgia Lodi

14. Michael Oakes (UoW) Natural Language Processing for TranslationRIILP

{Ontology: Resource} x {Matching : Mapping} x {Schema : Instance} :: Compone...Amit Sheth

EDS for IFLACliveRWright

Similar to Under the Hood of Your Favorite Search System (20)

Neural Models for Information Retrieval

The Duet model

TopicModels_BleiPaper_Summary.pptx

Concurrent Inference of Topic Models and Distributed Vector Representations

Document Classification Using KNN with Fuzzy Bags of Word Representation

Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...

Tovek Presentation by Livio Costantini

A Novel Approach for Keyword extraction in learning objects using text mining

Information_Retrieval_Models_Nfaoui_El_Habib

Text mining introduction-1

A rough set based hybrid method to text categorization

Tdm information retrieval

6&7-Query Languages & Operations.ppt

G04124041046

IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...

Semantic Interoperability - grafi della conoscenza

14. Michael Oakes (UoW) Natural Language Processing for Translation

{Ontology: Resource} x {Matching : Mapping} x {Schema : Instance} :: Compone...

EDS for IFLA

Recently uploaded

A Journey Into the Emotions of Software DevelopersNicole Novielli

Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan

Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3

Rise of the Machines: Known As Drones...Rick Flair

From Family Reminiscence to Scholarly Archive .Alan Dix

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

Take control of your SAP testing with UiPath Test SuiteDianaGray10

How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe

TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey

"ML in Production",Oleksandr BaganFwdays

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3

A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3

Time Series Foundation Models - current state and future directionsNathaniel Shimoni

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

Gen AI in Business - Global Trends Report 2024.pdfAddepto

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

Recently uploaded (20)

A Journey Into the Emotions of Software Developers

Generative AI for Technical Writer or Information Developers

Moving Beyond Passwords: FIDO Paris Seminar.pdf

Rise of the Machines: Known As Drones...

From Family Reminiscence to Scholarly Archive .

DevEX - reference for building teams, processes, and platforms

Developer Data Modeling Mistakes: From Postgres to NoSQL

Take control of your SAP testing with UiPath Test Suite

How AI, OpenAI, and ChatGPT impact business and software.

TeamStation AI System Report LATAM IT Salaries 2024

"ML in Production",Oleksandr Bagan

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx

A Deep Dive on Passkeys: FIDO Paris Seminar.pptx

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx

Time Series Foundation Models - current state and future directions

Dev Dives: Streamline document processing with UiPath Studio Web

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024

Gen AI in Business - Global Trends Report 2024.pdf

Unraveling Multimodality with Large Language Models.pdf

Under the Hood of Your Favorite Search System

1. The Advanced E-Discovery Institute November 12-13, 2009 What’s Under the Hood of your Favorite Search System? Ellen Voorhees ellen.voorhees@nist.gov

2. So you want to build a search engine What is the collection to be searched? How will the content (text other media) be represented? [indexing] How will the information need be represented? [query language] How will respective representations be matched? [retrieval model] How effective is the search? The Advanced E-Discovery Institute  November 13, 2009

3. Boolean Retrieval The Model documents represented by descriptors descriptors originally manually assigned concepts from controlled vocabulary modern implementations generally use words in text as descriptors information need represented by descriptors structured with Boolean operators modern implementations include more operators than just AND, OR, NOT a match occurs if and only if doc satisfies Boolean expression “fuzzy match” systems use descriptor weights, relax strict binary interpretation Pros and cons good: transparency---clear exactly why doc retrieved bad: little control over retrieved set size; no ranking; searchers must learn query language The Advanced E-Discovery Institute  November 13, 2009

4. Vector Space Model The Model documents represented as vectors in N-dimensional space where N is number of ‘terms’ in the document set term is usually a word (stem); but might be phrase or thesaurus class terms are weighted based on frequency and distribution of occurrences information need is natural language text mapped in same space matching is similarity between query and doc vectors example similarity: cosine of angle between vectors allows documents to be ranked by decreasing similarity Pros and Cons good: less brittle than pure Boolean bad: less transparency---depending on weights, a doc with few query terms can be ranked higher than a doc with many The Advanced E-Discovery Institute  November 13, 2009

5. Vector Similarities Document-Document similarity docs are similar to the extent they contain the same terms doc pairs with maximal similarity detects duplicates document clustering cluster hypothesis: “Closely associated documents tend to be relevant to the same requests.” thus, do retrieval based on returning whole clusters since usually much more information in doc-doc comparison than doc-query Term-Term similarity terms are similar to the extent the occur inthe same documents term clustering query expansion provide bottom-up description of document set T1 T2 T3 T4 … 5 0 33 0 … 0 0 8 0 … 1 4 0 2 … 0 3 0 4 … 0 1 0 0 … 3 2 0 … D1 D2 D3 D4 D5 D6… The Advanced E-Discovery Institute  November 13, 2009

6. Further Matrix Manipulation: Latent Semantic Indexing Mathematically, the axes in a vector space are orthogonal to one another so, vector space model technically assumes words occur in documents independently of any other words (which is nonsense) this vector space is very large, and very sparse Perform singular value decomposition of original matrix and select first X eigenvectors as new axes X chosen to be much smaller than number of terms, producing much smaller denser vector space project document vectors into new space elements in vector no longer correspond to words new axes capture some (but which?) dependencies among original word occurrences The Advanced E-Discovery Institute  November 13, 2009

8. comparative evaluation using mean scores on test collections

9. Absolute evaluation of current e-discovery search:

10. very little guidance in IR literature: you don’t know what you don’t know!

11. too much variability for test collections to predict tight boundsNumber relevant num_rel = num_ret R R Number retrieved number relevant retrieved Precision = number retrieved number relevant retrieved Recall = total relevant 2×Precision×Recall F = Precision + Recall

Under the Hood of Your Favorite Search System

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Under the Hood of Your Favorite Search System

Similar to Under the Hood of Your Favorite Search System (20)

Recently uploaded

Recently uploaded (20)

Under the Hood of Your Favorite Search System