SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Introduction to Information Retrieval:Language models for information retrievalby C.D. Manning, P. Raghavan, and H. Schutze. Presentation by Dustin Smith The University of Texas at Austin School of Information dustin.smith@utexas.edu 10/3/2011 1 INF384H / CS395T: Concepts of Information Retrieval
Christopher Manning – background BA Australian National University 1989 (majors in mathematics, computer science and linguistics) PhD Stanford Linguistics 1995 Asst Professor Carnegie Mellon University Computational Linguistics Program 1994-96 Lecturer University of Sydney Dept of Linguistics 1996-99 Asst Professor Stanford University Depts of Computer Science and Linguistics 1999-2006 Current: Assoc Professor Stanford University Depts of Linguistics and Computer Science 10/3/2011 INF384H / CS395T: Concepts of Information Retrieval 2
Prabhakar Raghavan– background Undergraduate degree in electrical engineering from ITT, Madras PhD in computer science from UC Berkeley Current: Working at Yahoo! Labs and is a Consulting Professor of Computer Science at Stanford University 10/3/2011 INF384H / CS395T: Concepts of Information Retrieval 3
Hinrich Schütze– background Technical University of Braunschweig Vordiplom Mathematik Vordiplom Informatik University of Stuttgart, Diplom Informatik (MSCS) Stanford University, Ph.D., Computational Linguistics Current: Chair of Theoretical Computational Linguistics, Institute for Natural Language Processing at the University of Stuttgart 10/3/2011 INF384H / CS395T: Concepts of Information Retrieval 4
Chapter/Presentation Outline Introduction to the concept of Language Models Finite automata and language models Types of language models Multinomial distributions over words Description of the Query Likelihood Model Using query likelihood language models in IR Estimating the query generation probability Ponte and Croft’s experiments Comparison of the language modeling approach to IR against other approaches to IR Description of various extension to the language modeling approach 10/3/2011 INF384H / CS395T: Concepts of Information Retrieval 5
Language Models Based on concept that a document is a good match for a query if the document model is likely to generate the query. An alternative to the straightforward query-document probability model. (traditional approach) 10/3/2011 INF384H / CS395T: Concepts of Information Retrieval 6
Finite automata and language models (238) ,[object Object]
The process is analogous for a document model
Figure 12.2 represents a single node with a single distribution over terms s.t.𝑡∈𝑉𝑃(𝑡)=1.   10/3/2011 INF384H / CS395T: Concepts of Information Retrieval 7 Language Models
Calculating phrase probability with stop/continue probability included (238) ,[object Object]
This calculation is shown with stop probabilities, but in practice these are left out. 10/3/2011 INF384H / CS395T: Concepts of Information Retrieval 8 Language Models
Comparison of document models (239-240) ,[object Object]
Given a query s = “frog said that toad likes that dog”,our two model probabilities are calculated by simply multiplying term distributions.
It’s evident why P(s|𝑀1) scores higher than P(s|𝑀2). More query terms were present in P(s|𝑀1) and so the probability is greater.  10/3/2011 INF384H / CS395T: Concepts of Information Retrieval 9 Language Models
Types of language models(240) ,[object Object]
Bigram Language Model
Section Conclusion
Which𝑀𝑑 to use?  10/3/2011 INF384H / CS395T: Concepts of Information Retrieval 10 Chain rule Language Models
Using query likelihood language models in IR (242-243) Using Bayes rule: P(d|q)=P(q|d)P(d)/P(q) With P(d) and P(q) uniform across documents, => P(d|q) = P(q|d) In the query likelihood model we construct a language model 𝑀𝑑 from each document Goal: to rank documents by P(d|q), where the probability of a document is interpreted as the likelihood that it is relevant to the query   10/3/2011 INF384H / CS395T: Concepts of Information Retrieval 11 The Query Likelihood Model
Using query likelihood language models in IR (242-243) Multinomial unigram language model Pq𝑀𝑑=𝐾𝑞𝑡∈𝑉𝑃(𝑡|𝑀𝑑)𝑡𝑓𝑡,𝑑 𝐾𝑞 is dropped as it is constant across all queries   10/3/2011 INF384H / CS395T: Concepts of Information Retrieval 12 Query generation process:1. Infer a LM for each document  2. Estimate Pq𝑀𝑑, the probability of generating the query according to each one of these document models 3. Rank the documents according to these probabilities   The Query Likelihood Model
Estimating the query generation probability (244) ,[object Object],  10/3/2011 INF384H / CS395T: Concepts of Information Retrieval 13 ,[object Object]
tf𝑡.𝑑is the raw term frequency of term t in document d
L𝑑is the number of tokens in document d  The Query Likelihood Model

Weitere ähnliche Inhalte

Was ist angesagt?

Text classification presentation
Text classification presentationText classification presentation
Text classification presentationMarijn van Zelst
 
Attention in Deep Learning
Attention in Deep LearningAttention in Deep Learning
Attention in Deep Learning健程 杨
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information RetrievalRoi Blanco
 
Lecture: Question Answering
Lecture: Question AnsweringLecture: Question Answering
Lecture: Question AnsweringMarina Santini
 
Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrievalKU Leuven
 
Information retrieval 10 tf idf and bag of words
Information retrieval 10 tf idf and bag of wordsInformation retrieval 10 tf idf and bag of words
Information retrieval 10 tf idf and bag of wordsVaibhav Khanna
 
Information retrieval (introduction)
Information  retrieval (introduction) Information  retrieval (introduction)
Information retrieval (introduction) Primya Tamil
 
Text similarity measures
Text similarity measuresText similarity measures
Text similarity measuresankit_ppt
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingYasir Khan
 
Natural language processing
Natural language processingNatural language processing
Natural language processingYogendra Tamang
 
Machine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachinePulse
 
2.3 bayesian classification
2.3 bayesian classification2.3 bayesian classification
2.3 bayesian classificationKrish_ver2
 
L2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms IL2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms IMachine Learning Valencia
 
[Phd Thesis Defense] CHAMELEON: A Deep Learning Meta-Architecture for News Re...
[Phd Thesis Defense] CHAMELEON: A Deep Learning Meta-Architecture for News Re...[Phd Thesis Defense] CHAMELEON: A Deep Learning Meta-Architecture for News Re...
[Phd Thesis Defense] CHAMELEON: A Deep Learning Meta-Architecture for News Re...Gabriel Moreira
 
Probabilistic retrieval model
Probabilistic retrieval modelProbabilistic retrieval model
Probabilistic retrieval modelbaradhimarch81
 
Model of information retrieval (3)
Model  of information retrieval (3)Model  of information retrieval (3)
Model of information retrieval (3)9866825059
 
Introduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsIntroduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsMounia Lalmas-Roelleke
 

Was ist angesagt? (20)

Text classification presentation
Text classification presentationText classification presentation
Text classification presentation
 
Attention in Deep Learning
Attention in Deep LearningAttention in Deep Learning
Attention in Deep Learning
 
Text MIning
Text MIningText MIning
Text MIning
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
Lecture: Question Answering
Lecture: Question AnsweringLecture: Question Answering
Lecture: Question Answering
 
Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrieval
 
Information retrieval 10 tf idf and bag of words
Information retrieval 10 tf idf and bag of wordsInformation retrieval 10 tf idf and bag of words
Information retrieval 10 tf idf and bag of words
 
Information retrieval (introduction)
Information  retrieval (introduction) Information  retrieval (introduction)
Information retrieval (introduction)
 
Text similarity measures
Text similarity measuresText similarity measures
Text similarity measures
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Graph Based Pattern Recognition
Graph Based Pattern RecognitionGraph Based Pattern Recognition
Graph Based Pattern Recognition
 
Machine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachine Learning and Real-World Applications
Machine Learning and Real-World Applications
 
2.3 bayesian classification
2.3 bayesian classification2.3 bayesian classification
2.3 bayesian classification
 
L2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms IL2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms I
 
[Phd Thesis Defense] CHAMELEON: A Deep Learning Meta-Architecture for News Re...
[Phd Thesis Defense] CHAMELEON: A Deep Learning Meta-Architecture for News Re...[Phd Thesis Defense] CHAMELEON: A Deep Learning Meta-Architecture for News Re...
[Phd Thesis Defense] CHAMELEON: A Deep Learning Meta-Architecture for News Re...
 
Probabilistic retrieval model
Probabilistic retrieval modelProbabilistic retrieval model
Probabilistic retrieval model
 
Model of information retrieval (3)
Model  of information retrieval (3)Model  of information retrieval (3)
Model of information retrieval (3)
 
Introduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsIntroduction to Information Retrieval & Models
Introduction to Information Retrieval & Models
 
NLP
NLPNLP
NLP
 

Ähnlich wie Language Models for Information Retrieval

Closing the Gap: Data Models for Documentary Linguistics
Closing the Gap: Data Models for Documentary LinguisticsClosing the Gap: Data Models for Documentary Linguistics
Closing the Gap: Data Models for Documentary LinguisticsBaden Hughes
 
2-Chapter Two-N-gram Language Models.ppt
2-Chapter Two-N-gram Language Models.ppt2-Chapter Two-N-gram Language Models.ppt
2-Chapter Two-N-gram Language Models.pptmilkesa13
 
Research Developments and Directions in Speech Recognition and ...
Research Developments and Directions in Speech Recognition and ...Research Developments and Directions in Speech Recognition and ...
Research Developments and Directions in Speech Recognition and ...butest
 
Ontology based top-k query answering over massive, heterogeneous, and dynamic...
Ontology based top-k query answering over massive, heterogeneous, and dynamic...Ontology based top-k query answering over massive, heterogeneous, and dynamic...
Ontology based top-k query answering over massive, heterogeneous, and dynamic...Daniele Dell'Aglio
 
Machine translation from English to Hindi
Machine translation from English to HindiMachine translation from English to Hindi
Machine translation from English to HindiRajat Jain
 
ModelDR - the tool that untangles complex information
ModelDR - the tool that untangles complex informationModelDR - the tool that untangles complex information
ModelDR - the tool that untangles complex informationSimon Roberts
 
A Deep Learning Model to Predict Congressional Roll Call Votes from Legislati...
A Deep Learning Model to Predict Congressional Roll Call Votes from Legislati...A Deep Learning Model to Predict Congressional Roll Call Votes from Legislati...
A Deep Learning Model to Predict Congressional Roll Call Votes from Legislati...mlaij
 
Information Extraction in the TalkOfEurope Creative Camp
Information Extraction in the TalkOfEurope Creative CampInformation Extraction in the TalkOfEurope Creative Camp
Information Extraction in the TalkOfEurope Creative CampWim Peters
 
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...kevig
 
Information retrieval as statistical translation
Information retrieval as statistical translationInformation retrieval as statistical translation
Information retrieval as statistical translationBhavesh Singh
 
Lemon-aid: using Lemon to aid quantitative historical linguistic analysis
Lemon-aid: using Lemon to aid quantitative historical linguistic analysisLemon-aid: using Lemon to aid quantitative historical linguistic analysis
Lemon-aid: using Lemon to aid quantitative historical linguistic analysismbruemmer
 
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...ijnlc
 
Language independent document
Language independent documentLanguage independent document
Language independent documentijcsit
 
Cross language information retrieval in indian
Cross language information retrieval in indianCross language information retrieval in indian
Cross language information retrieval in indianeSAT Publishing House
 
Topic modeling of marketing scientific papers: An experimental survey
Topic modeling of marketing scientific papers: An experimental surveyTopic modeling of marketing scientific papers: An experimental survey
Topic modeling of marketing scientific papers: An experimental surveyICDEcCnferenece
 

Ähnlich wie Language Models for Information Retrieval (20)

Closing the Gap: Data Models for Documentary Linguistics
Closing the Gap: Data Models for Documentary LinguisticsClosing the Gap: Data Models for Documentary Linguistics
Closing the Gap: Data Models for Documentary Linguistics
 
Topicmodels
TopicmodelsTopicmodels
Topicmodels
 
2-Chapter Two-N-gram Language Models.ppt
2-Chapter Two-N-gram Language Models.ppt2-Chapter Two-N-gram Language Models.ppt
2-Chapter Two-N-gram Language Models.ppt
 
04 --spatial-data
04 --spatial-data04 --spatial-data
04 --spatial-data
 
Research Developments and Directions in Speech Recognition and ...
Research Developments and Directions in Speech Recognition and ...Research Developments and Directions in Speech Recognition and ...
Research Developments and Directions in Speech Recognition and ...
 
Ontology based top-k query answering over massive, heterogeneous, and dynamic...
Ontology based top-k query answering over massive, heterogeneous, and dynamic...Ontology based top-k query answering over massive, heterogeneous, and dynamic...
Ontology based top-k query answering over massive, heterogeneous, and dynamic...
 
Machine translation from English to Hindi
Machine translation from English to HindiMachine translation from English to Hindi
Machine translation from English to Hindi
 
ModelDR - the tool that untangles complex information
ModelDR - the tool that untangles complex informationModelDR - the tool that untangles complex information
ModelDR - the tool that untangles complex information
 
How to supervise a thesis in NLP in the ChatGPT era? By Laure Soulier
How to supervise a thesis in NLP in the ChatGPT era? By Laure SoulierHow to supervise a thesis in NLP in the ChatGPT era? By Laure Soulier
How to supervise a thesis in NLP in the ChatGPT era? By Laure Soulier
 
A Deep Learning Model to Predict Congressional Roll Call Votes from Legislati...
A Deep Learning Model to Predict Congressional Roll Call Votes from Legislati...A Deep Learning Model to Predict Congressional Roll Call Votes from Legislati...
A Deep Learning Model to Predict Congressional Roll Call Votes from Legislati...
 
Information Extraction in the TalkOfEurope Creative Camp
Information Extraction in the TalkOfEurope Creative CampInformation Extraction in the TalkOfEurope Creative Camp
Information Extraction in the TalkOfEurope Creative Camp
 
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
 
Information retrieval as statistical translation
Information retrieval as statistical translationInformation retrieval as statistical translation
Information retrieval as statistical translation
 
A survey of xml tree patterns
A survey of xml tree patternsA survey of xml tree patterns
A survey of xml tree patterns
 
07 04-06
07 04-0607 04-06
07 04-06
 
Lemon-aid: using Lemon to aid quantitative historical linguistic analysis
Lemon-aid: using Lemon to aid quantitative historical linguistic analysisLemon-aid: using Lemon to aid quantitative historical linguistic analysis
Lemon-aid: using Lemon to aid quantitative historical linguistic analysis
 
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
 
Language independent document
Language independent documentLanguage independent document
Language independent document
 
Cross language information retrieval in indian
Cross language information retrieval in indianCross language information retrieval in indian
Cross language information retrieval in indian
 
Topic modeling of marketing scientific papers: An experimental survey
Topic modeling of marketing scientific papers: An experimental surveyTopic modeling of marketing scientific papers: An experimental survey
Topic modeling of marketing scientific papers: An experimental survey
 

Kürzlich hochgeladen

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 

Kürzlich hochgeladen (20)

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

Language Models for Information Retrieval

  • 1. Introduction to Information Retrieval:Language models for information retrievalby C.D. Manning, P. Raghavan, and H. Schutze. Presentation by Dustin Smith The University of Texas at Austin School of Information dustin.smith@utexas.edu 10/3/2011 1 INF384H / CS395T: Concepts of Information Retrieval
  • 2. Christopher Manning – background BA Australian National University 1989 (majors in mathematics, computer science and linguistics) PhD Stanford Linguistics 1995 Asst Professor Carnegie Mellon University Computational Linguistics Program 1994-96 Lecturer University of Sydney Dept of Linguistics 1996-99 Asst Professor Stanford University Depts of Computer Science and Linguistics 1999-2006 Current: Assoc Professor Stanford University Depts of Linguistics and Computer Science 10/3/2011 INF384H / CS395T: Concepts of Information Retrieval 2
  • 3. Prabhakar Raghavan– background Undergraduate degree in electrical engineering from ITT, Madras PhD in computer science from UC Berkeley Current: Working at Yahoo! Labs and is a Consulting Professor of Computer Science at Stanford University 10/3/2011 INF384H / CS395T: Concepts of Information Retrieval 3
  • 4. Hinrich Schütze– background Technical University of Braunschweig Vordiplom Mathematik Vordiplom Informatik University of Stuttgart, Diplom Informatik (MSCS) Stanford University, Ph.D., Computational Linguistics Current: Chair of Theoretical Computational Linguistics, Institute for Natural Language Processing at the University of Stuttgart 10/3/2011 INF384H / CS395T: Concepts of Information Retrieval 4
  • 5. Chapter/Presentation Outline Introduction to the concept of Language Models Finite automata and language models Types of language models Multinomial distributions over words Description of the Query Likelihood Model Using query likelihood language models in IR Estimating the query generation probability Ponte and Croft’s experiments Comparison of the language modeling approach to IR against other approaches to IR Description of various extension to the language modeling approach 10/3/2011 INF384H / CS395T: Concepts of Information Retrieval 5
  • 6. Language Models Based on concept that a document is a good match for a query if the document model is likely to generate the query. An alternative to the straightforward query-document probability model. (traditional approach) 10/3/2011 INF384H / CS395T: Concepts of Information Retrieval 6
  • 7.
  • 8. The process is analogous for a document model
  • 9. Figure 12.2 represents a single node with a single distribution over terms s.t.𝑡∈𝑉𝑃(𝑡)=1.   10/3/2011 INF384H / CS395T: Concepts of Information Retrieval 7 Language Models
  • 10.
  • 11. This calculation is shown with stop probabilities, but in practice these are left out. 10/3/2011 INF384H / CS395T: Concepts of Information Retrieval 8 Language Models
  • 12.
  • 13. Given a query s = “frog said that toad likes that dog”,our two model probabilities are calculated by simply multiplying term distributions.
  • 14. It’s evident why P(s|𝑀1) scores higher than P(s|𝑀2). More query terms were present in P(s|𝑀1) and so the probability is greater.  10/3/2011 INF384H / CS395T: Concepts of Information Retrieval 9 Language Models
  • 15.
  • 18. Which𝑀𝑑 to use?  10/3/2011 INF384H / CS395T: Concepts of Information Retrieval 10 Chain rule Language Models
  • 19. Using query likelihood language models in IR (242-243) Using Bayes rule: P(d|q)=P(q|d)P(d)/P(q) With P(d) and P(q) uniform across documents, => P(d|q) = P(q|d) In the query likelihood model we construct a language model 𝑀𝑑 from each document Goal: to rank documents by P(d|q), where the probability of a document is interpreted as the likelihood that it is relevant to the query   10/3/2011 INF384H / CS395T: Concepts of Information Retrieval 11 The Query Likelihood Model
  • 20. Using query likelihood language models in IR (242-243) Multinomial unigram language model Pq𝑀𝑑=𝐾𝑞𝑡∈𝑉𝑃(𝑡|𝑀𝑑)𝑡𝑓𝑡,𝑑 𝐾𝑞 is dropped as it is constant across all queries   10/3/2011 INF384H / CS395T: Concepts of Information Retrieval 12 Query generation process:1. Infer a LM for each document 2. Estimate Pq𝑀𝑑, the probability of generating the query according to each one of these document models 3. Rank the documents according to these probabilities   The Query Likelihood Model
  • 21.
  • 22. tf𝑡.𝑑is the raw term frequency of term t in document d
  • 23. L𝑑is the number of tokens in document d  The Query Likelihood Model
  • 24.
  • 27. maximum likelihood estimateConceptually the same: The probability estimate for a word present in the document combines a discounted (MLE) and a fraction of the estimate of its prevalence in the whole collection. 10/3/2011 INF384H / CS395T: Concepts of Information Retrieval 14
  • 28.
  • 29. First experiments on the language modeling approach to IR
  • 30. Performed on TREC topics 202-250 over TREC disks 2 and 3.LM much better than tf-idf (specifically at higher recalls) 10/3/2011 INF384H / CS395T: Concepts of Information Retrieval 15 The Query Likelihood Model
  • 31. LM vs. BIM vs. XML retrieval (249) Language models and the most successful XML retrieval models approach relevance modeling in a roundabout way as apposed to the BIM model that evaluates relevance directly. LM initially appears to not include relevance modeling The most successful XML retrieval models assume that queries and documents are objects of the same type BIM models have relevance as the central variable that is evaluated 10/3/2011 INF384H / CS395T: Concepts of Information Retrieval 16 Language Modeling Versus Other Approaches in IR
  • 32. LM vs. traditional tf-idf(249) The LM has significant relations to tf-idf models They differ on a more conceptual level Both directly use term frequency Both have a method of mixing document frequency and collection frequency to produce probabilities Both treat terms independently LM intuitions are more probabilistic than geometric LM mathematical models are more principled rather than heuristic LM differs in its use of tf and df 10/3/2011 INF384H / CS395T: Concepts of Information Retrieval 17
  • 33.
  • 34.
  • 35. Easier to incorporate relevance feedback by expanding the query with terms from relevant documents  10/3/2011 INF384H / CS395T: Concepts of Information Retrieval 18 Extended Language Modeling Approaches
  • 36.
  • 37. Using a document model to produce a relevant query
  • 39. Using a query model to produce a relevant document
  • 41. Comparing these models 10/3/2011 INF384H / CS395T: Concepts of Information Retrieval 19 Extended Language Modeling Approaches
  • 42.
  • 43. Outperforms query and document likelihood models
  • 44. But, scores are not comparable across queries  10/3/2011 INF384H / CS395T: Concepts of Information Retrieval 20 Extended Language Modeling Approaches
  • 45. Translation Model – Features (251) Answer to synonymy in basic LM models Lets you generate query words that are not in a document by translating to alternate terms with similar meaning Provides a basis for executing cross-language IR 10/3/2011 INF384H / CS395T: Concepts of Information Retrieval 21 Extended Language Modeling Approaches
  • 46. Translation Model – Issues (251) Computationally intensive Need to build the model using outside resources Thesaurus Bilingual dictionary Statistical machine translation system’s translation dictionary 10/3/2011 INF384H / CS395T: Concepts of Information Retrieval 22 Extended Language Modeling Approaches
  • 47. Thanks for not throwing vegetables! Questions? 10/3/2011 INF384H / CS395T: Concepts of Information Retrieval 23

Hinweis der Redaktion

  1. --- Rah-guh-vun
  2. --- appears to be the most social and outgoing of our nerdy authors
  3. Just read it
  4. ---Q-d model: modeling the relevance of a document to a query
  5. Alright, so: what is a document model? And how does it generate the query?They use the concept of automata to help explain what is meant by a language or document model. For any given document you have an alphabet w.r.t. that document and a language produces by that alphabetProbability is distributed over terms ST the sum of all probabilities is equal to 1. straightforward.
  6. --- I didn’t quite understand where the 0.8 stop/continue probability came from---Left out because given a fixed STOP prob, it does not effect results when comparing models to leave it out.Now we will compare models
  7. Next we look at probability over sequences of terms.
  8. ---By using the chain rule, we can build probabilities over sequences of terms. ---Two specific models that use the chain rule are the unigram and bigram modelsDescribe images---The fundamental question in language modeling is which doc-model to use?
  9. ---now we introduce formally the model representing the initial concepts of LM for IR
  10. The most common way to achieve the goal of the query likelihood model is to use the multinomial unigram language modelThe query generation process is randomNext: estimating this 𝐏𝒒𝑴𝒅The most common way to achieve the goal of the query likelihood model is to use the multinomial unigram language modelThe query generation process is randomNext: estimating this 𝐏(𝒒│𝑴_𝒅 )
  11. Basically we are counting how often each word occurs and dividing by the total # of words in the documentNotice the ^, that indicates that this probability is an estimateTherein lies the issue with language modelsWhich leads to the re-occuring issue of “zero probabilities”Which then leads to the much used approach of “smoothing”, which we will see a lot of in the next two presentations in detail.
  12. the initial idea behind smoothing was to allow for non-occuring terms to be in a query generated by the document model GIVE example, say you have a document about tigers that doesn’t contain the word cat but a user queries “big striped cats”One of the important points in this section is that smoothing is essential for the overall good properties of LMs
  13. ---But, as Dr. Lease has mentioned… its easy to get good results when you are comparing to the standard tf-idf---NEXT: comparison of language models to other IR approaches
  14. But they mention that LM can be thought to indirectly include relevance modeling by viewing documents and info needs as the same type of object and analyzing it with NLP BIM = binary independence model
  15. -Both use tf-Both use df and cf to produce prob-Both treat terms independently ------NEXT: document model
  16. Downsides: both downside stem from there being less text to estimate withNEXT: all three approaches
  17. --- so far we’ve addressed query likelihood and document likelihood, now they focus on comparing these modelsNext: model comparison
  18. Q -- What will we use to compare models? One example would be the notorious KL-divergence.Comment -- Some prior results show that comparing models outperforms both query and document likelihood modelsComment -- Not bad for ad hoc queries, but bad for topic trackingNEXT: translation model
  19. -- Synonymy: uses similar, but not the same words to say the same thing---I believe synonymy is still a pretty big issue
  20. -- more computationally intensive than basic LM approaches-- all of these extended language models have been shown to improve basic LM approaches