SlideShare ist ein Scribd-Unternehmen logo
1 von 34
CORPUS STRUCTURE, LANGUAGE MODELS, AND AD  HOC INFORMATION RETRIEVAL Oren Kurland and Lillian Lee Department of  Computer Science Cornell University Ithaca, NY
INFORMATION RETRIEVAL ,[object Object],[object Object]
INFORMATION RETRIEVAL (CONTD.)
CLUSTERING ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
A DATA SET WITH CLEAR CLUSTER STRUCTURE Ch. 16
CLUSTERING(CONTD.)
THE BIG PICTURE
TERMINOLOGY ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
TERM FREQUENCY ,[object Object],[object Object],[object Object],[object Object],[object Object]
DOCUMENT FREQUENCY ,[object Object],[object Object],[object Object],[object Object]
TF.IDF ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
TERM WEIGHTS ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
LANGUAGE MODELS FOR IR ,[object Object],[object Object],[object Object],[object Object]
LANGUAGE MODELS ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
THE SIMPLEST LANGUAGE MODEL (UNIGRAM MODEL) ,[object Object],[object Object],[object Object],[object Object]
SMOOTHING ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Query  = “the  algorithms  for  data  mining” Another Reason for Smoothing p( “algorithms”|d1)  = p(“algorithm”|d2) p( “data”|d1)  < p(“data”|d2) p( “mining”|d1)  < p(“mining”|d2) So we should make p(“the”) and p(“for”)  less different  for all docs,  and smoothing helps achieve this goal
  Content words p DML (w|d1):   0.04  0.001  0.02  0.002  0.003   p DML (w|d2):  0.02  0.001  0.01  0.003  0.004 Intuitively, d2 should have a higher score,  but p(q|d1)>p(q|d2)
 Query  = “the  algorithms  for  data  mining” P(w|REF)  0.2  0.00001  0.2  0.00001  0.00001 Smoothed p(w|d1):   0.184  0.000109  0.182  0.000209  0.000309 Smoothed p(w|d2):   0.182  0.000109  0.181  0.000309  0.000409
RETRIEVAL FRAMEWORK ,[object Object],[object Object],[object Object],[object Object]
CLUSTER-BASED SMOOTHING/SCORING ,[object Object],How likely doc D  belongs to cluster C Only effective when interpolated with the basic LM scores Likelihood of Q  given C
RETRIEVAL ALGORITHM Base line method:-  The documents are simply ranked by probabilistic functions on the basis of frequency of words encountered from query.
Probabilistic IR query d1 d2 dn 
 Information need document collection matching Introduction
BASIS SELECT This algorithm uses the pooling of statistics from documents simply to decide whether the document is worth ranking or not. Only the basis documents are allowed to appear in the final output list having some minimum thresh hold frequency.
IR based on LM query d1 d2 dn 
 Information need document collection generation 
 Introduction
SET SELECT ALGORITHM In this case all the documents may appear in the final output list. The idea is that any document in the “best” cluster, basis or not is potentially relevant.  BAG SELECT The documents appearing in more than one cluster should get extra consideration. The name is in reference to the incorporation in the document’s multiplicity in the bag formed from the “multi set union”.
ASPECT – X  RATIO The degree of relevance on a particular probability is based on the strength of association between d and c where d is the document and c is the query.  The uniform aspect x ratio assumes that every d  Є  c has same degree of association.
A HYBRID ALGORITHM An Interpolation algorithm combines the advantages of both selection-only algorithms and the aspect-x Algorithms The algorithm can be derived by dropping the original as-pect model's conditional independence assumption|namely, that p(qjd; c) = p(qjc) | and instead setting p(qjd; c) in Equation 1 to p(qjd)+(1¥ž)p(qjc), where ž indicates the degree of emphasis on individual-document information. If we do so, then via some algebra we get p(qjd) = žp(qjd) +(1¥ž)Pc p(qjc)p(cjd). Finally, applying the same assumptions as described in our discussion of the aspect-x algorithm yields a score function that is the linear interpolation of the score of the standard LM approach and the score of the aspect-x algorithm.
TEXT GENERATION WITH UNIGRAM LM  (Unigram) Language Model   p(w|   ) 
 text  0.2 mining 0.1 assocation 0.01 clustering 0.02 
 food  0.00001 
 Topic 1: Text mining 
 food 0.25 nutrition 0.1 healthy 0.05 diet 0.02 
 Topic 2: Health Document d Sampling Given   , p(d|   ) varies according to d  Text mining paper Food nutrition paper
ESTIMATION OF UNIGRAM LM (Unigram) Language Model   p(w|   )=? Document text 10 mining 5 association 3 database 3 algorithm 2 
 query 1 efficient 1 Estimation Total #words =100 
 text  ? mining ? assocation ? database ? 
 query ? 
 10/100 5/100 3/100 3/100 1/100 How good is the estimated model ? It gives our document sample the highest prob, but it doesn’t generalize well
 More about this later

THE BASIC LM APPROACH [PONTE & CROFT 98]  Document Text mining paper Food nutrition paper Query =  “ data mining algorithms” Language Model  
 text  ? mining ? assocation ? clustering ? 
 food  ? 
 
 food ? nutrition ? healthy ? diet ? 
 ? Which model would most  likely have generated this query?
[object Object],[object Object]
EXPERIMENTAL RESULT
Thank You

Weitere Àhnliche Inhalte

Was ist angesagt?

Text classification using Text kernels
Text classification using Text kernelsText classification using Text kernels
Text classification using Text kernelsDev Nath
 
Topic model an introduction
Topic model an introductionTopic model an introduction
Topic model an introductionYueshen Xu
 
Topic modeling using big data analytics
Topic modeling using big data analyticsTopic modeling using big data analytics
Topic modeling using big data analyticsFarheen Nilofer
 
Duet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackDuet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackBhaskar Mitra
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information RetrievalBhaskar Mitra
 
Text classification-php-v4
Text classification-php-v4Text classification-php-v4
Text classification-php-v4Glenn De Backer
 
Text Categorization Using Improved K Nearest Neighbor Algorithm
Text Categorization Using Improved K Nearest Neighbor AlgorithmText Categorization Using Improved K Nearest Neighbor Algorithm
Text Categorization Using Improved K Nearest Neighbor AlgorithmIJTET Journal
 
Text categorization
Text categorizationText categorization
Text categorizationKU Leuven
 
Topic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic ModelsTopic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic ModelsClaudia Wagner
 
Document ranking using qprp with concept of multi dimensional subspace
Document ranking using qprp with concept of multi dimensional subspaceDocument ranking using qprp with concept of multi dimensional subspace
Document ranking using qprp with concept of multi dimensional subspacePrakash Dubey
 
Basic review on topic modeling
Basic review on  topic modelingBasic review on  topic modeling
Basic review on topic modelingHiroyuki Kuromiya
 
TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxKalpit Desai
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information RetrievalBhaskar Mitra
 
Author Topic Model
Author Topic ModelAuthor Topic Model
Author Topic ModelFReeze FRancis
 
Lec 4,5
Lec 4,5Lec 4,5
Lec 4,5alaa223
 
Latent dirichletallocation presentation
Latent dirichletallocation presentationLatent dirichletallocation presentation
Latent dirichletallocation presentationSoojung Hong
 
ĐĄĐ”Ń€ĐłĐ”ĐžÌ† ĐšĐŸĐ»ŃŒŃ†ĐŸĐČ â€”ĐĐ˜ĐŁ ВйЭ —ICBDA 2015
ĐĄĐ”Ń€ĐłĐ”ĐžÌ† ĐšĐŸĐ»ŃŒŃ†ĐŸĐČ â€”ĐĐ˜ĐŁ ВйЭ —ICBDA 2015ĐĄĐ”Ń€ĐłĐ”ĐžÌ† ĐšĐŸĐ»ŃŒŃ†ĐŸĐČ â€”ĐĐ˜ĐŁ ВйЭ —ICBDA 2015
ĐĄĐ”Ń€ĐłĐ”ĐžÌ† ĐšĐŸĐ»ŃŒŃ†ĐŸĐČ â€”ĐĐ˜ĐŁ ВйЭ —ICBDA 2015rusbase
 
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...Sebastian Ruder
 

Was ist angesagt? (20)

Text classification using Text kernels
Text classification using Text kernelsText classification using Text kernels
Text classification using Text kernels
 
Topic Models
Topic ModelsTopic Models
Topic Models
 
Topic model an introduction
Topic model an introductionTopic model an introduction
Topic model an introduction
 
Topic modeling using big data analytics
Topic modeling using big data analyticsTopic modeling using big data analytics
Topic modeling using big data analytics
 
Topics Modeling
Topics ModelingTopics Modeling
Topics Modeling
 
Duet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackDuet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning Track
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
Text classification-php-v4
Text classification-php-v4Text classification-php-v4
Text classification-php-v4
 
Text Categorization Using Improved K Nearest Neighbor Algorithm
Text Categorization Using Improved K Nearest Neighbor AlgorithmText Categorization Using Improved K Nearest Neighbor Algorithm
Text Categorization Using Improved K Nearest Neighbor Algorithm
 
Text categorization
Text categorizationText categorization
Text categorization
 
Topic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic ModelsTopic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic Models
 
Document ranking using qprp with concept of multi dimensional subspace
Document ranking using qprp with concept of multi dimensional subspaceDocument ranking using qprp with concept of multi dimensional subspace
Document ranking using qprp with concept of multi dimensional subspace
 
Basic review on topic modeling
Basic review on  topic modelingBasic review on  topic modeling
Basic review on topic modeling
 
TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptx
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
Author Topic Model
Author Topic ModelAuthor Topic Model
Author Topic Model
 
Lec 4,5
Lec 4,5Lec 4,5
Lec 4,5
 
Latent dirichletallocation presentation
Latent dirichletallocation presentationLatent dirichletallocation presentation
Latent dirichletallocation presentation
 
ĐĄĐ”Ń€ĐłĐ”ĐžÌ† ĐšĐŸĐ»ŃŒŃ†ĐŸĐČ â€”ĐĐ˜ĐŁ ВйЭ —ICBDA 2015
ĐĄĐ”Ń€ĐłĐ”ĐžÌ† ĐšĐŸĐ»ŃŒŃ†ĐŸĐČ â€”ĐĐ˜ĐŁ ВйЭ —ICBDA 2015ĐĄĐ”Ń€ĐłĐ”ĐžÌ† ĐšĐŸĐ»ŃŒŃ†ĐŸĐČ â€”ĐĐ˜ĐŁ ВйЭ —ICBDA 2015
ĐĄĐ”Ń€ĐłĐ”ĐžÌ† ĐšĐŸĐ»ŃŒŃ†ĐŸĐČ â€”ĐĐ˜ĐŁ ВйЭ —ICBDA 2015
 
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
 

Ähnlich wie Artificial Intelligence

Language Technology Enhanced Learning
Language Technology Enhanced LearningLanguage Technology Enhanced Learning
Language Technology Enhanced Learningtelss09
 
Discovering Novel Information with sentence Level clustering From Multi-docu...
Discovering Novel Information with sentence Level clustering  From Multi-docu...Discovering Novel Information with sentence Level clustering  From Multi-docu...
Discovering Novel Information with sentence Level clustering From Multi-docu...irjes
 
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)Sean Golliher
 
4-IR Models_new.ppt
4-IR Models_new.ppt4-IR Models_new.ppt
4-IR Models_new.pptBereketAraya
 
4-IR Models_new.ppt
4-IR Models_new.ppt4-IR Models_new.ppt
4-IR Models_new.pptBereketAraya
 
Slides
SlidesSlides
Slidesbutest
 
ECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERINGECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERINGGeorge Simov
 
Language independent document
Language independent documentLanguage independent document
Language independent documentijcsit
 
Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...
Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...
Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...Dwaipayan Roy
 
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...
IRS-Lecture-Notes irsirs    IRS-Lecture-Notes irsirs   IRS-Lecture-Notes irsi...IRS-Lecture-Notes irsirs    IRS-Lecture-Notes irsirs   IRS-Lecture-Notes irsi...
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...onlmcq
 
lecture_mooney.ppt
lecture_mooney.pptlecture_mooney.ppt
lecture_mooney.pptbutest
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
 
TEXT CLUSTERING.doc
TEXT CLUSTERING.docTEXT CLUSTERING.doc
TEXT CLUSTERING.docnaveenchaurasia
 
Probablistic information retrieval
Probablistic information retrievalProbablistic information retrieval
Probablistic information retrievalNisha Arankandath
 
The Geometry of Learning
The Geometry of LearningThe Geometry of Learning
The Geometry of Learningfridolin.wild
 
A-Study_TopicModeling
A-Study_TopicModelingA-Study_TopicModeling
A-Study_TopicModelingSardhendu Mishra
 
A first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetupA first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetupDan Sullivan, Ph.D.
 
similarity measure
similarity measure similarity measure
similarity measure ZHAO Sam
 

Ähnlich wie Artificial Intelligence (20)

Language Technology Enhanced Learning
Language Technology Enhanced LearningLanguage Technology Enhanced Learning
Language Technology Enhanced Learning
 
Discovering Novel Information with sentence Level clustering From Multi-docu...
Discovering Novel Information with sentence Level clustering  From Multi-docu...Discovering Novel Information with sentence Level clustering  From Multi-docu...
Discovering Novel Information with sentence Level clustering From Multi-docu...
 
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
 
4-IR Models_new.ppt
4-IR Models_new.ppt4-IR Models_new.ppt
4-IR Models_new.ppt
 
4-IR Models_new.ppt
4-IR Models_new.ppt4-IR Models_new.ppt
4-IR Models_new.ppt
 
Slides
SlidesSlides
Slides
 
ECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERINGECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERING
 
Language independent document
Language independent documentLanguage independent document
Language independent document
 
Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...
Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...
Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...
 
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...
IRS-Lecture-Notes irsirs    IRS-Lecture-Notes irsirs   IRS-Lecture-Notes irsi...IRS-Lecture-Notes irsirs    IRS-Lecture-Notes irsirs   IRS-Lecture-Notes irsi...
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...
 
lecture_mooney.ppt
lecture_mooney.pptlecture_mooney.ppt
lecture_mooney.ppt
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articles
 
TEXT CLUSTERING.doc
TEXT CLUSTERING.docTEXT CLUSTERING.doc
TEXT CLUSTERING.doc
 
Some Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBASome Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBA
 
Probablistic information retrieval
Probablistic information retrievalProbablistic information retrieval
Probablistic information retrieval
 
The Geometry of Learning
The Geometry of LearningThe Geometry of Learning
The Geometry of Learning
 
A-Study_TopicModeling
A-Study_TopicModelingA-Study_TopicModeling
A-Study_TopicModeling
 
A first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetupA first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetup
 
similarity measure
similarity measure similarity measure
similarity measure
 

Mehr von vini89

Similarity based methods for word sense disambiguation
Similarity based methods for word sense disambiguationSimilarity based methods for word sense disambiguation
Similarity based methods for word sense disambiguationvini89
 
Machine translation with statistical approach
Machine translation with statistical approachMachine translation with statistical approach
Machine translation with statistical approachvini89
 
Hcs
HcsHcs
Hcsvini89
 
Fuzzy logic
Fuzzy logicFuzzy logic
Fuzzy logicvini89
 
Ann
Ann Ann
Ann vini89
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligencevini89
 
Ai
Ai Ai
Ai vini89
 
Ai presentation
Ai presentationAi presentation
Ai presentationvini89
 
Similarity based methods for word sense disambiguation
Similarity based methods for word sense disambiguationSimilarity based methods for word sense disambiguation
Similarity based methods for word sense disambiguationvini89
 
Mycin
MycinMycin
Mycinvini89
 

Mehr von vini89 (10)

Similarity based methods for word sense disambiguation
Similarity based methods for word sense disambiguationSimilarity based methods for word sense disambiguation
Similarity based methods for word sense disambiguation
 
Machine translation with statistical approach
Machine translation with statistical approachMachine translation with statistical approach
Machine translation with statistical approach
 
Hcs
HcsHcs
Hcs
 
Fuzzy logic
Fuzzy logicFuzzy logic
Fuzzy logic
 
Ann
Ann Ann
Ann
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
 
Ai
Ai Ai
Ai
 
Ai presentation
Ai presentationAi presentation
Ai presentation
 
Similarity based methods for word sense disambiguation
Similarity based methods for word sense disambiguationSimilarity based methods for word sense disambiguation
Similarity based methods for word sense disambiguation
 
Mycin
MycinMycin
Mycin
 

KĂŒrzlich hochgeladen

Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
Visit to a blind student's school🧑‍🩯🧑‍🩯(community medicine)
Visit to a blind student's school🧑‍🩯🧑‍🩯(community medicine)Visit to a blind student's school🧑‍🩯🧑‍🩯(community medicine)
Visit to a blind student's school🧑‍🩯🧑‍🩯(community medicine)lakshayb543
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxChelloAnnAsuncion2
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxMaryGraceBautista27
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxCarlos105
 
Q4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxQ4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxnelietumpap1
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxDr.Ibrahim Hassaan
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceSamikshaHamane
 

KĂŒrzlich hochgeladen (20)

Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
Visit to a blind student's school🧑‍🩯🧑‍🩯(community medicine)
Visit to a blind student's school🧑‍🩯🧑‍🩯(community medicine)Visit to a blind student's school🧑‍🩯🧑‍🩯(community medicine)
Visit to a blind student's school🧑‍🩯🧑‍🩯(community medicine)
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptx
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
 
Q4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxQ4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptx
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptx
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in Pharmacovigilance
 

Artificial Intelligence

  • 1. CORPUS STRUCTURE, LANGUAGE MODELS, AND AD HOC INFORMATION RETRIEVAL Oren Kurland and Lillian Lee Department of Computer Science Cornell University Ithaca, NY
  • 2.
  • 4.
  • 5. A DATA SET WITH CLEAR CLUSTER STRUCTURE Ch. 16
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17. Query = “the algorithms for data mining” Another Reason for Smoothing p( “algorithms”|d1) = p(“algorithm”|d2) p( “data”|d1) < p(“data”|d2) p( “mining”|d1) < p(“mining”|d2) So we should make p(“the”) and p(“for”) less different for all docs, and smoothing helps achieve this goal
 Content words p DML (w|d1): 0.04 0.001 0.02 0.002 0.003 p DML (w|d2): 0.02 0.001 0.01 0.003 0.004 Intuitively, d2 should have a higher score, but p(q|d1)>p(q|d2)
 Query = “the algorithms for data mining” P(w|REF) 0.2 0.00001 0.2 0.00001 0.00001 Smoothed p(w|d1): 0.184 0.000109 0.182 0.000209 0.000309 Smoothed p(w|d2): 0.182 0.000109 0.181 0.000309 0.000409
  • 18.
  • 19.
  • 20. RETRIEVAL ALGORITHM Base line method:- The documents are simply ranked by probabilistic functions on the basis of frequency of words encountered from query.
  • 21. Probabilistic IR query d1 d2 dn 
 Information need document collection matching Introduction
  • 22. BASIS SELECT This algorithm uses the pooling of statistics from documents simply to decide whether the document is worth ranking or not. Only the basis documents are allowed to appear in the final output list having some minimum thresh hold frequency.
  • 23. IR based on LM query d1 d2 dn 
 Information need document collection generation 
 Introduction
  • 24. SET SELECT ALGORITHM In this case all the documents may appear in the final output list. The idea is that any document in the “best” cluster, basis or not is potentially relevant. BAG SELECT The documents appearing in more than one cluster should get extra consideration. The name is in reference to the incorporation in the document’s multiplicity in the bag formed from the “multi set union”.
  • 25. ASPECT – X RATIO The degree of relevance on a particular probability is based on the strength of association between d and c where d is the document and c is the query. The uniform aspect x ratio assumes that every d Є c has same degree of association.
  • 26. A HYBRID ALGORITHM An Interpolation algorithm combines the advantages of both selection-only algorithms and the aspect-x Algorithms The algorithm can be derived by dropping the original as-pect model's conditional independence assumption|namely, that p(qjd; c) = p(qjc) | and instead setting p(qjd; c) in Equation 1 to p(qjd)+(1¥ž)p(qjc), where ž indicates the degree of emphasis on individual-document information. If we do so, then via some algebra we get p(qjd) = žp(qjd) +(1¥ž)Pc p(qjc)p(cjd). Finally, applying the same assumptions as described in our discussion of the aspect-x algorithm yields a score function that is the linear interpolation of the score of the standard LM approach and the score of the aspect-x algorithm.
  • 27. TEXT GENERATION WITH UNIGRAM LM (Unigram) Language Model  p(w|  ) 
 text 0.2 mining 0.1 assocation 0.01 clustering 0.02 
 food 0.00001 
 Topic 1: Text mining 
 food 0.25 nutrition 0.1 healthy 0.05 diet 0.02 
 Topic 2: Health Document d Sampling Given  , p(d|  ) varies according to d Text mining paper Food nutrition paper
  • 28. ESTIMATION OF UNIGRAM LM (Unigram) Language Model  p(w|  )=? Document text 10 mining 5 association 3 database 3 algorithm 2 
 query 1 efficient 1 Estimation Total #words =100 
 text ? mining ? assocation ? database ? 
 query ? 
 10/100 5/100 3/100 3/100 1/100 How good is the estimated model ? It gives our document sample the highest prob, but it doesn’t generalize well
 More about this later

  • 29. THE BASIC LM APPROACH [PONTE & CROFT 98] Document Text mining paper Food nutrition paper Query = “ data mining algorithms” Language Model 
 text ? mining ? assocation ? clustering ? 
 food ? 
 
 food ? nutrition ? healthy ? diet ? 
 ? Which model would most likely have generated this query?
  • 30.
  • 32.
  • 33.