1. CORPUS STRUCTURE, LANGUAGE MODELS, AND AD HOC INFORMATION RETRIEVAL Oren Kurland and Lillian Lee Department of Computer Science Cornell University Ithaca, NY
17. Query = âthe algorithms for data miningâ Another Reason for Smoothing p( âalgorithmsâ|d1) = p(âalgorithmâ|d2) p( âdataâ|d1) < p(âdataâ|d2) p( âminingâ|d1) < p(âminingâ|d2) So we should make p(âtheâ) and p(âforâ) less different for all docs, and smoothing helps achieve this goal⊠Content words p DML (w|d1): 0.04 0.001 0.02 0.002 0.003 p DML (w|d2): 0.02 0.001 0.01 0.003 0.004 Intuitively, d2 should have a higher score, but p(q|d1)>p(q|d2)⊠Query = âthe algorithms for data miningâ P(w|REF) 0.2 0.00001 0.2 0.00001 0.00001 Smoothed p(w|d1): 0.184 0.000109 0.182 0.000209 0.000309 Smoothed p(w|d2): 0.182 0.000109 0.181 0.000309 0.000409
18.
19.
20. RETRIEVAL ALGORITHM Base line method:- The documents are simply ranked by probabilistic functions on the basis of frequency of words encountered from query.
21. Probabilistic IR query d1 d2 dn ⊠Information need document collection matching Introduction
22. BASIS SELECT This algorithm uses the pooling of statistics from documents simply to decide whether the document is worth ranking or not. Only the basis documents are allowed to appear in the final output list having some minimum thresh hold frequency.
23. IR based on LM query d1 d2 dn ⊠Information need document collection generation ⊠Introduction
24. SET SELECT ALGORITHM In this case all the documents may appear in the final output list. The idea is that any document in the âbestâ cluster, basis or not is potentially relevant. BAG SELECT The documents appearing in more than one cluster should get extra consideration. The name is in reference to the incorporation in the documentâs multiplicity in the bag formed from the âmulti set unionâ.
25. ASPECT â X RATIO The degree of relevance on a particular probability is based on the strength of association between d and c where d is the document and c is the query. The uniform aspect x ratio assumes that every d Đ c has same degree of association.
26. A HYBRID ALGORITHM An Interpolation algorithm combines the advantages of both selection-only algorithms and the aspect-x Algorithms The algorithm can be derived by dropping the original as-pect model's conditional independence assumption|namely, that p(qjd; c) = p(qjc) | and instead setting p(qjd; c) in Equation 1 to p(qjd)+(1¥ž)p(qjc), where ž indicates the degree of emphasis on individual-document information. If we do so, then via some algebra we get p(qjd) = žp(qjd) +(1¥ž)Pc p(qjc)p(cjd). Finally, applying the same assumptions as described in our discussion of the aspect-x algorithm yields a score function that is the linear interpolation of the score of the standard LM approach and the score of the aspect-x algorithm.
27. TEXT GENERATION WITH UNIGRAM LM (Unigram) Language Model ï± p(w| ï± ) ⊠text 0.2 mining 0.1 assocation 0.01 clustering 0.02 ⊠food 0.00001 ⊠Topic 1: Text mining ⊠food 0.25 nutrition 0.1 healthy 0.05 diet 0.02 ⊠Topic 2: Health Document d Sampling Given ï± , p(d| ï± ) varies according to d Text mining paper Food nutrition paper
28. ESTIMATION OF UNIGRAM LM (Unigram) Language Model ï± p(w| ï± )=? Document text 10 mining 5 association 3 database 3 algorithm 2 ⊠query 1 efficient 1 Estimation Total #words =100 ⊠text ? mining ? assocation ? database ? ⊠query ? ⊠10/100 5/100 3/100 3/100 1/100 How good is the estimated model ? It gives our document sample the highest prob, but it doesnât generalize well⊠More about this laterâŠ
29. THE BASIC LM APPROACH [PONTE & CROFT 98] Document Text mining paper Food nutrition paper Query = â data mining algorithmsâ Language Model ⊠text ? mining ? assocation ? clustering ? ⊠food ? ⊠⊠food ? nutrition ? healthy ? diet ? ⊠? Which model would most likely have generated this query?