This talk features the basics behind the science of Information Retrieval with a story-mode on information and its various aspects. It then takes you through a quick journey into the process behind building of the search engine.
Powerful Google developer tools for immediate impact! (2023-24 C)
Â
Information Retrieval
1. INFORMATION RETRIEVAL A Look into the Science of Web Search Engines 1 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk Muhammad AtifQureshi
2. Contents Story Mode Learning Learning by Imagination Appendix 2 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
3. Story Mode Learning (Borrowed from Prof. Jimmy Lin, University of Maryland, Scientist in Twitter) 3 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
4. Information Retrieval Systems Information What is âinformationâ? Retrieval What do we mean by âretrievalâ? What are different types information needs? Systems How do computer systems fit into the human information seeking process? 4 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
5. What is Information? What do you think? There is no âcorrectâ definition Cookie Monsterâs definition: ânews or facts about somethingâ Different approaches: Philosophy Psychology Linguistics Electrical engineering Physics Computer science Information science 5 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
6. Dictionary says⊠Oxford English Dictionary information: informing, telling; thing told, knowledge, items of knowledge, news knowledge: knowing familiarity gained by experience; personâs range of information; a theoretical or practical understanding of; the sum of what is known Random House Dictionary information: knowledge communicated or received concerning a particular fact or circumstance; news 6 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
7. Intuitive Notions Information must Be something, although the exact nature (substance, energy, or abstract concept) is not clear; Be ânewâ: repetition of previously received messages is not informative Be âtrueâ: false or counterfactual information is âmis-informationâ Be âaboutâ something Robert M. Losee. (1997) A Discipline Independent Definition of Information. Journal of the American Society for Information Science, 48(3), 254-269. 7 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
8. Three Views of Information Information as process Information as communication Information as message transmission and reception 8 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
9. One View Information = characteristics of the output of a process Tells us something about the process and the input Information-generating process do not occur in isolation Input Output Process Input Output Input Output Process1 Process2 Input Output ⊠9 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
10. Whereâs the human? If a tree falls in the forest, and no one is around to hear it, is information transmitted? In the âinformation as processâ: Yes, but thatâs not very interesting to us Weâre concerned about information for human consumption Transmission of information from one person to another Recording of information Reconstruction of stored information 10 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
11. Another View Information science is characterized by âthe deliberate (purposeful) structure of the message by the sender in order to affect the image structure of the recipientâ This implies that the sender has knowledge of the recipient's structure Text = âa collection of signs purposefully structured by a sender with the intention of changing image-structure of a recipientâ Information = âthe structure of any text which is capable of changing the image-structure of a recipientâ 11 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
12. Transfer of Information Communication = transmission of information Thoughts Thoughts Telepathy? Words Words Writing Sounds Sounds Speech Encoding Decoding 12 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
13. Information Theory Better called âcommunication theoryâ Developed by Claude Shannon in 1940âs Concerned with the transmission of electrical signals over wires How do we send information quickly and reliably? Underlies modern electronic communication: Voice and data traffic⊠Over copper, fiber optic, wireless, etc. Famous result: Channel Capacity Theorem Formal measure of information in terms of entropy Information = âreduction in surpriseâ 13 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
14. The Noisy Channel Model Communication = producing the same message at the destination that was sent at the source The message must be encoded for transmission across a medium (called channel) But the channel is noisy and can distort the message Semantics (meaning) is irrelevant channel Receiver message Transmitter noise Source Destination message 14 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
15. A Synthesis Information retrieval as communication over time and space, across a noisy channel Sender Recipient Encoding Decoding Transmitter Receiver channel storage message message indexing/writing retrieval/reading noise Source Destination message message noise 15 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
16. âRetrieval?â âFetch somethingâ thatâs been stored Recover a stored state of knowledge Search through stored messages to find some messages relevant to the task at hand Encoding Decoding storage Sender Recipient message message indexing/writing Retrieval/reading noise 16 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
17. What is IR? Information retrieval is a problem-oriented discipline, concerned with the problem of the effective and efficient transfer of desired information between human generator and human user Anomalous States of Knowledge as a Basis for Information Retrieval. (1980) Nicholas J. Belkin. Canadian Journal of Information Science, 5, 133-143. 17 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
18. Types of Information Needs Retrospective âSearching the pastâ Different queries posed against a static collection Time invariant Prospective âSearching the futureâ Static query posed against a dynamic collection Time dependent 18 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
19. Retrospective Searches (I) Ad hoc retrieval: find documents âabout thisâ Known item search Directed exploration Identify positive accomplishments of the Hubble telescope since it was launched in 1991. Compile a list of mammals that are considered to be endangered, identify their habitat and, if possible, specify what threatens them. Find Jimmy Linâs homepage. Whatâs the ISBN number of âModern Information Retrievalâ? Who makes the best chocolates? What video conferencing systems exist for digital reference desk services? 19 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
20. Retrospective Searches (II) Question answering Who discovered Oxygen? When did Hawaii become a state? Where is Ayerâs Rock located? What team won the World Series in 1992? âFactoidâ What countries export oil? Name U.S. cities that have a âShubertâ theater. âListâ Who is Aaron Copland? What is a quasar? âDefinitionâ 20 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
21. Prospective âSearchesâ Filtering Make a binary decision about each incoming document Routing Sort incoming documents into different bins? Spam or not spam? Categorize news headlines: World? Nation? Metro? Sports? 21 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
22. What types of information? Text (Documents and portions thereof) XML and structured documents Images Audio (sound effects, songs, etc.) Video Source code Applications/Web services 22 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
23. Content-Based Search This is a relative new concept! What else would you search on? Whatâs more effective? Why is this hard in many applications? 23 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
24. Interesting Examples Google image search Google video search Query by humming http://images.google.com/ http://video.google.com/ http://www.cs.cornell.edu/Info/Faculty/bsmith/query-by-humming.html 24 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
25. What about databases? What are examples of databases? Banks storing account information Retailers storing inventories Universities storing student grades What exactly is a (relational) database? Think of them as a collection of tables They model some aspect of âthe worldâ 25 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
26. A (Simple) Database Example Student Table Department Table Course Table Enrollment Table 26 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
27. Database Queries What would you want to know from a database? What classes is John Arrow enrolled in? Who has the highest grade in LBSC 690? Whoâs in the history department? Of all the non-CLIS students taking LBSC 690 with a last name shorter than six characters and were born on a Monday, who has the longest email address? 27 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
28. Databases vs. IR 28 IR Databases What weâre retrieving Mostly unstructured. Free text with some metadata. Structured data. Clear semantics based on a formal model. Queries weâre posing Vague, imprecise information needs (often expressed in natural language). Formally (mathematically) defined queries. Unambiguous. Results we get Sometimes relevant, often not. Exact. Always correct in a formal sense. Interaction with system Interaction is important. One-shot queries. Other issues Issues downplayed. Concurrency, recovery, atomicity are all critical. Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
29. The Big Picture The four components of the information retrieval environment: User Process System Collection What computer geeks care about! What we care about! 29 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
30. The Information Retrieval Cycle Resource Query Ranked List Documents query reformulation, vocabulary learning, relevance feedback Documents source reselection Source Selection Query Formulation Search Selection Examination Delivery 30 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
31. Supporting the Search Process Source Selection Resource Query Formulation Query Search Ranked List Selection Indexing Documents Index Examination Acquisition Documents Collection Delivery 31 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
32. Simplification? Resource Query Ranked List Documents query reformulation, vocabulary learning, relevance feedback Documents source reselection Source Selection Is this itself a vast simplification? Query Formulation Search Selection Examination Delivery 32 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
33. Tackling the IR Challenge Divide and conquer! Strategy: use encapsulation to limit complexity Approach: Define interfaces (input and output) for each component Define the functions performed by each component Study each component in isolation Repeat the process within components as needed Make sure that this decomposition makes sense Result: a hierarchical decomposition 33 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
34. Where do we make the cut? Study the IR black box in isolation Simple behavior: in goes query, out comes documents Optimize the quality of documents that come out Study everything else around the black box Put the human back in the loop! Search Query Ranked List 34 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
35. The IR Black Box Documents Query Hits 35 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
36. Inside The IR Black Box Documents Query Representation Function Representation Function Query Representation Document Representation Index Comparison Function Hits 36 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
37. The Central Problem in IR Information Seeker Authors Concepts Concepts Query Terms Document Terms Do these represent the same concepts? 37 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
39. Imagine a System We have 1000s of web pages, what make these web pages different? May be different key terms or key words occurring in different web pages (e.g., sports, education, video sharing) 39 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
40. Realize Query Needs What do we expect when query? Query can be single word (no order), collection of words i.e., free sentence (order does not matter) or strict phrase (order matters e.g., "I love Pakistan") How to manage data of web pages Bag of words data structure with/without position of words/terms (simply, posting list of words/terms) Whatâs the best match? We have many matching results, but whatâs the order? 40 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
41. Order of Matching Results How could we rank web pages? Via query content matching score against web pages i.e., content based methods Via importance of web pages i.e., link based methods 41 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
42. What does Content Tell? Content Information: Rare terms give more information than frequent terms as common terms do not differentiate well between the content of documents (Information entropy) So what does common words make? Stop words (extreme case, e.g., it, a, the) or words with lesser importance (e.g., word science inside scientific documents) 42 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
43. Ranking Methods Content based methods: Examples: Tf-idf with cosine similarity, bm25, etc. Link based methods: Examples: PageRank, HITS, etc. 43 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
44. What is More in Ranking? What other measures we can take for ranking better? Combining content based methods with link based methods How about learning to rank by user click through data (apply machine learning) How about learning from social web (apply social science theories) 44 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
45. Lots of Web Pages How about scalability? We have too many words, can we limit them? Example: Is Studying conceptually different from study or studies? may be not (concept called stemming could simply everything to simple concept study) Stemming may not be sufficient then how about clustering web pages into topics i.e., (terms study, science, arts, university, school, college would single concept or a topic may be called as topic education) 45 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
46. Is it sufficient? Can we feel confident about how Web Search Engine works? No, it was just a summary for the day 46 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
47. Guess! what next you would see? ? 47 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
48. Our search engine Yes we are making it 48 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
50. Outline What is Research? How to prepare yourself for IR research? 50 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
51. What is Research? 51 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
52. What is Research? Research Discover new knowledge Seek answers to questions Basic research Goal: Expand manâs knowledge (e.g., which genes control social behavior of honey bees? ) Often driven by curiosity (but not always) High impact examples: relativity theory, DNA, ⊠Applied research Goal: Improve human condition (i.e., improve the world) (e.g., how to cure cancers?) Driven by practical needs High impact examples: computers, transistors, vaccinations, ⊠The boundary is vague; distinction isnât important 52 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
53. Why Research? Funding Curiosity Utility of Applications Advancement of Technology Amount of knowledge Application Development Applied Research Basic Research 53 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
54. Whereâs IR Research? Information Science Funding Quality of Life Utility of Applications Advancement of Technology Amount of knowledge Computer Science Application Development Applied Research Basic Research 54 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
55. Research Process Identification of the topic (e.g., Web search) Hypothesis formulation (e.g., algorithm X is better than Y=state-of-the-art) Experiment design (measures, data, etc) (e.g., retrieval accuracy on a sample of web data) Test hypothesis (e.g., compare X and Y on the data) Draw conclusions and repeat the cycle of hypothesis formulation and testing if necessary (e.g., Y is better only for some queries, now what?) 55 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
56. Typical IR Research Process Look for a high-impact topic (basic or applied) New problem: define/frame the problem Identify weakness of existing solutions if any Propose new methods Choose data sets (often a main challenge) Design evaluation measures (can be very difficult) Run many experiments (need to have clear research hypotheses) Analyze results and repeat the steps above if necessary Publish research results 56 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
57. Research Methods Exploratory research: Identify and frame a new problem (e.g., âa survey/outlook of personalized searchâ) Constructive research: Construct a (new) solution to a problem (e.g., âa new method for expert findingâ) Empirical research: evaluate and compare existing solutions (e.g., âa comparative evaluation of link analysis methods for web searchâ) The âE-C-E cycleâ: exploratoryï constructiveï empiricalï exploratory⊠57 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
58. Types of Research Questions and Results Exploratory (Framework): Whatâs out there? Descriptive (Principles): What does it look like? How does it work? Evaluative (Empirical results): How well does a method solve a problem? Explanatory (Causes): Why does something happen the way it happens? Predictive (Models): What would happen if xxx ? 58 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
59. Solid and High Impact Research Solid work: A clear hypothesis (research question) with conclusive result (either positive or negative) Clearly adds to our knowledge base (what can we learn from this work?) Implications: a solid, focused contribution is often better than a non-conclusive broad exploration High impact = high-importance-of-problem * high-quality-of-solution high impact = open up an important problem high impact = close a problem with the best solution high impact = major milestones in between Implications: question the importance of the problem and donât just be satisfied with a good solution, make it the best 59 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
60. How to Prepare Yourself for IR Research? 60 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
61. What it Takes to do Research? Curiosity: allow you to ask questions Critical thinking: allow you to challenge assumptions Learning: take you to the frontier of knowledge Persistence: so that you donât give up Respect data and truth: ensure your research is solid Communication: allow you to publish your work ⊠61 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
62. Learning about IR (1/2) Start with an IR text book (e.g., Manning et al., Grossman & Frieder, a forth-coming book from UMass,âŠ) Then read âReadings in IRâ by Karen Sparck Jones, Peter Willett And read papers recommended in the following article: http://www.sigir.org/forum/2005D/2005d_sigirforum_moffat.pdf Read other papers published in recent IR/IR-related conferences 62 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
63. Learning about IR (2/2) Getting more focused Choose your favorite sub-area (e.g., retrieval models) Extend your knowledge about related topics (e.g., machine learning, statistical modeling, optimization) Stay in frontier: Keep monitoring literature in both IR and related areas Broaden your view: Keep an eye on Industry activities Read about industry trends Try out novel prototype systems Funding trends Read request for proposals 63 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
64. Critical Thinking Develop a habit of asking questions, especially why questions Always try to make sense of what you have read/heard; donât let any question pass by Get used to challenging everything Practical advice Question every claim made in a paper or a talk (can you argue the other way?) Try to write two opposite reviews of a paper (one mainly to argue for accepting the paper and the other for rejecting it) Force yourself to challenge one point in every talk that you attend and raise a question 64 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
65. Respect Data and Truth Be honest with the experiment results Donât throw away negative results! Try to learn from negative results Donât twist data to fit your hypothesis; instead, let the hypothesis choose data Be objective in data analysis and interpretation; donât mislead readers Aim at understanding/explanation instead of just good results Be careful not to over-generalize (for both good and bad results); you may be far from the truth 65 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
66. Communications General communication skills: Oral and written Formal and informal Talk to people with different level of backgrounds Be clear, concise, accurate, and adaptive (elaborate with examples, summarize by abstraction) English proficiency Get used to talking to people from different fields 66 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
67. Persistence Work only on topics that you are passionate about Work only on hypotheses that you believe in Donât draw negative conclusions prematurely and give up easily positive results may be hidden in negative results In many cases, negative results donât completely reject a hypothesis Be comfortable with criticisms about your work (learn from negative reviews of a rejected paper) Think of possibilities of repositioning a work 67 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
68. Optimize Your Training Know your strengths and weaknesses strong in math vs. strong in system development creative vs. thorough ⊠Train yourself to fix weaknesses Find strategic partners Position yourself to take advantage of your strengths 68 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
69. Thank You Reach me on Twitter: @matifq Email me: maqureshi@iba.edu.pk 69 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk