10 - Information Retrieval - 1

154 Aufrufe

Veröffentlicht am

Lecture Information Service Engineering, Summer Semester 2017, Karlsruhe Institute of Technology, KIT Karlsruhe

Veröffentlicht in: Bildung
0 Kommentare
0 Gefällt mir
Statistik
Notizen
  • Als Erste(r) kommentieren

  • Gehören Sie zu den Ersten, denen das gefällt!

Keine Downloads
Aufrufe
Aufrufe insgesamt
154
Auf SlideShare
0
Aus Einbettungen
0
Anzahl an Einbettungen
2
Aktionen
Geteilt
0
Downloads
8
Kommentare
0
Gefällt mir
0
Einbettungen 0
Keine Einbettungen

Keine Notizen für die Folie

10 - Information Retrieval - 1

  1. 1. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Information Service Engineering Lecture 10: Linked Data Engineering - 5 and Information Retrieval - 1 Prof. Dr. Harald Sack FIZ Karlsruhe - Leibniz Institute for Information Infrastructure AIFB - Karlsruhe Institute of Technology Summer Semester 2017 This file is licensed under the Creative Commons Attribution-NonCommercial 3.0 (CC BY-NC 3.0)
  2. 2. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Information Service Engineering Last Lecture: Linked Data Engineering - 4 3.1 Knowledge Representations and Ontologies 3.2 Semantic Web and the Web of Data 3.3 Linked Data Principles 3.4 How to name Things - URIs 3.5 Resource Description Framework (RDF) 3.6 Creating new Models with RDFS 3.7 Querying RDF(S) with SPARQL 3.8 More Expressivity with Web Ontology Language (OWL) 3.9 Wikipedia, DBpedia, and Wikidata 3.10 Linked Data Programming ● From Wikipedia to DBpedia ● Differences between DBpedia and Wikidata 2
  3. 3. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Information Service Engineering Lecture 10: Linked Data Engineering - 5 3.1 Knowledge Representations and Ontologies 3.2 Semantic Web and the Web of Data 3.3 Linked Data Principles 3.4 How to name Things - URIs 3.5 Resource Description Framework (RDF) as simple Data Model 3.6 Creating new Models with RDFS 3.7 Querying RDF(S) with SPARQL 3.8 More Expressivity with Web Ontology Language (OWL) 3.9 Wikipedia, DBpedia, and Wikidata 3.10 Linked Data Programming 3
  4. 4. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Linked Data Driven Web Applications ● Required Components: ○ Local RDF Store ■ caching of results ■ permanent storage M.Hausenblas: Linked Data Applications, DERI Technical Report, 2009 3. Linked Data Engineering / 3.10 Linked Data Programming
  5. 5. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Linked Data Driven Web Applications ● Required Components: ○ Logic (Controller) and ○ User Interface ■ (=Business Logic) ■ (not LOD specific) M.Hausenblas: Linked Data Applications, DERI Technical Report, 2009 3. Linked Data Engineering / 3.10 Linked Data Programming
  6. 6. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Linked Data Driven Web Applications ● Required Components: ● Data Integration component ○ get data directly from LOD-Cloud or ○ via Semantic Indexer M.Hausenblas: Linked Data Applications, DERI Technical Report, 2009 3. Linked Data Engineering / 3.10 Linked Data Programming
  7. 7. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Linked Data Driven Web Applications ● Required Components: ● Data Re-/Publishing component ○ write back application dependent data into the Web of Data M.Hausenblas: Linked Data Applications, DERI Technical Report, 2009 3. Linked Data Engineering / 3.10 Linked Data Programming
  8. 8. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Linked Data Driven Web Applications ● Required Components: ○ Local RDF Store ■ caching of results ■ permanent storage ○ Logic (Controller) and ○ User Interface (=Business Logic) ■ (not LOD specific) ○ Data Integration component ■ get data directly from LOD-Cloud or ■ via Semantic Indexer ○ Data Re-/Publishing component ■ write back application dependent data into the Web of Data M.Hausenblas: Linked Data Applications, DERI Technical Report, 2009 3. Linked Data Engineering / 3.10 Linked Data Programming
  9. 9. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology ● The easiest way is to make use of a suitable library: ○ SPARQL Javascript Library http://www.thefigtrees.net/lee/blog/2006/04/sparql_calendar_demo_a_sparql.html ○ ARC for SPARQL (PHP) https://github.com/semsol/arc2/wiki ○ dotNetRDF (C#) https://dotnetrdf.github.io/ ○ Jena/ARQ (Java) http://jena.apache.org/ ○ Sesame (Java) http://rdf4j.org/ ○ SPARQL Wrapper (Python) http://rdflib.github.io/sparqlwrapper/ Linked Data Driven Web Applications 3. Linked Data Engineering / 3.10 Linked Data Programming
  10. 10. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology ● The easiest way is to make use of a suitable library: ○ SPARQL Wrapper (Python) http://rdflib.github.io/sparqlwrapper/ ● Access to Linked Data via SPARQL endpoints ○ let‘s choose DBpedia (just for simplicity...) http://dbpedia.org/sparql ● ...now we have to think of a simple example... Linked Data Programming Example 3. Linked Data Engineering / 3.10 Linked Data Programming
  11. 11. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology ● Example Application: ○ Build a simple application that looks for today‘s birthdays of famous people, as e.g. authors ○ Create a list of authors, whose birthday is today including some additional information, as e.g. ■ Year of Birth ■ Short description ○ Let‘s create a simple (local) web page for the task (i.e. encode results in HTML), which can be displayed in the browser ○ we use Python and the SPARQL Wrapper library Linked Data Programming Example 3. Linked Data Engineering / 3.10 Linked Data Programming
  12. 12. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Prerequisites - (Manual) Data Analysis http://dbpedia.org/page/Alexandre_Dumas ● Choose a representative example: ○ E.g. Alexandre Dumas from DBpedia 3. Linked Data Engineering / 3.10 Linked Data Programming
  13. 13. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology ● Data Analysis via SPARQL ○ What kind of entities are you looking for? ■ ?author rdf:type dbo:Writer . ○ What information do you need? ■ ?author dbo:birthDate ?birthdate . ■ ?author rdfs:label ?name . ■ ?author rdfs:comment ?description . ■ OPTIONAL { ?author dbo:thumbnail ?thumbnail } ○ Any filter criteria? ■ (lang(?name)="en") && (lang(?description)="en") ■ (SUBSTR(STR(?birthdate),6)="07-05") ■ More sophisticated: (SUBSTR(STR(bif:curdate('')),6) Prerequisites - (Manual) Data Analysis 3. Linked Data Engineering / 3.10 Linked Data Programming Virtuoso triple store builtin function for current date
  14. 14. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Prerequisites - SPARQL Queries SPARQL query at dbpedia.org 3. Linked Data Engineering / 3.10 Linked Data Programming
  15. 15. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Prerequisites - Download & Install ● You should have Python installed on your computer ○ https://www.python.org/downloads/ ● Download SPARQL Wrapper for Python ○ http://rdflib.github.io/sparqlwrapper/ ● Follow the instructions for Installation 3. Linked Data Engineering / 3.10 Linked Data Programming
  16. 16. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
  17. 17. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Linked Data Programming Example 3. Linked Data Engineering / 3.10 Linked Data Programming
  18. 18. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology import com.hp.hpl.jena.query.*; String service = "..."; // address of the SPARQL endpoint String query = " SELECT ..."; // your SPARQL query QueryExecution e = QueryExecutionFactory. sparqlService(service, query) ResultSet results = e. execSelect(); while ( results.hasNext() ) { QuerySolution s = results. nextSolution(); // ... } e.close(); ● Alternative: simple example with Jena ARQ and Java: http://jena.apache.org/ Linked Data Programming Example 3. Linked Data Engineering / 3.10 Linked Data Programming
  19. 19. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology You may also try out Wikidata... 3. Linked Data Engineering / 3.10 Linked Data Programming SPARQL query at wikidata.org
  20. 20. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Information Service Engineering Next Lecture: Linked Data Engineering - 5 3.1 Knowledge Representations and Ontologies 3.2 Semantic Web and the Web of Data 3.3 Linked Data Principles 3.4 How to name Things - URIs 3.5 Resource Description Framework (RDF) as simple Data Model 3.6 Creating new Models with RDFS 3.7 Querying RDF(S) with SPARQL 3.8 More Expressivity with Web Ontology Language (OWL) 3.9 Wikipedia, DBpedia, and Wikidata 3.10 Linked Data Programming 20
  21. 21. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Information Service Engineering Lecture Overview 1. Information, Natural Language and the Web 2. Natural Language Processing 3. Linked Data Engineering 4. Information Retrieval 5. Knowledge Mining 6. Exploratory Search and Recommender Systems 21
  22. 22. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Information Service Engineering Lecture 10: Information Retrieval 4.1 A Brief History of Libraries and IR 4.2 Fundamental Concepts of IR 4.3 Information Retrieval Models 4.4 Retrieval Evaluation 4.5 Web Information Retrieval 4.6 Document Crawling, Text Processing, and Indexing 4.7 Query Processing and Result Representation 4.8 Question Answering 22
  23. 23. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology ● “Information retrieval is a field concerned with the structure, analysis, organization, storage, searching, and retrieval of information.” (George Salton, 1968,[1]) ● “IR is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).” (Manning et al., 2008, [2]) Information Retrieval 4. Information Retrieval / 4.1 A Brief History of Libraries and IR 23
  24. 24. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology 24 How old is Information Retrieval?
  25. 25. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology ● 3,000 - 2,000 BC Sumerian archives ○ Clay tablets in cuneiform script stored in temple rooms ○ Mostly inventories and records of commercial transactions ● 300 BC Library of Alexandria ○ Idea: a universal library holding copies of all the world’s books ○ At its height, the library contained almost 750,000 books in form of papyrus scrolls Libraries and Information Retrieval https://commons.wikimedia.org/wiki/File:Milkau_Oberer_Teil_der_Stele_mit_dem_Text_von_Hammurapis_Gesetzescode_369-2.jpg 4. Information Retrieval / 4.1 A Brief History of Libraries and IR 25
  26. 26. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology 26 ● Middle Ages Monastic Libraries ○ Christian monks saved texts of Roman and Greek antiquity from getting lost by copying ○ Vatican Library founded in 1475 ● c. 1450 Printing Press ○ Johannes Gutenberg introduced movable type to Europe ○ Copying books became much easier and less expensive Libraries and Information Retrieval https://commons.wikimedia.org/wiki/File:Buchdrucker-1568.png 4. Information Retrieval / 4.1 A Brief History of Libraries and IR
  27. 27. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology ● German National Library ○ 24 M items ○ Located in Leipzig, Frankfurt (Main), and Berlin ● Library of Congress ○ the world’s largest library ○ 155M items ○ Classification system: Library of Congress Classification Libraries and Information Retrieval 4. Information Retrieval / 4.1 A Brief History of Libraries and IR 27
  28. 28. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology 28 ● Items are catalogued by metadata ○ Author, Editor, Title, ISBN,... ○ Keyword, e.g. “information retrieval” ○ Subjectarea, e.g.“informationsystems” ○ Specialized classification systems ■ Library of Congress Classification (LCC) ■ Dewey Decimal Classification (DDC) ■ Universal Decimal Classification (UDC) Library Catalog and Index 4. Information Retrieval / 4.1 A Brief History of Libraries and IR http://www.worldcat.org/title/modern-information-retrieval/oclc/40602840
  29. 29. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology 29 ● Catalogue cards serve as document proxies ● Experts must catalogue each item individually ● Full text search: every word is a keyword Full Text Search and Concordance 4. Information Retrieval / 4.1 A Brief History of Libraries and IR https://commons.wikimedia.org/wiki/File%3ASchlagwortkatalog.jpg
  30. 30. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology 30 ● Before information retrieval, in pre-computer area: Concordances ○ Alphabetical list of the principal words used in a book, listing every instance of each word with its immediate context ○ Only for works of special importance, as e.g. the Bible ○ First Bible concordance by Hugh of Saint-Chere, with the help of 500 monks, at c. 1250 Full Text Search and Concordance 4. Information Retrieval / 4.1 A Brief History of Libraries and IR https://commons.wikimedia.org/wiki/File%3ASchlagwortkatalog.jpg https://commons.wikimedia.org/w/index.php?title=File:A_Concordance_to_the_Engli sh_Poems_of_Thomas_Gray_(1908).djvu&page=17
  31. 31. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology ● 1957: Hans-Peter Luhn (IBM) uses words as indexing units for documents ○ Measure similarity between documents by word overlap ● 1960s and 1970s: Gerard Salton and his students (Harvard, Cornell) create the SMART system ○ Vector space model ○ Relevance feedback ● 1972: Karen Spärck Jones introduced inverse-document-frequency Early Information Retrieval 4. Information Retrieval / 4.1 A Brief History of Libraries and IR 31
  32. 32. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology 32 ● 1992: TREC - annual Text Retrieval Conference ○ Sponsored by the U.S. National Institute of Standards and Technology and the U.S. Department of Defense ○ many different tracks, e.g. blogs, genomics, spam, video, etc. ○ Provides data sets and test problems ● 1994: Web Crawler, very first Web Search Engine ● 1998: Google ● Current Research Questions: ○ Scalability, Speed, Quality ● IR related Research at ISE: ○ Semantic Search, Exploratory Search, Question Answering Information Retrieval Timeline 4. Information Retrieval / 4.1 A Brief History of Libraries and IR
  33. 33. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Information Service Engineering Lecture 10: Information Retrieval 4.1 A Brief History of Libraries and IR 4.2 Fundamental Concepts of IR 4.3 Information Retrieval Models 4.4 Retrieval Evaluation 4.5 Web Information Retrieval 4.6 Document Crawling, Text Processing, and Indexing 4.7 Query Processing and Result Representation 4.8 Question Answering 33
  34. 34. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology ● A document is a coherent passage of free text ● Examples: ○ Web pages, email, books, news stories, scholarly papers, text messages, WordTM , PowerpointTM , PDF, forum postings, patents, IM sessions, dictionary entries etc. ● Common properties: ○ Written in natural language ○ Significant text content ○ Some structure, e.g. ■ Papers: title,author, date, or ■ Email: subject, sender, destination, date Information Retrieval Basics 4. Information Retrieval / 4.2 Fundamental Concepts of IR 34https://pixabay.com/en/files-paper-office-paperwork-stack-1614223/
  35. 35. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology ● A document collection is a set of documents ○ also known as corpus ○ usually, all documents within a collection are similar with respect to some criterion ● Examples: ○ Chinese Patents ○ The articles covered by The New York Times ○ Amazon Product Reviews ○ The Web Information Retrieval Basics 4. Information Retrieval / 4.2 Fundamental Concepts of IR 35https://pixabay.com/en/files-paper-office-paperwork-stack-1614223/
  36. 36. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology ● An information need is the topic about which the user (or a group of users) desires to know more ○ Refers to an individual, hidden cognitive state ○ Paradoxical: It describes the user’s ignorance ○ Ill-defined ● Examples: ○ What is the capital of Uruguay? ○ Is it really true that Elvis is still alive? ○ Show me some definitions of “information need”! Information Retrieval Basics 4. Information Retrieval / 4.2 Fundamental Concepts of IR 36https://pixabay.com/en/files-paper-office-paperwork-stack-1614223/
  37. 37. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology ● A query is what the user conveys to the computer in an attempt to communicate the information need ○ Stated in a formal query language ○ Usually a list of search terms (keywords) ● Keyword queries are often poor descriptions of actual information needs ○ E.g., a query for “jaguar” could mean “places to buy jaguar cars” or the “cat”. ● Search queries (in particular one-word queries) are under-specified. ○ Semantics of long queries are ignored Information Retrieval Basics 4. Information Retrieval / 4.2 Fundamental Concepts of IR 37https://pixabay.com/en/files-paper-office-paperwork-stack-1614223/
  38. 38. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology ● A document is relevant with respect to some user’s information need, if the user perceives it as containing information of value with respect to this information need ○ Usually assumed to be a binary concept, but could also be graded ● Example: ○ Information need: “What is relevance in IR?” ● Relevant document: ○ Wikipedia’s entry “Relevance (information retrieval)” Information Retrieval Basics 4. Information Retrieval / 4.2 Fundamental Concepts of IR 38https://pixabay.com/en/files-paper-office-paperwork-stack-1614223/
  39. 39. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology The Information Retrieval Paradigm 4. Information Retrieval / 4.2 Fundamental Concepts of IR 39 Set of Queries Set of Documents Query Formulation Indexing indexquery matches based on (string) similarity https://pixabay.com/en/files-paper-office-paperwork-stack-1614223/https://pixabay.com/en/post-it-paper-notes-record-memory-1079361/
  40. 40. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology https://pixabay.com/en/files-paper-office-paperwork-stack-1614223/ Classical Information Retrieval Simplified Form 4. Information Retrieval / 4.2 Fundamental Concepts of IR 40 search term(s) keyword(s) search index search query document corpus document
  41. 41. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology IR System Architecture - Indexing Basic Building Blocks 4. Information Retrieval / 4.2 Fundamental Concepts of IR 41 Text Acquisition Index Creation Text Transformation Index Document Store
  42. 42. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology IR System Architecture - Querying Basic Building Blocks 4. Information Retrieval / 4.2 Fundamental Concepts of IR 42 User Interaction Ranking Evaluation Index Document Store Log Data Retrieval model uses queries and index to generate a ranked list of documents
  43. 43. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Information Service Engineering Lecture 10: Information Retrieval 4.1 A Brief History of Libraries and IR 4.2 Fundamental Concepts of IR 4.3 Information Retrieval Models 4.4 Retrieval Evaluation 4.5 Web Information Retrieval 4.6 Document Crawling, Text Processing, and Indexing 4.7 Query Processing and Result Representation 4.8 Question Answering 43
  44. 44. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Categorization of IR Models 4. Information Retrieval / 4.3 Information Retrieval Models 44 ● Set-theoretic models ○ represent documents as sets of words or phrases ○ similarities are usually derived from set-theoretic operations on those sets
  45. 45. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Categorization of IR Models 4. Information Retrieval / 4.3 Information Retrieval Models 45 ● Algebraic models ○ represent documents and queries usually as vectors, matrices, or tuples. ○ similarity of the query vector and document vector is represented as a scalar value.
  46. 46. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Categorization of IR Models 4. Information Retrieval / 4.3 Information Retrieval Models 46 ● Probabilistic models ○ treat the process of document retrieval as a probabilistic inference. ○ similarities are computed as probabilities that a document is relevant for a given query.
  47. 47. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Categorization of IR Models 4. Information Retrieval / 4.3 Information Retrieval Models 47 ● Models without term-interdependencies ○ treat different terms/words as independent. ○ in vector space models this is represented by the orthogonality assumption of term vectors ○ in probabilistic models this is represented by an independency assumption for term variables.
  48. 48. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Categorization of IR Models 4. Information Retrieval / 4.3 Information Retrieval Models 48 ● Immanent term-dependencies ○ allow a representation of interdependencies between terms. ○ interdependency between two terms is defined by the model itself. ● transcendent term interdependencies ○ do not allege how the interdependency between two terms is defined ○ rely an external source for the degree of interdependency between two terms.
  49. 49. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Categorization of IR Models 4. Information Retrieval / 4.3 Information Retrieval Models 49 Dominik Kuropka: Modelle zur Repräsentation natürlichsprachlicher Dokumente. Ontologie-basiertes Information-Filtering und -Retrieval mit relationalen Datenbanken, Advances in Information Systems and Management Science, Bd. 10, Logos Verlag, Berlin, 2004.
  50. 50. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Boolean Retrieval Model 4. Information Retrieval / 4.3 Information Retrieval Models 50 ● Propositional Logic as retrieval language ● selection and connection of arbitrary document sets via boolean connectors (search operators) ● easy to implement ● no differentiated term weights ● no ranking
  51. 51. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Vector Space Model 4. Information Retrieval / 4.3 Information Retrieval Models 51 ● Documents and queries are represented as points in a high-dimensional vector space ℝn ● for retrieval the Euclidian distance and Cosine similarity between search query and document vector is used ● ranking according to distance ● differentiated term weights ● linear order of terms in documents is lost ● No semantic sensitivity (vocabulary dependency)
  52. 52. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Probabilistic Model 4. Information Retrieval / 4.3 Information Retrieval Models 52 ● Documents are weighted according their relevance for a search query ● IR system estimated the probability of relevance for a search query ● term weights for terms ti for a search query q ● for a new document dm the relevance of dm for the search query q can be determined via the term weights ti Relevance feedback for search query q
  53. 53. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Information Service Engineering Lecture 10: Information Retrieval 4.1 A Brief History of Libraries and IR 4.2 Fundamental Concepts of IR 4.3 Information Retrieval Models 4.4 Retrieval Evaluation 4.5 Web Information Retrieval 4.6 Document Crawling, Text Processing, and Indexing 4.7 Query Processing and Result Representation 4.8 Question Answering 53
  54. 54. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Retrieval Evaluation 4. Information Retrieval / 4.4 Retrieval Evaluation 54 User Interaction Ranking Evaluation Index Document Store Log Data Monitors and measures effectiveness and efficiency (primarily offline)
  55. 55. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Retrieval Evaluation 4. Information Retrieval / 4.4 Retrieval Evaluation 55 ● Evaluation is key to building effective and efficient search engines. ● Drives advancement of search engines (when intuition fails) ● Measurement usually carried out in controlled laboratory experiments (to control the many factors) ● Effectiveness: Measures ability to find right information ○ Compare ranking to user relevance feedback ● Efficiency: Measures ability to do this quickly ○ Measure time and space requirements ● Effectiveness, efficiency, and cost are related ○ Efficiency and cost targets may impact effectiveness.
  56. 56. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Retrieval Evaluation ● How to objectively measure the quality of a (classification) experiment? ○ Compare your achieved results with a ground truth (gold standard) ● How to achieve a ground truth? ○ Often this means to invest manual effort… ● How to compare achieved results with a ground truth? ○ Correctness Precision ○ Completeness Recall ○ Correctness & Completeness F-Measure 4. Information Retrieval / 4.4 Retrieval Evaluation
  57. 57. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Confusion Matrix ● Contains information about relevant documents and documents retrieved by a search engine ● A table with two rows and two columns that reports the number of ○ false positives, false negatives, true positives, and true negatives. retrieved true false relevant true true positive false negative false false positive true negative ground truth Search results 4. Information Retrieval / 4.4 Retrieval Evaluation
  58. 58. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Recall and Precision ● Recall is the fraction of relevant documents that are retrieved relevant documents retrieved documents True Positives False Negative True Negatives False Positive 4. Information Retrieval / 4.4 Retrieval Evaluation ● Precision is the fraction of retrieved documents that are relevant
  59. 59. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology F-Measure ● F1 -Measure is the harmonic mean of precision and recall. relevant documents retrieved documents True Positives False Negative True Negatives False Positive 4. Information Retrieval / 4.4 Retrieval Evaluation
  60. 60. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Ranking Effectiveness ● Problem: Evaluate Ranking and not just a Boolean classification ● Idea: Calculate Recall and Precision at every rank position 4. Information Retrieval / 4.4 Retrieval Evaluation = relevant documents Ranking #1 Ranking #2
  61. 61. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Ranking Effectiveness ● Problem: Evaluate Ranking and not just a Boolean classification ● Idea: Calculate Recall and Precision at every rank position 4. Information Retrieval / 4.4 Retrieval Evaluation = relevant documents Ranking #1 Ranking #2 Recall 0.17 Precision 1.0 Recall 0.0 Precision 0.0
  62. 62. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Ranking Effectiveness ● Problem: Evaluate Ranking and not just a Boolean classification ● Idea: Calculate Recall and Precision at every rank position 4. Information Retrieval / 4.4 Retrieval Evaluation = relevant documents Ranking #1 Ranking #2 Recall 0.17 0.17 Precision 1.0 0.5 Recall 0.0 0.17 Precision 0.0 0.5
  63. 63. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Ranking Effectiveness ● Problem: Evaluate Ranking and not just a Boolean classification ● Idea: Calculate Recall and Precision at every rank position 4. Information Retrieval / 4.4 Retrieval Evaluation = relevant documents Ranking #1 Ranking #2 Recall 0.17 0.17 0.33 Precision 1.0 0.5 0.67 Recall 0.0 0.17 0.17 Precision 0.0 0.5 0.33
  64. 64. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Ranking Effectiveness ● Problem: Evaluate Ranking and not just a Boolean classification ● Idea: Calculate Recall and Precision at every rank position 4. Information Retrieval / 4.4 Retrieval Evaluation = relevant documents Ranking #1 Ranking #2 Recall 0.17 0.17 0.33 0.5 Precision 1.0 0.5 0.67 0.75 Recall 0.0 0.17 0.17 0.17 Precision 0.0 0.5 0.33 0.25
  65. 65. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Ranking Effectiveness ● Problem: Evaluate Ranking and not just a Boolean classification ● Idea: Calculate Recall and Precision at every rank position 4. Information Retrieval / 4.4 Retrieval Evaluation = relevant documents Ranking #1 Ranking #2 Recall 0.17 0.17 0.33 0.5 0.67 Precision 1.0 0.5 0.67 0.75 0.8 Recall 0.0 0.17 0.17 0.17 0.33 Precision 0.0 0.5 0.33 0.25 0.4
  66. 66. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Ranking Effectiveness ● Problem: Evaluate Ranking and not just a Boolean classification ● Idea: Calculate Recall and Precision at every rank position 4. Information Retrieval / 4.4 Retrieval Evaluation = relevant documents Ranking #1 Ranking #2 Recall 0.17 0.17 0.33 0.5 0.67 0.83 Precision 1.0 0.5 0.67 0.75 0.8 0.83 Recall 0.0 0.17 0.17 0.17 0.33 0.5 Precision 0.0 0.5 0.33 0.25 0.4 0.5
  67. 67. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Ranking Effectiveness ● Problem: Evaluate Ranking and not just a Boolean classification ● Idea: Calculate Recall and Precision at every rank position 4. Information Retrieval / 4.4 Retrieval Evaluation = relevant documents Ranking #1 Ranking #2 Recall 0.17 0.17 0.33 0.5 0.67 0.83 0.83 0.83 0.83 1.0 Precision 1.0 0.5 0.67 0.75 0.8 0.83 0.71 0.63 0.56 0.6 Recall 0.0 0.17 0.17 0.17 0.33 0.5 0.67 0.67 0.83 1.0 Precision 0.0 0.5 0.33 0.25 0.4 0.5 0.57 0.5 0.56 0.6
  68. 68. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Summarizing a Ranking ● Problem: Long lists are difficult to compare ● Ideas: 1. Calculate recall and precision at a small number of fixed rank positions ■ Compare two rankings: ● If precision at position p is higher, recall is higher too. ● “Precision at rank p” (p=5, p=10, p=20) ● Ignores ranking after p and ignores ranking within 1 to p. 2. Average the precision values from the rank positions where relevant documents are retrieved 4. Information Retrieval / 4.4 Retrieval Evaluation
  69. 69. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Average Precision 4. Information Retrieval / 4.4 Retrieval Evaluation = relevant documents Ranking #1 Ranking #2 Recall 0.17 0.17 0.33 0.5 0.67 0.83 0.83 0.83 0.83 1.0 Precision 1.0 0.5 0.67 0.75 0.8 0.83 0.71 0.63 0.56 0.6 Recall 0.0 0.17 0.17 0.17 0.33 0.5 0.67 0.67 0.83 1.0 Precision 0.0 0.5 0.33 0.25 0.4 0.5 0.57 0.5 0.56 0.6 precision@10 for ranking #1: (1.0+0.67+0.75+0.8+0.83+0.6)/6 = 0.78 precision@10 for ranking #2: (0.5+0.4+0.5+0.57+0.56+0.6)/6 = 0.52 Emphasizes top ranked documents
  70. 70. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Mean Average Precision 4. Information Retrieval / 4.4 Retrieval Evaluation ● Each ranking produces an average precision ● Mean Average Precision (MAP): ○ Summarize rankings from multiple queries by averaging the average precision ○ Most often used measure in research papers ○ Requires many relevance judgements
  71. 71. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Mean Average Precision 4. Information Retrieval / 4.4 Retrieval Evaluation relevant documents for query #1 Result #1 Result #2 Recall 0.2 0.2 0.4 0.4 0.4 0.6 0.6 0.6 0.8 1.0 Precision 1.0 0.5 0.67 0.5 0.4 0.5 0.43 0.38 0.44 0.5 Recall 0.0 0.33 0.33 0.33 0.67 0.67 1.0 1.0 1.0 1.0 Precision 0.0 0.5 0.33 0.25 0.4 0.33 0.43 0.38 0.33 0.3 relevant documents for query #2 precision@10 for result #1: (1.0+0.67+0.5+0.44+0.5)/5 = 0.62 precision@10 for result #2: (0.5+0.4+0.43)/3 = 0.44 Mean Average Precision MAP = (0.62+0.44)/2 = 0.53
  72. 72. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Information Service Engineering Lecture 10: Information Retrieval 4.1 A Brief History of Libraries and IR 4.2 Fundamental Concepts of IR 4.3 Information Retrieval Models 4.4 Retrieval Evaluation 4.5 Web Information Retrieval 4.6 Document Crawling, Text Processing, and Indexing 4.7 Query Processing and Result Representation 4.8 Question Answering 72
  73. 73. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology 4. Information Retrieval Bibliography [1] G. Salton, Automatic Information Organization and Retrieval, McGraw-Hill, New York, 1968. [2] Ch. D. Manning, P. Raghavan and H. Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008, https://nlp.stanford.edu/IR-book/ ● Further Reading: ○ R. Baeza-Yates, B. Ribeiro-Neto, Modern Information Retrieval, 2nd ed., Addison Wesley, 2010. 73
  74. 74. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology 4. Information Retrieval Syllabus Questions ● What are the main components of Linked Data driven Web applications and how do they interact? ● Explain the fundamental concepts of Information Retrieval ● Explain the Architecture of an IR System ● Explain the Boolean Retrieval model. What are its benefits and its drawbacks? ● Explain the Vector Space Retrieval model. What are its benefits and its drawbacks? ● Explain how can the ranking of search results be evaluated. 74

×