SlideShare ist ein Scribd-Unternehmen logo
1 von 12
Downloaden Sie, um offline zu lesen
The Philosophy of Information
 Retrieval Evaluation (2001)

       by Ellen Voorhees
The Author

• Computer scientist, Retrieval Group,
  NIST (15 years)
    o   TREC, TRECVid , and TAC - large-scale evaluation of
        technologies for processing natural language text and
        searching diverse media types
•   Research focus: "developing and validating
    appropriate evaluation schemes to measure system
    effectiveness in these areas"

• Siemens Corporate Research (9 years)
    o   factory automation, intelligence agents, agents
        applied to information access




                  http://www.linkedin.com/pub/ellen-voorhees/6/115/3b8
NIST (National Institute of Standards and
Technology)
• Non-regulatory agency of U.S. Dept of Commerce

• "Promote U.S. innovation and industrial competitiveness [...]
  enhance economic security and improve our quality of life"

• Estimated 2011 budget: $722 million

• Standards Reference Materials (experimental control samples,
  quality control benchmarks), election technology, ID cards

• 3 Nobel Prize Winners




          http://en.wikipedia.org/wiki/National_Institute_of_Standards_and_Technology
Premises

• User-based evaluation (p.1)

  o   better, more direct measure of user needs
  o   BUT very expensive and difficult to execute properly

• System evaluation (p.1)

  o   less expensive
  o   abstraction of retrieval process
  o   can control variables
         increases power of comparative experiments
  o   diagnostic information about system behavior
The Cranfield Paradigm
• Dominant model for 4 decades (p.1)

• Cranfield 2 experiment (1960s) - first lab testing of IR system
  (p.2)

   o   investigated which indexing languages is best
   o   design: considering the performance of index languages
       free from operational variable contamination
   o   aeronautics experts, aeronautics collection
   o   test collection: documents, information needs/topics,
       relevance judgment set
   o   assumptions:
          relevance approximated by topical similarity
          single judgment set representative of user population
          lists of relevant documents for each topic complete
Modern Adaptations to Cranfield
Paradigm not true, need to decrease noise (p.3)
• Assumptions
   o   modern collections larger and more diverse
   o   less complete relevance judgments

• Adaptations:
   o Ranked list of documents for each topic
        ordered by decreasing retrieval likelihood
   o Effectiveness as a whole computed as average across
     topics
   o Large number of topics
   o Use pooling (subsets of documents) instead (p.4)
   o Assumptions don't need to be strictly true for test
     collection to be viable
        different retrieval run scores compared on same test
        collections
How to Build a Test Collection
(TREC example)
• Set of documents and topics (reflective of operational setting
  and real tasks) (p.4)
   o e.g. law articles for law library

• Participants run topics against documents
   o return top documents per topic

• Pool formed, then judged by relevance assessors
   o evaluated using relevance judgments (binary)

• Results returned to participant

• Relevance judgments turn documents and topics into test
  collection (p.5)
Effects of Pooling and Incomplete Judgments
• Pooling doesn't produce complete judgments (p.5)
   o Some relevant documents not judged
   o If added later, from lower in system rankings

• Skewed across topics (p.6)
   o if have many relevant documents initially and later on

• What to do?
  o deep and diverse pool (p.9)
  o recall-oriented manual runs to supplement
  o opt for smaller, fair judgment set rather than larger biased
    set
Assessor Relevance Judgments

• Different judges, different time settings (p.9)

• Different assessor makes different relevance sets for same
  topics (subjectivity of relevance)

• TREC: 3 judges (p.10)

• Overlap < 50%, assessors really disagreed
Evaluating with Assessor Inconsistency
• Perform system ranking, sorting by value obtained by each
  system (p.10)

• Query-Relevance Set: different combinations of assessor
  judgments per topic

• Repeat experiments several times: (p.13)
  o different measures
  o different topic sets
  o different systems
  o different assessor groups

• Comparative evaluation result: stability of ranked retrieval
  results
Cross-Language Collections

• More difficult to build than monolingual collections (p.13)
  o separate set of assessors for each language
  o multiple assessors for 1 topic
  o need diverse pools for all languages
      minority language pools smaller and less diverse (p.14)

• What to do?
  o close coordination for consistency (p.13)
  o proceed with care
Discussion

• Do laboratory experiments translate to operational settings?

• Which metrics or evaluation scores are more meaningful to
  you?

• Are there other ways to reduce noise and error?

Weitere ähnliche Inhalte

Andere mochten auch

Info-Computationalism and Philosophical Aspects of Research in Information Sc...
Info-Computationalism and Philosophical Aspects of Research in Information Sc...Info-Computationalism and Philosophical Aspects of Research in Information Sc...
Info-Computationalism and Philosophical Aspects of Research in Information Sc...Gordana Dodig-Crnkovic
 
Pojem informace jako anomálie v informační vědě
Pojem informace jako anomálie v informační věděPojem informace jako anomálie v informační vědě
Pojem informace jako anomálie v informační věděJiří Stodola
 
The Philosophy of Information and the Structure of Philosophical Revolutions
The Philosophy of Information and the Structure of Philosophical RevolutionsThe Philosophy of Information and the Structure of Philosophical Revolutions
The Philosophy of Information and the Structure of Philosophical RevolutionsPhiloWeb
 
Naturalized Epistemology North American Computing and Philosophy 2007
Naturalized Epistemology North American Computing and Philosophy 2007 Naturalized Epistemology North American Computing and Philosophy 2007
Naturalized Epistemology North American Computing and Philosophy 2007 Gordana Dodig-Crnkovic
 
Jean-Yves Béziau: The metalogical hexagon of opposition
Jean-Yves Béziau: The metalogical hexagon of opposition Jean-Yves Béziau: The metalogical hexagon of opposition
Jean-Yves Béziau: The metalogical hexagon of opposition Filosofía Costa-Rica
 
The impact of innovation on travel and tourism industries (World Travel Marke...
The impact of innovation on travel and tourism industries (World Travel Marke...The impact of innovation on travel and tourism industries (World Travel Marke...
The impact of innovation on travel and tourism industries (World Travel Marke...Brian Solis
 

Andere mochten auch (6)

Info-Computationalism and Philosophical Aspects of Research in Information Sc...
Info-Computationalism and Philosophical Aspects of Research in Information Sc...Info-Computationalism and Philosophical Aspects of Research in Information Sc...
Info-Computationalism and Philosophical Aspects of Research in Information Sc...
 
Pojem informace jako anomálie v informační vědě
Pojem informace jako anomálie v informační věděPojem informace jako anomálie v informační vědě
Pojem informace jako anomálie v informační vědě
 
The Philosophy of Information and the Structure of Philosophical Revolutions
The Philosophy of Information and the Structure of Philosophical RevolutionsThe Philosophy of Information and the Structure of Philosophical Revolutions
The Philosophy of Information and the Structure of Philosophical Revolutions
 
Naturalized Epistemology North American Computing and Philosophy 2007
Naturalized Epistemology North American Computing and Philosophy 2007 Naturalized Epistemology North American Computing and Philosophy 2007
Naturalized Epistemology North American Computing and Philosophy 2007
 
Jean-Yves Béziau: The metalogical hexagon of opposition
Jean-Yves Béziau: The metalogical hexagon of opposition Jean-Yves Béziau: The metalogical hexagon of opposition
Jean-Yves Béziau: The metalogical hexagon of opposition
 
The impact of innovation on travel and tourism industries (World Travel Marke...
The impact of innovation on travel and tourism industries (World Travel Marke...The impact of innovation on travel and tourism industries (World Travel Marke...
The impact of innovation on travel and tourism industries (World Travel Marke...
 

Ähnlich wie Philosophy of IR Evaluation Ellen Voorhees

Search term recommendation and non-textual ranking evaluated
 Search term recommendation and non-textual ranking evaluated Search term recommendation and non-textual ranking evaluated
Search term recommendation and non-textual ranking evaluatedGESIS
 
An introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information RetrievalAn introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information RetrievalMounia Lalmas-Roelleke
 
Comparison of Techniques for Measuring Research Coverage of Scientific Papers...
Comparison of Techniques for Measuring Research Coverage of Scientific Papers...Comparison of Techniques for Measuring Research Coverage of Scientific Papers...
Comparison of Techniques for Measuring Research Coverage of Scientific Papers...Aravind Sesagiri Raamkumar
 
Advantages of Query Biased Summaries in Information Retrieval
Advantages of Query Biased Summaries in Information RetrievalAdvantages of Query Biased Summaries in Information Retrieval
Advantages of Query Biased Summaries in Information RetrievalOnur Yılmaz
 
Qualitative Studies in Software Engineering - Interviews, Observation, Ground...
Qualitative Studies in Software Engineering - Interviews, Observation, Ground...Qualitative Studies in Software Engineering - Interviews, Observation, Ground...
Qualitative Studies in Software Engineering - Interviews, Observation, Ground...alessio_ferrari
 
empirical-SLR.pptx
empirical-SLR.pptxempirical-SLR.pptx
empirical-SLR.pptxJitha Kannan
 
Text Retrieval Conferences (TREC)
Text Retrieval Conferences (TREC)Text Retrieval Conferences (TREC)
Text Retrieval Conferences (TREC)Abdul Gaffar
 
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...PyData
 
Systematic literature review technique.pptx
Systematic literature review technique.pptxSystematic literature review technique.pptx
Systematic literature review technique.pptxTANMAY DAS GUPTA
 
Advanced topics research
Advanced topics researchAdvanced topics research
Advanced topics researchkieran122
 
Proposing a Scientific Paper Retrieval and Recommender Framework
Proposing a Scientific Paper Retrieval and Recommender FrameworkProposing a Scientific Paper Retrieval and Recommender Framework
Proposing a Scientific Paper Retrieval and Recommender FrameworkAravind Sesagiri Raamkumar
 
Module 3 - CaseMethodology and FindingsCase AssignmentThe Ca.docx
Module 3 - CaseMethodology and FindingsCase AssignmentThe Ca.docxModule 3 - CaseMethodology and FindingsCase AssignmentThe Ca.docx
Module 3 - CaseMethodology and FindingsCase AssignmentThe Ca.docxaudeleypearl
 
Module 3 - CaseMethodology and FindingsCase AssignmentThe Ca.docx
Module 3 - CaseMethodology and FindingsCase AssignmentThe Ca.docxModule 3 - CaseMethodology and FindingsCase AssignmentThe Ca.docx
Module 3 - CaseMethodology and FindingsCase AssignmentThe Ca.docxroushhsiu
 
Enhancing the Quality of ImmPort Data
Enhancing the Quality of ImmPort DataEnhancing the Quality of ImmPort Data
Enhancing the Quality of ImmPort DataBarry Smith
 
Introduction to Systematic Literature Review method
Introduction to Systematic Literature Review methodIntroduction to Systematic Literature Review method
Introduction to Systematic Literature Review methodNorsaremah Salleh
 
Knowledge Representation on the Web
Knowledge Representation on the WebKnowledge Representation on the Web
Knowledge Representation on the WebRinke Hoekstra
 

Ähnlich wie Philosophy of IR Evaluation Ellen Voorhees (20)

Search term recommendation and non-textual ranking evaluated
 Search term recommendation and non-textual ranking evaluated Search term recommendation and non-textual ranking evaluated
Search term recommendation and non-textual ranking evaluated
 
An introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information RetrievalAn introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information Retrieval
 
Comparison of Techniques for Measuring Research Coverage of Scientific Papers...
Comparison of Techniques for Measuring Research Coverage of Scientific Papers...Comparison of Techniques for Measuring Research Coverage of Scientific Papers...
Comparison of Techniques for Measuring Research Coverage of Scientific Papers...
 
Advantages of Query Biased Summaries in Information Retrieval
Advantages of Query Biased Summaries in Information RetrievalAdvantages of Query Biased Summaries in Information Retrieval
Advantages of Query Biased Summaries in Information Retrieval
 
Chapter 7.pdf
Chapter 7.pdfChapter 7.pdf
Chapter 7.pdf
 
qury.pdf
qury.pdfqury.pdf
qury.pdf
 
Qualitative Studies in Software Engineering - Interviews, Observation, Ground...
Qualitative Studies in Software Engineering - Interviews, Observation, Ground...Qualitative Studies in Software Engineering - Interviews, Observation, Ground...
Qualitative Studies in Software Engineering - Interviews, Observation, Ground...
 
empirical-SLR.pptx
empirical-SLR.pptxempirical-SLR.pptx
empirical-SLR.pptx
 
Text Retrieval Conferences (TREC)
Text Retrieval Conferences (TREC)Text Retrieval Conferences (TREC)
Text Retrieval Conferences (TREC)
 
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
 
Cue Forum2008
Cue Forum2008Cue Forum2008
Cue Forum2008
 
Systematic literature review technique.pptx
Systematic literature review technique.pptxSystematic literature review technique.pptx
Systematic literature review technique.pptx
 
Advanced topics research
Advanced topics researchAdvanced topics research
Advanced topics research
 
Systematic Literature Review
Systematic Literature ReviewSystematic Literature Review
Systematic Literature Review
 
Proposing a Scientific Paper Retrieval and Recommender Framework
Proposing a Scientific Paper Retrieval and Recommender FrameworkProposing a Scientific Paper Retrieval and Recommender Framework
Proposing a Scientific Paper Retrieval and Recommender Framework
 
Module 3 - CaseMethodology and FindingsCase AssignmentThe Ca.docx
Module 3 - CaseMethodology and FindingsCase AssignmentThe Ca.docxModule 3 - CaseMethodology and FindingsCase AssignmentThe Ca.docx
Module 3 - CaseMethodology and FindingsCase AssignmentThe Ca.docx
 
Module 3 - CaseMethodology and FindingsCase AssignmentThe Ca.docx
Module 3 - CaseMethodology and FindingsCase AssignmentThe Ca.docxModule 3 - CaseMethodology and FindingsCase AssignmentThe Ca.docx
Module 3 - CaseMethodology and FindingsCase AssignmentThe Ca.docx
 
Enhancing the Quality of ImmPort Data
Enhancing the Quality of ImmPort DataEnhancing the Quality of ImmPort Data
Enhancing the Quality of ImmPort Data
 
Introduction to Systematic Literature Review method
Introduction to Systematic Literature Review methodIntroduction to Systematic Literature Review method
Introduction to Systematic Literature Review method
 
Knowledge Representation on the Web
Knowledge Representation on the WebKnowledge Representation on the Web
Knowledge Representation on the Web
 

Kürzlich hochgeladen

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 

Kürzlich hochgeladen (20)

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 

Philosophy of IR Evaluation Ellen Voorhees

  • 1. The Philosophy of Information Retrieval Evaluation (2001) by Ellen Voorhees
  • 2. The Author • Computer scientist, Retrieval Group, NIST (15 years) o TREC, TRECVid , and TAC - large-scale evaluation of technologies for processing natural language text and searching diverse media types • Research focus: "developing and validating appropriate evaluation schemes to measure system effectiveness in these areas" • Siemens Corporate Research (9 years) o factory automation, intelligence agents, agents applied to information access http://www.linkedin.com/pub/ellen-voorhees/6/115/3b8
  • 3. NIST (National Institute of Standards and Technology) • Non-regulatory agency of U.S. Dept of Commerce • "Promote U.S. innovation and industrial competitiveness [...] enhance economic security and improve our quality of life" • Estimated 2011 budget: $722 million • Standards Reference Materials (experimental control samples, quality control benchmarks), election technology, ID cards • 3 Nobel Prize Winners http://en.wikipedia.org/wiki/National_Institute_of_Standards_and_Technology
  • 4. Premises • User-based evaluation (p.1) o better, more direct measure of user needs o BUT very expensive and difficult to execute properly • System evaluation (p.1) o less expensive o abstraction of retrieval process o can control variables increases power of comparative experiments o diagnostic information about system behavior
  • 5. The Cranfield Paradigm • Dominant model for 4 decades (p.1) • Cranfield 2 experiment (1960s) - first lab testing of IR system (p.2) o investigated which indexing languages is best o design: considering the performance of index languages free from operational variable contamination o aeronautics experts, aeronautics collection o test collection: documents, information needs/topics, relevance judgment set o assumptions: relevance approximated by topical similarity single judgment set representative of user population lists of relevant documents for each topic complete
  • 6. Modern Adaptations to Cranfield Paradigm not true, need to decrease noise (p.3) • Assumptions o modern collections larger and more diverse o less complete relevance judgments • Adaptations: o Ranked list of documents for each topic ordered by decreasing retrieval likelihood o Effectiveness as a whole computed as average across topics o Large number of topics o Use pooling (subsets of documents) instead (p.4) o Assumptions don't need to be strictly true for test collection to be viable different retrieval run scores compared on same test collections
  • 7. How to Build a Test Collection (TREC example) • Set of documents and topics (reflective of operational setting and real tasks) (p.4) o e.g. law articles for law library • Participants run topics against documents o return top documents per topic • Pool formed, then judged by relevance assessors o evaluated using relevance judgments (binary) • Results returned to participant • Relevance judgments turn documents and topics into test collection (p.5)
  • 8. Effects of Pooling and Incomplete Judgments • Pooling doesn't produce complete judgments (p.5) o Some relevant documents not judged o If added later, from lower in system rankings • Skewed across topics (p.6) o if have many relevant documents initially and later on • What to do? o deep and diverse pool (p.9) o recall-oriented manual runs to supplement o opt for smaller, fair judgment set rather than larger biased set
  • 9. Assessor Relevance Judgments • Different judges, different time settings (p.9) • Different assessor makes different relevance sets for same topics (subjectivity of relevance) • TREC: 3 judges (p.10) • Overlap < 50%, assessors really disagreed
  • 10. Evaluating with Assessor Inconsistency • Perform system ranking, sorting by value obtained by each system (p.10) • Query-Relevance Set: different combinations of assessor judgments per topic • Repeat experiments several times: (p.13) o different measures o different topic sets o different systems o different assessor groups • Comparative evaluation result: stability of ranked retrieval results
  • 11. Cross-Language Collections • More difficult to build than monolingual collections (p.13) o separate set of assessors for each language o multiple assessors for 1 topic o need diverse pools for all languages minority language pools smaller and less diverse (p.14) • What to do? o close coordination for consistency (p.13) o proceed with care
  • 12. Discussion • Do laboratory experiments translate to operational settings? • Which metrics or evaluation scores are more meaningful to you? • Are there other ways to reduce noise and error?