Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Philosophy of IR Evaluation Ellen Voorhees
1. The Philosophy of Information
Retrieval Evaluation (2001)
by Ellen Voorhees
2. The Author
• Computer scientist, Retrieval Group,
NIST (15 years)
o TREC, TRECVid , and TAC - large-scale evaluation of
technologies for processing natural language text and
searching diverse media types
• Research focus: "developing and validating
appropriate evaluation schemes to measure system
effectiveness in these areas"
• Siemens Corporate Research (9 years)
o factory automation, intelligence agents, agents
applied to information access
http://www.linkedin.com/pub/ellen-voorhees/6/115/3b8
3. NIST (National Institute of Standards and
Technology)
• Non-regulatory agency of U.S. Dept of Commerce
• "Promote U.S. innovation and industrial competitiveness [...]
enhance economic security and improve our quality of life"
• Estimated 2011 budget: $722 million
• Standards Reference Materials (experimental control samples,
quality control benchmarks), election technology, ID cards
• 3 Nobel Prize Winners
http://en.wikipedia.org/wiki/National_Institute_of_Standards_and_Technology
4. Premises
• User-based evaluation (p.1)
o better, more direct measure of user needs
o BUT very expensive and difficult to execute properly
• System evaluation (p.1)
o less expensive
o abstraction of retrieval process
o can control variables
increases power of comparative experiments
o diagnostic information about system behavior
5. The Cranfield Paradigm
• Dominant model for 4 decades (p.1)
• Cranfield 2 experiment (1960s) - first lab testing of IR system
(p.2)
o investigated which indexing languages is best
o design: considering the performance of index languages
free from operational variable contamination
o aeronautics experts, aeronautics collection
o test collection: documents, information needs/topics,
relevance judgment set
o assumptions:
relevance approximated by topical similarity
single judgment set representative of user population
lists of relevant documents for each topic complete
6. Modern Adaptations to Cranfield
Paradigm not true, need to decrease noise (p.3)
• Assumptions
o modern collections larger and more diverse
o less complete relevance judgments
• Adaptations:
o Ranked list of documents for each topic
ordered by decreasing retrieval likelihood
o Effectiveness as a whole computed as average across
topics
o Large number of topics
o Use pooling (subsets of documents) instead (p.4)
o Assumptions don't need to be strictly true for test
collection to be viable
different retrieval run scores compared on same test
collections
7. How to Build a Test Collection
(TREC example)
• Set of documents and topics (reflective of operational setting
and real tasks) (p.4)
o e.g. law articles for law library
• Participants run topics against documents
o return top documents per topic
• Pool formed, then judged by relevance assessors
o evaluated using relevance judgments (binary)
• Results returned to participant
• Relevance judgments turn documents and topics into test
collection (p.5)
8. Effects of Pooling and Incomplete Judgments
• Pooling doesn't produce complete judgments (p.5)
o Some relevant documents not judged
o If added later, from lower in system rankings
• Skewed across topics (p.6)
o if have many relevant documents initially and later on
• What to do?
o deep and diverse pool (p.9)
o recall-oriented manual runs to supplement
o opt for smaller, fair judgment set rather than larger biased
set
9. Assessor Relevance Judgments
• Different judges, different time settings (p.9)
• Different assessor makes different relevance sets for same
topics (subjectivity of relevance)
• TREC: 3 judges (p.10)
• Overlap < 50%, assessors really disagreed
10. Evaluating with Assessor Inconsistency
• Perform system ranking, sorting by value obtained by each
system (p.10)
• Query-Relevance Set: different combinations of assessor
judgments per topic
• Repeat experiments several times: (p.13)
o different measures
o different topic sets
o different systems
o different assessor groups
• Comparative evaluation result: stability of ranked retrieval
results
11. Cross-Language Collections
• More difficult to build than monolingual collections (p.13)
o separate set of assessors for each language
o multiple assessors for 1 topic
o need diverse pools for all languages
minority language pools smaller and less diverse (p.14)
• What to do?
o close coordination for consistency (p.13)
o proceed with care
12. Discussion
• Do laboratory experiments translate to operational settings?
• Which metrics or evaluation scores are more meaningful to
you?
• Are there other ways to reduce noise and error?