SlideShare ist ein Scribd-Unternehmen logo
1 von 25
Alistair Moffat and Justin Zobel, ―Rank-Biased Precision for
Measurement of Retrieval Effectiveness‖, TOIS vol.27 no. 1, 2008.




                                                              Ofer Egozi
                                                    LARA group, Technion
Introduction to IR Evaluation



    Mean Average Precision



    Rank-Biased Precision



    Analysis of RBP

Task: given query q, output ranked list of

    documents
    ◦ Find probability that document d is relevant for q
Task: given query q, output ranked list of

    documents
    ◦ Find probability that document d is relevant for q
    Evaluation is difficult

    ◦ No (per query) test data
    ◦ Queries vary tremendously
    ◦ Relevance is a vague (human) concept
Precision / recall

                                      Precision: |alg   rel|/|alg|
                                         Recall: |alg    rel|/|rel|

            D
                alg(q,D)   rel(q,D)




    ◦ Precision and recall usually conflict
    ◦ Single measures proposed
        (P@X, RR, AP…)
Relevancy requires human judgment

    ◦ Exhaustive judging is not scalable
    ◦ TREC uses pooling
    ◦ Shown to miss significant relevant portion…
    ◦ … but shown to compare cross-system well
    ◦ Bias against novel approaches
In real-world, what does recall measure?

    ◦ Recall important only with ―perfect‖ knowledge
    ◦ If I got one result, and there is another I don’t know
      of, am I half-satisfied?...
    ◦ …yes, for specific needs (legal, patent)     session
    ◦ ―Boiling temperature of lead‖
In real-world, what does recall measure?

    ◦ Recall important only with ―perfect‖ knowledge
    ◦ If I got one result, and there is another I don’t know
      of, am I half-satisfied?...
    ◦ …yes, for specific needs (legal, patent)     session
    ◦ ―Boiling temperature of lead‖

    Precision is more user-oriented

    ◦ P@10 measures real user satisfaction
    ◦ Still, P@10=0.3 can mean first three or last three…
Calculated as

    ◦ Intuitively: sum all P@X where rel found, divide by
      total rel to normalize for summing across queries
    Example: $$---$----$-----$---

Calculated as

    ◦ Intuitively: sum all P@X where rel found, divide by
      total rel to normalize for summing across queries
    Example: $$---$----$-----$---


    Consider: $$---$----$-----$$$$


    ◦ AP is down to 0.5234, despite P@20 increasing
    ◦ Finding more rels can harm AP performance!
    ◦ Similar problems if some are initially unjudged
Methodological problem of instability

    ◦ Results may depend on judging extent
    ◦ More judging can be destabilizing (meaning error
      margins don’t shrink with reducing uncertainty)
Complex abstraction of user satisfaction

    ◦ ―Every time a relevant document is encountered, the user pauses, asks ―Over the
      documents I have seen so far, on average how satisfied am I?‖ and writes a number
      on a piece of paper. Finally, when the user has examined every document in the
      collection — because this is the only way to be sure that all of the relevant ones have
      been seen — the user computes the average of the values they have written.‖

    How can R be truly calculated?

    Think evaluating a Google query…
Complex abstraction of user satisfaction

    ◦ ―Every time a relevant document is encountered, the user pauses, asks ―Over the
      documents I have seen so far, on average how satisfied am I?‖ and writes a number
      on a piece of paper. Finally, when the user has examined every document in the
      collection — because this is the only way to be sure that all of the relevant ones have
      been seen — the user computes the average of the values they have written.‖

    How can R be truly calculated?

    Think evaluating a Google query…
    Still, MAP is highly popular and useful:

    ◦ Validated in numerous TREC researches
    ◦ Shown to be stable and robust across query sets
      (for deep enough pools)
Induced by a user model

Induced by a user model





    ◦ Each document is observed at probability pi-1
    ◦ Expected #docs seen:
    ◦ Total expected utility   (ri = known relevance function):

    ◦ RBP = expected utility rate = utility/effort
Values of p reflect user behaviors

    ◦ P=0.95    persistent user      (60% chance for 2nd page)

    ◦ P=0.5    impatient   (0.1% chance for 2nd page)
Values of p reflect user behaviors

    ◦ P=0.95      persistent user     (60% chance for 2nd page)

    ◦ P=0.5     impatient   (0.1% chance for 2nd page)

    ◦ P=0      I’m feeling lucky      (identical to P@1)
Values of p reflect user behaviors

    ◦ P=0.95      persistent user     (60% chance for 2nd page)

    ◦ P=0.5     impatient   (0.1% chance for 2nd page)

    ◦ P=0      I’m feeling lucky      (identical to P@1)



    Values of p control contribution of each

    relevant document
    ◦ But always positive!
Uncertainty: how many relevant documents?

    (down the ranking, or even in current depth)
    RBP value is inherently lower bound

Uncertainty: how many relevant documents?

    (down the ranking, or even in current depth)
    RBP value is inherently lower bound


    Residual uncertainty is easy to calculate –

    assume relevant…
Similarity
  (correlation)
  between measures




Detected significance
in evaluated systems’
ranking
RBP has significant advantages:

    ◦ Based on a solid and supported user model
    ◦ Real-life, no unknown factors (R, |D|)
    ◦ Error bounds for uncertainty
    ◦ Statistical significance as good as others

    But also:

    ◦ Absolute values, not relative to query difficulty
    ◦ A choice for p must be made

Weitere ähnliche Inhalte

Was ist angesagt?

Natural Language Processing: Parsing
Natural Language Processing: ParsingNatural Language Processing: Parsing
Natural Language Processing: ParsingRushdi Shams
 
Matrix Factorization In Recommender Systems
Matrix Factorization In Recommender SystemsMatrix Factorization In Recommender Systems
Matrix Factorization In Recommender SystemsYONG ZHENG
 
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...Balázs Hidasi
 
Artificial Intelligence: Natural Language Processing
Artificial Intelligence: Natural Language ProcessingArtificial Intelligence: Natural Language Processing
Artificial Intelligence: Natural Language ProcessingFrank Cunha
 
Engagement, metrics and "recommenders"
Engagement, metrics and "recommenders"Engagement, metrics and "recommenders"
Engagement, metrics and "recommenders"Mounia Lalmas-Roelleke
 
Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial Alexandros Karatzoglou
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introductionRobert Lujo
 
SVD and the Netflix Dataset
SVD and the Netflix DatasetSVD and the Netflix Dataset
SVD and the Netflix DatasetBen Mabey
 
Lecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language TechnologyLecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language TechnologyMarina Santini
 
Recommendation system
Recommendation systemRecommendation system
Recommendation systemAkshat Thakar
 
From Idea to Execution: Spotify's Discover Weekly
From Idea to Execution: Spotify's Discover WeeklyFrom Idea to Execution: Spotify's Discover Weekly
From Idea to Execution: Spotify's Discover WeeklyChris Johnson
 
Natural lanaguage processing
Natural lanaguage processingNatural lanaguage processing
Natural lanaguage processinggulshan kumar
 
Tutorial: Context In Recommender Systems
Tutorial: Context In Recommender SystemsTutorial: Context In Recommender Systems
Tutorial: Context In Recommender SystemsYONG ZHENG
 
Learning to Rank for Recommender Systems - ACM RecSys 2013 tutorial
Learning to Rank for Recommender Systems -  ACM RecSys 2013 tutorialLearning to Rank for Recommender Systems -  ACM RecSys 2013 tutorial
Learning to Rank for Recommender Systems - ACM RecSys 2013 tutorialAlexandros Karatzoglou
 
Algorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at SpotifyAlgorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at SpotifyChris Johnson
 

Was ist angesagt? (20)

Natural Language Processing: Parsing
Natural Language Processing: ParsingNatural Language Processing: Parsing
Natural Language Processing: Parsing
 
Query processing
Query processingQuery processing
Query processing
 
Matrix Factorization In Recommender Systems
Matrix Factorization In Recommender SystemsMatrix Factorization In Recommender Systems
Matrix Factorization In Recommender Systems
 
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
 
Recommender system
Recommender systemRecommender system
Recommender system
 
Topic Modeling
Topic ModelingTopic Modeling
Topic Modeling
 
Artificial Intelligence: Natural Language Processing
Artificial Intelligence: Natural Language ProcessingArtificial Intelligence: Natural Language Processing
Artificial Intelligence: Natural Language Processing
 
Engagement, metrics and "recommenders"
Engagement, metrics and "recommenders"Engagement, metrics and "recommenders"
Engagement, metrics and "recommenders"
 
Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
 
Information Retrieval Evaluation
Information Retrieval EvaluationInformation Retrieval Evaluation
Information Retrieval Evaluation
 
SVD and the Netflix Dataset
SVD and the Netflix DatasetSVD and the Netflix Dataset
SVD and the Netflix Dataset
 
Lecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language TechnologyLecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language Technology
 
Recommendation system
Recommendation systemRecommendation system
Recommendation system
 
From Idea to Execution: Spotify's Discover Weekly
From Idea to Execution: Spotify's Discover WeeklyFrom Idea to Execution: Spotify's Discover Weekly
From Idea to Execution: Spotify's Discover Weekly
 
Natural lanaguage processing
Natural lanaguage processingNatural lanaguage processing
Natural lanaguage processing
 
Tutorial: Context In Recommender Systems
Tutorial: Context In Recommender SystemsTutorial: Context In Recommender Systems
Tutorial: Context In Recommender Systems
 
Learning to Rank for Recommender Systems - ACM RecSys 2013 tutorial
Learning to Rank for Recommender Systems -  ACM RecSys 2013 tutorialLearning to Rank for Recommender Systems -  ACM RecSys 2013 tutorial
Learning to Rank for Recommender Systems - ACM RecSys 2013 tutorial
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Algorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at SpotifyAlgorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at Spotify
 

Ähnlich wie IR Evaluation using Rank-Biased Precision

evaluation in infomation retrival
evaluation in infomation retrivalevaluation in infomation retrival
evaluation in infomation retrivaljetaime
 
Chapter 5 Query Evaluation.pdf
Chapter 5 Query Evaluation.pdfChapter 5 Query Evaluation.pdf
Chapter 5 Query Evaluation.pdfHabtamu100
 
Diversity versus accuracy: solving the apparent dilemma facing recommender sy...
Diversity versus accuracy: solving the apparent dilemma facing recommender sy...Diversity versus accuracy: solving the apparent dilemma facing recommender sy...
Diversity versus accuracy: solving the apparent dilemma facing recommender sy...Aliaksandr Birukou
 
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalJulián Urbano
 
Querying incomplete data
Querying incomplete dataQuerying incomplete data
Querying incomplete dataINRIA-OAK
 
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...OpenSource Connections
 
ISSTA'16 Summer School: Intro to Statistics
ISSTA'16 Summer School: Intro to StatisticsISSTA'16 Summer School: Intro to Statistics
ISSTA'16 Summer School: Intro to StatisticsAndrea Arcuri
 
論文紹介:A Simple Theoretical Model of Importance for Summarization
論文紹介:A Simple Theoretical Model of Importance for Summarization論文紹介:A Simple Theoretical Model of Importance for Summarization
論文紹介:A Simple Theoretical Model of Importance for SummarizationNaomi Shiraishi
 
Game Metrics and Biometrics: The Future of Player Experience Research
Game Metrics and Biometrics: The Future of Player Experience ResearchGame Metrics and Biometrics: The Future of Player Experience Research
Game Metrics and Biometrics: The Future of Player Experience ResearchLennart Nacke
 
ReviewAnalysis MLconf 2016 JPrendki
ReviewAnalysis MLconf 2016 JPrendkiReviewAnalysis MLconf 2016 JPrendki
ReviewAnalysis MLconf 2016 JPrendkiJennifer Prendki
 
An introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information RetrievalAn introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information RetrievalMounia Lalmas-Roelleke
 
A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems
A Top-N Recommender System Evaluation Protocol Inspired by Deployed SystemsA Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems
A Top-N Recommender System Evaluation Protocol Inspired by Deployed SystemsAlan Said
 
Search Quality Evaluation to Help Reproducibility: An Open-source Approach
Search Quality Evaluation to Help Reproducibility: An Open-source ApproachSearch Quality Evaluation to Help Reproducibility: An Open-source Approach
Search Quality Evaluation to Help Reproducibility: An Open-source ApproachAlessandro Benedetti
 

Ähnlich wie IR Evaluation using Rank-Biased Precision (15)

evaluation in infomation retrival
evaluation in infomation retrivalevaluation in infomation retrival
evaluation in infomation retrival
 
Chapter 5 Query Evaluation.pdf
Chapter 5 Query Evaluation.pdfChapter 5 Query Evaluation.pdf
Chapter 5 Query Evaluation.pdf
 
Diversity versus accuracy: solving the apparent dilemma facing recommender sy...
Diversity versus accuracy: solving the apparent dilemma facing recommender sy...Diversity versus accuracy: solving the apparent dilemma facing recommender sy...
Diversity versus accuracy: solving the apparent dilemma facing recommender sy...
 
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
 
Querying incomplete data
Querying incomplete dataQuerying incomplete data
Querying incomplete data
 
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
 
ISSTA'16 Summer School: Intro to Statistics
ISSTA'16 Summer School: Intro to StatisticsISSTA'16 Summer School: Intro to Statistics
ISSTA'16 Summer School: Intro to Statistics
 
論文紹介:A Simple Theoretical Model of Importance for Summarization
論文紹介:A Simple Theoretical Model of Importance for Summarization論文紹介:A Simple Theoretical Model of Importance for Summarization
論文紹介:A Simple Theoretical Model of Importance for Summarization
 
Evaluation
EvaluationEvaluation
Evaluation
 
Game Metrics and Biometrics: The Future of Player Experience Research
Game Metrics and Biometrics: The Future of Player Experience ResearchGame Metrics and Biometrics: The Future of Player Experience Research
Game Metrics and Biometrics: The Future of Player Experience Research
 
ReviewAnalysis MLconf 2016 JPrendki
ReviewAnalysis MLconf 2016 JPrendkiReviewAnalysis MLconf 2016 JPrendki
ReviewAnalysis MLconf 2016 JPrendki
 
An introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information RetrievalAn introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information Retrieval
 
Mini-projects
Mini-projectsMini-projects
Mini-projects
 
A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems
A Top-N Recommender System Evaluation Protocol Inspired by Deployed SystemsA Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems
A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems
 
Search Quality Evaluation to Help Reproducibility: An Open-source Approach
Search Quality Evaluation to Help Reproducibility: An Open-source ApproachSearch Quality Evaluation to Help Reproducibility: An Open-source Approach
Search Quality Evaluation to Help Reproducibility: An Open-source Approach
 

Kürzlich hochgeladen

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 

Kürzlich hochgeladen (20)

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 

IR Evaluation using Rank-Biased Precision

  • 1. Alistair Moffat and Justin Zobel, ―Rank-Biased Precision for Measurement of Retrieval Effectiveness‖, TOIS vol.27 no. 1, 2008. Ofer Egozi LARA group, Technion
  • 2. Introduction to IR Evaluation  Mean Average Precision  Rank-Biased Precision  Analysis of RBP 
  • 3. Task: given query q, output ranked list of  documents ◦ Find probability that document d is relevant for q
  • 4. Task: given query q, output ranked list of  documents ◦ Find probability that document d is relevant for q Evaluation is difficult  ◦ No (per query) test data ◦ Queries vary tremendously ◦ Relevance is a vague (human) concept
  • 5. Precision / recall  Precision: |alg rel|/|alg| Recall: |alg rel|/|rel| D alg(q,D) rel(q,D) ◦ Precision and recall usually conflict ◦ Single measures proposed (P@X, RR, AP…)
  • 6. Relevancy requires human judgment  ◦ Exhaustive judging is not scalable ◦ TREC uses pooling ◦ Shown to miss significant relevant portion… ◦ … but shown to compare cross-system well ◦ Bias against novel approaches
  • 7. In real-world, what does recall measure?  ◦ Recall important only with ―perfect‖ knowledge ◦ If I got one result, and there is another I don’t know of, am I half-satisfied?... ◦ …yes, for specific needs (legal, patent) session ◦ ―Boiling temperature of lead‖
  • 8. In real-world, what does recall measure?  ◦ Recall important only with ―perfect‖ knowledge ◦ If I got one result, and there is another I don’t know of, am I half-satisfied?... ◦ …yes, for specific needs (legal, patent) session ◦ ―Boiling temperature of lead‖ Precision is more user-oriented  ◦ P@10 measures real user satisfaction ◦ Still, P@10=0.3 can mean first three or last three…
  • 9. Calculated as  ◦ Intuitively: sum all P@X where rel found, divide by total rel to normalize for summing across queries Example: $$---$----$-----$--- 
  • 10. Calculated as  ◦ Intuitively: sum all P@X where rel found, divide by total rel to normalize for summing across queries Example: $$---$----$-----$---  Consider: $$---$----$-----$$$$  ◦ AP is down to 0.5234, despite P@20 increasing ◦ Finding more rels can harm AP performance! ◦ Similar problems if some are initially unjudged
  • 11. Methodological problem of instability  ◦ Results may depend on judging extent ◦ More judging can be destabilizing (meaning error margins don’t shrink with reducing uncertainty)
  • 12. Complex abstraction of user satisfaction  ◦ ―Every time a relevant document is encountered, the user pauses, asks ―Over the documents I have seen so far, on average how satisfied am I?‖ and writes a number on a piece of paper. Finally, when the user has examined every document in the collection — because this is the only way to be sure that all of the relevant ones have been seen — the user computes the average of the values they have written.‖ How can R be truly calculated?  Think evaluating a Google query…
  • 13. Complex abstraction of user satisfaction  ◦ ―Every time a relevant document is encountered, the user pauses, asks ―Over the documents I have seen so far, on average how satisfied am I?‖ and writes a number on a piece of paper. Finally, when the user has examined every document in the collection — because this is the only way to be sure that all of the relevant ones have been seen — the user computes the average of the values they have written.‖ How can R be truly calculated?  Think evaluating a Google query… Still, MAP is highly popular and useful:  ◦ Validated in numerous TREC researches ◦ Shown to be stable and robust across query sets (for deep enough pools)
  • 14.
  • 15. Induced by a user model 
  • 16. Induced by a user model  ◦ Each document is observed at probability pi-1 ◦ Expected #docs seen: ◦ Total expected utility (ri = known relevance function): ◦ RBP = expected utility rate = utility/effort
  • 17. Values of p reflect user behaviors  ◦ P=0.95 persistent user (60% chance for 2nd page) ◦ P=0.5 impatient (0.1% chance for 2nd page)
  • 18. Values of p reflect user behaviors  ◦ P=0.95 persistent user (60% chance for 2nd page) ◦ P=0.5 impatient (0.1% chance for 2nd page) ◦ P=0 I’m feeling lucky  (identical to P@1)
  • 19. Values of p reflect user behaviors  ◦ P=0.95 persistent user (60% chance for 2nd page) ◦ P=0.5 impatient (0.1% chance for 2nd page) ◦ P=0 I’m feeling lucky  (identical to P@1) Values of p control contribution of each  relevant document ◦ But always positive!
  • 20.
  • 21.
  • 22. Uncertainty: how many relevant documents?  (down the ranking, or even in current depth) RBP value is inherently lower bound 
  • 23. Uncertainty: how many relevant documents?  (down the ranking, or even in current depth) RBP value is inherently lower bound  Residual uncertainty is easy to calculate –  assume relevant…
  • 24. Similarity (correlation) between measures Detected significance in evaluated systems’ ranking
  • 25. RBP has significant advantages:  ◦ Based on a solid and supported user model ◦ Real-life, no unknown factors (R, |D|) ◦ Error bounds for uncertainty ◦ Statistical significance as good as others But also:  ◦ Absolute values, not relative to query difficulty ◦ A choice for p must be made