How Do Gain and Discount Functions Affect the Correlation between DCG and User Satisfaction?

How do Gain and Discount Functions Affect the
Correlation between DCG and User Satisfaction?
Julián Urbano Mónica Marrero
ECIR 2015
Vienna, March 30th
Discount d(i ; k) Gain g(r)
Zipfian: 1/݅ Linear: ‫ݎ‬
Linear: ሺ݇ ൅ ݅ െ 1ሻ/݇ Exp(2): 2௥
െ 1
Constant: 1 Exp(3): 3௥
െ 1
Log(2): 1/ logଶ ݅ ൅ 1 Exp(5): 5௥
െ 1
Log(3): 1/ logଷ ݅ ൅ 2 Bin(1): Iሾ‫ݎ‬ ൒ 1ሿ
Log(5): 1/ logହሺ݅ ൅ 4ሻ Bin(2): Iሾ‫ݎ‬ ൒ 2ሿ
Discount functions
Rank i
Discountd(i)
0.00.20.40.60.81.0
1 2 3 4 5
Zipfian
Linear
Constant
Log(2)
Log(3)
Log(5)
Gain functions
Relevance r
Gaing(r)
0510152025
0 1 2
Linear
Exp(2)
Exp(3)
Exp(5)
Bin(1)Bin(2)
Documents
Information
Need
Real World Cranfield
IR System
Topic
Relevance
Judgments
IR System
Documents
GAP
DCG
ERR
Static
Component
Dynamic
Component
Test
Collection
Effectiveness
Measures
Time to complete task, Idle time,
Success rate, Frustration, Satisfaction,
Ease of use, Ease of learning…
Precision, Average Precision, Reciprocal Rank,
Q-measure, Discounted Cumulative Gain,
Rank-Biased Precision, Time-Biased Gain…
Live Observation
What Gain and Discount for DCG are
better to predict user satisfaction?
• First, let’s normalize DCG scores (this is not nDCG!)
• One system with DCG=φ. What does it mean?
• Intuition: φ·100% of users will be satisfied
• P(Sat|DCG= φ)= φ
• Two systems with ΔDCG=Δφ. What does it mean?
• Intuition: users will prefer the (supposedly) better one
• P(Pref|ΔDCG=Δφ)=1
P(Sat) and P(Pref) depend on the systems,
not on how we evaluate them. Yet, there are
many different ways to compute effectiveness
Experiment
• Collect user preferences between two systems
• Map DCG onto P(Sat)
• Map ΔDCG onto P(Pref)
• Music recommendation task
• Ad-hoc, informational, enjoyable by assessors
• Preferences less confounded by interface effects
• All data from MIREX (TREC-like for Music IR)
• Datasets from 2007–2012
• 3-point relevance scale: {0, 1, 2}
• 4115 examples
• Uniformly covering the [0,1] range of ΔDCG
• 432 unique queries
• 5636 unique documents
• Crowdsourced with Crowdflower
• Trap examples for quality control
• 113 subjects
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
Zipfian discount
Difference in DCG
Probabilitythatusersagree
Gains
Linear
Exp(2)
Exp(3)
Exp(5)
Bin(1)
Bin(2)
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
Linear discount
Difference in DCG
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
Log(2) discount
Difference in DCG
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
Log(3) discount
Difference in DCG
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
Log(5) discount
Difference in DCG
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
Constant discount
Difference in DCG
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
Zipfian discount
DCG
Probabilityofusersatisfaction
Gains
Linear
Exp(2)
Exp(3)
Exp(5)
Bin(1)
Bin(2)
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
Linear discount
DCG
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
Log(2) discount
DCG
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
Log(3) discount
DCG
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
Log(5) discount
DCG
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
Constant discount
DCG
Results (1 system): DCG predicting user satisfaction Results (2 systems): ΔDCG predicting user preference
Results: bias of Gain and Discount functions
• Diagonal: how far is P(Sat|DCG) from the ideal diagonal?
• Intuitiveness of DCG scores
• Endpoint: how far is P(Sat|DCG) from the ideal 0% and 100%?
• User disagreement and goodness of the DCG user model
• Top: how far is P(Pref|ΔDCG) from the ideal 100%?
• Discriminative power
Summary and Implications
• New method to map system effectiveness onto user satisfaction
• Sample application to DCG for a music recommendation task
• Gain functions that emphasize highly relevant documents underestimate
user satisfaction. Linear gain is better than exponential
• All discount functions bias the prediction of user satisfaction
• This task might be too enjoyable to observe discount effect
• Size (of the DCG difference) does matter
• Non-parametric statistics (eg. Sign test, Wilcoxon test) and just looking at
the ranking of systems (eg. Kendall τ) oversimplify the evaluation problem
• Zero-point null hypothesis testing (ie. H0 : ΔDCG=0) is not reasonable
• Future work will investigate this method for Text IR
• Provide a common framework, based on P(Sat) and P(Pref), to evaluate
with informational and navigational queries using appropriate measures
Data and code
available online
‫݇@ܩܥܦ‬ ൌ
∑ ݃ ‫ݎ‬௜ ⋅ ݀ ݅ ; ݇௞
௜ୀଵ
∑ ݃ ‫ݎ‬௠௔௫ ⋅ ݀ ݅ ; ݇௞
௜ୀଵ
0.060.100.14
Diagonal bias
Bias
Bin(2)
Bin(1)
Exp(5)
Exp(3)
Exp(2)
Linear
0.060.100.14
Zipfian
Linear
Log(2)
Log(3)
Log(5)
Constant
0.060.100.14
Gain Discount
0.140.180.22
Endpoint bias
Bias
Bin(2)
Bin(1)
Exp(5)
Exp(3)
Exp(2)
Linear
0.140.180.22
Zipfian
Linear
Log(2)
Log(3)
Log(5)
Constant
0.140.180.22
Gain Discount
0.460.500.54
Top bias
Bias
Bin(2)
Bin(1)
Exp(5)
Exp(3)
Exp(2)
Linear
0.460.500.54
Zipfian
Linear
Log(2)
Log(3)
Log(5)
Constant
0.460.500.54
Gain Discount

How Do Gain and Discount Functions Affect the Correlation between DCG and User Satisfaction?

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (13)

Ähnlich wie How Do Gain and Discount Functions Affect the Correlation between DCG and User Satisfaction?

Ähnlich wie How Do Gain and Discount Functions Affect the Correlation between DCG and User Satisfaction? (20)

Mehr von Julián Urbano

Mehr von Julián Urbano (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

How Do Gain and Discount Functions Affect the Correlation between DCG and User Satisfaction?