Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks

Crowdsourcing
Preference Judgments for
Evaluation of Music Similarity Tasks

Julián Urbano, Jorge Morato,
Mónica Marrero and Diego Martín
http://julian-urbano.info
Twitter: @julian_urbano

SIGIR CSE 2010
Geneva, Switzerland · July 23rd

2

Outline
• Introduction
• Motivation
• Alternative Methodology
• Crowdsourcing Preferences
• Results
• Conclusions and Future Work

3

Evaluation Experiments
• Essential for Information Retrieval [Voorhees, 2002]

• Traditionally followed the Cranfield paradigm
▫ Relevance judgments are the most important
part of test collections (and the most expensive)

• In the music domain evaluation has not been
taken too seriously until very recently
▫ MIREX appeared in 2005 [Downie et al., 2010]
▫ Additional problems with the construction and
maintenance of test collections [Downie, 2004]

4

Music Similarity Tasks
• Given a music piece (i.e. the query) return a
ranked list of other pieces similar to it
▫ Actual music contents, forget the metadata!

• It comes in two flavors
▫ Symbolic Melodic Similarity (SMS)
▫ Audio Music Similarity (AMS)

• It is inherently more complex to evaluate
▫ Relevance judgments are very problematic

5

Relevance (Similarity) Judgments
• Relevance is usually considered on a fixed scale
▫ Relevant, not relevant, very relevant…

• For music similarity tasks relevance is rather
continuous [Selfridge-Field, 1998][Typke et al., 2005][Jones et al., 2007]
▫ Single melodic changes are not perceived to
change the overall melody
 Move a note up or down in pitch, shorten it, etc.
▫ But the similarity is weaker as more changes apply

• Where is the line between relevance levels?

6

Partially Ordered Lists
• The relevance of a document is implied by its
position in a partially ordered list [Typke et al., 2005]
▫ Does not need any prefixed relevance scale

• Ordered groups of documents equally relevant
▫ Have to keep the order of the groups
▫ Allow permutations within the same group

• Assessors only need to be sure that any pair of
documents is ordered properly

7

Partially Ordered Lists (II)

8

Partially Ordered Lists (and III)
• Used in the first edition of MIREX in 2005
[Downie et al., 2005]

• Widely accepted by the MIR community
to report new developments
[Urbano et al., 2010a][Pinto et al., 2008][Hanna et al., 2007][Gratchen et al., 2006]

• MIREX was forced to move to traditional
level-based relevance since 2006
▫ Partially ordered lists are expensive
▫ And have some inconsistencies

9

Expensiveness
• The ground truth for just 11 queries took 35
music experts for 2 hours [Typke et al., 2005]
▫ Only 11 of them had time to work on all 11 queries
▫ This exceeds MIREX’s resources for a single task

• MIREX had to move to level-based relevance
▫ BROAD: Not Similar, Somewhat Similar, Very Similar
▫ FINE: numerical, from 0 to 10 with one decimal digit

• Problems with assessor consistency came up

10

Issues with Assessor Consistency
• The line between levels is certainly unclear
[Jones et al., 2007][Downie et al., 2010]

11

Original Methodology
• Go back to partially ordered lists
▫ Filter the collection
▫ Have the experts rank the candidates
▫ Arrange the candidates by rank
▫ Aggregate candidates whose ranks are not
significantly different (Mann-Whitney U)
• There are known odd results and inconsistencies
[Typke et al., 2005][Hanna et al., 2007][Urbano et al., 2010b]
▫ Disregard changes that do not alter the actual
perception, such as clef or key and time signature
▫ Something like changing the language of a text
and use synonyms [Urbano et al., 2010a]

12

Inconsistencies due to Ranking

13

Alternative Methodology
• Minimize inconsistencies [Urbano et al., 2010b]
• Cheapen the whole process

• Reasonable Person hypothesis [Downie, 2004]
▫ With crowdsourcing (finally)

• Use Amazon Mechanical Turk
▫ Get rid of experts [Alonso et al., 2008][Alonso et al., 2009]
▫ Work with “reasonable turkers”
▫ Explore other domains to apply crowdsourcing

14

Equally Relevant Documents
• Experts were forced to give totally ordered lists

• One would expect ranks to randomly average out
▫ Half the experts prefer one document
▫ Half the experts prefer the other one

• That is hardly the case
▫ Do not expect similar ranks if the experts
can not give similar ranks in the first place

15

Give Audio instead of Images
• Experts may guide by the images, not the music
▫ Some irrelevant changes in the image can deceive

• No music expertise should be needed
▫ Reasonable person turker hypothesis

16

Preference Judgments
• In their heads, experts actually do
preference judgments
▫ Similar to a binary search
▫ Accelerates assessor fatigue as the list grows

• Already noted for level-based relevance
▫ Go back and re-judge [Downie et al., 2010][Jones et al., 2007]
▫ Overlapping between BROAD and FINE scores

• Change the relevance assessment question
▫ Which is more similar to Q: A or B? [Carterette et al., 2008]

17

Preference Judgments (II)
• Better than traditional level-based relevance
▫ Inter-assessor agreement
▫ Time to answer

• In our case, three-point preferences
▫ A < B (A is more similar)
▫ A = B (they are equally similar/dissimilar)
▫ A > B (B is more similar)

18

Preference Judgments (and III)
• Use a modified QuickSort algorithm to sort
documents in a partially ordered list
▫ Do not need all O(n2) judgments, but O(n·log n)

X is the current pivot on the segment
X has been pivot already

19

How Many Assessors?
• Ranks are given to each document in a pair
▫ +1 if it is preferred over the other one
▫ -1 if the other one is preferred
▫ 0 if they were judged equally similar/dissimilar
• Test for signed differences in the samples
• In the original lists 35 experts were used
▫ Ranks of a document between 1 and more than 20
• Our rank sample is less (and equally) variable
▫ rank(A) = -rank(B) ⇒ var(A) = var (B)
▫ Effect size is larger so statistical power increases
▫ Fewer assessors are needed overall

20

Crowdsourcing Preferences
• Crowdsourcing seems very appropriate
▫ Reasonable person hypothesis
▫ Audio instead of images
▫ Preference judgments
▫ QuickSort for partially ordered lists
• The task can be split into very small assignments
• It should be much more cheap and consistent
▫ Do not need experts
▫ Do not deceive and increase consistency
▫ Easier and faster to judge
▫ Need fewer judgments and judges

21

New Domain of Application
• Crowdsourcing has been used mainly to evaluate
text documents in English

• How about other languages?
▫ Spanish [Alonso et al., 2010]

• How about multimedia?
▫ Image tagging? [Nowak et al., 2010]
▫ Music similarity?

22

Data
• MIREX 2005 Evaluation collection
▫ ~550 musical incipits in MIDI format
▫ 11 queries also in MIDI format
▫ 4 to 23 candidates per query

• Convert to MP3 as it is easier to play in browsers
• Trim the leading and tailing silence
▫ 1 to 57 secs. (mean 6) to 1 to 26 secs. (mean 4)
▫ 4 to 24 secs. (mean 13) to listen to all 3 incipits
• Uploaded all MP3 files and a Flash player to a
private server to stream data on the fly

23

HIT Design

2 yummy cents of dollar

24

Threats to Validity
• Basically had to randomize everything
▫ Initial order of candidates in the first segment
▫ Alternate between queries
▫ Alternate between pivots of the same query
▫ Alternate pivots as variations A and B
• Let the workers know about this randomization
• In first trials some documents were judged more
similar to the query than the query itself!
▫ Require at least 95% acceptance rate
▫ Ask for 10 different workers per HIT [Alonso et al., 2009]
▫ Beware of bots (always judged equal in 8 secs.)

25

Summary of Submissions
• The 11 lists account for 119 candidates to judge
• Sent 8 batches (QuickSort iterations) to MTurk
• Had to judge 281 pairs (38%) = 2810 judgments
• 79 unique workers for about 1 day and a half
• A total cost (excluding trials) of $70.25

26

Feedback and Music Background
• 23 of the 79 workers gave us feedback
▫ 4 very positive comments: very relaxing music
▫ 1 greedy worker: give me more money
▫ 2 technical problems loading the audio in 2 HITs
 Not reported by any of the other 9 workers

▫ 5 reported no music background
▫ 6 had formal music education
▫ 9 professional practitioners for several years
▫ 9 play an instrument, mainly piano
▫ 6 performers in choir

27

Agreement between Workers
• Forget about Fleiss’ Kappa
▫ Does not account for the size of the disagreement
▫ A<B and A=B is not as bad as A<B and B<A
• Look at all 45 pairs of judgments per pair
▫ +2 if total agreement (e.g. A<B and A<B)
▫ +1 if partial agreement (e.g. A<B and A=B)
▫ 0 if no agreement (i.e. A<B and B<A)
▫ Divide by 90 (all pairs with total agreement)

• Average agreement score per pair was 0.664
▫ From 0.506 (iteration 8) to 0.822 (iteration 2)

28

Agreement Workers-Experts
• Those 10 judgments were actually aggregated

Percentages per row total
▫ 155 (55%) total agreement
▫ 102 (36%) partial agreement
▫ 23 (8%) no agreement
• Total agreement score = 0.735
• Supports the reasonable person hypothesis

29

Agreement Single Worker-Experts

30

Agreement (Summary)

• Very similar judgments overall
▫ The reasonable person hypothesis stands still
▫ Crowdsourcing seems a doable alternative
▫ No music expertise seems necessary
• We could use just one assessor per pair
▫ If we could keep him/her throughout the query

31

Ground Truth Similarity
• Do high agreement scores translate into
highly similar ground truth lists?

• Consider the original lists (All-2) as ground truth
• And the crowdsourced lists as a system’s result
▫ Compute the Average Dynamic Recall [Typke et al., 2006]
▫ And then the other way around

• Also compare with the (more consistent) original
lists aggregated in Any-1 form [Urbano et al., 2010b]

32

Ground Truth Similarity (II)
• The result depends on the initial ordering
▫ Ground truth = (A, B, C), (D, E)
▫ Results1 = (A, B), (D, E, C)
 ADR score = 0.933
▫ Results2 = (A, B), (C, D, E)
 ADR score = 1

• Results1 is identical to Results2

• Generate 1000 (identical) versions by randomly
permuting the documents within a group

33

Ground Truth Similarity (and III)

Min. and Max. between square brackets

• Very similar to the original All-2 lists
• Like the Any-1 version, also more restrictive
• More consistent (workers were not deceived)

34

MIREX 2005 Revisited
• Would the evaluation have been affected?
▫ Re-evaluated the 7 systems that participated
▫ Included our Splines system [Urbano et al., 2010a]

• All systems perform significantly worse
▫ ADR score drops between 9-15%
• But their ranking is just the same
▫ Kendall’s τ = 1

35

Conclusions
• Partially ordered lists should come back

• We proposed an alternative methodology
▫ Asked for three-point preference judgments
▫ Used Amazon Mechanical Turk
 Crowdsourcing can be used for music-related tasks
 Provided empirical evidence supporting the
reasonable person hypothesis

• What for?
▫ More affordable and large-scale evaluations

36

Conclusions (and II)
• We need fewer assessors
▫ More queries with the same man-power
• Preferences are easier and faster to judge
• Fewer judgments are required
▫ Sorting algorithm

• Avoid inconsistencies (A=B option)
• Using audio instead of images gets rid of experts

• From 70 expert hours to 35 hours for $70

37

Future Work
• Choice of pivots in the sorting algorithm
▫ e.g. the query itself would not provide information

• Study the collections for Audio Tasks
▫ They have more data
 Inaccessible
▫ But no partially ordered list (yet)

• Use our methodology with one real expert
judging preferences for the same query
• Try crowdsourcing too with one single worker

38

Future Work (and II)
• Experimental study on the characteristics of
music similarity perception by humans
▫ Is it transitive?
 We assumed it is
▫ Is it symmetrical?

• If these properties do not hold we have problems

• Id they do, we can start thinking on Minimal
and Incremental Test Collections
[Carterette et al., 2005]

39

And That’s It!

Picture by 姒儿喵喵

Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (10)

Similar to Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks

Similar to Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks (9)

More from Julián Urbano

More from Julián Urbano (10)

Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks