SlideShare ist ein Scribd-Unternehmen logo
1 von 25
Discovering User Perceptions of
  Semantic Similarity in
  Near-duplicate Multimedia Files

Raynor Vliegendhart (speaker)
Martha Larson
Johan Pouwelse

WWW 2012 Workshop on Crowdsourcing Web Search (CrowdSearch 2012),
Lyon, France, April 17, 2012.
Outline

• Introduction
• Crowdsourcing Task
• Results
• Conclusions and Future Work




                                2
Question:
Are these the same? Why (not)?


            Chrono Cross -
            'Dream of the Shore Near Another World'
            Violin/Piano Cover


            Chrono Cross
            Dream of the Shore Near Another World
            Violin and Piano

                       sources: YouTube, IQYNEj51EUI (left), Iuh3YrJtK3M (right)

                                                                   3
Question:
Are these the same? Why (not)?


                 Chrono Cross -
                 'Dream of the Shore Near Another World'
                 Violin/Piano Cover
 Yes, it’s the
 same song
                 Chrono Cross
                 Dream of the Shore Near Another World
                 Violin and Piano

                            sources: YouTube, IQYNEj51EUI (left), Iuh3YrJtK3M (right)

                                                                        4
Question:
Are these the same? Why (not)?


            Chrono Cross -
            'Dream of the Shore Near Another World'
            Violin/Piano Cover
                         No, these are
                    different performances
                    by different performers
            Chrono Cross
            Dream of the Shore Near Another World
            Violin and Piano

                       sources: YouTube, IQYNEj51EUI (left), Iuh3YrJtK3M (right)

                                                                   5
Problem:
What constitutes a near duplicate?


Functional near-duplicate multimedia items are items
that fulfill the same purpose for the user.

Once the user has one of these items, there is no
additional need for another.




                                                    6
Problem:
What constitutes a near duplicate?

Our work:
• Discovering new notions of user-perceived
  similarity between multimedia files

• in a file-sharing setting

• through a crowdsourcing task.




                                              7
Motivation:
Clustering items in search results




                            screenshot from Tribler (tribler.org)

                                                   8
Motivation:
Clustering items in search results




                            screenshot from Tribler (tribler.org)

                                                   9
Outline

• Introduction
• Crowdsourcing Task
• Results
• Conclusions and Future Work




                                10
Crowdsourcing Task:
Point the odd one out

• Three multimedia files displayed as search results
• Worker points the odd one out and justifies why


• Challenge: eliciting serious judgments




                                                       11
Crowdsourcing Task:
   Eliciting serious judgments (1)

   “Imagine that you downloaded
    the three items in the list
    and that you view them.”

Harry Potter and the Sorcerers Stone Audio
Book (478 MB)

Harry Potter and the Sorcerer s Stone
(2001)(ENG GER NL) 2Lions- (4.36 GB)

Harry Potter.And.The.Sorcerer.Stone.DVDR.
NTSC.SKJACK.Universal.S (4.46 GB)


                                             12
Crowdsourcing Task:
Eliciting serious judgments (2)

• Don’t force workers to make a contrast
• Explain the definition of functional similarity


o The items are comparable. They are for all practical purposes the
  same. Someone would never really need all three of these.

o Each item can be considered unique. I can imagine that someone
  might really want to download all three of these items.

o One item is not like the other two. (Please mark that item in the list.)
  The other two items are comparable.

                                                                      13
Final HIT Design




                   14
Outline

• Introduction
• Crowdsourcing Task
• Results
• Conclusions and Future Work




                                15
Dataset




top 100 content   75 queries               75 results lists /
                                           32,773 filenames


                  1000 random triads (test set)
                  28 manually selected triads (validation set)

                                                         16
Results
                                                    1000 test triads
3 validation triads                          + 28 validation triads mixed in




 Recruitment                                            Main HIT
     HIT
                                                  (3 workers per test triad)



                      two HITs run concurrently
                                                                     17
Results
                                                1000 test triads
3 validation triads                      + 28 validation triads mixed in




 Recruitment                         8
                                                  Main HIT
     HIT
                      14 qualified
                       workers

                                              free-text judgments
< 36h                                          for 308 test triads

                                                              18
Card Sort

• Print judgments on small pieces of paper
• Group similar judgments into piles
• Merge piles iteratively
• Label each pile




                                             19
Card Sort

Example: “different language”
• “The third item is a Hindi language version of the movie”
• “This is a Spanish version of the movie represented by the other
 two”
•…




                                                               20
User-perceived
Similarity Dimensions

Different movie vs. TV show                     Different movie
Normal cut vs. extended cut                     Movie vs. trailer
Cartoon vs. movie                               Comic vs. movie
Movie vs. book                                  Audiobook vs. movie
Game vs. corresponding movie                    Sequels (movies)
Commentary document vs. movie                   Soundtrack vs. corresponding movie
Movie/TV show vs. unrelated audio album         Movie vs. wallpaper
Different episode                               Complete season vs. individual episodes
Episodes from different season                  Graphic novel vs. TV episode
Multiple episodes vs. full season               Different realization of same legend/story
Different songs                                 Different albums
Song vs. album                                  Collection vs. album
Album vs. remix                                 Event capture vs. song
Explicit version                                Bonus track included
Song vs. collection of songs+videos             Event capture vs. unrelated movie
Language of subtitles                           Different language
Mobile vs. normal version                       Quality and/or source
Different codec/container (MP4 audio vs. MP3)   Different game
Crack vs. game                                  Software versions
Different game, same series                     Different application
Addon vs. main application                      Documentation (pdf) vs. software
List (text document) vs. unrelated item         Safe vs. X-Rated


                                                                                             21
Outline

• Introduction
• Crowdsourcing Task
• Results
• Conclusions and Future Work




                                22
Conclusions

• A wealth of user-perceived dimensions of similarity discovered,
  some we could not have thought of
• Quick results due to interesting crowdsourcing task,
  with the focus on engagement and encouraging serious workers




                                                               23
Future Work

• Expand experiments, larger worker volume
• Other multimedia search settings
• Crowdsourcing the card sorting process


• Use findings to guide design of clustering algorithms
 Done: first version is deployed in Tribler




                                                          24
Questions?




             25

Weitere ähnliche Inhalte

Ähnlich wie Discovering User Perceptions of Semantic Similarity

A look inside Babelfy: Examining the bubble
A look inside Babelfy: Examining the bubbleA look inside Babelfy: Examining the bubble
A look inside Babelfy: Examining the bubbleFilip Ilievski
 
Generative Adversarial Networks (GAN)
Generative Adversarial Networks (GAN)Generative Adversarial Networks (GAN)
Generative Adversarial Networks (GAN)Manohar Mukku
 
Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 3
Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 3Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 3
Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 3Dr. Aparna Varde
 
gan-190318135433 (1).pptx
gan-190318135433 (1).pptxgan-190318135433 (1).pptx
gan-190318135433 (1).pptxkiran814572
 
Music recommendations @ MLConf 2014
Music recommendations @ MLConf 2014Music recommendations @ MLConf 2014
Music recommendations @ MLConf 2014Erik Bernhardsson
 
How to avoid drastic project change (using stochastic stability)
How to avoid drastic project change (using stochastic stability)How to avoid drastic project change (using stochastic stability)
How to avoid drastic project change (using stochastic stability)CS, NcState
 
Deep Learning: concepts and use cases (October 2018)
Deep Learning: concepts and use cases (October 2018)Deep Learning: concepts and use cases (October 2018)
Deep Learning: concepts and use cases (October 2018)Julien SIMON
 
On the value of stochastic analysis for software engineering
On the value of stochastic analysis for software engineeringOn the value of stochastic analysis for software engineering
On the value of stochastic analysis for software engineeringCS, NcState
 
Learning to assess Linked Data relationships using Genetic Programming
Learning to assess Linked Data relationships using Genetic ProgrammingLearning to assess Linked Data relationships using Genetic Programming
Learning to assess Linked Data relationships using Genetic ProgrammingVrije Universiteit Amsterdam
 
Functional Programming with Immutable Data Structures
Functional Programming with Immutable Data StructuresFunctional Programming with Immutable Data Structures
Functional Programming with Immutable Data Structureselliando dias
 
From Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science TalesFrom Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science TalesBertram Ludäscher
 
Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)Kira
 
Inside the Black Box: How Does a Neural Network Understand Names? - Philip Bl...
Inside the Black Box: How Does a Neural Network Understand Names? - Philip Bl...Inside the Black Box: How Does a Neural Network Understand Names? - Philip Bl...
Inside the Black Box: How Does a Neural Network Understand Names? - Philip Bl...Lucidworks
 
Maximizing Correctness with Minimal User Effort to Learn Data Transformations
Maximizing Correctness with Minimal User Effort to Learn Data TransformationsMaximizing Correctness with Minimal User Effort to Learn Data Transformations
Maximizing Correctness with Minimal User Effort to Learn Data TransformationsBo Wu
 
Crowd-Based Personalized Natural Language Explanations for Recommendations
Crowd-Based Personalized Natural Language Explanations for Recommendations Crowd-Based Personalized Natural Language Explanations for Recommendations
Crowd-Based Personalized Natural Language Explanations for Recommendations Shuo Chang
 
Overview of text mining and NLP (+software)
Overview of text mining and NLP (+software)Overview of text mining and NLP (+software)
Overview of text mining and NLP (+software)Florian Leitner
 
Machine learning for document analysis and understanding
Machine learning for document analysis and understandingMachine learning for document analysis and understanding
Machine learning for document analysis and understandingSeiichi Uchida
 
Improving Semantic Search Using Query Log Analysis
Improving Semantic Search Using Query Log AnalysisImproving Semantic Search Using Query Log Analysis
Improving Semantic Search Using Query Log AnalysisStuart Wrigley
 

Ähnlich wie Discovering User Perceptions of Semantic Similarity (20)

A look inside Babelfy: Examining the bubble
A look inside Babelfy: Examining the bubbleA look inside Babelfy: Examining the bubble
A look inside Babelfy: Examining the bubble
 
DeepLearning
DeepLearningDeepLearning
DeepLearning
 
Generative Adversarial Networks (GAN)
Generative Adversarial Networks (GAN)Generative Adversarial Networks (GAN)
Generative Adversarial Networks (GAN)
 
Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 3
Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 3Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 3
Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 3
 
gan-190318135433 (1).pptx
gan-190318135433 (1).pptxgan-190318135433 (1).pptx
gan-190318135433 (1).pptx
 
Music recommendations @ MLConf 2014
Music recommendations @ MLConf 2014Music recommendations @ MLConf 2014
Music recommendations @ MLConf 2014
 
How to avoid drastic project change (using stochastic stability)
How to avoid drastic project change (using stochastic stability)How to avoid drastic project change (using stochastic stability)
How to avoid drastic project change (using stochastic stability)
 
Deep Learning: concepts and use cases (October 2018)
Deep Learning: concepts and use cases (October 2018)Deep Learning: concepts and use cases (October 2018)
Deep Learning: concepts and use cases (October 2018)
 
On the value of stochastic analysis for software engineering
On the value of stochastic analysis for software engineeringOn the value of stochastic analysis for software engineering
On the value of stochastic analysis for software engineering
 
Learning to assess Linked Data relationships using Genetic Programming
Learning to assess Linked Data relationships using Genetic ProgrammingLearning to assess Linked Data relationships using Genetic Programming
Learning to assess Linked Data relationships using Genetic Programming
 
Functional Programming with Immutable Data Structures
Functional Programming with Immutable Data StructuresFunctional Programming with Immutable Data Structures
Functional Programming with Immutable Data Structures
 
From Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science TalesFrom Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science Tales
 
Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)
 
Inside the Black Box: How Does a Neural Network Understand Names? - Philip Bl...
Inside the Black Box: How Does a Neural Network Understand Names? - Philip Bl...Inside the Black Box: How Does a Neural Network Understand Names? - Philip Bl...
Inside the Black Box: How Does a Neural Network Understand Names? - Philip Bl...
 
Maximizing Correctness with Minimal User Effort to Learn Data Transformations
Maximizing Correctness with Minimal User Effort to Learn Data TransformationsMaximizing Correctness with Minimal User Effort to Learn Data Transformations
Maximizing Correctness with Minimal User Effort to Learn Data Transformations
 
Responses to remixing
Responses to remixingResponses to remixing
Responses to remixing
 
Crowd-Based Personalized Natural Language Explanations for Recommendations
Crowd-Based Personalized Natural Language Explanations for Recommendations Crowd-Based Personalized Natural Language Explanations for Recommendations
Crowd-Based Personalized Natural Language Explanations for Recommendations
 
Overview of text mining and NLP (+software)
Overview of text mining and NLP (+software)Overview of text mining and NLP (+software)
Overview of text mining and NLP (+software)
 
Machine learning for document analysis and understanding
Machine learning for document analysis and understandingMachine learning for document analysis and understanding
Machine learning for document analysis and understanding
 
Improving Semantic Search Using Query Log Analysis
Improving Semantic Search Using Query Log AnalysisImproving Semantic Search Using Query Log Analysis
Improving Semantic Search Using Query Log Analysis
 

Kürzlich hochgeladen

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 

Kürzlich hochgeladen (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 

Discovering User Perceptions of Semantic Similarity

  • 1. Discovering User Perceptions of Semantic Similarity in Near-duplicate Multimedia Files Raynor Vliegendhart (speaker) Martha Larson Johan Pouwelse WWW 2012 Workshop on Crowdsourcing Web Search (CrowdSearch 2012), Lyon, France, April 17, 2012.
  • 2. Outline • Introduction • Crowdsourcing Task • Results • Conclusions and Future Work 2
  • 3. Question: Are these the same? Why (not)? Chrono Cross - 'Dream of the Shore Near Another World' Violin/Piano Cover Chrono Cross Dream of the Shore Near Another World Violin and Piano sources: YouTube, IQYNEj51EUI (left), Iuh3YrJtK3M (right) 3
  • 4. Question: Are these the same? Why (not)? Chrono Cross - 'Dream of the Shore Near Another World' Violin/Piano Cover Yes, it’s the same song Chrono Cross Dream of the Shore Near Another World Violin and Piano sources: YouTube, IQYNEj51EUI (left), Iuh3YrJtK3M (right) 4
  • 5. Question: Are these the same? Why (not)? Chrono Cross - 'Dream of the Shore Near Another World' Violin/Piano Cover No, these are different performances by different performers Chrono Cross Dream of the Shore Near Another World Violin and Piano sources: YouTube, IQYNEj51EUI (left), Iuh3YrJtK3M (right) 5
  • 6. Problem: What constitutes a near duplicate? Functional near-duplicate multimedia items are items that fulfill the same purpose for the user. Once the user has one of these items, there is no additional need for another. 6
  • 7. Problem: What constitutes a near duplicate? Our work: • Discovering new notions of user-perceived similarity between multimedia files • in a file-sharing setting • through a crowdsourcing task. 7
  • 8. Motivation: Clustering items in search results screenshot from Tribler (tribler.org) 8
  • 9. Motivation: Clustering items in search results screenshot from Tribler (tribler.org) 9
  • 10. Outline • Introduction • Crowdsourcing Task • Results • Conclusions and Future Work 10
  • 11. Crowdsourcing Task: Point the odd one out • Three multimedia files displayed as search results • Worker points the odd one out and justifies why • Challenge: eliciting serious judgments 11
  • 12. Crowdsourcing Task: Eliciting serious judgments (1) “Imagine that you downloaded the three items in the list and that you view them.” Harry Potter and the Sorcerers Stone Audio Book (478 MB) Harry Potter and the Sorcerer s Stone (2001)(ENG GER NL) 2Lions- (4.36 GB) Harry Potter.And.The.Sorcerer.Stone.DVDR. NTSC.SKJACK.Universal.S (4.46 GB) 12
  • 13. Crowdsourcing Task: Eliciting serious judgments (2) • Don’t force workers to make a contrast • Explain the definition of functional similarity o The items are comparable. They are for all practical purposes the same. Someone would never really need all three of these. o Each item can be considered unique. I can imagine that someone might really want to download all three of these items. o One item is not like the other two. (Please mark that item in the list.) The other two items are comparable. 13
  • 15. Outline • Introduction • Crowdsourcing Task • Results • Conclusions and Future Work 15
  • 16. Dataset top 100 content 75 queries 75 results lists / 32,773 filenames 1000 random triads (test set) 28 manually selected triads (validation set) 16
  • 17. Results 1000 test triads 3 validation triads + 28 validation triads mixed in Recruitment Main HIT HIT (3 workers per test triad) two HITs run concurrently 17
  • 18. Results 1000 test triads 3 validation triads + 28 validation triads mixed in Recruitment 8 Main HIT HIT 14 qualified workers free-text judgments < 36h for 308 test triads 18
  • 19. Card Sort • Print judgments on small pieces of paper • Group similar judgments into piles • Merge piles iteratively • Label each pile 19
  • 20. Card Sort Example: “different language” • “The third item is a Hindi language version of the movie” • “This is a Spanish version of the movie represented by the other two” •… 20
  • 21. User-perceived Similarity Dimensions Different movie vs. TV show Different movie Normal cut vs. extended cut Movie vs. trailer Cartoon vs. movie Comic vs. movie Movie vs. book Audiobook vs. movie Game vs. corresponding movie Sequels (movies) Commentary document vs. movie Soundtrack vs. corresponding movie Movie/TV show vs. unrelated audio album Movie vs. wallpaper Different episode Complete season vs. individual episodes Episodes from different season Graphic novel vs. TV episode Multiple episodes vs. full season Different realization of same legend/story Different songs Different albums Song vs. album Collection vs. album Album vs. remix Event capture vs. song Explicit version Bonus track included Song vs. collection of songs+videos Event capture vs. unrelated movie Language of subtitles Different language Mobile vs. normal version Quality and/or source Different codec/container (MP4 audio vs. MP3) Different game Crack vs. game Software versions Different game, same series Different application Addon vs. main application Documentation (pdf) vs. software List (text document) vs. unrelated item Safe vs. X-Rated 21
  • 22. Outline • Introduction • Crowdsourcing Task • Results • Conclusions and Future Work 22
  • 23. Conclusions • A wealth of user-perceived dimensions of similarity discovered, some we could not have thought of • Quick results due to interesting crowdsourcing task, with the focus on engagement and encouraging serious workers 23
  • 24. Future Work • Expand experiments, larger worker volume • Other multimedia search settings • Crowdsourcing the card sorting process • Use findings to guide design of clustering algorithms Done: first version is deployed in Tribler 24