We describe the development of a test collection for the investigation of speech retrieval beyond identification of relevant content. This collection focus on satisfying user information needs for queries associated with specific types of speech acts. The collection is based on an archive of Internet video from Internet video sharing platform (blip.tv), and was provided by the MediaEval benchmarking initiative. A crowdsourcing approach is used to identify segments in the video data which contain speech acts, to create a description of the video containing the act and to generate search queries designed to refind this speech act. We describe and reflect on our experiences with crowdsourcing
this test collection using the Amazon Mechanical Turk platform.We highlight the challenges of constructing this dataset, including the selection of the data source, design of the crowdsouring task and the specification of queries and relevant items.
Creating a Data Collection for Evaluating Rich Speech Retrieval (LREC 2012)
1. Creating a Data Collection for Evaluating Rich Speech Retrieval
Creating a Data Collection
for Evaluating Rich Speech Retrieval
Maria Eskevich1 , Gareth J.F. Jones1
Martha Larson 2 , Roeland Ordelman 3
1 Centre for Digital Video Processing, Centre for Next Generation Localisation
School of Computing, Dublin City University, Dublin, Ireland
2 Delft University of Technology, Delft, The Netherlands
3 University of Twente, The Netherlands
2. Creating a Data Collection for Evaluating Rich Speech Retrieval
Outline
MediaEval benchmark
MediaEval 2011 Rich Speech Retrieval Task
What is crowdsourcing?
Crowdsourcing in Development of Speech and
Language Resources
Development of effective crowdsourcing task
Comments on results
Conclusion
Future Work: Brave New Task at MediaEval 2012
3. Creating a Data Collection for Evaluating Rich Speech Retrieval
ediaEval
Multimedia Evaluation benchmarking inititative
Evaluate new algorithms for multimedia access and
retrieval.
Emphasize the ”multi” in multimedia: speech, audio,
visual content, tags, users, context.
Innovates new tasks and techniques focusing on the
human and social aspects of multimedia content.
4. Creating a Data Collection for Evaluating Rich Speech Retrieval
ediaEval 2011
Rich Speech Retrieval (RSR) Task
Task Goal:
Information to be found - combination of required
audio and visual content, and speaker’s intention
5. Creating a Data Collection for Evaluating Rich Speech Retrieval
ediaEval 2011
Rich Speech Retrieval (RSR) Task
Task Goal:
Information to be found - combination of required
audio and visual content, and speaker’s intention
6. Creating a Data Collection for Evaluating Rich Speech Retrieval
ediaEval 2011
Rich Speech Retrieval (RSR) Task
Task Goal:
Information to be found - combination of required
audio and visual content, and speaker’s intention
7. Creating a Data Collection for Evaluating Rich Speech Retrieval
ediaEval 2011
Rich Speech Retrieval (RSR) Task
Task Goal:
Information to be found - combination of required
audio and visual content, and speaker’s intention
Transcript 1 Transcript 2
8. Creating a Data Collection for Evaluating Rich Speech Retrieval
ediaEval 2011
Rich Speech Retrieval (RSR) Task
Task Goal:
Information to be found - combination of required
audio and visual content, and speaker’s intention
Transcript 1 Transcript 2
Meaning 1 Meaning 2
9. Creating a Data Collection for Evaluating Rich Speech Retrieval
ediaEval 2011
Rich Speech Retrieval (RSR) Task
Task Goal:
Information to be found - combination of required
audio and visual content, and speaker’s intention
Transcript 1 = Transcript 2
Meaning 1 = Meaning 2
10. Creating a Data Collection for Evaluating Rich Speech Retrieval
ediaEval 2011
Rich Speech Retrieval (RSR) Task
Task Goal:
Information to be found - combination of required
audio and visual content, and speaker’s intention
Transcript 1 = Transcript 2
Meaning 1 = Meaning 2
Conventional retrieval
11. Creating a Data Collection for Evaluating Rich Speech Retrieval
ediaEval 2011
Rich Speech Retrieval (RSR) Task
Task Goal:
Information to be found - combination of required
audio and visual content, and speaker’s intention
Transcript 1 = Transcript 2
Meaning 1 = Meaning 2
12. Creating a Data Collection for Evaluating Rich Speech Retrieval
ediaEval 2011
Rich Speech Retrieval (RSR) Task
Task Goal:
Information to be found - combination of required
audio and visual content, and speaker’s intention
Transcript 1 = Transcript 2
Meaning 1 = Meaning 2
Speech act 1 = Speech act 2
13. Creating a Data Collection for Evaluating Rich Speech Retrieval
ediaEval 2011
Rich Speech Retrieval (RSR) Task
Task Goal:
Information to be found - combination of required
audio and visual content, and speaker’s intention
Transcript 1 = Transcript 2
Meaning 1 = Meaning 2
Speech act 1 = Speech act 2
Extended speech retrieval
14. Creating a Data Collection for Evaluating Rich Speech Retrieval
ediaEval 2011
Rich Speech Retrieval (RSR) Task
ME10WWW dataset:
Videos from Internet video sharing platform blip.tv
(1974 episodes, 350 hours)
15. Creating a Data Collection for Evaluating Rich Speech Retrieval
ediaEval 2011
Rich Speech Retrieval (RSR) Task
ME10WWW dataset:
Videos from Internet video sharing platform blip.tv
(1974 episodes, 350 hours)
Automatic Speech Recognition (ASR) transcript provided
by LIMSI and Vocapia Research
16. Creating a Data Collection for Evaluating Rich Speech Retrieval
ediaEval 2011
Rich Speech Retrieval (RSR) Task
ME10WWW dataset:
Videos from Internet video sharing platform blip.tv
(1974 episodes, 350 hours)
Automatic Speech Recognition (ASR) transcript provided
by LIMSI and Vocapia Research
No queries and relevant items
17. Creating a Data Collection for Evaluating Rich Speech Retrieval
ediaEval 2011
Rich Speech Retrieval (RSR) Task
ME10WWW dataset:
Videos from Internet video sharing platform blip.tv
(1974 episodes, 350 hours)
Automatic Speech Recognition (ASR) transcript provided
by LIMSI and Vocapia Research
No queries and relevant items
− > Collect for Retrieval Experiment:
user-generated queries
user-generated relevant items
18. Creating a Data Collection for Evaluating Rich Speech Retrieval
ediaEval 2011
Rich Speech Retrieval (RSR) Task
ME10WWW dataset:
Videos from Internet video sharing platform blip.tv
(1974 episodes, 350 hours)
Automatic Speech Recognition (ASR) transcript provided
by LIMSI and Vocapia Research
No queries and relevant items
− > Collect for Retrieval Experiment:
user-generated queries
user-generated relevant items
− > Collect via crowdsourcing technology
19. Creating a Data Collection for Evaluating Rich Speech Retrieval
What is crowdsourcing?
Crowdsourcing is a form of human computation.
Human computation is a method of having people do
things that we might consider assigning to a computing
device, e.g. a language translation task.
A crowdsourcing system facilitates a crowdsourcing
process.
20. Creating a Data Collection for Evaluating Rich Speech Retrieval
What is crowdsourcing?
Crowdsourcing is a form of human computation.
Human computation is a method of having people do
things that we might consider assigning to a computing
device, e.g. a language translation task.
A crowdsourcing system facilitates a crowdsourcing
process.
Factors to take into account:
21. Creating a Data Collection for Evaluating Rich Speech Retrieval
What is crowdsourcing?
Crowdsourcing is a form of human computation.
Human computation is a method of having people do
things that we might consider assigning to a computing
device, e.g. a language translation task.
A crowdsourcing system facilitates a crowdsourcing
process.
Factors to take into account:
Sufficient number of workers
22. Creating a Data Collection for Evaluating Rich Speech Retrieval
What is crowdsourcing?
Crowdsourcing is a form of human computation.
Human computation is a method of having people do
things that we might consider assigning to a computing
device, e.g. a language translation task.
A crowdsourcing system facilitates a crowdsourcing
process.
Factors to take into account:
Sufficient number of workers
Level of payment
23. Creating a Data Collection for Evaluating Rich Speech Retrieval
What is crowdsourcing?
Crowdsourcing is a form of human computation.
Human computation is a method of having people do
things that we might consider assigning to a computing
device, e.g. a language translation task.
A crowdsourcing system facilitates a crowdsourcing
process.
Factors to take into account:
Sufficient number of workers
Level of payment
Clear instructions
24. Creating a Data Collection for Evaluating Rich Speech Retrieval
What is crowdsourcing?
Crowdsourcing is a form of human computation.
Human computation is a method of having people do
things that we might consider assigning to a computing
device, e.g. a language translation task.
A crowdsourcing system facilitates a crowdsourcing
process.
Factors to take into account:
Sufficient number of workers
Level of payment
Clear instructions
Possible cheating
25. Creating a Data Collection for Evaluating Rich Speech Retrieval
Crowdsourcing in Development of Speech and
Language Resources
26. Creating a Data Collection for Evaluating Rich Speech Retrieval
Crowdsourcing in Development of Speech and
Language Resources
Suitability of crowdsourcing for simple/straightforward
natural language processing tasks:
27. Creating a Data Collection for Evaluating Rich Speech Retrieval
Crowdsourcing in Development of Speech and
Language Resources
Suitability of crowdsourcing for simple/straightforward
natural language processing tasks:
Work by non-experts crowdsource workers is of similar
standard to that performed by expert workers:
translation/translation assessment
transcription of native language
word sense disambiguation
temporal annotation
[Snow et al., 2008]
28. Creating a Data Collection for Evaluating Rich Speech Retrieval
Crowdsourcing in Development of Speech and
Language Resources
Suitability of crowdsourcing for simple/straightforward
natural language processing tasks:
Work by non-experts crowdsource workers is of similar
standard to that performed by expert workers:
translation/translation assessment
transcription of native language
word sense disambiguation
temporal annotation
[Snow et al., 2008]
Research question at collection creation stage:
Can untrained crowdsource workers undertake
extended tasks which require them to be creative?
29. Creating a Data Collection for Evaluating Rich Speech Retrieval
Crowdsourcing with Amazon Mechanical Turk
30. Creating a Data Collection for Evaluating Rich Speech Retrieval
Crowdsourcing with Amazon Mechanical Turk
Task is referred to as a ‘Human Intelligence Task’ or HIT.
31. Creating a Data Collection for Evaluating Rich Speech Retrieval
Crowdsourcing with Amazon Mechanical Turk
Task is referred to as a ‘Human Intelligence Task’ or HIT.
Crowdsourcing procedure:
HIT initiation: Requester uploads a HIT.
32. Creating a Data Collection for Evaluating Rich Speech Retrieval
Crowdsourcing with Amazon Mechanical Turk
Task is referred to as a ‘Human Intelligence Task’ or HIT.
Crowdsourcing procedure:
HIT initiation: Requester uploads a HIT.
Work: Workers carry out the HIT
33. Creating a Data Collection for Evaluating Rich Speech Retrieval
Crowdsourcing with Amazon Mechanical Turk
Task is referred to as a ‘Human Intelligence Task’ or HIT.
Crowdsourcing procedure:
HIT initiation: Requester uploads a HIT.
Work: Workers carry out the HIT
Review: Requester reviews the completed work and
confirms payment to the worker with a previously set
payment.
*Requester has an option of paying more (”Bonus”)
34. Creating a Data Collection for Evaluating Rich Speech Retrieval
Information expected from the worker
to create a test collection for RSR Task
35. Creating a Data Collection for Evaluating Rich Speech Retrieval
Information expected from the worker
to create a test collection for RSR Task
Speech act type:
’expressives’: apology, opinion
’assertives’: definition
’directives’: warning
’commissives’: promise
36. Creating a Data Collection for Evaluating Rich Speech Retrieval
Information expected from the worker
to create a test collection for RSR Task
Speech act type:
’expressives’: apology, opinion
’assertives’: definition
’directives’: warning
’commissives’: promise
Time of the labeled speech act: beginning and end
37. Creating a Data Collection for Evaluating Rich Speech Retrieval
Information expected from the worker
to create a test collection for RSR Task
Speech act type:
’expressives’: apology, opinion
’assertives’: definition
’directives’: warning
’commissives’: promise
Time of the labeled speech act: beginning and end
Accurate transcript of the labeled speech act
38. Creating a Data Collection for Evaluating Rich Speech Retrieval
Information expected from the worker
to create a test collection for RSR Task
Speech act type:
’expressives’: apology, opinion
’assertives’: definition
’directives’: warning
’commissives’: promise
Time of the labeled speech act: beginning and end
Accurate transcript of the labeled speech act
Queries to refind this speech act:
a full sentence query
a short web style query
39. Creating a Data Collection for Evaluating Rich Speech Retrieval
Data management for Amazon MTurking
ME10WWW videos vary in length:
40. Creating a Data Collection for Evaluating Rich Speech Retrieval
Data management for Amazon MTurking
ME10WWW videos vary in length:
− > Starting points for longer videos at a distance of
approximately 7 minutes apart are calculated:
Data set Episodes Starting points
Dev 247 562
Test 1727 3278
41. Creating a Data Collection for Evaluating Rich Speech Retrieval
Crowdsourcing experiment
42. Creating a Data Collection for Evaluating Rich Speech Retrieval
Crowdsourcing experiment
Worker expectations:
43. Creating a Data Collection for Evaluating Rich Speech Retrieval
Crowdsourcing experiment
Worker expectations:
44. Creating a Data Collection for Evaluating Rich Speech Retrieval
Crowdsourcing experiment
Worker expectations:
Reward vs Work
45. Creating a Data Collection for Evaluating Rich Speech Retrieval
Crowdsourcing experiment
Worker expectations:
Reward vs Work
Per hour Rate
46. Creating a Data Collection for Evaluating Rich Speech Retrieval
Crowdsourcing experiment
Requester uploads the HIT:
Worker expectations:
Reward vs Work
Per hour Rate
47. Creating a Data Collection for Evaluating Rich Speech Retrieval
Crowdsourcing experiment
Requester uploads the HIT:
Worker expectations:
Reward vs Work Pilot wording
Per hour Rate
48. Creating a Data Collection for Evaluating Rich Speech Retrieval
Crowdsourcing experiment
Requester uploads the HIT:
Worker expectations:
Reward vs Work Pilot wording
Per hour Rate 0.11 $ + bonus per
speech act type
49. Creating a Data Collection for Evaluating Rich Speech Retrieval
Crowdsourcing experiment
Workers feedback: Requester uploads the HIT:
Reward is not worth
the Work Pilot wording
Task is 0.11 $ + bonus per
too complicated speech act type
50. Creating a Data Collection for Evaluating Rich Speech Retrieval
Crowdsourcing experiment
Requester updates the HIT:
Workers feedback:
Rewording
Reward is not worth
the Work
Task is
too complicated
51. Creating a Data Collection for Evaluating Rich Speech Retrieval
Crowdsourcing experiment
Requester updates the HIT:
Workers feedback:
Rewording
Reward is not worth Examples
the Work
Task is
too complicated
52. Creating a Data Collection for Evaluating Rich Speech Retrieval
Crowdsourcing experiment
Requester updates the HIT:
Workers feedback:
Rewording
Reward is not worth Examples
the Work 0.19 $ + bonus (0-21$)
Task is Workers suggest bonus
too complicated size (Mention to be a
non-profit organization)
53. Creating a Data Collection for Evaluating Rich Speech Retrieval
Crowdsourcing experiment
Requester updates the HIT:
Workers feedback:
Reward is worth
Rewording
the Work
Examples
Task is
comprehensible 0.19 $ + bonus (0-21$)
Workers suggest bonus
Workers are
size (Mention that we are a
not greedy!
non-profit organization)
54. Creating a Data Collection for Evaluating Rich Speech Retrieval
HIT example
Pilot:
“Please watch the video and find a short portion of the
video (a segment) that contains an interesting quote. The
quote must fall into one of these six categories”
55. Creating a Data Collection for Evaluating Rich Speech Retrieval
HIT example
Pilot:
“Please watch the video and find a short portion of the
video (a segment) that contains an interesting quote. The
quote must fall into one of these six categories”
Revised:
“Imagine that you are watching videos on YouTube.
When you come across something interesting you might
want to share it on Facebook, Twitter or your favorite
social network. Now please watch this video and search
for an interesting video segment that you would like to
share with others because it is (an apology, a definition,
an opinion, a promise, a warning)”.
56. Creating a Data Collection for Evaluating Rich Speech Retrieval
Results:
Number of collected queries per speech act
Prices:
Dev set: 40 $ per 30 queries
Test set: 80 $ per 50 queries
57. Creating a Data Collection for Evaluating Rich Speech Retrieval
Results assessment
58. Creating a Data Collection for Evaluating Rich Speech Retrieval
Results assessment
Number of accepted HITs = number of collected queries
59. Creating a Data Collection for Evaluating Rich Speech Retrieval
Results assessment
Number of accepted HITs = number of collected queries
60. Creating a Data Collection for Evaluating Rich Speech Retrieval
Results assessment
Number of accepted HITs = number of collected queries
No overlap of workers in dev and test sets
61. Creating a Data Collection for Evaluating Rich Speech Retrieval
Results assessment
Number of accepted HITs = number of collected queries
No overlap of workers in dev and test sets
Creative work - Creative Cheating:
62. Creating a Data Collection for Evaluating Rich Speech Retrieval
Results assessment
Number of accepted HITs = number of collected queries
No overlap of workers in dev and test sets
Creative work - Creative Cheating:
Copy and paste provided examples
63. Creating a Data Collection for Evaluating Rich Speech Retrieval
Results assessment
Number of accepted HITs = number of collected queries
No overlap of workers in dev and test sets
Creative work - Creative Cheating:
Copy and paste provided examples
− > Examples should be pictures, not texts
64. Creating a Data Collection for Evaluating Rich Speech Retrieval
Results assessment
Number of accepted HITs = number of collected queries
No overlap of workers in dev and test sets
Creative work - Creative Cheating:
Copy and paste provided examples
− > Examples should be pictures, not texts
Choose the option of no speech act found in the video
65. Creating a Data Collection for Evaluating Rich Speech Retrieval
Results assessment
Number of accepted HITs = number of collected queries
No overlap of workers in dev and test sets
Creative work - Creative Cheating:
Copy and paste provided examples
− > Examples should be pictures, not texts
Choose the option of no speech act found in the video
− > Manual assessment by requester needed
66. Creating a Data Collection for Evaluating Rich Speech Retrieval
Results assessment
Number of accepted HITs = number of collected queries
No overlap of workers in dev and test sets
Creative work - Creative Cheating:
Copy and paste provided examples
− > Examples should be pictures, not texts
Choose the option of no speech act found in the video
− > Manual assessment by requester needed
Workers rarely find noteworthy content later than the
third minute from the start of playback point in the video
67. Creating a Data Collection for Evaluating Rich Speech Retrieval
Conclusions
It is possible to crowdsource extensive and complex
tasks to support speech and language resources
68. Creating a Data Collection for Evaluating Rich Speech Retrieval
Conclusions
It is possible to crowdsource extensive and complex
tasks to support speech and language resources
Use concepts and vocabulary familiar to the workers
69. Creating a Data Collection for Evaluating Rich Speech Retrieval
Conclusions
It is possible to crowdsource extensive and complex
tasks to support speech and language resources
Use concepts and vocabulary familiar to the workers
Pay attention to technical issues of watching the video
70. Creating a Data Collection for Evaluating Rich Speech Retrieval
Conclusions
It is possible to crowdsource extensive and complex
tasks to support speech and language resources
Use concepts and vocabulary familiar to the workers
Pay attention to technical issues of watching the video
Video preprocessing into smaller segments
71. Creating a Data Collection for Evaluating Rich Speech Retrieval
Conclusions
It is possible to crowdsource extensive and complex
tasks to support speech and language resources
Use concepts and vocabulary familiar to the workers
Pay attention to technical issues of watching the video
Video preprocessing into smaller segments
Creative work demands higher reward level, or just
more flexible system
72. Creating a Data Collection for Evaluating Rich Speech Retrieval
Conclusions
It is possible to crowdsource extensive and complex
tasks to support speech and language resources
Use concepts and vocabulary familiar to the workers
Pay attention to technical issues of watching the video
Video preprocessing into smaller segments
Creative work demands higher reward level, or just
more flexible system
High level of wastage due to task complexity
73. Creating a Data Collection for Evaluating Rich Speech Retrieval
ediaEval 2012 Brave New Task:
Search and Hyperlinking
Use Scenario: a user is searching for a known segment
in a video collection. Furthermore, because the information
in the segment might not be sufficient for his information
need, s/he wants to have links to other related video
segments, which may help to satisfy information need
related to this video.
74. Creating a Data Collection for Evaluating Rich Speech Retrieval
ediaEval 2012 Brave New Task:
Search and Hyperlinking
Use Scenario: a user is searching for a known segment
in a video collection. Furthermore, because the information
in the segment might not be sufficient for his information
need, s/he wants to have links to other related video
segments, which may help to satisfy information need
related to this video.
Sub-tasks:
75. Creating a Data Collection for Evaluating Rich Speech Retrieval
ediaEval 2012 Brave New Task:
Search and Hyperlinking
Use Scenario: a user is searching for a known segment
in a video collection. Furthermore, because the information
in the segment might not be sufficient for his information
need, s/he wants to have links to other related video
segments, which may help to satisfy information need
related to this video.
Sub-tasks:
Search: finding suitable video segments based on a short
natural language query,
76. Creating a Data Collection for Evaluating Rich Speech Retrieval
ediaEval 2012 Brave New Task:
Search and Hyperlinking
Use Scenario: a user is searching for a known segment
in a video collection. Furthermore, because the information
in the segment might not be sufficient for his information
need, s/he wants to have links to other related video
segments, which may help to satisfy information need
related to this video.
Sub-tasks:
Search: finding suitable video segments based on a short
natural language query,
Linking: defining links to other relevant video segments in
the collection.
77. Creating a Data Collection for Evaluating Rich Speech Retrieval
ediaEval 2012
Thank you for your attention!
Welcome to MediaEval 2012! http://multimediaeval.org