lecture of Marco Tagliasacchi (Politecnico di Milano) for Summer School on Social Media Modeling and Search, and European Chapter of the ACM SIGMM event, supported by CUbRIK and Social Sensor Project.
10-14 September, Fira, Santorini, Greece in Santorini
4. +
Crowdsourcing
n Crowdsourcing
is
an
example
of
human
compu+ng
n Use
an
online
community
of
human
workers
to
complete
useful
tasks
n The
task
is
outsourced
to
an
undefined
public
n Main
idea:
design
tasks
that
are
n Easy
for
humans
n Hard
for
machines
6. +
Applica0ons
in
mul0media
retrieval
n Create
annotated
data
sets
for
training
n Reduces
both
cost
and
0me
needed
to
gather
annota0ons,
n …but
annota0ons
might
be
noisy!
n
Validate
the
output
of
mul0media
retrieval
systems
n Query
expansion
/
reformula0on
7. +
Crea0ng
annotated
training
sets
[Sorokin
and
Forsyth,
2008]
n Collect
annota0ons
for
computer
vision
data
sets
n people
segmenta0on
Protocol 1
Protocol 1
Protocol 2
Protocol 2
8. Proto
+
Crea0ng
annotated
training
sets
[Sorokin
and
Forsyth,
2008]
Protocol 2
n Collect
annota0ons
for
computer
vision
data
sets
n people
segmenta0on
and
pose
annota0on
Protocol 3
Protocol 4
Figure 1. Example results show the example results obtained from the annotation experiments. The first column is the implementation of
9. +
Experiment 3: trace the boundary of the person.
1
0.8
Crea0ng
annotated
training
sets
area(XOR)/area(AND). The lower the better. Mean 0.21, std 0.14, median 0.16
knee
A
[Sorokin
and
Forsyth,
2008]
0.6 G B
0.4 F
E
C D
0.2 A B
0
0 50 100 150 200 250 300
n Observa0ons:
C n Annotators
make
errors
D E F G
n Quality
of
annotators
is
heterogeneous
n The
quality
of
the
annota0ons
depends
on
the
difficulty
of
the
task
Experiment 4: click on 14 landmarks
50
Mean error in pixels between annotation points. The lower the better. Mean 8.71, std 6.29, median 7.35.
40
14 12
12 7
7
14
14
9 13 11
11
1310
1310
30 10
figure 6 9 8
8
9
8 8
8
7
knee
9
9 14
14
7
7 13
13
G 11
13 rWrist
10
10
rHip
20 rAnkle A 3 4
12 B 13
3
3 4
4
C 11
11
13 13
Neck
rElbow 12
12 lHip
12 F rKnee 2
2 5
5
4
4 3
3 12 12
D E 2
5
12
lElbow rShoulder Head
10 B C lKnee 6
6 5
5
2
2
A 11
1 11
lWrist 1
11
lShoulder 11
lAnkle 6 1
1
6
6 1
10 10 10 10
0
0 50 100 150 200
9 250 300 350 9 9 9
8 8 8 8
7 7 7 7
14
14 7
7
14
14
14
10
10 13
13 6 14
14
6 6 6
9 8
8
8 14
14
9
12
12 10
10
10
13
13 13
13
11
11 9
9
13
5
13 9
9 10
10 13 5 5
10
10 5 9
9
9
11
11
11
11
11
4
4
12
3
3
8
8 4 8
8 11
11
4
3
3
3 12
12
12
4 4
D E F G
12 12 4
4
3
3
4
4 12 4
7
7
7
7 3 7
7 3 3
3 8 7
3
3
100 4
4
5
5
110 120 130 140
150 160
5
5
170 180 190
200 100 110 120 130 140 150 160
88
5 170
5
5 180 190 200 100 110 120 130 140 150 160 170 180 190 200 100 110 120 130 140
5
5 2
2
2
2
2
2 2
2
2
6
6 6
6
1
1 1
1
1
1
1
6
6 1
1 6
6
6
Figure 6. Quality details per landmark. We present analysis of annotation quality per landmark in experiment 4. We
Figure 5. Quality details. We presentbest pair forof annotation quality forbetween 35th4. For every image the best fitting between points “C” and
detailed analysis all annotations experiments 3 and and 65th percentiles - “E” of experiment 4 in fig. 5.
pair of annotations is selected. The score of the best pair is shown in the figure. For experiment 3 we score annotations by the area of
their symmetric difference (XOR) divided bysame scale:union(OR). For experimenttowe compute the average distance between the
the the area of their from image 100 4 200 on horizontal axis and from 3 pixels to 13 pixels of error on the vertical axis. T
10. +
Crea0ng
annotated
training
sets
[Soleymani
and
Larson,
2010]
n MediaEval
2010
Affect
Task
n Use
of
Amazon
Mechanical
Turk
to
annotate
the
Affect
Task
Corpus
n 126
videos
(2-‐5
mins
in
length)
n Annotate
n Mood
(e.g.,
pleased,
helpless,
energe0c,
etc.)
n Emo0on
(e.g.,
sadness,
joy,
anger,
etc.)
n Boreness
(nine
point
ra0ng
scale)
n Like
(nine
point
ra0ng
scale)
11. +
Crea0ng
annotated
training
sets
[Nowak
and
Ruger.,
2010]
n Crowdsourcing
image
concepts.
53
concepts,
e.g.,
n Abstract
categories:
pPlace contains threehmutual exclusive concepts, namely In-
artylife,
beach
olidays,
snow,
etc.
3.3.1 Design of HIT Template
door, Outdoor and No Visual Place. In contrast several op- The design of the HITs at MTurk for the im
n Time
of
the
day:
day,
tional concepts belongue
the category Landscape Elements.
night,
no
visual
c to tion task is similar to the annotation tool that w
The task of the annotators was to choose exactly one concept to the expert annotators (see Sec. 3.2). Each H
n …
for categories with mutual exclusive concepts and to select of the annotation of one image with all applica
all applicable concepts for optional designed concepts. All cepts. It is arranged as a question survey and
photos were annotated at an image-based level. The anno- into three sections. The section Scene Descript
n Subset
of
99
images
from
the
ImageCLEF2009
dataset
tator tagged the whole image with all applicable concepts section Representation each contain four questio
and then continued with the next image. tion Pictured Objects consists of three questions
each section the image to be annotated is pres
repetition of the image ensures that the turke
while answering the questions without scrolling
of the document. Fig. 2 illustrates the questi
section Representation.
Figure 1: Annotation tool that was used for the ac-
quisition of expert annotations.
12. +
Crea0ng
annotated
training
sets
[Nowak
and
Ruger.,
2010]
n Study
of
expert
and
non-‐expert
labeling
n Inter-‐annota0on
agreement
among
experts:
n very
high
n Influence
of
the
expert
ground
truth
on
concept-‐based
retrieval
ranking:
n very
limited
n Inter-‐annota0on
agreement
among
non-‐experts
n High,
although
not
as
good
as
among
experts
n Influence
of
averaged
annota0ons
(experts
vs.
non
experts)
on
concept-‐based
retrieval
ranking:
n Averaging
filters
out
noisy
non-‐expert
annota0ons
13. +
Crea0ng
annotated
training
sets
[Vondrick
et
al.,
2010]
n Crowdsourcing
object
tracking
in
video
4 C. Vondrick, D. Ramanan, D. Patterson
n Annotators
draw
bounding
boxes
Fig. 2: Our video labeling user interface. All previously labeled entities are shown
14. +
Crea0ng
annotated
training
sets
[Vondrick
et
al.,
2010]
n Annotators
label
the
enclosing
bounding
box
of
an
en0ty
every
T
frames
n Bounding
boxes
at
intermediate
0me
instants
are
interpolated
n Interes0ng
trade-‐off
between
n Cost
of
12 turk
workers
D. Ramanan, D. Patterson
M C. Vondrick,
n Cost
of
interpola0on
on
Amazon
EC2
cloud
(a) Field drills (b) Basketball players
15. +
Crea0ng
annotated
training
sets
ments between F and each of the other
e, every document was judged as more
4.1 HI T Design
The use of preference judgments is prone to have a very simple
[Urbano
et
al.,
2010]
which was judged equally similar (or HIT design (see Figure 4). We asked workers to listen to the
new segment appears to the left of F with
ed more relevant, and G is set up in the the two incipits to
r the second iteration, in the rightmost compare. Next, they were asked what variation was more similar
s needed because F and G were already to the original melody, allowing 3 options: A is more similar, B is
d be the pivot for the leftmost segment. more similar, and they are either equally similar or dissimilar. We
ged similar to B, but D and E are evalua0on
of
music
informa0on
retrieval
systems
they
n Goal:
judged as indicated them that if one melody was part of another one,
set up in a segment to the right of B. At had to be considered equally similar, so as to comply with the
rdered groups of relevance formed with original guidelines. As optional questions, they were asked for
n Use
crowdsourcing
amusicalalterna0ve
if o
experts
to
comments gor
Note that not all the 21 judgments were their
s
an
background, t any, and for create
round-‐
truths
of
par0ally
ordered
lists
aggregate every incipit (e.g. G is only suggestions to give us some feedback.
organized partially ordered list. Pivots for each
ace. Documents that have been pivots already
nts Preference Judgments
G, B, F C<F, D<F, E<F, A<F, G=F, B<F
B , F, G C=B, D>B, E>B, A=B
E , F, G C=A, D=E
D), (F, G) -
ents, the sample of rankings given to each
e than with the original method. Whenever
over another one, it would be given a rank
n case it was judged equally similar, a rank
its sample. With the original methodology,
anks given to an incipit could rangegreement
(92%
complete
+
par0al)
with
experts
n Good
a from 1
ch increases the variance of the samples.
eme, the two samples of rankings given to
s are the opposite and therefore have the
Mann-Whitney U tests can be used again
ank samples are different or not. Because
variable, the effect size is larger, which
16. +
Validate
the
output
of
MIR
systems
[Snoek
et
al.,
2010][Freiburg
et
al.,
2011]
n Search
engine
for
archival
rock
‘n’
roll
concert
video
n Use
of
crowdsourcing
to
improve,
extend
and
share
automa0cally
Audience
detected
concepts
in
video
fragments
Close-up Hands Pinkpop hat Keyboard Guitar player Singer Stage Pink
Drummer Over the shuolder
Figure 1: Eleven common concert concepts we detect automatically, and for which we collect user-feed
Audience Close-up Hands Pinkpop hat Keyboard Guitar player Drummer Over the shuolder Singer Stage Pinkpop logo
Figure 1: Eleven common concert concepts we detect automatically, and for which we collect user-feedback.
180
Excluded correct fragment labels
first exp
160 back. A
Crowdsourcing errors
140
vided t
a prefer
showed
Video Fragments
120
respond
100
gregatio
80 between
reliable
60
forced,
40
2%. Wi
crowdso
Figure 2: Timeline-based video player where col-
ored dots correspond to automated visual detection
20 tomated
results. Users can navigate directly to fragments of 0
can be e
interest by interaction with the colored dots, which >50% >60% >70% >80% >90% is an in
pop-up a feedback overlay as displayed in Figure 3. User-Feedback Agreement
Figure 2: Timeline-based video player where col- 6. AC
since 1970 at Landgraaf, the Netherlands. All music videos
Figure 4: Results for Experiment 2: Quality vs We th
17. +
Validate
the
output
of
MIR
systems Crowdsourcing Event Detection in YouTube Videos 3
[Steiner
et
al.,
2011]
through a combination of textual, visual, and behavioral analysis techniques. When
a user starts watching a video, three event detection processes start:
Visual Event Detection Process We detect shots in the video by visually analyzing its
content [19]. We do this with the help of a browser extension, i.e., the whole process
runs on the client-side using the modern HTML5 [12] JavaScript APIs of the <video>
and <canvas> elements. As soon as the shots have been detected, we offer the user the
n Propose
a
browser
extension
to
navigate
detected
events
in
videos
choice to quickly jump into a specific shot by clicking on a representative still frame.
n Visual
events
(shot
changes)
The detected named entitiesvideopresented to the
Occurrence Event Detection Process We analyze the available
NLP techniques, as outlined in [18]. are
metadata using
user in a list, and upon click via a timeline-like user interface allow for jumping into
n Occurrence
events
(analysis
of
metadata
by
means
of
NLP
to
detect
one of the shots where the named entity occurs.
named
en00es)
JavaScript Detection Processeachsoon asshotsvisualcount clicks been detected,
Interest-based Event
we attach event listeners to
As
of the
the
and
events have
on shots as an
n Interest-‐based
events
(click
counters
on
detected
visual
events)
expression of interest in those shots.
Fig. 2: Screenshot of the YouTube browser extension, showing the three different event
18. +
Validate
the
output
of
MIR
systems
[Goeau
et
al.,
2011]
n Visual
plant
species
iden0fica0on
n Based
on
local
visual
features
n Crowdsourced
valida0on
writing, 858 images were up
new users. These images a
with uniform background, o
background, and involve 15
set of 55 species. Note that
within ImageCLEF2011 pla
5. EVALUATION
Performances, basically i
rates, will be actually show
fline version connected to a d
an enjoying demo where an
leaves. Users would notice s
cation (around 2 seconds),
suggested in spite of the in
cases with occlusions or wit
Figure 1: GUI of the web application. a rough guide, a leave one
19. +
Validate
the
output
of
MIR
systems
[Yan
et
al.,
2010]
n CrowdSearch
combines
n Automated
image
search
n Local
processing
on
mobile
phones
+
backend
processing
n Real-‐0me
human
valida0on
of
search
results
n Amazon
Mechanical
Turk
n Studies
the
trade-‐off
in
terms
of
n Delay
man error and bias to maximize accuracy. To balance these !"#$%&'()*# +),-.-)/#&'()*#0 1"23.4)/#&5)3.-)/.6,&7)080
tradeoffs, CrowdSearch uses an adaptive algorithm that uses
n Accuracy
% $ # " !
delay and result prediction models of human responses to ju- )'(*+,( &'( &'( &'( &'( &'(
+9
n Cost
diciously use human validation. Once a candidate image is
validated, it is returned to the user as a valid search result. % $ # " !
)'(*+,( &'( -. &'( &'( &'(
+<
3. CROWDSOURCING FOR SEARCH
In this section,n More
on
this
later…
of the Ama-
% $ # " !
we first provide a background )'(*+,( -. -. -. -. -.
+;
zon Mechanical Turk (AMT). We then discuss several design
choices that we make while using crowdsourcing for image % $ # " !
validation including: 1) how to construct tasks such that )'(*+,( &'( -. -. &'( &'(
+:
they are likely to be answered quickly, 2) how to minimize
human error and bias, and 3) how to price a validation task
to minimize delay. Figure 2: Shown are an image search query, candi-
22. +
Annota0on
model
n A
set
of
objects
to
annotate
i = 1, . . . , I
n A
set
of
annotators
j = 1, . . . , J
n Types
of
annota0ons
n Binary
n Categorical
(mul0-‐class)
n Numerical
n Other
24. +
Aggrega0ng
annota0ons
n Majority
vo0ng
(baseline)
n For
each
object,
assign
the
label
that
received
the
largest
number
of
votes
n Aggrega0ng
annota0ons
n [Dawid
and
Skene,
1979]
n [Snow
et
al.,
2008]
n [Whitehill
et
al.,
2009]
n …
n Aggrega0ng
and
learning
n [Sheng
et
al.,
2008]
n [Donmez
et
al.,
2009]
n [Raykar
et
al.,
2010]
n …
25. +
Aggrega0ng
annota0ons
Majority
vo0ng
n Assume
that
j
n The
annotator
quality
is
independent
from
the
object
P (yi = yi ) = pj
n All
annotators
have
the
same
quality
pj = p
n The
integrated
quality
of
majority
vo0ng
using
I
2N
+
1
=
annotators
is
2N + 1
N
q = P (y M V = y) = p2N +1−i · (1 − p)i
i
l=0
26. +
Aggrega0ng
annota0ons
Majority
vo0ng
-q 1
0.9 p=1.0
ly.
p=0.9
Integrated quality
me 0.8
p=0.8
0.7
p=0.7
0.6 p=0.6
0.5 p=0.5
y) 0.4 p=0.4
me 0.3
, 0.2
U. 1 3 5 7 9 11 13
yi Number of labelers
is Figure 2: The relationship between integrated label-
ue ing quality, individual quality, and the number of la-
el belers.
27. +
Aggrega0ng
annota0ons
[Snow
et
al.,
2008]
j
n Binary
labels:
yi ∈ {0, 1}
n The
true
label
is
es0mated
evalua0ng
the
posterior
log-‐odds,
i.e.,
1 J
P (yi = 1|yi , . . . , yi )
log 1 J
P (yi = 0|yi , . . . , yi )
n Applying
Bayes
theorem
P (yi = 1|yi , . . . , yi )
1 J j
P (yi |yi = 1) P (yi = 1)
log 1 J
= log j
+ log
P (yi = 0|yi , . . . , yi ) j P (yi |yi = 0) P (yi = 0)
posterior
likelihood
prior
28. +
Aggrega0ng
annota0ons
[Snow
et
al.,
2008]
j j
n How
to
es0mate P (yi |y
=
1)
and
P
(y
i
|y
i
=
0)
?
i
n Gold
standard:
n Some
objects
have
known
labels
n Ask
to
annotate
these
objects
n Compute
empirical
p.m.f.
for
object(s)
with
known
labels
Number of correct annotations
P (y j = 1|y = 1) =
Number of annotations of object with label = 1
n Compute
the
performance
of
annotator
j
(independent
from
the
object)
j j j
P (y1 |y1 = 1) = P (y2 |y2 = 1) = . . . = P (yI |yI = 1) = P (y j |y = 1)
29. +
Aggrega0ng
annota0ons
[Snow
et
al.,
2008]
n Each
annotator
vote
is
weighted
by
the
log-‐likelihood
ra0o
for
their
given
response
(Naïve
Bayes)
n More
reliable
annotators
are
weighted
more
P (yi = 1|yi , . . . , yi )
1 J j
P (yi |yi = 1) P (yi = 1)
log 1 J
= log j
+ log
P (yi = 0|yi , . . . , yi ) j P (yi |yi = 0) P (yi = 0)
n Issue:
Obtaining
a
gold
standard
is
costly!
30. +
Aggrega0ng
annota0ons
[Kumar
and
Lease,
2011]
Figure 1: p1:w ∼U(0.6, 1.0). With very accurate annotators, generating multiple labels (to improve consensus
label accuracy) provides little benefit. Instead, labeling effort is better spent single labeling more examples.
n With
very
accurate
annotators,
it
is
berer
to
label
more
examples
once
pj ∼ U (0.6, 1.0)
Figure 2: p1:w ∼U(0.4, 0.6). With very noisy annotators, single labeling yields such poor training data that
n With
very
noisy
annotators,
aggrega0ng
labels
helps,
if
annotator
there is no benefit from labeling more examples (i.e. a flat learning rate). MV just aggregates this noise to
produce more ∼U(0.6, 1.0). With very accurate annotators, generating multiple labels (to improve consensus
Figure 1: p1:w noise. In contrast, by modeling worker accuracies and weighting their labels appropriately,
accuracies
are
taken
into
account
label accuracy) provides little benefit. Instead, labeling effort is better spent single labeling more examples.
NB can improve consensus labeling accuracy (and thereby classifier accuracy).
pj ∼ U (0.3, 0.7)
Figure 2: p1:w ∼U(0.4, 0.6). With very noisy annotators, single labeling yields such poor training data that
Figure 3: p1:w ∼U(0.3, 0.7). With greater variance in accuracies vs. Figure 2, NB further improves.
SL:
Single
Labeling,
MV:
Majority
Vo0ng;
NB:
Naïve
Bayes
there is no benefit from labeling more examples (i.e. a flat learning rate). MV just aggregates this noise to
produce more noise. In contrast, by modeling worker accuracies and weighting their labels appropriately,
NB can improve consensus labeling accuracy (and thereby classifier accuracy).