Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Â
Rls For Emnlp 2008
1. Cheap and Fast - But is it Good?
Evaluating Nonexpert Annotations
for Natural Language Tasks
Rion Snow Brendan OâConnor Daniel Jurafsky Andrew Y. Ng
2. The primacy of data
(Banko and Brill, 2001):
Scaling to Very Very Large Corpora
for Natural Language Disambiguation
3. Datasets drive research
statistical semantic role
parsing labeling
PropBank
Penn Treebank
word sense speech
disambiguation recognition
WordNet Switchboard
SemCor
statistical
textual
machine
entailment
Pascal RTE translation
UN Parallel Text
4. The advent of human
computation
⢠Open Mind Common Sense (Singh et al., 2002)
⢠Games with a Purpose (von Ahn and Dabbish, 2004)
⢠Online Word Games (Vickrey et al., 2008)
6. Using AMT for dataset
creation
⢠Su et al. (2007): name resolution, attribute extraction
⢠Nakov (2008): paraphrasing noun compounds
⢠Kaisser and Lowe (2008): sentence-level QA annotation
⢠Kaisser et al. (2008): customizing QA summary length
⢠Zaenen (2008): evaluating RTE agreement
7. Using AMT is cheap
Paper Labels Cents/Label
Su et al. (2007) 10,500 1.5
Nakov (2008) 19,018 unreported
Kaisser and Lowe (2008) 24,321 2.0
Kaisser et al. (2008) 45,300 3.7
Zaenen (2008) 4,000 2.0
9. But is it good?
⢠Objective: compare nonexpert annotation
quality on NLP tasks with gold standard,
expert-annotated data
⢠Method: pick 5 standard datasets, and
relabel each point with 10 new annotations
⢠Compare Turk agreement to dataset with
reported expert interannotator agreement
10. Tasks
⢠Affect recognition fear(âTropical storm forms in Atlanticâ) >
fear(âGoal delight for Shevaâ)
⢠Strapparava and Mihalcea (2007)
⢠Word Similarity sim(boy, lad) > sim(rooster, noon)
⢠Miller and Charles (1991)
⢠Textual Entailment if âMicrosoft was established in Italy in 1985â,
then âMicrosoft was established in 1985â ?
⢠Dagan et al. (2006)
⢠WSD âa bass on the lineâ vs. âa funky bass lineâ
⢠Pradhan et al. (2007)
⢠Temporal Annotation ran happens before fell in:
⢠Pustejovsky et al. (2003) âThe horse ran past the barn fell.â
13. Interannotator Agreement
Emotion 1-E ITA
Anger 0.459
Disgust 0.583
⢠6 total experts.
Fear 0.711
⢠One expertâs ITA is calculated as
Joy 0.596
the average of Pearson correlations
from each annotator to the avg. of Sadness 0.645
the other 5 annotators.
Surprise 0.464
Valence 0.844
All 0.603
14. Nonexpert ITA
We average over k
annotations to create a
single âproto-labelerâ.
We plot the ITA of this
proto-labeler for up to
10 annotations and
compare to the average
single expert ITA.
15. Interannotator Agreement
anger disgust
Emotion 1-E ITA 10-N ITA
0.75
0.65
Anger 0.459 0.675
correlation
correlation
0.65
0.55
Disgust 0.583 0.746
0.55
0.45
2 4 6 8 10 2 4 6 8 10
fear joy
Fear 0.711 0.689
0.65
0.70
0.45 0.55
correlation
correlation
0.50 0.60
Joy 0.596 0.632
0.35
Sadness 0.645 0.776
0.40
2 4 6 8 10 2 4 6 8 10
sadness surprise
0.50
Surprise 0.464 0.496
0.75
0.30 0.40
correlation
correlation
0.65
Valence 0.844 0.669
0.55
0.20
All 0.603 0.694
2 4 6 8 10 2 4 6 8 10
annotators annotators
Number of nonexpert annotators required to match expert ITA, on average: 4
17. Error Analysis: WSD
only 1 âmistakeâ out of 177 labels:
âThe Egyptian president said
he would visit Libya today...â
Semeval Task 17 marks this as âexecutive ofďŹcer of a ďŹrmâ sense,
while Turkers voted for âhead of a countryâ sense.
18. Error Analysis: RTE
~10 disagreements out of 100:
⢠Bob Carpenter: âOver half of the residual
disagreements between the Turker annotations and
the gold standard were of this highly suspect
nature and some were just wrong.â
⢠Bob Carpenterâs full analysis available atâFoolâs
Gold Standardâ, http://lingpipe-blog.com/
Close Examples
T:
A car bomb that exploded outside a U.S. T: âGoogle ďŹles for its long awaited IPO.â
military base near Beiji, killed 11 Iraqis.
H: âGoogle goes public.â
H: A car bomb exploded outside a U.S. base in
the northern town of Beiji, killing 11 Iraqis.
Labeled âTRUEâ in PASCAL RTE-1, Labeled âTRUEâ in PASCAL RTE-1,
Turkers vote 6-4 âFALSEâ. Turkers vote 6-4 âFALSEâ.
19. Weighting Annotators
⢠There are a small number of very proliďŹc, very
noisy annotators. If we plot each annotator:
1.0
0.8
accuracy
0.6
0.4
0 200 400 600 800
number of annotations
Task: RTE
⢠We should be able to do better than majority voting.
20. Weighting Annotators
⢠To infer the true value x , we weight each
i
response yi from annotator w using a small gold
standard training set:
⢠We estimate annotator response from 5% of the gold
standard test set, and evaluate with 20-fold CV.
22. Cost Summary
Total Cost in Time in Labels / Labels /
Task
Labels USD hours USD Hour
Affect 7000 $2.00 5.93 3500 1180.4
Recognition
Word
300 $0.20 0.17 1500 1724.1
Similarity
Textual
8000 $8.00 89.3 1000 89.59
Entailment
Temporal
4620 $13.86 39.9 333.3 115.85
Annotation
WSD 1770 $1.76 8.59 1005.7 206.1
All 21690 $25.82 143.9 840.0 150.7
23. In Summary
⢠All collected data and annotator
instructions are available at:
http://ai.stanford.edu/~rion/annotations
⢠Summary blog post and comments on
the Dolores Labs blog:
http://blog.doloreslabs.com
nlp.stanford.edu doloreslabs.com ai.stanford.edu
25. Training systems on
nonexpert annotations
⢠A simple affect recognition classiďŹer trained
on the averaged nonexpert votes
outperforms one trained on a single expert
annotation
26. Where are Turkers?
United States 77.1%
India 5.3%
Philippines 2.8%
Canada 2.8%
UK 1.9%
Germany 0.8%
Italy 0.5%
Netherlands 0.5%
Portugal 0.5%
Australia 0.4%
Remaining 7.3% divided among 78 countries / territories
Analysis by Dolores Labs
27. Who are Turkers?
Gender Age
Education Annual income
âMechanical Turk: The Demographicsâ, Panos Ipeirotis, NYU
behind-the-enemy-lines.blogspot.com
28. Why are Turkers?
A. To Kill Time
B. Fruitful way to spend free time
C. Income purposes
D. Pocket change/extra cash
E. For entertainment
F. Challenge, self-competition
G. Unemployed, no regular job, part-time job
H. To sharpen/ To keep mind sharp
I. Learn English
âWhy People Participate on Mechanical Turk, Now Tabulatedâ, Panos Ipeirotis, NYU
behind-the-enemy-lines.blogspot.com
29. How much does AMT pay?
âHow Much Turking Pays?â, Panos Ipeirotis, NYU
behind-the-enemy-lines.blogspot.com
35. Affect Recognition
We label 100 headlines
for each of 7 emotions
We pay 4 cents for 20
headlines (140 total
labels)
Total Cost: $2.00
Time to complete: 5.94 hrs
36. Example Task: Word Similarity
30 word pairs
(Rubenstein and
Goodenough, xxxx)
We pay 10 Turkers 2
cents apiece to score
all 30 word pairs
Total cost: $0.20
Time to complete:
10.4 minutes
38. ⢠Comparison against multiple annotators
⢠(graphs)
⢠avg. number of nonexperts : expert = 4
39. Datasets lead the way
WSJ + syntactic annotation = Penn TreeBank enables Statistical
parsing
Brown corpus + sense labeling = Semcor => WSD
TreeBank + role labels = PropBank => SRL
political speeches + translations = United Nations parallel
corpora => statistical machine translation
more: RTE, Timebank, ACE/MUC, etc...
40. Datasets drive research
statistical semantic role
parsing labeling
PropBank
Penn Treebank
word sense
speech
disambiguation
recognition
WordNet
SemCor Switchboard
social network
analysis statistical MT
Enron E-mail
Corpus UN Parallel Text
textual
entailment
Pascal RTE