Presentation as part of the "Social Media Annotation" Tutorial at ISWC2014. Content: What is crowdsourcing? What are typical steps taken when crowdsourcing the creation of training and verification corpora? What is the state of the art in performing these steps? How do these steps differ between mechanised labour and GWAPs?
1. University of Sheffield, NLP
Crowdsourcing Best Practices
Marta Sabou, Kalina Bontcheva
Leon Derczynski, Arno Scharl
2. University of Sheffield, NLP
The Science of Corpus Annotation
• Quite well understood best practice in how to create linguistic
annotation of consistently high quality by employing, training, and
managing groups of linguistic and/or domain experts
• Necessary in order to ensure reusability and repeatability of results
• The acquired corpora are of very high quality
• Costs are unfortunately also very high: estimated at between $0.36
and $1.0 per annotation (Zaidan and Callison-Burch, 2011; Poesio et
al., 2012)
3. University of Sheffield, NLP
Goals
What is crowdsourcing?
What is a typical workflow for crowdsoucing NLP tasks?
What are general solutions used by the state of the art?
How do different crowdsourcing genres compare?
5. University of Sheffield, NLP
Undefined and generally large group
Compared to in-house projects:
• cheaper (with 33%)
• reach to large number of users;
• reach to diverse user groups,
e.g., speakers of rare languages
6. University of Sheffield, NLP
Genre 1: Mechanised Labour
• Participants (workers) paid a small amount of money to
complete easy tasks (HIT = Human Intelligence Task)
9. University of Sheffield, NLP
Workflow for Crowdsourcing (Corpora)
1. Project Definition
2. Data and UI Preparation
3. Running the Project
4. Evaluation & Corpus
Delivery
11. University of Sheffield, NLP
Definition of semantic relations between concept pairs.
Coal Is a subcategory of Fossil Fuel
12. University of Sheffield, NLP
Trade-offs: Cost; Timescale; Worker skills
Small, simple tasks, fast completion => MLab
Complex, large tasks, slower completion => GWAP
13. University of Sheffield, NLP
• Data distribution: how “micro” is each microtask?
• Long paragraphs hard to digest, worker fatigue
• For most NLP tasks: one sentence corresponds to one task
• Single sentences not always appropriate: e.g. for co-ref
• Task Type
• Selection task: WSD, sentiment analysis, entity
disambiguation, relation typing.
• Sequence marking task: co-reference resolution.
14. University of Sheffield, NLP
• Categories per selection type task:
• Experts (Hovy,10): max 10, ideally 7
• In crowdsourcing less categories, typically 3-4
• To reduce cognitive load, focus on one category at a time
(e.g., one NE type)
• Number of workers per task:
• Depends on the subjective nature/complexity of the task
• Minimum 3, optimally 5
• Dynamic worker assignment for inconclusive tasks
• Lawson et al. (2010): number of required labels varies for different aspects of
the same NLP problem. Good results with only 4 annotators for Person NEs,
but require 6 for Location and 7 for Organizations
15. University of Sheffield, NLP
Reward scheme
• What to reward? - money, game points
• When to reward? - when work entered or after its evaluation
• How much to reward?
• Typically between $0.01 - $0.05/task (5 units)
• No clear, repeatable results for quality:reward relation
• High rewards get it done faster, but not better
• Pilot task gives timings, so pay at least minimum wage
• What to do with “bad” work? - detect at run-time and
exclude
17. University of Sheffield, NLP
Categories:10
Players/task:7
Payment:points
awarded based
on previously
contributed
judgments
18. University of Sheffield, NLP
Categories:10
Players/task:10
Payment:$0.05/5 units
Players filtered through gold-data
19. University of Sheffield, NLP
Workflow for Crowdsourcing Corpora
1. Project Definition
2. Data and UI Preparation
3. Running the Project
4. Evaluation & Corpus
Delivery
20. University of Sheffield, NLP
• Pre-process the corpus linguistically, as needed, e.g.
• Tokenise text if user needs to select words
• Identify proper names/noun phrases if we want to classify these
• Bring additional context, if needed, e.g. text of user profile from
Twitter; link to wikipedia page
• For GWAPs:
• Collect interesting input data if possible, I.e.,texts that are fun to
read and work on
• clean input data to remove errors (these will lower player
satisfaction)
• MLab can be used for cleaning the data set
21. University of Sheffield, NLP
• Build and test the user interfaces
• Easy to medium difficulty in AMT/CF; templates provided for
some task types
• Medium to hard for GWAPs
• Job management interfaces
• Provided in MLab platforms
• Must be built from scratch for GWAPs
• Comparative interface set-up times:
• CF: 2 days; Climate Quiz: 2 months
• (Thaler et al., 12): OntoPronto: 5 months
23. University of Sheffield, NLP
HINT: Add explicitly verifiable
questions to the UI:
- help filter out spammers
- force workers to read the task
input
24. University of Sheffield, NLP
Pilot the design, measure performance, try again
• Simple, clear design important
• Binary decision tasks get good results
Run bigger pilot studies with volunteers to test
everything and collect gold units for quality control later
25. University of Sheffield, NLP
Workflow for Crowdsourcing Corpora
1. Project Definition
2. Data and UI Preparation
3. Running the Project
4. Evaluation & Corpus
Delivery
26. University of Sheffield, NLP
Contributor recruitment:
• MLab - easy, given the platforms’ large worker pools and economic
incentives
• GWAPs - challenging, requires much PR.
• Social network based games allow inviting friends for leverage the viral
aspect of SNs
• Multi-channel advertisement: local and national press, science websites,
blogs, bookmarking web- sites, gaming forums, and social networking
sites
Contributor screening (only in MLab):
• MLab - by country, by skill (e.g., spoken language), by reliability
• MLab - screening through competency tests; answers to gold units
27. University of Sheffield, NLP
IN-TASK QUALITY CONTROL
Train contributors - through instructions:
• be clear and concise;
• avoid technical jargon;
• provide both positive and negative examples.
Train contributors - through gold data:
• CF - known data units (gold units) hidden in tasks
• When completing a gold unit, a worker is shown the expected answer thus
being trained “on the job”
• Workers who fail a certain percentage of gold units are automatically
excluded from the job
Great opportunity to train workers and amend expert data
Better gold data means better output quality, for the same cost
30. University of Sheffield, NLP
• For large tasks - Multi-batch methodology
• Submit tasks in multiple batches
• Ensure contributor diversity by starting batches at different times
• Needs less gold data
• Deal with worker disputes!
31. University of Sheffield, NLP
Workflow for Crowdsourcing Corpora
1. Project Definition
2. Data and UI Preparation
3. Running the Project
4. Evaluation & Corpus
Delivery
32. University of Sheffield, NLP
• Evaluate individual contributor inputs to produce final decision
• Majority vote
• Discard inputs from low-trusted contributors (e.g. Hsueh et al. (2009))
• Aggregation:
• Merge individual units from the microtasks (e.g. sentences) into
complete documents, including all crowdsourced markup
• Majority voting; average; collection
• Aggregation strategies:
• Climate Quiz: relation chosen between pairs if it has been voted
by 4 more players than the next most popular relation
• CF - Majority voting; confidence value computed taking into
account worker accuracy
33. University of Sheffield, NLP
• Evaluate corpus quality
• Compute inter-worker agreement;
• Compute inter-worker-trusted annotator agreement
• Compare to a gold standard baseline (P/R/F/Acc)
•To facilitate reuse:
• deliver corpus in a widely used format (XCES, CONLL, GATE XML)
• Share with research community
35. University of Sheffield, NLP
Evaluation of relation selection task:
Comparison with Gold Standard
Same data, different aggregation
36. University of Sheffield, NLP
Legal and Ethical Issues
1. Acknowledging the Crowd‘s contribution
S. Cooper, [other authors], and Foldit players: Predicting protein structures
with a multiplayer online game. Nature, 466(7307):756-760, 2010.
2. Ensuring privacy and wellbeing
1. Mechnised labour criticised for low wages, lack of worker rights
2. Majority of workers rely on microtasks as main income source
3. Prevent prolonged use & user exploitation (e.g. daily caps)
3. Licensing and consent
1. Some clearly state the use of Creative Common licenses
2. General failure to provide informed consent information
Crowdsourcing is an emerging collaborative approach for acquiring annotated corpora and a wide range of other linguistic resources
Three main kinds of crowdsourcing platforms
paid-for marketplaces such as Amazon Mechanical Turk (AMT) and CrowdFlower (CF)
games with a purpose
volunteer-based platforms such as crowdcrafting
Paid for crowdsourcing can be 33% cheaper than in-house employees when applied to tasks such as tagging and classification (Hoffmann, 2009)
Games with a purpose can be even cheaper in the long run, since the players are not paid.
However cost of implementing a game can be higher than AMT/CF costs for smaller projects (Poesio et al, 2012)
Tap into the large number of contributors/players available across the globe, through the internet
Easy to reach native speakers in various languages (but beware Google translate cheaters!)
Contributors are extrinsically motivated through economic incentives
Most NLP projects use crowdsourcing marketplaces: Amazon Mechanical Tutk and CrowdFlower
Requesters post Human Intelligence Tasks (HITs) to a large population of micro-workers (Callison-Burch and Dredze, 2010a)
Snow et al. (2008) collect event and affect annotations, while Lawson et al. (2010) and Finin et al. (2010) annotate special types of texts such as emails and Twitter feeds, respectively.
Challenges:
low quality output due to the workers’ purely economic motivation
high costs for large tasks (Parent and Eskenazi, 2011)
ethical issues (Fort et al., 2011)
In GWAPs (von Ahn and Dabbish, 2008), contributors carry out annotation tasks as a side effect of playing a game
Example GWAPs:
Phratris for annotating syntactic dependencies (Attardi, 2010)
PhraseDetectives (Poesio et al.,2012) to acquire anaphora annotations
Sentiment Quiz (Scharl et al., 2012) to annotate sentiment
http://www.wordrobe.org/ - A collection of NLP games incl. POS, NE
Challenges:
Designing apealing games and attracting a critical mass of players are among the key success factors within this genre (Wang et al., 2012)
In 2008, the group built a FB game that required players to rate the sentiment associated to a sentence on a 5-values scale, then used this as atraining corpus for the sentiment detection module. Over 800 player played the game.
In 2009 the game has been released in a slightly different form and with the aim to gather sentiment lexicons, i.e., associations between words and their sentiment polarity (ratings from as many as 12 players were averaged to get the final value). The game ran in 7 different languages and attracted over 4000 players.
Let this be an introductory example of a crowdsourcing project, however, crowdsourcing is a not a new phenomenon.
Volunteer contributes because he is interested in a domain, supports a cause
Compared to paid-for marketplaces, GWAPs:
reduce costs and the incentive to cheat as players are intrinsically motivated
promise superior results, due to motivated players and better utilization of sporadic, explorer-type users (Parent and Eskenazi, 2011)
Few papers, and most of those “theoretical”/survey-based comparison.
Climate Quiz is a GWAP deployed over the Facebook social networking platform. It is focused on acquiring factual knowledge in the domain of climate change.
The game is coupled with an ontology learning algorithm, as follows. The ontology learning algorithm extracts terms from unstructured and structured data sources. The term pairs that are most likely related based on the algorithm’s input data sources are subsequently sent to Climate Quiz, where players assign relations to each pair. These relations are fed back into the algorithm which uses them to refine the learned ontology and to derive new term pairs that should be connected.
As depicted here, Climate Quiz asks players to evaluate whether two concepts presented by the system are related (e.g. environmental activism,activism), and which label is the most appropriate to describe this relation (e.g. is a sub − category of ). Players can assign one of eight relations, three of which are generic (is a sub − category of, is identical to, is the opposite of), whereas five are domain- specific (opposes, supports, threatens, influences, works on /with). Two further relations, “other” and “is not related to” were added for cases not covered by the previous eight relations. The game’s interface allows players to switch the position of the two concepts or to skip ambiguous pairs.
In order to allow the comparative analysis of the two HC genres, a mechanised labour version of Climate Quiz was created on the CrowdFlower (CF) platform.
Additionally to the game interface, two verification questions were added to “force” the contributors to read the terms before selecting a random relation.
Can run for hours, days or years, depending on genre and size
Quality in terms of agreements with a gold standard.
Note: depending on how the raw input from CF is aggregated the results are very different. In particular, the aggregation mechanism of CQ (highest scored relations must have 4 more scores than second scored relation) leads to worse results than when the aggregation methods of CF are used (these take account of worker performance during majority vote).
Our findings verify experimentally all the differences between the two genre that the literature based study identified. Additionally, thanks to the experimental approach we have some concrete details about the actual values of some of the parameters.
For those aspects where earlier studies disagree we found that:
With the appropriate aggregation method, Mlab results can be as good as those obtained with games, at least for the task in question
2) Worker diversity is higher in GWAPs