SlideShare ist ein Scribd-Unternehmen logo
1 von 38
University of Sheffield, NLP 
Crowdsourcing Best Practices 
Marta Sabou, Kalina Bontcheva 
Leon Derczynski, Arno Scharl
University of Sheffield, NLP 
The Science of Corpus Annotation 
• Quite well understood best practice in how to create linguistic 
annotation of consistently high quality by employing, training, and 
managing groups of linguistic and/or domain experts 
• Necessary in order to ensure reusability and repeatability of results 
• The acquired corpora are of very high quality 
• Costs are unfortunately also very high: estimated at between $0.36 
and $1.0 per annotation (Zaidan and Callison-Burch, 2011; Poesio et 
al., 2012)
University of Sheffield, NLP 
Goals 
What is crowdsourcing? 
What is a typical workflow for crowdsoucing NLP tasks? 
What are general solutions used by the state of the art? 
How do different crowdsourcing genres compare?
University of Sheffield, NLP
University of Sheffield, NLP 
Undefined and generally large group 
Compared to in-house projects: 
• cheaper (with 33%) 
• reach to large number of users; 
• reach to diverse user groups, 
e.g., speakers of rare languages
University of Sheffield, NLP 
Genre 1: Mechanised Labour 
• Participants (workers) paid a small amount of money to 
complete easy tasks (HIT = Human Intelligence Task)
University of Sheffield, NLP 
Genre 2: Games with a purpose (GWAPs)
University of Sheffield, NLP 
Genre 3: Altruistic Crowdsourcing
University of Sheffield, NLP 
Workflow for Crowdsourcing (Corpora) 
1. Project Definition 
2. Data and UI Preparation 
3. Running the Project 
4. Evaluation & Corpus 
Delivery
University of Sheffield, NLP
University of Sheffield, NLP 
Definition of semantic relations between concept pairs. 
Coal Is a subcategory of Fossil Fuel
University of Sheffield, NLP 
Trade-offs: Cost; Timescale; Worker skills 
Small, simple tasks, fast completion => MLab 
Complex, large tasks, slower completion => GWAP
University of Sheffield, NLP 
• Data distribution: how “micro” is each microtask? 
• Long paragraphs hard to digest, worker fatigue 
• For most NLP tasks: one sentence corresponds to one task 
• Single sentences not always appropriate: e.g. for co-ref 
• Task Type 
• Selection task: WSD, sentiment analysis, entity 
disambiguation, relation typing. 
• Sequence marking task: co-reference resolution.
University of Sheffield, NLP 
• Categories per selection type task: 
• Experts (Hovy,10): max 10, ideally 7 
• In crowdsourcing less categories, typically 3-4 
• To reduce cognitive load, focus on one category at a time 
(e.g., one NE type) 
• Number of workers per task: 
• Depends on the subjective nature/complexity of the task 
• Minimum 3, optimally 5 
• Dynamic worker assignment for inconclusive tasks 
• Lawson et al. (2010): number of required labels varies for different aspects of 
the same NLP problem. Good results with only 4 annotators for Person NEs, 
but require 6 for Location and 7 for Organizations
University of Sheffield, NLP 
Reward scheme 
• What to reward? - money, game points 
• When to reward? - when work entered or after its evaluation 
• How much to reward? 
• Typically between $0.01 - $0.05/task (5 units) 
• No clear, repeatable results for quality:reward relation 
• High rewards get it done faster, but not better 
• Pilot task gives timings, so pay at least minimum wage 
• What to do with “bad” work? - detect at run-time and 
exclude
University of Sheffield, NLP
University of Sheffield, NLP 
Categories:10 
Players/task:7 
Payment:points 
awarded based 
on previously 
contributed 
judgments
University of Sheffield, NLP 
Categories:10 
Players/task:10 
Payment:$0.05/5 units 
Players filtered through gold-data
University of Sheffield, NLP 
Workflow for Crowdsourcing Corpora 
1. Project Definition 
2. Data and UI Preparation 
3. Running the Project 
4. Evaluation & Corpus 
Delivery
University of Sheffield, NLP 
• Pre-process the corpus linguistically, as needed, e.g. 
• Tokenise text if user needs to select words 
• Identify proper names/noun phrases if we want to classify these 
• Bring additional context, if needed, e.g. text of user profile from 
Twitter; link to wikipedia page 
• For GWAPs: 
• Collect interesting input data if possible, I.e.,texts that are fun to 
read and work on 
• clean input data to remove errors (these will lower player 
satisfaction) 
• MLab can be used for cleaning the data set
University of Sheffield, NLP 
• Build and test the user interfaces 
• Easy to medium difficulty in AMT/CF; templates provided for 
some task types 
• Medium to hard for GWAPs 
• Job management interfaces 
• Provided in MLab platforms 
• Must be built from scratch for GWAPs 
• Comparative interface set-up times: 
• CF: 2 days; Climate Quiz: 2 months 
• (Thaler et al., 12): OntoPronto: 5 months
University of Sheffield, NLP 
Example: Job Management Interface
University of Sheffield, NLP 
HINT: Add explicitly verifiable 
questions to the UI: 
- help filter out spammers 
- force workers to read the task 
input
University of Sheffield, NLP 
Pilot the design, measure performance, try again 
• Simple, clear design important 
• Binary decision tasks get good results 
Run bigger pilot studies with volunteers to test 
everything and collect gold units for quality control later
University of Sheffield, NLP 
Workflow for Crowdsourcing Corpora 
1. Project Definition 
2. Data and UI Preparation 
3. Running the Project 
4. Evaluation & Corpus 
Delivery
University of Sheffield, NLP 
Contributor recruitment: 
• MLab - easy, given the platforms’ large worker pools and economic 
incentives 
• GWAPs - challenging, requires much PR. 
• Social network based games allow inviting friends for leverage the viral 
aspect of SNs 
• Multi-channel advertisement: local and national press, science websites, 
blogs, bookmarking web- sites, gaming forums, and social networking 
sites 
Contributor screening (only in MLab): 
• MLab - by country, by skill (e.g., spoken language), by reliability 
• MLab - screening through competency tests; answers to gold units
University of Sheffield, NLP 
IN-TASK QUALITY CONTROL 
Train contributors - through instructions: 
• be clear and concise; 
• avoid technical jargon; 
• provide both positive and negative examples. 
Train contributors - through gold data: 
• CF - known data units (gold units) hidden in tasks 
• When completing a gold unit, a worker is shown the expected answer thus 
being trained “on the job” 
• Workers who fail a certain percentage of gold units are automatically 
excluded from the job 
Great opportunity to train workers and amend expert data 
Better gold data means better output quality, for the same cost
University of Sheffield, NLP 
Example: CF Instructions
University of Sheffield, NLP
University of Sheffield, NLP 
• For large tasks - Multi-batch methodology 
• Submit tasks in multiple batches 
• Ensure contributor diversity by starting batches at different times 
• Needs less gold data 
• Deal with worker disputes!
University of Sheffield, NLP 
Workflow for Crowdsourcing Corpora 
1. Project Definition 
2. Data and UI Preparation 
3. Running the Project 
4. Evaluation & Corpus 
Delivery
University of Sheffield, NLP 
• Evaluate individual contributor inputs to produce final decision 
• Majority vote 
• Discard inputs from low-trusted contributors (e.g. Hsueh et al. (2009)) 
• Aggregation: 
• Merge individual units from the microtasks (e.g. sentences) into 
complete documents, including all crowdsourced markup 
• Majority voting; average; collection 
• Aggregation strategies: 
• Climate Quiz: relation chosen between pairs if it has been voted 
by 4 more players than the next most popular relation 
• CF - Majority voting; confidence value computed taking into 
account worker accuracy
University of Sheffield, NLP 
• Evaluate corpus quality 
• Compute inter-worker agreement; 
• Compute inter-worker-trusted annotator agreement 
• Compare to a gold standard baseline (P/R/F/Acc) 
•To facilitate reuse: 
• deliver corpus in a widely used format (XCES, CONLL, GATE XML) 
• Share with research community
University of Sheffield, NLP
University of Sheffield, NLP 
Evaluation of relation selection task: 
Comparison with Gold Standard 
Same data, different aggregation
University of Sheffield, NLP 
Legal and Ethical Issues 
1. Acknowledging the Crowd‘s contribution 
S. Cooper, [other authors], and Foldit players: Predicting protein structures 
with a multiplayer online game. Nature, 466(7307):756-760, 2010. 
2. Ensuring privacy and wellbeing 
1. Mechnised labour criticised for low wages, lack of worker rights 
2. Majority of workers rely on microtasks as main income source 
3. Prevent prolonged use & user exploitation (e.g. daily caps) 
3. Licensing and consent 
1. Some clearly state the use of Creative Common licenses 
2. General failure to provide informed consent information
University of Sheffield, NLP
University of Sheffield, NLP 
Thank you! 
Questions?

Weitere ähnliche Inhalte

Was ist angesagt?

Answer Selection and Validation for Arabic Questions
Answer Selection and Validation for Arabic QuestionsAnswer Selection and Validation for Arabic Questions
Answer Selection and Validation for Arabic Questions
Ahmed Magdy Ezzeldin, MSc.
 
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
Lifeng (Aaron) Han
 

Was ist angesagt? (20)

Question Answering System using machine learning approach
Question Answering System using machine learning approachQuestion Answering System using machine learning approach
Question Answering System using machine learning approach
 
Presentation of Domain Specific Question Answering System Using N-gram Approach.
Presentation of Domain Specific Question Answering System Using N-gram Approach.Presentation of Domain Specific Question Answering System Using N-gram Approach.
Presentation of Domain Specific Question Answering System Using N-gram Approach.
 
Meta-evaluation of machine translation evaluation methods
Meta-evaluation of machine translation evaluation methodsMeta-evaluation of machine translation evaluation methods
Meta-evaluation of machine translation evaluation methods
 
Answer Selection and Validation for Arabic Questions
Answer Selection and Validation for Arabic QuestionsAnswer Selection and Validation for Arabic Questions
Answer Selection and Validation for Arabic Questions
 
Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...
 
Recent and Robust Query Auto-Completion - WWW 2014 Conference Presentation
Recent and Robust Query Auto-Completion - WWW 2014 Conference PresentationRecent and Robust Query Auto-Completion - WWW 2014 Conference Presentation
Recent and Robust Query Auto-Completion - WWW 2014 Conference Presentation
 
2010 INTERSPEECH
2010 INTERSPEECH 2010 INTERSPEECH
2010 INTERSPEECH
 
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
 
The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022
 
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
 
Searching with vectors
Searching with vectorsSearching with vectors
Searching with vectors
 
Sentence representations and question answering (YerevaNN)
Sentence representations and question answering (YerevaNN)Sentence representations and question answering (YerevaNN)
Sentence representations and question answering (YerevaNN)
 
Natural Language Processing (NLP) for Requirements Engineering (RE): an Overview
Natural Language Processing (NLP) for Requirements Engineering (RE): an OverviewNatural Language Processing (NLP) for Requirements Engineering (RE): an Overview
Natural Language Processing (NLP) for Requirements Engineering (RE): an Overview
 
Open vocabulary problem
Open vocabulary problemOpen vocabulary problem
Open vocabulary problem
 
Vectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic MatchingVectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic Matching
 
Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
Corpus Annotation through Crowdsourcing: Towards Best Practice GuidelinesCorpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
 
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
Chinese Character Decomposition for  Neural MT with Multi-Word ExpressionsChinese Character Decomposition for  Neural MT with Multi-Word Expressions
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
 
Arabic question answering ‫‬
Arabic question answering ‫‬Arabic question answering ‫‬
Arabic question answering ‫‬
 
Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...
 
K-SRL: Instance-based Learning for Semantic Role Labeling
K-SRL: Instance-based Learning for Semantic Role LabelingK-SRL: Instance-based Learning for Semantic Role Labeling
K-SRL: Instance-based Learning for Semantic Role Labeling
 

Ähnlich wie Crowdsourcing Best Practices

Expediting the Application Workshop Presentation -- 2015 SRA -- Dianne Donnel...
Expediting the Application Workshop Presentation -- 2015 SRA -- Dianne Donnel...Expediting the Application Workshop Presentation -- 2015 SRA -- Dianne Donnel...
Expediting the Application Workshop Presentation -- 2015 SRA -- Dianne Donnel...
Sandy Justice
 
Nirdesh_Developer_2.0_Years_6_months_Exp
Nirdesh_Developer_2.0_Years_6_months_ExpNirdesh_Developer_2.0_Years_6_months_Exp
Nirdesh_Developer_2.0_Years_6_months_Exp
Nirdesh Kulshreshtha
 
Differences in-task-descriptions
Differences in-task-descriptionsDifferences in-task-descriptions
Differences in-task-descriptions
Sameer Chavan
 

Ähnlich wie Crowdsourcing Best Practices (20)

Efficient named entity annotation through pre-empting
Efficient named entity annotation through pre-emptingEfficient named entity annotation through pre-empting
Efficient named entity annotation through pre-empting
 
Natural Language Processing: From Human-Robot Interaction to Alzheimer’s Dete...
Natural Language Processing: From Human-Robot Interaction to Alzheimer’s Dete...Natural Language Processing: From Human-Robot Interaction to Alzheimer’s Dete...
Natural Language Processing: From Human-Robot Interaction to Alzheimer’s Dete...
 
Deep learning for NLP
Deep learning for NLPDeep learning for NLP
Deep learning for NLP
 
Best Practices in Recommender System Challenges
Best Practices in Recommender System ChallengesBest Practices in Recommender System Challenges
Best Practices in Recommender System Challenges
 
Expediting the Application Workshop Presentation -- 2015 SRA -- Dianne Donnel...
Expediting the Application Workshop Presentation -- 2015 SRA -- Dianne Donnel...Expediting the Application Workshop Presentation -- 2015 SRA -- Dianne Donnel...
Expediting the Application Workshop Presentation -- 2015 SRA -- Dianne Donnel...
 
The Mythical Man Month
The Mythical Man MonthThe Mythical Man Month
The Mythical Man Month
 
staffing chapter no 8 external selection part 1, by heneman
staffing chapter no 8 external selection part 1, by henemanstaffing chapter no 8 external selection part 1, by heneman
staffing chapter no 8 external selection part 1, by heneman
 
Evaluating Semantic Search Systems to Identify Future Directions of Research
Evaluating Semantic Search Systems to Identify Future Directions of ResearchEvaluating Semantic Search Systems to Identify Future Directions of Research
Evaluating Semantic Search Systems to Identify Future Directions of Research
 
Agile Offsharing: Using Pair Work to Overcome Nearshoring Difficulties
Agile Offsharing: Using Pair Work to OvercomeNearshoring DifficultiesAgile Offsharing: Using Pair Work to OvercomeNearshoring Difficulties
Agile Offsharing: Using Pair Work to Overcome Nearshoring Difficulties
 
2211 APSIPA
2211 APSIPA2211 APSIPA
2211 APSIPA
 
Delphi Method by Amr Ali
Delphi Method  by Amr AliDelphi Method  by Amr Ali
Delphi Method by Amr Ali
 
Managing application performance by Kwame Thomison
Managing application performance by Kwame ThomisonManaging application performance by Kwame Thomison
Managing application performance by Kwame Thomison
 
Nirdesh_Developer_2.0_Years_6_months_Exp
Nirdesh_Developer_2.0_Years_6_months_ExpNirdesh_Developer_2.0_Years_6_months_Exp
Nirdesh_Developer_2.0_Years_6_months_Exp
 
Building NLP solutions for Davidson ML Group
Building NLP solutions for Davidson ML GroupBuilding NLP solutions for Davidson ML Group
Building NLP solutions for Davidson ML Group
 
Recommending Scientific Papers: Investigating the User Curriculum
Recommending Scientific Papers: Investigating the User CurriculumRecommending Scientific Papers: Investigating the User Curriculum
Recommending Scientific Papers: Investigating the User Curriculum
 
Differences in-task-descriptions
Differences in-task-descriptionsDifferences in-task-descriptions
Differences in-task-descriptions
 
Open Creativity Scoring Tutorial
Open Creativity Scoring TutorialOpen Creativity Scoring Tutorial
Open Creativity Scoring Tutorial
 
Spreading Compuer Learning_ppt [Compatibility Mode]
Spreading Compuer Learning_ppt [Compatibility Mode]Spreading Compuer Learning_ppt [Compatibility Mode]
Spreading Compuer Learning_ppt [Compatibility Mode]
 
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comEnhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
 
1530 track2 reid
1530 track2 reid1530 track2 reid
1530 track2 reid
 

Kürzlich hochgeladen

Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
Areesha Ahmad
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
MohamedFarag457087
 

Kürzlich hochgeladen (20)

Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 

Crowdsourcing Best Practices

  • 1. University of Sheffield, NLP Crowdsourcing Best Practices Marta Sabou, Kalina Bontcheva Leon Derczynski, Arno Scharl
  • 2. University of Sheffield, NLP The Science of Corpus Annotation • Quite well understood best practice in how to create linguistic annotation of consistently high quality by employing, training, and managing groups of linguistic and/or domain experts • Necessary in order to ensure reusability and repeatability of results • The acquired corpora are of very high quality • Costs are unfortunately also very high: estimated at between $0.36 and $1.0 per annotation (Zaidan and Callison-Burch, 2011; Poesio et al., 2012)
  • 3. University of Sheffield, NLP Goals What is crowdsourcing? What is a typical workflow for crowdsoucing NLP tasks? What are general solutions used by the state of the art? How do different crowdsourcing genres compare?
  • 5. University of Sheffield, NLP Undefined and generally large group Compared to in-house projects: • cheaper (with 33%) • reach to large number of users; • reach to diverse user groups, e.g., speakers of rare languages
  • 6. University of Sheffield, NLP Genre 1: Mechanised Labour • Participants (workers) paid a small amount of money to complete easy tasks (HIT = Human Intelligence Task)
  • 7. University of Sheffield, NLP Genre 2: Games with a purpose (GWAPs)
  • 8. University of Sheffield, NLP Genre 3: Altruistic Crowdsourcing
  • 9. University of Sheffield, NLP Workflow for Crowdsourcing (Corpora) 1. Project Definition 2. Data and UI Preparation 3. Running the Project 4. Evaluation & Corpus Delivery
  • 11. University of Sheffield, NLP Definition of semantic relations between concept pairs. Coal Is a subcategory of Fossil Fuel
  • 12. University of Sheffield, NLP Trade-offs: Cost; Timescale; Worker skills Small, simple tasks, fast completion => MLab Complex, large tasks, slower completion => GWAP
  • 13. University of Sheffield, NLP • Data distribution: how “micro” is each microtask? • Long paragraphs hard to digest, worker fatigue • For most NLP tasks: one sentence corresponds to one task • Single sentences not always appropriate: e.g. for co-ref • Task Type • Selection task: WSD, sentiment analysis, entity disambiguation, relation typing. • Sequence marking task: co-reference resolution.
  • 14. University of Sheffield, NLP • Categories per selection type task: • Experts (Hovy,10): max 10, ideally 7 • In crowdsourcing less categories, typically 3-4 • To reduce cognitive load, focus on one category at a time (e.g., one NE type) • Number of workers per task: • Depends on the subjective nature/complexity of the task • Minimum 3, optimally 5 • Dynamic worker assignment for inconclusive tasks • Lawson et al. (2010): number of required labels varies for different aspects of the same NLP problem. Good results with only 4 annotators for Person NEs, but require 6 for Location and 7 for Organizations
  • 15. University of Sheffield, NLP Reward scheme • What to reward? - money, game points • When to reward? - when work entered or after its evaluation • How much to reward? • Typically between $0.01 - $0.05/task (5 units) • No clear, repeatable results for quality:reward relation • High rewards get it done faster, but not better • Pilot task gives timings, so pay at least minimum wage • What to do with “bad” work? - detect at run-time and exclude
  • 17. University of Sheffield, NLP Categories:10 Players/task:7 Payment:points awarded based on previously contributed judgments
  • 18. University of Sheffield, NLP Categories:10 Players/task:10 Payment:$0.05/5 units Players filtered through gold-data
  • 19. University of Sheffield, NLP Workflow for Crowdsourcing Corpora 1. Project Definition 2. Data and UI Preparation 3. Running the Project 4. Evaluation & Corpus Delivery
  • 20. University of Sheffield, NLP • Pre-process the corpus linguistically, as needed, e.g. • Tokenise text if user needs to select words • Identify proper names/noun phrases if we want to classify these • Bring additional context, if needed, e.g. text of user profile from Twitter; link to wikipedia page • For GWAPs: • Collect interesting input data if possible, I.e.,texts that are fun to read and work on • clean input data to remove errors (these will lower player satisfaction) • MLab can be used for cleaning the data set
  • 21. University of Sheffield, NLP • Build and test the user interfaces • Easy to medium difficulty in AMT/CF; templates provided for some task types • Medium to hard for GWAPs • Job management interfaces • Provided in MLab platforms • Must be built from scratch for GWAPs • Comparative interface set-up times: • CF: 2 days; Climate Quiz: 2 months • (Thaler et al., 12): OntoPronto: 5 months
  • 22. University of Sheffield, NLP Example: Job Management Interface
  • 23. University of Sheffield, NLP HINT: Add explicitly verifiable questions to the UI: - help filter out spammers - force workers to read the task input
  • 24. University of Sheffield, NLP Pilot the design, measure performance, try again • Simple, clear design important • Binary decision tasks get good results Run bigger pilot studies with volunteers to test everything and collect gold units for quality control later
  • 25. University of Sheffield, NLP Workflow for Crowdsourcing Corpora 1. Project Definition 2. Data and UI Preparation 3. Running the Project 4. Evaluation & Corpus Delivery
  • 26. University of Sheffield, NLP Contributor recruitment: • MLab - easy, given the platforms’ large worker pools and economic incentives • GWAPs - challenging, requires much PR. • Social network based games allow inviting friends for leverage the viral aspect of SNs • Multi-channel advertisement: local and national press, science websites, blogs, bookmarking web- sites, gaming forums, and social networking sites Contributor screening (only in MLab): • MLab - by country, by skill (e.g., spoken language), by reliability • MLab - screening through competency tests; answers to gold units
  • 27. University of Sheffield, NLP IN-TASK QUALITY CONTROL Train contributors - through instructions: • be clear and concise; • avoid technical jargon; • provide both positive and negative examples. Train contributors - through gold data: • CF - known data units (gold units) hidden in tasks • When completing a gold unit, a worker is shown the expected answer thus being trained “on the job” • Workers who fail a certain percentage of gold units are automatically excluded from the job Great opportunity to train workers and amend expert data Better gold data means better output quality, for the same cost
  • 28. University of Sheffield, NLP Example: CF Instructions
  • 30. University of Sheffield, NLP • For large tasks - Multi-batch methodology • Submit tasks in multiple batches • Ensure contributor diversity by starting batches at different times • Needs less gold data • Deal with worker disputes!
  • 31. University of Sheffield, NLP Workflow for Crowdsourcing Corpora 1. Project Definition 2. Data and UI Preparation 3. Running the Project 4. Evaluation & Corpus Delivery
  • 32. University of Sheffield, NLP • Evaluate individual contributor inputs to produce final decision • Majority vote • Discard inputs from low-trusted contributors (e.g. Hsueh et al. (2009)) • Aggregation: • Merge individual units from the microtasks (e.g. sentences) into complete documents, including all crowdsourced markup • Majority voting; average; collection • Aggregation strategies: • Climate Quiz: relation chosen between pairs if it has been voted by 4 more players than the next most popular relation • CF - Majority voting; confidence value computed taking into account worker accuracy
  • 33. University of Sheffield, NLP • Evaluate corpus quality • Compute inter-worker agreement; • Compute inter-worker-trusted annotator agreement • Compare to a gold standard baseline (P/R/F/Acc) •To facilitate reuse: • deliver corpus in a widely used format (XCES, CONLL, GATE XML) • Share with research community
  • 35. University of Sheffield, NLP Evaluation of relation selection task: Comparison with Gold Standard Same data, different aggregation
  • 36. University of Sheffield, NLP Legal and Ethical Issues 1. Acknowledging the Crowd‘s contribution S. Cooper, [other authors], and Foldit players: Predicting protein structures with a multiplayer online game. Nature, 466(7307):756-760, 2010. 2. Ensuring privacy and wellbeing 1. Mechnised labour criticised for low wages, lack of worker rights 2. Majority of workers rely on microtasks as main income source 3. Prevent prolonged use & user exploitation (e.g. daily caps) 3. Licensing and consent 1. Some clearly state the use of Creative Common licenses 2. General failure to provide informed consent information
  • 38. University of Sheffield, NLP Thank you! Questions?

Hinweis der Redaktion

  1. Crowdsourcing is an emerging collaborative approach for acquiring annotated corpora and a wide range of other linguistic resources Three main kinds of crowdsourcing platforms paid-for marketplaces such as Amazon Mechanical Turk (AMT) and CrowdFlower (CF) games with a purpose volunteer-based platforms such as crowdcrafting Paid for crowdsourcing can be 33% cheaper than in-house employees when applied to tasks such as tagging and classification (Hoffmann, 2009) Games with a purpose can be even cheaper in the long run, since the players are not paid. However cost of implementing a game can be higher than AMT/CF costs for smaller projects (Poesio et al, 2012) Tap into the large number of contributors/players available across the globe, through the internet Easy to reach native speakers in various languages (but beware Google translate cheaters!)
  2. Contributors are extrinsically motivated through economic incentives Most NLP projects use crowdsourcing marketplaces: Amazon Mechanical Tutk and CrowdFlower Requesters post Human Intelligence Tasks (HITs) to a large population of micro-workers (Callison-Burch and Dredze, 2010a) Snow et al. (2008) collect event and affect annotations, while Lawson et al. (2010) and Finin et al. (2010) annotate special types of texts such as emails and Twitter feeds, respectively. Challenges: low quality output due to the workers’ purely economic motivation high costs for large tasks (Parent and Eskenazi, 2011) ethical issues (Fort et al., 2011)
  3. In GWAPs (von Ahn and Dabbish, 2008), contributors carry out annotation tasks as a side effect of playing a game Example GWAPs: Phratris for annotating syntactic dependencies (Attardi, 2010) PhraseDetectives (Poesio et al.,2012) to acquire anaphora annotations Sentiment Quiz (Scharl et al., 2012) to annotate sentiment http://www.wordrobe.org/ - A collection of NLP games incl. POS, NE Challenges: Designing apealing games and attracting a critical mass of players are among the key success factors within this genre (Wang et al., 2012) In 2008, the group built a FB game that required players to rate the sentiment associated to a sentence on a 5-values scale, then used this as atraining corpus for the sentiment detection module. Over 800 player played the game. In 2009 the game has been released in a slightly different form and with the aim to gather sentiment lexicons, i.e., associations between words and their sentiment polarity (ratings from as many as 12 players were averaged to get the final value). The game ran in 7 different languages and attracted over 4000 players. Let this be an introductory example of a crowdsourcing project, however, crowdsourcing is a not a new phenomenon.
  4. Volunteer contributes because he is interested in a domain, supports a cause
  5. Compared to paid-for marketplaces, GWAPs: reduce costs and the incentive to cheat as players are intrinsically motivated promise superior results, due to motivated players and better utilization of sporadic, explorer-type users (Parent and Eskenazi, 2011) Few papers, and most of those “theoretical”/survey-based comparison.
  6. Climate Quiz is a GWAP deployed over the Facebook social networking platform. It is focused on acquiring factual knowledge in the domain of climate change. The game is coupled with an ontology learning algorithm, as follows. The ontology learning algorithm extracts terms from unstructured and structured data sources. The term pairs that are most likely related based on the algorithm’s input data sources are subsequently sent to Climate Quiz, where players assign relations to each pair. These relations are fed back into the algorithm which uses them to refine the learned ontology and to derive new term pairs that should be connected. As depicted here, Climate Quiz asks players to evaluate whether two concepts presented by the system are related (e.g. environmental activism,activism), and which label is the most appropriate to describe this relation (e.g. is a sub − category of ). Players can assign one of eight relations, three of which are generic (is a sub − category of, is identical to, is the opposite of), whereas five are domain- specific (opposes, supports, threatens, influences, works on /with). Two further relations, “other” and “is not related to” were added for cases not covered by the previous eight relations. The game’s interface allows players to switch the position of the two concepts or to skip ambiguous pairs.
  7. In order to allow the comparative analysis of the two HC genres, a mechanised labour version of Climate Quiz was created on the CrowdFlower (CF) platform. Additionally to the game interface, two verification questions were added to “force” the contributors to read the terms before selecting a random relation.
  8. Can run for hours, days or years, depending on genre and size
  9. Quality in terms of agreements with a gold standard. Note: depending on how the raw input from CF is aggregated the results are very different. In particular, the aggregation mechanism of CQ (highest scored relations must have 4 more scores than second scored relation) leads to worse results than when the aggregation methods of CF are used (these take account of worker performance during majority vote).
  10. Our findings verify experimentally all the differences between the two genre that the literature based study identified. Additionally, thanks to the experimental approach we have some concrete details about the actual values of some of the parameters. For those aspects where earlier studies disagree we found that: With the appropriate aggregation method, Mlab results can be as good as those obtained with games, at least for the task in question 2) Worker diversity is higher in GWAPs