This document provides an overview of crowdsourcing and human computation. It begins with examples of using Amazon Mechanical Turk for basic tasks like labeling data. It then discusses how crowdsourcing can be used for more complex applications and discusses factors like incentive design, quality control, and platform selection. The document provides guidance on task design, experiment workflow, and usability considerations for effective crowdsourcing.
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
Crowdsourcing & Human Computation Labeling Data & Building Hybrid Systems
1. Matt Lease
School of Information @mattlease
University of Texas at Austin ml@ischool.utexas.edu
Crowdsourcing & Human Computation
Labeling Data & Building Hybrid Systems
Slides: www.slideshare.net/mattlease
2. Roadmap
• A Quick Example
• Crowd-powered data collection & applications
• Crowdsourcing, Incentives, & Demographics
• Mechanical Turk & Other Platforms
• Designing for Crowds & Statistical QA
• Open Problems
• Broader Considerations & a Darker Side
2
3. What is Crowdsourcing?
• Let’s start with a simple example!
• Goal
– See a concrete example of real crowdsourcing
– Ground later discussion of abstract concepts
– Provide a specific example with which we will
contrast other forms of crowdsourcing
3
7. Traditional Data Collection
• Setup data collection software / harness
• Recruit participants / annotators / assessors
• Pay a flat fee for experiment or hourly wage
• Characteristics
– Slow
– Expensive
– Difficult and/or Tedious
– Sample Bias…
7
8. “Hello World” Demo
• Let’s create and run a simple MTurk HIT
• This is a teaser highlighting concepts
– Don’t worry about details; we’ll revisit them
• Goal
– See a concrete example of real crowdsourcing
– Ground our later discussion of abstract concepts
– Provide a specific example with which we will
contrast other forms of crowdsourcing
8
11. NLP: Snow et al. (EMNLP 2008)
• MTurk annotation for 5 Tasks
– Affect recognition
– Word similarity
– Recognizing textual entailment
– Event temporal ordering
– Word sense disambiguation
• 22K labels for US $26
• High agreement between
consensus labels and
gold-standard labels
11
13. IR: Alonso et al. (SIGIR Forum 2008)
• MTurk for Information Retrieval (IR)
– Judge relevance of search engine results
• Many follow-on studies (design, quality, cost)
13
14. User Studies: Kittur, Chi, & Suh (CHI 2008)
• “…make creating believable invalid responses as
effortful as completing the task in good faith.”
14
15. Remote Usability Testing
• Liu, Bias, Lease, and Kuipers, ASIS&T, 2012
• Remote usability testing via MTurk & CrowdFlower
vs. traditional on-site testing
• Advantages
– More (Diverse) Participants
– High Speed
– Low Cost
• Disadvantages
– Lower Quality Feedback
– Less Interaction
– Greater need for quality control
– Less Focused User Groups
15
17. Human Subjects Research:
Surveys, Demographics, etc.
• A Guide to Behavioral Experiments
on Mechanical Turk
– W. Mason and S. Suri (2010). SSRN online.
• Crowdsourcing for Human Subjects Research
– L. Schmidt (CrowdConf 2010)
• Crowdsourcing Content Analysis for Behavioral Research:
Insights from Mechanical Turk
– Conley & Tosti-Kharas (2010). Academy of Management
• Amazon's Mechanical Turk : A New Source of
Inexpensive, Yet High-Quality, Data?
– M. Buhrmester et al. (2011). Perspectives… 6(1):3-5.
– see also: Amazon Mechanical Turk Guide for Social Scientists 17
18. • PhD Thesis, December 2005
• Law & von Ahn, Book, June 2011
18
LUIS VON AHN, CMU
19. ESP Game (Games With a Purpose)
L. Von Ahn and L. Dabbish (2004)
19
23. Crowd Sensing
• Steve Kelling, et al. A Human/Computer Learning
Network to Improve Biodiversity Conservation
and Research. AI Magazine 34.1 (2012): 10.
23
24. Tracking Sentiment in Online Media
Brew et al., PAIS 2010
• Volunteer-crowd
• Judge in exchange for
access to rich content
• Balance system needs
with user interest
• Daily updates to non-
stationary distribution
24
25. PHASE 2: FROM DATA COLLECTION
TO HUMAN COMPUTATION
25
27. Princeton University Press, 2005
• What was old is new
• Crowdsourcing: A New Branch
of Computer Science
– D.A. Grier, March 29, 2011
• Tabulating the heavens:
computing the Nautical
Almanac in 18th-century
England - M. Croarken’03
27
Human Computation
28. J. Pontin. Artificial Intelligence, With Help From
the Humans. New York Times (March 25, 2007)
The Mechanical Turk
28
Constructed and unveiled in 1770 by Wolfgang von Kempelen (1734–1804)
30. Human Computation
• Having people do stuff instead of computers
• Investigates use of people to execute certain
computations for which capabilities of current
automated methods are more limited
• Explores the metaphor of computation for
characterizing attributes, capabilities, and
limitations of human task performance
30
43. From Outsourcing to Crowdsourcing
• Take a job traditionally
performed by a known agent
(often an employee)
• Outsource it to an undefined,
generally large group of
people via an open call
• New application of principles
from open source movement
• Evolving & broadly defined ...
43
44. Crowdsourcing models
• Micro-tasks & citizen science
• Co-Creation
• Open Innovation, Contests
• Prediction Markets
• Crowd Funding and Charity
• “Gamification” (not serious gaming)
• Transparent
• cQ&A, Social Search, and Polling
• Physical Interface/Task
44
45. What is Crowdsourcing?
• Mechanisms and methodology for directing
crowd action to achieve some goal(s)
– E.g., novel ways of collecting data from crowds
• Powered by internet-connectivity
• Related topics:
– Human computation
– Collective intelligence
– Crowd/Social computing
– Wisdom of Crowds
– People services, Human Clouds, Peer-production, …
45
46. What is not crowdsourcing?
• Analyzing existing datasets (no matter source)
– Data mining
– Visual analytics
• Use of few people
– Mixed-initiative design
– Active learning
• Conducting a survey or poll… (*)
– Novelty?
46
47. Crowdsourcing Key Questions
• What are the goals?
– Purposeful directing of human activity
• How can you incentivize participation?
– Incentive engineering
– Who are the target participants?
• Which model(s) are most appropriate?
– How to adapt them to your context and goals?
47
48. Wisdom of Crowds (WoC)
Requires
• Diversity
• Independence
• Decentralization
• Aggregation
Input: large, diverse sample
(to increase likelihood of overall pool quality)
Output: consensus or selection (aggregation)
48
49. What do you want to accomplish?
• Create
• Execute task/computation
• Fund
• Innovate and/or discover
• Learn
• Monitor
• Predict
49
51. Why should your crowd participate?
• Earn Money (real or virtual)
• Have fun (or pass the time)
• Socialize with others
• Obtain recognition or prestige (leaderboards, badges)
• Do Good (altruism)
• Learn something new
• Obtain something else
• Create self-serving resource
Multiple incentives can often operate in parallel (*caveat)
51
52. Example: Wikipedia
• Earn Money (real or virtual)
• Have fun (or pass the time)
• Socialize with others
• Obtain recognition or prestige
• Do Good (altruism)
• Learn something new
• Obtain something else
• Create self-serving resource
52
53. Example: DuoLingo
• Earn Money (real or virtual)
• Have fun (or pass the time)
• Socialize with others
• Obtain recognition or prestige
• Do Good (altruism)
• Learn something new
• Obtain something else
• Create self-serving resource
53
54. Example:
• Earn Money (real or virtual)
• Have fun (or pass the time)
• Socialize with others
• Obtain recognition or prestige
• Do Good (altruism)
• Learn something new
• Obtain something else
• Create self-serving resource
54
55. Example: ESP
55
• Earn Money (real or virtual)
• Have fun (or pass the time)
• Socialize with others
• Obtain recognition or prestige
• Do Good (altruism)
• Learn something new
• Obtain something else
• Create self-serving resource
56. Example: fold.it
• Earn Money (real or virtual)
• Have fun (or pass the time)
• Socialize with others
• Obtain recognition or prestige
• Do Good (altruism)
• Learn something new
• Obtain something else
• Create self-serving resource
56
57. Example: FreeRice
• Earn Money (real or virtual)
• Have fun (or pass the time)
• Socialize with others
• Obtain recognition or prestige
• Do Good (altruism)
• Learn something new
• Obtain something else
• Create self-serving resource
57
58. Example: cQ&A
• Earn Money (real or virtual)
• Have fun (or pass the time)
• Socialize with others
• Obtain recognition or prestige
• Do Good (altruism)
• Learn something new
• Obtain something else
• Create self-serving resource
58
59. Example: reCaptcha
• Earn Money (real or virtual)
• Have fun (or pass the time)
• Socialize with others
• Obtain recognition or prestige
• Do Good (altruism)
• Learn something new
• Obtain something else
• Create self-serving resource
59
Is there an existing human
activity you can harness
for another purpose?
60. Example: Mechanical Turk
• Earn Money (real or virtual)
• Have fun (or pass the time)
• Socialize with others
• Obtain recognition or prestige
• Do Good (altruism)
• Learn something new
• Obtain something else
• Create self-serving resource
60
61. Dan Pink – YouTube video
“The Surprising Truth about what Motivates us”
61
62. Who are
the workers?
• A. Baio, November 2008. The Faces of Mechanical Turk.
• P. Ipeirotis. March 2010.
The New Demographics of Mechanical Turk
• J. Ross, et al. Who are the Crowdworkers?... CHI 2010.
62
63. MTurk Demographics
• 2008-2009 studies found
less global and diverse
than previously thought
– US
– Female
– Educated
– Bored
– Money is secondary
63
64. 2010 shows increasing diversity
47% US, 34% India, 19% other (P. Ipeitorotis. March 2010)
64
65. How Much to Pay?
• Price commensurate with task effort
– Ex: $0.02 for yes/no answer + $0.02 bonus for optional feedback
• Ethics & market-factors: W. Mason and S. Suri, 2010.
– e.g. non-profit SamaSource involves workers in refugee camps
– Predict right price given market & task: Wang et al. CSDM’11
• Uptake & time-to-completion vs. Cost & Quality
– Too little $$, no interest or slow – too much $$, attract spammers
– Real problem is lack of reliable QA substrate
• Accuracy & quantity
– More pay = more work, not better (W. Mason and D. Watts, 2009)
• Heuristics: start small, watch uptake and bargaining feedback
• Worker retention (“anchoring”)
65
See also: L.B. Chilton et al. KDD-HCOMP 2010.
67. Does anyone really use it? Yes!
http://www.mturk-tracker.com (P. Ipeirotis’10)
From 1/09 – 4/10, 7M HITs from 10K requestors
worth $500,000 USD (significant under-estimate)
67
68. MTurk: The Requester
• Sign up with your Amazon account
• Amazon payments
• Purchase prepaid HITs
• There is no minimum or up-front fee
• MTurk collects a 10% commission
• The minimum commission charge is $0.005 per HIT
68
69. MTurk Dashboard
• Three tabs
– Design
– Publish
– Manage
• Design
– HIT Template
• Publish
– Make work available
• Manage
– Monitor progress
69
72. MTurk API
• Amazon Web Services API
• Rich set of services
• Command line tools
• More flexibility than dashboard
72
73. MTurk Dashboard vs. API
• Dashboard
– Easy to prototype
– Setup and launch an experiment in a few minutes
• API
– Ability to integrate AMT as part of a system
– Ideal if you want to run experiments regularly
– Schedule tasks
73
83. Typical Workflow
• Define and design what to test
• Sample data
• Design the experiment
• Run experiment
• Collect data and analyze results
• Quality control
83
84. Development Framework
• Incremental approach (from Omar Alonso)
• Measure, evaluate, and adjust as you go
• Suitable for repeatable tasks
84
85. Survey Design
• One of the most important parts
• Part art, part science
• Instructions are key
• Prepare to iterate
85
86. Questionnaire Design
• Ask the right questions
• Workers may not be IR experts so don’t
assume the same understanding in terms of
terminology
• Show examples
• Hire a technical writer
– Engineer writes the specification
– Writer communicates
86
87. UX Design
• Time to apply all those usability concepts
• Generic tips
– Experiment should be self-contained.
– Keep it short and simple. Brief and concise.
– Be very clear with the relevance task.
– Engage with the worker. Avoid boring stuff.
– Always ask for feedback (open-ended question) in
an input box.
87
88. UX Design - II
• Presentation
• Document design
• Highlight important concepts
• Colors and fonts
• Need to grab attention
• Localization
88
89. Implementation
• Similar to a UX
• Build a mock up and test it with your team
– Yes, you need to judge some tasks
• Incorporate feedback and run a test on MTurk
with a very small data set
– Time the experiment
– Do people understand the task?
• Analyze results
– Look for spammers
– Check completion times
• Iterate and modify accordingly
89
90. Implementation – II
• Introduce quality control
– Qualification test
– Gold answers (honey pots)
• Adjust passing grade and worker approval rate
• Run experiment with new settings & same data
• Scale on data
• Scale on workers
90
91. Other design principles
• Text alignment
• Legibility
• Reading level: complexity of words and sentences
• Attractiveness (worker’s attention & enjoyment)
• Multi-cultural / multi-lingual
• Who is the audience (e.g. target worker community)
– Special needs communities (e.g. simple color blindness)
• Parsimony
• Cognitive load: mental rigor needed to perform task
• Exposure effect
91
92. The human side
• As a worker
– I hate when instructions are not clear
– I’m not a spammer – I just don’t get what you want
– Boring task
– A good pay is ideal but not the only condition for engagement
• As a requester
– Attrition
– Balancing act: a task that would produce the right results and
is appealing to workers
– I want your honest answer for the task
– I want qualified workers; system should do some of that for me
• Managing crowds and tasks is a daily activity
– more difficult than managing computers
92
94. When to assess quality of work
• Beforehand (prior to main task activity)
– How: “qualification tests” or similar mechanism
– Purpose: screening, selection, recruiting, training
• During
– How: assess labels as worker produces them
• Like random checks on a manufacturing line
– Purpose: calibrate, reward/penalize, weight
• After
– How: compute accuracy metrics post-hoc
– Purpose: filter, calibrate, weight, retain (HR)
– E.g. Jung & Lease (2011), Tang & Lease (2011), ...
94
95. How do we measure work quality?
• Compare worker’s label vs.
– Known (correct, trusted) label
– Other workers’ labels
• P. Ipeirotis. Worker Evaluation in Crowdsourcing: Gold Data or
Multiple Workers? Sept. 2010.
– Model predictions of the above
• Model the labels (Ryu & Lease, ASIS&T11)
• Model the workers (Chen et al., AAAI’10)
• Verify worker’s label
– Yourself
– Tiered approach (e.g. Find-Fix-Verify)
• Quinn and B. Bederson’09, Bernstein et al.’10
95
96. Typical Assumptions
• Objective truth exists
– no minority voice / rare insights
– Can relax this to model “truth distribution”
• Automatic answer comparison/evaluation
– What about free text responses? Hope from NLP…
• Automatic essay scoring
• Translation (BLEU: Papineni, ACL’2002)
• Summarization (Rouge: C.Y. Lin, WAS’2004)
– Have people do it (yourself or find-verify crowd, etc.)
96
97. Distinguishing Bias vs. Noise
• Ipeirotis (HComp 2010)
• People often have consistent, idiosyncratic
skews in their labels (bias)
– E.g. I like action movies, so they get higher ratings
• Once detected, systematic bias can be
calibrated for and corrected (yeah!)
• Noise, however, seems random & inconsistent
– this is the real issue we want to focus on
97
98. Comparing to known answers
• AKA: gold, honey pot, verifiable answer, trap
• Assumes you have known answers
• Cost vs. Benefit
– Producing known answers (experts?)
– % of work spent re-producing them
• Finer points
– Controls against collusion
– What if workers recognize the honey pots?
98
99. Comparing to other workers
• AKA: consensus, plurality, redundant labeling
• Well-known metrics for measuring agreement
• Cost vs. Benefit: % of work that is redundant
• Finer points
– Is consensus “truth” or systematic bias of group?
– What if no one really knows what they’re doing?
• Low-agreement across workers indicates problem is with the
task (or a specific example), not the workers
– Risk of collusion
• Sheng et al. (KDD 2008)
99
100. Comparing to predicted label
• Ryu & Lease, ASIS&T11
• Catch-22 extremes
– If model is really bad, why bother comparing?
– If model is really good, why collect human labels?
• Exploit model confidence
– Trust predictions proportional to confidence
– What if model very confident and wrong?
• Active learning
– Time sensitive: Accuracy / confidence changes
100
101. Compare to predicted worker labels
• Chen et al., AAAI’10
• Avoid inefficiency of redundant labeling
– See also: Dekel & Shamir (COLT’2009)
• Train a classifier for each worker
• For each example labeled by a worker
– Compare to predicted labels for all other workers
• Issues
• Sparsity: workers have to stick around to train model…
• Time-sensitivity: New workers & incremental updates?
101
102. Methods for measuring agreement
• What to look for
– Agreement, reliability, validity
• Inter-agreement level
– Agreement between judges
– Agreement between judges and the gold set
• Some statistics
– Percentage agreement
– Cohen’s kappa (2 raters)
– Fleiss’ kappa (any number of raters)
– Krippendorff’s alpha
• With majority vote, what if 2 say relevant, 3 say not?
– Use expert to break ties (Kochhar et al, HCOMP’10; GQR)
– Collect more judgments as needed to reduce uncertainty
102
103. Other practical tips
• Sign up as worker and do some HITs
• “Eat your own dog food”
• Monitor discussion forums
• Address feedback (e.g., poor guidelines,
payments, passing grade, etc.)
• Everything counts!
– Overall design only as strong as weakest link
103
105. Why Eytan Adar hates MTurk Research
(CHI 2011 CHC Workshop)
• Overly-narrow focus on MTurk
– Identify general vs. platform-specific problems
– Academic vs. Industrial problems
• Inattention to prior work in other disciplines
• Turks aren’t Martians
– Just human behavior…
105
106. What about sensitive data?
• Not all data can be publicly disclosed
– User data (e.g. AOL query log, Netflix ratings)
– Intellectual property
– Legal confidentiality
• Need to restrict who is in your crowd
– Separate channel (workforce) from technology
– Hot question for adoption at enterprise level
106
107. A Few Open Questions
• How should we balance automation vs.
human computation? Which does what?
• Who’s the right person for the job?
• How do we handle complex tasks? Can we
decompose them into smaller tasks? How?
107
108. What about ethics?
• Silberman, Irani, and Ross (2010)
– “How should we… conceptualize the role of these
people who we ask to power our computing?”
– Power dynamics between parties
• What are the consequences for a worker
when your actions harm their reputation?
– “Abstraction hides detail”
• Fort, Adda, and Cohen (2011)
– “…opportunities for our community to deliberately
value ethics above cost savings.”
108
112. Micro-tasks & Task Decomposition
• Small, simple tasks can be completed faster by
reducing extraneous context and detail
– e.g. “Can you name who is in this photo?”
• Current workflow research investigates how to
decompose complex tasks into simpler ones
112
113. Context & Informed Consent
• What is the larger task I’m contributing to?
• Who will benefit from it and how?
113
116. Issues of Identity Fraud
• Compromised & exploited worker accounts
• Sybil attacks: use of multiple worker identities
• Script bots masquerading as human workers
116
Robert Sim, MSR Faculty Summit’12
117. Safeguarding Personal Data
•
“What are the characteristics of MTurk workers?... the MTurk
system is set up to strictly protect workers’ anonymity….”
117
119. What about the regulation?
• Wolfson & Lease (ASIS&T 2011)
• As usual, technology is ahead of the law
– employment law
– patent inventorship
– data security and the Federal Trade Commission
– copyright ownership
– securities regulation of crowdfunding
• Take-away: don’t panic, but be mindful
– Understand risks of “just in-time compliance”
119
120. Digital Dirty Jobs
• NY Times: Policing the Web’s Lurid Precincts
• Gawker: Facebook content moderation
• CultureDigitally: The dirty job of keeping
Facebook clean
• Even LDC annotators reading typical
news articles report stress & nightmares!
120
121. Jeff Howe Vision vs. Reality?
• Vision of empowering worker freedom:
– work whenever you want for whomever you want
• When $$$ is at stake, populations at risk may
be compelled to perform work by others
– Digital sweat shops? Digital slaves?
– We really don’t know (and need to learn more…)
– Traction? Human Trafficking at MSR Summit’12
121
124. What about trust?
• Some reports of robot “workers” on MTurk
– E.g. McCreadie et al. (2011)
– Violates terms of service
• Why not just use a captcha?
124
126. Requester Fraud on MTurk
“Do not do any HITs that involve: filling in
CAPTCHAs; secret shopping; test our web page;
test zip code; free trial; click my link; surveys or
quizzes (unless the requester is listed with a
smiley in the Hall of Fame/Shame); anything
that involves sending a text message; or
basically anything that asks for any personal
information at all—even your zip code. If you
feel in your gut it’s not on the level, IT’S NOT.
Why? Because they are scams...”
126
131. Conclusion
• Crowdsourcing is quickly transforming practice
in industry and academia via greater efficiency
• Crowd computing enables a new design space
for applications, augmenting state-of-the-art AI
with human computation to offer
new capabilities and user experiences
• With people at the center of this new computing
paradigm, important research questions
bridge technological & social considerations
131
132. The Future of Crowd Work
Paper @ ACM CSCW 2013
Kittur, Nickerson, Bernstein, Gerber,
Shaw, Zimmerman, Lease, and Horton 132
133. Brief Digression: Information Schools
• At 30 universities in N. America, Europe, Asia
• Study human-centered aspects of information
technologies: design, implementation, policy, …
133
www.ischools.org
Wobbrock et
al., 2009
135. • Jeff Nickerson Aniket Kittur, Michael S. Bernstein, Elizabeth
Gerber, Aaron Shaw, John Zimmerman, Matthew Lease, and
John J. Horton. The Future of Crowd Work. In ACM Computer
Supported Cooperative Work (CSCW), February 2013.
• Alex Quinn and Ben Bederson. Human Computation: A Survey
and Taxonomy of a Growing Field. In Proceedings of CHI 2011.
• Law and von Ahn (2011). Human Computation
135
Surveys
136. 2013 Crowdsourcing
• 1st year of HComp as AAAI conference
• TREC 2013 Crowdsourcing Track
• Springer’s Information Retrieval (articles online):
Crowdsourcing for Information Retrieval
• 4th CrowdConf (San Francisco, Fall)
• 1st Crowdsourcing Week (Singapore, April)
136
138. 2012 Workshops & Conferences
• AAAI: Human Computation (HComp) (July 22-23)
• AAAI Spring Symposium: Wisdom of the Crowd (March 26-28)
• ACL: 3rd Workshop of the People's Web meets NLP (July 12-13)
• AMCIS: Crowdsourcing Innovation, Knowledge, and Creativity in Virtual Communities(August 9-12)
• CHI: CrowdCamp (May 5-6)
• CIKM: Multimodal Crowd Sensing (CrowdSens) (Oct. or Nov.)
• Collective Intelligence (April 18-20)
• CrowdConf 2012 -- 3rd Annual Conference on the Future of Distributed Work (October 23)
• CrowdNet - 2nd Workshop on Cloud Labor and Human Computation (Jan 26-27)
• EC: Social Computing and User Generated Content Workshop (June 7)
• ICDIM: Emerging Problem- specific Crowdsourcing Technologies (August 23)
• ICEC: Harnessing Collective Intelligence with Games (September)
• ICML: Machine Learning in Human Computation & Crowdsourcing (June 30)
• ICWE: 1st International Workshop on Crowdsourced Web Engineering (CroWE) (July 27)
• KDD: Workshop on Crowdsourcing and Data Mining (August 12)
• Multimedia: Crowdsourcing for Multimedia (Nov 2)
• SocialCom: Social Media for Human Computation (September 6)
• TREC-Crowd: 2nd TREC Crowdsourcing Track (Nov. 14-16)
• WWW: CrowdSearch: Crowdsourcing Web search (April 17)
138
139. 2011 Workshops & Conferences
• AAAI-HCOMP: 3rd Human Computation Workshop (Aug. 8)
• ACIS: Crowdsourcing, Value Co-Creation, & Digital Economy Innovation (Nov. 30 – Dec. 2)
• Crowdsourcing Technologies for Language and Cognition Studies (July 27)
• CHI-CHC: Crowdsourcing and Human Computation (May 8)
• CIKM: BooksOnline (Oct. 24, “crowdsourcing … online books”)
• CrowdConf 2011 -- 2nd Conf. on the Future of Distributed Work (Nov. 1-2)
• Crowdsourcing: Improving … Scientific Data Through Social Networking (June 13)
• EC: Workshop on Social Computing and User Generated Content (June 5)
• ICWE: 2nd International Workshop on Enterprise Crowdsourcing (June 20)
• Interspeech: Crowdsourcing for speech processing (August)
• NIPS: Second Workshop on Computational Social Science and the Wisdom of Crowds (Dec. TBD)
• SIGIR-CIR: Workshop on Crowdsourcing for Information Retrieval (July 28)
• TREC-Crowd: 1st TREC Crowdsourcing Track (Nov. 16-18)
• UbiComp: 2nd Workshop on Ubiquitous Crowdsourcing (Sep. 18)
• WSDM-CSDM: Crowdsourcing for Search and Data Mining (Feb. 9)
139
140. 2011 Tutorials and Keynotes
• By Omar Alonso and/or Matthew Lease
– CLEF: Crowdsourcing for Information Retrieval Experimentation and Evaluation (Sep. 20, Omar only)
– CrowdConf: Crowdsourcing for Research and Engineering
– IJCNLP: Crowd Computing: Opportunities and Challenges (Nov. 10, Matt only)
– WSDM: Crowdsourcing 101: Putting the WSDM of Crowds to Work for You (Feb. 9)
– SIGIR: Crowdsourcing for Information Retrieval: Principles, Methods, and Applications (July 24)
• AAAI: Human Computation: Core Research Questions and State of the Art
– Edith Law and Luis von Ahn, August 7
• ASIS&T: How to Identify Ducks In Flight: A Crowdsourcing Approach to Biodiversity Research and
Conservation
– Steve Kelling, October 10, ebird
• EC: Conducting Behavioral Research Using Amazon's Mechanical Turk
– Winter Mason and Siddharth Suri, June 5
• HCIC: Quality Crowdsourcing for Human Computer Interaction Research
– Ed Chi, June 14-18, about HCIC)
– Also see his: Crowdsourcing for HCI Research with Amazon Mechanical Turk
• Multimedia: Frontiers in Multimedia Search
– Alan Hanjalic and Martha Larson, Nov 28
• VLDB: Crowdsourcing Applications and Platforms
– Anhai Doan, Michael Franklin, Donald Kossmann, and Tim Kraska)
• WWW: Managing Crowdsourced Human Computation
– Panos Ipeirotis and Praveen Paritosh
140
142. More Books
July 2010, kindle-only: “This book introduces you to the
top crowdsourcing sites and outlines step by step with
photos the exact process to get started as a requester on
Amazon Mechanical Turk.“
142
143. Resources
A Few Blogs
Behind Enemy Lines (P.G. Ipeirotis, NYU)
Deneme: a Mechanical Turk experiments blog (Gret Little, MIT)
CrowdFlower Blog
http://experimentalturk.wordpress.com
Jeff Howe
A Few Sites
The Crowdsortium
Crowdsourcing.org
CrowdsourceBase (for workers)
Daily Crowdsource
MTurk Forums and Resources
Turker Nation: http://turkers.proboards.com
http://www.turkalert.com (and its blog)
Turkopticon: report/avoid shady requestors
Amazon Forum for MTurk
143
144. Bibliography
J. Barr and L. Cabrera. “AI gets a Brain”, ACM Queue, May 2006.
Bernstein, M. et al. Soylent: A Word Processor with a Crowd Inside. UIST 2010. Best Student Paper award.
Bederson, B.B., Hu, C., & Resnik, P. Translation by Iteractive Collaboration between Monolingual Users, Proceedings of Graphics
Interface (GI 2010), 39-46.
N. Bradburn, S. Sudman, and B. Wansink. Asking Questions: The Definitive Guide to Questionnaire Design, Jossey-Bass, 2004.
C. Callison-Burch. “Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon’s Mechanical Turk”, EMNLP 2009.
P. Dai, Mausam, and D. Weld. “Decision-Theoretic of Crowd-Sourced Workflows”, AAAI, 2010.
J. Davis et al. “The HPU”, IEEE Computer Vision and Pattern Recognition Workshop on Advancing Computer Vision with Human
in the Loop (ACVHL), June 2010.
M. Gashler, C. Giraud-Carrier, T. Martinez. Decision Tree Ensemble: Small Heterogeneous Is Better Than Large Homogeneous, ICMLA 2008.
D. A. Grier. When Computers Were Human. Princeton University Press, 2005. ISBN 0691091579
JS. Hacker and L. von Ahn. “Matchin: Eliciting User Preferences with an Online Game”, CHI 2009.
J. Heer, M. Bobstock. “Crowdsourcing Graphical Perception: Using Mechanical Turk to Assess Visualization Design”, CHI 2010.
P. Heymann and H. Garcia-Molina. “Human Processing”, Technical Report, Stanford Info Lab, 2010.
J. Howe. “Crowdsourcing: Why the Power of the Crowd Is Driving the Future of Business”. Crown Business, New York, 2008.
P. Hsueh, P. Melville, V. Sindhwami. “Data Quality from Crowdsourcing: A Study of Annotation Selection Criteria”. NAACL HLT
Workshop on Active Learning and NLP, 2009.
B. Huberman, D. Romero, and F. Wu. “Crowdsourcing, attention and productivity”. Journal of Information Science, 2009.
P.G. Ipeirotis. The New Demographics of Mechanical Turk. March 9, 2010. PDF and Spreadsheet.
P.G. Ipeirotis, R. Chandrasekar and P. Bennett. Report on the human computation workshop. SIGKDD Explorations v11 no 2 pp. 80-83, 2010.
P.G. Ipeirotis. Analyzing the Amazon Mechanical Turk Marketplace. CeDER-10-04 (Sept. 11, 2010)
144
145. Bibliography (2)
A. Kittur, E. Chi, and B. Suh. “Crowdsourcing user studies with Mechanical Turk”, SIGCHI 2008.
Aniket Kittur, Boris Smus, Robert E. Kraut. CrowdForge: Crowdsourcing Complex Work. CHI 2011
Adriana Kovashka and Matthew Lease. “Human and Machine Detection of … Similarity in Art”. CrowdConf 2010.
K. Krippendorff. "Content Analysis", Sage Publications, 2003
G. Little, L. Chilton, M. Goldman, and R. Miller. “TurKit: Tools for Iterative Tasks on Mechanical Turk”, HCOMP 2009.
T. Malone, R. Laubacher, and C. Dellarocas. Harnessing Crowds: Mapping the Genome of Collective Intelligence.
2009.
W. Mason and D. Watts. “Financial Incentives and the ’Performance of Crowds’”, HCOMP Workshop at KDD 2009.
J. Nielsen. “Usability Engineering”, Morgan-Kaufman, 1994.
A. Quinn and B. Bederson. “A Taxonomy of Distributed Human Computation”, Technical Report HCIL-2009-23, 2009
J. Ross, L. Irani, M. Six Silberman, A. Zaldivar, and B. Tomlinson. “Who are the Crowdworkers?: Shifting
Demographics in Amazon Mechanical Turk”. CHI 2010.
F. Scheuren. “What is a Survey” (http://www.whatisasurvey.info) 2004.
R. Snow, B. O’Connor, D. Jurafsky, and A. Y. Ng. “Cheap and Fast But is it Good? Evaluating Non-Expert Annotations
for Natural Language Tasks”. EMNLP-2008.
V. Sheng, F. Provost, P. Ipeirotis. “Get Another Label? Improving Data Quality … Using Multiple, Noisy Labelers”
KDD 2008.
S. Weber. “The Success of Open Source”, Harvard University Press, 2004.
L. von Ahn. Games with a purpose. Computer, 39 (6), 92–94, 2006.
L. von Ahn and L. Dabbish. “Designing Games with a purpose”. CACM, Vol. 51, No. 8, 2008.
145
146. Bibliography (3)
Shuo Chen et al. What if the Irresponsible Teachers Are Dominating? A Method of Training on Samples and
Clustering on Teachers. AAAI 2010.
Paul Heymann, Hector Garcia-Molina: Turkalytics: analytics for human computation. WWW 2011.
Florian Laws, Christian Scheible and Hinrich Schütze. Active Learning with Amazon Mechanical Turk.
EMNLP 2011.
C.Y. Lin. Rouge: A package for automatic evaluation of summaries. Proceedings of the workshop on text
summarization branches out (WAS), 2004.
C. Marshall and F. Shipman “The Ownership and Reuse of Visual Media”, JCDL, 2011.
Hohyon Ryu and Matthew Lease. Crowdworker Filtering with Support Vector Machine. ASIS&T 2011.
Wei Tang and Matthew Lease. Semi-Supervised Consensus Labeling for Crowdsourcing. ACM SIGIR
Workshop on Crowdsourcing for Information Retrieval (CIR), 2011.
S. Vijayanarasimhan and K. Grauman. Large-Scale Live Active Learning: Training Object Detectors with
Crawled Data and Crowds. CVPR 2011.
Stephen Wolfson and Matthew Lease. Look Before You Leap: Legal Pitfalls of Crowdsourcing. ASIS&T 2011.
146
147. Recent Work
• Della Penna, N, and M D Reid. (2012). “Crowd & Prejudice: An Impossibility Theorem for Crowd Labelling without a Gold
Standard.” in Proceedings of Collective Intelligence. Arxiv preprint arXiv:1204.3511.
• Demartini, Gianluca, D.E. Difallah, and P. Cudre-Mauroux. (2012). “ZenCrowd: leveraging probabilistic reasoning and
crowdsourcing techniques for large-scale entity linking.” 21st Annual Conference on the World Wide Web (WWW).
• Donmez, Pinar, Jaime Carbonnel, and Jeff Schneider. (2010). “A probabilistic framework to learn from multiple
annotators with time-varying accuracy.” in SIAM International Conference on Data Mining (SDM), 826-837.
• Donmez, Pinar, Jaime Carbonnel, and Jeff Schneider. (2009). “Efficiently learning the accuracy of labeling sources for
selective sampling.” in Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and
data mining (KDD), 259-268.
• Fort, K., Adda, G., and Cohen, K. (2011). Amazon Mechanical Turk: Gold mine or coal mine? Computational
Linguistics, 37(2):413–420.
• Ghosh, A, Satyen Kale, and Preson McAfee. (2012). “Who Moderates the Moderators? Crowdsourcing Abuse Detection
in User-Generated Content.” in Proceedings of the 12th ACM conference on Electronic commerce.
• Ho, C J, and J W Vaughan. (2012). “Online Task Assignment in Crowdsourcing Markets.” in Twenty-Sixth AAAI Conference
on Artificial Intelligence.
• Jung, Hyun Joon, and Matthew Lease. (2012). “Inferring Missing Relevance Judgments from Crowd Workers via
Probabilistic Matrix Factorization.” in Proceeding of the 36th international ACM SIGIR conference on Research and
development in information retrieval.
• Kamar, E, S Hacker, and E Horvitz. (2012). “Combining Human and Machine Intelligence in Large-scale Crowdsourcing.” in
Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems (AAMAS).
• Karger, D R, S Oh, and D Shah. (2011). “Budget-optimal task allocation for reliable crowdsourcing systems.” Arxiv preprint
arXiv:1110.3564.
• Kazai, Gabriella, Jaap Kamps, and Natasa Milic-Frayling. (2012). “An Analysis of Human Factors and Label Accuracy in
Crowdsourcing Relevance Judgments.” Springer's Information Retrieval Journal: Special Issue on Crowdsourcing.
147
148. Recent Work (2)
• Lin, C.H. and Mausam and Weld, D.S. (2012). “Crowdsourcing Control: Moving Beyond Multiple Choice.” in in
Proceedings of the 4th Human Computation Workshop (HCOMP) at AAAI.
• Liu, C, and Y M Wang. (2012). “TrueLabel + Confusions: A Spectrum of Probabilistic Models in Analyzing Multiple
Ratings.” in Proceedings of the 29th International Conference on Machine Learning (ICML).
• Liu, Di, Ranolph Bias, Matthew Lease, and Rebecca Kuipers. (2012). “Crowdsourcing for Usability Testing.” in
Proceedings of the 75th Annual Meeting of the American Society for Information Science and Technology (ASIS&T).
• Ramesh, A, A Parameswaran, Hector Garcia-Molina, and Neoklis Polyzotis. (2012). Identifying Reliable Workers Swiftly.
• Raykar, Vikas, Yu, S., Zhao, L.H., Valadez, G.H., Florin, C., Bogoni, L., and Moy, (2010). “Learning From Crowds.” Journal
of Machine Learning Research 11:1297-1322.
• Raykar, Vikas, Yu, S., Zhao, L.H., Jerebko, A., Florin, C., Valadez, G.H., Bogoni, L., and Moy, L. (2009). “Supervised
learning from multiple experts: whom to trust when everyone lies a bit.” in Proceedings of the 26th Annual
International Conference on Machine Learning (ICML), 889-896.
• Raykar, Vikas C, and Shipeng Yu. (2012). “Eliminating Spammers and Ranking Annotators for Crowdsourced Labeling
Tasks.” Journal of Machine Learning Research 13:491-518.
• Wauthier, Fabian L., and Michael I. Jordan. (2012). “Bayesian Bias Mitigation for Crowdsourcing.” in Advances in neural
information processing systems (NIPS).
• Weld, D.S., Mausam, and Dai, P. (2011). “Execution control for crowdsourcing.” in Proceedings of the 24th ACM
symposium adjunct on User interface software and technology (UIST).
• Weld, D.S., Mausam, and Dai, P. (2011). “Human Intelligence Needs Artificial Intelligence.” in in Proceedings of the 3rd
Human Computation Workshop (HCOMP) at AAAI.
• Welinder, Peter, Steve Branson, Serge Belongie, and Pietro Perona. (2010). “The Multidimensional Wisdom of
Crowds.” in Advances in Neural Information Processing Systems (NIPS), 2424-2432.
• Welinder, Peter, and Pietro Perona. (2010). “Online crowdsourcing: rating annotators and obtaining cost-effective
labels.” in IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 25-32.
• Whitehill, J, P Ruvolo, T Wu, J Bergsma, and J Movellan. (2009). “Whose Vote Should Count More: Optimal Integration
of Labels from Labelers of Unknown Expertise.” in Advances in Neural Information Processing Systems (NIPS).
• Yan, Y, and R Rosales. (2011). “Active learning from crowds.” in Proceedings of the 28th Annual International
Conference on Machine Learning (ICML).
148