Crowdsourcing for Search Evaluation and Social-Algorithmic Search

Crowdsourcing for Search Evaluation and
Social-Algorithmic Search

Matthew Lease
University of Texas at Austin

Omar Alonso
Microsoft

August 12, 2012

August 12, 2012 1

Topics
• Crowd-powered data collection & applications
– Evaluation: relevance judging, interactive studies, log data
– Training: e.g., active learning (e.g. learning to rank)
– Search: answering, verification, collaborations, physical
• Crowdsourcing & human computation
• Crowdsourcing platforms
• Incentive Engineering & Demographics
• Designing for Crowds & Quality assurance
• Future Challenges
• Broader Issues and the Dark Side
August 12, 2012 2

What is Crowdsourcing?
• Let’s start with an example and work back
toward a more general definition
• Example: Amazon’s Mechanical Turk (MTurk)
• Goal
– See a concrete example of real crowdsourcing
– Ground later discussion of abstract concepts
– Provide a specific example with which we will
contrast other forms of crowdsourcing

August 12, 2012 3

Human Intelligence Tasks (HITs)

August 12, 2012 4

Jane saw the man with the binoculars

August 12, 2012 6

Traditional Data Collection
• Setup data collection software / harness
• Recruit participants / annotators / assessors
• Pay a flat fee for experiment or hourly wage

• Characteristics
– Slow
– Expensive
– Difficult and/or Tedious
– Sample Bias…

August 12, 2012 7

“Hello World” Demo
• Let’s create and run a simple MTurk HIT
• This is a teaser highlighting concepts
– Don’t worry about details; we’ll revisit them
• Goal
– See a concrete example of real crowdsourcing
– Ground our later discussion of abstract concepts
– Provide a specific example with which we will
contrast other forms of crowdsourcing

August 12, 2012 8

Flip a coin
• Please flip a coin and report the results
• Two questions
1. Coin type?
2. Head or tails
• Results
Row Labels Count
Row Labels Counts
Dollar 56
Euro 11 head 57
Other 30 tail 43
(blank) 3
Grand Total 100 Grand Total 100

August 12, 2012 10

NOW WHAT CAN I DO WITH IT?

August 12, 2012 11

PHASE 1:
COLLECTING & LABELING DATA

August 12, 2012 12

Data is King!
• Massive free Web data
changed how we train
learning systems
– Banko and Brill (2001).
Human Language Tech.
– Halevy et al. (2009). IEEE
Intelligent Systems.

• Crowds provide new access to cheap & labeled
Big Data. But quality also matters!
August 12, 2012 13

NLP: Snow et al. (EMNLP 2008)
• MTurk annotation for 5 Tasks
– Affect recognition
– Word similarity
– Recognizing textual entailment
– Event temporal ordering
– Word sense disambiguation
• 22K labels for US $26
• High agreement between
consensus labels and
gold-standard labels
August 12, 2012 14

Computer Vision:
Sorokin & Forsythe (CVPR 2008)
• 4K labels for US $60

August 12, 2012 15

IR: Alonso et al. (SIGIR Forum 2008)
• MTurk for Information Retrieval (IR)
– Judge relevance of search engine results
• Many follow-on studies (design, quality, cost)

August 12, 2012 16

User Studies: Kittur, Chi, & Suh (CHI 2008)

• “…make creating believable invalid responses as
effortful as completing the task in good faith.”

August 12, 2012 17

Social & Behavioral Sciences
• A Guide to Behavioral Experiments
on Mechanical Turk
– W. Mason and S. Suri (2010). SSRN online.
• Crowdsourcing for Human Subjects Research
– L. Schmidt (CrowdConf 2010)
• Crowdsourcing Content Analysis for Behavioral Research:
Insights from Mechanical Turk
– Conley & Tosti-Kharas (2010). Academy of Management
• Amazon's Mechanical Turk : A New Source of
Inexpensive, Yet High-Quality, Data?
– M. Buhrmester et al. (2011). Perspectives… 6(1):3-5.
– see also: Amazon Mechanical Turk Guide for Social Scientists
August 12, 2012 18

Remote Usability Testing
• Liu, Bias, Lease, and Kuipers, ASIS&T, 2012
• Compares remote usability testing using MTurk and
CrowdFlower (not uTest) vs. traditional on-site testing
• Advantages
– More (Diverse) Participants
– High Speed
– Low Cost
• Disadvantages
– Lower Quality Feedback
– Less Interaction
– Greater need for quality control
– Less Focused User Groups
August 12, 2012 20

NLP Example – Dialect Identification

August 12, 2012 22

NLP Example – Machine Translation
• Manual evaluation on translation quality is
slow and expensive
• High agreement between non-experts and
experts
• $0.10 to translate a sentence

C. Callison-Burch. “Fast, Cheap, and Creative: Evaluating Translation Quality
Using Amazon’s Mechanical Turk”, EMNLP 2009.

August 12, 2012 23

Computer Vision – Painting Similarity

Kovashka & Lease, CrowdConf’10

August 12, 2012 24

IR Example – Relevance and ads

August 12, 2012 25

IR Example – Product Search

August 12, 2012 26

IR Example – Snippet Evaluation
• Study on summary lengths
• Determine preferred result length
• Asked workers to categorize web queries
• Asked workers to evaluate snippet quality
• Payment between $0.01 and $0.05 per HIT

M. Kaisser, M. Hearst, and L. Lowe. “Improving Search Results Quality by Customizing Summary Lengths”, ACL/HLT, 2008.

August 12, 2012 27

IR Example – Relevance Assessment
• Replace TREC-like relevance assessors with MTurk?
• Selected topic “space program” (011)
• Modified original 4-page instructions from TREC
• Workers more accurate than original assessors!
• 40% provided justification for each answer

O. Alonso and S. Mizzaro. “Can we get rid of TREC assessors? Using Mechanical Turk for relevance assessment”, SIGIR Workshop
on the Future of IR Evaluation, 2009.

August 12, 2012 28

IR Example – Timeline Annotation
• Workers annotate timeline on politics, sports, culture
• Given a timex (1970s, 1982, etc.) suggest something
• Given an event (Vietnam, World cup, etc.) suggest a timex

K. Berberich, S. Bedathur, O. Alonso, G. Weikum “A Language Modeling Approach for Temporal Information Needs”. ECIR 2010

August 12, 2012 29

COLLECTING DATA WITH OTHER
CROWDS & OTHER INCENTIVES

August 12, 2012 30

Why Eytan Adar hates MTurk Research
(CHI 2011 CHC Workshop)
• Overly-narrow focus on MTurk
– Identify general vs. platform-specific problems
– Academic vs. Industrial problems
• Inattention to prior work in other disciplines
• Turks aren’t Martians
– Just human behavior (more later…)

August 12, 2012 31

ESP Game (Games With a Purpose)
L. Von Ahn and L. Dabbish (2004)

August 12, 2012 32

reCaptcha

L. von Ahn et al. (2008). In Science.
August 12, 2012 33

Human Sensing and Monitoring
• Sullivan et al. (2009). Bio. Conservation (142):10
• Keynote by Steve Kelling at ASIS&T 2011

August 12, 2012 34

• Learning to map from web pages to queries
• Human computation game to elicit data
• Home grown system (no AMT)
• Try it!
pagehunt.msrlivelabs.com

See also:
• H. Ma. et al. “Improving Search Engines Using Human Computation Games”, CIKM 2009.
• Law et al. SearchWar. HCOMP 2009.
• Bennett et al. Picture This. HCOMP 2009.
August 12, 2012 35

Tracking Sentiment in Online Media
Brew et al., PAIS 2010
• Volunteer-crowd
• Judge in exchange for
access to rich content
• Balance system needs
with user interest
• Daily updates to non-
stationary distribution
August 12, 2012 36

PHASE 2: FROM DATA COLLECTION
TO HUMAN COMPUTATION

August 12, 2012 37

Human Computation
• What was old is new

• Crowdsourcing: A New Branch
of Computer Science
– D.A. Grier, March 29, 2011

• Tabulating the heavens:
computing the Nautical
Almanac in 18th-century
England - M. Croarken’03 Princeton University Press, 2005
August 12, 2012 38

The Mechanical Turk

Constructed and unveiled in 1770 by Wolfgang von Kempelen (1734–1804)

J. Pontin. Artificial Intelligence, With Help From
the Humans. New York Times (March 25, 2007)

August 12, 2012 39

The Human Processing Unit (HPU)
• Davis et al. (2010)

HPU

August 12, 2012 40

Human Computation
• Having people do stuff instead of computers
• Investigates use of people to execute certain
computations for which capabilities of current
automated methods are more limited
• Explores the metaphor of computation for
characterizing attributes, capabilities, and
limitations of human performance in executing
desired tasks
• Computation is required, crowd is not
• von Ahn’s Thesis (2005), Law & von Ahn (2011)
August 12, 2012 41

APPLYING HUMAN COMPUTATION:
CROWD-POWERED APPLICATIONS

August 12, 2012 42

Crowd-Assisted Search: “Amazon Remembers”

August 12, 2012 43

Crowd-Assisted Search (2)

• Yan et al., MobiSys’10

• CrowdTerrier
(McCreadie et al., SIGIR’12)

August 12, 2012 44/11

Translation by monolingual speakers
• C. Hu, CHI 2009

August 12, 2012 45

Soylent: A Word Processor with a Crowd Inside

• Bernstein et al., UIST 2010

August 12, 2012 46

fold.it
S. Cooper et al. (2010)

Alice G. Walton. Online Gamers Help Solve Mystery of
Critical AIDS Virus Enzyme. The Atlantic, October 8, 2011.
August 12, 2012 47

PlateMate (Noronha et al., UIST’10)

August 12, 2012 48/11

Image Analysis and more: Eatery

August 12, 2012 49

VizWiz aaaaaaaa
Bingham et al. (UIST 2010)

August 12, 2012 50/11

Crowd Sensing: Waze

August 12, 2012 52

THE SOCIAL SIDE OF SEARCH

August 12, 2012 53

People are more than HPUs
• Why is Facebook popular? People are social.
• Information needs are contextually grounded in
our social experiences and social networks
• The value of social search may be more than
the relevance of the search results
• Our social networks also embody additional
knowledge about us, our needs, and the world
The social dimension complements computation
August 12, 2012 54

Community Q&A

August 12, 2012 55/53

Complex Information Needs
 Who is Rahm Emanuel, Obama's Chief of Staff?
 How have dramatic shifts in terrorists resulted in an
equally dramatic shift in terrorist organizations?
 How do I find what events were in the news on my sons
birthday?
 Do you think the current drop in the Stock Market is
related to Obama's election to President?
 Why are prisoners on death row given final medicals?
 Should George Bush attack Iran's nuclear facility
before he leaves office?
 Why are people against gay marriage?
 Does anyone know anything interesting that happened
nation wide in 2008?
 Should the fact that a prisoner has cancer have any
bearing on an appeal for bail?
August 12, 2012 Source: Yahoo! Answers, “News & Events”, Nov. 6 2008 57

Community Q&A
• Ask the village vs. searching the archive
• Posting and waiting can be slow
– Find similar questions already answered
• Best answer (winner-take-all) vs. voting
• Challenges
– Questions shorter than documents
– Questions not queries, colloquial, errors
– Latency & quality (e.g. question routing)
• Cf. work by Bruce Croft & students
August 12, 2012 58

Horowitz & Kamvar, WWW’10
• Routing: Trust vs. Authority
• Social networks vs. search engines
– See also: Morris & Teevan, HCIC’12
August 12, 2012 59

Social Network integration
• Facebook Questions (with Bing)
• Google+ (acquired Aardvark)
• Twitter (cf. Paul, Hong, and Chi, ICWSM’11)

August 12, 2012 60

Search Buddies
Hecht et al. ICWSM 2012; Morris MSR Talk

August 12, 2012 61

{where to go on vacation}

• Tons of results • MTurk: 50 answers,
$1.80
• Read title + snippet +
URL • Quora: 2 answers
• Explore a few pages in • Y! Answers: 2
detail answers
August 12, 2012
• FB: 1 answer 62

Countries
Cities

August 12, 2012 63

• Let’s execute the same query in different days
Execution #1 Execution #2 Execution #3
Las Vegas 3 Kerala 6 Las Vegas 4
Hawaii 2 Goa 4 Himachal pradesh 3
Kerala 2 Ooty 3 Mauritius 2
Key West 2 Switzerland 3 Ooty 2
Orlando 2 Agra 2
kodaikanal 2
New Zealand 2

• Table show places with frequency >= 2
• Every execution uses same template & 50 workers
• Completion time more or less the same
• Results may differ
• Related work: Zhang et al., CHI 2012
August 12, 2012 64

SO WHAT IS CROWDSOURCING?

August 12, 2012 65

From Outsourcing to Crowdsourcing
• Take a job traditionally
performed by a known agent
(often an employee)
• Outsource it to an undefined,
generally large group of
people via an open call
• New application of principles
from open source movement
• Evolving & broadly defined ...
August 12, 2012 67

Crowdsourcing models
• Micro-tasks & citizen science
• Co-Creation
• Open Innovation, Contests
• Prediction Markets
• Crowd Funding and Charity
• “Gamification” (not serious gaming)
• Transparent
• cQ&A, Social Search, and Polling
• Physical Interface/Task
August 12, 2012 68

What is Crowdsourcing?
• A set of mechanisms and methods for scaling &
directing crowd activities to achieve some goal(s)
• Enabled by internet-connectivity
• Many related topics/areas:
– Human computation (next slide…)
– Collective intelligence
– Crowd/Social computing
– Wisdom of Crowds
– People services, Human Clouds, Peer-production, …
August 12, 2012 69

What is not crowdsourcing?
• Post-hoc use of pre-existing crowd data
– Data mining
– Visual analytics
• Use of one or few people
– Mixed-initiative design
– Active learning
• Conducting a survey or poll… (*)

August 12, 2012 70

Crowdsourcing Key Questions
• What are the goals?
– Purposeful directing of human activity

• How can you incentivize participation?
– Incentive engineering
– Who are the target participants?

• Which model(s) are most appropriate?
– How to adapt them to your context and goals?
August 12, 2012 71

What do you want to accomplish?
• Create
• Execute task/computation
• Fund
• Innovate and/or discover
• Learn
• Monitor
• Predict
August 12, 2012 72

INCENTIVE ENGINEERING

August 12, 2012 73

Who are
the workers?

• A. Baio, November 2008. The Faces of Mechanical Turk.
• P. Ipeirotis. March 2010.
The New Demographics of Mechanical Turk
• J. Ross, et al. Who are the Crowdworkers?... CHI 2010.
August 12, 2012 74

MTurk Demographics
• 2008-2009 studies found
less global and diverse
than previously thought
– US
– Female
– Educated
– Bored
– Money is secondary

August 12, 2012 75

2010 shows increasing diversity
47% US, 34% India, 19% other (P. Ipeitorotis. March 2010)

August 12, 2012 76

Why should your crowd participate?
• Earn Money (real or virtual)
• Have fun (or pass the time)
• Socialize with others
• Obtain recognition or prestige (leaderboards, badges)
• Do Good (altruism)
• Learn something new
• Obtain something else
• Create self-serving resource

Multiple incentives can often operate in parallel (*caveat)
August 12, 2012 77

Example: Wikipedia
• Obtain recognition or prestige

August 12, 2012 78

Example: DuoLingo

August 12, 2012 79

Example:

August 12, 2012 80

Example: ESP

August 12, 2012 81

Example: fold.it

August 12, 2012 82

Example: FreeRice

August 12, 2012 83

Example: cQ&A

August 12, 2012 84

Example: reCaptcha
• Do Good (altruism) Is there an existing human
activity you can harness
• Learn something new for another purpose?


August 12, 2012 85

Example: Mechanical Turk

August 12, 2012 86

How Much to Pay?
• Price commensurate with task effort
– Ex: $0.02 for yes/no answer + $0.02 bonus for optional feedback
• Ethics & market-factors: W. Mason and S. Suri, 2010.
– e.g. non-profit SamaSource contracts workers refugee camps
– Predict right price given market & task: Wang et al. CSDM’11
• Uptake & time-to-completion vs. Cost & Quality
– Too little $$, no interest or slow – too much $$, attract spammers
– Real problem is lack of reliable QA substrate
• Accuracy & quantity
– More pay = more work, not better (W. Mason and D. Watts, 2009)
• Heuristics: start small, watch uptake and bargaining feedback
• Worker retention (“anchoring”)
See also: L.B. Chilton et al. KDD-HCOMP 2010.
August 12, 2012 87

Dan Pink – YouTube video
“The Surprising Truth about what Motivates us”

August 12, 2012 88

PLATFORMS

August 12, 2012 89

Mechanical What?

August 12, 2012 90

Does anyone really use it? Yes!

http://www.mturk-tracker.com (P. Ipeirotis’10)

From 1/09 – 4/10, 7M HITs from 10K requestors
worth $500,000 USD (significant under-estimate)
August 12, 2012 91

MTurk: The Requester
• Sign up with your Amazon account
• Amazon payments
• Purchase prepaid HITs
• There is no minimum or up-front fee
• MTurk collects a 10% commission
• The minimum commission charge is $0.005 per HIT

August 12, 2012 92

MTurk Dashboard
• Three tabs
– Design
– Publish
– Manage
• Design
– HIT Template
• Publish
– Make work available
• Manage
– Monitor progress

August 12, 2012 93

MTurk: Dashboard - II

August 12, 2012 95

MTurk API
• Amazon Web Services API
• Rich set of services
• Command line tools
• More flexibility than dashboard

August 12, 2012 96

MTurk Dashboard vs. API
• Dashboard
– Easy to prototype
– Setup and launch an experiment in a few minutes
• API
– Ability to integrate AMT as part of a system
– Ideal if you want to run experiments regularly
– Schedule tasks

August 12, 2012 97

• Multiple Channels
• Gold-based tests
• Only pay for
“trusted” judgments

August 12, 2012 98

CloudFactory
• Information below from Mark Sears (Oct. 18, 2011)
• Cloud Labor API
– Tools to design virtual assembly lines
– workflows with multiple tasks chained together
• Focus on self serve tools for people to easily design crowd-powered assembly lines
that can be easily integrated into software applications
• Interfaces: command-line, RESTful API, and Web
• Each “task station” can have either a human or robot worker assigned
– web software services (AlchemyAPI, SendGrid, Google APIs, Twilio, etc.) or local software can
be combined with human computation
• Many built-in "best practices"
– “Tournament Stations” where multiple results are compared by a other cloud workers until
confidence of best answer is reached
– “Improver Stations” have workers improve and correct work by other workers
– Badges are earned by cloud workers passing tests created by requesters
– Training and tools to create skill tests will be flexible
– Algorithms to detect and kick out spammers/cheaters/lazy/bad workers

August 12, 2012 99

More Crowd Labor Platforms
• Clickworker
• CloudCrowd
• CrowdSource
• DoMyStuff
• Humanoid (by Matt Swason et al.)
• Microtask
• MobileWorks (by Anand Kulkarni )
• myGengo
• SmartSheet
• vWorker
• Industry heavy-weights
– Elance
– Liveops
– oDesk
– uTest
• and more…

August 12, 2012 100

Platform alternatives
• Why MTurk
– Amazon brand, lots of research papers
– Speed, price, diversity, payments
• Why not
– Crowdsourcing != Mturk
– Spam, no analytics, must build tools for worker & task quality
• Microsoft Universal Human Relevance System (UHRS)
• How to build your own crowdsourcing platform
– Back-end
– Template language for creating experiments
– Scheduler
– Payments?
August 12, 2012 101

Why Micro-Tasks?
• Easy, cheap and fast
• Ready-to use infrastructure, e.g.
– MTurk payments, workforce, interface widgets
– CrowdFlower quality control mechanisms, etc.
– Many others …
• Allows early, iterative, frequent trials
– Iteratively prototype and test new ideas
– Try new tasks, test when you want & as you go
• Many successful examples of use reported
August 12, 2012 102

Micro-Task Issues
• Process
– Task design, instructions, setup, iteration
• Choose crowdsourcing platform (or roll your own)
• Human factors
– Payment / incentives, interface and interaction design,
communication, reputation, recruitment, retention
• Quality Control / Data Quality
– Trust, reliability, spam detection, consensus labeling

August 12, 2012 103

WORKFLOW DESIGN

August 12, 2012 104

PlateMate - Architecture

August 12, 2012 105

Kulkarni et al.,
CSCW 2012

Turkomatic

August 12, 2012 106

CrowdForge: Workers perform a task
or further decompose them

Kittur et al., CHI 2011

August 12, 2012 107

Kittur et al., CrowdWeaver, CSCW 2012

August 12, 2012 108

DESIGNING FOR CROWDS

August 12, 2012 109

Typical Workflow
• Define and design what to test
• Sample data
• Design the experiment
• Run experiment
• Collect data and analyze results
• Quality control

August 12, 2012 111

Development Framework
• Incremental approach
• Measure, evaluate, and adjust as you go
• Suitable for repeatable tasks

August 12, 2012 112

Survey Design
• One of the most important parts
• Part art, part science
• Instructions are key
• Prepare to iterate

August 12, 2012 113

Questionnaire Design
• Ask the right questions
• Workers may not be IR experts so don’t
assume the same understanding in terms of
terminology
• Show examples
• Hire a technical writer
– Engineer writes the specification
– Writer communicates

August 12, 2012 114

UX Design
• Time to apply all those usability concepts
• Generic tips
– Experiment should be self-contained.
– Keep it short and simple. Brief and concise.
– Be very clear with the relevance task.
– Engage with the worker. Avoid boring stuff.
– Always ask for feedback (open-ended question) in
an input box.

August 12, 2012 115

UX Design - II
• Presentation
• Document design
• Highlight important concepts
• Colors and fonts
• Need to grab attention
• Localization

August 12, 2012 116

Examples - I
• Asking too much, task not clear, “do NOT/reject”
• Worker has to do a lot of stuff

August 12, 2012 117

Example - II
• Lot of work for a few cents
• Go here, go there, copy, enter, count …

August 12, 2012 118

A Better Example
• All information is available
– What to do
– Search result
– Question to answer

August 12, 2012 119

Form and Metadata
• Form with a close question (binary relevance) and
open-ended question (user feedback)
• Clear title, useful keywords
• Workers need to find your task

August 12, 2012 121

Relevance Judging – Example I

August 12, 2012 122

Relevance Judging – Example II

August 12, 2012 123

Implementation
• Similar to a UX
• Build a mock up and test it with your team
– Yes, you need to judge some tasks
• Incorporate feedback and run a test on MTurk
with a very small data set
– Time the experiment
– Do people understand the task?
• Analyze results
– Look for spammers
– Check completion times
• Iterate and modify accordingly
August 12, 2012 124

Implementation – II
• Introduce quality control
– Qualification test
– Gold answers (honey pots)
• Adjust passing grade and worker approval rate
• Run experiment with new settings & same data
• Scale on data
• Scale on workers

August 12, 2012 125

Experiment in Production
• Lots of tasks on MTurk at any moment
• Need to grab attention
• Importance of experiment metadata
• When to schedule
– Split a large task into batches and have 1 single
batch in the system
– Always review feedback from batch n before
uploading n+1

August 12, 2012 126

Other design principles
• Text alignment
• Legibility
• Reading level: complexity of words and sentences
• Attractiveness (worker’s attention & enjoyment)
• Multi-cultural / multi-lingual
• Who is the audience (e.g. target worker community)
– Special needs communities (e.g. simple color blindness)
• Parsimony
• Cognitive load: mental rigor needed to perform task
• Exposure effect
August 12, 2012 127

The human side
• As a worker
– I hate when instructions are not clear
– I’m not a spammer – I just don’t get what you want
– Boring task
– A good pay is ideal but not the only condition for engagement
• As a requester
– Attrition
– Balancing act: a task that would produce the right results and
is appealing to workers
– I want your honest answer for the task
– I want qualified workers; system should do some of that for me
• Managing crowds and tasks is a daily activity
– more difficult than managing computers
August 12, 2012 128

Things that work
• Qualification tests
• Honey-pots
• Good content and good presentation
• Economy of attention
• Things to improve
– Manage workers in different levels of expertise
including spammers and potential cases.
– Mix different pools of workers based on different
profile and expertise levels.

August 12, 2012 129

Things that need work
• UX and guidelines
– Help the worker
– Cost of interaction
• Scheduling and refresh rate
• Exposure effect
• Sometimes we just don’t agree
• How crowdsourcable is your task

August 12, 2012 130

RELEVANCE JUDGING & CROWDSOURCING

August 12, 2012 131

Motivating Example: Relevance Judging

• Relevance of search results is difficult to judge
– Highly subjective
– Expensive to measure
• Professional editors commonly used
• Potential benefits of crowdsourcing
– Scalability (time and cost)
– Diversity of judgments

August 12, 2012 133

Started with a joke …

August 12, 2012 135

Results for {idiot} at WSDM 2011
February 2011: 5/7 (R), 2/7 (NR)
– Most of the time those TV reality stars have absolutely no talent. They do whatever
they can to make a quick dollar. Most of the time the reality tv stars don not have
a mind of their own. R
– Most are just celebrity wannabees. Many have little or no talent, they just want
fame. R
– I can see this one going both ways. A particular sort of reality star comes to
mind, though, one who was voted off Survivor because he chose not to use his
immunity necklace. Sometimes the label fits, but sometimes it might be unfair. R
– Just because someone else thinks they are an "idiot", doesn't mean that is what the
word means. I don't like to think that any one person's photo would be used to
describe a certain term. NR
– While some reality-television stars are genuinely stupid (or cultivate an image of
stupidity), that does not mean they can or should be classified as "idiots." Some
simply act that way to increase their TV exposure and potential earnings. Other
reality-television stars are really intelligent people, and may be considered as
idiots by people who don't like them or agree with them. It is too subjective an
issue to be a good result for a search engine. NR
– Have you seen the knuckledraggers on reality television? They should be required to
change their names to idiot after appearing on the show. You could put numbers
after the word idiot so we can tell them apart. R
– Although I have not followed too many of these shows, those that I have encountered
have for a great part a very common property. That property is that most of the
participants involved exhibit a shallow self-serving personality that borders on
social pathological behavior. To perform or act in such an abysmal way could only
be an act of an idiot. R
August 12, 2012 136

Two Simple Examples of MTurk
1. Ask workers to classify a query
2. Ask workers to judge document relevance

Steps
• Define high-level task
• Design & implement interface & backend
• Launch, monitor progress, and assess work
• Iterate design

August 12, 2012 137

Query Classification Task
• Ask the user to classify a query
• Show a form that contains a few categories
• Upload a few queries (~20)
• Use 3 workers

August 12, 2012 138

Relevance Judging Task
• Use a few documents from a standard
collection used for evaluating search engines
• Ask user to make binary judgments
• Modification: graded judging
• Use 5 workers

August 12, 2012 141

Content quality
• People like to work on things that they like
• TREC ad-hoc vs. INEX
– TREC experiments took twice to complete
– INEX (Wikipedia), TREC (LA Times, FBIS)
• Topics
– INEX: Olympic games, movies, salad recipes, etc.
– TREC: cosmic events, Schengen agreement, etc.
• Content and judgments according to modern times
– Airport security docs are pre 9/11
– Antarctic exploration (global warming )

August 12, 2012 143

Content quality - II
• Document length
• Randomize content
• Avoid worker fatigue
– Judging 100 documents on the same subject can
be tiring, leading to decreasing quality

August 12, 2012 144

Presentation
• People scan documents for relevance cues
• Document design
• Highlighting no more than 10%

August 12, 2012 145

Presentation - II

August 12, 2012 146

Relevance justification
• Why settle for a label?
• Let workers justify answers
– cf. Zaidan et al. (2007) “annotator rationales”
• INEX
– 22% of assignments with comments
• Must be optional
• Let’s see how people justify

August 12, 2012 147

“Relevant” answers
[Salad Recipes]
Doesn't mention the word 'salad', but the recipe is one that could be considered a
salad, or a salad topping, or a sandwich spread.
Egg salad recipe
Egg salad recipe is discussed.
History of salad cream is discussed.
Includes salad recipe
It has information about salad recipes.
Potato Salad
Potato salad recipes are listed.
Recipe for a salad dressing.
Salad Recipes are discussed.
Salad cream is discussed.
Salad info and recipe
The article contains a salad recipe.
The article discusses methods of making potato salad.
The recipe is for a dressing for a salad, so the information is somewhat narrow for
the topic but is still potentially relevant for a researcher.
This article describes a specific salad. Although it does not list a specific recipe,
it does contain information relevant to the search topic.
gives a recipe for tuna salad
relevant for tuna salad recipes
relevant to salad recipes
this is on-topic for salad recipes

August 12, 2012 148

“Not relevant” answers
[Salad Recipes]
About gaming not salad recipes.
Article is about Norway.
Article is about Region Codes.
Article is about forests.
Article is about geography.
Document is about forest and trees.
Has nothing to do with salad or recipes.
Not a salad recipe
Not about recipes
Not about salad recipes
There is no recipe, just a comment on how salads fit into meal formats.
There is nothing mentioned about salads.
While dressings should be mentioned with salads, this is an article on one specific
type of dressing, no recipe for salads.
article about a swiss tv show
completely off-topic for salad recipes
not a salad recipe
not about salad recipes
totally off base

August 12, 2012 149

Feedback length

• Workers will justify answers
• Has to be optional for good
feedback
• In E51, mandatory comments
– Length dropped
– “Relevant” or “Not Relevant

August 12, 2012 151

Was the task difficult?
• Ask workers to rate difficulty of a search topic
• 50 topics; 5 workers, $0.01 per task

August 12, 2012 152

QUALITY ASSURANCE

August 12, 2012 153

When to assess quality of work
• Beforehand (prior to main task activity)
– How: “qualification tests” or similar mechanism
– Purpose: screening, selection, recruiting, training
• During
– How: assess labels as worker produces them
• Like random checks on a manufacturing line
– Purpose: calibrate, reward/penalize, weight
• After
– How: compute accuracy metrics post-hoc
– Purpose: filter, calibrate, weight, retain (HR)
– E.g. Jung & Lease (2011), Tang & Lease (2011), ...
August 12, 2012 154

How do we measure work quality?
• Compare worker’s label vs.
– Known (correct, trusted) label
– Other workers’ labels
• P. Ipeirotis. Worker Evaluation in Crowdsourcing: Gold Data or
Multiple Workers? Sept. 2010.
– Model predictions of the above
• Model the labels (Ryu & Lease, ASIS&T11)
• Model the workers (Chen et al., AAAI’10)
• Verify worker’s label
– Yourself
– Tiered approach (e.g. Find-Fix-Verify)
• Quinn and B. Bederson’09, Bernstein et al.’10
August 12, 2012 155

Typical Assumptions
• Objective truth exists
– no minority voice / rare insights
– Can relax this to model “truth distribution”
• Automatic answer comparison/evaluation
– What about free text responses? Hope from NLP…
• Automatic essay scoring
• Translation (BLEU: Papineni, ACL’2002)
• Summarization (Rouge: C.Y. Lin, WAS’2004)
– Have people do it (yourself or find-verify crowd, etc.)
August 12, 2012 156

Distinguishing Bias vs. Noise
• Ipeirotis (HComp 2010)
• People often have consistent, idiosyncratic
skews in their labels (bias)
– E.g. I like action movies, so they get higher ratings
• Once detected, systematic bias can be
calibrated for and corrected (yeah!)
• Noise, however, seems random & inconsistent
– this is the real issue we want to focus on

August 12, 2012 157

Comparing to known answers
• AKA: gold, honey pot, verifiable answer, trap
• Assumes you have known answers
• Cost vs. Benefit
– Producing known answers (experts?)
– % of work spent re-producing them
• Finer points
– Controls against collusion
– What if workers recognize the honey pots?

August 12, 2012 158

Comparing to other workers
• AKA: consensus, plurality, redundant labeling
• Well-known metrics for measuring agreement
• Cost vs. Benefit: % of work that is redundant
• Finer points
– Is consensus “truth” or systematic bias of group?
– What if no one really knows what they’re doing?
• Low-agreement across workers indicates problem is with the
task (or a specific example), not the workers
– Risk of collusion
• Sheng et al. (KDD 2008)
August 12, 2012 159

Comparing to predicted label
• Ryu & Lease, ASIS&T11 (CrowdConf’11 poster)
• Catch-22 extremes
– If model is really bad, why bother comparing?
– If model is really good, why collect human labels?
• Exploit model confidence
– Trust predictions proportional to confidence
– What if model very confident and wrong?
• Active learning
– Time sensitive: Accuracy / confidence changes
August 12, 2012 160

Compare to predicted worker labels
• Chen et al., AAAI’10
• Avoid inefficiency of redundant labeling
– See also: Dekel & Shamir (COLT’2009)
• Train a classifier for each worker
• For each example labeled by a worker
– Compare to predicted labels for all other workers
• Issues
• Sparsity: workers have to stick around to train model…
• Time-sensitivity: New workers & incremental updates?

August 12, 2012 161

Methods for measuring agreement
• What to look for
– Agreement, reliability, validity
• Inter-agreement level
– Agreement between judges
– Agreement between judges and the gold set
• Some statistics
– Percentage agreement
– Cohen’s kappa (2 raters)
– Fleiss’ kappa (any number of raters)
– Krippendorff’s alpha
• With majority vote, what if 2 say relevant, 3 say not?
– Use expert to break ties (Kochhar et al, HCOMP’10; GQR)
– Collect more judgments as needed to reduce uncertainty
August 12, 2012 162

Inter-rater reliability
• Lots of research
• Statistics books cover most of the material
• Three categories based on the goals
– Consensus estimates
– Consistency estimates
– Measurement estimates

August 12, 2012 163

Sample code
– R packages psy and irr
>library(psy)
>library(irr)
>my_data <- read.delim(file="test.txt",
head=TRUE, sep="t")
>kappam.fleiss(my_data,exact=FALSE)

>my_data2 <- read.delim(file="test2.txt",
head=TRUE, sep="t")
>ckappa(my_data2)

August 12, 2012 164

k coefficient
• Different interpretations of k
• For practical purposes you need to be >= moderate
• Results may vary
k Interpretation
<0 Poor agreement
0.01 – 0.20 Slight agreement
0.21 – 0.40 Fair agreement
0.41 – 0.60 Moderate agreement
0.61 – 0.80 Substantial agreement
0.81 – 1.00 Almost perfect agreement

August 12, 2012 165

Detection Theory
• Sensitivity measures
– High sensitivity: good ability to discriminate
– Low sensitivity: poor ability
Stimulus “Yes” “No”
Class
S1 Hits Misses
S2 False alarms Correct
rejections

Hit rate H = P(“yes”|S2)
False alarm rate F = P(“yes”|S1)

August 12, 2012 166

Finding Consensus
• When multiple workers disagree on the
correct label, how do we resolve this?
– Simple majority vote (or average and round)
– Weighted majority vote (e.g. naive bayes)
• Many papers from machine learning…
• If wide disagreement, likely there is a bigger
problem which consensus doesn’t address

August 12, 2012 168

Quality Control on MTurk
• Rejecting work & Blocking workers (more later…)
– Requestors don’t want bad PR or complaint emails
– Common practice: always pay, block as needed
• Approval rate: easy to use, but value?
– P. Ipeirotis. Be a Top Mechanical Turk Worker: You Need $5
and 5 Minutes. Oct. 2010
– Many requestors don’t ever reject…
• Qualification test
– Pre-screen workers’ capabilities & effectiveness
– Example and pros/cons in next slides…
• Geographic restrictions
• Mechanical Turk Masters (June 23, 2011)
– Recent addition, degree of benefit TBD…
August 12, 2012 169

Quality Control in General
• Extremely important part of the experiment
• Approach as “overall” quality; not just for workers
• Bi-directional channel
– You may think the worker is doing a bad job.
– The same worker may think you are a lousy requester.

August 12, 2012 171

Tools and Packages for MTurk
• QA infrastructure layers atop MTurk promote
useful separation-of-concerns from task
– TurkIt
• Quik Turkit provides nearly realtime services
– Turkit-online (??)
– Get Another Label (& qmturk)
– Turk Surveyor
– cv-web-annotation-toolkit (image labeling)
– Soylent
– Boto (python library)
• Turkpipe: submit batches of jobs using the command line.
• More needed…
August 12, 2012 172

A qualification test snippet
<Question>
<QuestionIdentifier>question1</QuestionIdentifier>
<QuestionContent>
<Text>Carbon monoxide poisoning is</Text>
</QuestionContent>
<AnswerSpecification>
<SelectionAnswer>
<StyleSuggestion>radiobutton</StyleSuggestion>
<Selections>
<Selection>
<SelectionIdentifier>1</SelectionIdentifier>
<Text>A chemical technique</Text>
</Selection>
<Selection>
<Text>A green energy treatment</Text>
</Selection>
<Selection>
<Text>A phenomena associated with sports</Text>
</Selection>
<Selection>
<Text>None of the above</Text>
</Selection>
</Selections>
</SelectionAnswer>
</AnswerSpecification>
August 12, 2012
</Question> 173

Qualification tests: pros and cons
• Advantages
– Great tool for controlling quality
– Adjust passing grade
• Disadvantages
– Extra cost to design and implement the test
– May turn off workers, hurt completion time
– Refresh the test on a regular basis
– Hard to verify subjective tasks like judging relevance
• Try creating task-related questions to get worker
familiar with task before starting task in earnest
August 12, 2012 174

More on quality control & assurance
• HR issues: recruiting, selection, & retention
– e.g., post/tweet, design a better qualification test,
bonuses, …
• Collect more redundant judgments…
– at some point defeats cost savings of
crowdsourcing
– 5 workers is often sufficient

August 12, 2012 175

Robots and Captchas
• Some reports of robots on MTurk
– E.g. McCreadie et al. (2011)
– violation of terms of service
– Artificial artificial artificial intelligence
• Captchas seem ideal, but…
– There is abuse of robots using turkers to solve captchas so
they can access web resources
– Turker wisdom is therefore to avoid such HITs
• What to do?
– Use standard captchas, notify workers
– Block robots other ways (e.g. external HITs)
– Catch robots through standard QC, response times
– Use HIT-specific captchas (Kazai et al., 2011)
August 12, 2012 176

Other quality heuristics
• Justification/feedback as quasi-captcha
– Successfully proven in past experiments
– Should be optional
– Automatically verifying feedback was written by a
person may be difficult (classic spam detection task)
• Broken URL/incorrect object
– Leave an outlier in the data set
– Workers will tell you
– If somebody answers “excellent” on a graded
relevance test for a broken URL => probably spammer

August 12, 2012 177

Dealing with bad workers
• Pay for “bad” work instead of rejecting it?
– Pro: preserve reputation, admit if poor design at fault
– Con: promote fraud, undermine approval rating system
• Use bonus as incentive
– Pay the minimum $0.01 and $0.01 for bonus
– Better than rejecting a $0.02 task
• If spammer “caught”, block from future tasks
– May be easier to always pay, then block as needed

August 12, 2012 178

Worker feedback
• Real feedback received via email after rejection
• Worker XXX
I did. If you read these articles most of them have
nothing to do with space programs. I’m not an idiot.

• Worker XXX
As far as I remember there wasn't an explanation about
what to do when there is no name in the text. I believe I
did write a few comments on that, too. So I think you're
being unfair rejecting my HITs.

August 12, 2012 179

Real email exchange with worker after rejection
WORKER: this is not fair , you made me work for 10 cents and i lost my 30 minutes
of time ,power and lot more and gave me 2 rejections at least you may keep it
pending. please show some respect to turkers

REQUESTER: I'm sorry about the rejection. However, in the directions given in the
hit, we have the following instructions: IN ORDER TO GET PAID, you must judge all 5
webpages below *AND* complete a minimum of three HITs.

Unfortunately, because you only completed two hits, we had to reject those hits.
We do this because we need a certain amount of data on which to make decisions
about judgment quality. I'm sorry if this caused any distress. Feel free to contact me
if you have any additional questions or concerns.

WORKER: I understood the problems. At that time my kid was crying and i went to
look after. that's why i responded like that. I was very much worried about a hit
being rejected. The real fact is that i haven't seen that instructions of 5 web page
and started doing as i do the dolores labs hit, then someone called me and i went
to attend that call. sorry for that and thanks for your kind concern.
August 12, 2012 180

Exchange with worker
• Worker XXX
Thank you. I will post positive feedback for you at
Turker Nation.

Me: was this a sarcastic comment?

• I took a chance by accepting some of your HITs to see if
you were a trustworthy author. My experience with you
has been favorable so I will put in a good word for you
on that website. This will help you get higher quality
applicants in the future, which will provide higher
quality work, which might be worth more to you, which
hopefully means higher HIT amounts in the future.

August 12, 2012 181

Build Your Reputation as a Requestor
• Word of mouth effect
– Workers trust the requester (pay on time, clear
explanation if there is a rejection)
– Experiments tend to go faster
– Announce forthcoming tasks (e.g. tweet)
• Disclose your real identity?

August 12, 2012 182

Other practical tips
• Sign up as worker and do some HITs
• “Eat your own dog food”
• Monitor discussion forums
• Address feedback (e.g., poor guidelines,
payments, passing grade, etc.)
• Everything counts!
– Overall design only as strong as weakest link

August 12, 2012 183

Conclusions
• But one may say “this is all good but looks like
a ton of work”
• The original goal: data is king
• Data quality and experimental designs are
preconditions to make sure we get the right
stuff
• Data will be later be used for rankers, ML
models, evaluations, etc.
• Don’t cut corners
August 12, 2012 184

THE ROAD AHEAD

August 12, 2012 185

What about sensitive data?
• Not all data can be publicly disclosed
– User data (e.g. AOL query log, Netflix ratings)
– Intellectual property
– Legal confidentiality
• Need to restrict who is in your crowd
– Separate channel (workforce) from technology
– Hot question for adoption at enterprise level

August 12, 2012 186

Wisdom of Crowds (WoC)
Requires
• Diversity
• Independence
• Decentralization
• Aggregation

Input: large, diverse sample
(to increase likelihood of overall pool quality)
Output: consensus or selection (aggregation)
August 12, 2012 187

WoC vs. Ensemble Learning
• Combine multiple models to improve performance
over any constituent model
– Can use many weak learners to make a strong one
– Compensate for poor models with extra computation
• Works better with diverse, independent learners
• cf. NIPS 2010-2011 Workshops
– Computational Social Science & the Wisdom of Crowds
• More investigation needed of traditional feature-
based machine learning & ensemble methods for
consensus labeling with crowdsourcing
August 12, 2012 188

Active Learning
• Minimize number of labels to achieve goal
accuracy rate of classifier
– Select examples to label to maximize learning
• Vijayanarasimhan and Grauman (CVPR 2011)
– Simple margin criteria: select maximally uncertain
examples to label next

– Finding which examples are uncertain can be
computationally intensive (workers have to wait)
– Use locality-sensitive hashing to find uncertain
examples in sub-linear time
August 12, 2012 189

Active Learning (2)
• V&G report each learning iteration ~ 75 min
– 15 minutes for model training & selection
– 60 minutes waiting for crowd labels
• Leaving workers idle may lose them, slowing
uptake and completion times
• Keep workers occupied
– Mason and Suri (2010): paid waiting room
– Laws et al. (EMNLP 2011): parallelize labeling and
example selection via producer-consumer model
• Workers consume examples, produce labels
• Model consumes label, produces examples
August 12, 2012 190

Query execution
• So you want to combine CPU + HPU in a DB?
• Crowd can answer difficult queries
• Query processing with human computation
• Long term goal
– When to switch from CPU to HPU and vice versa

August 12, 2012 191

MapReduce with human computation
• Commonalities
– Large task divided into smaller sub-problems
– Work distributed among worker nodes (workers)
– Collect all answers and combine them
– Varying performance of heterogeneous
CPUs/HPUs
• Variations
– Human response latency / size of “cluster”
– Some tasks are not suitable

August 12, 2012 192

A Few Questions
• How should we balance automation vs.
human computation? Which does what?

• Who’s the right person for the job?

• How do we handle complex tasks? Can we
decompose them into smaller tasks? How?

August 12, 2012 193

Research problems – operational
• Methodology
– Budget, people, document, queries, presentation,
incentives, etc.
– Scheduling
– Quality
• What’s the best “mix” of HC for a task?
• What are the tasks suitable for HC?
• Can I crowdsource my task?
– Eickhoff and de Vries, WSDM 2011 CSDM Workshop

August 12, 2012 194

More problems
• Human factors vs. outcomes
• Editors vs. workers
• Pricing tasks
• Predicting worker quality from observable
properties (e.g. task completion time)
• HIT / Requestor ranking or recommendation
• Expert search : who are the right workers given
task nature and constraints
• Ensemble methods for Crowd Wisdom consensus
August 12, 2012 195

Problems: crowds, clouds and algorithms
• Infrastructure
– Current platforms are very rudimentary
– No tools for data analysis
• Dealing with uncertainty (propagate rather than mask)
– Temporal and labeling uncertainty
– Learning algorithms
– Search evaluation
– Active learning (which example is likely to be labeled correctly)
• Combining CPU + HPU
– Human Remote Call?
– Procedural vs. declarative?
– Integration points with enterprise systems
August 12, 2012 196

Algorithms
• Bandit problems; explore-exploit
• Optimizing amount of work by workers
– Humans have limited throughput
– Harder to scale than machines
• Selecting the right crowds
• Stopping rule

August 12, 2012 197

BROADER CONSIDERATIONS:
ETHICS, ECONOMICS, REGULATION

August 12, 2012 198

What about ethics?
• Silberman, Irani, and Ross (2010)
– “How should we… conceptualize the role of these
people who we ask to power our computing?”
– Power dynamics between parties
• What are the consequences for a worker
when your actions harm their reputation?
– “Abstraction hides detail”

• Fort, Adda, and Cohen (2011)
– “…opportunities for our community to deliberately
value ethics above cost savings.”
August 12, 2012 199

Example: SamaSource

August 12, 2012 200

Davis et al. (2010) The HPU.

HPU

August 12, 2012 201

HPU: “Abstraction hides detail”
• Not just turning a mechanical crank

August 12, 2012 202

Micro-tasks & Task Decomposition
• Small, simple tasks can be completed faster by
reducing extraneous context and detail
– e.g. “Can you name who is in this photo?”

• Current workflow research investigates how to
decompose complex tasks into simpler ones
August 12, 2012 203

Context & Informed Consent

• What is the larger task I’m contributing to?
• Who will benefit from it and how?
August 12, 2012 204

What about the regulation?
• Wolfson & Lease (ASIS&T 2011)
• As usual, technology is ahead of the law
– employment law
– patent inventorship
– data security and the Federal Trade Commission
– copyright ownership
– securities regulation of crowdfunding
• Take-away: don’t panic, but be mindful
– Understand risks of “just in-time compliance”

August 12, 2012 205

Digital Dirty Jobs
• NY Times: Policing the Web’s Lurid Precincts
• Gawker: Facebook
content moderation
• CultureDigitally: The dirty job
of keeping Facebook clean

August 12, 2012 206

Jeff Howe Vision vs. Reality?
• Vision of empowering worker freedom:
– work whenever you want for whomever you want
• When $$$ is at stake, populations at risk may
be compelled to perform work by others
– Digital sweat shops? Digital slaves?
– We really don’t know (and need to learn more…)
– Traction? Human Trafficking at MSR Summit’12

August 12, 2012 207

A DARKER SIDE TO CROWDSOURCING
& HUMAN COMPUTATION

August 12, 2012 208

Putting the shoe on the other foot:
Spam

August 12, 2012 209

What about trust?
• Some reports of robot “workers” on MTurk
– E.g. McCreadie et al. (2011)
– Violates terms of service
• Why not just use a captcha?

August 12, 2012 210

Captcha Fraud

August 12, 2012 211

Requester Fraud on MTurk
“Do not do any HITs that involve: filling in
CAPTCHAs; secret shopping; test our web page;
test zip code; free trial; click my link; surveys or
quizzes (unless the requester is listed with a
smiley in the Hall of Fame/Shame); anything
that involves sending a text message; or
basically anything that asks for any personal
information at all—even your zip code. If you
feel in your gut it’s not on the level, IT’S NOT.
Why? Because they are scams...”
August 12, 2012 212

Defeating CAPTCHAs with crowds

August 12, 2012 213

WWW’12

August 12, 2012 215

Robert Sim, MSR Summit’12

August 12, 2012 216

Conclusions
• Crowdsourcing works and is here to stay
• Fast turnaround, easy to experiment, cheap
• Still have to design the experiments carefully!
• Usability considerations
• Worker quality
• User feedback extremely useful

August 12, 2012 217

Conclusions - II
• Lots of opportunities to improve current platforms
• Integration with current systems
• While MTurk first to-market in micro-task vertical,
many other vendors are emerging with different
affordances or value-added features

• Many open research problems …

August 12, 2012 218

Conclusions – III
• Important to know your limitations and be
ready to collaborate
• Lots of different skills and expertise required
– Social/behavioral science
– Human factors
– Algorithms
– Economics
– Distributed systems
– Statistics

August 12, 2012 219

REFERENCES & RESOURCES

August 12, 2012 220

Surveys
• Ipeirotis, Panagiotis G., R. Chandrasekar, and P. Bennett. (2009).
“A report on the human computation workshop (HComp).” ACM
SIGKDD Explorations Newsletter 11(2).

• Alex Quinn and Ben Bederson. Human Computation: A Survey
and Taxonomy of a Growing Field. In Proceedings of CHI 2011.

• Law and von Ahn (2011). Human Computation

August 12, 2012 221

2013 Events Planned
Research events
• 1st year of HComp as AAAI conference
• 2nd annual Collective Intelligence?

Industrial Events
• 4th CrowdConf (San Francisco, Fall)
• 1st Crowdsourcing Week (Singapore, April)

August 12, 2012 222

TREC Crowdsourcing Track
• Year 1 (2011) – horizontals
– Task 1 (hci): collect crowd relevance judgments
– Task 2 (stats): aggregate judgments
– Organizers: Kazai & Lease
– Sponsors: Amazon, CrowdFlower

• Year 2 (2012) – content types
– Task 1 (text): judge relevance
– Task 2 (images): judge relevance
– Organizers: Ipeirotis, Kazai, Lease, & Smucker
– Sponsors: Amazon, CrowdFlower, MobileWorks
August 12, 2012 223

2012 Workshops & Conferences
• AAAI: Human Computation (HComp) (July 22-23)
• AAAI Spring Symposium: Wisdom of the Crowd (March 26-28)
• ACL: 3rd Workshop of the People's Web meets NLP (July 12-13)
• AMCIS: Crowdsourcing Innovation, Knowledge, and Creativity in Virtual Communities (August 9-12)
• CHI: CrowdCamp (May 5-6)
• CIKM: Multimodal Crowd Sensing (CrowdSens) (Oct. or Nov.)
• Collective Intelligence (April 18-20)
• CrowdConf 2012 -- 3rd Annual Conference on the Future of Distributed Work (October 23)
• CrowdNet - 2nd Workshop on Cloud Labor and Human Computation (Jan 26-27)
• EC: Social Computing and User Generated Content Workshop (June 7)
• ICDIM: Emerging Problem- specific Crowdsourcing Technologies (August 23)
• ICEC: Harnessing Collective Intelligence with Games (September)
• ICML: Machine Learning in Human Computation & Crowdsourcing (June 30)
• ICWE: 1st International Workshop on Crowdsourced Web Engineering (CroWE) (July 27)
• KDD: Workshop on Crowdsourcing and Data Mining (August 12)
• Multimedia: Crowdsourcing for Multimedia (Nov 2)
• SocialCom: Social Media for Human Computation (September 6)
• TREC-Crowd: 2nd TREC Crowdsourcing Track (Nov. 14-16)
• WWW: CrowdSearch: Crowdsourcing Web search (April 17)
August 12, 2012 224

Journal Special Issues 2012

– Springer’s Information Retrieval (articles now online):
Crowdsourcing for Information Retrieval

– IEEE Internet Computing (articles now online):
Crowdsourcing (Sept./Oct. 2012)

– Hindawi’s Advances in Multimedia Journal: Multimedia
Semantics Analysis via Crowdsourcing Geocontext

August 12, 2012 225

2011 Workshops & Conferences
• AAAI-HCOMP: 3rd Human Computation Workshop (Aug. 8)
• ACIS: Crowdsourcing, Value Co-Creation, & Digital Economy Innovation (Nov. 30 – Dec. 2)
• Crowdsourcing Technologies for Language and Cognition Studies (July 27)
• CHI-CHC: Crowdsourcing and Human Computation (May 8)
• CIKM: BooksOnline (Oct. 24, “crowdsourcing … online books”)
• CrowdConf 2011 -- 2nd Conf. on the Future of Distributed Work (Nov. 1-2)
• Crowdsourcing: Improving … Scientific Data Through Social Networking (June 13)
• EC: Workshop on Social Computing and User Generated Content (June 5)
• ICWE: 2nd International Workshop on Enterprise Crowdsourcing (June 20)
• Interspeech: Crowdsourcing for speech processing (August)
• NIPS: Second Workshop on Computational Social Science and the Wisdom of Crowds (Dec. TBD)
• SIGIR-CIR: Workshop on Crowdsourcing for Information Retrieval (July 28)
• TREC-Crowd: 1st TREC Crowdsourcing Track (Nov. 16-18)
• UbiComp: 2nd Workshop on Ubiquitous Crowdsourcing (Sep. 18)
• WSDM-CSDM: Crowdsourcing for Search and Data Mining (Feb. 9)
August 12, 2012 226

2011 Tutorials and Keynotes
• By Omar Alonso and/or Matthew Lease
– CLEF: Crowdsourcing for Information Retrieval Experimentation and Evaluation (Sep. 20, Omar only)
– CrowdConf: Crowdsourcing for Research and Engineering
– IJCNLP: Crowd Computing: Opportunities and Challenges (Nov. 10, Matt only)
– WSDM: Crowdsourcing 101: Putting the WSDM of Crowds to Work for You (Feb. 9)
– SIGIR: Crowdsourcing for Information Retrieval: Principles, Methods, and Applications (July 24)

• AAAI: Human Computation: Core Research Questions and State of the Art
– Edith Law and Luis von Ahn, August 7
• ASIS&T: How to Identify Ducks In Flight: A Crowdsourcing Approach to Biodiversity Research and
Conservation
– Steve Kelling, October 10, ebird
• EC: Conducting Behavioral Research Using Amazon's Mechanical Turk
– Winter Mason and Siddharth Suri, June 5
• HCIC: Quality Crowdsourcing for Human Computer Interaction Research
– Ed Chi, June 14-18, about HCIC)
– Also see his: Crowdsourcing for HCI Research with Amazon Mechanical Turk
• Multimedia: Frontiers in Multimedia Search
– Alan Hanjalic and Martha Larson, Nov 28
• VLDB: Crowdsourcing Applications and Platforms
– Anhai Doan, Michael Franklin, Donald Kossmann, and Tim Kraska)
• WWW: Managing Crowdsourced Human Computation
– Panos Ipeirotis and Praveen Paritosh
August 12, 2012 227

Thank You!
Crowdsourcing news & information:
ir.ischool.utexas.edu/crowd

For further questions, contact us at:
omar.alonso@microsoft.com
ml@ischool.utexas.edu

Cartoons by Mateo Burtch (buta@sonic.net)
August 12, 2012 228

Additional Literature Reviews
• Man-Ching Yuen, Irwin King, and Kwong-Sak
Leung. A Survey of Crowdsourcing Systems.
SocialCom 2011.
• A. Doan, R. Ramakrishnan, A. Halevy.
Crowdsourcing Systems on the World-Wide
Web. Communications of the ACM, 2011.

August 12, 2012 229

More Books
July 2010, kindle-only: “This book introduces you to the
top crowdsourcing sites and outlines step by step with
photos the exact process to get started as a requester on
Amazon Mechanical Turk.“

August 12, 2012 230

Resources
A Few Blogs
 Behind Enemy Lines (P.G. Ipeirotis, NYU)
 Deneme: a Mechanical Turk experiments blog (Gret Little, MIT)
 CrowdFlower Blog
 http://experimentalturk.wordpress.com
 Jeff Howe

A Few Sites
 The Crowdsortium
 Crowdsourcing.org
 CrowdsourceBase (for workers)
 Daily Crowdsource

MTurk Forums and Resources
 Turker Nation: http://turkers.proboards.com
 http://www.turkalert.com (and its blog)
 Turkopticon: report/avoid shady requestors
 Amazon Forum for MTurk
August 12, 2012 231

Bibliography
 J. Barr and L. Cabrera. “AI gets a Brain”, ACM Queue, May 2006.
 Bernstein, M. et al. Soylent: A Word Processor with a Crowd Inside. UIST 2010. Best Student Paper award.
 Bederson, B.B., Hu, C., & Resnik, P. Translation by Iteractive Collaboration between Monolingual Users, Proceedings of Graphics
Interface (GI 2010), 39-46.
 N. Bradburn, S. Sudman, and B. Wansink. Asking Questions: The Definitive Guide to Questionnaire Design, Jossey-Bass, 2004.
 C. Callison-Burch. “Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon’s Mechanical Turk”, EMNLP 2009.
 P. Dai, Mausam, and D. Weld. “Decision-Theoretic of Crowd-Sourced Workflows”, AAAI, 2010.
 J. Davis et al. “The HPU”, IEEE Computer Vision and Pattern Recognition Workshop on Advancing Computer Vision with Human
in the Loop (ACVHL), June 2010.
 M. Gashler, C. Giraud-Carrier, T. Martinez. Decision Tree Ensemble: Small Heterogeneous Is Better Than Large Homogeneous, ICMLA 2008.
 D. A. Grier. When Computers Were Human. Princeton University Press, 2005. ISBN 0691091579
 JS. Hacker and L. von Ahn. “Matchin: Eliciting User Preferences with an Online Game”, CHI 2009.
 J. Heer, M. Bobstock. “Crowdsourcing Graphical Perception: Using Mechanical Turk to Assess Visualization Design”, CHI 2010.
 P. Heymann and H. Garcia-Molina. “Human Processing”, Technical Report, Stanford Info Lab, 2010.
 J. Howe. “Crowdsourcing: Why the Power of the Crowd Is Driving the Future of Business”. Crown Business, New York, 2008.
 P. Hsueh, P. Melville, V. Sindhwami. “Data Quality from Crowdsourcing: A Study of Annotation Selection Criteria”. NAACL HLT
Workshop on Active Learning and NLP, 2009.
 B. Huberman, D. Romero, and F. Wu. “Crowdsourcing, attention and productivity”. Journal of Information Science, 2009.
 P.G. Ipeirotis. The New Demographics of Mechanical Turk. March 9, 2010. PDF and Spreadsheet.
 P.G. Ipeirotis, R. Chandrasekar and P. Bennett. Report on the human computation workshop. SIGKDD Explorations v11 no 2 pp. 80-83, 2010.
 P.G. Ipeirotis. Analyzing the Amazon Mechanical Turk Marketplace. CeDER-10-04 (Sept. 11, 2010)

August 12, 2012 232

Bibliography (2)
 A. Kittur, E. Chi, and B. Suh. “Crowdsourcing user studies with Mechanical Turk”, SIGCHI 2008.
 Aniket Kittur, Boris Smus, Robert E. Kraut. CrowdForge: Crowdsourcing Complex Work. CHI 2011
 Adriana Kovashka and Matthew Lease. “Human and Machine Detection of … Similarity in Art”. CrowdConf 2010.
 K. Krippendorff. "Content Analysis", Sage Publications, 2003
 G. Little, L. Chilton, M. Goldman, and R. Miller. “TurKit: Tools for Iterative Tasks on Mechanical Turk”, HCOMP 2009.
 T. Malone, R. Laubacher, and C. Dellarocas. Harnessing Crowds: Mapping the Genome of Collective Intelligence.
2009.
 W. Mason and D. Watts. “Financial Incentives and the ’Performance of Crowds’”, HCOMP Workshop at KDD 2009.
 J. Nielsen. “Usability Engineering”, Morgan-Kaufman, 1994.
 A. Quinn and B. Bederson. “A Taxonomy of Distributed Human Computation”, Technical Report HCIL-2009-23, 2009
 J. Ross, L. Irani, M. Six Silberman, A. Zaldivar, and B. Tomlinson. “Who are the Crowdworkers?: Shifting
Demographics in Amazon Mechanical Turk”. CHI 2010.
 F. Scheuren. “What is a Survey” (http://www.whatisasurvey.info) 2004.
 R. Snow, B. O’Connor, D. Jurafsky, and A. Y. Ng. “Cheap and Fast But is it Good? Evaluating Non-Expert Annotations
for Natural Language Tasks”. EMNLP-2008.
 V. Sheng, F. Provost, P. Ipeirotis. “Get Another Label? Improving Data Quality … Using Multiple, Noisy Labelers”
KDD 2008.
 S. Weber. “The Success of Open Source”, Harvard University Press, 2004.
 L. von Ahn. Games with a purpose. Computer, 39 (6), 92–94, 2006.
 L. von Ahn and L. Dabbish. “Designing Games with a purpose”. CACM, Vol. 51, No. 8, 2008.

August 12, 2012 233

Bibliography (3)
 Shuo Chen et al. What if the Irresponsible Teachers Are Dominating? A Method of Training on Samples and
Clustering on Teachers. AAAI 2010.
 Paul Heymann, Hector Garcia-Molina: Turkalytics: analytics for human computation. WWW 2011.
 Florian Laws, Christian Scheible and Hinrich Schütze. Active Learning with Amazon Mechanical Turk.
EMNLP 2011.
 C.Y. Lin. Rouge: A package for automatic evaluation of summaries. Proceedings of the workshop on text
summarization branches out (WAS), 2004.
 C. Marshall and F. Shipman “The Ownership and Reuse of Visual Media”, JCDL, 2011.
 Hohyon Ryu and Matthew Lease. Crowdworker Filtering with Support Vector Machine. ASIS&T 2011.
 Wei Tang and Matthew Lease. Semi-Supervised Consensus Labeling for Crowdsourcing. ACM SIGIR
Workshop on Crowdsourcing for Information Retrieval (CIR), 2011.
 S. Vijayanarasimhan and K. Grauman. Large-Scale Live Active Learning: Training Object Detectors with
Crawled Data and Crowds. CVPR 2011.
 Stephen Wolfson and Matthew Lease. Look Before You Leap: Legal Pitfalls of Crowdsourcing. ASIS&T 2011.

August 12, 2012 234

Recent Work
• Della Penna, N, and M D Reid. (2012). “Crowd & Prejudice: An Impossibility Theorem for Crowd Labelling without a Gold
Standard.” in Proceedings of Collective Intelligence. Arxiv preprint arXiv:1204.3511.
• Demartini, Gianluca, D.E. Difallah, and P. Cudre-Mauroux. (2012). “ZenCrowd: leveraging probabilistic reasoning and
crowdsourcing techniques for large-scale entity linking.” 21st Annual Conference on the World Wide Web (WWW).
• Donmez, Pinar, Jaime Carbonnel, and Jeff Schneider. (2010). “A probabilistic framework to learn from multiple
annotators with time-varying accuracy.” in SIAM International Conference on Data Mining (SDM), 826-837.
• Donmez, Pinar, Jaime Carbonnel, and Jeff Schneider. (2009). “Efficiently learning the accuracy of labeling sources for
selective sampling.” in Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and
data mining (KDD), 259-268.
• Fort, K., Adda, G., and Cohen, K. (2011). Amazon Mechanical Turk: Gold mine or coal mine? Computational
Linguistics, 37(2):413–420.
• Ghosh, A, Satyen Kale, and Preson McAfee. (2012). “Who Moderates the Moderators? Crowdsourcing Abuse Detection
in User-Generated Content.” in Proceedings of the 12th ACM conference on Electronic commerce.
• Ho, C J, and J W Vaughan. (2012). “Online Task Assignment in Crowdsourcing Markets.” in Twenty-Sixth AAAI Conference
on Artificial Intelligence.
• Jung, Hyun Joon, and Matthew Lease. (2012). “Inferring Missing Relevance Judgments from Crowd Workers via
Probabilistic Matrix Factorization.” in Proceeding of the 36th international ACM SIGIR conference on Research and
development in information retrieval.
• Kamar, E, S Hacker, and E Horvitz. (2012). “Combining Human and Machine Intelligence in Large-scale Crowdsourcing.” in
Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems (AAMAS).
• Karger, D R, S Oh, and D Shah. (2011). “Budget-optimal task allocation for reliable crowdsourcing systems.” Arxiv preprint
arXiv:1110.3564.
• Kazai, Gabriella, Jaap Kamps, and Natasa Milic-Frayling. (2012). “An Analysis of Human Factors and Label Accuracy in
Crowdsourcing Relevance Judgments.” Springer's Information Retrieval Journal: Special Issue on Crowdsourcing.
August 12, 2012 235

Recent Work (2)
• Lin, C.H. and Mausam and Weld, D.S. (2012). “Crowdsourcing Control: Moving Beyond Multiple Choice.” in in
Proceedings of the 4th Human Computation Workshop (HCOMP) at AAAI.
• Liu, C, and Y M Wang. (2012). “TrueLabel + Confusions: A Spectrum of Probabilistic Models in Analyzing Multiple
Ratings.” in Proceedings of the 29th International Conference on Machine Learning (ICML).
• Liu, Di, Ranolph Bias, Matthew Lease, and Rebecca Kuipers. (2012). “Crowdsourcing for Usability Testing.” in
Proceedings of the 75th Annual Meeting of the American Society for Information Science and Technology (ASIS&T).
• Ramesh, A, A Parameswaran, Hector Garcia-Molina, and Neoklis Polyzotis. (2012). Identifying Reliable Workers Swiftly.
• Raykar, Vikas, Yu, S., Zhao, L.H., Valadez, G.H., Florin, C., Bogoni, L., and Moy, (2010). “Learning From Crowds.” Journal
of Machine Learning Research 11:1297-1322.
• Raykar, Vikas, Yu, S., Zhao, L.H., Jerebko, A., Florin, C., Valadez, G.H., Bogoni, L., and Moy, L. (2009). “Supervised
learning from multiple experts: whom to trust when everyone lies a bit.” in Proceedings of the 26th Annual
International Conference on Machine Learning (ICML), 889-896.
• Raykar, Vikas C, and Shipeng Yu. (2012). “Eliminating Spammers and Ranking Annotators for Crowdsourced Labeling
Tasks.” Journal of Machine Learning Research 13:491-518.
• Wauthier, Fabian L., and Michael I. Jordan. (2012). “Bayesian Bias Mitigation for Crowdsourcing.” in Advances in neural
information processing systems (NIPS).
• Weld, D.S., Mausam, and Dai, P. (2011). “Execution control for crowdsourcing.” in Proceedings of the 24th ACM
symposium adjunct on User interface software and technology (UIST).
• Weld, D.S., Mausam, and Dai, P. (2011). “Human Intelligence Needs Artificial Intelligence.” in in Proceedings of the 3rd
Human Computation Workshop (HCOMP) at AAAI.
• Welinder, Peter, Steve Branson, Serge Belongie, and Pietro Perona. (2010). “The Multidimensional Wisdom of
Crowds.” in Advances in Neural Information Processing Systems (NIPS), 2424-2432.
• Welinder, Peter, and Pietro Perona. (2010). “Online crowdsourcing: rating annotators and obtaining cost-effective
labels.” in IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 25-32.
• Whitehill, J, P Ruvolo, T Wu, J Bergsma, and J Movellan. (2009). “Whose Vote Should Count More: Optimal Integration
of Labels from Labelers of Unknown Expertise.” in Advances in Neural Information Processing Systems (NIPS).
• Yan, Y, and R Rosales. (2011). “Active learning from crowds.” in Proceedings of the 28th Annual International
Conference on Machine Learning (ICML).
August 12, 2012 236

Crowdsourcing in IR: 2008-2010
 2008
 O. Alonso, D. Rose, and B. Stewart. “Crowdsourcing for relevance evaluation”, SIGIR Forum, Vol. 42, No. 2.

 2009
 O. Alonso and S. Mizzaro. “Can we get rid of TREC Assessors? Using Mechanical Turk for … Assessment”. SIGIR Workshop on the Future of IR Evaluation.
 P.N. Bennett, D.M. Chickering, A. Mityagin. Learning Consensus Opinion: Mining Data from a Labeling Game. WWW.
 G. Kazai, N. Milic-Frayling, and J. Costello. “Towards Methods for the Collective Gathering and Quality Control of Relevance Assessments”, SIGIR.
 G. Kazai and N. Milic-Frayling. “… Quality of Relevance Assessments Collected through Crowdsourcing”. SIGIR Workshop on the Future of IR Evaluation.
 Law et al. “SearchWar”. HCOMP.
 H. Ma, R. Chandrasekar, C. Quirk, and A. Gupta. “Improving Search Engines Using Human Computation Games”, CIKM 2009.

 2010
 SIGIR Workshop on Crowdsourcing for Search Evaluation.
 O. Alonso, R. Schenkel, and M. Theobald. “Crowdsourcing Assessments for XML Ranked Retrieval”, ECIR.
 K. Berberich, S. Bedathur, O. Alonso, G. Weikum “A Language Modeling Approach for Temporal Information Needs”, ECIR.
 C. Grady and M. Lease. “Crowdsourcing Document Relevance Assessment with Mechanical Turk”. NAACL HLT Workshop on … Amazon's Mechanical Turk.
 Grace Hui Yang, Anton Mityagin, Krysta M. Svore, and Sergey Markov . “Collecting High Quality Overlapping Labels at Low Cost”. SIGIR.
 G. Kazai. “An Exploration of the Influence that Task Parameters Have on the Performance of Crowds”. CrowdConf.
 G. Kazai. “… Crowdsourcing in Building an Evaluation Platform for Searching Collections of Digitized Books”., Workshop on Very Large Digital Libraries (VLDL)
 Stephanie Nowak and Stefan Ruger. How Reliable are Annotations via Crowdsourcing? MIR.
 Jean-François Paiement, Dr. James G. Shanahan, and Remi Zajac. “Crowdsourcing Local Search Relevance”. CrowdConf.
 Maria Stone and Omar Alonso. “A Comparison of On-Demand Workforce with Trained Judges for Web Search Relevance Evaluation”. CrowdConf.
 T. Yan, V. Kumar, and D. Ganesan. CrowdSearch: exploiting crowds for accurate real-time image search on mobile phones. MobiSys pp. 77--90, 2010.

August 12, 2012 237

Crowdsourcing in IR: 2011
 WSDM Workshop on Crowdsourcing for Search and Data Mining.
 SIGIR Workshop on Crowdsourcing for Information Retrieval
 1st TREC Crowdsourcing Track

 O. Alonso and R. Baeza-Yates. “Design and Implementation of Relevance Assessments using Crowdsourcing, ECIR 2011.
 Roi Blanco, Harry Halpin, Daniel Herzig, Peter Mika, Jeffrey Pound, Henry Thompson, Thanh D. Tran. “Repeatable and
Reliable Search System Evaluation using Crowd-Sourcing”. SIGIR 2011.
 Yen-Ta Huang, An-Jung Cheng, Liang-Chi Hsieh, Winston H. Hsu, Kuo-Wei Chang. “Region-Based Landmark Discovery by
Crowdsourcing Geo-Referenced Photos.” SIGIR 2011.
 Hyun Joon Jung, Matthew Lease . “Improving Consensus Accuracy via Z-score and Weighted Voting”. HCOMP 2011.
 G. Kasneci, J. Van Gael, D. Stern, and T. Graepel, CoBayes: Bayesian Knowledge Corroboration with Assessors of
Unknown Areas of Expertise, WSDM 2011.
 Gabriella Kazai,. “In Search of Quality in Crowdsourcing for Search Engine Evaluation”, ECIR 2011.
 Gabriella Kazai, Jaap Kamps, Marijn Koolen, Natasa Milic-Frayling. “Crowdsourcing for Book Search Evaluation: Impact of Quality
on Comparative System Ranking.” SIGIR 2011.
 Abhimanu Kumar, Matthew Lease . “Learning to Rank From a Noisy Crowd”. SIGIR 2011.
 Edith Law, Paul N. Bennett, and Eric Horvitz. “The Effects of Choice in Routing Relevance Judgments”. SIGIR 2011.

August 12, 2012 238

Crowdsourcing for Search Evaluation and Social-Algorithmic Search

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (12)

Ähnlich wie Crowdsourcing for Search Evaluation and Social-Algorithmic Search

Ähnlich wie Crowdsourcing for Search Evaluation and Social-Algorithmic Search (20)

Mehr von Matthew Lease

Mehr von Matthew Lease (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Crowdsourcing for Search Evaluation and Social-Algorithmic Search