Crowdsourcing for HCI Research with Amazon Mechanical Turk

Crowdsourcing for Human Computer
Interaction Research

Ed H. Chi

Research Scientist
Google

(work done while at [Xerox] PARC with Aniket Kittur)

User studies

•  Getting input from users is important in HCI
–  surveys
–  rapid prototyping
–  usability tests
–  cognitive walkthroughs
–  performance measures
–  quantitative ratings

User studies

•  Getting input from users is expensive
–  Time costs
–  Monetary costs
•  Often have to trade off costs with sample size

Online solutions

•  Online user surveys
•  Remote usability testing
•  Online experiments
•  But still have difficulties
–  Rely on practitioner for recruiting participants
–  Limited pool of participants

Crowdsourcing

•  Make tasks available for anyone online to complete
•  Quickly access a large user pool, collect data, and
compensate users
•  Example: NASA Clickworkers
–  100k+ volunteers identified Mars craters from
space photographs
–  Aggregate results virtually indistinguishable from
expert geologists

experts

crowds

http://clickworkers.arc.nasa.gov

Amazon s Mechanical turk

•  Market for human intelligence tasks
•  Typically short, objective tasks
–  Tag an image
–  Find a webpage
–  Evaluate relevance of search results
•  Users complete for a few pennies each

Using Mechanical Turk for user studies

Traditional user Mechanical Turk
studies
Task complexity Complex Simple
Long Short
Task subjectivity Subjective Objective
Opinions Verifiable
User information Targeted demographics Unknown demographics
High interactivity Limited interactivity

Can Mechanical Turk be usefully used for user studies?

Task

•  Assess quality of Wikipedia articles
•  Started with ratings from expert Wikipedians
–  14 articles (e.g., Germany , Noam Chomsky )
–  7-point scale
•  Can we get matching ratings with mechanical turk?

Experiment 1

•  Rate articles on 7-point scales:
–  Well written
–  Factually accurate
–  Overall quality
•  Free-text input:
–  What improvements does the article need?
•  Paid $0.05 each

Experiment 1: Good news

•  58 users made 210 ratings (15 per article)
–  $10.50 total
•  Fast results
–  44% within a day, 100% within two days
–  Many completed within minutes

Experiment 1: Bad news

•  Correlation between turkers and Wikipedians
only marginally significant (r=.50, p=.07)
•  Worse, 59% potentially invalid responses
Experiment 1
Invalid 49%
comments
<1 min 31%
responses

•  Nearly 75% of these done by only 8 users

Not a good start
•  Summary of Experiment 1:
–  Only marginal correlation with experts.
–  Heavy gaming of the system by a minority
•  Possible Response:
–  Can make sure these gamers are not rewarded
–  Ban them from doing your hits in the future
–  Create a reputation system [Delores Lab]
•  Can we change how we collect user input ?

Design changes

•  Use verifiable questions to signal monitoring
–  How many sections does the article have?
–  How many images does the article have?
–  How many references does the article have?

Design changes

•  Make malicious answers as high cost as
good-faith answers
–  Provide 4-6 keywords that would give someone a
good summary of the contents of the article

Design changes

good-faith answers
•  Make verifiable answers useful for completing
task
–  Used tasks similar to how Wikipedians described
evaluating quality (organization, presentation,
references)

Design changes

good-faith answers
•  Make verifiable answers useful for completing
task
•  Put verifiable tasks before subjective
responses
–  First do objective tasks and summarization
–  Only then evaluate subjective quality
–  Ecological validity?

Experiment 2: Results

•  124 users provided 277 ratings (~20 per article)
•  Significant positive correlation with Wikipedians (r=.
66, p=.01)

•  Smaller proportion malicious responses
•  Increased time on task

Experiment 1 Experiment 2
Invalid 49% 3%
comments
<1 min 31% 7%
responses
Median time 1:30 4:06

Generalizing to other user studies

•  Combine objective and subjective questions
–  Rapid prototyping: ask verifiable questions about
content/design of prototype before subjective
evaluation
–  User surveys: ask common-knowledge questions
before asking for opinions

Limitations of mechanical turk

•  No control of users environment
–  Potential for different browsers, physical
distractions
–  General problem with online experimentation
•  Not designed for user studies
–  Difficult to do between-subjects design
–  Involves some programming
•  Users
–  Uncertainty about user demographics, expertise

Quick Summary

•  Mechanical Turk offers the practitioner a way to
access a large user pool and quickly collect data at
low cost
•  Good results require careful task design
1.  Use verifiable questions to signal monitoring
2.  Make malicious answers as high cost as good-faith
answers
3.  Make verifiable answers useful for completing task
4.  Put verifiable tasks before subjective responses

Crowdsourcing for HCI Research

•  Does my interface/visualization work?
–  WikiDashboard: transparency visualization for Wikipedia
–  J. Heer’s work at Stanford at looking at perceptual effects
•  Coding of large amount of user data
–  What is a question? In Twitter, Sharoda Paul at PARC
•  Decompose tasks into smaller tasks
–  Digital Taylorism
–  Frederick Winslow Taylor (1856-1915) 1911 book
'Principles Of Scientific Management'
•  Incentive mechanisms
–  Intrinsic vs. Extrinsic rewards
–  Games vs. Pay

•  @edchi
•  chi@acm.org
•  http://edchi.net

What would make you trust Wikipedia more?

24

What is Wikipedia?

Wikipedia is the best thing ever. Anyone in the world can write
anything they want about any subject, so you know you re getting the
best possible information.
– Steve Carell, The Office

25


Nothing

26


Wikipedia, just by its nature, is
impossible to trust completely. I don't
think this can necessarily be
changed.

27

WikiDashboard
  Transparency of social dynamics can reduce conflict and coordination
issues
  Attribution encourages contribution
–  WikiDashboard: Social dashboard for wikis
–  Prototype system: http://wikidashboard.parc.com

  Visualization for every wiki page
showing edit history timeline and
top individual editors

  Can drill down into activity history
for specific editors and view edits
to see changes side-by-side

Citation: Suh et al.
CHI 2008 Proceedings

Crowdsourcing Meetup (Stanford 28

Hillary
Clinton

2011) 29

Top
Editor
-‐
Wasted
Time
R

2011)

Surfacing information

•  Numerous studies mining Wikipedia revision
history to surface trust-relevant information
–  Adler & Alfaro, 2007; Dondio et al., 2006; Kittur et al., 2007;
Viegas et al., 2004; Zeng et al., 2006

Suh, Chi, Kittur, & Pendleton, CHI2008

•  But how much impact can this have on user
perceptions in a system which is inherently
mutable?
31

Hypotheses

1.  Visualization will impact perceptions of trust
2.  Compared to baseline, visualization will
impact trust both positively and negatively
3.  Visualization should have most impact when
high uncertainty about article
•  Low quality
•  High controversy

32

Design

•  3 x 2 x 2 design

Controversial Uncontroversial

Visualization Abortion Volcano
High quality
•  High stability George Bush Shark
•  Low stability
•  Baseline (none) Pro-life feminism Disk
defragmenter Low quality
Scientology and
celebrities Beeswax

33

Example: High trust visualization

34

Example: Low trust visualization

35

Summary info

•  % from anonymous
users

36

Summary info

users
•  Last change by
anonymous or
established user

37

Summary info

users
•  Last change by
anonymous or
established user
•  Stability of words

38

Graph

•  Instability

39

Method

•  Users recruited via Amazon s Mechanical Turk
–  253 participants
–  673 ratings
–  7 cents per rating
–  Kittur, Chi, & Suh, CHI 2008: Crowdsourcing user studies
•  To ensure salience and valid answers, participants
answered:
–  In what time period was this article the least stable?
–  How stable has this article been for the last month?
–  Who was the last editor?
–  How trustworthy do you consider the above editor?

40

Results

7 High stability Baseline Low stability

6
Trustworthiness rating
5

4

3

2

1
Low qual High qual Low qual High qual

Uncontroversial Controversial

main effects of quality and controversy:
• high-quality articles > low-quality articles (F(1, 425) = 25.37, p < .001)
• uncontroversial articles > controversial articles (F(1, 425) = 4.69, p = .
031)

41

Results

7 High stability Baseline Low stability

6
5

4

3

2

1

Uncontroversial Controversial

interaction effects of quality and controversy:
• high quality articles were rated equally trustworthy whether controversial
or not, while
• low quality articles were rated lower when they were controversial than
when they were uncontroversial.
42

Results

1.  Significant effect of 7 High stability Baseline Low stability

visualization 6

–  High > low, p < .001 5

2.  Viz has both positive and 4

negative effects 3

–  High > baseline, p < .001 2

–  Low > baseline, p < .01 1

3.  No interaction of Uncontroversial Controversial

visualization with either
quality or controversy
–  Robust across conditions

43

Results


visualization 6

–  High > low, p < .001 5


negative effects 3



3.  No interaction of Uncontroversial Controversial


44

Results


visualization 6

–  High > low, p < .001 5


negative effects 3



3.  No interaction effect of Uncontroversial Controversial


45

Crowdsourcing for HCI Research with Amazon Mechanical Turk

Recommended

Recommended

More Related Content

Similar to Crowdsourcing for HCI Research with Amazon Mechanical Turk

Similar to Crowdsourcing for HCI Research with Amazon Mechanical Turk (20)

More from Ed Chi

More from Ed Chi (20)

Recently uploaded

Recently uploaded (20)

Crowdsourcing for HCI Research with Amazon Mechanical Turk