Date: September 6th, 2017
Speaker: Jesse Chandler, PhD, is a survey researcher at Mathematica Policy Research and an Adjunct Faculty Associate at the Institute for Social Research at the University of Michigan.
Overview: Crowdsourcing has had a dramatic impact on the speed and scale at which scientific research can be conducted. Clinical scientists have particularly benefited from readily available research study participants and streamlined recruiting and payment systems afforded by Amazon Mechanical Turk (MTurk), a popular labor market for crowdsourcing workers. MTurk has been used in this capacity for more than five years. The popularity and novelty of the platform have spurred numerous methodological investigations, making it the most studied nonprobability sample available to researchers. This article summarizes what is known about MTurk sample composition and data quality with an emphasis on findings relevant to clinical psychological research. It then addresses methodological issues with using MTurk--many of which are common to other nonprobability samples but unfamiliar to clinical science researchers--and suggests concrete steps to avoid these issues or minimize their impact.
Recruiting Study Participants Online using Amazon's Mechanical Turk
1. Digital Scholar
Webinar
September 6th, 2017
Hosted by the Southern California Clinical and Translational Science Institute (SC CTSI)
University of Southern California (USC) and Children’s Hospital Los Angeles (CHLA)
4. Today’s Learning Objectives
Describe the potential and strengths of Mechanical Turk as a
complementary participant recruitment tool for clinical translational studies
Identify study types where MTurk is applicable
Describe basic features of mTurk and how they are used
Describe potential weaknesses (e.g., valid data quality, external validity of
results) of using MTurk and how to address them
5. Jesse Chandler, PhD
Today’s Speaker
Topic: Recruiting study participants online using Amazon's
Mechanical Turk
Speaker: Jesse Chandler, PhD, a survey researcher at
Mathematica Policy Research and an Adjunct Faculty
Associate at the Institute for Social Research at the
University of Michigan
6. Questions: Please use the Q&A Feature
1. Click on the tab here to
access Q&A
2. Ask and post question here
1
2
11. Other Advantages of MTurk
• Simple: Easy to use interface. Security,
recruitment, identity verification and
payment handled by Amazon
12. Other Advantages of MTurk
• Simple: Easy to use interface. Security,
recruitment, identity verification and
payment handled by Amazon
• Fast: Hundreds of responses per day
13. Other Advantages of MTurk
• Simple: Easy to use interface. Security,
recruitment, identity verification and
payment handled by Amazon
• Fast: Hundreds of responses per day
• Cost effective: $0.10 per respondent
minute (plus fee)
15. Who are Mechanical Turk Workers?
• Workers are mostly Indian and American
• Most research relies heavily on American
workers
Stewart et al., 2017
16. Who are Mechanical Turk Workers?
• Workers are mostly Indian and American
• Most research relies heavily on American
workers
• 500,000 registered users
• The typical lab will struggle to reach more
then 15,000 workers in any quarter
Stewart et al., 2017
17. Diverse but not Representative
USA Mechanical Turk
Population Size 323m 500k (15k active)
Age 47.1 33.5
White 74% 83%
4 year degree 19% 35%
Republican 29% 18%
Democrat 35% 41%
LG(B) 1.7%(1.8%) 3.8%(6.9%)
Atheist 3% 21%
Has children ~54% ~30%
Working age with disability 11% ~5%
Casey et al., 2017
23. Linking external platforms to MTurk
Provide workers with a code that they then submit to MTurk.
Please paste this code into the MTurk HIT to confirm your
participation:
Confirmation Code: ${e://Field/ResponseID}
Pass a workerID through to the survey website
https://qualtrics.com/SE/?SID=SV_bjBZj&MID='+mturkworkerID+'
Pe’er, Paolacci, Chandler & Mueller, 2012
26. Behavioral Science Research
• Surveys and survey experiments
– General population
– Specific groups
• Pilot testing and item generation (Fowler et
al., 2015; Sina, Krauss & Rosenfield, 2014)
27. Behavioral Science Research
• Surveys and survey experiments
– General population
– Specific groups
• Pilot testing and item generation (Fowler et
al., 2015; Sina, Krauss & Rosenfield, 2014)
• Experimental games (Arechar et al., 2017)
28. Behavioral Science Research
• Surveys and survey experiments
– General population
– Specific groups
• Pilot testing and item generation (Fowler et
al., 2015; Sina, Krauss & Rosenfield, 2014)
• Experimental games (Arechar et al., 2017)
• Measures of reaction time (Crump et al., 2013)
29. Behavioral Science Research
• Surveys and survey experiments
– General population
– Specific groups
• Pilot testing and item generation (Fowler et
al., 2015; Sina, Krauss & Rosenfield, 2014)
• Experimental games (Arechar et al., 2017)
• Measures of reaction time (Crump et al., 2013)
• Eye tracking (Tran et al., 2017; Xu et al., 2015)
30. Behavioral Health Research on MTurk
• About 12% use psychotropic medication
• About 20% lifetime history of diagnosis
• Average prevalence of ADHD
• Average prevalence of acquired brain
injury
Bernstein & Calamia, 2017; Chandler & Shapiro, 2016;
Shapiro, Chandler & Mueller 2014; Wymbs & Dawson, 2015
31. Behavioral Health Research on MTurk
• About 12% use psychotropic medication
• About 20% lifetime history of diagnosis
• Average prevalence of ADHD
• Average prevalence of acquired brain
injury
• Tend to be a little more socially anxious
• Tend to be a little higher on the ASD
spectrum
Bernstein & Calamia, 2017; Chandler & Shapiro, 2016;
Shapiro, Chandler & Mueller 2014; Wymbs & Dawson, 2015
32. Longitudinal studies
• Many published papers collect multi-wave
data across time periods ranging from
months up to a year
– Retention rate is usually about 60-70%
Boynton & Richman, 2014; Chandler et al., 2013; Schleider &
Weisz, 2015; Shapiro et al., 2013; Weins & Walker, 2014
33. Longitudinal studies
• Many published papers collect multi-wave
data across time periods ranging from
months up to a year
– Retention rate is usually about 60-70%
• Two week daily diary study of alcohol use
– 70% completed at least four entries
– 60% adherence
Boynton & Richman, 2014; Chandler et al., 2013; Schleider &
Weisz, 2015; Shapiro et al., 2013; Weins & Walker, 2014
34. Content coding and judgment
• Annotation of text in forums (MacLean & Heer,
2013; Vlahovic et al., 2014)
• Speech pathology ratings (McAllister et al., 2014)
34
Workers as Research Assistants
35. Content coding and judgment
• Annotation of text in forums (MacLean & Heer,
2013; Vlahovic et al., 2014)
• Speech pathology ratings (McAllister et al., 2014)
Data collection
• Upload pictures of thermostats (Meier et al.,
2011)
• Upload letters about standardized testing
(Chandler, unpublished data)
35
Workers as Research Assistants
37. An Illustration from Political Science
• Accurate:
– 15 workers as good as 5
experts
– Worker and expert ratings,
r = .96
• Fast: 22,000 statements
in 5hours
• Cost Effective: Total cost
of $1080
• Elastic: Scaled up or
down quickly
• DIY: Anybody can
replicate it
Benoit et al., 2015
38. An Illustration from Political Science
• Accurate:
– 15 workers as good as 5
experts
– Worker and expert ratings,
r = .96
• Fast: 22,000 statements
in 5hours
• Cost Effective: Total cost
of $1080
• Elastic: Scaled up or
down quickly
• DIY: Anybody can
replicate it
Benoit et al., 2015
39. An Illustration from Political Science
• Accurate:
– 15 workers as good as 5
experts
– Worker and expert ratings,
r = .96
• Fast: 22,000 statements
in 5hours
• Cost Effective: Total cost
of $1080
• Elastic: Scaled up or
down quickly
• DIY: Anybody can
replicate it
Benoit et al., 2015
40. Transactive Crowds
• MTurk workers asked to provide cognitive
reappraisals of the negative thoughts of
other workers (Morris & Picard, 2014)
41. Transactive Crowds
• MTurk workers asked to provide cognitive
reappraisals of the negative thoughts of
other workers (Morris & Picard, 2014)
• An app that allows people with visual
impairments to upload images and receive
near realtime descriptions of their contents
(Bingham et al., 2010)
51. Workers are Basically Honest
Variable Mechanical Turk GSS
Age (+/- 1 year) 97.8% Age (+/- 1 year) 94.2%
Biological Sex 98.6% Sex 99.1%
Race 97.8% Race 93.6%
Latino Ethnicity 96.9% Latino Ethnicity 93.4%
State Residency 97.6% Residency at 16 96%
MTurk Data: Casey et al., 2017
GSS: 2008, 2010
52. Factual Knowledge Questions
• How many countries are in
Africa?
– 10% guess 53 or 54
Goodman, Cryder & Cheema, 2013;
Chandler & Paolacci, unpublished data
53. Factual Knowledge Questions
• How many countries are in
Africa?
– 10% guess 53 or 54
• Which Nobel Prize did
Venkatraman Ramakrishnan
win?
– 30% guess Chemistry
Goodman, Cryder & Cheema, 2013;
Chandler & Paolacci, unpublished data
56. Potential Consequences of Fraud
• People might respond to subsequent questions
truthfully, adding noise to any measurements
• People respond to subsequent questions using a lay
theory about how they “should” respond
Wessling-Sharpe, Huber & Netzer, 2017
57. Potential Consequences of Fraud
• People might respond to subsequent questions
truthfully, adding noise to any measurements
• People might respond to subsequent questions using
a lay theory about how they “should” respond
Wessling-Sharpe, Huber & Netzer, 2017
58. Potential Consequences of Fraud
• People might respond to subsequent questions
truthfully, adding noise to any measurements
• People might respond to subsequent questions using
a lay theory about how they “should” respond
Wessling-Sharpe, Huber & Netzer, 2017
59. When “Good Enough” is Good Enough
• Would I have used a non-probability sample to begin
with?
• Will representative sample study design discussions
be more effective if I have data?
• Do I need a way to prioritize which treatments I
decide to test or not test?
• If a treatment has an impact, would I act differently if I
learned that it was 15% smaller or lower than I had
initially observed?
• Is there a better ROI for an answer that is +/-15% for
1/10th the cost and in 1/10th the time?
60. When “Good Enough” is Good Enough
• Would I have used a non-probability sample to begin
with?
• Will representative sample study design discussions
be more effective if I have data?
• Do I need a way to prioritize which treatments I
decide to test or not test?
• If a treatment has an impact, would I act differently if I
learned that it was 15% smaller or lower than I had
initially observed?
• Is there a better ROI for an answer that is +/-15% for
1/10th the cost and in 1/10th the time?
61. When “Good Enough” is Good Enough
• Would I have used a non-probability sample to begin
with?
• Will representative sample study design discussions
be more effective if I have data?
• Do I need a way to prioritize which treatments I
decide to test or not test?
• If a treatment has an impact, would I act differently if I
learned that it was 15% smaller or lower than I had
initially observed?
• Is there a better ROI for an answer that is +/-15% for
1/10th the cost and in 1/10th the time?
62. When “Good Enough” is Good Enough
• Would I have used a non-probability sample to begin
with?
• Will representative sample study design discussions
be more effective if I have data?
• Do I need a way to prioritize which treatments I
decide to test or not test?
• If a treatment has an impact, would I act differently if I
learned that it was 15% smaller or lower than I had
initially observed?
• Is there a better ROI for an answer that is +/-15% for
1/10th the cost and in 1/10th the time?
63. When “Good Enough” is Good Enough
• Would I have used a non-probability sample to begin
with?
• Will representative sample study design discussions
be more effective if I have data?
• Do I need a way to prioritize which treatments I
decide to test or not test?
• If a treatment has an impact, would I act differently if I
learned that it was 15% smaller or lower than I had
initially observed?
• Is there a better ROI for an answer that is +/-15% for
1/10th the cost and in 1/10th the time?
64. Getting Started
Stewart, N., Chandler, J., & Paolacci, G. (2017).
Crowdsourcing samples in cognitive science. Trends in
Cognitive Sciences
Chandler, J., & Shapiro, D. (2016). Conducting clinical
research using crowdsourced convenience
samples. Annual Review of Clinical Psychology
Mason, W., & Suri, S. (2012). Conducting behavioral
research on Amazon’s Mechanical Turk. Behavior Research
Methods
Ranard, B. L., Ha, Y. P., Meisel, Z. F., Asch, D. A., Hill, S. S.,
Becker, L. B., ... & Merchant, R. M. (2014). Crowdsourcing—
harnessing the masses to advance health and medicine, a
systematic review. Journal of General Internal Medicine
66. Q u e s t i o n s
Program director: Katja Reuter, PhD
Email: katja.reuter@usc.edu
Twitter: @dmsci
Next Digital Scholar Webinar
I n f o r m a t i o n a b o u t
t h e p r o g r a m
http://sc-ctsi.org/digital-scholar/
Oct 4, 2017 | 12-1PM PST
Topic: Disseminating scientific papers via Twitter: Practical
insights and research evidence
Speaker: Stefanie Haustein, PhD, Assistant Professor, School
of Information Studies, University of Ottawa
Register at: sc-ctsi.org/digital-scholar/register
Hinweis der Redaktion
Welcome to today’s Digital Scholar Webinar at the University of Southern California.
Advances in digital technology have led to a heightened interest in exploring the use of digital practices and tools to benefit researchers and clinicians.
This webinar series is focused on the workflows and needs of health sciences researchers.
Today we will introduce a Web-based crowdsourcing tool that supports research participant recruitment.
Crowdsourcing through Amazon’s Mechanical Turk, called mTurk, may serve you as a new solution to complement your existing study recruitment.
After today’s webinar, you will be able to…
1. Describe the potential and strengths of Mechanical Turk as a complementary participant recruitment tool for clinical translational studies.
2. Identify study types where MTurk is applicable
3. Describe basic features of mTurk and how they are used
4. Describe potential weaknesses (e.g., valid data quality, external validity of results) of using MTurk and how to address them
I am delighted to introduce today’s speaker is Dr. Jesse Chandler who is a survey researcher and Adjunct Faculty Associate at the Institute for Social Research at the University of Michigan.
We will have time for your questions at the end of the presentation. Please add your questions to the Q&A, which you can find on the right side.
Named after a 19th century fake chess playing machine
Appealing features – say what they are, discuss spread
Researchers are “requesters”
Tasks (like surveys) are HITs
People who want to complete tasks are “workers”
Workers accept and then submit HITs to requesters for approval
Requesters discretionarily approve work
A worker’s approval rate determines access to future work
Definitely more diverse than college samples
MTurk is about as representative as other convenience samples
About as good as community samples from college towns (Berinsky et al., 2012)
Slightly less representative than existing commercial panels (e.g., ANSEP; Berinsky et al., 2012; GfK/TESS; Mullinex, Druckman & Freese, 2014; Weinberg et al., 2012)
But you can stratify and/or apply weighting to MTurk samples (Greenblatt, 2013)
More socially anxious, introverted (Goodman, Cryder, & Cheema, 2013; Kosara & Ziemkiewicz, 2010; Shapiro et al., 2013)
Less emotionally stable (Goodman et al., 2013; Kosara & Ziemkiewicz, 2010; Holubec-Gootzeit 2014)
Higher autism spectrum quotient (Palmer, Payton,Enticott & Hohwy, 2014)
An API: Feature rich and can be integrated with other software
Native GUI: Simple to use and less features
TurkPrime: Feature rich and more cost effective than the native GUI
Use premade qualifications or create your own
Lots of research on addition and substance abuse, other topics include perceptions of physicians and physician messaging, intelligibility of medical pictograms, perceptions of warnings and attitudes (e.g. vaccines)
Infant attention
Lots of research on addition and substance abuse, other topics include perceptions of physicians and physician messaging, intelligibility of medical pictograms, perceptions of warnings and attitudes (e.g. vaccines)
Infant attention
Lots of research on addition and substance abuse, other topics include perceptions of physicians and physician messaging, intelligibility of medical pictograms, perceptions of warnings and attitudes (e.g. vaccines)
Infant attention
Lots of research on addition and substance abuse, other topics include perceptions of physicians and physician messaging, intelligibility of medical pictograms, perceptions of warnings and attitudes (e.g. vaccines)
Infant attention
Lots of research on addition and substance abuse, other topics include perceptions of physicians and physician messaging, intelligibility of medical pictograms, perceptions of warnings and attitudes (e.g. vaccines)
Infant attention
Need to email workers – can do this through TurkPrime or the API
Need to email workers – can do this through TurkPrime or the API
Medical word identification 84% agreement between pair of turkers and aggregate of 9 nurses, best automated alternative was 72%
Similar work has been done using workers as a replacement for speech pathologists – 9 workers = 3 speech pathologists
Coding text posted by breast cancer survivors in online forums
Medical word identification 84% agreement between pair of turkers and aggregate of 9 nurses, best automated alternative was 72%
Similar work has been done using workers as a replacement for speech pathologists – 9 workers = 3 speech pathologists
Coding text posted by breast cancer survivors in online forums
Lee, A. Y., & Tufail, A. (2014). Mechanical Turk based system for macular OCT segmentation. Investigative Ophthalmology & Visual Science, 55(13), 4787-4787.
Training sets for machine learning
Triaging images and video data
Content coding
Generating experimental stimuli
Generating survey questions
Training sets for machine learning
Triaging images and video data
Content coding
Generating experimental stimuli
Generating survey questions
Training sets for machine learning
Triaging images and video data
Content coding
Generating experimental stimuli
Generating survey questions
R = 0.98
R = .81
An effect requires a specific demographic characteristic to occur
The strength of an effect depends on a demographic characteristic (and this matters)
Interest in a particular subgroup that is really not representative
Interest in the robustness of a treatment or in a precise estimate of effect size
Can potentially interfere with correlations between knowledge and attitudes or behavior
Can potentially interfere with correlations between knowledge and attitudes or behavior