In this talk, I discuss several interactive crowd-powered systems
that help people address real-world problems. For instance, VizWiz
sends questions blind people have about their visual environment to
the crowd, Legion allows outsourcing of desktop tasks to the crowd,
and Scribe allows the crowd to caption audio in real-time. The
thousands of people have engaged with these systems, providing an
interesting look at how end users want to interact with crowd work.
Collectively, these systems illustrate a new approach to human
computation in which the dynamic crowd is provided the computational
support needed to act as a single, high-quality agent. The classic
advantage of the crowd has been its wisdom, but our systems are
beginning to show how crowd agents can surpass even expert individuals
on motor and cognitive performance tasks.
Crowd Agents: Interactive Crowd-Powered Systems in the Real World
1. Crowd Agents
Interactive Crowd-Powered Systems in the Real World
Jeffrey P. Bigham
University of Rochester
University of Rochester Human-Computer Interaction Jeffrey P. Bigham
2. Crowd Agents
Interactive Crowd-Powered Systems in the Real World
Jeffrey P. Bigham
University of Rochester
University of Rochester Human-Computer Interaction Jeffrey P. Bigham
3. Introduction VizWiz Crowd Agents Scribe
University of Rochester Human-Computer Interaction Jeffrey P. Bigham
4. Introduction VizWiz Crowd Agents Scribe
Human Assistance in History
What the Disability Community
Can Teach Us About Interactive
Crowdsourcing. Jeffrey P.
Bigham and Richard Ladner.
Iteractions magazine. July 2011.
University of Rochester Human-Computer Interaction Jeffrey P. Bigham
5. Introduction VizWiz Crowd Agents Scribe
Connectivity
University of Rochester Human-Computer Interaction Jeffrey P. Bigham
6. Introduction VizWiz Crowd Agents Scribe
University of Rochester Human-Computer Interaction Courtesy Jeffrey P. Brabyn
of John Bigham
7. Introduction VizWiz Crowd Agents Scribe
Remote Assistance
Video Relay Services
Real-time
Captioning
University of Rochester Human-Computer Interaction Jeffrey P. Bigham
8. Introduction VizWiz Crowd Agents Scribe
Connectivity -> Crowd
Mechanical Turk
Friends and Family
on Social Networks
University of Rochester Human-Computer Interaction Jeffrey P. Bigham
9. Introduction VizWiz Crowd Agents Scribe
VizWiz
Bigham et al. Nearly Real-Time Answers to Visual Questions. UIST 2010.
University of Rochester Human-Computer Interaction Jeffrey P. Bigham
10. Introduction VizWiz Crowd Agents Scribe
Access Technology
• Optical Character Recognition
• Color Recognizers
• Talking GPS
• …
Problems
1. Limited Scope
2. Unacceptable Error Rate
3. $$$
4. Not Exactly What Users Want
University of Rochester Human-Computer Interaction Jeffrey P. Bigham
11. Introduction VizWiz Crowd Agents Scribe
Releasing VizWiz
• Released on May 31, 2011
– 5000 users asked more than 50,000 questions
– answers in less than a minute
University of Rochester Human-Computer Interaction Jeffrey P. Bigham
12. Introduction VizWiz Crowd Agents Scribe
Recruiting Crowd Quickly
How many workers do we need?
- number of current workers
- likelihood of needing more workers
Post jobs or remove jobs Turkers answer multiple questions
Turkit
For $4/hr goes down to under 30s from start to finish.
quikturkit.googlecode.com
Bigham et al. Nearly Real-Time Answers to Visual Questions. UIST 2010.
University of Rochester Human-Computer Interaction Jeffrey P. Bigham
13. Introduction VizWiz Crowd Agents Scribe
Characterization of the Crowd
- Workers Come and Go
- Some May Do the Wrong Thing
University of Rochester Human-Computer Interaction Jeffrey P. Bigham
14. Introduction VizWiz Crowd Agents Scribe
Supporting a Continuous Interaction?
Where’s the coffee?
Walk to end of this hall, turn right.
Turn right into the kitchen.
Where’s the Soda on left, coffee on the right
How do I use this machine?
coffee?
University of Rochester Human-Computer Interaction Jeffrey P. Bigham
15. Introduction VizWiz Crowd Agents Scribe
Model for Crowd Agents
University of Rochester Human-Computer Interaction Jeffrey P. Bigham
16. Introduction VizWiz Crowd Agents Scribe
Model for Crowd Agents
Input Mediation
Learning
University of Rochester Human-Computer Interaction Jeffrey P. Bigham
17. Introduction VizWiz Crowd Agents Scribe
Model for Crowd Agents
• What interface is being controlled?
• How is input mediation done?
• Role of automated agents?
University of Rochester Human-Computer Interaction Jeffrey P. Bigham
18. Introduction VizWiz Crowd Agents Scribe
Chorus
University of Rochester Human-Computer Interaction Jeffrey P. Bigham
19. Introduction VizWiz Crowd Agents Scribe
University of Rochester Human-Computer Interaction Jeffrey P. Bigham
20. Introduction VizWiz Crowd Agents Scribe
Legion: Control of Any Interface
Input Media on
Legion Server - video stream
Flash Media Server - - task description
Input Mediators - - crowd agreement/payment info
- video stream quikTurkit -
- task description
- worker input
(key presses, mouse clicks) Worker Interface
- mediated input
Legion Client
250 8/10
8/10
200
10/10 Explanation of
Time (sec)
controls, and feedback
150 regarding current
bonus level (tied to
crowd agreement).
10/10
100
4/10
50
0 Feedback reflecting worker’s
Solo Mob Vote Active Leader last key press, and whether
the interface last followed
multiple workers the crowd or the worker.
W. Lasecki, S. White, K. Murray, R. Miller, and J.P. Bigham “Real-Time Control of
Existing Interfaces.” UIST 2011.
University of Rochester Human-Computer Interaction Jeffrey P. Bigham
21. Introduction VizWiz Crowd Agents Scribe
University of Rochester Human-Computer Interaction Jeffrey P. Bigham
22. Introduction VizWiz Crowd Agents Scribe
Crowd Memory
W.S. Lasecki, S.C. White, K.I. Murray and J.P. Bigham. “Crowd Memory: Learning in
the Collective.” Collective Intelligence 2012.
University of Rochester Human-Computer Interaction Jeffrey P. Bigham
23. Introduction VizWiz Crowd Agents Scribe
Crowd Memory
University of Rochester Human-Computer Interaction Jeffrey P. Bigham
24. Introduction VizWiz Crowd Agents Scribe
Deployable Activity Recognition
W.S. Lasecki, Y. Song, H. Kautz, and J.P. Bigham. “Real-Time Activity Labeling for
Deployable Activity Recognition.” Submitted to CSCW 2012. Pervasive 2012 (poster)
University of Rochester Human-Computer Interaction Jeffrey P. Bigham
25. Legion:Scribe
Real-Time Captions by Groups of Non-Experts
University of Rochester Human-Computer Interaction Jeffrey P. Bigham
26. Introduction VizWiz Crowd Agents Scribe
Real-Time Captioning
Problem: produce text transcript of speech with less than 5-second latency
Stenographers ASR
expensive cheap
difficult to schedule available on demand
lack domain expertise Can I
can be trained for new vocab
help?
pretty accurate does not work*
NO,
you are worse than ASR.
* in real settings from an unknown mic with speaker who hasn’t trained the ASR
University of Rochester Human-Computer Interaction Jeffrey P. Bigham
27. Introduction VizWiz Crowd Agents Scribe
Real-Time Captioning
W. Lasecki, C. Miller, A. Sadilek, A. Abumoussa, D. Borrello, R. Kushalnagar, J.P.
Bigham. “Real-Time Captioning by Groups of Non-Experts.” UIST 2012.
University of Rochester Human-Computer Interaction Jeffrey P. Bigham
28. Introduction VizWiz Crowd Agents Scribe
Input Mediator
Multiple Sequence Alignment
Online Version
Stage 1 the Stage 2 the Stage 3 the now and
Graph open open open file
Time
java java java up
Worker 1 open the file now
Worker 2 the java fiel
Worker 3 open java file up and
Baseline open the java file now and
W.S. Lasecki, C.D. Miller, D. Borrello and J.P. Bigham. “Online Sequence Alignment
for Real-Time Audio Transcription by Non-Experts.” AAAI 2012 (poster).
University of Rochester Human-Computer Interaction Jeffrey P. Bigham
29. Introduction VizWiz Crowd Agents Scribe
Scribe Interface
Encourages:
- real-time input
- global coverage
- short sequences
Co-evolution of
Interface and
Algorithm
University of Rochester Human-Computer Interaction Jeffrey P. Bigham
30. Introduction VizWiz Crowd Agents Scribe
Coverage Graph
University of Rochester Human-Computer Interaction Jeffrey P. Bigham
31. Introduction VizWiz Crowd Agents Scribe
Tradeoff
Failures:
“n-factorial”
“in pectoral”
University of Rochester Human-Computer Interaction Jeffrey P. Bigham
32. Introduction VizWiz Crowd Agents Scribe
Interesting Qualities
• Captionists can be experts
– not at captioning but in the subject
• Low cost
– $30/hour on Mturk (did not optimize)
– or free (impossible before)
• Recruited on demand
– for only as long as needed
University of Rochester Human-Computer Interaction Jeffrey P. Bigham
33. Introduction VizWiz Crowd Agents Scribe
Scribe ASR
Web prefetching is 1 technique that A lactate fencing is one thinking that
ressearchers rely on history based to and etc. rely on to improve network.
the non history based technique the Phillipe pitching. Anything survived
downloaded pages will be scanned and all incident techniques…
hyperlinks will be…
University of Rochester Human-Computer Interaction Jeffrey P. Bigham
34. Introduction VizWiz Crowd Agents Scribe
Incorporating ASR
Coverage Increase: 28% to 55%
(single worker case)
University of Rochester Human-Computer Interaction Jeffrey P. Bigham
35. Conclusions
General Lessons, Science, and the Future
University of Rochester Human-Computer Interaction Jeffrey P. Bigham
36. Introduction VizWiz Crowd Agents Scribe
“What would it take for me
to be proud of my daughter
being a crowd worker?”
- Niki Kittur @ CrowdCamp
Currence Bigham after her first running race.
University of Rochester Human-Computer Interaction Jeffrey P. Bigham
37. Introduction VizWiz Crowd Agents Scribe
Do Good
Connect to help and support.
Do Better
Do better work than anyone could alone.
University of Rochester Human-Computer Interaction Jeffrey P. Bigham
38. Introduction VizWiz Crowd Agents Scribe
hci.cs.rochester.edu
@jeffbigham
Thanks!
Funded by: National Science Foundation Grants (#IIS-1149709, #IIS-
1116051, #IIS-1049080 ), and Google.
University of Rochester Human-Computer Interaction Jeffrey P. Bigham
39. Introduction VizWiz Crowd Agents Scribe
University of Rochester Human-Computer Interaction Jeffrey P. Bigham
Hinweis der Redaktion
Hi everyone,I’m Jeff from the University of Rochester.Over the past few years, we have been working on crowd-powered systems designed to be used in the real world, to help real people solve everyday problems.Today, I’ll tell you about some of those systems, general lessons I think we can take away from them, and how users and workers have reacted to interacting with them on real tasks.
But, before I do, I want to take a bit of a step back, in order to place our work in the context of history, and to set a foundation for my vision for the future.Since the earliest days of computer science, computer scientists have dreamed of a future world in which we work seamlessly with machines to get things done in the real world.The AI and HCI communities have in particular taken up this challenge, with slightly different focuses, but with what I believe to be often similar end goals.What I’m excited about is that I believe we’re finally at a point where we can actually build the intelligent interactive agents of our dreams.A big part of how we’ll build them is real-time human computation, which I believe requires a tight coupling of AI and HCI.
A lot of my research is in building applications targeted at helping people with disabilities, and nowhere is the long history of human assistance as readily apparent as it is there.Peopleprovide one another assistance every day.Volunteers may go to a blind person’s home to read her mail, sign language interpreters help ensure education is available to deaf students, and friends help people with physical disabilities get around.This has been true forever – what has changed is connectivity.
Connectivity means that wherever I am, whatever I need, I can now easily recruit a person to help me with it.I needn’t rely on having someone nearby or technology that is itself intelligent enough to help me.
And, people with disabilities were some of the first to leverage what we today might call human computation.This sketch from the early 90s, illustrates a service developed by the Smith-Kettlewell Eye Institute, in which a blind person has scanned a frozen dinner and is talking to a remote supporter to find out more about it.I especially like this picture because the blind person is being assisted remotely by a person in a wheelchair.
As technology improved, so did the services available to people with disabilities.By 2000 or so, deaf people were connecting to video relay services that allowed them to sign to hearing folks on the phone, and they connected to remote real-time captionists who could convert speech to written text.These were huge advances, but because they required experts who need to be available for a long time, they are very expensive – in the range of $100 to $200 an hour.
I’m excited because increasing connectivity now means that anyone can help – workers on Mechanical Turk, Volunteers, and Friends and Family.Potentially, making the market for assistance much more elastic.
A few years ago, we explored this potential through an iPhone application that we developed called VizWiz.VizWizlets blind people take a picture, speak a question, and get an answer back in a few seconds from people out on the web.
There is already a lot of great access technology that serves as sensors onto an inaccessible world for people with disabilities.OCR recognizes text, color recognizers can help people coordinate outfits, and talking GPS units can help people find their way.Unfortunately, despite its promise – this technology remains limited in the scope of problems it can reliably solve, still has unacceptable error rates for real applications. The technology is expensive, costing 100s to 1000s of dollars, and in the end often isn’t exactly what users would want anyway.In fact, we as technologists often don’t really know what users really want.
And, so in course to running what called a deployable Wizard-of-Oz experiment, we released VizWiz on the app store about a year ago to pretty dramatic results – 5000 users have asked more than 50,000 questions.This provides us an unprecedented look at what blind people might actually want to know about their visual environments.
So, how do we get answers back quickly for VizWiz?On the backend, we run a service called quikTurkit. The goal of quikTurkit is to keep workers around to answer questions. It can either be used on demand (when a question is received, or keep a pool of workers around at all times to farther reduce latency). To help improve on-demand response times, the VizWiz application lets quikTurkit know when someone has started to interact with it (aka, took a picture), so it can begin recruiting workers.An interesting result that came from our initial work is that time to answer is very much dependent on how difficult the work is to do – in this case, VizWiz questions are all answered pretty quickly, but they are answered most quickly when the question can actually be answered from the photo and the question could be automatically converted to text using speech recognition.It turns out keep a steady pool around isn’t that expensive, and doing so farther reduces latency to under 30 seconds from when a question is received to when an answer is sent.
Our experience with VizWiz led us to characterize the crowd that is easy to recruit online as follows:The crowd is dynamic, which means that workers come and go.And, some workers may do the wrong thing.
So, given that characterization of the crowd, imagine that we wanted to support a richer, continuous interaction like this one.How could that work with the crowds that we have?
We could imagine recruiting a single worker from the crowd, who could chat with the user much like they would on IM. This has definite advantages.For instance, by using existing interfaces, we can leverage all that we know about making these usable, and we can leverage the experience that people have using them – turkers know how to use instant messenger, and so do blind people.But, doing this naively fails under our model of the crowd – in particular, what if a worker provides bad input, or what if a worker disappears entirely.To accommodate for this, we add in more workers, all controlling IM as they know how to do.But, now we have another problem – the user’s interaction is not what they’re accustomed to – namely, they’re being expected to hold multiple conversations at once.
To address this we introduce an input mediation layer that takes all the input that it receives, and condenses it to a single stream that is forwarded on.This layer could be powered by an automatic algorithm, or also powered by the crowd.We might also introduce learning into the pipeline, so that the system can learn to serve as one of its own workers, thus for instance allowing the crowd to take on the difficult bit of adapting to new environments after which the automatic agents take over.
This model is what we mean by crowd agents – it’s crowd workers acting as one.And so, questions that can be asked to define a particular crowd agent are:What interface is being controlled?How is input mediation done?And, what is the role of automated agents?
Our Chorus system demonstrates how this works for chatting with the crowd agent.Each crowd worker chats in an interface that looks a lot like instant messenger. To maintain consistency, they are provided a space for shared memory.The crowd mediates its own inputs by voting responses through.
This is an example conversation – in this case, the user chats with the crowd agent about a place to eat in Los Angeles.It seems as though this real-time chat is happening with a real person.Behind the scenes, Chorus is making sure that happens. Workers propose messages, and only those that receive enough votes are forwarded on.In our experiments, the crowd agent was able to reliably carry on a conversation with the user, answering nearly all questions in a reasonable way.Even though the crowd is comprised of people, issues like consistency and memory make a Crowd Turing Test something reasonable we might explore in the future.
Legion is another system that we created. In this case we put the crowd agent in control of an existing desktop interface via VNC (remote desktop).Crowd workers send their commands (keys or mouse clicks), and the Legion input mediators decide how to forward them on.The most basic strategy one might try is to divide time into windows and just take a vote – but it turns out this is slow and leads to thrashing. What worked best for us in this case was to use the vote not to decide what to do next, but to elect leaders who would temporarily assume full control.Over a number of trials on different tasks, the leader input mediator showed the best compromise between speed and successful task completion.
Legion can be used for all sorts of tasks –In this example, we used it to copy a table we drew on a whiteboard into a spreadsheet.We even drove a robot about with it, in this case turning a cheap mobile webcam into a robot that followed natural language commands.
We also used Legion to investigate properties of our crowd – specifically, with people coming and going, would the crowd learn from each other?We set up a simple board in a first-person shooter in which players needed to press one of two buttons to progress through the game (either a white or a black button). We told the first generation of crowd workers which button to press, and then let them loose.
Over the course of an hour-long experiment, the crowd completely turned over several times, but they continued to press only the white button, presumably because they were learning from each other.We relate this back to the concept of Organizational Learning, which is one construct that helps to explain how culture and traditions are passed down from generation to generation at organizations ranging in size from families to nations.Of course, the time scales of the crowd are much shorter.
We also created a system for more deployable activity recognition using this model.The idea is that while automated system can do a decent job at recognizing activities, they struggle in new environments or when someone does an activity in a new way.In our system, when the automated system, in this case an HMM-based activity recognizer, is not confident about a label, it sends the video out to the crowd. Each crowd worker inputs activity labels, and other crowd workers serve as input mediator to decide what is forwarded.The labels get sent along with the sensor stream to train the system to work better next time.As an interesting side note, automated suggestions server a dual purpose. Clearly, they can be used directly when they are correct, but they also help tune the crowd to the desired granularity of response – for instance, if the suggested label is “making breakfast”, workers are less likely to suggest and choose lower-level actions like “raising spoon” or “closing bag of cereal.”
The final system I’ll describe is calledLegion Scribe, which allows students to caption speech in real time for deaf and hard of hearing students.
Real-time captioning is the problem of producing a text version of speech with less than 5 seconds latency.Currently, there are two main approaches to real-time captioning, and they both have drawbacks.The first is to employ professional stenographers – they are pretty accurate, but expensive, difficult to schedule, and often lack domain expertise – which makes it difficult to caption advanced technical material.The second is Automatic Speech Recognition – it’s cheap, available on demand, and able to be adapted to new vocab.Unfortunately, despite impressive advances over the past few decades, it does not work…which is only a slight exaggeration, which is to say it does not work in novel contexts, such as when a deaf student showing up to a classroom and pulling out her iPhone.So, that led me to ask whether I could help. I type pretty fast, I know about computer science, maybe I could at least help caption our courses.Unfortunately, I can’t. In fact, by some metrics, I’m worse than ASR because I just can’t type fast enough.
So, we built a system that allows me to help.It’s called Scribe. A traditional stenographer setup looks like this. You stream audio to someone, they type what they hear, and the digital text is forwarded back to you.Unfortunately, if that person is me, I can’t type the 225 wpm or so necessary to type at natural speaking rates.So, instead, we distribute the audio to multiple people, they all type, and then we merge the text they type together to form a single output.Making this work well has two main components – the computer interface side that encourages workers to type what they hear, and to type different parts of the speech. And, the algorithm side, which takes these pieces and stitches them back together.First, the algorithm:
It turns out that our problem is sort of similar to one encountered in computational biology.In particular, in shotgun sequencing, DNA is broken into multiple short strands that can be more easily sequenced. These sequences are then merged back together in order by MSA by computing the best alignment.To use MSA, we replaced the mutation model for nucleotides with a natural language model.MSA is usually an offline procedure. To do alignment online, we perform a greedy search on a dependency graph that we create in which edges join words that appeared next to each other in the crowd input.
Unfortunately, this is only half the story because it turns out the interface design for this task was non-trivial.The task is actually pretty difficult. And, by its nature is frustrating because you really can’t do it perfectly.Our interface encourages real-time input with feedback to captionists, and encourages global coverage by systematically varying the volume of the clip. The algorithm only works well with continuous sequences, and so the interface rewards workers for typing a few words in a row. Each word a worker types is more latent than the last, so the interface stops rewarding workers after about eight word sequences.Scribe required us to carefully consider both the interface and algorithm at once so we could make up for a deficiency in one with the other, and so we describe this process as one of co-evolution.
So, we ran some experiments with the system with a bunch of workers – both local undergrads and turk workers – capturing some technical lectures from courses drawn from MIT X.The first thing you’d want to know is whether our workers can actually even type all of the words that they hear in aggregate.This graph shows they can surpass at about 7 workers, although it’s important to point out that these workers were complete novices.As expected, Scribe quickly outperforms both ASR and a single worker.
Here’s a precision vs. coverage graph – in this case, coverage is roughly recall. We can get pretty close to CART, although metrics in this space are tricky because not all errors are created equally.But, not all errors are created equally. Because of how the computer systems stenographers use to convert phonemes to text work, they often make homonym errors. These errors are compounded when the captionist is not a domain expert.So, for instance, when transcribing an Electrical Engineering lecture, CART transcribed “n-factorial” as “in pectoral” – whereas our workers got it right.
Believe it or not, deaf and hard of hearing people often actually prefer our captions.Here’s a quick video that illustrates one of the reasons why.First, you’ll notice that while our captions aren’t perfect, the errors make much more sense than ASR. This is one reason that even while ASR seems competitive to individuals on automated metrics, in practice it is much worse.
This is one reason why incorporating ASR back into the overall system is difficult.Nevertheless, doing so does increase coverage substantially – from 28% to 55%, showing there is information there that could be leveraged.I’m most excited about the work that we’re beginning that will use real-time crowd captions to train ASR on the fly.
So, I am done with the majority of my talk.But, I want to end with a challenge, a partial solution, and another challenge.
The first challenge is not mine. It comes from NikiKittur at CrowdCamp at CHI this year.He asks, “What would it take for me to be proud of my daughter for being a crowd worker?”And, I think this is a very interesting question. So much of our research in human computation is about how to get the crowd to do work we don’t want to do, how to compensate for the low quality work they often provide, that I think we are missing an enormous opportunity to leverage the crowd to do work that we can be proud of, that we would be proud for our sons and daughters to do, that we would be proud to do ourselves.
So, part of my answer to Niki’s question is to pursue systems that allow us to come together to Do Good. I think VizWiz is a great example of this. Spend a few seconds, and help a blind person go about their day more independently. I would be proud of my daughter for doing that. … Eventually, I think we can build interactive, crowd-powered systems that provide real value to all of us during our everyday lives.But, I think we can also DO BETTER. One of the reasons why I am excited about Scribe is because it allows me to do something as part of a crowd that I simply could not do alone. Real-time captioning requires motor and cognitive performance at the outer limits of what humans can do.The challenge we are currently pursuing is to better understand the capabilities of crowd agents – both through the development of new applications that leverage them and their potentiallly super-human abilities, and through the development of a basic science of crowd motor and cognitive performance modeled on what we have for individual humans.Collectively, I hope these directions will allow crowdsourcing work to transition from work we don’t want to do, to work we can be proud to do.
The content of this talk is the result of the hard work of a whole bunch of collaborators, some of whom are shown here, and a result of generous funding by the National Science Foundation and Google.