Axa Assurance Maroc - Insurer Innovation Award 2024
Professor Steve Roberts; The Bayesian Crowd: scalable information combination for Citizen Science and Crowdsourcing
1. The Bayesian Crowd: scalable
informaon combinaon for Cizen
Science and Crowdsourcing
Stephen Roberts
Machine Learning Research Group Oxford-Man Instute
University of Oxford
Alan Turing Instute
Joint work with Edwin Simpson, Steven Reece Ma-eo Venanzi
Bayes Nets Meeng, January 2017
2.
3. • Bayesian modelling allows for explicit incorporation of all desiderata
• Effort focused not only on theory development, but algorithmic
implementations that are timely practical for real-world, real-time
scenarios
• Single, under- and over-arching philosophy…
“one method to rule them all… and in the darkness bind them”
“The language is that of Bayesian inference, which I will not utter
here...”
p(a|b) =
p(b|a)p(a)/p(b)
Core methodology – Bayesian inference
4. • Uncertainty at all levels of inference is naturally taken into account
• Optimal fusion of information: subjective, objective
• Handling missing values
• Handling of noise
• Principled inference of confidence and risk
• Optimal decision making
What does this buy us?
12. How can we deal with unreliable worker responses and
very large datasets?
Big data: Square Kilometer Array,
10 petabytes compressed images/day
Noisy reports: Twi-er, Typhoon Haiyan
13. Aims: Reliability and E9ciency
●
Challenge: volunteers have varying reliability
– Di;erent knowledge, interests, skills
– Typically handled with redundancy → build a consensus
●
Challenge: datasets are large, what to priorise?
●
Aim: increase accuracy by learning reliability
●
Aim: use our volunteers' me eciently
– Reduce redundant decisions
– Deploy experts where needed
– Use addional data to scale up to larger datasets
14. Machine Learning: aggregate responses and assign
tasks intelligently
●
Probabilisc models of people and data
●
Handle uncertainty in model
●
Opmise and automate analysis to reduce costs
Machine learning
Data
Crowd AnnotaonsCrowd
Results
15. Zooniverse has 26 current applicaons across a
range of domains, with 1 million volunteers
● Can we use ML to handle variaons in ability?
● Or to match tasks to people's interests and skills?
16. How can we combine annotaons from
di;erent members of the crowd?
● Fewer annotaons needed from more reliable labellers
● ConCdence and trust → user weights
● But weighted majority is soE selecon
– Blurred decision boundaries
● Need to combine di;erent experse + weak labellers
17. Bayesian Methods
● Opmal framework for combining evidence
● Quanfy prior beliefs explicitly
– E.g. workers are mostly be-er than random
● QuanCes uncertainty at all levels
– Which agents are reliable?
– Do we need more evidence for an object's target class?
● Principled approach
– Move away from Cne-tuning each project
– E.g. avoid trial-and-error thresholds to determine when
consensus reached
18. How can we aggregate responses intelligently?
● Bayes' rule combines di@erent pieces of informaon
● Weight workers' contribuons through their likelihood
of response to class
● Opmal weighted majority decision
● Error guarantees
● SoD selecon
p(t|c)∝p(t)∏k ∈K
p(c
(k)
|t)
p(c(k)
|t) c(k)
t
19. Likelihood deCned by a confusion matrix
● Likelihood = of response to class :
● Richer than user accuracy weights:
– Di;ering skill levels in each class
– Responses need not be votes
p(c(k)
|t)
Response c(k)
Target
class
t
A B C
1 0.7 0.1 0.2
2 0.4 0.4 0.2
π(k)
c
(k)
t
20. Independent Bayesian classiCer combinaon
(IBCC) handles parameter uncertainty
Target labels
(multinomial)
Observed worker responses
(multinomial)
Worker-
specific
confusion
matrix
(Dirichlet)
Proportions of each
class (Dirichlet)
●
Deal raonally with limited or missing data
21. Hyperparameters encode prior beliefs in worker
behaviour, e.g. worker is be-er than random
●
Opmise/marginalise to handle model uncertainty
●
Share prior pseudo-counts between similar projects
●
Rao → relave
probability of
agent
responses given
class t
●
Magnitude →
strength of
prior beliefs
c(k)
t
A B C
1 7 1 2
2 4 4 2
22. Joint, condi
oned on hyper-hyper parameters
Inference
Gibbs sampling – rather slow
Variaonal Bayes – o;ers fast inference, at
expense of approximaons
Inference
23. -ve free energy Kullback-Leibler divergence
Variational Bayes
28. Users rate each presented object which provides a score of
-1 : very unlikely SN object
1 : possible SN object
3 : likely SN object
(“true” labels obtained retrospectively via Palomar Transient Factory
spectrographic analysis)
Zooniverse: Galaxy Zoo Supernovae
29. IBCC-VB outperforms alternaves
Galaxy Zoo
Supernovae
AUC
IBCC-VB 0.90
Mean 0.65
Weighted Sum 0.64
Weighted Majority 0.58
Area under ROC curve defining better
solutions
31. Community detecon over E[π] matrices:
behaviour types among Zooniverse users
Sensible Extreme Random Opmist Pessimist
● vbIBCC provides insights into crowd behaviour using
Bayesian community analysis
● Design training to inOuence these types
● CommunityBCC builds these types into the model to
be-er predict new workers
32. CommunityBCC builds these disnct types into
the model to be-er understand new workers
● Priors constrain the
worker model
● Fewer examples needed
to learn reliabilies
33. Dynamic IBCC: behaviour changes as people
learn, get bored, move...
● Detect a worker's current state: aggregate correctly,
select suitable tasks, inOuence behaviour
Current state
35. “true” decision label
(multinomial)
Set of all observed decisions
(multinomial)
Dirichlet
Dirichlet
Agent specific
“confusion” matrix
time
What about dynamics?
36. Dynamic IBCC tracks changes to the confusion
matrix over me
● Bayes' Clter
esmates
evolving
Markov chain
● Assumpon:
unexpected
behaviour →
state changes
Galaxy Zoo Supernovae example volunteer
37. Dynamic IBCC tracks changes to the confusion
matrix over me
● Bayes' Clter
esmates
evolving
Markov chain
● Assumpon:
unexpected
behaviour →
state changes
Mechanical Turk document classiCcaon
39. Combining the crowd with features:
TREC Crowdsourcing Challenge
● IBCC + 2000 LDA features acng
as addional classiCers [11]
● Classify unlabelled documents
● Results:
– 0.81 AUC with only 16%
documents labelled at all
– 0.77 for next-best approach
– 1st place required mulple
labellings of all documents
40. BCCWords: an e9cient way to learn language
in new contexts
25,000 50,000 75,000 100,000 125,000 150,000
0.2
0.3
0.4
0.5
0.6
0.7
0.8
#labels
Accuracy
IBCC
CBCC
ScalBCCWords
MV(Textclassi+er
DawidSkene
MV
Votedistribution
CrowdFlower Tweet
Senment
Posive words about the
weather learnt by
BCCWords
BCCWords increases
accuracy with limited
labels
41. Unstructured data in social media: a rich
source of mely informaon
Real-me, local events – e.g. emergency reports aDer an
earthquake
Senment about products, health and social issues – e.g.
opinions about H1N1, product reviews
Butler 2013, Morrow et al. 2011
42. Understanding Textual Data Streams
● Turn unstructured data into reliable, machine-readable
informaon
● Automated classiBers struggle to understand diverse,
evolving language in new contexts
● Need new tools to resolve ambiguity and lack of
training data
Ushahidi – From Hai 2010 earthquake
Morrow et al. 2011
Categories of earthquake reports
Nepal, 2015, Quakemap.org
Gender
Kivran-Swaine et al., 2013
“Love” “Dude”
43. Interpreng Language through Crowdsourcing
● Biased and noisy interpretaons
● Scalability: the workers cannot label everything mulple mes
● New techniques needed to reduce the workload of labellers
using textual informaon
● How to learn a language model from unreliable judgements?
+
+
-
+
Repeve TasksRepeve Tasks Time Costs
44. Scenario: Senment Analysis of Tweets and
Reviews
Dataset Text Plaorm Sen
ment
Classes
No.
Documents
No.
Judgements
No.
Workers
2013
CrowdScale
shared
task
challenge
Tweets about
weather
CrowdFlower Posive
Negave
Neutral –
Not related X
Unknown ?
98,980 569,375 461
Rodrigues et
al., 2013
Ro-en
Tomatoes
Movie
Reviews
Amazon
Mechanical
Turk
Posive
Negave
5,000 27,747 203
“Morning sunshine”
09:18 PM June 7, 2011
“Is it rainy too?
Totally hate it”
10:05 PM June 7, 2011
“lovely sunny day”
10:06 PM June 7,
2011
45. Bayesian ClassiCer Combinaon with Words
BCCWords
●
Bayes' theorem provides a principled mathemacal
framework for classiCer combinaon
– Dawid Skene, 1979; Kim Ghahramani, 2012; Simpson et al., 2013;
Venanzi et al., 2014.
– Outperforms weighted majority vong etc.
+
+
-
+BCCWords
46. Bayesian ClassiCer Combinaon with Words
BCCWords
● Novel approach to combine weak signals from text
and crowd
– Model the reliability of members of the crowd
– Train a language model to reduce the number of
judgements needed
+
+
-
+BCCWords
47. Reliability of judgements deBned by a
confusion matrix for each worker
● DeBnes likelihood for worker k:
● Aggregate support for class c using Bayes' rule:
● Richer than weighng by overall accuracy:
– Accounts for bias and random noise
– Di@ering skill levels in each class
– Labels need not be votes for true class
p(label
(k)
|true class)
label(k)
True
class
+ve uncertain -ve
+ve 0.7 0.1 0.2
-ve 0.4 0.4 0.2
∏k∈K
p(label
(k)
|trueclass=c)
48. Likelihood of text features in each class: bag-of-
words
ωc=p(wordn|true class=c)
●
Words have di;erent likelihoods in each senment class
●
Prior distribuon over word likelihoods in each class
●
Learning posterior : update pseudo-counts as we observe words
in document of class c
Good, nice
More likely
Terrible
More likely
ωc
ωc
51. BCCWords: judgements are condioned on
true class
Confusion
Matrix
Judgement
Label
True Class
N documents
52. BCCWords: judgements and words are
condioned on the true class
Confusion
Matrix
Judgement
Label
True Class
Word
Likelihoods
Words
ωc
N documents
53. BCCWords: judgements and words are
condioned on the true class
Use Bayes' rule to infer true class
from labels and words
Confusion
Matrix
Judgement
Label
True Class
Word
Likelihoods
Words
ωc
N documents
… but we need to
learn the likelihoods
from true class
labels
54. Variaonal Bayes: learn confusion matrices, language
model and true class with limited training data
●
Computaonally e9cient: 20 mins for 500k judgements, 98k tweets
●
Iteravely updates each variable in turn, learning from latent structure
and any prior knowledge or training data
●
Algorithm can be distributed to constrain memory requirements
55. Experiments: Senment Analysis of Tweets and
Reviews
Dataset Text Plaorm Sen
ment
Classes
No.
Documents
No.
Judgements
No.
Workers
2013
CrowdScale
shared
task
challenge
Tweets about
weather
CrowdFlower Posive
Negave
Neutral –
Not related X
Unknown ?
98,980 569,375 461
Rodrigues et
al., 2013
Ro-en
Tomatoes
Movie
Reviews
Amazon
Mechanical
Turk
Posive
Negave
5,000 27,747 203
“Morning sunshine”
09:18 PM June 7, 2011
“Is it rainy too?
Totally hate it”
10:05 PM June 7, 2011
“lovely sunny day”
10:06 PM June 7,
2011
56. Language Model for Weather Senment
Posive NegaveMost Likely Words
Discriminave Words
57. Disnct worker types show the importance of
learning reliability
1
0.5
0
1
0.5
0
1
1
0.5
True
class Worker
Label
Probability
Good Worker Inaccurate Worker
CrowdLower Weather – 5 classes
58. Summary: BCCWords fuses subjecve
interpretaons to learn models of language in
the wild
● Important to account for skills and bias
of individuals in crowd
● Learns worker reliability and language
model in a single integrated inference
algorithm
● Uses textual informaon to reduce the
number of judgements required
● Bayesian inference
– Proven framework for fusing informaon
– Handles uncertainty in true class labels
and model itself
1
0.5
0
1
0.5
0
1
0.5
0
1
0.5
0
59. Moving towards e9cient learning with
Crowd in-the-Loop
● Turn masses of unstructured, heterogeneous data into
reliable, machine-readable informaon
● Use the model to choose who does what task
1
0.5
0
1
0.5
0
1
0.5
0
1
0.5
0
● Detect di;erent interpretaons of language between communies
in the crowd?
60. Intelligent agent-task assignment:
who should classify which object?
● Aim: direct crowd's e;ort to learn quickly cheaply
● Priorise tasks by considering their features and conCdence
in their classiCcaon
● Task choice depends on the workers available
● Maximise expected ulity
DynIBCC confusion matrix
describes individual skills
61. Ulity of response: informaon gain about
targets when DynIBCC is updated
● Naturally balances exploraon exploitaon
● Explore an agent's behaviour from silver tasks
– Objects already labelled conCdently by crowd
– Increases ulity of past responses
● Exploit an agent's skills to learn uncertain targets t
E[U τ (k ,i)]=E[ I (t ; ci
(k)
∣Dτ )]
Index of target object
Worker ID
Crowdsourced data
collected so far
Time index
62. Hiring and Cring algorithm makes greedy
assignments to reduce computaonal cost
● Hire for priority task that matches current skills
● Fire if new crowd members likely to do be-er
63. Loose crowds on the web in organisaons:
Disaster Response
● Extracng key informaon from noisy background
– Text: Twi-er, Ushahidi 15000 messages in a few weeks [8]
– Images: Satellite, Social Media
– Team communicaons, other agencies
● Locaons of emergencies:
– connuous target funcon
64. Bayesian crowdsourced heatmaps visualise
likely emergencies and informaon gaps
● Neighbouring reports related by spaal Gaussian
process (GP) classiCer
Κ
ti
Density of
emergencies
at (x,y)
Emergency
state at (x,y)
ci
(k)
π(k)
α0
(k)
Sigmoid funcon maps GP to Dirichlet
GP Variance
65. Bayesian crowdsourced heatmaps visualise
likely emergencies and informaon gaps
Ushahidi crowd + trusted report from Crst responder
67. Adapve training and movaon to create diverse
skills and smulate workers
●
Model worker preferences, rewards
●
Fast approximaons to future ulity
– Deduct cost of rewards
– Add retenon, work rate, reliability
– Target clusters of workers
●
Selecng tasks/training: consider person's
history
Apprenceship/Peer Training
Infer improvements in confusion
matrices from e;ect of task on others
68. Models for combining new data types target
funcons
● Targets have mulple dimensions
– Shapes in PlanetFour
● Poisson processes, event rates
– Malaria rates
69. Acvely switch types of tasks to opmise
learning from the crowd
● Select quesons from decision tree
● Labelling, comparing, marking features, grouping...
● Ulity varies: accuracy of responses, current model of
features...
34.556
Maximise
informaon
about t
...is like...
70. Learn how people make decisions by
acvely adapng tasks
● Improve automaon,
reduce work
● Select interacon mode or
quesons in the micro-task
● Maximise informaon given
current model
● Crowd-supervised feature
extracon, e.g. adapng
PCA to learn more useful
features from the crowd
Projecon
71. Summary: Bayesian models enable accurate
and scalable crowdsourcing across domains
● Quanfy uncertainty in data model worker behaviour
● Acvely learn from crowds using model of features
● Opportunies: opmisaon and learning to automate
with humans-in-the-loop
Machine
learning
Data
Crowd AnnotaonsCrowd
Results
72. ORCHID and Zooniverse collaborators worked
with Rescue Global to idenfy and then reCne
their crical informaon requirements.
• placement of life detectors and water
Clters within 50 mile radius of Kathmandu.
Crowd labelled 1200 Planet Labs satellite images
using Zooniverse soEware.
• Recruited 25 image labellers from within
Oxford University and Rescue Global sta;
(they worked hard over the bank holiday
weekend).
Folded in OpenStreetMap building density data
and inferred populaon density map using
ORCHID data processing algorithms.
Delivered map overlay to Rescue Global for
disseminaon to their CaDRA partners (SARaid,
Team Rubicon, CADENA).
29/04/15 to
2/05/15
02/05/15 to
20:13 GMT 05/05/15
00:15 GMT
06/05/15
05/05/15
25/04/15, 7.8 Earthquake in Gorkha District of Nepal
73. SoDware on Github
● h+p://www.robots.ox.ac.uk/~edwin/
– Please use and report bugs
● PyIBCC: IBCC-VB and DynIBCC-VB in Python 2
– Collaborang with Zooniverse
● MatlabIBCC: IBCC-VB and DynIBCC-VB in Matlab
Acknowledgements
● Uni of Southampton: Nick Jennings, Alex Rogers, Sarvapali
Ramchurn, Ma+eo Venanzi
● Oxford: Edwin Simpson, Steve Reece, Chris Linto+ Zooniverse team
● EPSRC (UK research council), the ORCHID project, Rescue Global,
MicrosoD, Zooniverse
74. References
[1] Dawid, A. P., Skene, A. M. (1979). Maximum likelihood esmaon of observer error-rates using the EM algorithm. Applied stascs, 20-28.
[2] Kim, H. C., Ghahramani, Z. (2012). Bayesian classiCer combinaon. In Internaonal conference on arCcial intelligence and stascs (pp. 619-
627).
[3] E. Simpson, S. Roberts, I. Psorakis, A. Smith and C. Linto- (2011). Bayesian Combinaon of Mulple, Imperfect ClassiCers. Proceedings of NIPS
2011 workshop
[4] Simpson, E., Roberts, S., Psorakis, I., Smith, A. (2013). Dynamic bayesian combinaon of mulple imperfect classiCers. In Decision Making and
Imperfecon (pp. 1-35). Springer.
[5] Psorakis, I., Roberts, S., Ebden, M., Sheldon, B. (2011). Overlapping Community Detecon using Bayesian Nonnegave Matrix Factorizaon.
Physical Review E, 83.
[6] Venanzi, M., Guiver, J., Kazai, G., Kohli, P., Shokouhi, M. (2014). Community-based bayesian aggregaon models for crowdsourcing. In
Proceedings of the 23rd internaonal conference on World wide web (pp. 155-164). Internaonal World Wide Web Conferences Steering
Commi-ee.
[7] E. Simpson, S. Roberts (2015 – to appear). Bayesian Methods for Intelligent Task Assignment in Crowdsourcing Systems, Scalable Decision
Making: Uncertainty, Imperfecon, Deliberaon; Studies in Computaonal Intelligence, Springer
[8] N. Morrow, N. Mock, A. Papendieck, and N. Kocmich (2011). Independent Evaluaon of the Ushahidi Hai Project. Development Informaon
Systems., 8:2011.
[9] MacKay, David J. C. (1992). Informaon-based objecve funcons for acve data selecon. Neural computaon, 4(4):590–604.
[10]Chen, X., Benne-, P. N., Collins-Thompson, K., and Horvitz, E. (2013). Pairwise ranking aggregaon in a crowdsourced se`ng. In Proceedings of
the sixth ACM internaonal conference on Web search and data mining. ACM
[11]E. Simpson, S. Reece, A. Penta, G. Ramchurn, and S. Roberts (2012). Using a Bayesian Model to Combine LDA Features with Crowdsourced
Responses. In The Twenty-First Text REtrieval Conference (TREC 2012), Crowdsourcing Track, NIST.
[12]S. Nitzan, J. Paroush (1982). Opmal decision rules in uncertain dichotomous choice situaons. Internaonal Economic Review, 23(2):289–297,
1982.
[13]D. Berend, A. Kontorovich (2014). Consistency of Weighted Majority Votes. NIPS
[14]Y. Zhang, X. Chen, D. Zhou, M. Jordan (2014). Spectral methods meet EM: a Provable Opmal Algorithm for Crowdsourcing.