SlideShare ist ein Scribd-Unternehmen logo
1 von 39
FCPCCS - Big Data and Crowdsourcing
Pattern-recognition and the
crowd
FCPCCS - Big Data and Crowdsourcing
What would you do with unlimited human analysts?
FCPCCS - Big Data and Crowdsourcing
FCPCCS - Big Data and Crowdsourcing
People
DataCategories
FCPCCS - Big Data and Crowdsourcing
Models
FCPCCS - Big Data and Crowdsourcing
FCPCCS - Big Data and Crowdsourcing
FCPCCS - Big Data and Crowdsourcing
Unstructured data gets structured (bonus: a
system that gets smarter over time)
Adaptive System
Machine
Learning
Optimization
Human
Annotation
Prediction
Engine
Structured Data Reports
Action
FCPCCS - Big Data and Crowdsourcing
80%
85%
99%
83%
81%
88%
87%
90%
73%
91%
0% 50% 100%
News Category 4
News Category 2
News Category 1
Manufacturing
Health Sciences
Finding Relevant News Articles
% analyst time saved
% accuracy
(compared to
humans)
Efficiency of human time is a major benefit
FCPCCS - Big Data and Crowdsourcing
FCPCCS - Big Data and Crowdsourcing
FCPCCS - Big Data and Crowdsourcing
The importance of definition
• If people can’t agree on what’s-in and what’s-out, it’s
hard to train a machine
FCPCCS - Big Data and Crowdsourcing
FCPCCS - Big Data and Crowdsourcing
Wait a sec! Aren’t these ducks?
(Can we agree to disagree?)
FCPCCS - Big Data and Crowdsourcing
The importance of definition
• If people can’t agree on what’s-in and what’s-out, it’s
hard to train a machine
• In our case toxicity was defined as:
• ad hominem attacks (directed at specific people)
• bigoted comments (e.g., sexist, racist, homophobic, etc)
• Set definitions
• Then see if people are consistent
• Run pilots
• Do inter-annotator agreement
• Iterate
FCPCCS - Big Data and Crowdsourcing
Inter-annotator agreement: is everyone
measuring the same way?
FCPCCS - Big Data and Crowdsourcing
Quick recommendation for inter-annotator
agreement
• You can measure consistency, probably the best way is
Krippendorff’s alpha
• Don’t use percentage agreement! Particularly when data are
skewed towards one category.
• If 95% of the data fall under one category label, then random
coding would still have two people agree so much that %
agreement would make you think you had a reliable study
(even though you wouldn’t)
• And you can ALSO use models to check these things
FCPCCS - Big Data and Crowdsourcing
Finding healthy communities (supportive)
FCPCCS - Big Data and Crowdsourcing
And unhealthy ones (toxic)
FCPCCS - Big Data and Crowdsourcing
FCPCCS - Big Data and Crowdsourcing
Collect data and annotations—then interrogate it
Human annotations
Which
people/categories
should we be wary
of?
Which annotations
do we select to train
a model with?
A classifier
that can
predict
unseen data
FCPCCS - Big Data and Crowdsourcing
Routing messages that matter
FCPCCS - Big Data and Crowdsourcing
Processing millions of SMS in 12 African languages
Intent of sender
(i.e. report a problem, ask
a question or make a
suggestion)
Categorization
(i.e. orphans and
vulnerable children,
violence against children,
health, nutrition)
Language detection
(i.e. English, Acholi,
Karamojong, Luganda,
Nkole, Swahili, Lango)
Location
(i.e. village names)
FCPCCS - Big Data and Crowdsourcing
FCPCCS - Big Data and Crowdsourcing
1.4%
FCPCCS - Big Data and Crowdsourcing
FCPCCS - Big Data and Crowdsourcing
Top 3 categories in Nigeria
9.69%
17.68%
39.44%
Employment
U-report support
Health
FCPCCS - Big Data and Crowdsourcing
The Donald Rumsfeld Question
FCPCCS - Big Data and Crowdsourcing
How do I find what I don’t know I don’t
know?
FCPCCS - Big Data and Crowdsourcing
Negative topics in Walmart employee reviews
Hours/Benefits
968
518
Management
2,404
Work/life balance
1,241
Company Values Dealing With
Customers
658
Training &
Expectation
968
Low Pay
1,446
FCPCCS - Big Data and Crowdsourcing
Common Pros among
Employees
Common Cons among Employees
37%
25% 24%
41%
27%
17%
0%
10%
20%
30%
40%
50%
Current
Former
24%
16%
13% 13%14%
16%
12%
0%
10%
20%
30%
Current
Former
Structuring unstructured data lets you combine it
with other metadata
FCPCCS - Big Data and Crowdsourcing
Question: What improves models the
most?
FCPCCS - Big Data and Crowdsourcing
Instead of worrying about the algorithms
in the machine
FCPCCS - Big Data and Crowdsourcing
It’s almost always better to just get more
pandas
FCPCCS - Big Data and Crowdsourcing
How else do you verify?
 We assess model accuracy using cross-validation.
 Instead of using all annotated data to train a model, you hold out a
random 10% and build the model with the rest.
 Then you predict against that 10%. You do this 10 times and average
the accuracy.
 Precision measures “if we automatically label something as
X, how often are we right?”
 Recall measures “how much of stuff that SHOULD have label
X are actually given label X?”
FCPCCS - Big Data and Crowdsourcing
The system gets smarter
 Here’s what happens going across the first 2,543
annotations on one REALLY low signal classification task
 By 9,744 annotations, our accuracy is 97%
FCPCCS - Big Data and Crowdsourcing
Other tasks are more straight-forward
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
50 100 150 200
F-score
Number of paragraphs annotated
F-scores go up with more annotations
Disease
Country
Reported_deaths
Reported_cases
Date
Issue
Location
People affected
# of deaths
Event date
FCPCCS - Big Data and Crowdsourcing
Project workflow
Phase 1:Data
• Data capture,
normalization and
loading
Phase 2:Discovery
• Topic discovery
• Category creation
• Expert data
annotation
• Category
verification
Phase 3:Training
• Guideline creation
• Annotator
validation
• Model training
Phase 4:
Optimization
• Model evaluation
• Category
refinement
Phase 5:Model
Deployment
• Full system
integration
• Model
performance
• Metrics reporting
FCPCCS - Big Data and Crowdsourcing
email tyler@idibon.com
twitter @idibon
www idibon.com
THANK YOU!

Weitere ähnliche Inhalte

Andere mochten auch

PEW_Facksheets_DENTAL_50_State_WEB
PEW_Facksheets_DENTAL_50_State_WEBPEW_Facksheets_DENTAL_50_State_WEB
PEW_Facksheets_DENTAL_50_State_WEB
Sergey Nesterov
 
9 Frame Analysis - Biffy Clyro - Mountains
9 Frame Analysis - Biffy Clyro - Mountains9 Frame Analysis - Biffy Clyro - Mountains
9 Frame Analysis - Biffy Clyro - Mountains
alexhester
 
8 b juan belen matamoros
8 b   juan belen  matamoros8 b   juan belen  matamoros
8 b juan belen matamoros
eloisahidalgo
 
Indicadores De La Calidad Formativa Desde Las Actividades
Indicadores De La Calidad Formativa Desde Las ActividadesIndicadores De La Calidad Formativa Desde Las Actividades
Indicadores De La Calidad Formativa Desde Las Actividades
hugorio
 
Theory Of Fun
Theory Of FunTheory Of Fun
Theory Of Fun
cbee48
 

Andere mochten auch (18)

9 frame structure analysis of ‘the walking
9 frame structure analysis of ‘the walking9 frame structure analysis of ‘the walking
9 frame structure analysis of ‘the walking
 
PEW_Facksheets_DENTAL_50_State_WEB
PEW_Facksheets_DENTAL_50_State_WEBPEW_Facksheets_DENTAL_50_State_WEB
PEW_Facksheets_DENTAL_50_State_WEB
 
SPEED LIMITER: A Road Safety Solution
SPEED LIMITER: A Road Safety SolutionSPEED LIMITER: A Road Safety Solution
SPEED LIMITER: A Road Safety Solution
 
Sunil_Parab
Sunil_ParabSunil_Parab
Sunil_Parab
 
Английский сленг (А-С)
Английский сленг (А-С)Английский сленг (А-С)
Английский сленг (А-С)
 
9 frederic skinner
9 frederic skinner9 frederic skinner
9 frederic skinner
 
The musician’s soul: A journey examining spirituality for performers, teacher...
The musician’s soul: A journey examining spirituality for performers, teacher...The musician’s soul: A journey examining spirituality for performers, teacher...
The musician’s soul: A journey examining spirituality for performers, teacher...
 
TIC EN EL AULA
TIC EN EL AULATIC EN EL AULA
TIC EN EL AULA
 
Challenges for Using E-Books in (Swiss Higher) Education
Challenges for Using E-Books in (Swiss Higher) EducationChallenges for Using E-Books in (Swiss Higher) Education
Challenges for Using E-Books in (Swiss Higher) Education
 
9 framestrucureanalysis (se7en)
9 framestrucureanalysis (se7en)9 framestrucureanalysis (se7en)
9 framestrucureanalysis (se7en)
 
How IT can empower your organization to wlan
How IT can empower your organization to wlanHow IT can empower your organization to wlan
How IT can empower your organization to wlan
 
9 Frame Analysis - Biffy Clyro - Mountains
9 Frame Analysis - Biffy Clyro - Mountains9 Frame Analysis - Biffy Clyro - Mountains
9 Frame Analysis - Biffy Clyro - Mountains
 
2013 Skoll World Forum Plenary - Introducing the Social Progress Index
2013 Skoll World Forum Plenary - Introducing the Social Progress Index2013 Skoll World Forum Plenary - Introducing the Social Progress Index
2013 Skoll World Forum Plenary - Introducing the Social Progress Index
 
2013 Skoll World Forum Panel Presentation of the Social Progress Index Design
2013 Skoll World Forum Panel Presentation of the Social Progress Index Design2013 Skoll World Forum Panel Presentation of the Social Progress Index Design
2013 Skoll World Forum Panel Presentation of the Social Progress Index Design
 
8 b juan belen matamoros
8 b   juan belen  matamoros8 b   juan belen  matamoros
8 b juan belen matamoros
 
Indicadores De La Calidad Formativa Desde Las Actividades
Indicadores De La Calidad Formativa Desde Las ActividadesIndicadores De La Calidad Formativa Desde Las Actividades
Indicadores De La Calidad Formativa Desde Las Actividades
 
Theory Of Fun
Theory Of FunTheory Of Fun
Theory Of Fun
 
slideshare- Portcullismarketaccess
slideshare- Portcullismarketaccessslideshare- Portcullismarketaccess
slideshare- Portcullismarketaccess
 

Ähnlich wie Pattern recognition and the crowd

The MD Anderson / IBM Watson Announcement: What does it mean for machine lear...
The MD Anderson / IBM Watson Announcement: What does it mean for machine lear...The MD Anderson / IBM Watson Announcement: What does it mean for machine lear...
The MD Anderson / IBM Watson Announcement: What does it mean for machine lear...
Health Catalyst
 

Ähnlich wie Pattern recognition and the crowd (20)

Amazon Machine Learning for Developers
Amazon Machine Learning for DevelopersAmazon Machine Learning for Developers
Amazon Machine Learning for Developers
 
Data Science for Business Managers - An intro to ROI for predictive analytics
Data Science for Business Managers - An intro to ROI for predictive analyticsData Science for Business Managers - An intro to ROI for predictive analytics
Data Science for Business Managers - An intro to ROI for predictive analytics
 
Healthcare Analytics Summit Keynote Fall 2017
Healthcare Analytics Summit Keynote Fall 2017Healthcare Analytics Summit Keynote Fall 2017
Healthcare Analytics Summit Keynote Fall 2017
 
The MD Anderson / IBM Watson Announcement: What does it mean for machine lear...
The MD Anderson / IBM Watson Announcement: What does it mean for machine lear...The MD Anderson / IBM Watson Announcement: What does it mean for machine lear...
The MD Anderson / IBM Watson Announcement: What does it mean for machine lear...
 
David Cocker big data MDCPartners ta-scan
David Cocker big data MDCPartners ta-scanDavid Cocker big data MDCPartners ta-scan
David Cocker big data MDCPartners ta-scan
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)
 
Certus Accelerate - Why You Need to Invest in Your Data by Vincent McBurney
Certus Accelerate - Why You Need to Invest in Your Data by Vincent McBurneyCertus Accelerate - Why You Need to Invest in Your Data by Vincent McBurney
Certus Accelerate - Why You Need to Invest in Your Data by Vincent McBurney
 
Bio IT World 2019 - AI For Healthcare - Simon Taylor, Lucidworks
Bio IT World 2019 - AI For Healthcare - Simon Taylor, LucidworksBio IT World 2019 - AI For Healthcare - Simon Taylor, Lucidworks
Bio IT World 2019 - AI For Healthcare - Simon Taylor, Lucidworks
 
COVID-19 - How to Improve Outcomes By Improving Data
COVID-19 - How to Improve Outcomes By Improving DataCOVID-19 - How to Improve Outcomes By Improving Data
COVID-19 - How to Improve Outcomes By Improving Data
 
AI/ML Webinar - Improve Public Health
AI/ML Webinar - Improve Public HealthAI/ML Webinar - Improve Public Health
AI/ML Webinar - Improve Public Health
 
GPS for Chemical Space - Digital Assistants to Support Molecule Design - Chem...
GPS for Chemical Space - Digital Assistants to Support Molecule Design - Chem...GPS for Chemical Space - Digital Assistants to Support Molecule Design - Chem...
GPS for Chemical Space - Digital Assistants to Support Molecule Design - Chem...
 
AI/ML Week: Improve Public Health
AI/ML Week: Improve Public HealthAI/ML Week: Improve Public Health
AI/ML Week: Improve Public Health
 
BIG DATA AND MACHINE LEARNING
BIG DATA AND MACHINE LEARNINGBIG DATA AND MACHINE LEARNING
BIG DATA AND MACHINE LEARNING
 
Mohammed AL Madhani
Mohammed AL MadhaniMohammed AL Madhani
Mohammed AL Madhani
 
Gary Hope - Machine Learning: It's Not as Hard as you Think
Gary Hope - Machine Learning: It's Not as Hard as you ThinkGary Hope - Machine Learning: It's Not as Hard as you Think
Gary Hope - Machine Learning: It's Not as Hard as you Think
 
Slima explainable deep learning using fuzzy logic human ist u fribourg ver 17...
Slima explainable deep learning using fuzzy logic human ist u fribourg ver 17...Slima explainable deep learning using fuzzy logic human ist u fribourg ver 17...
Slima explainable deep learning using fuzzy logic human ist u fribourg ver 17...
 
Machine learning in Banks
Machine learning in BanksMachine learning in Banks
Machine learning in Banks
 
Becoming Datacentric
Becoming DatacentricBecoming Datacentric
Becoming Datacentric
 
Robert Brooks, PwC
Robert Brooks, PwCRobert Brooks, PwC
Robert Brooks, PwC
 
How to Create a Big Data Culture in Pharma
How to Create a Big Data Culture in PharmaHow to Create a Big Data Culture in Pharma
How to Create a Big Data Culture in Pharma
 

Mehr von Idibon1

Gender, language, and Twitter: Social theory and computational methods
Gender, language, and Twitter: Social theory and computational methodsGender, language, and Twitter: Social theory and computational methods
Gender, language, and Twitter: Social theory and computational methods
Idibon1
 

Mehr von Idibon1 (10)

Computing with Affective Lexicons: Computational Linguistics Tutorial with Da...
Computing with Affective Lexicons: Computational Linguistics Tutorial with Da...Computing with Affective Lexicons: Computational Linguistics Tutorial with Da...
Computing with Affective Lexicons: Computational Linguistics Tutorial with Da...
 
Understanding Community Needs: Scalable SMS Processing for UNICEF Nigeria and...
Understanding Community Needs: Scalable SMS Processing for UNICEF Nigeria and...Understanding Community Needs: Scalable SMS Processing for UNICEF Nigeria and...
Understanding Community Needs: Scalable SMS Processing for UNICEF Nigeria and...
 
Conspiracy, complaints, and fraud: The language of reasons
Conspiracy, complaints, and fraud: The language of reasonsConspiracy, complaints, and fraud: The language of reasons
Conspiracy, complaints, and fraud: The language of reasons
 
Ciara Sanker: Personal epistemology and epistemic learning
Ciara Sanker: Personal epistemology and epistemic learningCiara Sanker: Personal epistemology and epistemic learning
Ciara Sanker: Personal epistemology and epistemic learning
 
Suzanne Wertheim: Linguistic Anthropology meets NLP
Suzanne Wertheim: Linguistic Anthropology meets NLPSuzanne Wertheim: Linguistic Anthropology meets NLP
Suzanne Wertheim: Linguistic Anthropology meets NLP
 
Will Monroe: Text to 3D scene generation with lexical grounding
Will Monroe: Text to 3D scene generation with lexical groundingWill Monroe: Text to 3D scene generation with lexical grounding
Will Monroe: Text to 3D scene generation with lexical grounding
 
Gender, language, and Twitter: Social theory and computational methods
Gender, language, and Twitter: Social theory and computational methodsGender, language, and Twitter: Social theory and computational methods
Gender, language, and Twitter: Social theory and computational methods
 
Counts, comparisons, collocations, contestations: Towards a dictionary of the...
Counts, comparisons, collocations, contestations: Towards a dictionary of the...Counts, comparisons, collocations, contestations: Towards a dictionary of the...
Counts, comparisons, collocations, contestations: Towards a dictionary of the...
 
Dan Jurafsky: The Language of Food
Dan Jurafsky: The Language of FoodDan Jurafsky: The Language of Food
Dan Jurafsky: The Language of Food
 
Chris Potts: Sentiment analysis in context
Chris Potts: Sentiment analysis in contextChris Potts: Sentiment analysis in context
Chris Potts: Sentiment analysis in context
 

Kürzlich hochgeladen

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 

Kürzlich hochgeladen (20)

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 

Pattern recognition and the crowd

  • 1. FCPCCS - Big Data and Crowdsourcing Pattern-recognition and the crowd
  • 2. FCPCCS - Big Data and Crowdsourcing What would you do with unlimited human analysts?
  • 3. FCPCCS - Big Data and Crowdsourcing
  • 4. FCPCCS - Big Data and Crowdsourcing People DataCategories
  • 5. FCPCCS - Big Data and Crowdsourcing Models
  • 6. FCPCCS - Big Data and Crowdsourcing
  • 7. FCPCCS - Big Data and Crowdsourcing
  • 8. FCPCCS - Big Data and Crowdsourcing Unstructured data gets structured (bonus: a system that gets smarter over time) Adaptive System Machine Learning Optimization Human Annotation Prediction Engine Structured Data Reports Action
  • 9. FCPCCS - Big Data and Crowdsourcing 80% 85% 99% 83% 81% 88% 87% 90% 73% 91% 0% 50% 100% News Category 4 News Category 2 News Category 1 Manufacturing Health Sciences Finding Relevant News Articles % analyst time saved % accuracy (compared to humans) Efficiency of human time is a major benefit
  • 10. FCPCCS - Big Data and Crowdsourcing
  • 11. FCPCCS - Big Data and Crowdsourcing
  • 12. FCPCCS - Big Data and Crowdsourcing The importance of definition • If people can’t agree on what’s-in and what’s-out, it’s hard to train a machine
  • 13. FCPCCS - Big Data and Crowdsourcing
  • 14. FCPCCS - Big Data and Crowdsourcing Wait a sec! Aren’t these ducks? (Can we agree to disagree?)
  • 15. FCPCCS - Big Data and Crowdsourcing The importance of definition • If people can’t agree on what’s-in and what’s-out, it’s hard to train a machine • In our case toxicity was defined as: • ad hominem attacks (directed at specific people) • bigoted comments (e.g., sexist, racist, homophobic, etc) • Set definitions • Then see if people are consistent • Run pilots • Do inter-annotator agreement • Iterate
  • 16. FCPCCS - Big Data and Crowdsourcing Inter-annotator agreement: is everyone measuring the same way?
  • 17. FCPCCS - Big Data and Crowdsourcing Quick recommendation for inter-annotator agreement • You can measure consistency, probably the best way is Krippendorff’s alpha • Don’t use percentage agreement! Particularly when data are skewed towards one category. • If 95% of the data fall under one category label, then random coding would still have two people agree so much that % agreement would make you think you had a reliable study (even though you wouldn’t) • And you can ALSO use models to check these things
  • 18. FCPCCS - Big Data and Crowdsourcing Finding healthy communities (supportive)
  • 19. FCPCCS - Big Data and Crowdsourcing And unhealthy ones (toxic)
  • 20. FCPCCS - Big Data and Crowdsourcing
  • 21. FCPCCS - Big Data and Crowdsourcing Collect data and annotations—then interrogate it Human annotations Which people/categories should we be wary of? Which annotations do we select to train a model with? A classifier that can predict unseen data
  • 22. FCPCCS - Big Data and Crowdsourcing Routing messages that matter
  • 23. FCPCCS - Big Data and Crowdsourcing Processing millions of SMS in 12 African languages Intent of sender (i.e. report a problem, ask a question or make a suggestion) Categorization (i.e. orphans and vulnerable children, violence against children, health, nutrition) Language detection (i.e. English, Acholi, Karamojong, Luganda, Nkole, Swahili, Lango) Location (i.e. village names)
  • 24. FCPCCS - Big Data and Crowdsourcing
  • 25. FCPCCS - Big Data and Crowdsourcing 1.4%
  • 26. FCPCCS - Big Data and Crowdsourcing
  • 27. FCPCCS - Big Data and Crowdsourcing Top 3 categories in Nigeria 9.69% 17.68% 39.44% Employment U-report support Health
  • 28. FCPCCS - Big Data and Crowdsourcing The Donald Rumsfeld Question
  • 29. FCPCCS - Big Data and Crowdsourcing How do I find what I don’t know I don’t know?
  • 30. FCPCCS - Big Data and Crowdsourcing Negative topics in Walmart employee reviews Hours/Benefits 968 518 Management 2,404 Work/life balance 1,241 Company Values Dealing With Customers 658 Training & Expectation 968 Low Pay 1,446
  • 31. FCPCCS - Big Data and Crowdsourcing Common Pros among Employees Common Cons among Employees 37% 25% 24% 41% 27% 17% 0% 10% 20% 30% 40% 50% Current Former 24% 16% 13% 13%14% 16% 12% 0% 10% 20% 30% Current Former Structuring unstructured data lets you combine it with other metadata
  • 32. FCPCCS - Big Data and Crowdsourcing Question: What improves models the most?
  • 33. FCPCCS - Big Data and Crowdsourcing Instead of worrying about the algorithms in the machine
  • 34. FCPCCS - Big Data and Crowdsourcing It’s almost always better to just get more pandas
  • 35. FCPCCS - Big Data and Crowdsourcing How else do you verify?  We assess model accuracy using cross-validation.  Instead of using all annotated data to train a model, you hold out a random 10% and build the model with the rest.  Then you predict against that 10%. You do this 10 times and average the accuracy.  Precision measures “if we automatically label something as X, how often are we right?”  Recall measures “how much of stuff that SHOULD have label X are actually given label X?”
  • 36. FCPCCS - Big Data and Crowdsourcing The system gets smarter  Here’s what happens going across the first 2,543 annotations on one REALLY low signal classification task  By 9,744 annotations, our accuracy is 97%
  • 37. FCPCCS - Big Data and Crowdsourcing Other tasks are more straight-forward 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 50 100 150 200 F-score Number of paragraphs annotated F-scores go up with more annotations Disease Country Reported_deaths Reported_cases Date Issue Location People affected # of deaths Event date
  • 38. FCPCCS - Big Data and Crowdsourcing Project workflow Phase 1:Data • Data capture, normalization and loading Phase 2:Discovery • Topic discovery • Category creation • Expert data annotation • Category verification Phase 3:Training • Guideline creation • Annotator validation • Model training Phase 4: Optimization • Model evaluation • Category refinement Phase 5:Model Deployment • Full system integration • Model performance • Metrics reporting
  • 39. FCPCCS - Big Data and Crowdsourcing email tyler@idibon.com twitter @idibon www idibon.com THANK YOU!

Hinweis der Redaktion

  1. http://nypost.com/2015/02/07/meet-the-bird-brains-batty-enough-to-go-bird-watching-in-winter/
  2. This is the basic stuff you want. (It’s a little self-serving because Idibon’s adaptive system is what makes us special but we really do believe that optimizing training on relevant data with meaningful categories is THE way to deliver business value.) By using computers to create an initial understanding of data and elevate specific cases for Human Annotation, we use computers to make human decisions smarter, and humans to make computer decisions smarter. Our system optimizes work by using cutting edge Machine Learning that improves accuracy and learns iteratively. Our Prediction Engine provides initial conclusions for further evaluation by human analysts and is also what allows us to scale ten of millions messages a day. Our Optimization process teaches our algorithm what results to select for, essentially refining its accuracy. The key take away here is that we optimize for human analysts time; we can cluster data initially and automatically, then we can escalate specific cases to human annotation. Much of the learning is unsupervised and therefore faster, cheaper and actually more accurate. After iterations in our adaptive system, previously unstructured data is now structured. This structured data can be delivered in different outputs, including CSV file exports for your analysts to build reports or direct routing to customer service agents to take action.
  3. As you can see—different categories have different results. News category 1 is awesome—you really don’t have to show human analysts much data to get all the Relevant stuff (you show them 10% of the data and still get 99% of what the client cares about) Manufacturing is less awesome. You can reduce your workload to just 73% of what it was…but you have to accept that you’ll only get 83% of the stuff you care about (you’ll miss 17%). If you want to get more like 90% accuracy, you need to review more documents. You “only” get a workload reduction of ~56%. Ideally, you want a system that gets better over time.
  4. First case study! http://idibon.com/toxicity-in-reddit-communities-a-journey-to-the-darkest-depths-of-the-interwebs/
  5. Lately, Reddit has gotten a lot of press for having terrible, awful communities
  6. See also http://cswww.essex.ac.uk/Research/nle/arrau/icagr.pdf
  7. http://blog.ioactive.com/2013/05/security-101-machine-learning-and-big.html
  8. The important thing is having definitions people will agree with and can be consistent with…and which actually answer organizational objectives. Do you care about whether duck decoys and/or rubber duckies are ducks or not? WHY? http://blog.ioactive.com/2013/05/security-101-machine-learning-and-big.html
  9. The trickiest thing about ad hominem attacks as a definition is: what to do with trash talk in sports/gaming. Tricky!
  10. The trickiest thing about ad hominem attacks as a definition is: what to do with trash talk in sports/gaming. Tricky!
  11. This is interactive, check out: http://idibon.com/toxicity-in-reddit-communities-a-journey-to-the-darkest-depths-of-the-interwebs/ The DIY (do it yourself) group is the one that is most supportive and least toxic. This data ties to actual upvote/downvote behavior. Meaning that you’re not actually a supportive community if everyone down votes the supportive comments, nor are you a toxic community if everyone downvotes the toxic comments.
  12. This is interactive, check out: http://idibon.com/toxicity-in-reddit-communities-a-journey-to-the-darkest-depths-of-the-interwebs/ It’s only when everyone upvotes toxic comments that you are a toxic community by our definition here.
  13. We also specifically looked at bigotry. Indeed, /r/TheRedPill, is seen as the most bigoted. It’s a subreddit dedicated to proud male chauvinism. This is interactive, check out: http://idibon.com/toxicity-in-reddit-communities-a-journey-to-the-darkest-depths-of-the-interwebs/
  14. Case study three: http://idibon.com/idibon-supports-unicef-provide-natural-language-processing-sms-based-social-monitoring-systems-africa/ Photo: http://unicefaids.tumblr.com/post/37835112363/photo-young-people-in-kitwe-zambia-explore-the
  15. The United Nations Children’s Fund (UNICEF) is a United Nations branch that provides long-term humanitarian and developmental assistance to children and mothers in developing countries. Idibon provides scalable natural language processing and analytics to UNICEF’s multinational U-report applications, enabling UNICEF to process text messages sent from citizens in Uganda and Nigeria “to better understand and empower marginalized communities that are often excluded due to language barriers.” (Evan Wheeler, CTO of UNICEF’s Global Innovation Centre) UNICEF U-report only has six dedicated analysts to process and respond to millions of messages a month and Idibon’s technology enables the organization to operate efficiently and at scale. Specifically, Idibon processes each SMS in four ways: Intent of sender – to prioritize support/services (UNICEF receives more than a million messages a month and can only respond to about a thousand) Categorization – to prioritize support/services and to route to appropriate analyst Language detection – to route to appropriate analyst Location – to identify where to send support/services Press release: http://unicefstories.org/2015/02/09/idibon-supports-unicef-to-provide-natural-language-processing-to-sms-based-social-monitoring-systems-in-africa/
  16. Environment is an important issue. But it looks to be about 1.4% of the data…which means you do have to get enough data to build a model. Note that different countries/languages talk about the environment differently (Uganda=droughts, cows; Nigeria: oil). So you may have more or less heterogeneity in your rarer categories. Image from http://www.theatlantic.com/photo/2011/06/nigeria-the-cost-of-oil/100082/ For more recent news: http://www.theguardian.com/environment/2015/jan/07/niger-delta-communities-to-sue-shell-in-london-for-oil-spill-compensation
  17. “Environment” is clearly an important issue in Nigeria but only 1.4% of the messages are classified that way. (One other thing: high/low percentages don’t necessarily correspond to personal or societal importance.)
  18. Each needle found makes the next one easier to find, buuuuuuut some things you want to find are just too rare. You can’t model things that aren’t in the data.
  19. At UNICEF, different people care about different categories—the people who respond to rumors of ebola outbreaks or cures are different than the people trying to keep track of economic issues. Most actionable is, of course, finding people who specifically require support about participating in the community.
  20. Pay and Opportunities are much less of a pro once employees have left Walmart and becomes more of a con Management is highly criticised amongst both current and former
  21. 9,744 annotations total 951 for engageable 8793 for irrelevant