SlideShare ist ein Scribd-Unternehmen logo
1 von 31
Downloaden Sie, um offline zu lesen
Ushine Plug-In
Using machine learning and natural language processing
to improve the human review process of crisis reports
Topics
● Intro to project
● Project contents
● Data sets
● Evaluation
● Data ethics
● Future work
How to Follow Up...
● GitHub repository (open-source project code + wiki documentation):
http://github.com/dssg/ushine-learning
Collaborators welcome! (Both within and outside of Ushahidi.)
● DSSG team e-mail: dssg-ushahidi@googlegroups.com
● Main Ushahidi contacts: Emmanuel Kala + Heather Leson
● Data Science for Social Good fellowship: http://dssg.io
Thanks!
Thanks to our partners at Ushahidi and the many
individuals and organizations who generously gave us
their advice and feedback...
Alphabetically:
Chris Albon, Rob Baker, George Chamales, Jennifer Chan,
Crisis Mappers, Schuyler Erle, Sara-Jayne Farmer, Rayid
Ghani, Eric Goodwin, Catherine Graham, Neil Horning,
Humanity Road, Anahi Ayala Iacucci, Rob Mitchum,
Emmanuel Kala, David Kobia, Heather Leson, Rob Munro,
Chris Thompson, Syria Tracker, Juan-Pablo Velez.
Project Contents [August 20]
1) Detect language of report text
2) Identify private information in report text
3) Identify locations in report text
4) Identify URLs in report text
5) Suggest categories of report
6) Detect (near-)duplicate reports
Ushahidi Process
DSSG helps here
Report Review w/o Ushine
Report Review with Ushine
(Exact user interface still
under development)
Scope
● Ushine DOES:
○ Improve the human review process of reports
● Ushine DOESN’T:
○ Verify reports
○ “Really” understand the report
○ Achieve 100% accuracy in anything
Useful for:
● In multi-lingual situation, automatically route reports to
speakers of that language
● Flag reports that need / don’t need translations
○ (if deployment specifies certain set of acceptable
languages)
Caveats:
● Not 100% accurate
● Performs less well on “imperfect” writing
○ e.g. SMS-speak, mixed languages
1) Detect report language
1) Detect report language
Technical details:
● Tested 4 plug-in language detectors on 850
reports, for agreement with human language
identification:
2) Identify Private Info
Identify people’s names, organizations’ names, locations, e-mail
addresses, URLs, phone/ID numbers, Twitter usernames
Useful for:
● Flagging private info in report that reviewer might want to remove, to
protect sensitive people/situations
● As an extra check before exporting reports to others.
Technical details:
● Use NLTK’s pre-trained Named Entity Recognizer (NER) to identify people’
s names, organizations’ names, and locations.
● Use regular expressions to identify e-mail addresses, URLs, phone/ID
numbers,and Twitter usernames.
● Better to be overly careful: false negatives are more dangerous than false
positives
2) Identify Private Info
Caveats:
● Not 100% accurate.
○ Use to support, not replace, humans. (Though humans are not 100%
accurate by themselves either!)
○ Always, be aware of responsibility to protect sensitive information.
○ Non-sensitive deployments (non-wars/disasters) may still have
sensitive information.
○ (More on data ethics @ end)
● Definition of “private” can be very subjective and nuanced.
● Does not re-word sentence; only identifies problematic words for editing.
● Currently only useful for English text (though extendable to other
languages given a suitable NER)
3) Identify Locations
Useful for:
● Identifying text within report that may refer to a location
Caveats:
● Imperfect accuracy, especially on imperfect English
● Currently only useful for English text (though extendable to other
languages given a suitable NER)
● Does not geo-locate location for mapping, just makes it easier to figure out
what text to then geo-locate.
Technical details:
● Use NLTK’s pre-trained Named Entity Recognizer (NER)
4) Identify URLs (links)
Useful for:
● Identifying text within report that refers to a URL (photo/video/article/etc.)
Technical details:
● Use regular expressions
A Detour on Data Sets
● So far none of the tasks have required
“training data” on past Ushahidi deployments
○ (NLTK’s named entity recognizer uses its own
training data, not from Ushahidi)
● Next task, category rankings, DOES require
Ushahidi training data
● Data cleanliness: Often lacking
○ We wrote scripts to automate cleaning
○ Useful for other Ushahidi work too!
Data Sets - Examples
Additional unusable
datasets for various
reasons (e.g. overly
formulaic language)
Many additional
CrowdMap datasets
(not used by Ushine
because of time
constraints)
Sensitive data was
removed before
being shared with
us
Afghanistan election
(peaceful)
Kenyan election
(less peaceful)
Data Set Differences
5) Category Suggestions
For each category (e.g. “Bribery” or “Violence”),
give 0-100% rating of how likely the report is to belong
Useful for:
● Increasing speed and accuracy of the category assignment process
Caveats:
● Not 100% accurate
● “Cold start” problem
5) Category Suggestions
● Global classifier:
○ Classifier trained on previous deployments (e.g.
previous Indian and Venezuela election reports) then
used for a new deployment (e.g. new Kenyan
election)
● Local classifier:
○ Train a classifier on-the-fly on reports annotated in a
new deployment. Cold-start problem.
● Adaptive classifier:
○ Retrain global classifier on the current deployment
5) Category Suggestions
● Learning Curve Plot from Mexico election
(Higher F1 score means better performance)
5) Category Suggestions
Technical details:
● Binary classifier for each category.
● Local classifier: Bag-of-words unigram
frequency features (with frequency cut-off = 5)
○ In general, bigrams & TF-IDF normalization did not
help.
● Global classifier for election deployment
○ Trained using 7 election deployments
○ For each category label, cross-deployment validation
was used to select feature sets (unigram, tfidf,
bigram, and C parameter).
5) Category Suggestions
Technical details:
● Adaptive Classifier
○ Interpolates between local classifier f and global
classifier g using
(1-α)*g(x) + α*f(x),
where x is a report.
○ α is tuned on-the-fly to maximize F1 score bas
grid search.
6) Detect (near-) duplicates
Has the report already been submitted, or retweeted?
Useful for:
● Identifying (near-)duplicate reports to prevent
copies and redundant work
Caveats:
● Not 100% accurate
● Not looking at “similar/related content”, but rather (near-)duplicates
Technical details:
● SimHash efficiently hashes each report text to a 64-bit representation.
● (Near-)duplicates have short distances
Evaluation
Currently analyzing the results of an evaluation experiment
that simulates an election crisis.
Assess the impact on users’ speed and accuracy of
● identifying private info, location, URLs
● choosing categories
3 comparison groups:
1) “Regular” process w/o computer suggestions
2) Our computer’s suggestions
3) “Perfect” suggestions
Evaluation
Ushahidi Plugin integration
● Configurable URL for the Ushine web
service
● Extract location names and other entities
from report text. These are displayed as
report metadata
● Detect and display the report language
● Suggest reports that are similar to the
current one
Data Ethics
This isn’t today’s focus, but very important as part of an on-going
Ushahidi discussion:
1) Private information tool especially should be used wisely -- not 100%
accurate and does not replace, but rather supports, thoughtful human decision-
making.
2) To improve category classification, need access to training data.
How to store data? Who has access?
Carelessness about sensitive data
can have real and bad consequences!
Non-sensitive deployments (non-wars/disasters)
may still have sensitive information.
Automated vs. Suggestions
● In theory, everything could be automated
○ Ex: Automatically select top-ranked categories
instead of giving humans the rankings
● Ushahidi reports need high quality data, so
we recommend using our package’s output
as suggestions to guide human decisions
● Especially important for sensitive tasks like
private information detection!
Future Ideas
1. Urgency assessment
2. Filter irrelevant reports (not strictly spam)
3. Automatically propose new [sub-]categories
4. Cluster similar (non-identical) reports
5. Hierarchical topic modelling / visualization
6. …?
How to Follow Up...
● GitHub repository (project code + wiki documentation): http://github.
com/dssg/ushine-learning
Collaborators welcome! (Both within and outside of Ushahidi.)
● DSSG team e-mail: dssg-ushahidi@googlegroups.com
● Main Ushahidi contacts: Emmanuel Kala + Heather Leson
● Data Science for Social Good fellowship: http://dssg.io

Weitere ähnliche Inhalte

Andere mochten auch

Map it, Change it
Map it, Change itMap it, Change it
Map it, Change itUshahidi
 
Creating complex information systems
Creating complex information systemsCreating complex information systems
Creating complex information systemsAnahi Iacucci
 
النظم المفتوحة المصدر المجانية المتخصصة بالعمل الانساني
النظم المفتوحة المصدر المجانية المتخصصة بالعمل الانسانيالنظم المفتوحة المصدر المجانية المتخصصة بالعمل الانساني
النظم المفتوحة المصدر المجانية المتخصصة بالعمل الانسانيShadi Akil
 
Ushahidi esri juliana
Ushahidi esri julianaUshahidi esri juliana
Ushahidi esri julianaUshahidi
 
Ushahidi and Crowdmap training
Ushahidi and Crowdmap trainingUshahidi and Crowdmap training
Ushahidi and Crowdmap trainingAnahi Iacucci
 
Open Data for Development Challenge - Canada
Open Data for Development Challenge - CanadaOpen Data for Development Challenge - Canada
Open Data for Development Challenge - CanadaAnahi Iacucci
 
Citizen pollution mapping made easy
Citizen pollution mapping made easy Citizen pollution mapping made easy
Citizen pollution mapping made easy Ushahidi
 
Ushahidi personas scenarios
Ushahidi personas scenariosUshahidi personas scenarios
Ushahidi personas scenariosUshahidi
 
الحملات الاعلانية
الحملات الاعلانية الحملات الاعلانية
الحملات الاعلانية Dina Najem
 
New Media for National Broadcasters
New Media for National BroadcastersNew Media for National Broadcasters
New Media for National BroadcastersAnahi Iacucci
 
Kenya Ushahidi Evaluation: Unsung Peace Heros/Building Bridges
Kenya Ushahidi Evaluation: Unsung Peace Heros/Building BridgesKenya Ushahidi Evaluation: Unsung Peace Heros/Building Bridges
Kenya Ushahidi Evaluation: Unsung Peace Heros/Building BridgesUshahidi
 
Ushahidi Toolbox - Assessment
Ushahidi Toolbox - AssessmentUshahidi Toolbox - Assessment
Ushahidi Toolbox - AssessmentUshahidi
 
7 Lessons Learned in Hacking the Facebook Platform from @ankurnagpal
7 Lessons Learned in Hacking the Facebook Platform from @ankurnagpal7 Lessons Learned in Hacking the Facebook Platform from @ankurnagpal
7 Lessons Learned in Hacking the Facebook Platform from @ankurnagpalConrad Wadowski
 
Qu’est-ce c’est l’Information humanitaire?
Qu’est-ce c’est l’Information humanitaire?Qu’est-ce c’est l’Information humanitaire?
Qu’est-ce c’est l’Information humanitaire?Anahi Iacucci
 

Andere mochten auch (16)

Map it, Change it
Map it, Change itMap it, Change it
Map it, Change it
 
web 2.0
web 2.0 web 2.0
web 2.0
 
Creating complex information systems
Creating complex information systemsCreating complex information systems
Creating complex information systems
 
النظم المفتوحة المصدر المجانية المتخصصة بالعمل الانساني
النظم المفتوحة المصدر المجانية المتخصصة بالعمل الانسانيالنظم المفتوحة المصدر المجانية المتخصصة بالعمل الانساني
النظم المفتوحة المصدر المجانية المتخصصة بالعمل الانساني
 
Ushahidi esri juliana
Ushahidi esri julianaUshahidi esri juliana
Ushahidi esri juliana
 
Ushahidi and Crowdmap training
Ushahidi and Crowdmap trainingUshahidi and Crowdmap training
Ushahidi and Crowdmap training
 
Open Data for Development Challenge - Canada
Open Data for Development Challenge - CanadaOpen Data for Development Challenge - Canada
Open Data for Development Challenge - Canada
 
Citizen pollution mapping made easy
Citizen pollution mapping made easy Citizen pollution mapping made easy
Citizen pollution mapping made easy
 
Ushahidi personas scenarios
Ushahidi personas scenariosUshahidi personas scenarios
Ushahidi personas scenarios
 
الحملات الاعلانية
الحملات الاعلانية الحملات الاعلانية
الحملات الاعلانية
 
New Media for National Broadcasters
New Media for National BroadcastersNew Media for National Broadcasters
New Media for National Broadcasters
 
Kenya Ushahidi Evaluation: Unsung Peace Heros/Building Bridges
Kenya Ushahidi Evaluation: Unsung Peace Heros/Building BridgesKenya Ushahidi Evaluation: Unsung Peace Heros/Building Bridges
Kenya Ushahidi Evaluation: Unsung Peace Heros/Building Bridges
 
Ushahidi Toolbox - Assessment
Ushahidi Toolbox - AssessmentUshahidi Toolbox - Assessment
Ushahidi Toolbox - Assessment
 
7 Lessons Learned in Hacking the Facebook Platform from @ankurnagpal
7 Lessons Learned in Hacking the Facebook Platform from @ankurnagpal7 Lessons Learned in Hacking the Facebook Platform from @ankurnagpal
7 Lessons Learned in Hacking the Facebook Platform from @ankurnagpal
 
Flsms partie II
Flsms partie IIFlsms partie II
Flsms partie II
 
Qu’est-ce c’est l’Information humanitaire?
Qu’est-ce c’est l’Information humanitaire?Qu’est-ce c’est l’Information humanitaire?
Qu’est-ce c’est l’Information humanitaire?
 

Ähnlich wie Data Science for Social Good and Ushahidi - Final Presentation

Data Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAData Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAjaved75
 
Opportunities and methodological challenges of Big Data for official statist...
Opportunities and methodological challenges of  Big Data for official statist...Opportunities and methodological challenges of  Big Data for official statist...
Opportunities and methodological challenges of Big Data for official statist...Piet J.H. Daas
 
Hate Speech / Toxic Comment Detection - Data Mining (CSE-362) Project
Hate Speech / Toxic Comment Detection - Data Mining (CSE-362) ProjectHate Speech / Toxic Comment Detection - Data Mining (CSE-362) Project
Hate Speech / Toxic Comment Detection - Data Mining (CSE-362) Projectfabiodeazevedo3
 
Weird News Ranking : IRE project
Weird News Ranking : IRE projectWeird News Ranking : IRE project
Weird News Ranking : IRE projectRupali Aher
 
Step Up Your Survey Research - Dawn of the Data Age Lecture Series
Step Up Your Survey Research - Dawn of the Data Age Lecture SeriesStep Up Your Survey Research - Dawn of the Data Age Lecture Series
Step Up Your Survey Research - Dawn of the Data Age Lecture SeriesLuciano Pesci, PhD
 
Data Science - Experiments
Data Science - ExperimentsData Science - Experiments
Data Science - ExperimentsGaurav Marwaha
 
Rules for great digital government
Rules for great digital governmentRules for great digital government
Rules for great digital governmentProudCity
 
Social media analytics as a service: tools from GATE
Social media analytics as a service: tools from GATESocial media analytics as a service: tools from GATE
Social media analytics as a service: tools from GATEDiana Maynard
 
AI in Data science
AI in Data science AI in Data science
AI in Data science AlliVinay1
 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSumit Raj
 
Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Thinkful
 
Data analytics career path
Data analytics career pathData analytics career path
Data analytics career pathRubikal
 
“Responsible AI: Tools and Frameworks for Developing AI Solutions,” a Present...
“Responsible AI: Tools and Frameworks for Developing AI Solutions,” a Present...“Responsible AI: Tools and Frameworks for Developing AI Solutions,” a Present...
“Responsible AI: Tools and Frameworks for Developing AI Solutions,” a Present...Edge AI and Vision Alliance
 
An Audiovisual Metadata Platform to Support Mass Description
An Audiovisual Metadata Platform to Support Mass DescriptionAn Audiovisual Metadata Platform to Support Mass Description
An Audiovisual Metadata Platform to Support Mass DescriptionJon W. Dunn
 
Kp-Data Analytics-ts.pptx
Kp-Data Analytics-ts.pptxKp-Data Analytics-ts.pptx
Kp-Data Analytics-ts.pptxCloudBusiness2
 
Morden EcoSystem.pptx
Morden EcoSystem.pptxMorden EcoSystem.pptx
Morden EcoSystem.pptxpriti jadhao
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceHiba Akroush
 
Open data for development
Open data for developmentOpen data for development
Open data for developmentmlepage
 

Ähnlich wie Data Science for Social Good and Ushahidi - Final Presentation (20)

Data Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAData Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATA
 
Opportunities and methodological challenges of Big Data for official statist...
Opportunities and methodological challenges of  Big Data for official statist...Opportunities and methodological challenges of  Big Data for official statist...
Opportunities and methodological challenges of Big Data for official statist...
 
Hate Speech / Toxic Comment Detection - Data Mining (CSE-362) Project
Hate Speech / Toxic Comment Detection - Data Mining (CSE-362) ProjectHate Speech / Toxic Comment Detection - Data Mining (CSE-362) Project
Hate Speech / Toxic Comment Detection - Data Mining (CSE-362) Project
 
Weird News Ranking : IRE project
Weird News Ranking : IRE projectWeird News Ranking : IRE project
Weird News Ranking : IRE project
 
Step Up Your Survey Research - Dawn of the Data Age Lecture Series
Step Up Your Survey Research - Dawn of the Data Age Lecture SeriesStep Up Your Survey Research - Dawn of the Data Age Lecture Series
Step Up Your Survey Research - Dawn of the Data Age Lecture Series
 
Data Science - Experiments
Data Science - ExperimentsData Science - Experiments
Data Science - Experiments
 
Rules for great digital government
Rules for great digital governmentRules for great digital government
Rules for great digital government
 
Social media analytics as a service: tools from GATE
Social media analytics as a service: tools from GATESocial media analytics as a service: tools from GATE
Social media analytics as a service: tools from GATE
 
AI in Data science
AI in Data science AI in Data science
AI in Data science
 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter Data
 
Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)
 
Data analytics career path
Data analytics career pathData analytics career path
Data analytics career path
 
Data Analytics Career Paths
Data Analytics Career PathsData Analytics Career Paths
Data Analytics Career Paths
 
“Responsible AI: Tools and Frameworks for Developing AI Solutions,” a Present...
“Responsible AI: Tools and Frameworks for Developing AI Solutions,” a Present...“Responsible AI: Tools and Frameworks for Developing AI Solutions,” a Present...
“Responsible AI: Tools and Frameworks for Developing AI Solutions,” a Present...
 
An Audiovisual Metadata Platform to Support Mass Description
An Audiovisual Metadata Platform to Support Mass DescriptionAn Audiovisual Metadata Platform to Support Mass Description
An Audiovisual Metadata Platform to Support Mass Description
 
Kp-Data Analytics-ts.pptx
Kp-Data Analytics-ts.pptxKp-Data Analytics-ts.pptx
Kp-Data Analytics-ts.pptx
 
CAPI _TRIPS_SMS
CAPI _TRIPS_SMSCAPI _TRIPS_SMS
CAPI _TRIPS_SMS
 
Morden EcoSystem.pptx
Morden EcoSystem.pptxMorden EcoSystem.pptx
Morden EcoSystem.pptx
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Open data for development
Open data for developmentOpen data for development
Open data for development
 

Mehr von Ushahidi

Corruption mapping (april 2013, part 2)
Corruption mapping (april 2013, part 2)Corruption mapping (april 2013, part 2)
Corruption mapping (april 2013, part 2)Ushahidi
 
Anti-Corruption Mapping (April 2013, part 1)
Anti-Corruption Mapping (April 2013, part 1)Anti-Corruption Mapping (April 2013, part 1)
Anti-Corruption Mapping (April 2013, part 1)Ushahidi
 
Ushahdi 3.0 Design Framework
Ushahdi 3.0 Design Framework Ushahdi 3.0 Design Framework
Ushahdi 3.0 Design Framework Ushahidi
 
Around the Globe Corruption Mapping (part 2)
Around the Globe Corruption Mapping (part 2)Around the Globe Corruption Mapping (part 2)
Around the Globe Corruption Mapping (part 2)Ushahidi
 
Around the Globe Corruption Mapping (part 1)
Around the Globe Corruption Mapping (part 1)Around the Globe Corruption Mapping (part 1)
Around the Globe Corruption Mapping (part 1)Ushahidi
 
Kenya Ushahidi Evaluation: Uchaguzi
Kenya Ushahidi Evaluation: UchaguziKenya Ushahidi Evaluation: Uchaguzi
Kenya Ushahidi Evaluation: UchaguziUshahidi
 
Kenya Ushahidi Evaluation: Blog Series
Kenya Ushahidi Evaluation: Blog SeriesKenya Ushahidi Evaluation: Blog Series
Kenya Ushahidi Evaluation: Blog SeriesUshahidi
 
Pivoting An African Open Source Project
Pivoting An African Open Source ProjectPivoting An African Open Source Project
Pivoting An African Open Source ProjectUshahidi
 
Map it, Make it, Hack it
Map it, Make it, Hack itMap it, Make it, Hack it
Map it, Make it, Hack itUshahidi
 
What if Citizens Mapped Health?
What if Citizens Mapped Health?What if Citizens Mapped Health?
What if Citizens Mapped Health?Ushahidi
 
Re-imagining Citizen Engagement
Re-imagining Citizen EngagementRe-imagining Citizen Engagement
Re-imagining Citizen EngagementUshahidi
 
Ushahidi Research Seminar 11.11.11
Ushahidi Research Seminar 11.11.11Ushahidi Research Seminar 11.11.11
Ushahidi Research Seminar 11.11.11Ushahidi
 
Ihub Research
Ihub ResearchIhub Research
Ihub ResearchUshahidi
 
What's in the toolkit (Ushahidi at ETHz)
What's in the toolkit (Ushahidi at ETHz)What's in the toolkit (Ushahidi at ETHz)
What's in the toolkit (Ushahidi at ETHz)Ushahidi
 
Volunteer Mappers: Building community resilience with citizen media
Volunteer Mappers: Building community resilience with citizen mediaVolunteer Mappers: Building community resilience with citizen media
Volunteer Mappers: Building community resilience with citizen mediaUshahidi
 
Ushahidi Deployment - Output Toolbox
Ushahidi Deployment - Output ToolboxUshahidi Deployment - Output Toolbox
Ushahidi Deployment - Output ToolboxUshahidi
 
Ushahidi Deployment - Implementation Toolbox
Ushahidi Deployment - Implementation ToolboxUshahidi Deployment - Implementation Toolbox
Ushahidi Deployment - Implementation ToolboxUshahidi
 
Ushahidi Deployment - Assessment Toolbox
Ushahidi Deployment - Assessment ToolboxUshahidi Deployment - Assessment Toolbox
Ushahidi Deployment - Assessment ToolboxUshahidi
 
2011 ushahidi deployment-partners
2011 ushahidi deployment-partners2011 ushahidi deployment-partners
2011 ushahidi deployment-partnersUshahidi
 
I hub research_jc
I hub research_jcI hub research_jc
I hub research_jcUshahidi
 

Mehr von Ushahidi (20)

Corruption mapping (april 2013, part 2)
Corruption mapping (april 2013, part 2)Corruption mapping (april 2013, part 2)
Corruption mapping (april 2013, part 2)
 
Anti-Corruption Mapping (April 2013, part 1)
Anti-Corruption Mapping (April 2013, part 1)Anti-Corruption Mapping (April 2013, part 1)
Anti-Corruption Mapping (April 2013, part 1)
 
Ushahdi 3.0 Design Framework
Ushahdi 3.0 Design Framework Ushahdi 3.0 Design Framework
Ushahdi 3.0 Design Framework
 
Around the Globe Corruption Mapping (part 2)
Around the Globe Corruption Mapping (part 2)Around the Globe Corruption Mapping (part 2)
Around the Globe Corruption Mapping (part 2)
 
Around the Globe Corruption Mapping (part 1)
Around the Globe Corruption Mapping (part 1)Around the Globe Corruption Mapping (part 1)
Around the Globe Corruption Mapping (part 1)
 
Kenya Ushahidi Evaluation: Uchaguzi
Kenya Ushahidi Evaluation: UchaguziKenya Ushahidi Evaluation: Uchaguzi
Kenya Ushahidi Evaluation: Uchaguzi
 
Kenya Ushahidi Evaluation: Blog Series
Kenya Ushahidi Evaluation: Blog SeriesKenya Ushahidi Evaluation: Blog Series
Kenya Ushahidi Evaluation: Blog Series
 
Pivoting An African Open Source Project
Pivoting An African Open Source ProjectPivoting An African Open Source Project
Pivoting An African Open Source Project
 
Map it, Make it, Hack it
Map it, Make it, Hack itMap it, Make it, Hack it
Map it, Make it, Hack it
 
What if Citizens Mapped Health?
What if Citizens Mapped Health?What if Citizens Mapped Health?
What if Citizens Mapped Health?
 
Re-imagining Citizen Engagement
Re-imagining Citizen EngagementRe-imagining Citizen Engagement
Re-imagining Citizen Engagement
 
Ushahidi Research Seminar 11.11.11
Ushahidi Research Seminar 11.11.11Ushahidi Research Seminar 11.11.11
Ushahidi Research Seminar 11.11.11
 
Ihub Research
Ihub ResearchIhub Research
Ihub Research
 
What's in the toolkit (Ushahidi at ETHz)
What's in the toolkit (Ushahidi at ETHz)What's in the toolkit (Ushahidi at ETHz)
What's in the toolkit (Ushahidi at ETHz)
 
Volunteer Mappers: Building community resilience with citizen media
Volunteer Mappers: Building community resilience with citizen mediaVolunteer Mappers: Building community resilience with citizen media
Volunteer Mappers: Building community resilience with citizen media
 
Ushahidi Deployment - Output Toolbox
Ushahidi Deployment - Output ToolboxUshahidi Deployment - Output Toolbox
Ushahidi Deployment - Output Toolbox
 
Ushahidi Deployment - Implementation Toolbox
Ushahidi Deployment - Implementation ToolboxUshahidi Deployment - Implementation Toolbox
Ushahidi Deployment - Implementation Toolbox
 
Ushahidi Deployment - Assessment Toolbox
Ushahidi Deployment - Assessment ToolboxUshahidi Deployment - Assessment Toolbox
Ushahidi Deployment - Assessment Toolbox
 
2011 ushahidi deployment-partners
2011 ushahidi deployment-partners2011 ushahidi deployment-partners
2011 ushahidi deployment-partners
 
I hub research_jc
I hub research_jcI hub research_jc
I hub research_jc
 

Kürzlich hochgeladen

UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.francesco barbera
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?SANGHEE SHIN
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
Introduction to Quantum Computing
Introduction to Quantum ComputingIntroduction to Quantum Computing
Introduction to Quantum ComputingGDSC PJATK
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
RAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AIRAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AIUdaiappa Ramachandran
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
20200723_insight_release_plan
20200723_insight_release_plan20200723_insight_release_plan
20200723_insight_release_planJamie (Taka) Wang
 
Spring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdfSpring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdfAnna Loughnan Colquhoun
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 

Kürzlich hochgeladen (20)

UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
Introduction to Quantum Computing
Introduction to Quantum ComputingIntroduction to Quantum Computing
Introduction to Quantum Computing
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
RAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AIRAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AI
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
20200723_insight_release_plan
20200723_insight_release_plan20200723_insight_release_plan
20200723_insight_release_plan
 
Spring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdfSpring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdf
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 

Data Science for Social Good and Ushahidi - Final Presentation

  • 1. Ushine Plug-In Using machine learning and natural language processing to improve the human review process of crisis reports
  • 2. Topics ● Intro to project ● Project contents ● Data sets ● Evaluation ● Data ethics ● Future work
  • 3. How to Follow Up... ● GitHub repository (open-source project code + wiki documentation): http://github.com/dssg/ushine-learning Collaborators welcome! (Both within and outside of Ushahidi.) ● DSSG team e-mail: dssg-ushahidi@googlegroups.com ● Main Ushahidi contacts: Emmanuel Kala + Heather Leson ● Data Science for Social Good fellowship: http://dssg.io
  • 4. Thanks! Thanks to our partners at Ushahidi and the many individuals and organizations who generously gave us their advice and feedback... Alphabetically: Chris Albon, Rob Baker, George Chamales, Jennifer Chan, Crisis Mappers, Schuyler Erle, Sara-Jayne Farmer, Rayid Ghani, Eric Goodwin, Catherine Graham, Neil Horning, Humanity Road, Anahi Ayala Iacucci, Rob Mitchum, Emmanuel Kala, David Kobia, Heather Leson, Rob Munro, Chris Thompson, Syria Tracker, Juan-Pablo Velez.
  • 5. Project Contents [August 20] 1) Detect language of report text 2) Identify private information in report text 3) Identify locations in report text 4) Identify URLs in report text 5) Suggest categories of report 6) Detect (near-)duplicate reports
  • 8. Report Review with Ushine (Exact user interface still under development)
  • 9. Scope ● Ushine DOES: ○ Improve the human review process of reports ● Ushine DOESN’T: ○ Verify reports ○ “Really” understand the report ○ Achieve 100% accuracy in anything
  • 10. Useful for: ● In multi-lingual situation, automatically route reports to speakers of that language ● Flag reports that need / don’t need translations ○ (if deployment specifies certain set of acceptable languages) Caveats: ● Not 100% accurate ● Performs less well on “imperfect” writing ○ e.g. SMS-speak, mixed languages 1) Detect report language
  • 11. 1) Detect report language Technical details: ● Tested 4 plug-in language detectors on 850 reports, for agreement with human language identification:
  • 12. 2) Identify Private Info Identify people’s names, organizations’ names, locations, e-mail addresses, URLs, phone/ID numbers, Twitter usernames Useful for: ● Flagging private info in report that reviewer might want to remove, to protect sensitive people/situations ● As an extra check before exporting reports to others. Technical details: ● Use NLTK’s pre-trained Named Entity Recognizer (NER) to identify people’ s names, organizations’ names, and locations. ● Use regular expressions to identify e-mail addresses, URLs, phone/ID numbers,and Twitter usernames. ● Better to be overly careful: false negatives are more dangerous than false positives
  • 13. 2) Identify Private Info Caveats: ● Not 100% accurate. ○ Use to support, not replace, humans. (Though humans are not 100% accurate by themselves either!) ○ Always, be aware of responsibility to protect sensitive information. ○ Non-sensitive deployments (non-wars/disasters) may still have sensitive information. ○ (More on data ethics @ end) ● Definition of “private” can be very subjective and nuanced. ● Does not re-word sentence; only identifies problematic words for editing. ● Currently only useful for English text (though extendable to other languages given a suitable NER)
  • 14. 3) Identify Locations Useful for: ● Identifying text within report that may refer to a location Caveats: ● Imperfect accuracy, especially on imperfect English ● Currently only useful for English text (though extendable to other languages given a suitable NER) ● Does not geo-locate location for mapping, just makes it easier to figure out what text to then geo-locate. Technical details: ● Use NLTK’s pre-trained Named Entity Recognizer (NER)
  • 15. 4) Identify URLs (links) Useful for: ● Identifying text within report that refers to a URL (photo/video/article/etc.) Technical details: ● Use regular expressions
  • 16. A Detour on Data Sets ● So far none of the tasks have required “training data” on past Ushahidi deployments ○ (NLTK’s named entity recognizer uses its own training data, not from Ushahidi) ● Next task, category rankings, DOES require Ushahidi training data ● Data cleanliness: Often lacking ○ We wrote scripts to automate cleaning ○ Useful for other Ushahidi work too!
  • 17. Data Sets - Examples Additional unusable datasets for various reasons (e.g. overly formulaic language) Many additional CrowdMap datasets (not used by Ushine because of time constraints) Sensitive data was removed before being shared with us
  • 19. 5) Category Suggestions For each category (e.g. “Bribery” or “Violence”), give 0-100% rating of how likely the report is to belong Useful for: ● Increasing speed and accuracy of the category assignment process Caveats: ● Not 100% accurate ● “Cold start” problem
  • 20. 5) Category Suggestions ● Global classifier: ○ Classifier trained on previous deployments (e.g. previous Indian and Venezuela election reports) then used for a new deployment (e.g. new Kenyan election) ● Local classifier: ○ Train a classifier on-the-fly on reports annotated in a new deployment. Cold-start problem. ● Adaptive classifier: ○ Retrain global classifier on the current deployment
  • 21. 5) Category Suggestions ● Learning Curve Plot from Mexico election (Higher F1 score means better performance)
  • 22. 5) Category Suggestions Technical details: ● Binary classifier for each category. ● Local classifier: Bag-of-words unigram frequency features (with frequency cut-off = 5) ○ In general, bigrams & TF-IDF normalization did not help. ● Global classifier for election deployment ○ Trained using 7 election deployments ○ For each category label, cross-deployment validation was used to select feature sets (unigram, tfidf, bigram, and C parameter).
  • 23. 5) Category Suggestions Technical details: ● Adaptive Classifier ○ Interpolates between local classifier f and global classifier g using (1-α)*g(x) + α*f(x), where x is a report. ○ α is tuned on-the-fly to maximize F1 score bas grid search.
  • 24. 6) Detect (near-) duplicates Has the report already been submitted, or retweeted? Useful for: ● Identifying (near-)duplicate reports to prevent copies and redundant work Caveats: ● Not 100% accurate ● Not looking at “similar/related content”, but rather (near-)duplicates Technical details: ● SimHash efficiently hashes each report text to a 64-bit representation. ● (Near-)duplicates have short distances
  • 25. Evaluation Currently analyzing the results of an evaluation experiment that simulates an election crisis. Assess the impact on users’ speed and accuracy of ● identifying private info, location, URLs ● choosing categories 3 comparison groups: 1) “Regular” process w/o computer suggestions 2) Our computer’s suggestions 3) “Perfect” suggestions
  • 27. Ushahidi Plugin integration ● Configurable URL for the Ushine web service ● Extract location names and other entities from report text. These are displayed as report metadata ● Detect and display the report language ● Suggest reports that are similar to the current one
  • 28. Data Ethics This isn’t today’s focus, but very important as part of an on-going Ushahidi discussion: 1) Private information tool especially should be used wisely -- not 100% accurate and does not replace, but rather supports, thoughtful human decision- making. 2) To improve category classification, need access to training data. How to store data? Who has access? Carelessness about sensitive data can have real and bad consequences! Non-sensitive deployments (non-wars/disasters) may still have sensitive information.
  • 29. Automated vs. Suggestions ● In theory, everything could be automated ○ Ex: Automatically select top-ranked categories instead of giving humans the rankings ● Ushahidi reports need high quality data, so we recommend using our package’s output as suggestions to guide human decisions ● Especially important for sensitive tasks like private information detection!
  • 30. Future Ideas 1. Urgency assessment 2. Filter irrelevant reports (not strictly spam) 3. Automatically propose new [sub-]categories 4. Cluster similar (non-identical) reports 5. Hierarchical topic modelling / visualization 6. …?
  • 31. How to Follow Up... ● GitHub repository (project code + wiki documentation): http://github. com/dssg/ushine-learning Collaborators welcome! (Both within and outside of Ushahidi.) ● DSSG team e-mail: dssg-ushahidi@googlegroups.com ● Main Ushahidi contacts: Emmanuel Kala + Heather Leson ● Data Science for Social Good fellowship: http://dssg.io