The Eric and Wendy Schmidt Data Science for Social Good - Summer Fellowship 2013
Preliminary Update July 2013
About the DSSG Rock stars:
http://dssg.io/
https://twitter.com/datascifellows/
Their project:
http://dssg.io/2013/07/15/ushahidi-machine-learning-for-human-rights.html
More @ ushahidi.com / wiki.ushahidi.com / blog.ushahidi.com
How to Remove Document Management Hurdles with X-Docs?
Data Science for Social Good and Ushahidi
1. Project Update - July 11, 2013
The Eric & Wendy Schmidt
Data Science
for Social Good
Summer Fellowship 2013
www.dssg.io | dssg-ushahidi@googlegroups.com
4. Data Sets
23,000 reports from 20 datasets
• 22% English
• 35% non-English
• 43% mixed languages
Each report includes text, category, location,
sometimes more data
7. Current Task Status [July 11]
1) Suggest categories.......................
2) Extract named entities...................
(especially locations)
3) Detect language............................
End of presentation has more extensive technical details
8. Toy Demo
http://ec2-54-218-196-140.us-west-2.compute.amazonaws.com/home
Note this is ONLY a basic "toy" user interface to demonstrate the current prototype functionality.
Our plan is to deliver an open-source code library,
which Ushahidi will incorporate into the existing user interface.
If link doesn't work -- just look at the screenshots in the next slides. :)
11. Secondary Project Ideas
1. Detect private info to strip
2. Urgency assessment
3. Filtering irrelevant reports (not strictly spam)
4. Automatically proposing new [sub-]categories
5. Cluster similar (non-identical) reports
6. Hierarchical topic modelling / visualization
12. Evaluation Plans
• Tap into Ushahidi and crisis mapping
communities for feedback
• Simulate past event with our system
• Success metrics:
o Increased annotator speed
o Increased annotator categorization accuracy
o Decreased annotator frustration/tedium
13. Feedback welcome!
Contact us at dssg-
ushahidi@googlegroups.com
We would love your input!
See next 4 slides for technical details on our 4 tasks...
or skip if you're happy to stay unaware... :)
14. 1) Suggest categories
Currently:
• Simple bag-of-words unigram features
• 1-vs.-all classification (scikit-learn)
• Little categories fewer big categories
• Performance uninspiring :(
Future:
Bigrams... word frequency filter...
15. 2) Extract named entities
Currently:
• NLTK's Named Entity Recognizer
• Eval: pretty good
Future:
• Train location-recognizer on datasets
• Merge types for non-location NEs
16. 3) Detect Language
Currently:
• Existing packages (Bing, python, ...)
Future:
• Evaluate quality
• Allow event-specific language bias
We're happy to give an update on our Ushahidi project's . [Abe Gong]
Citizens submit reports (via SMS, twitter, and the web) which are reviewed by annotators. It's a slow manual process -- to categorize, geolocate, strip private info, etc.
We're building a data wizardry system to support the manual annotation process
Since Ushahidi reports are mostly public, private info should be hidden. example: names, phone numbers, and addresses 4. example: in Haiti earthquake, we might observe unexpected robbery reports arising. 5. This is mainly for a better workflow, because annotators can work better when they process similar reports altogether. 6. To see which topics are commonly occurring in Election in general, and which topics only occur in Kenyan election specifically.