EMBERS AutoGSR is a novel, web based framework that generates a comprehensive database of validated civil unrest events using minimal human effort. AutoGSR is a deployed system for the past 6 months that is continually processing data 24X7 in an automated fashion. The system extracts civil unrest events of type "who protested where, when and why?" from news articles published in over 7 languages, and collected from 16 countries.
For more information, please visit: http://people.cs.vt.edu/parang/ or contact parang at firstname at cs vt edu
2. Introduction
• AutoGSR is a part of the EMBERS project
• EMBERS is a fully automated 24x7 cloud hosted system,
that mines through massive data streams of open source
data like twitter, facebook, news, blogs, etc. to generate
forecasts for civil unrest events that will happen in future
• EMBERS is funded by IARPA’s OSI program, since 2012
• Forecasts for civil unrest events generated by EMBERS are
evaluated against ground truth that is reported in news
articles. This ground truth is generated manually by MITRE
using a team of analysts. However, this manual approach
for generating ground truth is not scalable.
2
3. Goal
AutoGSR aims to generate comprehensive ground
truth data
• by extracting events of type:
“Who protested where, when and why”
• from news articles in:
Spanish, Portuguese, English and Arabic
• While minimizing the manual effort required
In the OSI program, the ground truth data, which comprises of records
of civil unrest events reported in Latin American news papers is referred
to as Gold Standard Report (GSR). Since, we are automating the
process of ground truth generation, we named our system: AutoGSR
3
4. Sub-Goals
1. Minimize the Manual Effort required to generate
the Ground Truth Civil Unrest Data
2. Generate a “comprehensive” dataset
4
• For the OSI project, IARPA is generating GSR with the help of
MITRE.
• MITRE’s GSR generation process is purely manual, thereby leading
to high cost.
• Basic idea behind AutoGSR is to make the GSR generation
economically feasible.
• Why emphasize on word “Comprehensive”?
o Because Automated event extractors have poor recall
• Almost all of the civil unrest events needs to be identified
• Crucial from the point of view of OSI evaluations
• This dataset is also used by EMBERS forecasting models for training
5. Why Automated extractors
have poor Recall?
• Because most of the extraction methods are
based on patterns ex: <student w/2 protest>
– While patterns work nicely with semi-structured data
like medical reports, calendar notification etc., it works
poorly for unstructured data like news, blogs etc.
– Free flowing text can express a given information in a
wide variety of ways
• Spread across multiple sentences
• Co-reference Resolution
• Negation, etc.
6
6. Precision Recall tradeoff
• Rigid Patterns (high precision, low recall)
– <student w/2 protest>
– Matches true events
– Looses out of several other real events (labors strike)
– ICEWS
• Loose Patterns (low precision, high recall)
– <Noun w/2 protest/alt>
– Identifies almost all real events
– Matches several false events (player strike)
– GDELT
8
Preferred
7. What ratio of the articles are
truly protest events?
9
17633%
9868%
2976%
0%
2000%
4000%
6000%
8000%
10000%
12000%
14000%
16000%
18000%
20000%
Google&Search& Processed& Protest&
16.8%
AutoGSR Articles Count for April 2016 (10 LA Countries)
10. Baseline Version
• This is a baseline version that automates the GSR production
process:
– Performs keyword based Google search query and download links
– Extracts “article text” from these links and looks for protest keywords
– Loads only those articles in the interface which have protest keywords
• Also translates articles into English
• Loads image associated with the article
• Highlight protest keywords
• Identify city names from the article text and pre-populate location dropdown for
faster encoding
– Interface allows user to encode articles by clicking a few buttons
– Interface also allows to review and resolve conflicts
• The encoding process still remains manual:
– Does not perform any classification or filtering of articles
– Does not provide any encoding recommendations
5
12. The “intelligent” version
• This version introduces several machine learning
models for:
– discovery and classification of news articles
– Encoding recommendations:
• Recommendations for Individual encoding elements.
• Recommendations for the whole encoding tuple
• The architecture has a very flexible design:
– It is easy to plug third-party models into the system
• New Interface
– Similar news stories are clustered together in real-time
– Shows Non-Protest articles separately from the Protest
articles
7
13. Models Ecosystem
8
Filtering-Based Models Probability-Based Models Recommendations-Based Models
These are rules based models that
classify incoming news articles into
protest and non-protest with a 0 or 1
certainty
These models assign a probability
score to an incoming article to specify
whether the article is reporting a
protest or not
These models assume that the incoming article
is a protest article and tries to recommend
complete or partial encoding(s) for the article
1. Sub-domain based filtering model
2. URL based filtering model
3. Negative keyword based filtering
Model
1. Naïve-Bayes Document Classifier
2. Image based Classifier
3. SEO Meta Tags based Classifier
4. Deep Learning Classifier
1. Clustering based Model for full-encoding
recommendation
2. Geo-location Model for location
recommendation
3. Key sentence(s) recommendation
4. SEO Meta Tags based recommendations
5. National or Statewide protest
recommendation
Approach: All articles are passed
through each of these models.
However, if any of these models
classify the article as Non-protest
then the article is labeled as non-
protest article in the interface
Approach: Each of these models
assign individual probabilities to an
incoming article. An article’s final
probability is calculated using ‘model
ensemble’ approach.
In the interface user can specify a
cut-off probability score. Articles that
have probability greater than the
cutoff will appear as protest articles in
the interface
Approach: These recommendations appear in
the interface for each article. The
recommendations are clickable allowing users to
select an encoding by just 1-click.
14. Filtering-Based Models Probability-Based Models Recommendations-Based Models
These are rules based models that
classify incoming news articles into
protest and non-protest with a 0 or 1
certainty
These models assign a probability
score to an incoming article to specify
whether the article is reporting a
protest or not
These models assume that the incoming article
is a protest article and tries to recommend
complete or partial encoding(s) for the article
1. Sub-domain based filtering model
2. URL based filtering model
3. Negative keyword based filtering
Model
1. Naïve-Bayes Document Classifier
2. Image based Classifier
3. SEO Meta Tags based Classifier
4. Deep Learning Classifier
1. Clustering based Model for full-encoding
recommendation
2. Geo-location Model for location
recommendation
3. Key sentence(s) recommendation
4. SEO Meta Tags based recommendations
5. National or Statewide protest
recommendation
Approach: All articles are passed
through each of these models.
However, if any of these models
classify the article as Non-protest
then the article is labeled as non-
protest article in the interface
Approach: Each of these models
assign individual probabilities to an
incoming article. An article’s final
probability is calculated using ‘model
ensemble’ approach.
In the interface user can specify a
cut-off probability score. Articles that
have probability greater than the
cutoff will appear as protest articles in
the interface
Approach: These recommendations appear in
the interface for each article. The
recommendations are clickable allowing users to
select an encoding by just 1-click.
Models Ecosystem
(duplicate slide for quick reference)
9
15. Sub-Domain Based Filtering
• Many of the sub-domains are tagged as
non-relevant for protest articles.
– Sports, Entertainment, Editorial etc.
• If an article appears in any of these sub-
domains it will be classified as non-protest
article
• Filtering-Based Model
10
16. URL-Based Filtering
• Even from the relevant sub-domains, there might be
several URL structures that are irrelevant. For example:
– URLs summarizing top stories of the day
Ex: http://www.clarin.com/politica/
– URLs summarizing stories by topics
Ex: http://www.clarin.com/tema/manifestaciones.html
– URLs corresponding to search terms
Ex: http://www.clarin.com/buscador?q=protesta
• Filtering-Based Model
11
17. Negative Keyword Based
Filtering Model
• For many of the protest keywords, there exist words
(Negative Keywords) which when used together with the
protest keyword can alter the meaning. For example:
• Filtering-Based Model
12
Protest Keyword Negative Keyword Phrase Meaning
marcha ponar en marcha to start; to set in motion
protesta tomar protesta to swear in (public official)
protesta rendir protesta to swear in (public official)
18. Filtering-Based Models Probability-Based Models Recommendations-Based Models
These are rules based models that
classify incoming news articles into
protest and non-protest with a 0 or 1
certainty
These models assign a probability
score to an incoming article to specify
whether the article is reporting a
protest or not
These models assume that the incoming article
is a protest article and tries to recommend
complete or partial encoding(s) for the article
1. Sub-domain based filtering model
2. URL based filtering model
3. Negative keyword based filtering
Model
1. Naïve-Bayes Document Classifier
2. Image based Classifier
3. SEO Meta Tags based Classifier
4. Deep Learning Classifier
1. Clustering based Model for full-encoding
recommendation
2. Geo-location Model for location
recommendation
3. Key sentence(s) recommendation
4. SEO Meta Tags based recommendations
5. National or Statewide protest
recommendation
Approach: All articles are passed
through each of these models.
However, if any of these models
classify the article as Non-protest
then the article is labeled as non-
protest article in the interface
Approach: Each of these models
assign individual probabilities to an
incoming article. An article’s final
probability is calculated using ‘model
ensemble’ approach.
In the interface user can specify a
cut-off probability score. Articles that
have probability greater than the
cutoff will appear as protest articles in
the interface
Approach: These recommendations appear in
the interface for each article. The
recommendations are clickable allowing users to
select an encoding by just 1-click.
Models Ecosystem
(duplicate slide for quick reference)
13
19. Naïve-Bayes Document
Classifier
1. For each article in the training set extract named
entities: people, location and organization
2. For each country, for every mention of people, location,
organization and protest keywords in the training set,
identify the probability of being a protest article
3. For an incoming article, based on the mentions of
people, location, organization and protest keyword in it,
assign a naive-bayes probability of the article being a
protest article
• Probability-Based Model
14
20. Image Based Classifier
• A picture is worth 1,000 words
• An image classification model that learns from the
images in the training set and classifies the incoming
images as protest image or not
• Excludes cases when the article image is a standard
image like newspaper logo or there is no associated
image.
• Probability-Based Model
15
21. SEO Meta Tags based
Classification and Suggestions
• Almost every news site use SEO meta tags that makes it
easy for search engine crawlers to index their content
• In these tags they provide very succinct information
about the article that can be used to our advantage like
summary, abstract, description, keywords, publish date
etc.
• These tags are generated for each article specifically to
get a better presence on the web.
• Probability-Based and Suggestion-Based Model
16
22. SEO Meta Tags based
Classification and Suggestions
17
23. Deep Learning Classifier
• Uses Neural Network based Deep Learning
techniques like word2vec, doc2vec to
classify incoming articles into protest and
non-protest.
• Probability-Based Model
18
24. Model Ensemble
• The goal of model ensemble is to combine probabilities from each of
the probability based models into a one final probability score for the
article.
• Takes into account how good each of the models have been in the
past
• Also takes care of cases when one or more of the models is not able
to generate any probability score (for ex: when the image is not
present)
• The interface shows only one single combined probability for each
article. The interface allows the user to specify a cutoff probability
score. Any article with a combined probability score greater than the
cutoff is shows an protest article in the interface
• Part of Probability-Based-Models
19
25. Filtering-Based Models Probability-Based Models Recommendations-Based Models
These are rules based models that
classify incoming news articles into
protest and non-protest with a 0 or 1
certainty
These models assign a probability
score to an incoming article to specify
whether the article is reporting a
protest or not
These models assume that the incoming article
is a protest article and tries to recommend
complete or partial encoding(s) for the article
1. Sub-domain based filtering model
2. URL based filtering model
3. Negative keyword based filtering
Model
1. Naïve-Bayes Document Classifier
2. Image based Classifier
3. SEO Meta Tags based Classifier
4. Deep Learning Classifier
1. Clustering based Model for full-encoding
recommendation
2. Geo-location Model for location
recommendation
3. Key sentence(s) recommendation
4. SEO Meta Tags based recommendations
5. National or Statewide protest
recommendation
Approach: All articles are passed
through each of these models.
However, if any of these models
classify the article as Non-protest
then the article is labeled as non-
protest article in the interface
Approach: Each of these models
assign individual probabilities to an
incoming article. An article’s final
probability is calculated using ‘model
ensemble’ approach.
In the interface user can specify a
cut-off probability score. Articles that
have probability greater than the
cutoff will appear as protest articles in
the interface
Approach: These recommendations appear in
the interface for each article. The
recommendations are clickable allowing users to
select an encoding by just 1-click.
Models Ecosystem
(duplicate slide for quick reference)
20
26. Clustering-Based Full Encoding
Recommendation
• Articles referring to the same topic are clustered together in real-
time in the interface
– Uses a third party search results clustering algorithm named lingo3G
• If any of the articles in the cluster has already been encoded, the
system starts to recommend the same encoding for other articles in
the cluster
• In case of multiple articles with different encodings in the same
cluster, then the recommendations are made based on the most
used encoding tuple
• Recommendations are clickable and allows a user to encode the
article using just 1-click
• Recommendation-Based Model
21
27. Geo-Location Model
• This model works on Location Named Entities extracted
from article text and an extended version of world-
gazetteer to recommend a location that the article is
talking about
• Also handles cases when the article reports landmarks
instead of city names
• Recommendation-Based Model
22
28. Key Sentence(s) Suggestion
• This is a Neural-Network based model that identifies key sentences
in the article:
– Sentences reporting protest
– Sentences reporting reasons for protest, or participating population
– Sentences providing contextual information
• On the interface the user can toggle his “reading view” to show:
– Just the highlighted sentences of the articles
– Full Article
• Recommendation-Based Model
23
29. National / Statewide Protest
Suggestion
• Simple keyword based model that looks for variants of
the word “national” or “State-wide” in the article text and
makes a recommendation that the protest maybe a
nationwide protest
• Used more as a cautionary model to alert users that
article might need to be encoded as nationwide/
statewide protest article instead of city level protest
article
• Recommendation-Based Model
24
30. Adding a New Model
• The system has a very flexible architecture that allows
addition of new models till the time they fall in on of the
three categories – filtered, probability or suggestion
based model.
• The system treats the models as black-box and uses a
standard interface for calling them:
– Based on the model type, the system expects a standard
response
– For example: It is very easy to integrate BBN SERIF into the new
version. SERIF will receive an article through an API and will
return the extracted event (full or partial), which will then be
automatically shown as a suggestion in the interface.
25
32. New “Intelligent” Interface
• New Intelligent Interface:
– User defined criteria for classifying Protest / Non-protest Article
– Similar articles appear in clusters, thereby reducing redundancy
– Shows full-event encoding suggestions (event extraction) for the article.
There are two ways to show these full-event suggestions:
• Clustering based suggestions: Assuming that articles in the cluster are similar,
encodings from the encoded articles are used to make suggestions for the
unencoded articles
• Ensembled Recommendation Suggestions: Full tuples encoding suggestions are
generated from the partial suggestions made by the recommendation based
models
– Individual suggestions are shown in the encoding form itself. These
suggestions are generated by recommendation models
– Shows the output from all the classification models along with their
comments in an easy to ready well-constructed English statements.
– Key-sentence Highlighted with an ability to tag sentences and switch
between two reading views: “Full Article” and “Highlighted Text”.
27
35. AutoGSR Interface
30
Allows the user to choose his criteria for
selecting protest/Non-protest articles. He
can define Cutoff Confidence Probability
for classifying an article as protest article.
36. AutoGSR Interface
31
The returned articles are clustered on-the-
fly such that similar articles appear in the
same cluster. The system also generates
Cluster Labels
37. AutoGSR Interface
32
Clicking on a cluster shows all the articles
in the clusters along with a color-coding to
differentiate encoded articles from
unencoded articles
38. AutoGSR Interface
33
Full Encoding Suggestions along with
confidence scores are generated based on
the encodings of the other articles in the
cluster
40. AutoGSR Interface
35
Shows the output from all the
classification models along with their
comments in an easy to ready well-
constructed English statements
42. AutoGSR Interface
37
Based on the output of key-sentence
recommendations model, sentences are highlighted
that are deemed to contain the information required
by event extraction. Further, a user can also click a
particular sentence and record the type of
information provided by that sentence in case if he
disagrees with the system generated
recommendations