This talk was presented to NYC Open Data Meetup Group on Nov 11, 2014.
Speaker:
Daeil Kim is currently a data scientist at the Times and is finishing up his Ph.D at Brown University on work related to developing scalable inference algorithms for Bayesian Nonparametric models. His work at the Times spans a variety of problems related to the company's business interests, audience development, as well as developing tools to aid journalism.
Topic:
This talk will focus mostly on how machine learning can help problems that prop up in journalism. We'll begin first by talking about using popular supervised learning algorithms such as regularized Logistic Regression to help assist a journalist's work in uncovering insights into a story regarding the recall of Takata airbags in cars. Afterwards, we'll think about using topic modeling to deal with large document dumps generated from FOIA (Freedom of Information Act) requests and Refinery, a simple web based tool to ease the implementation of such tasks. Finally, if there is time, we will go over how topic models have been extended to assist in the problem of designing an efficient recommendation engine for text-based content.
2. Overview
● The Story of Faulty Takata Airbags
○ Using Logistic Regression to predict suspicious comments
● Dealing with large document corpuses: The FOIA problem
○ What are Topic Models?
■ What are topics and why are they useful?
■ Latent Dirichlet Allocation - A Graphical Model Perspective
■ Scalable Topic Models
○ Refinery: A Locally Deployable Web Platform for Large Document Analysis
■ The Technology Stack for Refinery
■ How does Refinery work?
● Future Directions
4. Complaints data from NHTSA complaints
The Data
Data contains 33,204 comments with 2219 of
these painstakingly labeled as being suspicious (by
Hiroko Tabuchi).
A Machine Learning Approach
Develop a prediction algorithm that can predict
whether a comment was either suspicious or not.
The algorithm will then learn from the dataset
which features are representative of a suspicious
comment.
5. The Machine Learning Approach
A sample comment. We will preprocess this data for the algorithm
- NEW TOYOTA CAMRY LE PURCHASED JANUARY 2004 - ON FEBRUARY 25TH KEY WOULD NOT TURN (TOOK 10 - 15 MINUTES TO START IT) -
LATER WHILE PARKING, THE CAR THE STEERING LOCKED TURNING THE CAR TO THE RIGHT - THE CAR ACCELERATED AND SURGED DESPITE
DEPRESSING THE BRAKE (SAME AS ODI PEO4021) - THOUGH THE CAR BROKE A METAL FLAG POLE, DAMAGED A RETAINING WALL, AND FELL
SEVEN FEET INTO A MAJOR STREET, THE AIR BAGS DID NOT DEPLOY - CAR IS SEVERELY DAMAGED: WHEELS, TIRES, FRONT END, GAS TANK,
FRONT AXLE - DRIVER HAS A SWOLLEN AND SORE KNEE ALONG WITH SIGNIFICANT SOFT TISSUE INJURIES INCLUDING BACK PAIN *SC *JB
TOKENIZE
(NEW), (TOYOTA), (CAMRY), (LE), (PURCHASED), (JANUARY), (2004), ... → Break this into individual words
(NEW TOYOTA), (TOYOTA CAMRY), (CAMRY LE), (LE PURCHASED), ... → Break this into bigrams (every two word combinations)
FILTER
(NEW), (TOYOTA), (CAMRY), (LE), (PURCHASED), (JANUARY), (2004), ... → Remove tokens that appear in less than 5 comments
(NEW TOYOTA), (TOYOTA CAMRY), (CAMRY LE), (LE PURCHASED), ... → Remove bigrams that appear in less than 5 comments
The data now consists of 33,204 examples with 56,191 features
DATA IS READY FOR TRAINING!
6. Cross-Validation
Comment ID
Features (i.e word frequency)
0 0 0 3 1 0 2 0...
1 0 0 0 2 0 1 1...
...
1 1 5 1 2 0 0 1...
Labels (S = Suspicious, NS = Not Suspicious)
This is our training set. Take a subset of the
data for training
S
NS
S
S
NS
NS
NS
NS
NS
This is our test set. After training, test on
this dataset to obtain accuracy measures.
7. How did we do?
Experiment Setup
We hold out 25% of both the
suspicious and not suspicious
comments for testing and train on
the rest. We do this 5 times, creating
random splits and retraining the
model with these splits.
Performance!
We obtain a very high AUC (~.97) on
our test sets.
Check what we missed
These comments are potentially
worth checking twice.
8. The most predictive words / features
Predictive of a
suspicious comment
Predictive of a
normal comment.
After training the model,
we then applied this on
the full dataset.
We looked for
comments that Hiroko
didn’t label as being
suspicious, but the
algorithm did to follow
up on (374 / 33K total).
Result: 7 new cases
where a passenger
was injured were
discovered from
those comments she
missed.
9. Dealing with large document corpuses (i.e FOIA dumps)
We’ll use Topic Models for making sense of these large document collections!
10. What are Topic Models?
There are reasons to believe that the
genetics of an organism are likely to
shift due to the extreme changes in our
climate. To protect them, our politicians
must pass environmental legislation
that can protect our future species from
becoming extinct…
Decompose
documents as a
probability
distribution over
“topic” indices
1
“Climate Change”
0
“Politics”
“Genetics”
Topics in turn represent probability distributions over the unique words in your vocabulary.
“Politics” “Climate Change” “Genetics”
12. Bayes Theorem
Prior belief about the world. In terms of
LDA, our modeling assumptions / priors.
Normalization constant makes this
problem a lot harder. We need this
for valid probabilities.
Likelihood. Given our model,
how likely is this data?
Posterior distribution. Probability of our
new model given the data.
13. Posterior Inference in LDA
GOAL: Obtain this posterior
which means that we need to
calculate this intractable term:
For LDA, this represents the posterior
over latent variables representing how
much a document contains of topic k (θ)
and topic word assignments z.
LDA: Latent Dirichlet Allocation (Bayesian Topic Model)
Blei et. al, 2001
14. Scalable Learning & Inference in Topic Models
LDA: Latent Dirichlet Allocation (Bayesian Topic Model)
Blei et. al, 2001
Update θ, z, and β after analyzing
each mini-batch of documents.
Analyze a subset of your total documents before updating.
15. Refinery: An open source web-app for large document analyses
Daeil Kim @ New York Times
Founder of Refinery
daeil.kim@nytimes.com
Ben Swanson @ MIT Media Lab
Co-Founder of Refinery
dujiaozhu@gmail.com
Refinery is a 2014 Knight Prototype Fund winner. Check it out at: http://docrefinery.org
16. Installing Refinery
3 Simple Steps to get Refinery running Install these first!
1) Command → git clone https://github.com/daeilkim/refinery.git
2) Go to the root folder. Command → vagrant up
3) Open brower and go to --> 11.11.11.11:8080
17. A Typical Refinery Pipeline
Step 1: Upload documents
Step 2: Extract Topics from a Topic
Model
Step 3: Find a subset of documents with
topics of interest.
Step 4: Discover Interesting Phrases
18. A Quick Refinery Demo
Extracting NYT
articles from
keyword “obama” in
2013.
What themes / topics defined the Obama administration during 2013?
19. Future Directions: Better tools for Investigative Reporting
Collecting
& Scraping
Data
Refinery focuses on extracting
insights from relatively clean data
Great tools like DocumentCloud take
care of steps 1 & 2
Enterprise stories might
be completed in a
fraction of the time.
Filtering
& Cleaning
Data
Extracting
Insights
20. Interesting Extensions to Topic Models
Combining topic models with recommendation systems.
LDA / Topic Modeling
Matrix Factorization Model
Generative Process
Generative Process
Benefits
● The model think of users as mixtures
of topics. We are what we read and
rate.
● The ratings in turn help shape the
topics that are also discovered.
● Can do in-matrix and out of matrix
predictions.