SlideShare ist ein Scribd-Unternehmen logo
1 von 20
Downloaden Sie, um offline zu lesen
Aiding journalism with 
machine learning @ NYT 
Dae Il Kim - daeil.kim@nytimes.com
Overview 
● The Story of Faulty Takata Airbags 
○ Using Logistic Regression to predict suspicious comments 
● Dealing with large document corpuses: The FOIA problem 
○ What are Topic Models? 
■ What are topics and why are they useful? 
■ Latent Dirichlet Allocation - A Graphical Model Perspective 
■ Scalable Topic Models 
○ Refinery: A Locally Deployable Web Platform for Large Document Analysis 
■ The Technology Stack for Refinery 
■ How does Refinery work? 
● Future Directions
The Story of Faulty Takata Airbags
Complaints data from NHTSA complaints 
The Data 
Data contains 33,204 comments with 2219 of 
these painstakingly labeled as being suspicious (by 
Hiroko Tabuchi). 
A Machine Learning Approach 
Develop a prediction algorithm that can predict 
whether a comment was either suspicious or not. 
The algorithm will then learn from the dataset 
which features are representative of a suspicious 
comment.
The Machine Learning Approach 
A sample comment. We will preprocess this data for the algorithm 
- NEW TOYOTA CAMRY LE PURCHASED JANUARY 2004 - ON FEBRUARY 25TH KEY WOULD NOT TURN (TOOK 10 - 15 MINUTES TO START IT) - 
LATER WHILE PARKING, THE CAR THE STEERING LOCKED TURNING THE CAR TO THE RIGHT - THE CAR ACCELERATED AND SURGED DESPITE 
DEPRESSING THE BRAKE (SAME AS ODI PEO4021) - THOUGH THE CAR BROKE A METAL FLAG POLE, DAMAGED A RETAINING WALL, AND FELL 
SEVEN FEET INTO A MAJOR STREET, THE AIR BAGS DID NOT DEPLOY - CAR IS SEVERELY DAMAGED: WHEELS, TIRES, FRONT END, GAS TANK, 
FRONT AXLE - DRIVER HAS A SWOLLEN AND SORE KNEE ALONG WITH SIGNIFICANT SOFT TISSUE INJURIES INCLUDING BACK PAIN *SC *JB 
TOKENIZE 
(NEW), (TOYOTA), (CAMRY), (LE), (PURCHASED), (JANUARY), (2004), ... → Break this into individual words 
(NEW TOYOTA), (TOYOTA CAMRY), (CAMRY LE), (LE PURCHASED), ... → Break this into bigrams (every two word combinations) 
FILTER 
(NEW), (TOYOTA), (CAMRY), (LE), (PURCHASED), (JANUARY), (2004), ... → Remove tokens that appear in less than 5 comments 
(NEW TOYOTA), (TOYOTA CAMRY), (CAMRY LE), (LE PURCHASED), ... → Remove bigrams that appear in less than 5 comments 
The data now consists of 33,204 examples with 56,191 features 
DATA IS READY FOR TRAINING!
Cross-Validation 
Comment ID 
Features (i.e word frequency) 
0 0 0 3 1 0 2 0... 
1 0 0 0 2 0 1 1... 
... 
1 1 5 1 2 0 0 1... 
Labels (S = Suspicious, NS = Not Suspicious) 
This is our training set. Take a subset of the 
data for training 
S 
NS 
S 
S 
NS 
NS 
NS 
NS 
NS 
This is our test set. After training, test on 
this dataset to obtain accuracy measures.
How did we do? 
Experiment Setup 
We hold out 25% of both the 
suspicious and not suspicious 
comments for testing and train on 
the rest. We do this 5 times, creating 
random splits and retraining the 
model with these splits. 
Performance! 
We obtain a very high AUC (~.97) on 
our test sets. 
Check what we missed 
These comments are potentially 
worth checking twice.
The most predictive words / features 
Predictive of a 
suspicious comment 
Predictive of a 
normal comment. 
After training the model, 
we then applied this on 
the full dataset. 
We looked for 
comments that Hiroko 
didn’t label as being 
suspicious, but the 
algorithm did to follow 
up on (374 / 33K total). 
Result: 7 new cases 
where a passenger 
was injured were 
discovered from 
those comments she 
missed.
Dealing with large document corpuses (i.e FOIA dumps) 
We’ll use Topic Models for making sense of these large document collections!
What are Topic Models? 
There are reasons to believe that the 
genetics of an organism are likely to 
shift due to the extreme changes in our 
climate. To protect them, our politicians 
must pass environmental legislation 
that can protect our future species from 
becoming extinct… 
Decompose 
documents as a 
probability 
distribution over 
“topic” indices 
1 
“Climate Change” 
0 
“Politics” 
“Genetics” 
Topics in turn represent probability distributions over the unique words in your vocabulary. 
“Politics” “Climate Change” “Genetics”
Topic Models: A Graphical Model Perspective 
LDA: Latent Dirichlet Allocation (Bayesian Topic Model) 
Blei et. al, 2001 
1 
“Climate Change” 
0 
“Politics” 
“Genetics” dna: 2, obama: 1, state: 1, gene: 2, 
climate: 3, government: 1, drug: 2, 
pollution: 3
Bayes Theorem 
Prior belief about the world. In terms of 
LDA, our modeling assumptions / priors. 
Normalization constant makes this 
problem a lot harder. We need this 
for valid probabilities. 
Likelihood. Given our model, 
how likely is this data? 
Posterior distribution. Probability of our 
new model given the data.
Posterior Inference in LDA 
GOAL: Obtain this posterior 
which means that we need to 
calculate this intractable term: 
For LDA, this represents the posterior 
over latent variables representing how 
much a document contains of topic k (θ) 
and topic word assignments z. 
LDA: Latent Dirichlet Allocation (Bayesian Topic Model) 
Blei et. al, 2001
Scalable Learning & Inference in Topic Models 
LDA: Latent Dirichlet Allocation (Bayesian Topic Model) 
Blei et. al, 2001 
Update θ, z, and β after analyzing 
each mini-batch of documents. 
Analyze a subset of your total documents before updating.
Refinery: An open source web-app for large document analyses 
Daeil Kim @ New York Times 
Founder of Refinery 
daeil.kim@nytimes.com 
Ben Swanson @ MIT Media Lab 
Co-Founder of Refinery 
dujiaozhu@gmail.com 
Refinery is a 2014 Knight Prototype Fund winner. Check it out at: http://docrefinery.org
Installing Refinery 
3 Simple Steps to get Refinery running Install these first! 
1) Command → git clone https://github.com/daeilkim/refinery.git 
2) Go to the root folder. Command → vagrant up 
3) Open brower and go to --> 11.11.11.11:8080
A Typical Refinery Pipeline 
Step 1: Upload documents 
Step 2: Extract Topics from a Topic 
Model 
Step 3: Find a subset of documents with 
topics of interest. 
Step 4: Discover Interesting Phrases
A Quick Refinery Demo 
Extracting NYT 
articles from 
keyword “obama” in 
2013. 
What themes / topics defined the Obama administration during 2013?
Future Directions: Better tools for Investigative Reporting 
Collecting 
& Scraping 
Data 
Refinery focuses on extracting 
insights from relatively clean data 
Great tools like DocumentCloud take 
care of steps 1 & 2 
Enterprise stories might 
be completed in a 
fraction of the time. 
Filtering 
& Cleaning 
Data 
Extracting 
Insights
Interesting Extensions to Topic Models 
Combining topic models with recommendation systems. 
LDA / Topic Modeling 
Matrix Factorization Model 
Generative Process 
Generative Process 
Benefits 
● The model think of users as mixtures 
of topics. We are what we read and 
rate. 
● The ratings in turn help shape the 
topics that are also discovered. 
● Can do in-matrix and out of matrix 
predictions.

Weitere ähnliche Inhalte

Andere mochten auch

A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data Vivian S. Zhang
 
We're so skewed_presentation
We're so skewed_presentationWe're so skewed_presentation
We're so skewed_presentationVivian S. Zhang
 
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
 Hack session for NYTimes Dialect Map Visualization( developed by R Shiny) Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)Vivian S. Zhang
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on HadoopVivian S. Zhang
 
Wikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big DataWikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big DataVivian S. Zhang
 
Introducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rIntroducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rVivian S. Zhang
 
Data mining with caret package
Data mining with caret packageData mining with caret package
Data mining with caret packageVivian S. Zhang
 
Max Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learningMax Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learningVivian S. Zhang
 
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow Vivian S. Zhang
 
Winning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangWinning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangVivian S. Zhang
 
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorKaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorVivian S. Zhang
 
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitionsOwen Zhang
 

Andere mochten auch (14)

A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data
 
We're so skewed_presentation
We're so skewed_presentationWe're so skewed_presentation
We're so skewed_presentation
 
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
 Hack session for NYTimes Dialect Map Visualization( developed by R Shiny) Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on Hadoop
 
Bayesian models in r
Bayesian models in rBayesian models in r
Bayesian models in r
 
Wikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big DataWikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big Data
 
Introducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rIntroducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with r
 
Data mining with caret package
Data mining with caret packageData mining with caret package
Data mining with caret package
 
Max Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learningMax Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learning
 
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
 
Winning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangWinning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen Zhang
 
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorKaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
 
Xgboost
XgboostXgboost
Xgboost
 
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitions
 

Ähnlich wie Using Machine Learning to aid Journalism at the New York Times

Managing the Earthquake: Surviving Major Database Architecture Changes (rev.2...
Managing the Earthquake: Surviving Major Database Architecture Changes (rev.2...Managing the Earthquake: Surviving Major Database Architecture Changes (rev.2...
Managing the Earthquake: Surviving Major Database Architecture Changes (rev.2...Michael Rosenblum
 
Prompt Engineering - an Art, a Science, or your next Job Title?
Prompt Engineering - an Art, a Science, or your next Job Title?Prompt Engineering - an Art, a Science, or your next Job Title?
Prompt Engineering - an Art, a Science, or your next Job Title?Maxim Salnikov
 
Named Entity Recognition from Online News
Named Entity Recognition from Online NewsNamed Entity Recognition from Online News
Named Entity Recognition from Online NewsBernardo Najlis
 
Marketisation Of Home Care
Marketisation Of Home CareMarketisation Of Home Care
Marketisation Of Home CareJessica Hill
 
Kdd 2014 tutorial bringing structure to text - chi
Kdd 2014 tutorial   bringing structure to text - chiKdd 2014 tutorial   bringing structure to text - chi
Kdd 2014 tutorial bringing structure to text - chiBarbara Starr
 
Cloudera Data Science Challenge 3 Solution by Doug Needham
Cloudera Data Science Challenge 3 Solution by Doug NeedhamCloudera Data Science Challenge 3 Solution by Doug Needham
Cloudera Data Science Challenge 3 Solution by Doug NeedhamDoug Needham
 
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...
IRJET -  	  Conversion of Unsupervised Data to Supervised Data using Topic Mo...IRJET -  	  Conversion of Unsupervised Data to Supervised Data using Topic Mo...
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...IRJET Journal
 
Synchronicity: Just-In-Time Discovery of Lost Web Pages
Synchronicity: Just-In-Time Discovery of Lost Web PagesSynchronicity: Just-In-Time Discovery of Lost Web Pages
Synchronicity: Just-In-Time Discovery of Lost Web PagesMichael Nelson
 
Automated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSUAutomated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSUCS, NcState
 
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Greg Makowski
 
Julien Gonçalves: Named entity recognition and disambiguation using an iterat...
Julien Gonçalves: Named entity recognition and disambiguation using an iterat...Julien Gonçalves: Named entity recognition and disambiguation using an iterat...
Julien Gonçalves: Named entity recognition and disambiguation using an iterat...Semantic Web Company
 
Data Mining Techniques And Their Applications
Data Mining Techniques And Their ApplicationsData Mining Techniques And Their Applications
Data Mining Techniques And Their ApplicationsDiane Allen
 
Removing Self Admitted Technical Debt
Removing Self Admitted Technical DebtRemoving Self Admitted Technical Debt
Removing Self Admitted Technical DebtAlexander Serebrenik
 
Successfully Kickstarting Data Governance's Social Dynamics: Define, Collabor...
Successfully Kickstarting Data Governance's Social Dynamics: Define, Collabor...Successfully Kickstarting Data Governance's Social Dynamics: Define, Collabor...
Successfully Kickstarting Data Governance's Social Dynamics: Define, Collabor...Stijn (Stan) Christiaens
 
Declarative Multilingual Information Extraction with SystemT
Declarative Multilingual Information Extraction with SystemTDeclarative Multilingual Information Extraction with SystemT
Declarative Multilingual Information Extraction with SystemTLaura Chiticariu
 
Explainability and bias in AI
Explainability and bias in AIExplainability and bias in AI
Explainability and bias in AIBill Liu
 
Fantastic Problems and Where to Find Them: Daryl Weir
Fantastic Problems and Where to Find Them: Daryl WeirFantastic Problems and Where to Find Them: Daryl Weir
Fantastic Problems and Where to Find Them: Daryl WeirFuturice
 

Ähnlich wie Using Machine Learning to aid Journalism at the New York Times (20)

Wikipedia Content Analysis
Wikipedia Content AnalysisWikipedia Content Analysis
Wikipedia Content Analysis
 
Managing the Earthquake: Surviving Major Database Architecture Changes (rev.2...
Managing the Earthquake: Surviving Major Database Architecture Changes (rev.2...Managing the Earthquake: Surviving Major Database Architecture Changes (rev.2...
Managing the Earthquake: Surviving Major Database Architecture Changes (rev.2...
 
Prompt Engineering - an Art, a Science, or your next Job Title?
Prompt Engineering - an Art, a Science, or your next Job Title?Prompt Engineering - an Art, a Science, or your next Job Title?
Prompt Engineering - an Art, a Science, or your next Job Title?
 
Named Entity Recognition from Online News
Named Entity Recognition from Online NewsNamed Entity Recognition from Online News
Named Entity Recognition from Online News
 
Marketisation Of Home Care
Marketisation Of Home CareMarketisation Of Home Care
Marketisation Of Home Care
 
Kdd 2014 tutorial bringing structure to text - chi
Kdd 2014 tutorial   bringing structure to text - chiKdd 2014 tutorial   bringing structure to text - chi
Kdd 2014 tutorial bringing structure to text - chi
 
Cloudera Data Science Challenge 3 Solution by Doug Needham
Cloudera Data Science Challenge 3 Solution by Doug NeedhamCloudera Data Science Challenge 3 Solution by Doug Needham
Cloudera Data Science Challenge 3 Solution by Doug Needham
 
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...
IRJET -  	  Conversion of Unsupervised Data to Supervised Data using Topic Mo...IRJET -  	  Conversion of Unsupervised Data to Supervised Data using Topic Mo...
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...
 
Synchronicity: Just-In-Time Discovery of Lost Web Pages
Synchronicity: Just-In-Time Discovery of Lost Web PagesSynchronicity: Just-In-Time Discovery of Lost Web Pages
Synchronicity: Just-In-Time Discovery of Lost Web Pages
 
Automated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSUAutomated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSU
 
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
 
Julien Gonçalves: Named entity recognition and disambiguation using an iterat...
Julien Gonçalves: Named entity recognition and disambiguation using an iterat...Julien Gonçalves: Named entity recognition and disambiguation using an iterat...
Julien Gonçalves: Named entity recognition and disambiguation using an iterat...
 
Msr2021 tutorial-di penta
Msr2021 tutorial-di pentaMsr2021 tutorial-di penta
Msr2021 tutorial-di penta
 
Data Mining Techniques And Their Applications
Data Mining Techniques And Their ApplicationsData Mining Techniques And Their Applications
Data Mining Techniques And Their Applications
 
Removing Self Admitted Technical Debt
Removing Self Admitted Technical DebtRemoving Self Admitted Technical Debt
Removing Self Admitted Technical Debt
 
MATLAB Essay
MATLAB EssayMATLAB Essay
MATLAB Essay
 
Successfully Kickstarting Data Governance's Social Dynamics: Define, Collabor...
Successfully Kickstarting Data Governance's Social Dynamics: Define, Collabor...Successfully Kickstarting Data Governance's Social Dynamics: Define, Collabor...
Successfully Kickstarting Data Governance's Social Dynamics: Define, Collabor...
 
Declarative Multilingual Information Extraction with SystemT
Declarative Multilingual Information Extraction with SystemTDeclarative Multilingual Information Extraction with SystemT
Declarative Multilingual Information Extraction with SystemT
 
Explainability and bias in AI
Explainability and bias in AIExplainability and bias in AI
Explainability and bias in AI
 
Fantastic Problems and Where to Find Them: Daryl Weir
Fantastic Problems and Where to Find Them: Daryl WeirFantastic Problems and Where to Find Them: Daryl Weir
Fantastic Problems and Where to Find Them: Daryl Weir
 

Mehr von Vivian S. Zhang

Career services workshop- Roger Ren
Career services workshop- Roger RenCareer services workshop- Roger Ren
Career services workshop- Roger RenVivian S. Zhang
 
Nycdsa wordpress guide book
Nycdsa wordpress guide bookNycdsa wordpress guide book
Nycdsa wordpress guide bookVivian S. Zhang
 
Nyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedNyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedVivian S. Zhang
 
Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015 Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015 Vivian S. Zhang
 
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public dataTHE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public dataVivian S. Zhang
 
Natural Language Processing(SupStat Inc)
Natural Language Processing(SupStat Inc)Natural Language Processing(SupStat Inc)
Natural Language Processing(SupStat Inc)Vivian S. Zhang
 
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...Vivian S. Zhang
 
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nycData Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nycVivian S. Zhang
 
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nycData Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nycVivian S. Zhang
 
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...Vivian S. Zhang
 
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...Vivian S. Zhang
 
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...Vivian S. Zhang
 
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...Vivian S. Zhang
 
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...Vivian S. Zhang
 
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...Vivian S. Zhang
 

Mehr von Vivian S. Zhang (17)

Why NYC DSA.pdf
Why NYC DSA.pdfWhy NYC DSA.pdf
Why NYC DSA.pdf
 
Career services workshop- Roger Ren
Career services workshop- Roger RenCareer services workshop- Roger Ren
Career services workshop- Roger Ren
 
Nycdsa wordpress guide book
Nycdsa wordpress guide bookNycdsa wordpress guide book
Nycdsa wordpress guide book
 
Xgboost
XgboostXgboost
Xgboost
 
Nyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedNyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expanded
 
Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015 Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015
 
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public dataTHE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
 
Natural Language Processing(SupStat Inc)
Natural Language Processing(SupStat Inc)Natural Language Processing(SupStat Inc)
Natural Language Processing(SupStat Inc)
 
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
 
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nycData Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
 
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nycData Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
 
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
 
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
 
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
 
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...
 
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...
 
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...
 

Kürzlich hochgeladen

TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxDwiAyuSitiHartinah
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Vladislav Solodkiy
 
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptxCCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptxdhiyaneswaranv1
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introductionsanjaymuralee1
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxVenkatasubramani13
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionajayrajaganeshkayala
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructuresonikadigital1
 
Optimal Decision Making - Cost Reduction in Logistics
Optimal Decision Making - Cost Reduction in LogisticsOptimal Decision Making - Cost Reduction in Logistics
Optimal Decision Making - Cost Reduction in LogisticsThinkInnovation
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationGiorgio Carbone
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...PrithaVashisht1
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerPavel Šabatka
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityAggregage
 
Rock Songs common codes and conventions.pptx
Rock Songs common codes and conventions.pptxRock Songs common codes and conventions.pptx
Rock Songs common codes and conventions.pptxFinatron037
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Guido X Jansen
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best PracticesDataArchiva
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?sonikadigital1
 

Kürzlich hochgeladen (16)

TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023
 
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptxCCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introduction
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptx
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual intervention
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructure
 
Optimal Decision Making - Cost Reduction in Logistics
Optimal Decision Making - Cost Reduction in LogisticsOptimal Decision Making - Cost Reduction in Logistics
Optimal Decision Making - Cost Reduction in Logistics
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - Presentation
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayer
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
 
Rock Songs common codes and conventions.pptx
Rock Songs common codes and conventions.pptxRock Songs common codes and conventions.pptx
Rock Songs common codes and conventions.pptx
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?
 

Using Machine Learning to aid Journalism at the New York Times

  • 1. Aiding journalism with machine learning @ NYT Dae Il Kim - daeil.kim@nytimes.com
  • 2. Overview ● The Story of Faulty Takata Airbags ○ Using Logistic Regression to predict suspicious comments ● Dealing with large document corpuses: The FOIA problem ○ What are Topic Models? ■ What are topics and why are they useful? ■ Latent Dirichlet Allocation - A Graphical Model Perspective ■ Scalable Topic Models ○ Refinery: A Locally Deployable Web Platform for Large Document Analysis ■ The Technology Stack for Refinery ■ How does Refinery work? ● Future Directions
  • 3. The Story of Faulty Takata Airbags
  • 4. Complaints data from NHTSA complaints The Data Data contains 33,204 comments with 2219 of these painstakingly labeled as being suspicious (by Hiroko Tabuchi). A Machine Learning Approach Develop a prediction algorithm that can predict whether a comment was either suspicious or not. The algorithm will then learn from the dataset which features are representative of a suspicious comment.
  • 5. The Machine Learning Approach A sample comment. We will preprocess this data for the algorithm - NEW TOYOTA CAMRY LE PURCHASED JANUARY 2004 - ON FEBRUARY 25TH KEY WOULD NOT TURN (TOOK 10 - 15 MINUTES TO START IT) - LATER WHILE PARKING, THE CAR THE STEERING LOCKED TURNING THE CAR TO THE RIGHT - THE CAR ACCELERATED AND SURGED DESPITE DEPRESSING THE BRAKE (SAME AS ODI PEO4021) - THOUGH THE CAR BROKE A METAL FLAG POLE, DAMAGED A RETAINING WALL, AND FELL SEVEN FEET INTO A MAJOR STREET, THE AIR BAGS DID NOT DEPLOY - CAR IS SEVERELY DAMAGED: WHEELS, TIRES, FRONT END, GAS TANK, FRONT AXLE - DRIVER HAS A SWOLLEN AND SORE KNEE ALONG WITH SIGNIFICANT SOFT TISSUE INJURIES INCLUDING BACK PAIN *SC *JB TOKENIZE (NEW), (TOYOTA), (CAMRY), (LE), (PURCHASED), (JANUARY), (2004), ... → Break this into individual words (NEW TOYOTA), (TOYOTA CAMRY), (CAMRY LE), (LE PURCHASED), ... → Break this into bigrams (every two word combinations) FILTER (NEW), (TOYOTA), (CAMRY), (LE), (PURCHASED), (JANUARY), (2004), ... → Remove tokens that appear in less than 5 comments (NEW TOYOTA), (TOYOTA CAMRY), (CAMRY LE), (LE PURCHASED), ... → Remove bigrams that appear in less than 5 comments The data now consists of 33,204 examples with 56,191 features DATA IS READY FOR TRAINING!
  • 6. Cross-Validation Comment ID Features (i.e word frequency) 0 0 0 3 1 0 2 0... 1 0 0 0 2 0 1 1... ... 1 1 5 1 2 0 0 1... Labels (S = Suspicious, NS = Not Suspicious) This is our training set. Take a subset of the data for training S NS S S NS NS NS NS NS This is our test set. After training, test on this dataset to obtain accuracy measures.
  • 7. How did we do? Experiment Setup We hold out 25% of both the suspicious and not suspicious comments for testing and train on the rest. We do this 5 times, creating random splits and retraining the model with these splits. Performance! We obtain a very high AUC (~.97) on our test sets. Check what we missed These comments are potentially worth checking twice.
  • 8. The most predictive words / features Predictive of a suspicious comment Predictive of a normal comment. After training the model, we then applied this on the full dataset. We looked for comments that Hiroko didn’t label as being suspicious, but the algorithm did to follow up on (374 / 33K total). Result: 7 new cases where a passenger was injured were discovered from those comments she missed.
  • 9. Dealing with large document corpuses (i.e FOIA dumps) We’ll use Topic Models for making sense of these large document collections!
  • 10. What are Topic Models? There are reasons to believe that the genetics of an organism are likely to shift due to the extreme changes in our climate. To protect them, our politicians must pass environmental legislation that can protect our future species from becoming extinct… Decompose documents as a probability distribution over “topic” indices 1 “Climate Change” 0 “Politics” “Genetics” Topics in turn represent probability distributions over the unique words in your vocabulary. “Politics” “Climate Change” “Genetics”
  • 11. Topic Models: A Graphical Model Perspective LDA: Latent Dirichlet Allocation (Bayesian Topic Model) Blei et. al, 2001 1 “Climate Change” 0 “Politics” “Genetics” dna: 2, obama: 1, state: 1, gene: 2, climate: 3, government: 1, drug: 2, pollution: 3
  • 12. Bayes Theorem Prior belief about the world. In terms of LDA, our modeling assumptions / priors. Normalization constant makes this problem a lot harder. We need this for valid probabilities. Likelihood. Given our model, how likely is this data? Posterior distribution. Probability of our new model given the data.
  • 13. Posterior Inference in LDA GOAL: Obtain this posterior which means that we need to calculate this intractable term: For LDA, this represents the posterior over latent variables representing how much a document contains of topic k (θ) and topic word assignments z. LDA: Latent Dirichlet Allocation (Bayesian Topic Model) Blei et. al, 2001
  • 14. Scalable Learning & Inference in Topic Models LDA: Latent Dirichlet Allocation (Bayesian Topic Model) Blei et. al, 2001 Update θ, z, and β after analyzing each mini-batch of documents. Analyze a subset of your total documents before updating.
  • 15. Refinery: An open source web-app for large document analyses Daeil Kim @ New York Times Founder of Refinery daeil.kim@nytimes.com Ben Swanson @ MIT Media Lab Co-Founder of Refinery dujiaozhu@gmail.com Refinery is a 2014 Knight Prototype Fund winner. Check it out at: http://docrefinery.org
  • 16. Installing Refinery 3 Simple Steps to get Refinery running Install these first! 1) Command → git clone https://github.com/daeilkim/refinery.git 2) Go to the root folder. Command → vagrant up 3) Open brower and go to --> 11.11.11.11:8080
  • 17. A Typical Refinery Pipeline Step 1: Upload documents Step 2: Extract Topics from a Topic Model Step 3: Find a subset of documents with topics of interest. Step 4: Discover Interesting Phrases
  • 18. A Quick Refinery Demo Extracting NYT articles from keyword “obama” in 2013. What themes / topics defined the Obama administration during 2013?
  • 19. Future Directions: Better tools for Investigative Reporting Collecting & Scraping Data Refinery focuses on extracting insights from relatively clean data Great tools like DocumentCloud take care of steps 1 & 2 Enterprise stories might be completed in a fraction of the time. Filtering & Cleaning Data Extracting Insights
  • 20. Interesting Extensions to Topic Models Combining topic models with recommendation systems. LDA / Topic Modeling Matrix Factorization Model Generative Process Generative Process Benefits ● The model think of users as mixtures of topics. We are what we read and rate. ● The ratings in turn help shape the topics that are also discovered. ● Can do in-matrix and out of matrix predictions.