SlideShare ist ein Scribd-Unternehmen logo
1 von 44
Building the Next New York
Times Recommendation
Engine
By Alexander Spangher
Problem Statement:
1. The New York Times publishes over 300 articles, blog posts and interactive
stories a day.
Corpus:
n articles that are still relevant
over the past x days
For each user:
1 2 3 4 ...
30 day reading history
Machine Learning
“All of machine learning can be broken down into regression and matrix
factorization.”
-A drunk PhD student at a bar
1. Regression: f(input) = output
2. Factorization: f(output) = output
-Yann Lecun, 2015
Problem Statement (Refined)
1. Define pool of articles.
Not all articles expire at the same rate
1. Rank order articles based on reading history of user.
Assume that reader’s future preferences will match past preferences
Defining the Pool of Articles
Defining Relevancy
Exponential Distribution
Evergreen Model
Section,
Desk,
Word Count
...
clicks per day
2. Learn
relationship
between
features
and metric
1. Learn training metric
3. Convert to
interpretable
expiration date
Fit a to each item in training set
Fit:
i
Likelihood function:
Maximum Likelihood Estimate (MLE)
likelihood of data and
parameters
joint pdf of data given
parameter
product of independent pdf’s
Maximum Likelihood Estimate
Given timestamp of every click:
Maximum Likelihood Estimate
???
Maximum Log Likelihood Estimate
Or, use optimization package:
Python: http://cvxopt.org/
Convex Optimization by Stephen Boyd
Learn relationship between article features and
x = [desk, word count, section, ...]
y =
General Linear model:
Performance
Building the Recommender
(http://open.blogs.nytimes.com/2015/08/11/building-the-next-new-york-times-recommendation-engine/)
First Iteration
Keyword-Based model: TF-IDF Vector
N = number of times word appears in document
D = number of documents that word appears in
First Iteration
Keyword-Based model: TF-IDF Vector
[ 0.02, 0.5, 0, 0, … , .01 ]
[ 0.9, 0.01, 0.2, … , .05 ]
fun cat dog scholar
nice
Feedback:
“Recommendations work for me
I have been following the Oscar Pistorius case for over a year now and every time there has been a
relevant story about the case, I have been recommended that story.
Recommendations seem to be working very well for me.”
Feedback:
“No More Brooks recommendations, please
Your constant pushing of David Brooks onto me is like an annoying grandmother who won't believe
you are really allergic to peanuts even though you regularly go into anaphylactic shock at her dinner
table and need to be rushed to the hospital. What can I say… you're killing me. Please stop it.
...
Thanks for your attention to this matter.”
Feedback:
“Dear NY Times,
You seem to have missed the fact that, while I do read the Weddings section, I only (or almost only)
read about the weddings of same sex couples.
Please stop recommending heterosexual weddings articles to me!!”
[ 0.02, 0.5, 0, 0, … , .01 ]
[ 0.9, 0.01, 0.2, … , .05 ]
1 2 3 4
k
LDA-Based model: Topic Vector
Second Iteration:
Example topic,
probabilityweight
cat yarn tree building car money bank paw toy newspaper Spotify
Example topic, :
probabilityweight
cat yarn tree building car money bank paw toy newspaper Spotify
LDA
David Blei
(2003)
Topic Space
How do we learn these parameters?
LDA Definition:
Choose 𝜃 ~ Dirichlet(ɑ)
For each in document:
Choose word topic ~ Mult(𝜃)
Choose word from
Variational Inference
Image borrowed from David Blei (2003)
Variational Inference (cont.)
Variational Inference (cont.)
1. (E-Step):
1. (M-Step):
tractable!!!
Collaborative Topic Modeling (CTM)
Image borrowed from David Blei (2011)
The graphical model for the CTM model we use.
Scaling the algorithm
Training procedure is batch. Do we have time to scale to all our users, in real
time???
Strategy:
1. Iterate until some variables don’t change (article-topics).
1. Scale out, fixing non-changing variables. Update equation for one variable
becomes a closed-form equation.
Algorithm
1. Batch train on training set of users
1. Fix and scale out to all users
Derive scores for users
As seen in:
http://benanne.github.io/2014/08/05/spotify-cnns.html!!
C parameter: the back-off average
Any vector-based algorithm.
1)Deep Network (Spotify’s audio-CNN)
2)Shallow Network (Doc2Vec)
3)Topic Model
4)pLSA
In conclusion
Modeling is fun!
All models are bad, but some can be useful!
Improve by recognizing shortfalls.
Evaluate on KPIs, on customer feedback, on design decisions.
not functional
sub-optimal
flat-lining/degrading

Weitere ähnliche Inhalte

Ähnlich wie DataEngConf: Building the Next New York Times Recommendation Engine

Discovering User's Topics of Interest in Recommender Systems
Discovering User's Topics of Interest in Recommender SystemsDiscovering User's Topics of Interest in Recommender Systems
Discovering User's Topics of Interest in Recommender Systems
Gabriel Moreira
 
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
Gabriel Moreira
 
CHAPTER 6FunctionsChapter Topics6.1 Introductio.docx
CHAPTER 6FunctionsChapter Topics6.1  Introductio.docxCHAPTER 6FunctionsChapter Topics6.1  Introductio.docx
CHAPTER 6FunctionsChapter Topics6.1 Introductio.docx
robertad6
 
CSIS 100CSIS 100 - Discussion Board Topic #1One of the object.docx
CSIS 100CSIS 100 - Discussion Board Topic #1One of the object.docxCSIS 100CSIS 100 - Discussion Board Topic #1One of the object.docx
CSIS 100CSIS 100 - Discussion Board Topic #1One of the object.docx
mydrynan
 
Sheet1Table 1 temperature and impact energy values for steelTable.docx
Sheet1Table 1 temperature and impact energy values for steelTable.docxSheet1Table 1 temperature and impact energy values for steelTable.docx
Sheet1Table 1 temperature and impact energy values for steelTable.docx
bjohn46
 

Ähnlich wie DataEngConf: Building the Next New York Times Recommendation Engine (20)

Discovering User's Topics of Interest in Recommender Systems
Discovering User's Topics of Interest in Recommender SystemsDiscovering User's Topics of Interest in Recommender Systems
Discovering User's Topics of Interest in Recommender Systems
 
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
 
Reasesrty djhjan S - explanation required.pptx
Reasesrty djhjan S - explanation required.pptxReasesrty djhjan S - explanation required.pptx
Reasesrty djhjan S - explanation required.pptx
 
Frontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text AnalysisFrontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text Analysis
 
CHAPTER 6FunctionsChapter Topics6.1 Introductio.docx
CHAPTER 6FunctionsChapter Topics6.1  Introductio.docxCHAPTER 6FunctionsChapter Topics6.1  Introductio.docx
CHAPTER 6FunctionsChapter Topics6.1 Introductio.docx
 
Dataworkz odsc london 2018
Dataworkz odsc london 2018Dataworkz odsc london 2018
Dataworkz odsc london 2018
 
Kdd 2014 tutorial bringing structure to text - chi
Kdd 2014 tutorial   bringing structure to text - chiKdd 2014 tutorial   bringing structure to text - chi
Kdd 2014 tutorial bringing structure to text - chi
 
[系列活動] 人工智慧與機器學習在推薦系統上的應用
[系列活動] 人工智慧與機器學習在推薦系統上的應用[系列活動] 人工智慧與機器學習在推薦系統上的應用
[系列活動] 人工智慧與機器學習在推薦系統上的應用
 
The Magical Art of Extracting Meaning From Data
The Magical Art of Extracting Meaning From DataThe Magical Art of Extracting Meaning From Data
The Magical Art of Extracting Meaning From Data
 
Wikipedia Document Classification
Wikipedia Document Classification Wikipedia Document Classification
Wikipedia Document Classification
 
A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modelling
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
 
A-Study_TopicModeling
A-Study_TopicModelingA-Study_TopicModeling
A-Study_TopicModeling
 
CSIS 100CSIS 100 - Discussion Board Topic #1One of the object.docx
CSIS 100CSIS 100 - Discussion Board Topic #1One of the object.docxCSIS 100CSIS 100 - Discussion Board Topic #1One of the object.docx
CSIS 100CSIS 100 - Discussion Board Topic #1One of the object.docx
 
Sheet1Table 1 temperature and impact energy values for steelTable.docx
Sheet1Table 1 temperature and impact energy values for steelTable.docxSheet1Table 1 temperature and impact energy values for steelTable.docx
Sheet1Table 1 temperature and impact energy values for steelTable.docx
 
Recsys 2016
Recsys 2016Recsys 2016
Recsys 2016
 
Probabilistic modeling in deep learning
Probabilistic modeling in deep learningProbabilistic modeling in deep learning
Probabilistic modeling in deep learning
 
Eric Smidth
Eric SmidthEric Smidth
Eric Smidth
 
Coaching teams in creative problem solving
Coaching teams in creative problem solvingCoaching teams in creative problem solving
Coaching teams in creative problem solving
 
Digital media & society cmst 301 project 4 final exam1. fo
Digital media & society cmst 301 project 4 final exam1. foDigital media & society cmst 301 project 4 final exam1. fo
Digital media & society cmst 301 project 4 final exam1. fo
 

Mehr von Hakka Labs

Mehr von Hakka Labs (20)

Always Valid Inference (Ramesh Johari, Stanford)
Always Valid Inference (Ramesh Johari, Stanford)Always Valid Inference (Ramesh Johari, Stanford)
Always Valid Inference (Ramesh Johari, Stanford)
 
DataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series searchDataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series search
 
DataEngConf SF16 - Data Asserts: Defensive Data Science
DataEngConf SF16 - Data Asserts: Defensive Data ScienceDataEngConf SF16 - Data Asserts: Defensive Data Science
DataEngConf SF16 - Data Asserts: Defensive Data Science
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 
DataEngConf SF16 - Recommendations at Instacart
DataEngConf SF16 - Recommendations at InstacartDataEngConf SF16 - Recommendations at Instacart
DataEngConf SF16 - Recommendations at Instacart
 
DataEngConf SF16 - Running simulations at scale
DataEngConf SF16 - Running simulations at scaleDataEngConf SF16 - Running simulations at scale
DataEngConf SF16 - Running simulations at scale
 
DataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
DataEngConf SF16 - Deriving Meaning from Wearable Sensor DataDataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
DataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
 
DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale
 
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQDataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
 
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
 
DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...
 
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at PinterestDataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
 
DataEngConf SF16 - Bridging the gap between data science and data engineering
DataEngConf SF16 - Bridging the gap between data science and data engineeringDataEngConf SF16 - Bridging the gap between data science and data engineering
DataEngConf SF16 - Bridging the gap between data science and data engineering
 
DataEngConf SF16 - Multi-temporal Data Structures
DataEngConf SF16 - Multi-temporal Data StructuresDataEngConf SF16 - Multi-temporal Data Structures
DataEngConf SF16 - Multi-temporal Data Structures
 
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using SparkDataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
 
DataEngConf SF16 - Beginning with Ourselves
DataEngConf SF16 - Beginning with OurselvesDataEngConf SF16 - Beginning with Ourselves
DataEngConf SF16 - Beginning with Ourselves
 
DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
DataEngConf SF16 - Routing Billions of Analytics Events with High DeliverabilityDataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
 
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
 
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedInDataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL Workshop
 

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
 
WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - Keynote
 
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
WSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaS
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 

DataEngConf: Building the Next New York Times Recommendation Engine