News recommendations are particularly challenging given the high number of new contents produced every day and the fast deterioration of its value for the users, demanding models and infrastructure able to deal with those nuances and serve a newly trained model about 100 times per day. Attending this presentation you're going to follow a detailed overview of how R&D team of Hearst's TV division is putting together Google BigQuery, Kubernetes cluster and Tensorflow to build a hybrid recommendation system combining model-based matrix factorization, content recency, and content semantics through NLP.
11. Data Acquisition
Page views with
user’s time on page
Google Analytics Google BigQuery CMS
Content corpus: title,
body, timestamp,
meta-data (sections,
tags, etc.)
Contents
TFRecord/CSV files
19. Natural Language
Processing
Concatenate content data
(title, body, sections, tags, …)
Remove stop words, symbols
and HTML tags
Train word2vec Neural Network
Combine all word-vectors of
each article into one (doc2vec)
CMS
articles
doc2vec
contents
57. Hybrid Matrix Factorization
• R ≈ U* x V*
where:
• U* = UUsersxKClusters x AKClustersxLatent_factors
• V* = BLatent_factorsxKClusters x VKClustersxItems
*Only A and B are variables to be trained. U and V are constants.