Over the past year, the POLITICO team has developed a recommendation system for our users, which recommends not only news content to read but also news topics to subscribe to. This talk will discuss our development path, including dead-ends and performance trade-offs. In the end, the team produced a system based on search technology (in our case, Elasticsearch) and refined by machine learning techniques to achieve a balance between personalization and serendipity.
2. Business Overview (5 slides)
Business Case (3 slides)
Evaluation (5 slides)
Prototype (6 slides)
Lessons Learned (3 slides)
Production System (7 slides)
3.
4. Core Site Subscription Site
Oregon judge says he’ll block Trump’s
abortion rule
Pelosi, Schumer to meet with Trump on
infrastructure next week
Trump met with Twitter CEO amid bias
complaints
Bob Corker: Primary challenger for Trump
would be ‘good thing for our country’
FERC denies groups’ legal fees in
pipeline challenge
House Democrats say Wheeler left
biofuels client off disclosure
Court sides with EPA in ozone region
expansion fight
Virginia uranium case may set nuclear
precedent
11. We want to recommend stories
• Because a user may have missed
something of interest / importance
• Because a user may not have been
aware of an interesting kind of news
that we write about
12. Defense
Agriculture
New York &
New Jersey
Education
Health Care
Content read by a
user
In a case like this, we
want to
• Recommend Health
Care stories
• Occasionally suggest
Defense and Education
news
• Stay away from New
Jersey
cluster analysis of ~2000
stories from 2018 by topic
13.
14. We evaluate our system to
Figure out if the current version of the
system doing better than the previous
version
Identify users for which the system is doing
particularly bad
Version 2
1. Senate Commerce taps Ireland data chief
for privacy hearing
2. U.S. Navy drafting new guidelines for
reporting UFOs
3. 5G fight among Trump advisers likely to
continue
Version 1
1. 5G fight among Trump advisers likely to
continue
2. Lockheed Martin net sales jump to $14.3B
3. U.S. tech companies see hope that talks
could pry open China’s market
How do we determine if this
is interesting?
15. Our situation
No direct feedback
historically, our users have not interacted with
rating systems on our site
Dynamic interests
Reads are driven by big events in the news cycle in
addition to a user’s historical behavior
Recommendations strongly tied to time
A news organization publishes new content
throughout the day, so we can’t compare a week’s
worth of consumption with the recommendations
made on Monday.
1 2 3 4 5
(insert popular Presidential tweet)
16. Short-term prediction of news reads
• Sum of
• news over the past 7 days
• that you read
• was in your top 10 recommended news at the
time of reading
• discounted by how far down that top 10 list it
appears (10 – rank + 1)
• Normalized by total possible score
• 100 * (score / (10 * # read))
Stories read this week Rec.
Rank
Score
5G fight among Trump
advisers likely to continue
2 9
U.S. tech companies see hope
that talks could pry open
China’s market
7 4
Senate Commerce taps
Ireland data chief for privacy
hearing
- 0
U.S. Navy drafting new
guidelines for reporting UFOs
- 0
5G fight among Trump
advisers likely to continue
3 8
Evaluation Score 42
17. A very low score means our user could be
missing news they’ve demonstrated an
interest in
Stories read this week Rec.
Rank
Score
Northrop Grumman's sales up
22 percent
- 0
General Dynamics reports 23
percent jump in revenue
- 0
Lockheed Martin net sales
jump to $14.3
3 8
5G fight among Trump
advisers likely to continue
- 0
Evaluation Score 20
Recommendations
Inhofe ‘no longer concerned’ about border
deployments harming readiness
Supreme Court divided on citizenship question
for census
Budget reform gets a reboot as talks on a broader
deal begin
18. A very high score indicates our user could be
missing news they didn’t know they were
interested in
Stories read this week Rec.
Rank
Score
Northrop Grumman's sales up
22 percent
1 10
General Dynamics reports 23
percent jump in revenue
2 9
Lockheed Martin net sales
jump to $14.3
1 10
5G fight among Trump
advisers likely to continue
3 8
Evaluation Score 92.5
19.
20. We started with two streams of
information
• Published News Documents
• Content Reads (web clicks, email opens)
CMS
Annotation
Pipeline
User
Activit
y
Transform
Pipeline
Redshift
Elasticsearc
h
??? Recommendations
21. Content Filtering
You read certain kinds of news
We think you’d like to keep reading those kinds of news
Based on annotations of news that we do in a separate
system
People
Organizations & Committees
Taxonomic topics
We do this because the market for old news is very small.
Thus we need to deal with kinds of news
Cluster Model
22. Elasticsearc
h
• Content id
• tags
Apache Spark
Cluster
maker
Cluster Model
Cluster Model Training
• K-means clustering
• Normal metrics to
choose K
• Used Jaccard distances
based on Content Tags
23. Collaborative Filtering
There are people who read the kind of stuff that you do
We think you’d like to read the stuff they’ve been reading
People who read math
books like to color
turtles.
We see you’ve been
reading a bit of math
lately…
Recommendation
Model
24. • Visitor id
• Cluster 0 preference
• Cluster 1 preference
• …
• Cluster N preference
Redshift
aggregate Collaborative
filtering
clusterElasticsearc
h
• Content id
• tags
• Visitor id
• Content id
• timestamp
• Content id
• Cluster id
• Visitor id
• Cluster id
• timestam
p
join
• Visitor id
• Cluster id
• # views
Recommendation
Model
Apache Spark
Recommendation Model Training
Cluster Model
27. Performance was good
Able to train a model in a few hours
Evaluation scores were decent
Iteration was hard
We couldn’t give a good explanation for why a recommendation was made
Improving the model felt like guesswork
The system was rather complex
Lots of moving parts
28. The real world intervened
Two months later, our new
search system was
humming along in
production
That gave us time to think
about recommendations…
29. We got together and figured out how we’d want to
explain/defend a recommendation:
Similar to what you’ve (recently) read?
Something that a lot of people read?
Something that a lot of subscribers read?
Something that a lot of people like you read?
Something that a lot of your colleagues read?
This made it sound like a search problem…
(ironic picture of people getting
excited in a meeting)
32. General Reads Search
What is popular amongst all of our readers?
Transform
• we roll up reads by the hour
Search
• All reads within the last 2 days
• Sum aggregation on content id over # reads
Notes
• Very fast
• Relatively small data footprint
Date Content # reads
2019-04-25 13:00 id-1 20,000
2019-04-25 13:00 id-2 15,000
2019-04-25 14:00 id-1 3,000
2019-04-25 15:00 id-2 40,000
2019-04-25 15:00 id-3 25,000
Data Used
33. Subscriber Reads Search
What is popular amongst our subscribers?
Search
• All reads within the last 2 days
• Count aggregation on content id
Notes
• Very fast
• Larger data footprint
• We determined it’s tolerable for 50k – 100k subscribers
• More than that would call for scaling up the
Elasticsearch
Date User Content
2019-04-25 13:23:47 A id-1
2019-04-25 13:38:10 A id-2
2019-04-25 14:12:57 B id-1
2019-04-25 15:00:07 C id-2
2019-04-25 15:32:54 A id-3
Data Used
34. Account Reads Search
What is popular amongst people you work with?
Search
• All reads within the last 2 days
• Term query to restrict to user’s account
• Count aggregation on content id
Notes
• Very fast
• Introduces some serendipity
Data Used
Date User Content
2019-04-25 13:23:47 A id-1
2019-04-25 13:38:10 A id-2
2019-04-25 14:12:57 B id-1
2019-04-25 15:00:07 C id-2
2019-04-25 15:32:54 A id-3
Date User Account
2019-04-25 A X
2019-04-25 B X
2019-04-25 C Y
35. Community Reads Search
What are people like you reading?
Search
A series of 3 queries per request
Bucket 1: the 150 most recent news you’ve read in the last 7
days
Bucket 2: the 50 users who have read news in Bucket 1, ranked
by clicks/opens
Bucket 3: the 150 most recent news that users in Bucket 2 have
read, ranked by how many of them clicked/opened each
Notes
Surprisingly fast
Introduces some serendipity
Date User Content
2019-04-25 13:23:47 A id-1
2019-04-25 13:38:10 A id-2
2019-04-25 14:12:57 B id-1
2019-04-25 15:00:07 C id-2
2019-04-25 15:32:54 A id-3
Data Used
1
2
3
36. Similar News Search
What kind of stuff do you usually read?
Search
All news you’ve read in the past 30 days
Count aggregation on annotations
News with at least one annotation the 30-day bucket
• Boosted by the frequency of the annotations in the
user’s reads
Notes
Very fast
Addresses the cold-start problem
But: loses correlations between annotations
A user may like articles about
Corn & Boats
Content Annotations
id-5 Airplanes
id-6 Boats, Corn
id-7 Tables, Walls,
Corn
Data Used
Date User Content Annotations
2019-04-25 13:23:47 A id-1 Water, Corn
2019-04-25 13:38:10 A id-2 Corn, Boats
2019-04-25 14:12:57 A id-1 Water, Corn
2019-04-25 15:00:07 A id-3 Tables,
Walls
2019-04-25 15:32:54 A id-4 Chairs,
Boats
37. Things we’re happy about
• The system has relatively few moving parts
• We can explain our recommendations (and troubleshoot them)
• Recommendations are available for newly published content immediately
• Our scaling is mostly managed by scaling Elasticsearch
• It’s very easy to add additional constraints
• Ex/ If you don’t subscribe to the Energy vertical, we don’t want any of its content
affecting your recommendations
38. A few challenges opportunities we’ve identified
• It’s weird to use something so different than the standard architecture
• That’s a big reason we want your feedback
• We want to revisit the Similar News Search
• It seems like we should honor the correlations between annotations
• Each recommendation search/component should not be equally weighted
• Some are likely to be more pertinent to some users
• There are obvious dependencies
• If something is generally popular, it’s more likely to be popular for people in your account