SlideShare ist ein Scribd-Unternehmen logo
1 von 39
Ryan Kohl
POLITICO
 Business Overview (5 slides)
 Business Case (3 slides)
 Evaluation (5 slides)
 Prototype (6 slides)
 Lessons Learned (3 slides)
 Production System (7 slides)
Core Site Subscription Site
Oregon judge says he’ll block Trump’s
abortion rule
Pelosi, Schumer to meet with Trump on
infrastructure next week
Trump met with Twitter CEO amid bias
complaints
Bob Corker: Primary challenger for Trump
would be ‘good thing for our country’
FERC denies groups’ legal fees in
pipeline challenge
House Democrats say Wheeler left
biofuels client off disclosure
Court sides with EPA in ozone region
expansion fight
Virginia uranium case may set nuclear
precedent
Core Site
Subscriber Site
Agriculture
Budget & Appropriations
Campaigns
Cybersecurity
Defense
Education
eHealth
Employment & Immigration
Energy
Financial Services
Health Care
Tax
Technology
Transportation
California
Florida
New Jersey
New York
Canada
Web Reads ~25% Email Reads ~75%
Most users want to customize
the emails they receive
They do this by selecting
• Topics
• People
• Keywords
of interest
Sometimes news happens
that
• Is not the kind of thing
you usually care about
• But you find very
interesting
We want to recommend stories
• Because a user may have missed
something of interest / importance
• Because a user may not have been
aware of an interesting kind of news
that we write about
Defense
Agriculture
New York &
New Jersey
Education
Health Care
Content read by a
user
In a case like this, we
want to
• Recommend Health
Care stories
• Occasionally suggest
Defense and Education
news
• Stay away from New
Jersey
cluster analysis of ~2000
stories from 2018 by topic
 We evaluate our system to
 Figure out if the current version of the
system doing better than the previous
version
 Identify users for which the system is doing
particularly bad
Version 2
1. Senate Commerce taps Ireland data chief
for privacy hearing
2. U.S. Navy drafting new guidelines for
reporting UFOs
3. 5G fight among Trump advisers likely to
continue
Version 1
1. 5G fight among Trump advisers likely to
continue
2. Lockheed Martin net sales jump to $14.3B
3. U.S. tech companies see hope that talks
could pry open China’s market
How do we determine if this
is interesting?
Our situation
 No direct feedback
 historically, our users have not interacted with
rating systems on our site
 Dynamic interests
 Reads are driven by big events in the news cycle in
addition to a user’s historical behavior
 Recommendations strongly tied to time
 A news organization publishes new content
throughout the day, so we can’t compare a week’s
worth of consumption with the recommendations
made on Monday.
1 2 3 4 5
(insert popular Presidential tweet)
Short-term prediction of news reads
• Sum of
• news over the past 7 days
• that you read
• was in your top 10 recommended news at the
time of reading
• discounted by how far down that top 10 list it
appears (10 – rank + 1)
• Normalized by total possible score
• 100 * (score / (10 * # read))
Stories read this week Rec.
Rank
Score
5G fight among Trump
advisers likely to continue
2 9
U.S. tech companies see hope
that talks could pry open
China’s market
7 4
Senate Commerce taps
Ireland data chief for privacy
hearing
- 0
U.S. Navy drafting new
guidelines for reporting UFOs
- 0
5G fight among Trump
advisers likely to continue
3 8
Evaluation Score 42
A very low score means our user could be
missing news they’ve demonstrated an
interest in
Stories read this week Rec.
Rank
Score
Northrop Grumman's sales up
22 percent
- 0
General Dynamics reports 23
percent jump in revenue
- 0
Lockheed Martin net sales
jump to $14.3
3 8
5G fight among Trump
advisers likely to continue
- 0
Evaluation Score 20
Recommendations
Inhofe ‘no longer concerned’ about border
deployments harming readiness
Supreme Court divided on citizenship question
for census
Budget reform gets a reboot as talks on a broader
deal begin
A very high score indicates our user could be
missing news they didn’t know they were
interested in
Stories read this week Rec.
Rank
Score
Northrop Grumman's sales up
22 percent
1 10
General Dynamics reports 23
percent jump in revenue
2 9
Lockheed Martin net sales
jump to $14.3
1 10
5G fight among Trump
advisers likely to continue
3 8
Evaluation Score 92.5
We started with two streams of
information
• Published News Documents
• Content Reads (web clicks, email opens)
CMS
Annotation
Pipeline
User
Activit
y
Transform
Pipeline
Redshift
Elasticsearc
h
??? Recommendations
Content Filtering
 You read certain kinds of news
 We think you’d like to keep reading those kinds of news
 Based on annotations of news that we do in a separate
system
 People
 Organizations & Committees
 Taxonomic topics
 We do this because the market for old news is very small.
 Thus we need to deal with kinds of news
Cluster Model
Elasticsearc
h
• Content id
• tags
Apache Spark
Cluster
maker
Cluster Model
Cluster Model Training
• K-means clustering
• Normal metrics to
choose K
• Used Jaccard distances
based on Content Tags
Collaborative Filtering
 There are people who read the kind of stuff that you do
 We think you’d like to read the stuff they’ve been reading
People who read math
books like to color
turtles.
We see you’ve been
reading a bit of math
lately…
Recommendation
Model
• Visitor id
• Cluster 0 preference
• Cluster 1 preference
• …
• Cluster N preference
Redshift
aggregate Collaborative
filtering
clusterElasticsearc
h
• Content id
• tags
• Visitor id
• Content id
• timestamp
• Content id
• Cluster id
• Visitor id
• Cluster id
• timestam
p
join
• Visitor id
• Cluster id
• # views
Recommendation
Model
Apache Spark
Recommendation Model Training
Cluster Model
Runtime System
Cluster Model
Recommendation
Model
CMS
Annotation
Pipeline
Recommendation
App
• Visitor id
• Cluster 0 preference
• Cluster 1 preference
• …
• Cluster N preference
• Content id
• Cluster id
• Content id
• tags
• Visitor id
• Content id
 Performance was good
 Able to train a model in a few hours
 Evaluation scores were decent
 Iteration was hard
 We couldn’t give a good explanation for why a recommendation was made
 Improving the model felt like guesswork
 The system was rather complex
 Lots of moving parts
The real world intervened
Two months later, our new
search system was
humming along in
production
That gave us time to think
about recommendations…
We got together and figured out how we’d want to
explain/defend a recommendation:
 Similar to what you’ve (recently) read?
 Something that a lot of people read?
 Something that a lot of subscribers read?
 Something that a lot of people like you read?
 Something that a lot of your colleagues read?
This made it sound like a search problem…
(ironic picture of people getting
excited in a meeting)
CMS
Annotation
Pipeline
User
Activit
y
Transform
Pipeline
Redshift
Elasticsearc
h
Elasticsearc
h
Recommendation
App
We had no idea if
these searches would
• Give good results
• Be fast enough
General Reads Search
What is popular amongst all of our readers?
Transform
• we roll up reads by the hour
Search
• All reads within the last 2 days
• Sum aggregation on content id over # reads
Notes
• Very fast
• Relatively small data footprint
Date Content # reads
2019-04-25 13:00 id-1 20,000
2019-04-25 13:00 id-2 15,000
2019-04-25 14:00 id-1 3,000
2019-04-25 15:00 id-2 40,000
2019-04-25 15:00 id-3 25,000
Data Used
Subscriber Reads Search
What is popular amongst our subscribers?
Search
• All reads within the last 2 days
• Count aggregation on content id
Notes
• Very fast
• Larger data footprint
• We determined it’s tolerable for 50k – 100k subscribers
• More than that would call for scaling up the
Elasticsearch
Date User Content
2019-04-25 13:23:47 A id-1
2019-04-25 13:38:10 A id-2
2019-04-25 14:12:57 B id-1
2019-04-25 15:00:07 C id-2
2019-04-25 15:32:54 A id-3
Data Used
Account Reads Search
What is popular amongst people you work with?
Search
• All reads within the last 2 days
• Term query to restrict to user’s account
• Count aggregation on content id
Notes
• Very fast
• Introduces some serendipity
Data Used
Date User Content
2019-04-25 13:23:47 A id-1
2019-04-25 13:38:10 A id-2
2019-04-25 14:12:57 B id-1
2019-04-25 15:00:07 C id-2
2019-04-25 15:32:54 A id-3
Date User Account
2019-04-25 A X
2019-04-25 B X
2019-04-25 C Y
Community Reads Search
What are people like you reading?
Search
A series of 3 queries per request
 Bucket 1: the 150 most recent news you’ve read in the last 7
days
 Bucket 2: the 50 users who have read news in Bucket 1, ranked
by clicks/opens
 Bucket 3: the 150 most recent news that users in Bucket 2 have
read, ranked by how many of them clicked/opened each
Notes
 Surprisingly fast
 Introduces some serendipity
Date User Content
2019-04-25 13:23:47 A id-1
2019-04-25 13:38:10 A id-2
2019-04-25 14:12:57 B id-1
2019-04-25 15:00:07 C id-2
2019-04-25 15:32:54 A id-3
Data Used
1
2
3
Similar News Search
What kind of stuff do you usually read?
Search
 All news you’ve read in the past 30 days
 Count aggregation on annotations
 News with at least one annotation the 30-day bucket
• Boosted by the frequency of the annotations in the
user’s reads
Notes
 Very fast
 Addresses the cold-start problem
 But: loses correlations between annotations
 A user may like articles about
 Corn & Boats
Content Annotations
id-5 Airplanes
id-6 Boats, Corn
id-7 Tables, Walls,
Corn
Data Used
Date User Content Annotations
2019-04-25 13:23:47 A id-1 Water, Corn
2019-04-25 13:38:10 A id-2 Corn, Boats
2019-04-25 14:12:57 A id-1 Water, Corn
2019-04-25 15:00:07 A id-3 Tables,
Walls
2019-04-25 15:32:54 A id-4 Chairs,
Boats
Things we’re happy about
• The system has relatively few moving parts
• We can explain our recommendations (and troubleshoot them)
• Recommendations are available for newly published content immediately
• Our scaling is mostly managed by scaling Elasticsearch
• It’s very easy to add additional constraints
• Ex/ If you don’t subscribe to the Energy vertical, we don’t want any of its content
affecting your recommendations
A few challenges opportunities we’ve identified
• It’s weird to use something so different than the standard architecture
• That’s a big reason we want your feedback
• We want to revisit the Similar News Search
• It seems like we should honor the correlations between annotations
• Each recommendation search/component should not be equally weighted
• Some are likely to be more pertinent to some users
• There are obvious dependencies
• If something is generally popular, it’s more likely to be popular for people in your account
Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl

Weitere ähnliche Inhalte

Was ist angesagt?

Search and social patents for 2012 and beyond
Search and social patents for 2012 and beyondSearch and social patents for 2012 and beyond
Search and social patents for 2012 and beyond
Bill Slawski
 
Balancing the Dimensions of User Intent
Balancing the Dimensions of User IntentBalancing the Dimensions of User Intent
Balancing the Dimensions of User Intent
Trey Grainger
 

Was ist angesagt? (20)

Suicide prevention using social media analytics
Suicide prevention using social media analyticsSuicide prevention using social media analytics
Suicide prevention using social media analytics
 
Knowledge Integration in Practice
Knowledge Integration in PracticeKnowledge Integration in Practice
Knowledge Integration in Practice
 
Understanding Queries through Entities
Understanding Queries through EntitiesUnderstanding Queries through Entities
Understanding Queries through Entities
 
Search and social patents for 2012 and beyond
Search and social patents for 2012 and beyondSearch and social patents for 2012 and beyond
Search and social patents for 2012 and beyond
 
Link analysis for web search
Link analysis for web searchLink analysis for web search
Link analysis for web search
 
Making things findable
Making things findableMaking things findable
Making things findable
 
Semantic search: from document retrieval to virtual assistants
Semantic search: from document retrieval to virtual assistantsSemantic search: from document retrieval to virtual assistants
Semantic search: from document retrieval to virtual assistants
 
Ranking in Google Since The Advent of The Knowledge Graph
Ranking in Google Since The Advent of The Knowledge GraphRanking in Google Since The Advent of The Knowledge Graph
Ranking in Google Since The Advent of The Knowledge Graph
 
Related Entity Finding on the Web
Related Entity Finding on the WebRelated Entity Finding on the Web
Related Entity Finding on the Web
 
Measuring Relevance in the Negative Space
Measuring Relevance in the Negative SpaceMeasuring Relevance in the Negative Space
Measuring Relevance in the Negative Space
 
Better Search Through Query Understanding
Better Search Through Query UnderstandingBetter Search Through Query Understanding
Better Search Through Query Understanding
 
Trying Not to Filter: Internet Filtering Technologies in Libraries
Trying Not to Filter: Internet Filtering Technologies in LibrariesTrying Not to Filter: Internet Filtering Technologies in Libraries
Trying Not to Filter: Internet Filtering Technologies in Libraries
 
Semantic seo and the evolution of queries
Semantic seo and the evolution of queriesSemantic seo and the evolution of queries
Semantic seo and the evolution of queries
 
Searching for Meaning
Searching for MeaningSearching for Meaning
Searching for Meaning
 
Balancing the Dimensions of User Intent
Balancing the Dimensions of User IntentBalancing the Dimensions of User Intent
Balancing the Dimensions of User Intent
 
Social Search in a Professional Context
Social Search in a Professional ContextSocial Search in a Professional Context
Social Search in a Professional Context
 
Fast, Lenient, and Accurate – Building Personalized Instant Search Experience...
Fast, Lenient, and Accurate – Building Personalized Instant Search Experience...Fast, Lenient, and Accurate – Building Personalized Instant Search Experience...
Fast, Lenient, and Accurate – Building Personalized Instant Search Experience...
 
Actively Learning to Rank Semantic Associations for Personalized Contextual E...
Actively Learning to Rank Semantic Associations for Personalized Contextual E...Actively Learning to Rank Semantic Associations for Personalized Contextual E...
Actively Learning to Rank Semantic Associations for Personalized Contextual E...
 
Improving Your Surveys and Questionnaires with Cognitive Interviewing
Improving Your Surveys and Questionnaires with Cognitive InterviewingImproving Your Surveys and Questionnaires with Cognitive Interviewing
Improving Your Surveys and Questionnaires with Cognitive Interviewing
 
Bigdataanalytics
BigdataanalyticsBigdataanalytics
Bigdataanalytics
 

Ähnlich wie Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl

Using New Technologies to Make Sense of Content Chaos: Text mining and visual...
Using New Technologies to Make Sense of Content Chaos: Text mining and visual...Using New Technologies to Make Sense of Content Chaos: Text mining and visual...
Using New Technologies to Make Sense of Content Chaos: Text mining and visual...
KM Chicago
 
Running head INFORMATION LITERACY 1INFORMATION LITERACY 2.docx
Running head INFORMATION LITERACY 1INFORMATION LITERACY 2.docxRunning head INFORMATION LITERACY 1INFORMATION LITERACY 2.docx
Running head INFORMATION LITERACY 1INFORMATION LITERACY 2.docx
wlynn1
 
Running head INFORMATION LITERACY 1INFORMATION LITERACY 2.docx
Running head INFORMATION LITERACY 1INFORMATION LITERACY 2.docxRunning head INFORMATION LITERACY 1INFORMATION LITERACY 2.docx
Running head INFORMATION LITERACY 1INFORMATION LITERACY 2.docx
jeanettehully
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Simplilearn
 

Ähnlich wie Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl (20)

Social Media Boot Camp at PACOM 3
Social Media Boot Camp at PACOM 3Social Media Boot Camp at PACOM 3
Social Media Boot Camp at PACOM 3
 
Smbc pacom-2
Smbc pacom-2Smbc pacom-2
Smbc pacom-2
 
English Essay Steel. Online assignment writing service.
English Essay Steel. Online assignment writing service.English Essay Steel. Online assignment writing service.
English Essay Steel. Online assignment writing service.
 
Essay About Education System In Usa. Online assignment writing service.
Essay About Education System In Usa. Online assignment writing service.Essay About Education System In Usa. Online assignment writing service.
Essay About Education System In Usa. Online assignment writing service.
 
Engaging with Users on Public Social Media
Engaging with Users on Public Social MediaEngaging with Users on Public Social Media
Engaging with Users on Public Social Media
 
Using New Technologies to Make Sense of Content Chaos: Text mining and visual...
Using New Technologies to Make Sense of Content Chaos: Text mining and visual...Using New Technologies to Make Sense of Content Chaos: Text mining and visual...
Using New Technologies to Make Sense of Content Chaos: Text mining and visual...
 
Data mining with big data implementation
Data mining with big data implementationData mining with big data implementation
Data mining with big data implementation
 
Sample Title Page For Thesis Proposal - How To
Sample Title Page For Thesis Proposal - How ToSample Title Page For Thesis Proposal - How To
Sample Title Page For Thesis Proposal - How To
 
Monitoring Trends on social media: Twitter, Blogs, and Google Insights
Monitoring Trends on social media: Twitter, Blogs, and Google InsightsMonitoring Trends on social media: Twitter, Blogs, and Google Insights
Monitoring Trends on social media: Twitter, Blogs, and Google Insights
 
CPRS Ottawa-Gatineau - Measuring Social Media Workshop - Sean Howard - thornl...
CPRS Ottawa-Gatineau - Measuring Social Media Workshop - Sean Howard - thornl...CPRS Ottawa-Gatineau - Measuring Social Media Workshop - Sean Howard - thornl...
CPRS Ottawa-Gatineau - Measuring Social Media Workshop - Sean Howard - thornl...
 
Predictive analytics by Discourse Analytics
Predictive analytics by Discourse AnalyticsPredictive analytics by Discourse Analytics
Predictive analytics by Discourse Analytics
 
SiriusDecisions Explores the Need for Demand Orchestration
SiriusDecisions Explores the Need for Demand OrchestrationSiriusDecisions Explores the Need for Demand Orchestration
SiriusDecisions Explores the Need for Demand Orchestration
 
Running head INFORMATION LITERACY 1INFORMATION LITERACY 2.docx
Running head INFORMATION LITERACY 1INFORMATION LITERACY 2.docxRunning head INFORMATION LITERACY 1INFORMATION LITERACY 2.docx
Running head INFORMATION LITERACY 1INFORMATION LITERACY 2.docx
 
Running head INFORMATION LITERACY 1INFORMATION LITERACY 2.docx
Running head INFORMATION LITERACY 1INFORMATION LITERACY 2.docxRunning head INFORMATION LITERACY 1INFORMATION LITERACY 2.docx
Running head INFORMATION LITERACY 1INFORMATION LITERACY 2.docx
 
How To Prepare For The Future Of Search
How To Prepare For The Future Of SearchHow To Prepare For The Future Of Search
How To Prepare For The Future Of Search
 
LashBack Presentation at Mailcon August 2019
LashBack Presentation at Mailcon August 2019LashBack Presentation at Mailcon August 2019
LashBack Presentation at Mailcon August 2019
 
Frontiers of Computational Journalism week 3 - Information Filter Design
Frontiers of Computational Journalism week 3 - Information Filter DesignFrontiers of Computational Journalism week 3 - Information Filter Design
Frontiers of Computational Journalism week 3 - Information Filter Design
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
 
2014 ASAE Membership, Marketing & Communications Conference: Modern eMessagin...
2014 ASAE Membership, Marketing & Communications Conference: Modern eMessagin...2014 ASAE Membership, Marketing & Communications Conference: Modern eMessagin...
2014 ASAE Membership, Marketing & Communications Conference: Modern eMessagin...
 
DataScienceInnovation_ShareThis
DataScienceInnovation_ShareThisDataScienceInnovation_ShareThis
DataScienceInnovation_ShareThis
 

Mehr von OpenSource Connections

Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerHaystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
OpenSource Connections
 
Haystack 2019 - Architectural considerations on search relevancy in the conte...
Haystack 2019 - Architectural considerations on search relevancy in the conte...Haystack 2019 - Architectural considerations on search relevancy in the conte...
Haystack 2019 - Architectural considerations on search relevancy in the conte...
OpenSource Connections
 

Mehr von OpenSource Connections (20)

Encores
EncoresEncores
Encores
 
Test driven relevancy
Test driven relevancyTest driven relevancy
Test driven relevancy
 
How To Structure Your Search Team for Success
How To Structure Your Search Team for SuccessHow To Structure Your Search Team for Success
How To Structure Your Search Team for Success
 
The right path to making search relevant - Taxonomy Bootcamp London 2019
The right path to making search relevant  - Taxonomy Bootcamp London 2019The right path to making search relevant  - Taxonomy Bootcamp London 2019
The right path to making search relevant - Taxonomy Bootcamp London 2019
 
Payloads and OCR with Solr
Payloads and OCR with SolrPayloads and OCR with Solr
Payloads and OCR with Solr
 
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie HullHaystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
 
Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
Haystack 2019 Lightning Talk - State of Apache Tika - Tim AllisonHaystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
 
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
 
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj BharadwajHaystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
 
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
 
Haystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon HughesHaystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon Hughes
 
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerHaystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
 
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
 
Haystack 2019 - Architectural considerations on search relevancy in the conte...
Haystack 2019 - Architectural considerations on search relevancy in the conte...Haystack 2019 - Architectural considerations on search relevancy in the conte...
Haystack 2019 - Architectural considerations on search relevancy in the conte...
 
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
 
Haystack 2019 - Establishing a relevance focused culture in a large organizat...
Haystack 2019 - Establishing a relevance focused culture in a large organizat...Haystack 2019 - Establishing a relevance focused culture in a large organizat...
Haystack 2019 - Establishing a relevance focused culture in a large organizat...
 
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
 
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
 
Haystack 2019 - Addressing variance in AB tests: Interleaved evaluation of ra...
Haystack 2019 - Addressing variance in AB tests: Interleaved evaluation of ra...Haystack 2019 - Addressing variance in AB tests: Interleaved evaluation of ra...
Haystack 2019 - Addressing variance in AB tests: Interleaved evaluation of ra...
 
Haystack 2019 - Beyond The Search Engine: Improving Relevancy through Query E...
Haystack 2019 - Beyond The Search Engine: Improving Relevancy through Query E...Haystack 2019 - Beyond The Search Engine: Improving Relevancy through Query E...
Haystack 2019 - Beyond The Search Engine: Improving Relevancy through Query E...
 

Kürzlich hochgeladen

Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
HyderabadDolls
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
HyderabadDolls
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 

Kürzlich hochgeladen (20)

Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 

Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl

  • 2.  Business Overview (5 slides)  Business Case (3 slides)  Evaluation (5 slides)  Prototype (6 slides)  Lessons Learned (3 slides)  Production System (7 slides)
  • 3.
  • 4. Core Site Subscription Site Oregon judge says he’ll block Trump’s abortion rule Pelosi, Schumer to meet with Trump on infrastructure next week Trump met with Twitter CEO amid bias complaints Bob Corker: Primary challenger for Trump would be ‘good thing for our country’ FERC denies groups’ legal fees in pipeline challenge House Democrats say Wheeler left biofuels client off disclosure Court sides with EPA in ozone region expansion fight Virginia uranium case may set nuclear precedent
  • 6. Agriculture Budget & Appropriations Campaigns Cybersecurity Defense Education eHealth Employment & Immigration Energy Financial Services Health Care Tax Technology Transportation California Florida New Jersey New York Canada
  • 7. Web Reads ~25% Email Reads ~75%
  • 8. Most users want to customize the emails they receive They do this by selecting • Topics • People • Keywords of interest
  • 9.
  • 10. Sometimes news happens that • Is not the kind of thing you usually care about • But you find very interesting
  • 11. We want to recommend stories • Because a user may have missed something of interest / importance • Because a user may not have been aware of an interesting kind of news that we write about
  • 12. Defense Agriculture New York & New Jersey Education Health Care Content read by a user In a case like this, we want to • Recommend Health Care stories • Occasionally suggest Defense and Education news • Stay away from New Jersey cluster analysis of ~2000 stories from 2018 by topic
  • 13.
  • 14.  We evaluate our system to  Figure out if the current version of the system doing better than the previous version  Identify users for which the system is doing particularly bad Version 2 1. Senate Commerce taps Ireland data chief for privacy hearing 2. U.S. Navy drafting new guidelines for reporting UFOs 3. 5G fight among Trump advisers likely to continue Version 1 1. 5G fight among Trump advisers likely to continue 2. Lockheed Martin net sales jump to $14.3B 3. U.S. tech companies see hope that talks could pry open China’s market How do we determine if this is interesting?
  • 15. Our situation  No direct feedback  historically, our users have not interacted with rating systems on our site  Dynamic interests  Reads are driven by big events in the news cycle in addition to a user’s historical behavior  Recommendations strongly tied to time  A news organization publishes new content throughout the day, so we can’t compare a week’s worth of consumption with the recommendations made on Monday. 1 2 3 4 5 (insert popular Presidential tweet)
  • 16. Short-term prediction of news reads • Sum of • news over the past 7 days • that you read • was in your top 10 recommended news at the time of reading • discounted by how far down that top 10 list it appears (10 – rank + 1) • Normalized by total possible score • 100 * (score / (10 * # read)) Stories read this week Rec. Rank Score 5G fight among Trump advisers likely to continue 2 9 U.S. tech companies see hope that talks could pry open China’s market 7 4 Senate Commerce taps Ireland data chief for privacy hearing - 0 U.S. Navy drafting new guidelines for reporting UFOs - 0 5G fight among Trump advisers likely to continue 3 8 Evaluation Score 42
  • 17. A very low score means our user could be missing news they’ve demonstrated an interest in Stories read this week Rec. Rank Score Northrop Grumman's sales up 22 percent - 0 General Dynamics reports 23 percent jump in revenue - 0 Lockheed Martin net sales jump to $14.3 3 8 5G fight among Trump advisers likely to continue - 0 Evaluation Score 20 Recommendations Inhofe ‘no longer concerned’ about border deployments harming readiness Supreme Court divided on citizenship question for census Budget reform gets a reboot as talks on a broader deal begin
  • 18. A very high score indicates our user could be missing news they didn’t know they were interested in Stories read this week Rec. Rank Score Northrop Grumman's sales up 22 percent 1 10 General Dynamics reports 23 percent jump in revenue 2 9 Lockheed Martin net sales jump to $14.3 1 10 5G fight among Trump advisers likely to continue 3 8 Evaluation Score 92.5
  • 19.
  • 20. We started with two streams of information • Published News Documents • Content Reads (web clicks, email opens) CMS Annotation Pipeline User Activit y Transform Pipeline Redshift Elasticsearc h ??? Recommendations
  • 21. Content Filtering  You read certain kinds of news  We think you’d like to keep reading those kinds of news  Based on annotations of news that we do in a separate system  People  Organizations & Committees  Taxonomic topics  We do this because the market for old news is very small.  Thus we need to deal with kinds of news Cluster Model
  • 22. Elasticsearc h • Content id • tags Apache Spark Cluster maker Cluster Model Cluster Model Training • K-means clustering • Normal metrics to choose K • Used Jaccard distances based on Content Tags
  • 23. Collaborative Filtering  There are people who read the kind of stuff that you do  We think you’d like to read the stuff they’ve been reading People who read math books like to color turtles. We see you’ve been reading a bit of math lately… Recommendation Model
  • 24. • Visitor id • Cluster 0 preference • Cluster 1 preference • … • Cluster N preference Redshift aggregate Collaborative filtering clusterElasticsearc h • Content id • tags • Visitor id • Content id • timestamp • Content id • Cluster id • Visitor id • Cluster id • timestam p join • Visitor id • Cluster id • # views Recommendation Model Apache Spark Recommendation Model Training Cluster Model
  • 25. Runtime System Cluster Model Recommendation Model CMS Annotation Pipeline Recommendation App • Visitor id • Cluster 0 preference • Cluster 1 preference • … • Cluster N preference • Content id • Cluster id • Content id • tags • Visitor id • Content id
  • 26.
  • 27.  Performance was good  Able to train a model in a few hours  Evaluation scores were decent  Iteration was hard  We couldn’t give a good explanation for why a recommendation was made  Improving the model felt like guesswork  The system was rather complex  Lots of moving parts
  • 28. The real world intervened Two months later, our new search system was humming along in production That gave us time to think about recommendations…
  • 29. We got together and figured out how we’d want to explain/defend a recommendation:  Similar to what you’ve (recently) read?  Something that a lot of people read?  Something that a lot of subscribers read?  Something that a lot of people like you read?  Something that a lot of your colleagues read? This made it sound like a search problem… (ironic picture of people getting excited in a meeting)
  • 30.
  • 32. General Reads Search What is popular amongst all of our readers? Transform • we roll up reads by the hour Search • All reads within the last 2 days • Sum aggregation on content id over # reads Notes • Very fast • Relatively small data footprint Date Content # reads 2019-04-25 13:00 id-1 20,000 2019-04-25 13:00 id-2 15,000 2019-04-25 14:00 id-1 3,000 2019-04-25 15:00 id-2 40,000 2019-04-25 15:00 id-3 25,000 Data Used
  • 33. Subscriber Reads Search What is popular amongst our subscribers? Search • All reads within the last 2 days • Count aggregation on content id Notes • Very fast • Larger data footprint • We determined it’s tolerable for 50k – 100k subscribers • More than that would call for scaling up the Elasticsearch Date User Content 2019-04-25 13:23:47 A id-1 2019-04-25 13:38:10 A id-2 2019-04-25 14:12:57 B id-1 2019-04-25 15:00:07 C id-2 2019-04-25 15:32:54 A id-3 Data Used
  • 34. Account Reads Search What is popular amongst people you work with? Search • All reads within the last 2 days • Term query to restrict to user’s account • Count aggregation on content id Notes • Very fast • Introduces some serendipity Data Used Date User Content 2019-04-25 13:23:47 A id-1 2019-04-25 13:38:10 A id-2 2019-04-25 14:12:57 B id-1 2019-04-25 15:00:07 C id-2 2019-04-25 15:32:54 A id-3 Date User Account 2019-04-25 A X 2019-04-25 B X 2019-04-25 C Y
  • 35. Community Reads Search What are people like you reading? Search A series of 3 queries per request  Bucket 1: the 150 most recent news you’ve read in the last 7 days  Bucket 2: the 50 users who have read news in Bucket 1, ranked by clicks/opens  Bucket 3: the 150 most recent news that users in Bucket 2 have read, ranked by how many of them clicked/opened each Notes  Surprisingly fast  Introduces some serendipity Date User Content 2019-04-25 13:23:47 A id-1 2019-04-25 13:38:10 A id-2 2019-04-25 14:12:57 B id-1 2019-04-25 15:00:07 C id-2 2019-04-25 15:32:54 A id-3 Data Used 1 2 3
  • 36. Similar News Search What kind of stuff do you usually read? Search  All news you’ve read in the past 30 days  Count aggregation on annotations  News with at least one annotation the 30-day bucket • Boosted by the frequency of the annotations in the user’s reads Notes  Very fast  Addresses the cold-start problem  But: loses correlations between annotations  A user may like articles about  Corn & Boats Content Annotations id-5 Airplanes id-6 Boats, Corn id-7 Tables, Walls, Corn Data Used Date User Content Annotations 2019-04-25 13:23:47 A id-1 Water, Corn 2019-04-25 13:38:10 A id-2 Corn, Boats 2019-04-25 14:12:57 A id-1 Water, Corn 2019-04-25 15:00:07 A id-3 Tables, Walls 2019-04-25 15:32:54 A id-4 Chairs, Boats
  • 37. Things we’re happy about • The system has relatively few moving parts • We can explain our recommendations (and troubleshoot them) • Recommendations are available for newly published content immediately • Our scaling is mostly managed by scaling Elasticsearch • It’s very easy to add additional constraints • Ex/ If you don’t subscribe to the Energy vertical, we don’t want any of its content affecting your recommendations
  • 38. A few challenges opportunities we’ve identified • It’s weird to use something so different than the standard architecture • That’s a big reason we want your feedback • We want to revisit the Similar News Search • It seems like we should honor the correlations between annotations • Each recommendation search/component should not be equally weighted • Some are likely to be more pertinent to some users • There are obvious dependencies • If something is generally popular, it’s more likely to be popular for people in your account