Making Reddit Search Relevant and Scalable - Anupama Joshi & Jerry Bao, Reddit
1. Making Reddit Search
Relevant and Scalable
Anupama Joshi
Senior Engineering Manager, Search
Jerry Bao
Senior Software Engineer, Search
2. Agenda
• What is Reddit?
• Search Architecture
• Improving our Relevance
• The History of Search @ Reddit
• Scaling our Infrastructure
• Q&A
3. What is Reddit?
Reddit is a network of communities where
individuals can find experiences built
around their interests, hobbies and
passions
It’s where people converse about the
things that are most important to them
5. Reddit by the numbers
Alexa Rank (US/World)
MAU
Communities
Posts per day
Comments per day
Votes per day
Searches per day
5th/18th
400M+
1M+
440K+
3.5M+
82M+
68M+
18. Show and Tell: A better subreddit search
Challenge: Redditors are very creative in their subreddit naming (e.g. r/superbowl
is about superb owl pictures) which whilst fun, poses a challenge for discovery.
Answer: faceted search on posts!
21. Show and Tell: Better Post Search
● Post search with phrase matching of selftext
The challenge: What about images and link posts?
Answer - Comments
● Comments are important but which comments are most relevant to the post?
● How do we separate the signal from the noise?
Answer - HVT
● HVTs are the highest scoring tf-idf terms from comment sections.
● Index and match on these HVTs along with post selftexts and titles.
22. Result: Better Post Search
Qualitatively, we saw some users notice almost immediately when we first introduced HVTs.
For some queries, the difference is
quite stark. The following are
search results for the query
‘shabooya’. Note how ‘shabooya’
doesn’t appear anywhere in the title
or the body of the first three post
results, but you can see the phrase
show up in the comments.
23. Result: Better Post Search
● Post click through rate (CTR) (+3.15%),
● Relevancy ranking for navigational searches (MRR) (+4.01%)
● Search experience improvements for navigational searches due to increased
recall on posts with poor title or body text
24. Take It to the Next Level: Improve Search Relevance
● Learn from the users click statistics to automatically generate a relevancy
model
● Rerank Search results based on aggregated Click Signal weights that users
click higher on search results for a given query
○ Stream user events in Solr/Fusion cluster
○ Spark Jobs to aggregate click data
○ Use output from the aggregated signal to boost the search results
25. Result: Post search relevance using signals
7.5 % Increase in CTR12.5 % increase in MRR
28. Head-Tail Analysis
A tail query like “lot of credit card debit” would be rewritten to produce better relevant results.
29. Trending Searches
● Reddit can attribute week-over-week DAU
growth to external events, like game
releases, movie releases, and cultural
events (reference).
● We see similar upticks in searches based on
these events (reference).
● We believe that we can increase search
engagement and time on site by leveraging
these signals to highlight trending queries
to users when they search on Reddit.
30. NSFW Categorization
● Develop NSFW classification criteria
● Query Time classification based content filtering.
● Results boosting/reordering based on classification(boost or filter results
based on knowing the query does/does not have NSFW intent)
● Look at the NSFW results in recall
● Look at the NSFW results people clicked
● Try open source Tensorflow libraries for auto detection of NSFW which is not
marked NSFW
31. Related Searches
● Train a collaborative filtering matrix decomposition recommender using
SparkML's Alternating Least Squares (ALS) to batch compute query-query
similarities
● Related Searches backend based on Collaborative Filtering & Co Occurrence
Counting Algorithm via Temporal Proximity
● Collaborative filtering based recommender systems are a popular technique
applied for movie recommendations at Netflix, or product recommendations in
e-commerce sites like Amazon
32. Related Searches
● Dynamic temporal buckets as source of data.
● All pairs irrespective of number of distinct queries in Session
● Length & temporal distance metrics to help with boosting recommendation.
● Intuitive & easily explainable.
● Scales extremely well for building pluggable logic & adding more dimensions.
35. What’s next
● Contextual Query Understanding
○ how context informs query understanding
● Understanding User Intent
○ classifying the query by its interpretation. The interpretation of the query can then be used to
define intent
● Query rewriting and scoping
○ query rewriting technique that improves precision by matching each query segment to the right
attribute
○ query tagging (special case of named-entity recognition (NER))
37. Reddit Search has an
interesting history...
History of Reddit Search
38. History of Reddit Search
● 2005 - Steve Huffman, cofounder and now CEO, implements postgres tsearch.
● 2006 - Chris Slowe, founding engineer and now CTO, implements pylucene.
○ “we fixed a bug in the search results ordering” - Steve Huffman ‘06
○ “I made a quick fix to search that I hope helps until we get a chance to really fix it.” - Steve ‘07
● 2008 - David King, first employee and former search engineer, implements Solr.
○ “[David]’s been fixing search and hacking mystery projects in Erlang.” - Alexis Ohanian ‘08
○ “I’ve totally replaced the reddit search function.” - David King ‘08
● 2010 - David King replaces Solr with IndexTank.
○ “We launched a new search engine yesterday. Calm down. It’s okay. I know. You’ve been hurt
before.” - David King ‘10
● 2012 - u/kemitche implements CloudSearch after LinkedIn shut down IndexTank
“Q: Where do you see reddit in 10 years? A: Reddit search might work by then.” - Steve AMA ‘16
39. Redditors told us how
much they loved
Search...
“Reddit Search is great!” - said no redditor ever
40. “This image should honestly replace the 503 error (all servers busy) page.” - u/seven0feleven
41. “Ever since they moved away from scotch tape, I've been able to get irrelevant results in record time.” - u/El_Bandito_Blanquito
42. In 2017, we set out to
rebuild search from the
ground up!
Rebuilding Search
43. Our First Cluster
● Create an AMI with Solr and Fusion packages installed
● Spin up servers with custom AMI
● SSH into each server
○ Install Fusion and Solr
○ Edit configuration files
○ Increase file descriptor limit
● Configured in AWS US West
44. Our First Cluster
Our new cluster was up
and running well! We
immediately started work
on ingesting data and
relevance tuning.
45. But we ran into a
couple of key issues
when trying to scale
up...
Challenge #1
46. Issues with Scaling our Solr Cluster
● Adding capacity to our cluster or changing instance types took a lot of
effort
● Adding capacity our cluster meant that we needed to rebalance our
cluster so that our replicas were equally distributed across machines
○ Solr 7+ introduced some basic autoscaling features but lacked
policies to ensure a cluster was properly balanced
○ Rebalancing process was 100% manual
● Cross-region requests cost unnecessary latency
● As a result, our team was very cautious in scaling our cluster until it
was absolutely needed, to reduce the number of times we scaled up
48. Terraform + Puppet
● Together they allow us to programmatically make changes to
infrastructure and server configuration quickly
● We can describe how we want servers to be setup
○ Install Java and Solr
○ Mount drives and add user groups/permissions
○ Set up Solr configuration files
● Modifications to servers and infra are reviewable, and revertible
● Rollout changes across our fleet with ease
● “Can you add more servers Jerry??”
○ No problem! One line code change.
59. Solr Rebalancing Tool
● Applied balancing rules in order
○ Check each shard’s availability zone distribution and replica
distribution
○ Move replicas so that each collection’s replicas are on the most
amount of machines
○ Move replicas so that each machine has the least amount of
replicas possible
● Outputs list of operations to be performed and confirms with user each
replica to move
63. Our cluster was now
scaling easily, but
reindexing all of our
data took many
weeks...
Challenge #2
64. Indexing Data for Search
● Backfills
○ Pulls data from our datasource
○ Transforms it into the schema we need for indexing
○ Used to add/remove/change field indexing
● Streaming
○ Captures real-time updates so up-to-date information can be
reflected in our indices
○ Transforms data the same way as backfills
65. Why are fast backfills important?
● Quickly iterate on document schemas
● Test new ways to analyze document fields
● Create multiple clusters of the same data for testing
● Fix data issues rapidly
67. Hive
● Pulled data from postgres with sqoop into Hive
● A series of transformations to
○ Join thing and data tables
○ Rotate the keys into columns
○ Store the final result as Parquet in S3
● Fusion/Spark fetched S3 files and indexed data into Solr
69. Issues with v1
● Several weeks to transform data
○ Afraid of changing the schema
● Many stages of transformation, making it hard to debug and figure out
how far upstream data transformation issues were
○ Hard to ensure the end result was correct
70. Thing Service
● Search Service as the transformer and indexer of data
○ Fetches the latest data from the Thing Service
● Special logic in Thing Service made it easier to handle postgres data
○ Score of links, comments
○ Converting to actual data types (booleans, fullnames)
● Cut backfill time from multiple weeks to a single week with
parallelization
72. Issues with v2
● Reliant upon a shared production service for what should be an offline
job
○ We’ve pushed the thing service too hard with our backfills,
affecting other services that rely upon it
● Other initiatives highlighted how slow our ingestion could get
○ HVTs (augmenting links with high value tokens from comments)
○ Attempts to index comment data
73. Spark
● Running our own postgres replicas from wal-e backups in S3
● Spark pulls data directly from postgres and transforms the data
● Can horizontally scale ingestion to be faster
○ Postgres to speed up ingestion of data into Spark
○ Spark to speed up transformation and joining of data
● We can adjust ingestion parallelism by repartitioning in the end
● Cut backfill time significantly from multiple weeks to days
76. Redditors Issue Expensive Queries
● High Recall Queries
○ the, would, you, ifs, news, games
● Crazy Queries
○ (AFD+OR+CDU+OR+CSU+OR+FDP+OR+Grünen+OR+SPD+OR+"
Die+Linke"+OR+Energiepolitik+OR+Gesetze~+OR+Kabinetts~+O
R+Regierungs~+OR+Referentenentwurf)+(Energiehandel~+OR+E
nergiemanagement~+OR+Energiepreis~+OR+Energiesteuer~)
● These queries would take multiple seconds to complete, blocking a
significant number of CPU cores in the cluster
77. Cutting Queries Off
● Utilize timeAllowed in solrconfig.xml to prevent expensive queries
taking up all of your cluster’s resources
○ NOTE: timeAllowed is not a hard cutoff. From the Solr docs:
○ As this check is periodically performed, the actual time for which a
request can be processed before it is aborted would be marginally
greater than or equal to the value of timeAllowed. If the request
consumes more time in other stages, e.g., custom components,
etc., this parameter is not expected to abort the request.
80. Multi-Cluster Solr Environment
● One cluster per collection
● Hardware Isolation: one collections issues won’t affect other
collections
● Scale each collection independently
● Balancing becomes really simple
○ Each machine has equally distributed number of replicas
○ Ensure AZ and shard awareness
81. Solr 7.5 Autoscaling
● Solr 7.5 includes new policies that allow us to equally distribute
replicas by
○ Arbitrary properties
○ Collection
○ Cluster
● Turn Solr Scaling into a one step process