SlideShare ist ein Scribd-Unternehmen logo
1 von 20
Downloaden Sie, um offline zu lesen
Search at Tumblr
Yufei Pan
Director of Search, Tumblr
16 January 2013
Tumblr - Follow the World’s Creators
Founded
● David Karp
● February 2007

Publishing Platform
● 163 million blogs
● 72 billion posts

Social Network
● Follow, Mention
● Like, Reblog
About search@tumblr
● Most important way to discover great content
○ 50M searches a day

● Limited search for a long time (2007-2012)
○ Tagged page
■ mysql lookup of a single tag id
■ sorted by reverse chronological order
○ Finding blog
■ navigate through curated directories
About search@tumblr
● Search Team
○ 2012 July, Jak joined as first search engineer!

Jak

Yufei

Bennett

Beitao

Patrick

● Features launched in 2013
○ Post search, Blog search, Theme search
○ Typeaheads, Recommendation, Trends

Adam
Whole New Search
Post search
● full text search
● top and recent
● post type filtering

Blog search
● name & title
● top tags in posts
● blog highlights

Related search
● term co-occurrence
Typeahead Autocompletes
Search Autocomplete

Mention Autocomplete

●
●
●

Interactive guide of tumblr content
High volume of traffic
Low latency

Tag Suggest
Recommendations
Personalized Recommendation

Weekly Dashboard Digest
Trends
Trending Tags

Trending Blogs
Theme Search
Search Architecture
Post
Search

Blog
Search

Typeahead

Related
Tags

Blog
Recommend

Blog
Highlights

Blog
Top Tags

Trending
Tags

Trending
Blogs

Trending
Posts

Online
Search Online Framework

Recent Post
Index

Blog Full
Index

Theme
Index

Blog Top-K
Index

Follower
Counts

Post
Notecount

Post
Model

Personalized

Blog Index

Trending
Blogs

Trending
Posts

Trending
Tags

Related Tag
Index

Blog Global
Rank

Blog
Model

User
Model

Typeahead
Indices

Data

Top
Post Index

Blog Top
Posts

Blog Top
Tags

Two Degree

Like Root

Blog
Feedback

In-Blog Tag
Index

Global Tag
Index

Search Offline Framework

Rediscover
Solr

Offline

MySQL
Activity Streams (Fire Geyser)

Scribe logs, Sqoop tables (HDFS)

Nginx
Linux
Software Stack
● Search Online
○ HAProxy, Nginx, PHP
○ Memcache
○ Icinga, Scribe, OpenTSDB

● Search Data
○ Solr, Redis, MySQL

● Search Offline
○ Sqoop, Hadoop
○ Java, Hive, Pig, Scalding, Python
Search Online Framework
Search Services

SearchBase

Search Flow
Execution

Multi-level
Caching

Search
Logging

Async
Execution

Search
Editorial

QueryIF

RetrieverIF

SignalFetcherIF

RankerIF

DocFetcherIF

FilterIF

SimpleQuery

SolrPostRetriever

NotecountFetcher

TopPostRanker

PostFetcher

PostFilter

PersonalizedQuery

MysqlPostRetriever

FollowercountFetcher

TumblelogRanker

TumblelogFetcher

TumblelogFilter

AdvancedPostQuery

SMPostRetriever

TumblelogGlobalRan
kFetcher

RelatedPostRanker

TagFetcher

TagFilter

RecommendationSign
alFetcher

TumblelogMixingRan
ker

TimeSliceQuery
TrendTagQuery

TumblelogRetriever
TagTypeahead
Reteriever

BlogTopTagFetcher
Search Batch Processing
Search Data (Redis)

Workflow
Composition

Dependency
Resolution

Automatic
Versioning

Data
Verification

Execution
Logging
Failure
Detection/Alert

Search Workflow Engine
Hive Jobs
Term
Generators

Streaming
Jobs

Pig Jobs

Top-K
Indexer

Delta
Propagator

Search Task Base
Scribe Logs, Sqoop Tables (HDFS)

Scalding
Jobs
Lucille2
Classes
Indexing
● 3-Tier indices
○ Index all posts
■ 600+ machines
○ Recent (6W) + Popular (4Y) + Existing tag table
■ Down to 40 machines
■ Minor loss in coverage
■ Serve up to 4K qps (non-cached)

● Lean index
○ Separate signals from index
■ Eliminate high volume re-indexing
■ Independent signal engineering from indexing
○ Separate document text from index
■ Dropping the memory footprint
Ranking
● Quickly evolving!
● Major ranking signals in production
○ Global popularity
■ likes, reblogs, follows
○ Local popularity
■ popularity projected on <user, query>
●
●

blog search: aggregated likes on query term
blog recommendation: follow counts among friends

○ Textual relevancy
■ how: exact match, query proximity
■ where: name, title, tag, mention, body, etc
○ Recency
Duplicate Elimination (DE)
● Index-time DE
○ post signature
■ number of tags > N1
■ md5 hash of normalized tag list

● Search-time DE
○ Media DE
■ posts with same media hashes.
○ Near DE
■ posts with tags > N2
■ mark as near duplicate if diff <= N3 tags
■ older posts selected as original
Search Platform
● A curvy road
○ Started with ElasticSearch
○ Switched to SolrCloud due to reliability
○ Ended up with Solr + Customized Clustering

● Our takes
○ ElasticSearch and SolrCloud have great functionality
■ distributed indexing and search
■ easy cluster management
○ Solr seems still much more reliable with high
indexing load and search traffic.
Offline Precomputation
● Benefits
○ Minimize the search online latency
○ More sophisticated/expensive computation

● Limitation
○ Loss of freshness
○ Expensive for longtail query and results

● Precomputed
○
○
○
○

Typeaheads
Related search
Blog recommendation
Top posts of Blog / User
What’s Next
● Inblog search
○ full text search on all posts in a blog
○ original posts, reblogs, likes

● Ranking
○ more effective and spam-resilient signals
○ learning to rank

● Topical interest modeling
○ supervised and unsupervised
○ blog content and user activities
○ interest based blog recommendation

● Content discovery
○ trending content in various categories
Q&A
Question: Are you hiring?
Answer: Yeah! Check it out at http://www.tumblr.com/jobs

More questions please, :-)

Weitere ähnliche Inhalte

Ähnlich wie Search at Tumblr (nyc search meetup)

WordPress SEO Class Outline for NCSU Professional Development
WordPress SEO Class Outline for NCSU Professional DevelopmentWordPress SEO Class Outline for NCSU Professional Development
WordPress SEO Class Outline for NCSU Professional DevelopmentFrank Jones
 
Search Engine Optimization Fundamentals
Search Engine Optimization FundamentalsSearch Engine Optimization Fundamentals
Search Engine Optimization FundamentalsKalin Chernev
 
Search engine optimization (SEO, SEM, SMM)
Search engine optimization (SEO, SEM, SMM)Search engine optimization (SEO, SEM, SMM)
Search engine optimization (SEO, SEM, SMM)NeslaSherin
 
Seo class (2) converted
Seo class (2) convertedSeo class (2) converted
Seo class (2) convertedNeslaSherin
 
How To Guide : Researching Topics For Blogs
How To Guide : Researching Topics For BlogsHow To Guide : Researching Topics For Blogs
How To Guide : Researching Topics For BlogsOmnePresent
 
Query Understanding at LinkedIn [Talk at Facebook]
Query Understanding at LinkedIn [Talk at Facebook]Query Understanding at LinkedIn [Talk at Facebook]
Query Understanding at LinkedIn [Talk at Facebook]Abhimanyu Lad
 
Demystifying SEO - Getting the Fundamentals Right
Demystifying SEO - Getting the Fundamentals RightDemystifying SEO - Getting the Fundamentals Right
Demystifying SEO - Getting the Fundamentals RightRaunak Guha
 
SEO and TLD Domain Names - NamesCon 2020
SEO and TLD Domain Names - NamesCon 2020SEO and TLD Domain Names - NamesCon 2020
SEO and TLD Domain Names - NamesCon 2020Bill Hartzer
 
Personalized search
Personalized searchPersonalized search
Personalized searchToine Bogers
 
Keyword Research - Moving Beyond Adwords - SMX Israel 2013
Keyword Research - Moving Beyond Adwords - SMX Israel 2013Keyword Research - Moving Beyond Adwords - SMX Israel 2013
Keyword Research - Moving Beyond Adwords - SMX Israel 2013Mark Ginsberg
 
Mark ginsberg beyond kw research - smx israel
Mark ginsberg   beyond kw research - smx israelMark ginsberg   beyond kw research - smx israel
Mark ginsberg beyond kw research - smx israelBarry Schwartz
 
Michał Suski SEO Surfer SEOCON.ID
Michał Suski SEO Surfer SEOCON.IDMichał Suski SEO Surfer SEOCON.ID
Michał Suski SEO Surfer SEOCON.IDAbi Yudhie
 
Search Engine Optimisation - Have you been crawled over?
Search Engine Optimisation - Have you been crawled over?Search Engine Optimisation - Have you been crawled over?
Search Engine Optimisation - Have you been crawled over?Marshal Yung
 
Presentation: SEO Basics
Presentation: SEO BasicsPresentation: SEO Basics
Presentation: SEO BasicsAmanda Billy
 
SEO Introduction
SEO IntroductionSEO Introduction
SEO IntroductionSSAA60
 
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...OpenSource Connections
 
Performing Technical Keyword Research for a NEW Website
Performing Technical Keyword Research for a NEW WebsitePerforming Technical Keyword Research for a NEW Website
Performing Technical Keyword Research for a NEW WebsiteFrom The Future
 

Ähnlich wie Search at Tumblr (nyc search meetup) (20)

WordPress SEO Class Outline for NCSU Professional Development
WordPress SEO Class Outline for NCSU Professional DevelopmentWordPress SEO Class Outline for NCSU Professional Development
WordPress SEO Class Outline for NCSU Professional Development
 
Search Engine Optimization Fundamentals
Search Engine Optimization FundamentalsSearch Engine Optimization Fundamentals
Search Engine Optimization Fundamentals
 
Search engine optimization (SEO, SEM, SMM)
Search engine optimization (SEO, SEM, SMM)Search engine optimization (SEO, SEM, SMM)
Search engine optimization (SEO, SEM, SMM)
 
Seo class (2) converted
Seo class (2) convertedSeo class (2) converted
Seo class (2) converted
 
How To Guide : Researching Topics For Blogs
How To Guide : Researching Topics For BlogsHow To Guide : Researching Topics For Blogs
How To Guide : Researching Topics For Blogs
 
Query Understanding at LinkedIn [Talk at Facebook]
Query Understanding at LinkedIn [Talk at Facebook]Query Understanding at LinkedIn [Talk at Facebook]
Query Understanding at LinkedIn [Talk at Facebook]
 
Demystifying SEO - Getting the Fundamentals Right
Demystifying SEO - Getting the Fundamentals RightDemystifying SEO - Getting the Fundamentals Right
Demystifying SEO - Getting the Fundamentals Right
 
SEO and TLD Domain Names - NamesCon 2020
SEO and TLD Domain Names - NamesCon 2020SEO and TLD Domain Names - NamesCon 2020
SEO and TLD Domain Names - NamesCon 2020
 
SEO AND DIGITAL MARKETING
SEO AND DIGITAL MARKETINGSEO AND DIGITAL MARKETING
SEO AND DIGITAL MARKETING
 
Personalized search
Personalized searchPersonalized search
Personalized search
 
Keyword Research - Moving Beyond Adwords - SMX Israel 2013
Keyword Research - Moving Beyond Adwords - SMX Israel 2013Keyword Research - Moving Beyond Adwords - SMX Israel 2013
Keyword Research - Moving Beyond Adwords - SMX Israel 2013
 
Mark ginsberg beyond kw research - smx israel
Mark ginsberg   beyond kw research - smx israelMark ginsberg   beyond kw research - smx israel
Mark ginsberg beyond kw research - smx israel
 
Michał Suski SEO Surfer SEOCON.ID
Michał Suski SEO Surfer SEOCON.IDMichał Suski SEO Surfer SEOCON.ID
Michał Suski SEO Surfer SEOCON.ID
 
Search Engine Optimisation - Have you been crawled over?
Search Engine Optimisation - Have you been crawled over?Search Engine Optimisation - Have you been crawled over?
Search Engine Optimisation - Have you been crawled over?
 
Presentation: SEO Basics
Presentation: SEO BasicsPresentation: SEO Basics
Presentation: SEO Basics
 
SEO Introduction
SEO IntroductionSEO Introduction
SEO Introduction
 
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
 
Introduction to Databases
Introduction to Databases Introduction to Databases
Introduction to Databases
 
DC presentation 1
DC presentation 1DC presentation 1
DC presentation 1
 
Performing Technical Keyword Research for a NEW Website
Performing Technical Keyword Research for a NEW WebsitePerforming Technical Keyword Research for a NEW Website
Performing Technical Keyword Research for a NEW Website
 

Kürzlich hochgeladen

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 

Kürzlich hochgeladen (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 

Search at Tumblr (nyc search meetup)

  • 1. Search at Tumblr Yufei Pan Director of Search, Tumblr 16 January 2013
  • 2. Tumblr - Follow the World’s Creators Founded ● David Karp ● February 2007 Publishing Platform ● 163 million blogs ● 72 billion posts Social Network ● Follow, Mention ● Like, Reblog
  • 3. About search@tumblr ● Most important way to discover great content ○ 50M searches a day ● Limited search for a long time (2007-2012) ○ Tagged page ■ mysql lookup of a single tag id ■ sorted by reverse chronological order ○ Finding blog ■ navigate through curated directories
  • 4. About search@tumblr ● Search Team ○ 2012 July, Jak joined as first search engineer! Jak Yufei Bennett Beitao Patrick ● Features launched in 2013 ○ Post search, Blog search, Theme search ○ Typeaheads, Recommendation, Trends Adam
  • 5. Whole New Search Post search ● full text search ● top and recent ● post type filtering Blog search ● name & title ● top tags in posts ● blog highlights Related search ● term co-occurrence
  • 6. Typeahead Autocompletes Search Autocomplete Mention Autocomplete ● ● ● Interactive guide of tumblr content High volume of traffic Low latency Tag Suggest
  • 10. Search Architecture Post Search Blog Search Typeahead Related Tags Blog Recommend Blog Highlights Blog Top Tags Trending Tags Trending Blogs Trending Posts Online Search Online Framework Recent Post Index Blog Full Index Theme Index Blog Top-K Index Follower Counts Post Notecount Post Model Personalized Blog Index Trending Blogs Trending Posts Trending Tags Related Tag Index Blog Global Rank Blog Model User Model Typeahead Indices Data Top Post Index Blog Top Posts Blog Top Tags Two Degree Like Root Blog Feedback In-Blog Tag Index Global Tag Index Search Offline Framework Rediscover Solr Offline MySQL Activity Streams (Fire Geyser) Scribe logs, Sqoop tables (HDFS) Nginx Linux
  • 11. Software Stack ● Search Online ○ HAProxy, Nginx, PHP ○ Memcache ○ Icinga, Scribe, OpenTSDB ● Search Data ○ Solr, Redis, MySQL ● Search Offline ○ Sqoop, Hadoop ○ Java, Hive, Pig, Scalding, Python
  • 12. Search Online Framework Search Services SearchBase Search Flow Execution Multi-level Caching Search Logging Async Execution Search Editorial QueryIF RetrieverIF SignalFetcherIF RankerIF DocFetcherIF FilterIF SimpleQuery SolrPostRetriever NotecountFetcher TopPostRanker PostFetcher PostFilter PersonalizedQuery MysqlPostRetriever FollowercountFetcher TumblelogRanker TumblelogFetcher TumblelogFilter AdvancedPostQuery SMPostRetriever TumblelogGlobalRan kFetcher RelatedPostRanker TagFetcher TagFilter RecommendationSign alFetcher TumblelogMixingRan ker TimeSliceQuery TrendTagQuery TumblelogRetriever TagTypeahead Reteriever BlogTopTagFetcher
  • 13. Search Batch Processing Search Data (Redis) Workflow Composition Dependency Resolution Automatic Versioning Data Verification Execution Logging Failure Detection/Alert Search Workflow Engine Hive Jobs Term Generators Streaming Jobs Pig Jobs Top-K Indexer Delta Propagator Search Task Base Scribe Logs, Sqoop Tables (HDFS) Scalding Jobs Lucille2 Classes
  • 14. Indexing ● 3-Tier indices ○ Index all posts ■ 600+ machines ○ Recent (6W) + Popular (4Y) + Existing tag table ■ Down to 40 machines ■ Minor loss in coverage ■ Serve up to 4K qps (non-cached) ● Lean index ○ Separate signals from index ■ Eliminate high volume re-indexing ■ Independent signal engineering from indexing ○ Separate document text from index ■ Dropping the memory footprint
  • 15. Ranking ● Quickly evolving! ● Major ranking signals in production ○ Global popularity ■ likes, reblogs, follows ○ Local popularity ■ popularity projected on <user, query> ● ● blog search: aggregated likes on query term blog recommendation: follow counts among friends ○ Textual relevancy ■ how: exact match, query proximity ■ where: name, title, tag, mention, body, etc ○ Recency
  • 16. Duplicate Elimination (DE) ● Index-time DE ○ post signature ■ number of tags > N1 ■ md5 hash of normalized tag list ● Search-time DE ○ Media DE ■ posts with same media hashes. ○ Near DE ■ posts with tags > N2 ■ mark as near duplicate if diff <= N3 tags ■ older posts selected as original
  • 17. Search Platform ● A curvy road ○ Started with ElasticSearch ○ Switched to SolrCloud due to reliability ○ Ended up with Solr + Customized Clustering ● Our takes ○ ElasticSearch and SolrCloud have great functionality ■ distributed indexing and search ■ easy cluster management ○ Solr seems still much more reliable with high indexing load and search traffic.
  • 18. Offline Precomputation ● Benefits ○ Minimize the search online latency ○ More sophisticated/expensive computation ● Limitation ○ Loss of freshness ○ Expensive for longtail query and results ● Precomputed ○ ○ ○ ○ Typeaheads Related search Blog recommendation Top posts of Blog / User
  • 19. What’s Next ● Inblog search ○ full text search on all posts in a blog ○ original posts, reblogs, likes ● Ranking ○ more effective and spam-resilient signals ○ learning to rank ● Topical interest modeling ○ supervised and unsupervised ○ blog content and user activities ○ interest based blog recommendation ● Content discovery ○ trending content in various categories
  • 20. Q&A Question: Are you hiring? Answer: Yeah! Check it out at http://www.tumblr.com/jobs More questions please, :-)