SlideShare ist ein Scribd-Unternehmen logo
1 von 20
Downloaden Sie, um offline zu lesen
Search at Tumblr
Yufei Pan
Director of Search, Tumblr
16 January 2013
Tumblr - Follow the World’s Creators
Founded
● David Karp
● February 2007

Publishing Platform
● 163 million blogs
● 72 billion posts

Social Network
● Follow, Mention
● Like, Reblog
About search@tumblr
● Most important way to discover great content
○ 50M searches a day

● Limited search for a long time (2007-2012)
○ Tagged page
■ mysql lookup of a single tag id
■ sorted by reverse chronological order
○ Finding blog
■ navigate through curated directories
About search@tumblr
● Search Team
○ 2012 July, Jak joined as first search engineer!

Jak

Yufei

Bennett

Beitao

Patrick

● Features launched in 2013
○ Post search, Blog search, Theme search
○ Typeaheads, Recommendation, Trends

Adam
Whole New Search
Post search
● full text search
● top and recent
● post type filtering

Blog search
● name & title
● top tags in posts
● blog highlights

Related search
● term co-occurrence
Typeahead Autocompletes
Search Autocomplete

Mention Autocomplete

●
●
●

Interactive guide of tumblr content
High volume of traffic
Low latency

Tag Suggest
Recommendations
Personalized Recommendation

Weekly Dashboard Digest
Trends
Trending Tags

Trending Blogs
Theme Search
Search Architecture
Post
Search

Blog
Search

Typeahead

Related
Tags

Blog
Recommend

Blog
Highlights

Blog
Top Tags

Trending
Tags

Trending
Blogs

Trending
Posts

Online
Search Online Framework

Recent Post
Index

Blog Full
Index

Theme
Index

Blog Top-K
Index

Follower
Counts

Post
Notecount

Post
Model

Personalized

Blog Index

Trending
Blogs

Trending
Posts

Trending
Tags

Related Tag
Index

Blog Global
Rank

Blog
Model

User
Model

Typeahead
Indices

Data

Top
Post Index

Blog Top
Posts

Blog Top
Tags

Two Degree

Like Root

Blog
Feedback

In-Blog Tag
Index

Global Tag
Index

Search Offline Framework

Rediscover
Solr

Offline

MySQL
Activity Streams (Fire Geyser)

Scribe logs, Sqoop tables (HDFS)

Nginx
Linux
Software Stack
● Search Online
○ HAProxy, Nginx, PHP
○ Memcache
○ Icinga, Scribe, OpenTSDB

● Search Data
○ Solr, Redis, MySQL

● Search Offline
○ Sqoop, Hadoop
○ Java, Hive, Pig, Scalding, Python
Search Online Framework
Search Services

SearchBase

Search Flow
Execution

Multi-level
Caching

Search
Logging

Async
Execution

Search
Editorial

QueryIF

RetrieverIF

SignalFetcherIF

RankerIF

DocFetcherIF

FilterIF

SimpleQuery

SolrPostRetriever

NotecountFetcher

TopPostRanker

PostFetcher

PostFilter

PersonalizedQuery

MysqlPostRetriever

FollowercountFetcher

TumblelogRanker

TumblelogFetcher

TumblelogFilter

AdvancedPostQuery

SMPostRetriever

TumblelogGlobalRan
kFetcher

RelatedPostRanker

TagFetcher

TagFilter

RecommendationSign
alFetcher

TumblelogMixingRan
ker

TimeSliceQuery
TrendTagQuery

TumblelogRetriever
TagTypeahead
Reteriever

BlogTopTagFetcher
Search Batch Processing
Search Data (Redis)

Workflow
Composition

Dependency
Resolution

Automatic
Versioning

Data
Verification

Execution
Logging
Failure
Detection/Alert

Search Workflow Engine
Hive Jobs
Term
Generators

Streaming
Jobs

Pig Jobs

Top-K
Indexer

Delta
Propagator

Search Task Base
Scribe Logs, Sqoop Tables (HDFS)

Scalding
Jobs
Lucille2
Classes
Indexing
● 3-Tier indices
○ Index all posts
■ 600+ machines
○ Recent (6W) + Popular (4Y) + Existing tag table
■ Down to 40 machines
■ Minor loss in coverage
■ Serve up to 4K qps (non-cached)

● Lean index
○ Separate signals from index
■ Eliminate high volume re-indexing
■ Independent signal engineering from indexing
○ Separate document text from index
■ Dropping the memory footprint
Ranking
● Quickly evolving!
● Major ranking signals in production
○ Global popularity
■ likes, reblogs, follows
○ Local popularity
■ popularity projected on <user, query>
●
●

blog search: aggregated likes on query term
blog recommendation: follow counts among friends

○ Textual relevancy
■ how: exact match, query proximity
■ where: name, title, tag, mention, body, etc
○ Recency
Duplicate Elimination (DE)
● Index-time DE
○ post signature
■ number of tags > N1
■ md5 hash of normalized tag list

● Search-time DE
○ Media DE
■ posts with same media hashes.
○ Near DE
■ posts with tags > N2
■ mark as near duplicate if diff <= N3 tags
■ older posts selected as original
Search Platform
● A curvy road
○ Started with ElasticSearch
○ Switched to SolrCloud due to reliability
○ Ended up with Solr + Customized Clustering

● Our takes
○ ElasticSearch and SolrCloud have great functionality
■ distributed indexing and search
■ easy cluster management
○ Solr seems still much more reliable with high
indexing load and search traffic.
Offline Precomputation
● Benefits
○ Minimize the search online latency
○ More sophisticated/expensive computation

● Limitation
○ Loss of freshness
○ Expensive for longtail query and results

● Precomputed
○
○
○
○

Typeaheads
Related search
Blog recommendation
Top posts of Blog / User
What’s Next
● Inblog search
○ full text search on all posts in a blog
○ original posts, reblogs, likes

● Ranking
○ more effective and spam-resilient signals
○ learning to rank

● Topical interest modeling
○ supervised and unsupervised
○ blog content and user activities
○ interest based blog recommendation

● Content discovery
○ trending content in various categories
Q&A
Question: Are you hiring?
Answer: Yeah! Check it out at http://www.tumblr.com/jobs

More questions please, :-)

Weitere ähnliche Inhalte

Ähnlich wie Search at Tumblr (nyc search meetup)

Query Understanding at LinkedIn [Talk at Facebook]
Query Understanding at LinkedIn [Talk at Facebook]Query Understanding at LinkedIn [Talk at Facebook]
Query Understanding at LinkedIn [Talk at Facebook]
Abhimanyu Lad
 
Demystifying SEO - Getting the Fundamentals Right
Demystifying SEO - Getting the Fundamentals RightDemystifying SEO - Getting the Fundamentals Right
Demystifying SEO - Getting the Fundamentals Right
Raunak Guha
 
Mark ginsberg beyond kw research - smx israel
Mark ginsberg   beyond kw research - smx israelMark ginsberg   beyond kw research - smx israel
Mark ginsberg beyond kw research - smx israel
Barry Schwartz
 
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
OpenSource Connections
 

Ähnlich wie Search at Tumblr (nyc search meetup) (20)

WordPress SEO Class Outline for NCSU Professional Development
WordPress SEO Class Outline for NCSU Professional DevelopmentWordPress SEO Class Outline for NCSU Professional Development
WordPress SEO Class Outline for NCSU Professional Development
 
Search Engine Optimization Fundamentals
Search Engine Optimization FundamentalsSearch Engine Optimization Fundamentals
Search Engine Optimization Fundamentals
 
Search engine optimization (SEO, SEM, SMM)
Search engine optimization (SEO, SEM, SMM)Search engine optimization (SEO, SEM, SMM)
Search engine optimization (SEO, SEM, SMM)
 
Seo class (2) converted
Seo class (2) convertedSeo class (2) converted
Seo class (2) converted
 
How To Guide : Researching Topics For Blogs
How To Guide : Researching Topics For BlogsHow To Guide : Researching Topics For Blogs
How To Guide : Researching Topics For Blogs
 
Query Understanding at LinkedIn [Talk at Facebook]
Query Understanding at LinkedIn [Talk at Facebook]Query Understanding at LinkedIn [Talk at Facebook]
Query Understanding at LinkedIn [Talk at Facebook]
 
Demystifying SEO - Getting the Fundamentals Right
Demystifying SEO - Getting the Fundamentals RightDemystifying SEO - Getting the Fundamentals Right
Demystifying SEO - Getting the Fundamentals Right
 
SEO and TLD Domain Names - NamesCon 2020
SEO and TLD Domain Names - NamesCon 2020SEO and TLD Domain Names - NamesCon 2020
SEO and TLD Domain Names - NamesCon 2020
 
SEO AND DIGITAL MARKETING
SEO AND DIGITAL MARKETINGSEO AND DIGITAL MARKETING
SEO AND DIGITAL MARKETING
 
Personalized search
Personalized searchPersonalized search
Personalized search
 
Keyword Research - Moving Beyond Adwords - SMX Israel 2013
Keyword Research - Moving Beyond Adwords - SMX Israel 2013Keyword Research - Moving Beyond Adwords - SMX Israel 2013
Keyword Research - Moving Beyond Adwords - SMX Israel 2013
 
Mark ginsberg beyond kw research - smx israel
Mark ginsberg   beyond kw research - smx israelMark ginsberg   beyond kw research - smx israel
Mark ginsberg beyond kw research - smx israel
 
Michał Suski SEO Surfer SEOCON.ID
Michał Suski SEO Surfer SEOCON.IDMichał Suski SEO Surfer SEOCON.ID
Michał Suski SEO Surfer SEOCON.ID
 
Search Engine Optimisation - Have you been crawled over?
Search Engine Optimisation - Have you been crawled over?Search Engine Optimisation - Have you been crawled over?
Search Engine Optimisation - Have you been crawled over?
 
Presentation: SEO Basics
Presentation: SEO BasicsPresentation: SEO Basics
Presentation: SEO Basics
 
SEO Introduction
SEO IntroductionSEO Introduction
SEO Introduction
 
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
 
Introduction to Databases
Introduction to Databases Introduction to Databases
Introduction to Databases
 
DC presentation 1
DC presentation 1DC presentation 1
DC presentation 1
 
Performing Technical Keyword Research for a NEW Website
Performing Technical Keyword Research for a NEW WebsitePerforming Technical Keyword Research for a NEW Website
Performing Technical Keyword Research for a NEW Website
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Kürzlich hochgeladen (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 

Search at Tumblr (nyc search meetup)

  • 1. Search at Tumblr Yufei Pan Director of Search, Tumblr 16 January 2013
  • 2. Tumblr - Follow the World’s Creators Founded ● David Karp ● February 2007 Publishing Platform ● 163 million blogs ● 72 billion posts Social Network ● Follow, Mention ● Like, Reblog
  • 3. About search@tumblr ● Most important way to discover great content ○ 50M searches a day ● Limited search for a long time (2007-2012) ○ Tagged page ■ mysql lookup of a single tag id ■ sorted by reverse chronological order ○ Finding blog ■ navigate through curated directories
  • 4. About search@tumblr ● Search Team ○ 2012 July, Jak joined as first search engineer! Jak Yufei Bennett Beitao Patrick ● Features launched in 2013 ○ Post search, Blog search, Theme search ○ Typeaheads, Recommendation, Trends Adam
  • 5. Whole New Search Post search ● full text search ● top and recent ● post type filtering Blog search ● name & title ● top tags in posts ● blog highlights Related search ● term co-occurrence
  • 6. Typeahead Autocompletes Search Autocomplete Mention Autocomplete ● ● ● Interactive guide of tumblr content High volume of traffic Low latency Tag Suggest
  • 10. Search Architecture Post Search Blog Search Typeahead Related Tags Blog Recommend Blog Highlights Blog Top Tags Trending Tags Trending Blogs Trending Posts Online Search Online Framework Recent Post Index Blog Full Index Theme Index Blog Top-K Index Follower Counts Post Notecount Post Model Personalized Blog Index Trending Blogs Trending Posts Trending Tags Related Tag Index Blog Global Rank Blog Model User Model Typeahead Indices Data Top Post Index Blog Top Posts Blog Top Tags Two Degree Like Root Blog Feedback In-Blog Tag Index Global Tag Index Search Offline Framework Rediscover Solr Offline MySQL Activity Streams (Fire Geyser) Scribe logs, Sqoop tables (HDFS) Nginx Linux
  • 11. Software Stack ● Search Online ○ HAProxy, Nginx, PHP ○ Memcache ○ Icinga, Scribe, OpenTSDB ● Search Data ○ Solr, Redis, MySQL ● Search Offline ○ Sqoop, Hadoop ○ Java, Hive, Pig, Scalding, Python
  • 12. Search Online Framework Search Services SearchBase Search Flow Execution Multi-level Caching Search Logging Async Execution Search Editorial QueryIF RetrieverIF SignalFetcherIF RankerIF DocFetcherIF FilterIF SimpleQuery SolrPostRetriever NotecountFetcher TopPostRanker PostFetcher PostFilter PersonalizedQuery MysqlPostRetriever FollowercountFetcher TumblelogRanker TumblelogFetcher TumblelogFilter AdvancedPostQuery SMPostRetriever TumblelogGlobalRan kFetcher RelatedPostRanker TagFetcher TagFilter RecommendationSign alFetcher TumblelogMixingRan ker TimeSliceQuery TrendTagQuery TumblelogRetriever TagTypeahead Reteriever BlogTopTagFetcher
  • 13. Search Batch Processing Search Data (Redis) Workflow Composition Dependency Resolution Automatic Versioning Data Verification Execution Logging Failure Detection/Alert Search Workflow Engine Hive Jobs Term Generators Streaming Jobs Pig Jobs Top-K Indexer Delta Propagator Search Task Base Scribe Logs, Sqoop Tables (HDFS) Scalding Jobs Lucille2 Classes
  • 14. Indexing ● 3-Tier indices ○ Index all posts ■ 600+ machines ○ Recent (6W) + Popular (4Y) + Existing tag table ■ Down to 40 machines ■ Minor loss in coverage ■ Serve up to 4K qps (non-cached) ● Lean index ○ Separate signals from index ■ Eliminate high volume re-indexing ■ Independent signal engineering from indexing ○ Separate document text from index ■ Dropping the memory footprint
  • 15. Ranking ● Quickly evolving! ● Major ranking signals in production ○ Global popularity ■ likes, reblogs, follows ○ Local popularity ■ popularity projected on <user, query> ● ● blog search: aggregated likes on query term blog recommendation: follow counts among friends ○ Textual relevancy ■ how: exact match, query proximity ■ where: name, title, tag, mention, body, etc ○ Recency
  • 16. Duplicate Elimination (DE) ● Index-time DE ○ post signature ■ number of tags > N1 ■ md5 hash of normalized tag list ● Search-time DE ○ Media DE ■ posts with same media hashes. ○ Near DE ■ posts with tags > N2 ■ mark as near duplicate if diff <= N3 tags ■ older posts selected as original
  • 17. Search Platform ● A curvy road ○ Started with ElasticSearch ○ Switched to SolrCloud due to reliability ○ Ended up with Solr + Customized Clustering ● Our takes ○ ElasticSearch and SolrCloud have great functionality ■ distributed indexing and search ■ easy cluster management ○ Solr seems still much more reliable with high indexing load and search traffic.
  • 18. Offline Precomputation ● Benefits ○ Minimize the search online latency ○ More sophisticated/expensive computation ● Limitation ○ Loss of freshness ○ Expensive for longtail query and results ● Precomputed ○ ○ ○ ○ Typeaheads Related search Blog recommendation Top posts of Blog / User
  • 19. What’s Next ● Inblog search ○ full text search on all posts in a blog ○ original posts, reblogs, likes ● Ranking ○ more effective and spam-resilient signals ○ learning to rank ● Topical interest modeling ○ supervised and unsupervised ○ blog content and user activities ○ interest based blog recommendation ● Content discovery ○ trending content in various categories
  • 20. Q&A Question: Are you hiring? Answer: Yeah! Check it out at http://www.tumblr.com/jobs More questions please, :-)