SlideShare ist ein Scribd-Unternehmen logo
1 von 20
Downloaden Sie, um offline zu lesen
Search at Tumblr
Yufei Pan
Director of Search, Tumblr
16 January 2013
Tumblr - Follow the World’s Creators
Founded
● David Karp
● February 2007

Publishing Platform
● 163 million blogs
● 72 billion posts

Social Network
● Follow, Mention
● Like, Reblog
About search@tumblr
● Most important way to discover great content
○ 50M searches a day

● Limited search for a long time (2007-2012)
○ Tagged page
■ mysql lookup of a single tag id
■ sorted by reverse chronological order
○ Finding blog
■ navigate through curated directories
About search@tumblr
● Search Team
○ 2012 July, Jak joined as first search engineer!

Jak

Yufei

Bennett

Beitao

Patrick

● Features launched in 2013
○ Post search, Blog search, Theme search
○ Typeaheads, Recommendation, Trends

Adam
Whole New Search
Post search
● full text search
● top and recent
● post type filtering

Blog search
● name & title
● top tags in posts
● blog highlights

Related search
● term co-occurrence
Typeahead Autocompletes
Search Autocomplete

Mention Autocomplete

●
●
●

Interactive guide of tumblr content
High volume of traffic
Low latency

Tag Suggest
Recommendations
Personalized Recommendation

Weekly Dashboard Digest
Trends
Trending Tags

Trending Blogs
Theme Search
Search Architecture
Post
Search

Blog
Search

Typeahead

Related
Tags

Blog
Recommend

Blog
Highlights

Blog
Top Tags

Trending
Tags

Trending
Blogs

Trending
Posts

Online
Search Online Framework

Recent Post
Index

Blog Full
Index

Theme
Index

Blog Top-K
Index

Follower
Counts

Post
Notecount

Post
Model

Personalized

Blog Index

Trending
Blogs

Trending
Posts

Trending
Tags

Related Tag
Index

Blog Global
Rank

Blog
Model

User
Model

Typeahead
Indices

Data

Top
Post Index

Blog Top
Posts

Blog Top
Tags

Two Degree

Like Root

Blog
Feedback

In-Blog Tag
Index

Global Tag
Index

Search Offline Framework

Rediscover
Solr

Offline

MySQL
Activity Streams (Fire Geyser)

Scribe logs, Sqoop tables (HDFS)

Nginx
Linux
Software Stack
● Search Online
○ HAProxy, Nginx, PHP
○ Memcache
○ Icinga, Scribe, OpenTSDB

● Search Data
○ Solr, Redis, MySQL

● Search Offline
○ Sqoop, Hadoop
○ Java, Hive, Pig, Scalding, Python
Search Online Framework
Search Services

SearchBase

Search Flow
Execution

Multi-level
Caching

Search
Logging

Async
Execution

Search
Editorial

QueryIF

RetrieverIF

SignalFetcherIF

RankerIF

DocFetcherIF

FilterIF

SimpleQuery

SolrPostRetriever

NotecountFetcher

TopPostRanker

PostFetcher

PostFilter

PersonalizedQuery

MysqlPostRetriever

FollowercountFetcher

TumblelogRanker

TumblelogFetcher

TumblelogFilter

AdvancedPostQuery

SMPostRetriever

TumblelogGlobalRan
kFetcher

RelatedPostRanker

TagFetcher

TagFilter

RecommendationSign
alFetcher

TumblelogMixingRan
ker

TimeSliceQuery
TrendTagQuery

TumblelogRetriever
TagTypeahead
Reteriever

BlogTopTagFetcher
Search Batch Processing
Search Data (Redis)

Workflow
Composition

Dependency
Resolution

Automatic
Versioning

Data
Verification

Execution
Logging
Failure
Detection/Alert

Search Workflow Engine
Hive Jobs
Term
Generators

Streaming
Jobs

Pig Jobs

Top-K
Indexer

Delta
Propagator

Search Task Base
Scribe Logs, Sqoop Tables (HDFS)

Scalding
Jobs
Lucille2
Classes
Indexing
● 3-Tier indices
○ Index all posts
■ 600+ machines
○ Recent (6W) + Popular (4Y) + Existing tag table
■ Down to 40 machines
■ Minor loss in coverage
■ Serve up to 4K qps (non-cached)

● Lean index
○ Separate signals from index
■ Eliminate high volume re-indexing
■ Independent signal engineering from indexing
○ Separate document text from index
■ Dropping the memory footprint
Ranking
● Quickly evolving!
● Major ranking signals in production
○ Global popularity
■ likes, reblogs, follows
○ Local popularity
■ popularity projected on <user, query>
●
●

blog search: aggregated likes on query term
blog recommendation: follow counts among friends

○ Textual relevancy
■ how: exact match, query proximity
■ where: name, title, tag, mention, body, etc
○ Recency
Duplicate Elimination (DE)
● Index-time DE
○ post signature
■ number of tags > N1
■ md5 hash of normalized tag list

● Search-time DE
○ Media DE
■ posts with same media hashes.
○ Near DE
■ posts with tags > N2
■ mark as near duplicate if diff <= N3 tags
■ older posts selected as original
Search Platform
● A curvy road
○ Started with ElasticSearch
○ Switched to SolrCloud due to reliability
○ Ended up with Solr + Customized Clustering

● Our takes
○ ElasticSearch and SolrCloud have great functionality
■ distributed indexing and search
■ easy cluster management
○ Solr seems still much more reliable with high
indexing load and search traffic.
Offline Precomputation
● Benefits
○ Minimize the search online latency
○ More sophisticated/expensive computation

● Limitation
○ Loss of freshness
○ Expensive for longtail query and results

● Precomputed
○
○
○
○

Typeaheads
Related search
Blog recommendation
Top posts of Blog / User
What’s Next
● Inblog search
○ full text search on all posts in a blog
○ original posts, reblogs, likes

● Ranking
○ more effective and spam-resilient signals
○ learning to rank

● Topical interest modeling
○ supervised and unsupervised
○ blog content and user activities
○ interest based blog recommendation

● Content discovery
○ trending content in various categories
Q&A
Question: Are you hiring?
Answer: Yeah! Check it out at http://www.tumblr.com/jobs

More questions please, :-)

Weitere Àhnliche Inhalte

Andere mochten auch

Scaling HBase (nosql store) to handle massive loads at Pinterest by Jeremy Carol
Scaling HBase (nosql store) to handle massive loads at Pinterest by Jeremy CarolScaling HBase (nosql store) to handle massive loads at Pinterest by Jeremy Carol
Scaling HBase (nosql store) to handle massive loads at Pinterest by Jeremy CarolHakka Labs
 
Deployment Tools and Techniques at Spotify: Virtualenv in debian by Chris Angove
Deployment Tools and Techniques at Spotify: Virtualenv in debian by Chris AngoveDeployment Tools and Techniques at Spotify: Virtualenv in debian by Chris Angove
Deployment Tools and Techniques at Spotify: Virtualenv in debian by Chris AngoveHakka Labs
 
Square's Machine Learning Infrastructure and Applications - Rong Yan
Square's Machine Learning Infrastructure and Applications - Rong YanSquare's Machine Learning Infrastructure and Applications - Rong Yan
Square's Machine Learning Infrastructure and Applications - Rong YanHakka Labs
 
Software Dendrology by Brandon Bloom
Software Dendrology by Brandon BloomSoftware Dendrology by Brandon Bloom
Software Dendrology by Brandon BloomHakka Labs
 
Obtaining, Scrubbing, and Exploring Data at the Command Line by Jeroen Janssens
Obtaining, Scrubbing, and Exploring Data at the Command Line by Jeroen Janssens	Obtaining, Scrubbing, and Exploring Data at the Command Line by Jeroen Janssens
Obtaining, Scrubbing, and Exploring Data at the Command Line by Jeroen Janssens Hakka Labs
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakHakka Labs
 

Andere mochten auch (6)

Scaling HBase (nosql store) to handle massive loads at Pinterest by Jeremy Carol
Scaling HBase (nosql store) to handle massive loads at Pinterest by Jeremy CarolScaling HBase (nosql store) to handle massive loads at Pinterest by Jeremy Carol
Scaling HBase (nosql store) to handle massive loads at Pinterest by Jeremy Carol
 
Deployment Tools and Techniques at Spotify: Virtualenv in debian by Chris Angove
Deployment Tools and Techniques at Spotify: Virtualenv in debian by Chris AngoveDeployment Tools and Techniques at Spotify: Virtualenv in debian by Chris Angove
Deployment Tools and Techniques at Spotify: Virtualenv in debian by Chris Angove
 
Square's Machine Learning Infrastructure and Applications - Rong Yan
Square's Machine Learning Infrastructure and Applications - Rong YanSquare's Machine Learning Infrastructure and Applications - Rong Yan
Square's Machine Learning Infrastructure and Applications - Rong Yan
 
Software Dendrology by Brandon Bloom
Software Dendrology by Brandon BloomSoftware Dendrology by Brandon Bloom
Software Dendrology by Brandon Bloom
 
Obtaining, Scrubbing, and Exploring Data at the Command Line by Jeroen Janssens
Obtaining, Scrubbing, and Exploring Data at the Command Line by Jeroen Janssens	Obtaining, Scrubbing, and Exploring Data at the Command Line by Jeroen Janssens
Obtaining, Scrubbing, and Exploring Data at the Command Line by Jeroen Janssens
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
 

Ähnlich wie Search at Tumblr by Yufei Pan

Blogging for Business
Blogging for BusinessBlogging for Business
Blogging for Businesseagleg
 
Intro to Technical Writing: Creating Content that Google and Readers will Love
Intro to Technical Writing: Creating Content that Google and Readers will LoveIntro to Technical Writing: Creating Content that Google and Readers will Love
Intro to Technical Writing: Creating Content that Google and Readers will LoveLauren Hayward Schaefer
 
WordPress SEO Class Outline for NCSU Professional Development
WordPress SEO Class Outline for NCSU Professional DevelopmentWordPress SEO Class Outline for NCSU Professional Development
WordPress SEO Class Outline for NCSU Professional DevelopmentFrank Jones
 
Search Engine Optimization Fundamentals
Search Engine Optimization FundamentalsSearch Engine Optimization Fundamentals
Search Engine Optimization FundamentalsKalin Chernev
 
Search engine optimization (SEO, SEM, SMM)
Search engine optimization (SEO, SEM, SMM)Search engine optimization (SEO, SEM, SMM)
Search engine optimization (SEO, SEM, SMM)NeslaSherin
 
Seo class (2) converted
Seo class (2) convertedSeo class (2) converted
Seo class (2) convertedNeslaSherin
 
How To Guide : Researching Topics For Blogs
How To Guide : Researching Topics For BlogsHow To Guide : Researching Topics For Blogs
How To Guide : Researching Topics For BlogsOmnePresent
 
Query Understanding at LinkedIn [Talk at Facebook]
Query Understanding at LinkedIn [Talk at Facebook]Query Understanding at LinkedIn [Talk at Facebook]
Query Understanding at LinkedIn [Talk at Facebook]Abhimanyu Lad
 
Demystifying SEO - Getting the Fundamentals Right
Demystifying SEO - Getting the Fundamentals RightDemystifying SEO - Getting the Fundamentals Right
Demystifying SEO - Getting the Fundamentals RightRaunak Guha
 
SEO and TLD Domain Names - NamesCon 2020
SEO and TLD Domain Names - NamesCon 2020SEO and TLD Domain Names - NamesCon 2020
SEO and TLD Domain Names - NamesCon 2020Bill Hartzer
 
Personalized search
Personalized searchPersonalized search
Personalized searchToine Bogers
 
Keyword Research - Moving Beyond Adwords - SMX Israel 2013
Keyword Research - Moving Beyond Adwords - SMX Israel 2013Keyword Research - Moving Beyond Adwords - SMX Israel 2013
Keyword Research - Moving Beyond Adwords - SMX Israel 2013Mark Ginsberg
 
Mark ginsberg beyond kw research - smx israel
Mark ginsberg   beyond kw research - smx israelMark ginsberg   beyond kw research - smx israel
Mark ginsberg beyond kw research - smx israelBarry Schwartz
 
MichaƂ Suski SEO Surfer SEOCON.ID
MichaƂ Suski SEO Surfer SEOCON.IDMichaƂ Suski SEO Surfer SEOCON.ID
MichaƂ Suski SEO Surfer SEOCON.IDAbi Yudhie
 
Search Engine Optimisation - Have you been crawled over?
Search Engine Optimisation - Have you been crawled over?Search Engine Optimisation - Have you been crawled over?
Search Engine Optimisation - Have you been crawled over?Marshal Yung
 
Presentation: SEO Basics
Presentation: SEO BasicsPresentation: SEO Basics
Presentation: SEO BasicsAmanda Billy
 
SEO Introduction
SEO IntroductionSEO Introduction
SEO IntroductionSSAA60
 
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...OpenSource Connections
 

Ähnlich wie Search at Tumblr by Yufei Pan (20)

Blogging for Business
Blogging for BusinessBlogging for Business
Blogging for Business
 
Intro to Technical Writing: Creating Content that Google and Readers will Love
Intro to Technical Writing: Creating Content that Google and Readers will LoveIntro to Technical Writing: Creating Content that Google and Readers will Love
Intro to Technical Writing: Creating Content that Google and Readers will Love
 
WordPress SEO Class Outline for NCSU Professional Development
WordPress SEO Class Outline for NCSU Professional DevelopmentWordPress SEO Class Outline for NCSU Professional Development
WordPress SEO Class Outline for NCSU Professional Development
 
Search Engine Optimization Fundamentals
Search Engine Optimization FundamentalsSearch Engine Optimization Fundamentals
Search Engine Optimization Fundamentals
 
Search engine optimization (SEO, SEM, SMM)
Search engine optimization (SEO, SEM, SMM)Search engine optimization (SEO, SEM, SMM)
Search engine optimization (SEO, SEM, SMM)
 
Seo class (2) converted
Seo class (2) convertedSeo class (2) converted
Seo class (2) converted
 
How To Guide : Researching Topics For Blogs
How To Guide : Researching Topics For BlogsHow To Guide : Researching Topics For Blogs
How To Guide : Researching Topics For Blogs
 
Query Understanding at LinkedIn [Talk at Facebook]
Query Understanding at LinkedIn [Talk at Facebook]Query Understanding at LinkedIn [Talk at Facebook]
Query Understanding at LinkedIn [Talk at Facebook]
 
Demystifying SEO - Getting the Fundamentals Right
Demystifying SEO - Getting the Fundamentals RightDemystifying SEO - Getting the Fundamentals Right
Demystifying SEO - Getting the Fundamentals Right
 
SEO and TLD Domain Names - NamesCon 2020
SEO and TLD Domain Names - NamesCon 2020SEO and TLD Domain Names - NamesCon 2020
SEO and TLD Domain Names - NamesCon 2020
 
SEO AND DIGITAL MARKETING
SEO AND DIGITAL MARKETINGSEO AND DIGITAL MARKETING
SEO AND DIGITAL MARKETING
 
Personalized search
Personalized searchPersonalized search
Personalized search
 
Keyword Research - Moving Beyond Adwords - SMX Israel 2013
Keyword Research - Moving Beyond Adwords - SMX Israel 2013Keyword Research - Moving Beyond Adwords - SMX Israel 2013
Keyword Research - Moving Beyond Adwords - SMX Israel 2013
 
Mark ginsberg beyond kw research - smx israel
Mark ginsberg   beyond kw research - smx israelMark ginsberg   beyond kw research - smx israel
Mark ginsberg beyond kw research - smx israel
 
MichaƂ Suski SEO Surfer SEOCON.ID
MichaƂ Suski SEO Surfer SEOCON.IDMichaƂ Suski SEO Surfer SEOCON.ID
MichaƂ Suski SEO Surfer SEOCON.ID
 
Search Engine Optimisation - Have you been crawled over?
Search Engine Optimisation - Have you been crawled over?Search Engine Optimisation - Have you been crawled over?
Search Engine Optimisation - Have you been crawled over?
 
Presentation: SEO Basics
Presentation: SEO BasicsPresentation: SEO Basics
Presentation: SEO Basics
 
SEO Introduction
SEO IntroductionSEO Introduction
SEO Introduction
 
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
 
Introduction to Databases
Introduction to Databases Introduction to Databases
Introduction to Databases
 

Mehr von Hakka Labs

Always Valid Inference (Ramesh Johari, Stanford)
Always Valid Inference (Ramesh Johari, Stanford)Always Valid Inference (Ramesh Johari, Stanford)
Always Valid Inference (Ramesh Johari, Stanford)Hakka Labs
 
DataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series searchDataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series searchHakka Labs
 
DataEngConf SF16 - Data Asserts: Defensive Data Science
DataEngConf SF16 - Data Asserts: Defensive Data ScienceDataEngConf SF16 - Data Asserts: Defensive Data Science
DataEngConf SF16 - Data Asserts: Defensive Data ScienceHakka Labs
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataHakka Labs
 
DataEngConf SF16 - Recommendations at Instacart
DataEngConf SF16 - Recommendations at InstacartDataEngConf SF16 - Recommendations at Instacart
DataEngConf SF16 - Recommendations at InstacartHakka Labs
 
DataEngConf SF16 - Running simulations at scale
DataEngConf SF16 - Running simulations at scaleDataEngConf SF16 - Running simulations at scale
DataEngConf SF16 - Running simulations at scaleHakka Labs
 
DataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
DataEngConf SF16 - Deriving Meaning from Wearable Sensor DataDataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
DataEngConf SF16 - Deriving Meaning from Wearable Sensor DataHakka Labs
 
DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale Hakka Labs
 
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQDataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQHakka Labs
 
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...Hakka Labs
 
DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...Hakka Labs
 
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at PinterestDataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at PinterestHakka Labs
 
DataEngConf SF16 - Bridging the gap between data science and data engineering
DataEngConf SF16 - Bridging the gap between data science and data engineeringDataEngConf SF16 - Bridging the gap between data science and data engineering
DataEngConf SF16 - Bridging the gap between data science and data engineeringHakka Labs
 
DataEngConf SF16 - Multi-temporal Data Structures
DataEngConf SF16 - Multi-temporal Data StructuresDataEngConf SF16 - Multi-temporal Data Structures
DataEngConf SF16 - Multi-temporal Data StructuresHakka Labs
 
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using SparkDataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using SparkHakka Labs
 
DataEngConf SF16 - Beginning with Ourselves
DataEngConf SF16 - Beginning with OurselvesDataEngConf SF16 - Beginning with Ourselves
DataEngConf SF16 - Beginning with OurselvesHakka Labs
 
DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
DataEngConf SF16 - Routing Billions of Analytics Events with High DeliverabilityDataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
DataEngConf SF16 - Routing Billions of Analytics Events with High DeliverabilityHakka Labs
 
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...Hakka Labs
 
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedInDataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedInHakka Labs
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopHakka Labs
 

Mehr von Hakka Labs (20)

Always Valid Inference (Ramesh Johari, Stanford)
Always Valid Inference (Ramesh Johari, Stanford)Always Valid Inference (Ramesh Johari, Stanford)
Always Valid Inference (Ramesh Johari, Stanford)
 
DataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series searchDataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series search
 
DataEngConf SF16 - Data Asserts: Defensive Data Science
DataEngConf SF16 - Data Asserts: Defensive Data ScienceDataEngConf SF16 - Data Asserts: Defensive Data Science
DataEngConf SF16 - Data Asserts: Defensive Data Science
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 
DataEngConf SF16 - Recommendations at Instacart
DataEngConf SF16 - Recommendations at InstacartDataEngConf SF16 - Recommendations at Instacart
DataEngConf SF16 - Recommendations at Instacart
 
DataEngConf SF16 - Running simulations at scale
DataEngConf SF16 - Running simulations at scaleDataEngConf SF16 - Running simulations at scale
DataEngConf SF16 - Running simulations at scale
 
DataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
DataEngConf SF16 - Deriving Meaning from Wearable Sensor DataDataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
DataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
 
DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale
 
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQDataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
 
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
 
DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...
 
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at PinterestDataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
 
DataEngConf SF16 - Bridging the gap between data science and data engineering
DataEngConf SF16 - Bridging the gap between data science and data engineeringDataEngConf SF16 - Bridging the gap between data science and data engineering
DataEngConf SF16 - Bridging the gap between data science and data engineering
 
DataEngConf SF16 - Multi-temporal Data Structures
DataEngConf SF16 - Multi-temporal Data StructuresDataEngConf SF16 - Multi-temporal Data Structures
DataEngConf SF16 - Multi-temporal Data Structures
 
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using SparkDataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
 
DataEngConf SF16 - Beginning with Ourselves
DataEngConf SF16 - Beginning with OurselvesDataEngConf SF16 - Beginning with Ourselves
DataEngConf SF16 - Beginning with Ourselves
 
DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
DataEngConf SF16 - Routing Billions of Analytics Events with High DeliverabilityDataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
 
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
 
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedInDataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL Workshop
 

KĂŒrzlich hochgeladen

Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Christopher Logan Kennedy
 
Mcleodganj Call Girls đŸ„° 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls đŸ„° 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls đŸ„° 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls đŸ„° 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vĂĄzquez
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 

KĂŒrzlich hochgeladen (20)

Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Mcleodganj Call Girls đŸ„° 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls đŸ„° 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls đŸ„° 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls đŸ„° 8617370543 Service Offer VIP Hot Model
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 

Search at Tumblr by Yufei Pan

  • 1. Search at Tumblr Yufei Pan Director of Search, Tumblr 16 January 2013
  • 2. Tumblr - Follow the World’s Creators Founded ● David Karp ● February 2007 Publishing Platform ● 163 million blogs ● 72 billion posts Social Network ● Follow, Mention ● Like, Reblog
  • 3. About search@tumblr ● Most important way to discover great content ○ 50M searches a day ● Limited search for a long time (2007-2012) ○ Tagged page ■ mysql lookup of a single tag id ■ sorted by reverse chronological order ○ Finding blog ■ navigate through curated directories
  • 4. About search@tumblr ● Search Team ○ 2012 July, Jak joined as first search engineer! Jak Yufei Bennett Beitao Patrick ● Features launched in 2013 ○ Post search, Blog search, Theme search ○ Typeaheads, Recommendation, Trends Adam
  • 5. Whole New Search Post search ● full text search ● top and recent ● post type filtering Blog search ● name & title ● top tags in posts ● blog highlights Related search ● term co-occurrence
  • 6. Typeahead Autocompletes Search Autocomplete Mention Autocomplete ● ● ● Interactive guide of tumblr content High volume of traffic Low latency Tag Suggest
  • 10. Search Architecture Post Search Blog Search Typeahead Related Tags Blog Recommend Blog Highlights Blog Top Tags Trending Tags Trending Blogs Trending Posts Online Search Online Framework Recent Post Index Blog Full Index Theme Index Blog Top-K Index Follower Counts Post Notecount Post Model Personalized Blog Index Trending Blogs Trending Posts Trending Tags Related Tag Index Blog Global Rank Blog Model User Model Typeahead Indices Data Top Post Index Blog Top Posts Blog Top Tags Two Degree Like Root Blog Feedback In-Blog Tag Index Global Tag Index Search Offline Framework Rediscover Solr Offline MySQL Activity Streams (Fire Geyser) Scribe logs, Sqoop tables (HDFS) Nginx Linux
  • 11. Software Stack ● Search Online ○ HAProxy, Nginx, PHP ○ Memcache ○ Icinga, Scribe, OpenTSDB ● Search Data ○ Solr, Redis, MySQL ● Search Offline ○ Sqoop, Hadoop ○ Java, Hive, Pig, Scalding, Python
  • 12. Search Online Framework Search Services SearchBase Search Flow Execution Multi-level Caching Search Logging Async Execution Search Editorial QueryIF RetrieverIF SignalFetcherIF RankerIF DocFetcherIF FilterIF SimpleQuery SolrPostRetriever NotecountFetcher TopPostRanker PostFetcher PostFilter PersonalizedQuery MysqlPostRetriever FollowercountFetcher TumblelogRanker TumblelogFetcher TumblelogFilter AdvancedPostQuery SMPostRetriever TumblelogGlobalRan kFetcher RelatedPostRanker TagFetcher TagFilter RecommendationSign alFetcher TumblelogMixingRan ker TimeSliceQuery TrendTagQuery TumblelogRetriever TagTypeahead Reteriever BlogTopTagFetcher
  • 13. Search Batch Processing Search Data (Redis) Workflow Composition Dependency Resolution Automatic Versioning Data Verification Execution Logging Failure Detection/Alert Search Workflow Engine Hive Jobs Term Generators Streaming Jobs Pig Jobs Top-K Indexer Delta Propagator Search Task Base Scribe Logs, Sqoop Tables (HDFS) Scalding Jobs Lucille2 Classes
  • 14. Indexing ● 3-Tier indices ○ Index all posts ■ 600+ machines ○ Recent (6W) + Popular (4Y) + Existing tag table ■ Down to 40 machines ■ Minor loss in coverage ■ Serve up to 4K qps (non-cached) ● Lean index ○ Separate signals from index ■ Eliminate high volume re-indexing ■ Independent signal engineering from indexing ○ Separate document text from index ■ Dropping the memory footprint
  • 15. Ranking ● Quickly evolving! ● Major ranking signals in production ○ Global popularity ■ likes, reblogs, follows ○ Local popularity ■ popularity projected on <user, query> ● ● blog search: aggregated likes on query term blog recommendation: follow counts among friends ○ Textual relevancy ■ how: exact match, query proximity ■ where: name, title, tag, mention, body, etc ○ Recency
  • 16. Duplicate Elimination (DE) ● Index-time DE ○ post signature ■ number of tags > N1 ■ md5 hash of normalized tag list ● Search-time DE ○ Media DE ■ posts with same media hashes. ○ Near DE ■ posts with tags > N2 ■ mark as near duplicate if diff <= N3 tags ■ older posts selected as original
  • 17. Search Platform ● A curvy road ○ Started with ElasticSearch ○ Switched to SolrCloud due to reliability ○ Ended up with Solr + Customized Clustering ● Our takes ○ ElasticSearch and SolrCloud have great functionality ■ distributed indexing and search ■ easy cluster management ○ Solr seems still much more reliable with high indexing load and search traffic.
  • 18. Offline Precomputation ● Benefits ○ Minimize the search online latency ○ More sophisticated/expensive computation ● Limitation ○ Loss of freshness ○ Expensive for longtail query and results ● Precomputed ○ ○ ○ ○ Typeaheads Related search Blog recommendation Top posts of Blog / User
  • 19. What’s Next ● Inblog search ○ full text search on all posts in a blog ○ original posts, reblogs, likes ● Ranking ○ more effective and spam-resilient signals ○ learning to rank ● Topical interest modeling ○ supervised and unsupervised ○ blog content and user activities ○ interest based blog recommendation ● Content discovery ○ trending content in various categories
  • 20. Q&A Question: Are you hiring? Answer: Yeah! Check it out at http://www.tumblr.com/jobs More questions please, :-)