Structure, Personalization, Scale: A Deep Dive into LinkedIn Search
Presented at QCon New York 2014 in the Applied Science and Machine Learning track
https://qconnewyork.com/presentation/structure-personalization-scale-deep-dive-linkedin-search
Also presented at the NYC Search, Discovery, and Analytics Meetup
http://www.meetup.com/NYC-Search-and-Discovery/events/168265892/
Video: http://new.livestream.com/hugeinc/search-at-linkedin
Abstract
All of us are familiar with search as users. And as software engineers, many of us have worked on search problems in the context of web search, site search, or enterprise search. But search at LinkedIn is different. Our corpus is a richly structured professional graph comprised of 300M+ people, 3M+ companies, 2M+ groups, and 1.5M+ publishers. Our members perform billions of searches (over 5.7B in 2012), and each of those searches is highly personalized based on the searcher's identity and relationships with other professional entities in LinkedIn's economic graph. And all this data is in constant flux as LinkedIn adds more than 2 members every second in over 200 countries (2/3 of our members are outside the United States). As a result, we’ve built a system quite different from those used for other search applications. In this talk, we will discuss some of the unique challenges we've faced as we deliver highly personalized search over semi-structured data at massive scale.
Asif Makhani heads Search at LinkedIn. Prior to that, he was a founding member of A9 and led the development and launch of Amazon CloudSearch, a fully managed and elastic search service in the AWS Cloud. Asif has a Masters in Computer Science from Stanford University and a BMath from University of Waterloo. @asifm
Speaker Bios
Daniel Tunkelang leads LinkedIn's efforts around query understanding. Before that, he led LinkedIn's product data science team. He previously led a local search quality team at Google and was a founding employee of Endeca (acquired by Oracle in 2011). He has written a textbook on faceted search, and is a recognized advocate of human-computer interaction and information retrieval (HCIR). He has a PhD in Computer Science from CMU, as well as BS and MS degrees from MIT. @dtunkelang
10. What’s unique
Personalized
Part of a larger product experience
– Many products
– Big part
Task-centric
– Find a job, hire top talent, find a person, …
10
21. Challenges
Index rebuilding very difficult
Live updates are at an entity granularity
Scoring is inflexible
Lucene limitations
Fragmentation – too many components, too many stacks
Economic Graph
21
Opportunity
23. Life of a Query
23
Query Rewriter/
Planner
Results
Merging
User
Query
Search
Results
Search Shard
Search Shard
24. Life of a Query – Within A Search Shard
24
Rewritten
Query
Top
Results
From
Shard
INDEX
Top
Results
Retrieve a
Document
Score the
Document
25. Life of a Query – Within A Rewriter
25
Query
DATA
MODEL
Rewriter
State
Rewriter
Module
DATA
MODEL
DATA
MODEL
Rewritten
Query
Rewriter
Module
Rewriter
Module
26. Life of Data - Offline
26
INDEX
Derived DataRaw Data
DATA
MODEL
DATA
MODEL
DATA
MODEL
DATA
MODEL
DATA
MODEL
27. Improvements
Regular full index builds using Hadoop
– Easier to reshard, add fields
Improved Relevance
– Offline relevance, query rewriting frameworks
Partial Live Updates Support
– Allows efficient updates of high frequency fields (no sync)
– Goodbye Content Store, Goodbye Zoie
Early termination
– Ultra low latency for instant results
– Goodbye Cleo
Indexing and searching across graph entities/attributes
Single engine, single stack
27
30. Lucene
An open source API that supports search functionality:
Add new documents to index
Delete documents from the index
Construct queries
Search the index using the query
Score the retrieved documents
30
31. The Search Index
Inverted Index: Mapping from (search) terms to list of
documents (they are present in)
Forward Index: Mapping from documents to metadata
about them
31
33. The Search Index
The lists are called posting lists
Upto hundreds of millions of posting lists
Upto hundreds of millions of documents
Posting lists may contain as few as a single hit and as
many as tens of millions of hits
Terms can be
– words in the document
– inferred attributes about the document
33
35. Early termination
We order documents in the index based on a static rank –
from most important to least important
An offline relevance algorithm assigns a static rank to
each document on which the sorting is performed
This allows retrieval to be early-terminated (assuming a
strong correlation between static rank and importance of
result for a specific query)
Also works well with personalized search
– +term:asif +prefix:makh
+(connection:35176 connection:418001 connection:1520032)
35
36. Partial Updates
Lucene segments are “document-partitioned”
We have enhanced Lucene with “term-partitioned”
segments
We use 3 term-partitioned segments:
– Base index (never changed)
– Live update buffer
– Snapshot index
36
40. The Search Quality Pipeline
40
New
document
F
e
a
t
u
r
e
s
Machin
e
learning
model
score
New
document
F
e
a
t
u
r
e
s
Machin
e
learning
model
score
New
document
F
e
a
t
u
r
e
s
Machine
learning
model
score
Ordered
list
Ordered
list
Ordered
list
spellcheck query tagging vertical intent query expansion
50. Examples of Features
50
Search keywords matching title = 3
Searcher location = Result location
Searcher network distance to result = 2
…
51. Model Training: Traditional Approach
51
Documents for
training
F
e
a
t
u
r
e
s
Human
evaluation
L
a
b
e
l
s
Machine
learning
model
52. Model Training: LinkedIn’s Approach
52
Documents for
training
F
e
a
t
u
r
e
s
Human
evaluation
Search logs
L
a
b
e
l
s
Machine
learning
model
53. Fair Pairs and Easy Negatives
53
Flipped
[Radlinski and Joachims, 2006]
• Sample negatives from bottom results
• But watch out for variable length result sets.
• Compromise, e.g., sample from page 10.
54. Model Selection
Select model based on user and query features.
– e.g., person name queries, recruiters making skills queries
Resulting model is a tree with logistic regression leaves.
Only one regression model evaluated for each document.
54
b0 +b1T(x1)+...+bn xn
a0 +a1 P(x1)+...+anQ(xn)
X2=?
X10< 0.1234 ?
g0 +g1 R(x1)+...+gnQ(xn)
55. Summary
What is LinkedIn search and why should you care?
LinkedIn search enables the participants in
the economic graph to find and be found.
What are our systems challenges?
Indexing rich, structured content; retrieving
using global and social factors; real-time updates.
What are our relevance challenges?
Query understanding, personalized
machine-learned ranking models.
55
56. 56
Asif Makhani Daniel Tunkelang
amakhani@linkedin.com dtunkelang@linkedin.com
https://linkedin.com/in/asifmakhani https://linkedin.com/in/dtunkelang