We describe the challenges that we faced while building the instant search experience at LinkedIn, and present techniques that we developed to overcome them. We discuss three aspects of instant search – performance, tolerance to user errors, and accuracy of search results.
Fast, Lenient, and Accurate – Building Personalized Instant Search Experience at LinkedIn
1. Fast, Lenient, and Accurate
Building Personalized Instant Search Experience at LinkedIn
Ganesh Venkataraman, Abhi Lad, Lin Guo, Shakti Sinha
LinkedIn
2. Agenda
● LinkedIn
● LinkedIn Search
○ Navigational vs Exploratory searches
○ Typeahead vs SERP
● Big picture and problem statement
● Instant search – Search-as-you-type
○ Query autocomplete
○ Entity-aware suggestions
○ Instant results
● Conclusions & Future work
11. Instant Search – Search-as-you-type
Satisfy navigational searches:
Show instant search results.
Help frame exploratory searches:
Complete the user’s query and show search suggestions.
17. Query Autocomplete – Offline processing
linkedin software engineer
software engineer
big data
data scientist
data engineer
expert systems
.
.
[linkedin] [software engineer]
Query logs Entities Index
FST – Finite State Transducers
Compact + fast retrieval + fuzzy match (via Levenstein Automata)
18. Query Autocomplete – Online processing
Two step process:
1. Retrieval (Candidate generation)
User’s query: [big data e]
Candidates = C(big data e) U C(data e) U C(e)
= big data engineer,
big data expert systems,
big data entry,
...
linkedin software engineer
software engineer
big data
data scientist
data engineer
expert systems
.
.
Query logs
19. Query Autocomplete – Online processing
Two step process:
2. Scoring (Ranking)
User’s query: [big data e]
Candidate completions: “big data engineer”, “big data expert”, “big data entry”
Score(“big data engineer”):
P(s1
, s2
, s3
…) ≈ P(s1
)·P(s2
|s1
)·P(s3
|s2
).. // Bigram language model
Use entities : P([engineer] | [big data])
Fall back to words : P(engineer | data)·P(data | big)
21. Instant Results
● Fast retrieval over 450+ million members
● Highly personalized
● Balance personalization & popularity
● Resilient to spelling variations
22. Instant Results – Indexing
NAME: richard
PREFIX: r, ri, ric, rich, richa, ...
NAME: branson
PREFIX: b, br, bra, bran, brans, ...
● Inverted Index (Maps token to list of docs that contain that token):
NAME:richard => [1, 4, 10, 15, …] // Everyone named “richard”
PREFIX:ri => [1, 2, 4, 7, 10, 15, …] // Everyone whose name starts with “ri”
…
● Retrieval approach
User’s query – richard b
Rewritten query – +NAME:richard +PREFIX:b
● Prefix-based tokenization:
DOCID 4
(posting lists)
23. Instant Results – Indexing
CONN: 1, 10, 15
● Inverted Index
CONN:4 => [1, 10, 15] // Everyone connected to Richard Branson
CONN:1 => [4, ...]
CONN:10 => [4, ...]
...
● Retrieval approach
User’s query – richard b
Rewritten query – +NAME:richard +PREFIX:b +CONN:1
(Everyone named richard b… and connected to User:1)
● Connections Index:
DOCID 4
24. Instant Results – Indexing
Early Termination
Problem: A query like [PREFIX:ri] might retrieve too many candidate documents.
How can we retrieve the most promising documents first so that we don’t need to score all of them?
Static Rank: Order documents based on their prior (query independent) likelihood of relevance:
A combination of:
● Profile views
● Spam and security related scores
● Editorial rules (Celebrities, influencers, …)
numToScore: The number of documents to retrieve and score for any query
25. Balancing Popularity and Personalization
Query: richard b…
Are you looking for Richard Branson, or a colleague name Richard Burton?
(Assume searcher’s ID = 1)
Rewritten Query:
● +NAME:richard +PREFIX:b +CONN:1 // Too restrictive. Only find searcher’s connections.
● +NAME:richard +PREFIX:b ?CONN:1[50%] // Try to retrieve 50% results from searcher’s connections
Instant Results – Retrieval
Custom search operator: “Weighted OR”
27. Name Clusters
Offline process to cluster together similar sounding or similarly spelt names.
Two step process:
1. Coarse clustering (optimized for broad coverage)
Normalization: repeated chars, accented chars, common phonetic variations (c ⇔ k, ph ⇔ f)
Combination of edit distance & double metaphone (sound)
E.g. (dipak = deepak), (wiener = weiner), (catherine = kathryn), (jeff = joff)
2. Fine-grained clustering (optimized for precision)
Split up clusters based on more sophisticated rules
Position and character-aware edit distance
Query reformulation data (q1
→ q2
→ click)
E.g. (jeff ≠ joff)
Instant Results – Spelling Variations
28. Instant Results – Spelling Variations
NAME: kathryn
CLUSTER: katharine
Potential queries:
katherine
kathryn
katharine
catharine
Rewritten queries:
?NAME:katherine ?CLUSTER:katharine
?NAME:kathryn ?CLUSTER:katharine
?NAME:katharine ?CLUSTER:katharine
?NAME:catharine ?CLUSTER:katharine
Either match original query term or match the name cluster
Query time
Indexing time
29. Clicked result treated as positive.
All other shown results treated as negative.
Since this is navigational search, we assume there’s
only 1 correct result => low presentation bias.
Learning to Rank (Machine-learned ranking)
Training data
● Click data from previous typeahead sessions
● <searcher, query, doc> ⇒ positive/negative
Features / signals
● Textual match against various fields
● Network distance, number of shared connections
● Global popularity
● Compound features
Instant Results – Scoring
+
–
–
–
30. Conclusions
● Instant search experience
○ Directly satisfy navigational search uses in typeahead via Instant Results
○ Help the user frame exploratory search queries via Query Autocomplete & Search
Suggestions
● Combination of techniques
○ Query tagger for entity extraction – “Things not Strings”
○ FST-based query completion
○ Inverted index-based instant results + Early termination + Weighted OR
○ Name clusters for fuzzy name matching
31. Future Work
● Personalized query completions
○ m ⇒ machine learning
○ m ⇒ machinist
● Multi-entity query suggestions
○ Now : [linkedin] ⇒ “Find people who work at LinkedIn”
○ Future : [linkedin data scientist] ⇒ “Find data scientists at LinkedIn”
● Better blending
○ Autocomplete + query suggestions + instant results
○ Query features – what does the query mean?
○ Results features – what results come back from each system?