12. The (0ld) system 12
● System that is too complex to accurately tune the boosts: Relevancy
whack-a-mole
● Inventory content frequently changes
● Lacks data driven input -- assumption driven without proper statistical
analysis.
“If only there was a way to do this
differently…”
14. Job search is both an IR and a match problem
Search/ IR (e.g. Youtube)
14
{ User } { Resource }
Match (e.g. online chess)
● Many to Many
● Asymmetric
● Unlimited supply
{ Player} { Player }
● One to One
● Symmetric
● No extra supply
Job Search
{ Job Seekers } {Job Positions}
● One to Many
● Asymmetric and Bi-directional
● Limited supply, unlimited “attempts”
15. ● Fragmented
Organized around “Shifts”. A worker
can be assigned 1 to 30+ hours per
week. Many hold multiple jobs
● Transactional
Workers stay at each job for 6 months
on average
● Lightly Skilled
Many hourly jobs require just a high
school diploma
Hourly Jobs are not ‘Sticky’ 15
https://www.snag.co/employers/wp-content/uploads/2016/07/2016
_SOTHW_Report-3.pdf
16. Hourly job search is often a recommendation
Schedule and location can be more important
than actual duty of the job
Queries are not explicit (40% don’t have keywords)
20. Learning to Rank Model 20
Development Environment
Abandonment: 0
Relevancy Labels Features
Click: 1
Apply Intent: 2
match scores on job title,
employer name, job type, ...
distance <position, seeker>
match scores on query location
(e.g. zip-code, city)
match scores on job description
query string attributes (e.g
length, entity type)
posting attributes (e.g. position,
requirements, industry,
semantics representation)
.
.
.
lambdamart
Composability!
21. Training Pipeline - esltr plugin 0.x 21
Development Environment
data
warehouse
posting
collection
event
sampler
posting
sampler
training
data
generator
posting
ingestion
model
generator
feature
backfilling
relevancy label
parser
Ranklib
relevancy
scores
query
info
features training
data
ranking
model
posting
docs
user
events
training
index
search
engine
(dev)
search
engine
(prod)
22. Training Pipeline - esltr plugin 1.0 22
Development Environment
data
warehouse
event
sampler
training
data
generator
model
generator
feature
parser
relevancy label
generator
relevancy
scores
features training
data
ranking
model
user
events
live feature logs
+
HyperOpt
search
engine
(prod)
23. Offline Validation Pre-Deployment 23
Development Environment
● Re-ranking historical queries
Gives good directional guidance, but not very accurate in absolute numbers due to 1)
inability to account for new items and 2) contamination from sponsored postings with
artificially high rankings.
● Manual examination of common query patterns
Great for sanity checks. Reveals details beyond relevancy labels. More indicative of
future performance.
● Best of Both Worlds?
Aljadda, Khalifeh & Korayem, Mohammed & Grainger, Trey. (2018). Fully Automated QA
System for Large Scale Search and Recommendation Engines Leveraging Implicit User
Feedback.
24. Deployment via A-B testing 24
Production Environment
Don’t modify the existing system.
25. Deployment via A-B testing 25
Production Environment
a) Build a parallel system
b) Iterate
c) Test
d) Evaluate
29. Iteration 1 (Q2 2017) 29
● LTR Features
1. job_title match score
2. job_description match score
3. employer_name match score
4. city-state_match score
5. zipcode_match score
6. distance <query location, posting>
● Relevancy Labels
Click : 1
Apply Intent: 2
Completed Applications: 3
● Success Criteria
- NDCG@10
● Use Cases
Site: desktop, mobile web
User: registered
Search Type:
- zip-code location only
- zip-code location + keyword
30. Relevancy Performance 30
Iteration 1
● Pros
Immediate boost of NDCG for zipcode-only
searches (~5%)
● Cons
Keyword and location-only searches shared
same feature space, leading to polarized
user experience.
● Todo
Add query-string-related attributes to the
list of features
Query:
- keyword: Starbucks
- location: Arlington, VA, 22201
Results:
Rank Employer Location
1 Starbucks Arlington, VA, 22201
2 Starbucks Arlington, VA, 22203
3 WholeFoods Arlington, VA, 22201
4 Starbucks Washington, DC, 20007
● When things don’t work:
31. Iteration 2 (Q3 2017) 31
● Success Criteria
- NDCG@10
- Application Rate
(# of applications/ # of search sessions)
● Use Cases
Site: desktop, mobile web
User: registered, unregistered
Search Type:
- zip-code location only
- zip-code location + keyword
- text location only
- text location + keyword
● LTR Features
1. job_title match scores
2. job_description match scores
3. employer_name match scores
4. location “match” level
5. distance <seeker, posting>
6. query location level
7. query length
8. platform (e.g. desktop, mobile)
9. job seeker registration status
● Relevancy Labels
Click : 1
Apply Intent: 2
Completed Applications: 3
32. Relevancy Performance 32
Iteration 2
● Pros
More stable performance across the board.
● Cons
Low geo-location resolution rate (~95%) hurt
queries with text locations
Default text analyzers supplied noisy signals
to ltr.
● Todo
Enhance geo-coding logics
Define customized analyzers (e.g. stopwords,
synonym filters, keyword markers) for every
field used by the ranking model
Query:
- keyword: Part time restaurant
Results:
Rank Title Employer
1 Part time server Chipotle
2 Full time cook KFC
3 Part time Cashier Restaurant Depot
4 Cook District Taco
● When things don’t work:
33. Iteration 3 (Q4 2017) 33
● Success Criteria
- NDCG@10
- Application rate
(# of applications / # of search sessions)
- Applicant conversion rate
(# of applicants / # of users)
- Applications per user
(# of applications / # of users)
● Use Cases
Site: desktop, mobile web
User: registered, unregistered
Search Type:
- zip-code location only
- zip-code location + keyword
- text location only
- text location + keyword
- keyword only
● LTR Features
1. job_title match scores
2. job_description match scores
3. employer_name match scores
4. location “match” level
5. distance <seeker, postings>
6. query location level
7. query length
8. platform (e.g. desktop, mobile)
9. job seeker registration status
10. is_faceted flag
● Relevancy Labels
Click : 1
Apply Intent: 2
Completed Applications: 3
34. Relevancy Performance 34
Iteration 3
● Pros
Location only searches are 10%+ better than
baseline. Keyword searches broke even.
● Cons
Large numbers of tied LTR scores artificially
limited user options via presentation bias
Lack of features about job description contexts
meant “click-baits” received too much
exposure
● Todo
Randomize the ranking of postings with tied
LTR scores on a per-user/session basis
Add query independent posting-level features
Query:
- keyword: PT (part time)
- location: Arlington, VA
Results:
Rank Title Location
1 Part time Cashier Arlington, VA, 22201
2 Drive Uber PT! Arlington, VA, 22209
3 Drive Uber PT! Arlington, VA, 22202
4 Drive Uber PT! Arlington, VA, 22203
● When things don’t work:
35. Current Iteration (Q1 2018) 35
● Success Criteria
- Application rate
(# of applications/ # of search sessions)
- Applicant conversion rate
(# of applicants/ # of users)
- Applications per user
(# of applications / # of users)
- Application diversity
(# of distinct applied postings/ # of applications)
● Use cases
Site: mobile apps, desktop, mobile web
User: registered, unregistered
Search Type:
- zip-code location only
- zip-code location + keyword
- text location only
- text location + keyword
- keyword only
- user coordinates only (a.k.a Jobs near me)
- user coordinates + keyword
● LTR features
1. job_title match scores
2. job_description match scores
3. employer_name match scores
4. location “match” level
5. distance <seeker, postings>
6. query location level
7. query length
8. platform (e.g. desktop, mobile)
9. job seeker registration status
10. is_faceted flag
11. Location conf level of postings
(proxy for posting quality)
● Relevancy Labels
Click : 1
Apply Intent: 2
Completed Applications: 3
36. Android App Live Performance (April, 2018) 36
Metrics Qualitative assessments
● Signal Regularisation: No particular
field has outsized impact on relevancy
anymore
● Signal Coordination: e.g. The
interaction between text and location
relevancy are more balanced
● Randomized ties => Better Match:
Randomization enables
well-distributed matchings and better
marketplace health, and partially
corrects positional bias
Metric
Control
(80% user)
Test (20%
user)
Average
% Lift
Application
Rate
0.1273
(0.0005)
0.1409
(0.0011) 10.72%
Applicant
Conversion
Rate
33.86%
(0.20%)
36.64%
(0.43%) 8.22%
Apply Intent
Diversity
0.676
(0.002)
0.759
(0.004) 12.40%
Click
Diversity
0.663
(0.002)
0.807
(0.004) 21.62%
37. Engineering Challenges 37
● Latency
● API: window size from 3000 to 1000 to 500
● Igniter (posting ingestion) execution time
● Signal Quality
● Randomization for result consistency
39. Lessons Learned 39
Model Development
● Relevancy tuning can create feedback loops. Look ahead
Changes in the ranking function sometimes triggers changes in user behavior, which in turn invalidate
said ranking function. Treat relevancy tuning as interactive experiments, not a curve-fitting exercise
● Apply strong model assumptions to correct deficiencies in old ranking functions
Use sound behavioral hypothesis via data analysis and qualitative user research to regulate model
behavior. Historical data can be noisy. Let AB tests be the final judge.
● Engineer the relevancy labels as well as the features
Implicit feedbacks are not absolute measures of relevancy and should be modeled to account for biases
and behavioral assumptions
● Ranking functions are only as expressive as the features you feed them
Any relevancy insights that can’t be encoded as meaningful differences in the feature space will not be
reflected in the search results
40. Lessons Learned 40
Engineering & Infrastructure
● Prioritize on velocity of iteration (analysis paralysis)
● Worked backwards from conclusions about system latency
42. Posting and Query Semantics Features 42
● Contextual information in posting
descriptions contribute many relevancy
signals
● Back-testings on both manually crafted
bag-of-words features and
machine-learned representations (e.g.
via SVD, word2vec) already saw
significant lift of reranked NDCG
● Some concerns for query-time
performance and over-fitting of long
NLP feature vectors
“... hiring individuals to work as part-time
Package Handlers... involves continual
lifting, lowering and sliding packages that
typically weigh 25 - 35 lbs… typically do
not work on holidays.... working
approximately 17.5 - 20 hours per week…
outstanding education assistance of up to
$2,625 per semester...”
“We have a part time opening for a delivery
driver position. Must be authorized to work
in the US”
High context
Low context
43. Click / Relevancy Label Modeling 43
Model Improvements
● Build multi-stage click models to
account for factors that cannot be
formulated as query-time LTR features
(e.g. rank position, between-session
correlations).
● Creates a positive feedback loop that
boosts potentially relevant postings
with low exposures (and penalize the
reverse)
44. Personalized Matching 44
Model Improvements
● Incorporate LTR features about matching
signals between job seeker preferences/
qualifications and job requirements
● (Potentially) an online learning module
that dynamically adjusts the rankings
shown to each user based on onsite
behavior
(...That pays >$15 per
hour. No night shifts!
...is In the retail
industry, where I have
5 years of experience
Bonus points if it’s
Harris Teeter…)
I want a part
time job near
my home!
45. Engineering Improvements 45
Engineering & Infrastructure
● Push-button training pipeline
● Automated push button deployment for re-indexing
● Latency and scale improvements
47. 47
● Elasticsearch: https://www.elastic.co/guide/en/elasticsearch/reference/current/index.htm
● ES Learning to Rank Plugin: http://elasticsearch-learning-to-rank.readthedocs.io/en/latest/
● Relevancy tuning: Turnbull, Doug, and John Berryman. Relevant Search with Applications for Solr and Elasticsearch. Manning, 2016.
● lambdaMart: C. Burges. From RankNet to LambdaRank to LambdaMART: An overview. Technical Report MSR-TR-2010-82,
Microsoft Research, 2010.
● ranklib: https://sourceforge.net/p/lemur/wiki/RankLib
● xgboost: Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '16). ACM, New York, NY, USA, 785-794
K.V. Rashmi and Ran Gilad-Bachrach, Dart: Dropouts meet multiple additive regression trees, April 2015
● hyperopt: J. Bergstra, R. Bardenet, Y. Bengio and B. Kégl. Algorithms for Hyper-parameter Optimization. Proc. Neural Information
Processing Systems 24 (NIPS2011), 2546–2554, 2011
● Interleaving: O. Chapelle, T. Joachims, F. Radlinski, Yisong Yue, Large-Scale Validation and Analysis of Interleaved Search Evaluation,
ACM Transactions on Information Systems (TOIS), 30(1):6.1-6.41, 2012.
T. Joachims, Evaluating Retrieval Performance Using Clickthrough Data, Proceedings of the SIGIR Workshop on Mathematical/Formal
Methods in Information Retrieval, 2002.
● document & query embeddings: Mitra, Bhaskar & Craswell, Nick. (2017). Neural Models for Information Retrieval.
Hamed Zamani and W. Bruce Croft. 2017. Relevance-based Word Embedding. In Proceedings of the 40th International ACM SIGIR
Conference on Research and Development in Information Retrieval (SIGIR '17). ACM, New York, NY, USA, 505-514
● Click model: Chuklin, A., Markov, I., & de Rijke, M. (2015). Click models for web search. Synthesis Lectures on Information Concepts
Retrieval and Services, 7(3), 1–115. With Pyclick: https://github.com/markovi/PyClick
Y. Hu, Y. Koren and C. Volinsky, "Collaborative Filtering for Implicit Feedback Datasets," 2008 Eighth IEEE International Conference
on Data Mining, Pisa, 2008, pp. 263-272.