Qiaoling Liu, Lead Data Scientist, CareerBuilder at MLconf ATL 2017

CompanyDepot: Employer
Name Normalization in the
Online Recruitment Industry
Qiaoling Liu
Sep. 2017

The Employer Name Normalization Task
links employer names in job postings or resumes to entities in an
employer knowledge base (KB)

A domain-specific case of entity linking
Traditional entity
linking
links
entity mentions
in
text
to entities in
often a global KB
Employer name
normalization
employer names jobs / resumes an employer KB

Key Challenges
1. Handle name variations
Ø Legacy names, nicknames, acronyms, typos
2. Handle irrelevant or unlinkable input data
Ø E.g., “self-employed”, “not specified”
3. Handle employer names from both job postings and resumes
Ø Different semi-structured formats
4. Leverage the location/url context
Ø e.g., (Macys.com, San Francisco)
5. Handle duplicates in the KB
Ø e.g., {“Enterprise Rent A Car”, “Enterprise Rentacar”, “Enterprise Rent-A-Car Company”}
Unique Challenges!
Common Challenges!

Two Levels of Employer Name Normalization
•Handle name variations
•Handle irrelevant or unlinkable input data
•Handle employer names from both job postings & resumes
•Leverage the location/url context
Entity-level
normalization
•Handle duplicates in the KB
Cluster-level
normalization

Entity-Level Normalization
-- mapping a query to an entity
Entity-Level Normalization
- mapping a query to an entity
Walmart
Pharmacy
Target
Pharmacy
Walmart
Supercenter
Walmart
Wal-Mart
Stores, Inc.
target.com
Target
Corporation
Entities
walmart pharmacy target.comQueries walmart

Cluster-Level Normalization
-- mapping a query to a cluster of entities
Cluster-Level Normalization
- mapping a query to a cluster of entities
Walmart
Pharmacy
Target
Pharmacy
Walmart
Supercenter
Walmart
Wal-Mart
Stores, Inc.
target.com
Target
Corporation
Entities
walmart pharmacy target.comwalmartQueries

Architecture of
CompanyDepot
atistics and examples for mapping sources.
e Example
K IBM Corp. ! International Business Machines Corporation
MSFT ! Microso Corporation
K Amazon Web Services, Inc. ! Amazon.com, Inc.
M bankofamerica ! Bank of America Corporation
M pricewaterhouse coopers ! PwC
is ready, it can take normalization requests. Each
of an employer name and its location context (part
ation information could be empty). e system then
searcher to retrieve a list of N employer entities.
date entities are then sent to the reranking step,
s a feature vector for each entity and uses a machine
anking model to rank them. Finally, the top-ranked
the validation step to decide whether it is a correct
uery using a binary classier. If it says yes, the
this entity to the user; otherwise, it outputs NIL.
ng Sources
s are used in both our entity-level normalization (to
sion) and cluster-level normalization (to do graph-
g). Each source contains a set of mappings from
o normalized forms. Table 1 shows the statistics and
ach source. We describe how each mapping source
w:
Cluster Result
Entity
Result
Query
Employer
Knowledge
Base
Mapping
Source 2
Mapping
Source 1
Mapping
Source 5
Mapping
Source 4
Mapping
Source 3
Client
KB
Index
Clusters
Mapping
Index
Cluster
Index
Reranking Step
Indexing Step
Retrieval Step
Validation Step
Clustering Step
Cluster Lookup
Ofﬂine
Online
Learning to Rank

Query Expansion using External Knowledge from
5 Mapping Sources
Table 1: Statistics and examples for mapping sources.
Source Size Example
Wikipedia 135K IBM Corp. ! International Business Machines Corporation
Stock 6K MSFT ! Microso Corporation
Hierarchy 272K Amazon Web Services, Inc. ! Amazon.com, Inc.
Legacy 26M bankofamerica ! Bank of America Corporation
Provider 10M pricewaterhouse coopers ! PwC
Once the index is ready, it can take normalization requests. Each

Indexing Step
• Using Lucene indexer
Table 2: Index structure.
(a) A document in the KB index.
Field Value
id 15
normalized form International Business Machines Corporation
calibrated name internationalbusinessmachines
domain ibm.com
json {“id”: “15”, “normalized form”: “International
Business Machines Corporation”, …}
(b) A document in the mapping index.
Field Value
surface form IBM
normalized form International Business Machines Corporation
mapping source wikipedia
(c) A document in the cluster index.
Field Value
cluster member key internationalbusinessmachines
cluster representative International Business Machines Corporation

Retrieval Step
1. Get top 1000 entities using Lucene aggregated search combining
(1) keyword searches; (2) fuzzy searches; (3) phrase searches.
• Query expansion based on mappings, e.g., MSFT - Microsoft Corporation
2. From these results, get top N1 entities by Lucene score, top N2
entities by Levenshtein Distance, top N3 entities by mapping table,
and top N4 entities by url matching.
3. Return the pool of N=N1+N2+N3+N4 entities (N is about 10~20).

Reranking Step
1. Generate Features for each entity:
• query features: query length, if query location/url is specified, etc.
• query-entity features: Lucene score, string similarity, location/url match, etc.
• entity features: entity popularity, # locations, legal word presence, etc.
2. Learn to rank the entities using coordinate ascent in RankLib,
• a list-wise method that can directly optimize any user specified ranking
measure (e.g., P@1).

Validation Step
1. Generate features for the top-ranked entity
• all features from the previous step;
• score of the learning-to-rank method.
2. Classify the top-ranked entity into CORRECT or WRONG
• binary classification using LibSVM.

Graph-Based Clustering using External Knowledge
from 5 Mapping Sources
Table 1: Statistics and examples for mapping sources.
Source Size Example
Wikipedia 135K IBM Corp. ! International Business Machines Corporation
Stock 6K MSFT ! Microso Corporation
Hierarchy 272K Amazon Web Services, Inc. ! Amazon.com, Inc.
Legacy 26M bankofamerica ! Bank of America Corporation
Provider 10M pricewaterhouse coopers ! PwC
Once the index is ready, it can take normalization requests. Each

Create an Undirected GraphCreate an Undirected Graph
walmartcanada
targetpharmacywalmartsupercenter
walmart
walmartstores
target
wamart
targets
4
1
3
2
3
2
targetstore
1

Remove Low-Quality EdgesRemove Low-Quality Edges
walmartcanada
walmartsupercenter
walmart
walmartstores
wamart
4
1
3
2
targetpharmacy
target
targets
3
2
targetstore
1

Find All Connected Components as ClustersFind All Connected Components as Clusters
walmartcanada
walmartsupercenter
walmart
walmartstores
wamart
targetpharmacy
target
targets targetstore

Select Cluster Representative Entity
Wal-Mart Stores, Inc. Target Corporation
Select Cluster Representative Entity
walmartcanada
walmartsupercenter
walmart
walmartstores
targetpharmacy
target
targetstore

Entity-Level Datasets
Table 4: Statistics about the entity-level datasets. %Country
(State, URL) means the percentage of queries with country
(state, url) specied. %US means the percentage of queries
with country=US when country is specied.
Dataset #eries %Country %US %State %URL
RDB 1098 58.5% 96.4% 50.9% 0%
EDGE 1093 97.3% 45.3% 20.8% 0%
JOB1 1100 100% 100% 99.7% 0%
JOB2 500 100% 98.4% 100% 0%
JOBFEED 453 87.5% 100% 87.5% 100%

Metrics for Entity-Level Normalization
• Ic: correct results; Iw: wrong results; In: null results
• Precision = Ic / (Ic + Iw): percentage of correct results out of all non-
null results.
• Coverage = (Ic + Iw) / (Ic + Iw + In): percentage of queries that a non-
null result is returned.

Results on Entity-Level Normalization Datasets
0.0 0.2 0.4 0.6 0.8 1.0
0.20.40.60.81.0
Coverage
Precision
●●
●
●
●
●
●
●
RDB: CD−V2−E
RDB: CD−V1
RDB: Legacy
RDB: WService
EDGE: CD−V2−E
EDGE: CD−V1
EDGE: Legacy
EDGE: WService
JOB1: CD−V2−E
JOB1: CD−V1
JOB1: Legacy
JOB1: WService
JOB2: CD−V2−E
JOB2: CD−V1
JOB2: Legacy
JOB2: WService
0.0 0.2 0.4 0.6 0.8 1.0
0.800.901.00
Coverage
Precision
JOBFEED: CD−V2−E (using query url)
JOBFEED: CD−V2−E (ignoring query url)
JOBFEED: WService
Figure 5: Results on JOBFEED (entity-level normalization).
Table 7: Results on cluster-level normalization datasets.
(a) Resume dataset.
System SuccessRate DiversityReductionRatio F-score
CD-V2-C 0.963 0.704 0.814
CD-V1.5-C 0.897 0.688 0.779
CD-V2-E 0.958 0.416 0.580
(b) Job dataset.
0.40.60.81.0
●●
●
●
●
●
●
RDB: CD−V2−E
RDB: CD−V1
RDB: Legacy
RDB: WService
EDGE: CD−V2−E
EDGE: CD−V1
EDGE: Legacy
EDGE: WService
JOB1: CD−V2−E
JOB1: CD−V1
JOB1: Legacy
JOB1: WService
JOB2: CD−V2−E
0.0 0.2 0.4 0.6 0.8 1.0
0.800.901.00
Coverage
Precision
JOBFEED: CD−V2−E (using query url)
JOBFEED: CD−V2−E (ignoring query url)
JOBFEED: WService
(a) Resume dataset.

Cluster-Level Datasets
• Resume dataset
• Search for resumes by 98 most frequent search queries about companies
• Get 20 most frequent raw employer names from these resumes
• Collect 817 unique raw employer names from resumes
• Job dataset
• Get top 182 employer entities with the most jobs by a baseline normalizer
• Get the raw employer names in the jobs posted by these entities
• Collect 6515 unique raw employer names from job postings

Metrics for Cluster-Level Normalization
• Success Rate (SR):
• how likely the system returns a correct result.
• Diversity Reduction Ratio (DRR):
• how much result diversity the system reduces correctly via clustering.
• Light-weight labeling:
• for each query, label whether the result returned by the system is correct.
each query q, we label whether the result r returned by the sys-
tem is correct or not. Let QS be the set of successful queries,
i.e., the queries which receive a correct result, i.e., QS = {q 2
Q | fC (q) is a correct result for q}. We dene Success Rate (SR) of
the system as
SR =
|QS |
|Q|
(1)
To measure the diversity in results returned by a system, we
adapted the true diversity metric [14] which is dened based on
entropy. As it does not maer how diverse the wrong results are,
we only compute the diversity in the correct results. Let QS |r be
the set of successful queries that are mapped to the cluster of r, i.e.,
QS |r = {q 2 QS | fC (q) = r}. We rst compute the entropy of the
correct results as
H =
’
r 2R
|QS |r |
|QS |
· ln
✓
|QS |r |
|QS |
◆
(2)
e above entropy H 2 [0,ln(|QS |)] is not linear to |QS |, which
makes it a lile hard to understand and interpret. So True Diver-
Train EDGE
Test RDB EDGE
and shared across system
correct cluster for each i
evaluation indicators.
entity-level normalizatio
6.3 Systems and
Table 5 summarizes the
6.3.1 Results of Entit
normalization datasets,
CD-V1, CD-V2-E, Legac
E, the output contains a
By varying the threshol
precision-coverage curv
condence score is not
(q) is a correct result for q}. We dene Success Rate (SR) of
stem as
SR =
|QS |
|Q|
(1)
measure the diversity in results returned by a system, we
ed the true diversity metric [14] which is dened based on
py. As it does not maer how diverse the wrong results are,
ly compute the diversity in the correct results. Let QS |r be
of successful queries that are mapped to the cluster of r, i.e.,
= {q 2 QS | fC (q) = r}. We rst compute the entropy of the
t results as
H =
’
r 2R
|QS |r |
|QS |
· ln
✓
|QS |r |
|QS |
◆
(2)
bove entropy H 2 [0,ln(|QS |)] is not linear to |QS |, which
it a lile hard to understand and interpret. So True Diver-
4] is proposed as TD = exp(H). It gives the eective number
rect clusters returned by the system, and is linear to |QS |.
and shared across systems.
correct cluster for each input a
evaluation indicators. ird,
entity-level normalization: Suc
6.3 Systems and Resu
Table 5 summarizes the system
6.3.1 Results of Entity-Leve
normalization datasets, we co
CD-V1, CD-V2-E, Legacy, and
E, the output contains a con
By varying the threshold on t
precision-coverage curve. For
condence score is not availa
precision and coverage value.
Figure 4 shows the precision
H =
’
r 2R
|QS |r |
|QS |
· ln
✓
|QS |r |
|QS |
◆
(2)
e above entropy H 2 [0,ln(|QS |)] is not linear to |QS |, which
makes it a lile hard to understand and interpret. So True Diver-
sity [14] is proposed as TD = exp(H). It gives the eective number
of correct clusters returned by the system, and is linear to |QS |.
Based on the above True Diversity, we can compute how much
result diversity the system reduces correctly, i.e., Diversity Reduc-
tion Ratio (DRR), which is in range [0, 1]:
DRR = 1
exp(H) 1
|QS | 1
(3)
Finally, we compute the f-score (or the harmonic mean) of Suc-
cess Rate and Diversity Reduction Ratio to measure the normaliza-
tion quality:
F-score =
2 · SR · DRR
SR + DRR
(4)
tion Ratio (DRR), which is in range [0, 1]:
DRR = 1
exp(H) 1
|QS | 1
Finally, we compute the f-score (or the harmon
cess Rate and Diversity Reduction Ratio to measu
tion quality:
F-score =
2 · SR · DRR
SR + DRR
e proposed metric has three merits. First, it is
showing the correctness and diversity of the resu
cluster-level normalization system. Second, it only
labeling eort, i.e., labeling for each (query, resu
the result is correct for the query or not. e labe

Results on Cluster-Level Normalization Datasets
1.0
asets.
means
(a) Resume dataset.
CD-V2-C 0.963 0.704 0.814
CD-V1.5-C 0.897 0.688 0.779
CD-V2-E 0.958 0.416 0.580
(b) Job dataset.
CD-V2-C 0.904 0.979 0.940
CD-V1.5-C 0.778 0.981 0.868
CD-V2-E 0.905 0.926 0.915
the other hand, CD-V2-C has a much higher diversity reduction

the trad
data so
the em
as han
employ
and du
per ada
system
duplica
2.2
e sys
the foll
on exte
to impr
cluster
normal
Our
malizat
domain
e employer name normalization task discussed in
can be viewed as a general entity linking problem, yet it d
the traditional entity linking task in three aspects [20]: (
data sources; (2) dierent contexts; (3) dierent KBs.
the employer name normalization task has unique chall
as handling the location and the url context associate
employer names in jobs and resumes, as well as hand
and duplicate entities in the KB. e system proposed
per adapts the three-module framework used in the en
systems. We also propose cluster-level normalization
duplicate results, which is not considered in entity linki
2.2 Domain-Specic Name Normalizati
e system described in this paper extends the system in
the following contributions: (1) performing query expan
on external mapping sources and supporting using u
to improve normalization quality; (2) supporting norm
cluster level; (3) proposing a new metric for evaluating c
normalization. More details will be described in Section
Our work is also related to a set of domain-specic
malization applications. For example, within the same r
From entity-level normalization
to cluster-level normalization:
ü Correctness remained
ü Diversity reduced
Candidate Search Results Facets

Conclusion and Future Work
ü Presented CompanyDepot: supporting employer name
normalization at both entity and cluster level
ü Proposed new metrics for cluster-level normalization
q Improve clustering, e.g., merge and split
q Develop more features for entity quality and query segmentation.
q Improve the quality and coverage of the employer KB

Thank you!
Any questions?
qiaoling.liu@careerbuilder.com

Calibrating Employer Names
1. Convert the name to lowercase, and replace ’s with s;
2. Convert all the non-alphanumeric characters to space;
3. Remove stop-phrases (e.g., “pvt ltd” and “l l c”) and stop-words (e.g., “inc”,
“corporation”, “incorporated”, and “the”);
4. Expand commonly used abbreviations, e.g., “ctr” - “center”, “svc” -
“services”;
5. remove all spaces in the name.
Employer name After calibration
International Business Machines
Corporation
internationalbusinessmachines
Sherman Howard L.L.C. shermanhoward
Oxnard Police Dept oxnardpolicedepartment
Macy’s, Inc. macys

Related Work
• Entity Linking with a Knowedge Base
• Domain-Specific Name Normalization
• Deduplicating Domain-Specific KBs
• Clustering Methods and Evaluation Metrics

Qiaoling Liu, Lead Data Scientist, CareerBuilder at MLconf ATL 2017

Qiaoling Liu, Lead Data Scientist, CareerBuilder at MLconf ATL 2017

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Qiaoling Liu, Lead Data Scientist, CareerBuilder at MLconf ATL 2017

Ähnlich wie Qiaoling Liu, Lead Data Scientist, CareerBuilder at MLconf ATL 2017 (20)

Mehr von MLconf

Mehr von MLconf (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Qiaoling Liu, Lead Data Scientist, CareerBuilder at MLconf ATL 2017