SlideShare ist ein Scribd-Unternehmen logo
1 von 35
Downloaden Sie, um offline zu lesen
CompanyDepot:	Employer	
Name	Normalization	in	the	
Online	Recruitment	Industry
Qiaoling	Liu
Sep.	2017
The	Employer	Name	Normalization	Task
links	employer	names	in	job	postings	or	resumes	to	entities in	an	
employer	knowledge	base	(KB)
A	domain-specific	case	of	entity	linking
Traditional entity	
linking
links
entity	mentions
in
text
to	entities	in
often	a global	KB
Employer	name	
normalization
employer	names jobs	/	resumes	 an	employer KB
Key	Challenges
1. Handle	name	variations
Ø Legacy	names,	nicknames,	acronyms,	typos
2. Handle	irrelevant	or	unlinkable input	data
Ø E.g.,	“self-employed”,	“not	specified”
3. Handle	employer	names	from	both	job	postings	and	resumes
Ø Different	semi-structured	formats
4. Leverage	the	location/url context
Ø e.g.,	(Macys.com,	San	Francisco)
5. Handle	duplicates	in	the	KB
Ø e.g.,	{“Enterprise	Rent	A	Car”,	“Enterprise	Rentacar”,	“Enterprise	Rent-A-Car	Company”}
Unique	Challenges!
Common	Challenges!
Two	Levels	of	Employer	Name	Normalization
•Handle	name	variations
•Handle	irrelevant	or	unlinkable input	data
•Handle	employer	names	from	both	job	postings	&	resumes
•Leverage	the	location/url context
Entity-level	
normalization
•Handle	duplicates	in	the	KB
Cluster-level	
normalization
Entity-Level	Normalization
-- mapping	a	query	to	an	entity
Entity-Level Normalization
- mapping a query to an entity
Walmart
Pharmacy
Target
Pharmacy
Walmart
Supercenter
Walmart
Wal-Mart
Stores, Inc.
target.com
Target
Corporation
Entities
walmart pharmacy target.comQueries walmart
Cluster-Level	Normalization
-- mapping	a	query	to	a	cluster	of	entities
Cluster-Level Normalization
- mapping a query to a cluster of entities
Walmart
Pharmacy
Target
Pharmacy
Walmart
Supercenter
Walmart
Wal-Mart
Stores, Inc.
target.com
Target
Corporation
Entities
walmart pharmacy target.comwalmartQueries
Architecture	of
CompanyDepot
atistics and examples for mapping sources.
e Example
K IBM Corp. ! International Business Machines Corporation
MSFT ! Microso Corporation
K Amazon Web Services, Inc. ! Amazon.com, Inc.
M bankofamerica ! Bank of America Corporation
M pricewaterhouse coopers ! PwC
is ready, it can take normalization requests. Each
of an employer name and its location context (part
ation information could be empty). e system then
searcher to retrieve a list of N employer entities.
date entities are then sent to the reranking step,
s a feature vector for each entity and uses a machine
anking model to rank them. Finally, the top-ranked
the validation step to decide whether it is a correct
uery using a binary classier. If it says yes, the
this entity to the user; otherwise, it outputs NIL.
ng Sources
s are used in both our entity-level normalization (to
sion) and cluster-level normalization (to do graph-
g). Each source contains a set of mappings from
o normalized forms. Table 1 shows the statistics and
ach source. We describe how each mapping source
w:
Cluster Result
Entity
Result
Query
Employer
Knowledge
Base
Mapping
Source 2
Mapping
Source 1
Mapping
Source 5
Mapping
Source 4
Mapping
Source 3
Client
KB
Index
Clusters
Mapping
Index
Cluster
Index
Reranking Step
Indexing Step
Retrieval Step
Validation Step
Clustering Step
Cluster Lookup
Offline
Online
Learning	to	Rank
Entity-Level	Normalization
Query	Expansion	using	External	Knowledge	from		
5	Mapping	Sources
Table 1: Statistics and examples for mapping sources.
Source Size Example
Wikipedia 135K IBM Corp. ! International Business Machines Corporation
Stock 6K MSFT ! Microso Corporation
Hierarchy 272K Amazon Web Services, Inc. ! Amazon.com, Inc.
Legacy 26M bankofamerica ! Bank of America Corporation
Provider 10M pricewaterhouse coopers ! PwC
Once the index is ready, it can take normalization requests. Each
Indexing	Step
• Using	Lucene	indexer
Table 2: Index structure.
(a) A document in the KB index.
Field Value
id 15
normalized form International Business Machines Corporation
calibrated name internationalbusinessmachines
domain ibm.com
json {“id”: “15”, “normalized form”: “International
Business Machines Corporation”, …}
(b) A document in the mapping index.
Field Value
surface form IBM
normalized form International Business Machines Corporation
mapping source wikipedia
(c) A document in the cluster index.
Field Value
cluster member key internationalbusinessmachines
cluster representative International Business Machines Corporation
Retrieval	Step
1. Get	top	1000	entities	using	Lucene	aggregated	search	combining			
(1)	keyword	searches;	(2)	fuzzy	searches;	(3)	phrase	searches.
• Query	expansion	based	on	mappings,	e.g.,	MSFT	-	Microsoft	Corporation
2. From	these	results,	get	top	N1 entities	by	Lucene	score,	top	N2
entities	by	Levenshtein Distance,	top	N3 entities	by	mapping	table,	
and	top	N4 entities	by	url matching.
3. Return	the	pool	of	N=N1+N2+N3+N4 entities	(N	is	about	10~20).
Reranking Step
1. Generate	Features	for	each	entity:	
• query	features:	query	length,	if	query	location/url is	specified,	etc.
• query-entity	features:	Lucene	score,	string	similarity,	location/url match,	etc.
• entity	features:	entity	popularity,	#	locations,	legal	word	presence,	etc.
2. Learn	to	rank	the	entities	using	coordinate	ascent	in	RankLib,	
• a	list-wise	method	that	can	directly	optimize	any	user	specified	ranking	
measure	(e.g.,	P@1).
Validation	Step
1. Generate	features	for	the	top-ranked	entity
• all	features	from	the	previous	step;	
• score	of	the	learning-to-rank	method.
2. Classify	the	top-ranked	entity	into	CORRECT	or	WRONG
• binary	classification	using	LibSVM.
Cluster-Level	Normalization
Graph-Based	Clustering	using	External	Knowledge	
from	5	Mapping	Sources
Table 1: Statistics and examples for mapping sources.
Source Size Example
Wikipedia 135K IBM Corp. ! International Business Machines Corporation
Stock 6K MSFT ! Microso Corporation
Hierarchy 272K Amazon Web Services, Inc. ! Amazon.com, Inc.
Legacy 26M bankofamerica ! Bank of America Corporation
Provider 10M pricewaterhouse coopers ! PwC
Once the index is ready, it can take normalization requests. Each
Create	an	Undirected	GraphCreate an Undirected Graph
walmartcanada
targetpharmacywalmartsupercenter
walmart
walmartstores
target
wamart
targets
4
1
3
2
3
2
targetstore
1
Remove	Low-Quality	EdgesRemove Low-Quality Edges
walmartcanada
walmartsupercenter
walmart
walmartstores
wamart
4
1
3
2
targetpharmacy
target
targets
3
2
targetstore
1
Find	All	Connected	Components	as	ClustersFind All Connected Components as Clusters
walmartcanada
walmartsupercenter
walmart
walmartstores
wamart
targetpharmacy
target
targets targetstore
Select	Cluster	Representative	Entity
Wal-Mart Stores, Inc. Target Corporation
Select Cluster Representative Entity
walmartcanada
walmartsupercenter
walmart
walmartstores
targetpharmacy
target
targetstore
Experiments
Entity-Level	Datasets
Table 4: Statistics about the entity-level datasets. %Country
(State, URL) means the percentage of queries with country
(state, url) specied. %US means the percentage of queries
with country=US when country is specied.
Dataset #eries %Country %US %State %URL
RDB 1098 58.5% 96.4% 50.9% 0%
EDGE 1093 97.3% 45.3% 20.8% 0%
JOB1 1100 100% 100% 99.7% 0%
JOB2 500 100% 98.4% 100% 0%
JOBFEED 453 87.5% 100% 87.5% 100%
Metrics	for	Entity-Level	Normalization
• Ic:	correct	results;	Iw:	wrong	results;	In:	null	results
• Precision	=	Ic /	(Ic +	Iw):	percentage	of	correct	results	out	of	all	non-
null	results.
• Coverage	=	(Ic +	Iw)	/	(Ic +	Iw +	In):	percentage	of	queries	that	a	non-
null	result	is	returned.
Results	on	Entity-Level	Normalization	Datasets
0.0 0.2 0.4 0.6 0.8 1.0
0.20.40.60.81.0
Coverage
Precision
●●
●
●
●
●
●
●
RDB: CD−V2−E
RDB: CD−V1
RDB: Legacy
RDB: WService
EDGE: CD−V2−E
EDGE: CD−V1
EDGE: Legacy
EDGE: WService
JOB1: CD−V2−E
JOB1: CD−V1
JOB1: Legacy
JOB1: WService
JOB2: CD−V2−E
JOB2: CD−V1
JOB2: Legacy
JOB2: WService
0.0 0.2 0.4 0.6 0.8 1.0
0.800.901.00
Coverage
Precision
JOBFEED: CD−V2−E (using query url)
JOBFEED: CD−V2−E (ignoring query url)
JOBFEED: WService
Figure 5: Results on JOBFEED (entity-level normalization).
Table 7: Results on cluster-level normalization datasets.
(a) Resume dataset.
System SuccessRate DiversityReductionRatio F-score
CD-V2-C 0.963 0.704 0.814
CD-V1.5-C 0.897 0.688 0.779
CD-V2-E 0.958 0.416 0.580
(b) Job dataset.
0.40.60.81.0
●●
●
●
●
●
●
RDB: CD−V2−E
RDB: CD−V1
RDB: Legacy
RDB: WService
EDGE: CD−V2−E
EDGE: CD−V1
EDGE: Legacy
EDGE: WService
JOB1: CD−V2−E
JOB1: CD−V1
JOB1: Legacy
JOB1: WService
JOB2: CD−V2−E
0.0 0.2 0.4 0.6 0.8 1.0
0.800.901.00
Coverage
Precision
JOBFEED: CD−V2−E (using query url)
JOBFEED: CD−V2−E (ignoring query url)
JOBFEED: WService
Figure 5: Results on JOBFEED (entity-level normalization).
Table 7: Results on cluster-level normalization datasets.
(a) Resume dataset.
Cluster-Level	Datasets
• Resume	dataset
• Search	for	resumes	by	98	most	frequent	search	queries	about	companies
• Get	20	most	frequent	raw	employer	names	from	these	resumes
• Collect	817	unique	raw	employer	names	from	resumes
• Job	dataset
• Get	top	182	employer	entities	with	the	most	jobs	by	a	baseline	normalizer	
• Get	the	raw	employer	names	in	the	jobs	posted	by	these	entities
• Collect	6515	unique	raw	employer	names	from	job	postings
Metrics	for	Cluster-Level	Normalization
• Success	Rate	(SR):	
• how	likely	the	system	returns	a	correct	result.
• Diversity	Reduction	Ratio	(DRR):	
• how	much	result	diversity	the	system	reduces	correctly	via	clustering.
• Light-weight	labeling:	
• for	each	query,	label	whether	the	result	returned	by	the	system	is	correct.
each query q, we label whether the result r returned by the sys-
tem is correct or not. Let QS be the set of successful queries,
i.e., the queries which receive a correct result, i.e., QS = {q 2
Q | fC (q) is a correct result for q}. We dene Success Rate (SR) of
the system as
SR =
|QS |
|Q|
(1)
To measure the diversity in results returned by a system, we
adapted the true diversity metric [14] which is dened based on
entropy. As it does not maer how diverse the wrong results are,
we only compute the diversity in the correct results. Let QS |r be
the set of successful queries that are mapped to the cluster of r, i.e.,
QS |r = {q 2 QS | fC (q) = r}. We rst compute the entropy of the
correct results as
H =
’
r 2R
|QS |r |
|QS |
¡ ln
✓
|QS |r |
|QS |
◆
(2)
e above entropy H 2 [0,ln(|QS |)] is not linear to |QS |, which
makes it a lile hard to understand and interpret. So True Diver-
Train EDGE
Test RDB EDGE
and shared across system
correct cluster for each i
evaluation indicators. 
entity-level normalizatio
6.3 Systems and
Table 5 summarizes the
6.3.1 Results of Entit
normalization datasets,
CD-V1, CD-V2-E, Legac
E, the output contains a
By varying the threshol
precision-coverage curv
condence score is not
(q) is a correct result for q}. We dene Success Rate (SR) of
stem as
SR =
|QS |
|Q|
(1)
measure the diversity in results returned by a system, we
ed the true diversity metric [14] which is dened based on
py. As it does not maer how diverse the wrong results are,
ly compute the diversity in the correct results. Let QS |r be
of successful queries that are mapped to the cluster of r, i.e.,
= {q 2 QS | fC (q) = r}. We rst compute the entropy of the
t results as
H =
’
r 2R
|QS |r |
|QS |
¡ ln
✓
|QS |r |
|QS |
◆
(2)
bove entropy H 2 [0,ln(|QS |)] is not linear to |QS |, which
it a lile hard to understand and interpret. So True Diver-
4] is proposed as TD = exp(H). It gives the eective number
rect clusters returned by the system, and is linear to |QS |.
and shared across systems. 
correct cluster for each input a
evaluation indicators. ird,
entity-level normalization: Suc
6.3 Systems and Resu
Table 5 summarizes the system
6.3.1 Results of Entity-Leve
normalization datasets, we co
CD-V1, CD-V2-E, Legacy, and
E, the output contains a con
By varying the threshold on t
precision-coverage curve. For
condence score is not availa
precision and coverage value.
Figure 4 shows the precision
H =
’
r 2R
|QS |r |
|QS |
¡ ln
✓
|QS |r |
|QS |
◆
(2)
e above entropy H 2 [0,ln(|QS |)] is not linear to |QS |, which
makes it a lile hard to understand and interpret. So True Diver-
sity [14] is proposed as TD = exp(H). It gives the eective number
of correct clusters returned by the system, and is linear to |QS |.
Based on the above True Diversity, we can compute how much
result diversity the system reduces correctly, i.e., Diversity Reduc-
tion Ratio (DRR), which is in range [0, 1]:
DRR = 1
exp(H) 1
|QS | 1
(3)
Finally, we compute the f-score (or the harmonic mean) of Suc-
cess Rate and Diversity Reduction Ratio to measure the normaliza-
tion quality:
F-score =
2 ¡ SR ¡ DRR
SR + DRR
(4)
tion Ratio (DRR), which is in range [0, 1]:
DRR = 1
exp(H) 1
|QS | 1
Finally, we compute the f-score (or the harmon
cess Rate and Diversity Reduction Ratio to measu
tion quality:
F-score =
2 ¡ SR ¡ DRR
SR + DRR
e proposed metric has three merits. First, it is
showing the correctness and diversity of the resu
cluster-level normalization system. Second, it only
labeling eort, i.e., labeling for each (query, resu
the result is correct for the query or not. e labe
Results	on	Cluster-Level	Normalization	Datasets
1.0
asets.
means
Figure 5: Results on JOBFEED (entity-level normalization).
Table 7: Results on cluster-level normalization datasets.
(a) Resume dataset.
System SuccessRate DiversityReductionRatio F-score
CD-V2-C 0.963 0.704 0.814
CD-V1.5-C 0.897 0.688 0.779
CD-V2-E 0.958 0.416 0.580
(b) Job dataset.
System SuccessRate DiversityReductionRatio F-score
CD-V2-C 0.904 0.979 0.940
CD-V1.5-C 0.778 0.981 0.868
CD-V2-E 0.905 0.926 0.915
the other hand, CD-V2-C has a much higher diversity reduction
the trad
data so
the em
as han
employ
and du
per ada
system
duplica
2.2
e sys
the foll
on exte
to impr
cluster
normal
Our
malizat
domain
e employer name normalization task discussed in
can be viewed as a general entity linking problem, yet it d
the traditional entity linking task in three aspects [20]: (
data sources; (2) dierent contexts; (3) dierent KBs.
the employer name normalization task has unique chall
as handling the location and the url context associate
employer names in jobs and resumes, as well as hand
and duplicate entities in the KB. e system proposed
per adapts the three-module framework used in the en
systems. We also propose cluster-level normalization
duplicate results, which is not considered in entity linki
2.2 Domain-Specic Name Normalizati
e system described in this paper extends the system in
the following contributions: (1) performing query expan
on external mapping sources and supporting using u
to improve normalization quality; (2) supporting norm
cluster level; (3) proposing a new metric for evaluating c
normalization. More details will be described in Section
Our work is also related to a set of domain-specic
malization applications. For example, within the same r
From	entity-level	normalization	
to	cluster-level	normalization:
Ăź Correctness remained
Ăź Diversity reduced
Candidate	Search	Results	Facets
Conclusion	and	Future	Work
Ăź Presented	CompanyDepot:	supporting	employer	name	
normalization	at	both	entity	and	cluster	level
Ăź Proposed	new	metrics	for	cluster-level	normalization
q Improve	clustering,	e.g.,	merge	and	split
q Develop	more	features	for	entity	quality	and	query	segmentation.	
q Improve	the	quality	and	coverage	of	the	employer	KB
Thank	you!
Any	questions?
qiaoling.liu@careerbuilder.com
Backup
Calibrating	Employer	Names
1. Convert	the	name	to	lowercase,	and	replace	’s	with	s;
2. Convert	all	the	non-alphanumeric	characters	to	space;	
3. Remove	stop-phrases	(e.g.,	“pvt ltd”	and	“l	l	c”)	and	stop-words	(e.g.,	“inc”,	
“corporation”,	“incorporated”,	and	“the”);	
4. Expand	commonly	used	abbreviations,	e.g.,	“ctr”	-	“center”,	“svc”	-	
“services”;	
5. remove	all	spaces	in	the	name.	
Employer	name After	calibration
International	Business	Machines	
Corporation
internationalbusinessmachines
Sherman		Howard	L.L.C. shermanhoward
Oxnard	Police	Dept oxnardpolicedepartment
Macy’s,	Inc. macys
Related	Work
• Entity	Linking	with	a	Knowedge Base
• Domain-Specific	Name	Normalization
• Deduplicating Domain-Specific	KBs
• Clustering	Methods	and	Evaluation	Metrics
Qiaoling Liu, Lead Data Scientist, CareerBuilder at MLconf ATL 2017
Qiaoling Liu, Lead Data Scientist, CareerBuilder at MLconf ATL 2017

Weitere ähnliche Inhalte

Was ist angesagt?

Webinar: Don't Leave Your Data in the Dark
Webinar: Don't Leave Your Data in the DarkWebinar: Don't Leave Your Data in the Dark
Webinar: Don't Leave Your Data in the DarkDataStax
 
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]Shirshanka Das
 
Still on IBM BigInsights? We have the right path for you
Still on IBM BigInsights? We have the right path for youStill on IBM BigInsights? We have the right path for you
Still on IBM BigInsights? We have the right path for youModusOptimum
 
Intro to Machine Learning with H2O and Python - Denver
Intro to Machine Learning with H2O and Python - DenverIntro to Machine Learning with H2O and Python - Denver
Intro to Machine Learning with H2O and Python - DenverSri Ambati
 
Security Breakout Session
Security Breakout Session Security Breakout Session
Security Breakout Session Splunk
 
Db2 event store
Db2 event storeDb2 event store
Db2 event storeModusOptimum
 
H2O Driverless AI Workshop
H2O Driverless AI WorkshopH2O Driverless AI Workshop
H2O Driverless AI WorkshopSri Ambati
 
Big Data for Everyman
Big Data for EverymanBig Data for Everyman
Big Data for EverymanMichael Wilde
 
Building a Streaming Microservices Architecture - Data + AI Summit EU 2020
Building a Streaming Microservices Architecture - Data + AI Summit EU 2020Building a Streaming Microservices Architecture - Data + AI Summit EU 2020
Building a Streaming Microservices Architecture - Data + AI Summit EU 2020Databricks
 
Migrate and Modernize Hadoop-Based Security Policies for Databricks
Migrate and Modernize Hadoop-Based Security Policies for DatabricksMigrate and Modernize Hadoop-Based Security Policies for Databricks
Migrate and Modernize Hadoop-Based Security Policies for DatabricksDatabricks
 
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq AbdullahLeveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq AbdullahDatabricks
 
Using H2O for Mobile Transaction Forecasting & Anomaly Detection - Capital One
Using H2O for Mobile Transaction Forecasting & Anomaly Detection - Capital OneUsing H2O for Mobile Transaction Forecasting & Anomaly Detection - Capital One
Using H2O for Mobile Transaction Forecasting & Anomaly Detection - Capital OneSri Ambati
 
Adobe Behance Scales to Millions of Users at Lower TCO with Neo4j
Adobe Behance Scales to Millions of Users at Lower TCO with Neo4jAdobe Behance Scales to Millions of Users at Lower TCO with Neo4j
Adobe Behance Scales to Millions of Users at Lower TCO with Neo4jNeo4j
 
Introducing Splunk – The Big Data Engine
Introducing Splunk – The Big Data EngineIntroducing Splunk – The Big Data Engine
Introducing Splunk – The Big Data EngineSwiss Big Data User Group
 
EsgynDB: A Big Data Engine. Simplifying Fast and Reliable Mixed Workloads
EsgynDB: A Big Data Engine. Simplifying Fast and Reliable Mixed Workloads EsgynDB: A Big Data Engine. Simplifying Fast and Reliable Mixed Workloads
EsgynDB: A Big Data Engine. Simplifying Fast and Reliable Mixed Workloads Srikanth Ramakrishnan
 
Big Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David GiardBig Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David GiardITCamp
 
Splunk 5 Overview Analyst v1.0
Splunk 5 Overview Analyst v1.0Splunk 5 Overview Analyst v1.0
Splunk 5 Overview Analyst v1.0Splunk
 
Creando un Portal Oracle para una Empresa
Creando un Portal Oracle para una EmpresaCreando un Portal Oracle para una Empresa
Creando un Portal Oracle para una Empresaisarmientop
 
Splunk introduction
Splunk introductionSplunk introduction
Splunk introductionTruong Cuong
 

Was ist angesagt? (20)

Webinar: Don't Leave Your Data in the Dark
Webinar: Don't Leave Your Data in the DarkWebinar: Don't Leave Your Data in the Dark
Webinar: Don't Leave Your Data in the Dark
 
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]
 
Still on IBM BigInsights? We have the right path for you
Still on IBM BigInsights? We have the right path for youStill on IBM BigInsights? We have the right path for you
Still on IBM BigInsights? We have the right path for you
 
Intro to Machine Learning with H2O and Python - Denver
Intro to Machine Learning with H2O and Python - DenverIntro to Machine Learning with H2O and Python - Denver
Intro to Machine Learning with H2O and Python - Denver
 
Security Breakout Session
Security Breakout Session Security Breakout Session
Security Breakout Session
 
Db2 event store
Db2 event storeDb2 event store
Db2 event store
 
H2O Driverless AI Workshop
H2O Driverless AI WorkshopH2O Driverless AI Workshop
H2O Driverless AI Workshop
 
Big Data for Everyman
Big Data for EverymanBig Data for Everyman
Big Data for Everyman
 
Building a Streaming Microservices Architecture - Data + AI Summit EU 2020
Building a Streaming Microservices Architecture - Data + AI Summit EU 2020Building a Streaming Microservices Architecture - Data + AI Summit EU 2020
Building a Streaming Microservices Architecture - Data + AI Summit EU 2020
 
Big Data Application Architectures - IoT
Big Data Application Architectures - IoTBig Data Application Architectures - IoT
Big Data Application Architectures - IoT
 
Migrate and Modernize Hadoop-Based Security Policies for Databricks
Migrate and Modernize Hadoop-Based Security Policies for DatabricksMigrate and Modernize Hadoop-Based Security Policies for Databricks
Migrate and Modernize Hadoop-Based Security Policies for Databricks
 
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq AbdullahLeveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
 
Using H2O for Mobile Transaction Forecasting & Anomaly Detection - Capital One
Using H2O for Mobile Transaction Forecasting & Anomaly Detection - Capital OneUsing H2O for Mobile Transaction Forecasting & Anomaly Detection - Capital One
Using H2O for Mobile Transaction Forecasting & Anomaly Detection - Capital One
 
Adobe Behance Scales to Millions of Users at Lower TCO with Neo4j
Adobe Behance Scales to Millions of Users at Lower TCO with Neo4jAdobe Behance Scales to Millions of Users at Lower TCO with Neo4j
Adobe Behance Scales to Millions of Users at Lower TCO with Neo4j
 
Introducing Splunk – The Big Data Engine
Introducing Splunk – The Big Data EngineIntroducing Splunk – The Big Data Engine
Introducing Splunk – The Big Data Engine
 
EsgynDB: A Big Data Engine. Simplifying Fast and Reliable Mixed Workloads
EsgynDB: A Big Data Engine. Simplifying Fast and Reliable Mixed Workloads EsgynDB: A Big Data Engine. Simplifying Fast and Reliable Mixed Workloads
EsgynDB: A Big Data Engine. Simplifying Fast and Reliable Mixed Workloads
 
Big Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David GiardBig Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David Giard
 
Splunk 5 Overview Analyst v1.0
Splunk 5 Overview Analyst v1.0Splunk 5 Overview Analyst v1.0
Splunk 5 Overview Analyst v1.0
 
Creando un Portal Oracle para una Empresa
Creando un Portal Oracle para una EmpresaCreando un Portal Oracle para una Empresa
Creando un Portal Oracle para una Empresa
 
Splunk introduction
Splunk introductionSplunk introduction
Splunk introduction
 

Ähnlich wie Qiaoling Liu, Lead Data Scientist, CareerBuilder at MLconf ATL 2017

Linqtosql 090629035715 Phpapp01
Linqtosql 090629035715 Phpapp01Linqtosql 090629035715 Phpapp01
Linqtosql 090629035715 Phpapp01google
 
Dev-In-Town:Linq To Sql by Chan Ming Man
Dev-In-Town:Linq To Sql by Chan Ming ManDev-In-Town:Linq To Sql by Chan Ming Man
Dev-In-Town:Linq To Sql by Chan Ming ManQuek Lilian
 
[React Native Tutorial] Lecture 6: Component, Props, and Network
[React Native Tutorial] Lecture 6: Component, Props, and Network[React Native Tutorial] Lecture 6: Component, Props, and Network
[React Native Tutorial] Lecture 6: Component, Props, and NetworkKobkrit Viriyayudhakorn
 
Webinar: What's new in the .NET Driver
Webinar: What's new in the .NET DriverWebinar: What's new in the .NET Driver
Webinar: What's new in the .NET DriverMongoDB
 
Cis247 a ilab 2 of 7 employee class
Cis247 a ilab 2 of 7 employee classCis247 a ilab 2 of 7 employee class
Cis247 a ilab 2 of 7 employee classccis224477
 
Cis247 a ilab 2 of 7 employee class
Cis247 a ilab 2 of 7 employee classCis247 a ilab 2 of 7 employee class
Cis247 a ilab 2 of 7 employee classcis247
 
Evolutionary db development
Evolutionary db development Evolutionary db development
Evolutionary db development Open Party
 
Cis247 i lab 2 of 7 employee class
Cis247 i lab 2 of 7 employee classCis247 i lab 2 of 7 employee class
Cis247 i lab 2 of 7 employee classsdjdskjd9097
 
Advanced Index Tuning
Advanced Index TuningAdvanced Index Tuning
Advanced Index TuningQuest Software
 
Richard Simmons
Richard SimmonsRichard Simmons
Richard Simmonsrasimmons
 
Improving the Quality of Existing Software - DevIntersection April 2016
Improving the Quality of Existing Software - DevIntersection April 2016Improving the Quality of Existing Software - DevIntersection April 2016
Improving the Quality of Existing Software - DevIntersection April 2016Steven Smith
 
Data Entry Operator Certification
Data Entry Operator CertificationData Entry Operator Certification
Data Entry Operator CertificationVskills
 
Cis247 i lab 3 overloaded methods and static methods variables
Cis247 i lab 3 overloaded methods and static methods variablesCis247 i lab 3 overloaded methods and static methods variables
Cis247 i lab 3 overloaded methods and static methods variablessdjdskjd9097
 
VSSML17 L5. Basic Data Transformations and Feature Engineering
VSSML17 L5. Basic Data Transformations and Feature EngineeringVSSML17 L5. Basic Data Transformations and Feature Engineering
VSSML17 L5. Basic Data Transformations and Feature EngineeringBigML, Inc
 
CMP323_AWS Batch Easy & Efficient Batch Computing on Amazon Web Services
CMP323_AWS Batch Easy & Efficient Batch Computing on Amazon Web ServicesCMP323_AWS Batch Easy & Efficient Batch Computing on Amazon Web Services
CMP323_AWS Batch Easy & Efficient Batch Computing on Amazon Web ServicesAmazon Web Services
 
Cis247 a ilab 3 overloaded methods and static methods variables
Cis247 a ilab 3 overloaded methods and static methods variablesCis247 a ilab 3 overloaded methods and static methods variables
Cis247 a ilab 3 overloaded methods and static methods variablesccis224477
 
Improving the Quality of Existing Software
Improving the Quality of Existing SoftwareImproving the Quality of Existing Software
Improving the Quality of Existing SoftwareSteven Smith
 
Cis 247 all i labs
Cis 247 all i labsCis 247 all i labs
Cis 247 all i labsccis224477
 
Necessary Evils, Building Optimized CRUD Procedures
Necessary Evils, Building Optimized CRUD ProceduresNecessary Evils, Building Optimized CRUD Procedures
Necessary Evils, Building Optimized CRUD ProceduresJason Strate
 

Ähnlich wie Qiaoling Liu, Lead Data Scientist, CareerBuilder at MLconf ATL 2017 (20)

Linqtosql 090629035715 Phpapp01
Linqtosql 090629035715 Phpapp01Linqtosql 090629035715 Phpapp01
Linqtosql 090629035715 Phpapp01
 
Dev-In-Town:Linq To Sql by Chan Ming Man
Dev-In-Town:Linq To Sql by Chan Ming ManDev-In-Town:Linq To Sql by Chan Ming Man
Dev-In-Town:Linq To Sql by Chan Ming Man
 
[React Native Tutorial] Lecture 6: Component, Props, and Network
[React Native Tutorial] Lecture 6: Component, Props, and Network[React Native Tutorial] Lecture 6: Component, Props, and Network
[React Native Tutorial] Lecture 6: Component, Props, and Network
 
1z0 591
1z0 5911z0 591
1z0 591
 
Webinar: What's new in the .NET Driver
Webinar: What's new in the .NET DriverWebinar: What's new in the .NET Driver
Webinar: What's new in the .NET Driver
 
Cis247 a ilab 2 of 7 employee class
Cis247 a ilab 2 of 7 employee classCis247 a ilab 2 of 7 employee class
Cis247 a ilab 2 of 7 employee class
 
Cis247 a ilab 2 of 7 employee class
Cis247 a ilab 2 of 7 employee classCis247 a ilab 2 of 7 employee class
Cis247 a ilab 2 of 7 employee class
 
Evolutionary db development
Evolutionary db development Evolutionary db development
Evolutionary db development
 
Cis247 i lab 2 of 7 employee class
Cis247 i lab 2 of 7 employee classCis247 i lab 2 of 7 employee class
Cis247 i lab 2 of 7 employee class
 
Advanced Index Tuning
Advanced Index TuningAdvanced Index Tuning
Advanced Index Tuning
 
Richard Simmons
Richard SimmonsRichard Simmons
Richard Simmons
 
Improving the Quality of Existing Software - DevIntersection April 2016
Improving the Quality of Existing Software - DevIntersection April 2016Improving the Quality of Existing Software - DevIntersection April 2016
Improving the Quality of Existing Software - DevIntersection April 2016
 
Data Entry Operator Certification
Data Entry Operator CertificationData Entry Operator Certification
Data Entry Operator Certification
 
Cis247 i lab 3 overloaded methods and static methods variables
Cis247 i lab 3 overloaded methods and static methods variablesCis247 i lab 3 overloaded methods and static methods variables
Cis247 i lab 3 overloaded methods and static methods variables
 
VSSML17 L5. Basic Data Transformations and Feature Engineering
VSSML17 L5. Basic Data Transformations and Feature EngineeringVSSML17 L5. Basic Data Transformations and Feature Engineering
VSSML17 L5. Basic Data Transformations and Feature Engineering
 
CMP323_AWS Batch Easy & Efficient Batch Computing on Amazon Web Services
CMP323_AWS Batch Easy & Efficient Batch Computing on Amazon Web ServicesCMP323_AWS Batch Easy & Efficient Batch Computing on Amazon Web Services
CMP323_AWS Batch Easy & Efficient Batch Computing on Amazon Web Services
 
Cis247 a ilab 3 overloaded methods and static methods variables
Cis247 a ilab 3 overloaded methods and static methods variablesCis247 a ilab 3 overloaded methods and static methods variables
Cis247 a ilab 3 overloaded methods and static methods variables
 
Improving the Quality of Existing Software
Improving the Quality of Existing SoftwareImproving the Quality of Existing Software
Improving the Quality of Existing Software
 
Cis 247 all i labs
Cis 247 all i labsCis 247 all i labs
Cis 247 all i labs
 
Necessary Evils, Building Optimized CRUD Procedures
Necessary Evils, Building Optimized CRUD ProceduresNecessary Evils, Building Optimized CRUD Procedures
Necessary Evils, Building Optimized CRUD Procedures
 

Mehr von MLconf

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...MLconf
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingMLconf
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...MLconf
 
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushIgor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushMLconf
 
Josh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceJosh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceMLconf
 
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...MLconf
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...MLconf
 
Meghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMLconf
 
Noam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionNoam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionMLconf
 
June Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLMLconf
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksMLconf
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...MLconf
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldMLconf
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...MLconf
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...MLconf
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...MLconf
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeMLconf
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...MLconf
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareMLconf
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesMLconf
 

Mehr von MLconf (20)

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
 
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushIgor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
 
Josh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceJosh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious Experience
 
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
 
Meghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the Cheap
 
Noam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionNoam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data Collection
 
June Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of ML
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI World
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to code
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better Software
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime Changes
 

KĂźrzlich hochgeladen

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel AraĂşjo
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 

KĂźrzlich hochgeladen (20)

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 

Qiaoling Liu, Lead Data Scientist, CareerBuilder at MLconf ATL 2017

  • 4. Key Challenges 1. Handle name variations Ø Legacy names, nicknames, acronyms, typos 2. Handle irrelevant or unlinkable input data Ø E.g., “self-employed”, “not specified” 3. Handle employer names from both job postings and resumes Ø Different semi-structured formats 4. Leverage the location/url context Ø e.g., (Macys.com, San Francisco) 5. Handle duplicates in the KB Ø e.g., {“Enterprise Rent A Car”, “Enterprise Rentacar”, “Enterprise Rent-A-Car Company”} Unique Challenges! Common Challenges!
  • 6. Entity-Level Normalization -- mapping a query to an entity Entity-Level Normalization - mapping a query to an entity Walmart Pharmacy Target Pharmacy Walmart Supercenter Walmart Wal-Mart Stores, Inc. target.com Target Corporation Entities walmart pharmacy target.comQueries walmart
  • 7. Cluster-Level Normalization -- mapping a query to a cluster of entities Cluster-Level Normalization - mapping a query to a cluster of entities Walmart Pharmacy Target Pharmacy Walmart Supercenter Walmart Wal-Mart Stores, Inc. target.com Target Corporation Entities walmart pharmacy target.comwalmartQueries
  • 8. Architecture of CompanyDepot atistics and examples for mapping sources. e Example K IBM Corp. ! International Business Machines Corporation MSFT ! Microso Corporation K Amazon Web Services, Inc. ! Amazon.com, Inc. M bankofamerica ! Bank of America Corporation M pricewaterhouse coopers ! PwC is ready, it can take normalization requests. Each of an employer name and its location context (part ation information could be empty). e system then searcher to retrieve a list of N employer entities. date entities are then sent to the reranking step, s a feature vector for each entity and uses a machine anking model to rank them. Finally, the top-ranked the validation step to decide whether it is a correct uery using a binary classier. If it says yes, the this entity to the user; otherwise, it outputs NIL. ng Sources s are used in both our entity-level normalization (to sion) and cluster-level normalization (to do graph- g). Each source contains a set of mappings from o normalized forms. Table 1 shows the statistics and ach source. We describe how each mapping source w: Cluster Result Entity Result Query Employer Knowledge Base Mapping Source 2 Mapping Source 1 Mapping Source 5 Mapping Source 4 Mapping Source 3 Client KB Index Clusters Mapping Index Cluster Index Reranking Step Indexing Step Retrieval Step Validation Step Clustering Step Cluster Lookup Offline Online Learning to Rank
  • 10. Query Expansion using External Knowledge from 5 Mapping Sources Table 1: Statistics and examples for mapping sources. Source Size Example Wikipedia 135K IBM Corp. ! International Business Machines Corporation Stock 6K MSFT ! Microso Corporation Hierarchy 272K Amazon Web Services, Inc. ! Amazon.com, Inc. Legacy 26M bankofamerica ! Bank of America Corporation Provider 10M pricewaterhouse coopers ! PwC Once the index is ready, it can take normalization requests. Each
  • 11. Indexing Step • Using Lucene indexer Table 2: Index structure. (a) A document in the KB index. Field Value id 15 normalized form International Business Machines Corporation calibrated name internationalbusinessmachines domain ibm.com json {“id”: “15”, “normalized form”: “International Business Machines Corporation”, …} (b) A document in the mapping index. Field Value surface form IBM normalized form International Business Machines Corporation mapping source wikipedia (c) A document in the cluster index. Field Value cluster member key internationalbusinessmachines cluster representative International Business Machines Corporation
  • 12. Retrieval Step 1. Get top 1000 entities using Lucene aggregated search combining (1) keyword searches; (2) fuzzy searches; (3) phrase searches. • Query expansion based on mappings, e.g., MSFT - Microsoft Corporation 2. From these results, get top N1 entities by Lucene score, top N2 entities by Levenshtein Distance, top N3 entities by mapping table, and top N4 entities by url matching. 3. Return the pool of N=N1+N2+N3+N4 entities (N is about 10~20).
  • 13. Reranking Step 1. Generate Features for each entity: • query features: query length, if query location/url is specified, etc. • query-entity features: Lucene score, string similarity, location/url match, etc. • entity features: entity popularity, # locations, legal word presence, etc. 2. Learn to rank the entities using coordinate ascent in RankLib, • a list-wise method that can directly optimize any user specified ranking measure (e.g., P@1).
  • 14. Validation Step 1. Generate features for the top-ranked entity • all features from the previous step; • score of the learning-to-rank method. 2. Classify the top-ranked entity into CORRECT or WRONG • binary classification using LibSVM.
  • 16. Graph-Based Clustering using External Knowledge from 5 Mapping Sources Table 1: Statistics and examples for mapping sources. Source Size Example Wikipedia 135K IBM Corp. ! International Business Machines Corporation Stock 6K MSFT ! Microso Corporation Hierarchy 272K Amazon Web Services, Inc. ! Amazon.com, Inc. Legacy 26M bankofamerica ! Bank of America Corporation Provider 10M pricewaterhouse coopers ! PwC Once the index is ready, it can take normalization requests. Each
  • 17. Create an Undirected GraphCreate an Undirected Graph walmartcanada targetpharmacywalmartsupercenter walmart walmartstores target wamart targets 4 1 3 2 3 2 targetstore 1
  • 19. Find All Connected Components as ClustersFind All Connected Components as Clusters walmartcanada walmartsupercenter walmart walmartstores wamart targetpharmacy target targets targetstore
  • 20. Select Cluster Representative Entity Wal-Mart Stores, Inc. Target Corporation Select Cluster Representative Entity walmartcanada walmartsupercenter walmart walmartstores targetpharmacy target targetstore
  • 22. Entity-Level Datasets Table 4: Statistics about the entity-level datasets. %Country (State, URL) means the percentage of queries with country (state, url) specied. %US means the percentage of queries with country=US when country is specied. Dataset #eries %Country %US %State %URL RDB 1098 58.5% 96.4% 50.9% 0% EDGE 1093 97.3% 45.3% 20.8% 0% JOB1 1100 100% 100% 99.7% 0% JOB2 500 100% 98.4% 100% 0% JOBFEED 453 87.5% 100% 87.5% 100%
  • 23. Metrics for Entity-Level Normalization • Ic: correct results; Iw: wrong results; In: null results • Precision = Ic / (Ic + Iw): percentage of correct results out of all non- null results. • Coverage = (Ic + Iw) / (Ic + Iw + In): percentage of queries that a non- null result is returned.
  • 24. Results on Entity-Level Normalization Datasets 0.0 0.2 0.4 0.6 0.8 1.0 0.20.40.60.81.0 Coverage Precision ●● ● ● ● ● ● ● RDB: CD−V2−E RDB: CD−V1 RDB: Legacy RDB: WService EDGE: CD−V2−E EDGE: CD−V1 EDGE: Legacy EDGE: WService JOB1: CD−V2−E JOB1: CD−V1 JOB1: Legacy JOB1: WService JOB2: CD−V2−E JOB2: CD−V1 JOB2: Legacy JOB2: WService 0.0 0.2 0.4 0.6 0.8 1.0 0.800.901.00 Coverage Precision JOBFEED: CD−V2−E (using query url) JOBFEED: CD−V2−E (ignoring query url) JOBFEED: WService Figure 5: Results on JOBFEED (entity-level normalization). Table 7: Results on cluster-level normalization datasets. (a) Resume dataset. System SuccessRate DiversityReductionRatio F-score CD-V2-C 0.963 0.704 0.814 CD-V1.5-C 0.897 0.688 0.779 CD-V2-E 0.958 0.416 0.580 (b) Job dataset. 0.40.60.81.0 ●● ● ● ● ● ● RDB: CD−V2−E RDB: CD−V1 RDB: Legacy RDB: WService EDGE: CD−V2−E EDGE: CD−V1 EDGE: Legacy EDGE: WService JOB1: CD−V2−E JOB1: CD−V1 JOB1: Legacy JOB1: WService JOB2: CD−V2−E 0.0 0.2 0.4 0.6 0.8 1.0 0.800.901.00 Coverage Precision JOBFEED: CD−V2−E (using query url) JOBFEED: CD−V2−E (ignoring query url) JOBFEED: WService Figure 5: Results on JOBFEED (entity-level normalization). Table 7: Results on cluster-level normalization datasets. (a) Resume dataset.
  • 25. Cluster-Level Datasets • Resume dataset • Search for resumes by 98 most frequent search queries about companies • Get 20 most frequent raw employer names from these resumes • Collect 817 unique raw employer names from resumes • Job dataset • Get top 182 employer entities with the most jobs by a baseline normalizer • Get the raw employer names in the jobs posted by these entities • Collect 6515 unique raw employer names from job postings
  • 26. Metrics for Cluster-Level Normalization • Success Rate (SR): • how likely the system returns a correct result. • Diversity Reduction Ratio (DRR): • how much result diversity the system reduces correctly via clustering. • Light-weight labeling: • for each query, label whether the result returned by the system is correct. each query q, we label whether the result r returned by the sys- tem is correct or not. Let QS be the set of successful queries, i.e., the queries which receive a correct result, i.e., QS = {q 2 Q | fC (q) is a correct result for q}. We dene Success Rate (SR) of the system as SR = |QS | |Q| (1) To measure the diversity in results returned by a system, we adapted the true diversity metric [14] which is dened based on entropy. As it does not maer how diverse the wrong results are, we only compute the diversity in the correct results. Let QS |r be the set of successful queries that are mapped to the cluster of r, i.e., QS |r = {q 2 QS | fC (q) = r}. We rst compute the entropy of the correct results as H = ’ r 2R |QS |r | |QS | ¡ ln ✓ |QS |r | |QS | ◆ (2) e above entropy H 2 [0,ln(|QS |)] is not linear to |QS |, which makes it a lile hard to understand and interpret. So True Diver- Train EDGE Test RDB EDGE and shared across system correct cluster for each i evaluation indicators. entity-level normalizatio 6.3 Systems and Table 5 summarizes the 6.3.1 Results of Entit normalization datasets, CD-V1, CD-V2-E, Legac E, the output contains a By varying the threshol precision-coverage curv condence score is not (q) is a correct result for q}. We dene Success Rate (SR) of stem as SR = |QS | |Q| (1) measure the diversity in results returned by a system, we ed the true diversity metric [14] which is dened based on py. As it does not maer how diverse the wrong results are, ly compute the diversity in the correct results. Let QS |r be of successful queries that are mapped to the cluster of r, i.e., = {q 2 QS | fC (q) = r}. We rst compute the entropy of the t results as H = ’ r 2R |QS |r | |QS | ¡ ln ✓ |QS |r | |QS | ◆ (2) bove entropy H 2 [0,ln(|QS |)] is not linear to |QS |, which it a lile hard to understand and interpret. So True Diver- 4] is proposed as TD = exp(H). It gives the eective number rect clusters returned by the system, and is linear to |QS |. and shared across systems. correct cluster for each input a evaluation indicators. ird, entity-level normalization: Suc 6.3 Systems and Resu Table 5 summarizes the system 6.3.1 Results of Entity-Leve normalization datasets, we co CD-V1, CD-V2-E, Legacy, and E, the output contains a con By varying the threshold on t precision-coverage curve. For condence score is not availa precision and coverage value. Figure 4 shows the precision H = ’ r 2R |QS |r | |QS | ¡ ln ✓ |QS |r | |QS | ◆ (2) e above entropy H 2 [0,ln(|QS |)] is not linear to |QS |, which makes it a lile hard to understand and interpret. So True Diver- sity [14] is proposed as TD = exp(H). It gives the eective number of correct clusters returned by the system, and is linear to |QS |. Based on the above True Diversity, we can compute how much result diversity the system reduces correctly, i.e., Diversity Reduc- tion Ratio (DRR), which is in range [0, 1]: DRR = 1 exp(H) 1 |QS | 1 (3) Finally, we compute the f-score (or the harmonic mean) of Suc- cess Rate and Diversity Reduction Ratio to measure the normaliza- tion quality: F-score = 2 ¡ SR ¡ DRR SR + DRR (4) tion Ratio (DRR), which is in range [0, 1]: DRR = 1 exp(H) 1 |QS | 1 Finally, we compute the f-score (or the harmon cess Rate and Diversity Reduction Ratio to measu tion quality: F-score = 2 ¡ SR ¡ DRR SR + DRR e proposed metric has three merits. First, it is showing the correctness and diversity of the resu cluster-level normalization system. Second, it only labeling eort, i.e., labeling for each (query, resu the result is correct for the query or not. e labe
  • 27. Results on Cluster-Level Normalization Datasets 1.0 asets. means Figure 5: Results on JOBFEED (entity-level normalization). Table 7: Results on cluster-level normalization datasets. (a) Resume dataset. System SuccessRate DiversityReductionRatio F-score CD-V2-C 0.963 0.704 0.814 CD-V1.5-C 0.897 0.688 0.779 CD-V2-E 0.958 0.416 0.580 (b) Job dataset. System SuccessRate DiversityReductionRatio F-score CD-V2-C 0.904 0.979 0.940 CD-V1.5-C 0.778 0.981 0.868 CD-V2-E 0.905 0.926 0.915 the other hand, CD-V2-C has a much higher diversity reduction
  • 28. the trad data so the em as han employ and du per ada system duplica 2.2 e sys the foll on exte to impr cluster normal Our malizat domain e employer name normalization task discussed in can be viewed as a general entity linking problem, yet it d the traditional entity linking task in three aspects [20]: ( data sources; (2) dierent contexts; (3) dierent KBs. the employer name normalization task has unique chall as handling the location and the url context associate employer names in jobs and resumes, as well as hand and duplicate entities in the KB. e system proposed per adapts the three-module framework used in the en systems. We also propose cluster-level normalization duplicate results, which is not considered in entity linki 2.2 Domain-Specic Name Normalizati e system described in this paper extends the system in the following contributions: (1) performing query expan on external mapping sources and supporting using u to improve normalization quality; (2) supporting norm cluster level; (3) proposing a new metric for evaluating c normalization. More details will be described in Section Our work is also related to a set of domain-specic malization applications. For example, within the same r From entity-level normalization to cluster-level normalization: Ăź Correctness remained Ăź Diversity reduced Candidate Search Results Facets
  • 29. Conclusion and Future Work Ăź Presented CompanyDepot: supporting employer name normalization at both entity and cluster level Ăź Proposed new metrics for cluster-level normalization q Improve clustering, e.g., merge and split q Develop more features for entity quality and query segmentation. q Improve the quality and coverage of the employer KB
  • 32. Calibrating Employer Names 1. Convert the name to lowercase, and replace ’s with s; 2. Convert all the non-alphanumeric characters to space; 3. Remove stop-phrases (e.g., “pvt ltd” and “l l c”) and stop-words (e.g., “inc”, “corporation”, “incorporated”, and “the”); 4. Expand commonly used abbreviations, e.g., “ctr” - “center”, “svc” - “services”; 5. remove all spaces in the name. Employer name After calibration International Business Machines Corporation internationalbusinessmachines Sherman Howard L.L.C. shermanhoward Oxnard Police Dept oxnardpolicedepartment Macy’s, Inc. macys
  • 33. Related Work • Entity Linking with a Knowedge Base • Domain-Specific Name Normalization • Deduplicating Domain-Specific KBs • Clustering Methods and Evaluation Metrics