Entity Resolution is the task of disambiguating manifestations of real world entities through linking and grouping and is often an essential part of the data wrangling process. There are three primary tasks involved in entity resolution: deduplication, record linkage, and canonicalization; each of which serve to improve data quality by reducing irrelevant or repeated data, joining information from disparate records, and providing a single source of information to perform analytics upon. However, due to data quality issues (misspellings or incorrect data), schema variations in different sources, or simply different representations, entity resolution is not a straightforward process and most ER techniques utilize machine learning and other stochastic approaches.
2. Workshop Objectives
● Introduce entity resolution theory and tasks
● Similarity scores and similarity vectors
● Pairwise matching with the Fellegi Sunter algorithm
● Clustering and Blocking for deduplication
● Final notes on entity resolution
4. Entity Resolution refers to techniques that
identify, group, and link digital mentions or
manifestations of some object in the real world.
5. In the Data Science Pipeline, ER is generally a wrangling technique.
ComputationStorageDataInteraction
Computational
Data Store
Feature Analysis
Model Builds
Model
Selection &
Monitoring
NormalizationIngestion
Feedback
Wrangling
API
Cross Validation
6. - Creation of high quality data sets
- Reduction in the number of instances in
machine learning models
- Reduction in the amount of covariance and
therefore collinearity of predictor variables.
- Simplification of relationships
Information Quality
7. Graph Analysis Simplification and Connection
ben@ddl.com
selma@gmail.comtony@ddl.com
allen@acme.com rebecca@acme.com
ben@gmail.com tony@gmail.com Ben
Rebecca
Allen
Tony
Selma
8. - Heterogenous data: unstructured records
- Larger and more varied datasets
- Multi-domain and multi-relational data
- Varied applications (web and mobile)
Parallel, Probabilistic Methods Required*
* Although this is often debated in various related domains.
Machine Learning and ER
10. - Primary consideration in ER
- Cluster records that correspond to the same
real world entity, normalizing a schema
- Reduces number of records in the dataset
- Variant: compute cluster representative
Deduplication
11. Record Linkage
- Match records from one deduplicated data
store to another (bipartite)
- K-partite linkage links records in multiple
data stores and their various associations
- Generally proposed in relational data stores,
but more frequently applied to unstructured
records from various sources.
12. Referencing
- Known as entity disambiguation
- Match noisy records to a clean, deduplicated
reference table that is already canonicalized
- Generally used to atomize multiple records
to some primary key and donate extra
information to the record
13. Canonicalization
- Compute representative
- Generally the “most complete” record
- Imputation of missing attributes via merging
- Attribute selection based on the most likely
candidate for downstream matching
14. Notation
- R: set of records
- M: set of matches
- N: set of non-matches
- E: set of entities
- L: set of links
Compare (Mt
,Nt
,Et
,Lt
)⇔(Mp
,Np
,Ep
,Lp
)
- t = true, p = predicted
15. Key Assumptions
- Every entity refers to a real world object (e.g.
there are no “fake” instances
- References or sources (for record linkage)
include no duplicates (integrity constraints)
- If two records are identical, they are true
matches ( , ) ∈ Mt
16. - NLTK: natural language toolkit
- Dedupe*: structured deduplication
- Distance: C implemented distance metrics
- Scikit-Learn: machine learning models
- Fuzzywuzzy: fuzzy string matching
- PyBloom: probabilistic set matching
Tools for Entity Resolution
18. At the heart of any entity resolution task is
the computation of similarity or distance.
19. For two records, x and y, compute a similarity
vector for each component attribute:
[match_score(attrx
, attry
)
for attr in zip(x,y)]
Where match_score is a per-attribute function that
computes either a boolean (match, not match) or real
valued distance score.
match_score ∈ [0,1]*
24. Pairwise Matching:
Given a vector of attribute match
scores for a pair of records (x,y)
compute Pmatch
(x,y).
25. Weighted Sum + Threshold
Pmatch
= sum(weight*score for score in vector)
- weights should sum to one
- determine weight for each attribute match score
- higher weights for more predictive features
- e.g. email more predictive than username
- attribute value also contributes to predictability
- If weighted score > threshold then match.
26. Rule Based Approach
- Formulate rules about the construction of a
match for attribute collections.
if scorename
> 0.75 && scoreprice
> 0.6
- Although formulating rules is hard, domain
specific rules can be applied, making this a
typical approach for many applications.
27. Modern record linkage theory was formalized in 1969
by Ivan Fellegi and Alan Sunter who proved that the
probabilistic decision rule they described was optimal
when the comparison attributes were conditionally
independent.
Their pioneering work “A Theory For Record Linkage”
remains the mathematical foundation for many
record linkage applications even today.
Fellegi, Ivan P., and Alan B. Sunter. "A theory for record linkage." Journal
of the American Statistical Association 64.328 (1969): 1183-1210.
28. Record Linkage Model
For two record sets, A and B:
and a record pair,
is the similarity vector, where is
some match score function for the record set.
M is the match set and U the non-match set
29. Record Linkage Model
Probabilistic linkage based on:
Linkage Rule: L(tl
, tu
) - upper & lower thresholds:
R(r)
tu
tl
MatchUncertainNon-Match
30. Linkage Rule Error
- Type I Error: a non-match is called a match.
- Type II Error: match is called a non-match
31. Optimizing a Linkage Rule
L*
(t*
l
, t*
u
) is optimized in (similarity vector
space) with error bounds and if:
- L*
bounds type I and II errors:
- L*
has the least conditional probability of not making a
decision - e.g. minimizes the uncertainty range in R(r).
32. L*
Discovery
Given N records in (e.g. N similarity vectors):
Sort the records decreasing by R(r) (m( ) / u( ))
Select n and n′ such that:
R(r)
0
1
, … , n
n+1
, … , n′
-1
n′
, … , N
, ,
33. Practical Application of FS
is high dimensional: m( ) and u( ) computations are inefficient.
Typically a naive Bayes assumption is made about the
conditional independence of features in given a match or a
non-match.
Computing P( |r ∈ M) requires knowledge of matches.
- Supervised machine learning with a training set.
- Expectation Maximization (EM) to train parameters.
34. Machine Learning Parameters
Supervised Methods
- Decision Trees
- Cochinwala, Munir, et al. "Efficient data reconciliation." Information Sciences 137.1 (2001): 1-15.
- Support Vector Machines
- Bilenko, Mikhail, and Raymond J. Mooney. "Adaptive duplicate detection using learnable string similarity
measures." Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data
mining. ACM, 2003.
- Christen, Peter. "Automatic record linkage using seeded nearest neighbour and support vector machine
classification." Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data
mining. ACM, 2008.
- Ensembles of Classifiers
- Chen, Zhaoqi, Dmitri V. Kalashnikov, and Sharad Mehrotra. "Exploiting context analysis for combining multiple
entity resolution systems." Proceedings of the 2009 ACM SIGMOD International Conference on Management of
data. ACM, 2009.
- Conditional Random Fields
- Gupta, Rahul, and Sunita Sarawagi. "Answering table augmentation queries from unstructured lists on the web."
Proceedings of the VLDB Endowment 2.1 (2009): 289-300.
35. Machine Learning Parameters
Unsupervised Methods
- Expectation Maximization
- Winkler, William E. "Overview of record linkage and current research directions." Bureau of the Census. 2006.
- Herzog, Thomas N., Fritz J. Scheuren, and William E. Winkler. Data quality and record linkage techniques. Springer
Science & Business Media, 2007.
- Hierarchical Clustering
- Ravikumar, Pradeep, and William W. Cohen. "A hierarchical graphical model for record linkage."Proceedings of the
20th conference on Uncertainty in artificial intelligence. AUAI Press, 2004.
Active Learning Methods
- Committee of Classifiers
- Sarawagi, Sunita, and Anuradha Bhamidipaty. "Interactive deduplication using active learning."Proceedings of the
eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2002.
- Tejada, Sheila, Craig A. Knoblock, and Steven Minton. "Learning object identification rules for information
integration." Information Systems 26.8 (2001): 607-633.
36. Luckily, all of these models are in Scikit-Learn.
Considerations:
- Building training sets is hard:
- Most records are easy non-matches
- Record pairs can be ambiguous
- Class imbalance: more negatives than positives
Machine Learning & Fellegi Sunter is the state of the art.
Implementing Papers
38. To obtain a supervised training set, start by
using clustering and then add active learning
techniques to propose items to knowledge
engineers for labeling.
39. Advantages to Clusters
- Resolution decisions are not made simply on
pairwise comparisons, but search a larger space.
- Can use a variety of algorithms such that:
- Number of clusters is not known in advance
- There are numerous small, singleton clusters
- Input is a pairwise similarity graph
40. Requirement: Blocking
- Naive Approach is |R|2
comparisons.
- Consider 100,000 products from 10 online
stores is 1,000,000,000,000 comparisons.
- At 1 s per comparison = 11.6 days
- Most are not going to be matches
- Can we block on product category?
41. Canopy Clustering
- Often used as a pre-clustering optimization for
approaches that must do pairwise comparisons, e.g. K-
Means or Hierarchical Clustering
- Can be run in parallel, and is often used in Big Data
systems (implementations exist in MapReduce on
Hadoop)
- Use distance metric on similarity vectors for
computation.
42. Canopy Clustering
The algorithm begins with two thresholds T1
and T2
the loose and tight
distances respectively, where T1
> T2
.
1. Remove a point from the set and start a new “canopy”
2. For each point in the set, assign it to the new canopy if the distance
is less than the loose distance T1
.
3. If the distance is less than T2
remove it from the original set
completely.
4. Repeat until there are no more data points to cluster.
43.
44.
45.
46.
47.
48.
49.
50. Canopy Clustering
By setting threshold values relatively permissively -
canopies will capture more data.
In practice, most canopies will contain only a single
point, and can be ignored.
Pairwise comparisons are made between the
similarity vectors inside of each canopy.
52. Data Preparation
Good Data Preparation can go a long way in
getting good results, and is most of the work.
- Data Normalization
- Schema Normalization
- Imputation
53. Data Normalization
- convert to all lower case, remove whitespace
- run spell checker to remove known
typographical errors
- expand abbreviations, replace nicknames
- perform lookups in lexicons
- tokenize, stem, or lemmatize words
54. Schema Normalization
- match attribute names (title → name)
- compound attributes (full name → first, last)
- nested attributes, particularly boolean attributes
- deal with set and list valued attributes
- segment records from raw text
55. Imputation
- How do you deal with missing values?
- Set all to nan or None, remove empty string.
- How do you compare missing values? Omit
from similarity vector?
- Fill in missing values with aggregate (mean) or
with some default value.
56. Canonicalization
Merge information from duplicates to a representative
entity that contains maximal information - consider
downstream resolution.
Name, Email, Phone, Address
Joe Halifax, joe.halifax@gmail.com, null, New York, NY
Joseph Halifax Jr., null, (212) 123-4444, 130 5th Ave Apt 12, New York, NY
Joseph Halifax, joe.halifax@gmail.com, (212) 123-4444, 130 5th Ave Apt 12, New York, NY
57. Evaluation
- # of predicted matching pairs, cluster level metrics
- Precision/Recall → F1 score
Match Miss
Actual Match True Match False Match |A|
Actual Miss False Miss True Miss |B|
|P(A)| |P(B)| total