SlideShare ist ein Scribd-Unternehmen logo
1 von 43
Downloaden Sie, um offline zu lesen
Record Linking
Robert Berry
Twitter: @no0p_
August, 2015
Objectives
1) Problem definition → Identify
– Informal problem definition
– Formal problem statement
2) A somewhat general solution → Implementation
– Logistic Regression Theory
– Mechanics
3) Practical tips → Successful / Painless Implementation
– Various complications and compromises
What is record linking?
● Database Context
– Two or more records, referencing the same entity,
without a key.
● General Statement (Wikipedia)
– “Record linkage refers to the task of finding records in a data set
that refer to the same entity across different data sources. Record
linkage is necessary when joining data sets based on entities that
may or may not share a common identifier.”
https://en.wikipedia.org/wiki/Record_linkage
Common Experience
Concrete Example: Two Tables
One of the oldest problems facing civilization
● 1946 article titled "Record Linkage" published in the
American Journal of Public Health.
● Census as statecraft
● Censuses in Egypt are said to have been taken during
the early Pharaonic period in 3340 BCE and in 3056
BCE
● US Census research
● Counting People
● Varied Data Sources:
– Interviews
– Mailed Surveys
– Birth Certificates
– Marriage Records
– Death Records
How Problem Emerges for Census
The Many-faced Problem
● Typically Entities in data model
– Companies
– Buildings
– People
– Documents
● Typically arises from different data
sources
– User entries
– Transcription errors
– Divergent data sets (split brain)
– Application Developers
● No natural Key
● Candidate keys include a set of
possible values, including
transcription and spelling errors
● Text
Duplicates Case
Name & Conquer
Label the records in table1 as set A, and the records in table2 as set B.
Define a set of possible matches as A X B, such that each pair (a A,b B) is a∈ ∈
possible match or non-match.
The set of true pairs is labeled set T where (a=b), and the set of false pairs is
labeled set F where (a≠b).
Solve the problem by distinguishing whether a given pair
belongs to set T or set F.
T and F are a partition of A X B.
Concrete Pairs
Binary Classification
● The problem is now stated as a binary classification
problem which is well understood
● A solution can take the form of a function which returns a
probability that a given pair is a true pair.
P(truepair)=f (a,b)
Need an Automated Solution
P(truepair)=f (a,b)≠
Simple Solution: Discriminant Metric
● String distance metrics
– Levenshtein (contrib/fuzzystrmatch)
– Tri-gram (contrib/pg_trgm)
– Jaro-Winkler
●
Returns value between 0 and 1
●
Developed for '95 Census
https://en.wikipedia.org/wiki/Jaro–Winkler_distance
Jaro-Winkler Example
pg_trgm Example
● What is the optimal threshold?
● Only leverages information present in text field
Black box Logistic Regression
P(truepair)=f (a,b)
P(truepair)=f (x1 , x2 , x3…xn)
x_1 = Jaro-Winkler distance of name
x_2 = geospatial distance of addresses
x_3 = Share middle initial
Logistic Regression
g(t)=
1
1+e−t
t=β0+β1 x1+β2 x2+β3 x3
● Binary Classification
● t is a linear function of
independent variables
● Returns a probability (number
between 0 and 1)
● Widely Implemented
● Available inside Postgresql
through the Madlib extension.
● http://madlib.net/
Feature Selection
● Choose a number of independent variables (for pairs)
– x_1, x_2, ...
● Some useful string comparison features
Jaro-Winkler Distance 0-1
Levenshtein Distance Z
length(Longest Common Substring) Z
Shared Middle Initial 0/1
Shared n-gram features (Suffix/Prefix) R
Word Shape 0/1
Binned length delta Z
Pair Features From Relations
Same Company or Organization 1/0
Date of Activity Z
Average Topic Distance R
Geospatial Distance Whatever ST_Distance returns
Same Phone Number 1/0
Installing Madlib
– pgxnclient install madlib
– Source:
– https://github.com/madlib/madlib
– Configure; make
– madpack -p postgres -c user@0.0.0.0/db install
Madlib Modules
● Regression Models
– Linear
– logistic
● Tree Models
● Conditional Random Field
● ARIMA
● Latent Dirichlet Allocation
● K-means clustering
● Descriptive Statistics
● Inference
● SVM
● Cardinality Estimators
● Utility Functions
– Linear algebra
operations for arrays as
vectors
Create Training and Test Sets
● Create a utility to manually
add records
● If available, may be
possible to match pairs
based on a foreign key.
– E.g. 20% of company_a and
company_b records have an
IRS TID you can link on, but
still have representative typo
and alternative spelling
examples.
How To Train Your Model
How To Train Your Model
Making Predictions
Prediction Examples
Evaluating the Model
Accuracy=
TP+TN
TP+TN +FP+FN
Precision=
TP
TP+FP
Recall=
TP
TP+FN
Results for “banks” with name and address
Accuracy=
TP+TN
TP+TN+FP+FN
=0.995
Precision=
TP
TP+FP
=0.985
Recall=
TP
TP+FN
=0.992
Results Points
Results Distribution
Generic Company Predictions
Less Good Generic Company Predictions
Results for Generic Company Model with Names Only
Accuracy=
TP+TN
TP+TN+FP+FN
=0.99
Precision=
TP
TP+FP
=0.914
Recall=
TP
TP+FN
=0.662
Reviewing Model
Storing Link Information
● Does not require application changes
● Preserves Original Information
● Separates Speculation From Source
Data
– Easy to update
● “Probabilistic Foreign Key” Sounds
Cool
● Reasonably Painless Access Pattern
● Duplication and Linking Case
● Pluggable Solutions
● Manual Review
What about that cross product?
● M*N ~ (N^2) pair comparisons to find best match
● Compare only against best candidates
– Indexable value that can exclude candidates
● trigram distance
● Geographic restriction
● Kmeans cluster id
● LSH bucket
● Probably OK to have fairly loose comparison set
– With 100,000,000 records, comparing 100,000,000 * 100 is quite a bit faster than
100,000,000 * 100,000,000.
Practical Tip: like-like comparisons
● Consider linking company pairs.
● Better results if comparing specific subset of companies
with targeted features.
– Banks
– Blood Banks
● May be worthwhile to first classify records because
feasible to easily distinguish between classes.
Practical Tip: Features > 1
● Odds ratios can be difficult to reason about when features
are 0 to 1.
● Multiplying by 10 seems to work
Practical Tip: Organizing
● Leverage Postgresql's data organizing features to
organize your training and test data
● Schemas for isolating model tables and training tables
● Materialized views for partitioning examples into training
and test sets
– Tablesample
● Pair surrogate keys allows for generating new features
from data relatively painlessly.
Practical Tip: Probabilistic Foreign Key
● Manual review of close calls to minimize human effort
● order by real_humans_pr where real_humans_pr < 0.55
Summary
● Identify Problem as multiple records for same entity
without a key
● Set of true pairs in A X B
– Binary Classification of pairs
● Solve binary classification in Postgresql with Madlib
logistic regression module
– String metric features & features from relations
● Create probabilistic foreign keys
– Manual review to taste
Final Thoughts
● We should strive for correct data with referential integrity
● Be comfortable in a world of imperfect databases
Slides at http://no0p.github.io/record_linking.pdf

Weitere ähnliche Inhalte

Was ist angesagt?

Bc0038– data structure using c
Bc0038– data structure using cBc0038– data structure using c
Bc0038– data structure using chayerpa
 
Introduction of data structure
Introduction of data structureIntroduction of data structure
Introduction of data structureeShikshak
 
Introduction to Data Structure : Pointer
Introduction to Data Structure : PointerIntroduction to Data Structure : Pointer
Introduction to Data Structure : PointerS P Sajjan
 
What's next in Julia
What's next in JuliaWhat's next in Julia
What's next in JuliaJiahao Chen
 
Designing A Syntax Based Retrieval System03
Designing A Syntax Based Retrieval System03Designing A Syntax Based Retrieval System03
Designing A Syntax Based Retrieval System03Avelin Huo
 
Data structures (introduction)
 Data structures (introduction) Data structures (introduction)
Data structures (introduction)Arvind Devaraj
 
Data Structure Question Bank(2 marks)
Data Structure Question Bank(2 marks)Data Structure Question Bank(2 marks)
Data Structure Question Bank(2 marks)pushpalathakrishnan
 
Introduction to data structure ppt
Introduction to data structure pptIntroduction to data structure ppt
Introduction to data structure pptNalinNishant3
 
Introduction of data structures and algorithms
Introduction of data structures and algorithmsIntroduction of data structures and algorithms
Introduction of data structures and algorithmsVinayKumarV16
 
Bt0065 c programming and data structures - theory-de
Bt0065 c programming and data structures - theory-deBt0065 c programming and data structures - theory-de
Bt0065 c programming and data structures - theory-desmumbahelp
 

Was ist angesagt? (20)

Data handling CBSE PYTHON CLASS 11
Data handling CBSE PYTHON CLASS 11Data handling CBSE PYTHON CLASS 11
Data handling CBSE PYTHON CLASS 11
 
Python Fundamentals Class 11
Python Fundamentals Class 11Python Fundamentals Class 11
Python Fundamentals Class 11
 
Bc0038– data structure using c
Bc0038– data structure using cBc0038– data structure using c
Bc0038– data structure using c
 
L3
L3L3
L3
 
Introduction of data structure
Introduction of data structureIntroduction of data structure
Introduction of data structure
 
Introduction to Data Structure : Pointer
Introduction to Data Structure : PointerIntroduction to Data Structure : Pointer
Introduction to Data Structure : Pointer
 
What's next in Julia
What's next in JuliaWhat's next in Julia
What's next in Julia
 
Designing A Syntax Based Retrieval System03
Designing A Syntax Based Retrieval System03Designing A Syntax Based Retrieval System03
Designing A Syntax Based Retrieval System03
 
Recursion CBSE Class 12
Recursion CBSE Class 12Recursion CBSE Class 12
Recursion CBSE Class 12
 
Data structures (introduction)
 Data structures (introduction) Data structures (introduction)
Data structures (introduction)
 
Data Structure Question Bank(2 marks)
Data Structure Question Bank(2 marks)Data Structure Question Bank(2 marks)
Data Structure Question Bank(2 marks)
 
Introduction to data structure ppt
Introduction to data structure pptIntroduction to data structure ppt
Introduction to data structure ppt
 
Introduction of data structures and algorithms
Introduction of data structures and algorithmsIntroduction of data structures and algorithms
Introduction of data structures and algorithms
 
Lesson 1 overview
Lesson 1   overviewLesson 1   overview
Lesson 1 overview
 
LISP: Scope and extent in lisp
LISP: Scope and extent in lispLISP: Scope and extent in lisp
LISP: Scope and extent in lisp
 
Bt0065 c programming and data structures - theory-de
Bt0065 c programming and data structures - theory-deBt0065 c programming and data structures - theory-de
Bt0065 c programming and data structures - theory-de
 
Datastructure
DatastructureDatastructure
Datastructure
 
Data structures using c
Data structures using cData structures using c
Data structures using c
 
Data types in python
Data types in pythonData types in python
Data types in python
 
Lecture-05-DSA
Lecture-05-DSALecture-05-DSA
Lecture-05-DSA
 

Andere mochten auch

Anmol_Garg_Resume1
Anmol_Garg_Resume1Anmol_Garg_Resume1
Anmol_Garg_Resume1ANMOL GARG
 
Advertisingppt 130708233914-phpapp01
Advertisingppt 130708233914-phpapp01Advertisingppt 130708233914-phpapp01
Advertisingppt 130708233914-phpapp01jahidulislammaruf
 
criminal slideshow
criminal slideshowcriminal slideshow
criminal slideshowStacy Lease
 
ReFlections Spring 2015
ReFlections Spring 2015ReFlections Spring 2015
ReFlections Spring 2015H Henly
 
APA_IPI Progressive Rate Presentation 061112
APA_IPI  Progressive Rate Presentation 061112APA_IPI  Progressive Rate Presentation 061112
APA_IPI Progressive Rate Presentation 061112Michael Klein, CAPP
 
Dooars Tour Packages by Shah Tour
Dooars Tour Packages by Shah TourDooars Tour Packages by Shah Tour
Dooars Tour Packages by Shah TourShah Tour
 
Adeel Technical Project management
Adeel Technical Project managementAdeel Technical Project management
Adeel Technical Project managementEngineer Adeel Ahmad
 
NPA Presentation_Product Price and Promotion_Klein_10-28-2014
NPA Presentation_Product Price and Promotion_Klein_10-28-2014NPA Presentation_Product Price and Promotion_Klein_10-28-2014
NPA Presentation_Product Price and Promotion_Klein_10-28-2014Michael Klein, CAPP
 
1120_Reflections_Volume34_Winter2014_111214
1120_Reflections_Volume34_Winter2014_1112141120_Reflections_Volume34_Winter2014_111214
1120_Reflections_Volume34_Winter2014_111214H Henly
 
Flora and Fauna Photo Essay
Flora and Fauna Photo EssayFlora and Fauna Photo Essay
Flora and Fauna Photo Essayzinnmadison
 
The_Ecovillage_Movement_Undergraduate_Thesis_by_Charlotte_Ashlock
The_Ecovillage_Movement_Undergraduate_Thesis_by_Charlotte_AshlockThe_Ecovillage_Movement_Undergraduate_Thesis_by_Charlotte_Ashlock
The_Ecovillage_Movement_Undergraduate_Thesis_by_Charlotte_AshlockCharlotte Ashlock
 
Soporte de archivo
Soporte de archivoSoporte de archivo
Soporte de archivoittansegovia
 
Libro15 14feb ias_tuno1
Libro15 14feb ias_tuno1Libro15 14feb ias_tuno1
Libro15 14feb ias_tuno1ittansegovia
 

Andere mochten auch (15)

Anmol_Garg_Resume1
Anmol_Garg_Resume1Anmol_Garg_Resume1
Anmol_Garg_Resume1
 
Advertisingppt 130708233914-phpapp01
Advertisingppt 130708233914-phpapp01Advertisingppt 130708233914-phpapp01
Advertisingppt 130708233914-phpapp01
 
criminal slideshow
criminal slideshowcriminal slideshow
criminal slideshow
 
ReFlections Spring 2015
ReFlections Spring 2015ReFlections Spring 2015
ReFlections Spring 2015
 
APA_IPI Progressive Rate Presentation 061112
APA_IPI  Progressive Rate Presentation 061112APA_IPI  Progressive Rate Presentation 061112
APA_IPI Progressive Rate Presentation 061112
 
Dooars Tour Packages by Shah Tour
Dooars Tour Packages by Shah TourDooars Tour Packages by Shah Tour
Dooars Tour Packages by Shah Tour
 
Adeel Technical Project management
Adeel Technical Project managementAdeel Technical Project management
Adeel Technical Project management
 
NPA Presentation_Product Price and Promotion_Klein_10-28-2014
NPA Presentation_Product Price and Promotion_Klein_10-28-2014NPA Presentation_Product Price and Promotion_Klein_10-28-2014
NPA Presentation_Product Price and Promotion_Klein_10-28-2014
 
1120_Reflections_Volume34_Winter2014_111214
1120_Reflections_Volume34_Winter2014_1112141120_Reflections_Volume34_Winter2014_111214
1120_Reflections_Volume34_Winter2014_111214
 
Flora and Fauna Photo Essay
Flora and Fauna Photo EssayFlora and Fauna Photo Essay
Flora and Fauna Photo Essay
 
Bach_Le
Bach_LeBach_Le
Bach_Le
 
The_Ecovillage_Movement_Undergraduate_Thesis_by_Charlotte_Ashlock
The_Ecovillage_Movement_Undergraduate_Thesis_by_Charlotte_AshlockThe_Ecovillage_Movement_Undergraduate_Thesis_by_Charlotte_Ashlock
The_Ecovillage_Movement_Undergraduate_Thesis_by_Charlotte_Ashlock
 
Soporte de archivo
Soporte de archivoSoporte de archivo
Soporte de archivo
 
Libro15 14feb ias_tuno1
Libro15 14feb ias_tuno1Libro15 14feb ias_tuno1
Libro15 14feb ias_tuno1
 
3B CAR WASHERS
3B CAR WASHERS3B CAR WASHERS
3B CAR WASHERS
 

Ähnlich wie Machine Learning for Record Linking Using Logistic Regression

Query Optimization - Brandon Latronica
Query Optimization - Brandon LatronicaQuery Optimization - Brandon Latronica
Query Optimization - Brandon Latronica"FENG "GEORGE"" YU
 
Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...Infrrd
 
Data Science as a Career and Intro to R
Data Science as a Career and Intro to RData Science as a Career and Intro to R
Data Science as a Career and Intro to RAnshik Bansal
 
Machine learning introduction
Machine learning introductionMachine learning introduction
Machine learning introductionAnas Jamil
 
Analysis of different similarity measures: Simrank
Analysis of different similarity measures: SimrankAnalysis of different similarity measures: Simrank
Analysis of different similarity measures: SimrankAbhishek Mungoli
 
Chapter 3. Data Preprocessing.ppt
Chapter 3. Data Preprocessing.pptChapter 3. Data Preprocessing.ppt
Chapter 3. Data Preprocessing.pptSubrata Kumer Paul
 
20070702 Text Categorization
20070702 Text Categorization20070702 Text Categorization
20070702 Text Categorizationmidi
 
Silicon valleycodecamp2013
Silicon valleycodecamp2013Silicon valleycodecamp2013
Silicon valleycodecamp2013Sanjeev Mishra
 
Intro to Machine Learning for non-Data Scientists
Intro to Machine Learning for non-Data ScientistsIntro to Machine Learning for non-Data Scientists
Intro to Machine Learning for non-Data ScientistsParinaz Ameri
 
Matrix decomposition and_applications_to_nlp
Matrix decomposition and_applications_to_nlpMatrix decomposition and_applications_to_nlp
Matrix decomposition and_applications_to_nlpankit_ppt
 
Neural Nets Deconstructed
Neural Nets DeconstructedNeural Nets Deconstructed
Neural Nets DeconstructedPaul Sterk
 
Implementing Item Response Theory
Implementing Item Response TheoryImplementing Item Response Theory
Implementing Item Response TheoryNathan Thompson
 
CS3114_09212011.ppt
CS3114_09212011.pptCS3114_09212011.ppt
CS3114_09212011.pptArumugam90
 
Discovering Novel Information with sentence Level clustering From Multi-docu...
Discovering Novel Information with sentence Level clustering  From Multi-docu...Discovering Novel Information with sentence Level clustering  From Multi-docu...
Discovering Novel Information with sentence Level clustering From Multi-docu...irjes
 
Stack squeues lists
Stack squeues listsStack squeues lists
Stack squeues listsJames Wong
 

Ähnlich wie Machine Learning for Record Linking Using Logistic Regression (20)

fINAL ML PPT.pptx
fINAL ML PPT.pptxfINAL ML PPT.pptx
fINAL ML PPT.pptx
 
Query Optimization - Brandon Latronica
Query Optimization - Brandon LatronicaQuery Optimization - Brandon Latronica
Query Optimization - Brandon Latronica
 
Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...
 
Learning to learn Model Behavior: How to use "human-in-the-loop" to explain d...
Learning to learn Model Behavior: How to use "human-in-the-loop" to explain d...Learning to learn Model Behavior: How to use "human-in-the-loop" to explain d...
Learning to learn Model Behavior: How to use "human-in-the-loop" to explain d...
 
Data Science as a Career and Intro to R
Data Science as a Career and Intro to RData Science as a Career and Intro to R
Data Science as a Career and Intro to R
 
Machine learning introduction
Machine learning introductionMachine learning introduction
Machine learning introduction
 
Analysis of different similarity measures: Simrank
Analysis of different similarity measures: SimrankAnalysis of different similarity measures: Simrank
Analysis of different similarity measures: Simrank
 
Chapter 3. Data Preprocessing.ppt
Chapter 3. Data Preprocessing.pptChapter 3. Data Preprocessing.ppt
Chapter 3. Data Preprocessing.ppt
 
20070702 Text Categorization
20070702 Text Categorization20070702 Text Categorization
20070702 Text Categorization
 
Silicon valleycodecamp2013
Silicon valleycodecamp2013Silicon valleycodecamp2013
Silicon valleycodecamp2013
 
Intro to Machine Learning for non-Data Scientists
Intro to Machine Learning for non-Data ScientistsIntro to Machine Learning for non-Data Scientists
Intro to Machine Learning for non-Data Scientists
 
Matrix decomposition and_applications_to_nlp
Matrix decomposition and_applications_to_nlpMatrix decomposition and_applications_to_nlp
Matrix decomposition and_applications_to_nlp
 
Neural Nets Deconstructed
Neural Nets DeconstructedNeural Nets Deconstructed
Neural Nets Deconstructed
 
Implementing Item Response Theory
Implementing Item Response TheoryImplementing Item Response Theory
Implementing Item Response Theory
 
S2 NIGHT SKILL.pptx
S2 NIGHT SKILL.pptxS2 NIGHT SKILL.pptx
S2 NIGHT SKILL.pptx
 
S2 NIGHT SKILL.pptx
S2 NIGHT SKILL.pptxS2 NIGHT SKILL.pptx
S2 NIGHT SKILL.pptx
 
CS3114_09212011.ppt
CS3114_09212011.pptCS3114_09212011.ppt
CS3114_09212011.ppt
 
Discovering Novel Information with sentence Level clustering From Multi-docu...
Discovering Novel Information with sentence Level clustering  From Multi-docu...Discovering Novel Information with sentence Level clustering  From Multi-docu...
Discovering Novel Information with sentence Level clustering From Multi-docu...
 
ppt
pptppt
ppt
 
Stack squeues lists
Stack squeues listsStack squeues lists
Stack squeues lists
 

Machine Learning for Record Linking Using Logistic Regression

  • 2. Objectives 1) Problem definition → Identify – Informal problem definition – Formal problem statement 2) A somewhat general solution → Implementation – Logistic Regression Theory – Mechanics 3) Practical tips → Successful / Painless Implementation – Various complications and compromises
  • 3. What is record linking? ● Database Context – Two or more records, referencing the same entity, without a key. ● General Statement (Wikipedia) – “Record linkage refers to the task of finding records in a data set that refer to the same entity across different data sources. Record linkage is necessary when joining data sets based on entities that may or may not share a common identifier.” https://en.wikipedia.org/wiki/Record_linkage
  • 6. One of the oldest problems facing civilization ● 1946 article titled "Record Linkage" published in the American Journal of Public Health. ● Census as statecraft ● Censuses in Egypt are said to have been taken during the early Pharaonic period in 3340 BCE and in 3056 BCE ● US Census research
  • 7. ● Counting People ● Varied Data Sources: – Interviews – Mailed Surveys – Birth Certificates – Marriage Records – Death Records How Problem Emerges for Census
  • 8. The Many-faced Problem ● Typically Entities in data model – Companies – Buildings – People – Documents ● Typically arises from different data sources – User entries – Transcription errors – Divergent data sets (split brain) – Application Developers ● No natural Key ● Candidate keys include a set of possible values, including transcription and spelling errors ● Text
  • 10. Name & Conquer Label the records in table1 as set A, and the records in table2 as set B. Define a set of possible matches as A X B, such that each pair (a A,b B) is a∈ ∈ possible match or non-match. The set of true pairs is labeled set T where (a=b), and the set of false pairs is labeled set F where (a≠b). Solve the problem by distinguishing whether a given pair belongs to set T or set F. T and F are a partition of A X B.
  • 12. Binary Classification ● The problem is now stated as a binary classification problem which is well understood ● A solution can take the form of a function which returns a probability that a given pair is a true pair. P(truepair)=f (a,b)
  • 13. Need an Automated Solution P(truepair)=f (a,b)≠
  • 14. Simple Solution: Discriminant Metric ● String distance metrics – Levenshtein (contrib/fuzzystrmatch) – Tri-gram (contrib/pg_trgm) – Jaro-Winkler ● Returns value between 0 and 1 ● Developed for '95 Census https://en.wikipedia.org/wiki/Jaro–Winkler_distance
  • 16. pg_trgm Example ● What is the optimal threshold? ● Only leverages information present in text field
  • 17. Black box Logistic Regression P(truepair)=f (a,b) P(truepair)=f (x1 , x2 , x3…xn) x_1 = Jaro-Winkler distance of name x_2 = geospatial distance of addresses x_3 = Share middle initial
  • 18. Logistic Regression g(t)= 1 1+e−t t=β0+β1 x1+β2 x2+β3 x3 ● Binary Classification ● t is a linear function of independent variables ● Returns a probability (number between 0 and 1) ● Widely Implemented ● Available inside Postgresql through the Madlib extension. ● http://madlib.net/
  • 19. Feature Selection ● Choose a number of independent variables (for pairs) – x_1, x_2, ... ● Some useful string comparison features Jaro-Winkler Distance 0-1 Levenshtein Distance Z length(Longest Common Substring) Z Shared Middle Initial 0/1 Shared n-gram features (Suffix/Prefix) R Word Shape 0/1 Binned length delta Z
  • 20. Pair Features From Relations Same Company or Organization 1/0 Date of Activity Z Average Topic Distance R Geospatial Distance Whatever ST_Distance returns Same Phone Number 1/0
  • 21. Installing Madlib – pgxnclient install madlib – Source: – https://github.com/madlib/madlib – Configure; make – madpack -p postgres -c user@0.0.0.0/db install
  • 22. Madlib Modules ● Regression Models – Linear – logistic ● Tree Models ● Conditional Random Field ● ARIMA ● Latent Dirichlet Allocation ● K-means clustering ● Descriptive Statistics ● Inference ● SVM ● Cardinality Estimators ● Utility Functions – Linear algebra operations for arrays as vectors
  • 23. Create Training and Test Sets ● Create a utility to manually add records ● If available, may be possible to match pairs based on a foreign key. – E.g. 20% of company_a and company_b records have an IRS TID you can link on, but still have representative typo and alternative spelling examples.
  • 24. How To Train Your Model
  • 25. How To Train Your Model
  • 28. Evaluating the Model Accuracy= TP+TN TP+TN +FP+FN Precision= TP TP+FP Recall= TP TP+FN
  • 29. Results for “banks” with name and address Accuracy= TP+TN TP+TN+FP+FN =0.995 Precision= TP TP+FP =0.985 Recall= TP TP+FN =0.992
  • 33. Less Good Generic Company Predictions
  • 34. Results for Generic Company Model with Names Only Accuracy= TP+TN TP+TN+FP+FN =0.99 Precision= TP TP+FP =0.914 Recall= TP TP+FN =0.662
  • 36. Storing Link Information ● Does not require application changes ● Preserves Original Information ● Separates Speculation From Source Data – Easy to update ● “Probabilistic Foreign Key” Sounds Cool ● Reasonably Painless Access Pattern ● Duplication and Linking Case ● Pluggable Solutions ● Manual Review
  • 37. What about that cross product? ● M*N ~ (N^2) pair comparisons to find best match ● Compare only against best candidates – Indexable value that can exclude candidates ● trigram distance ● Geographic restriction ● Kmeans cluster id ● LSH bucket ● Probably OK to have fairly loose comparison set – With 100,000,000 records, comparing 100,000,000 * 100 is quite a bit faster than 100,000,000 * 100,000,000.
  • 38. Practical Tip: like-like comparisons ● Consider linking company pairs. ● Better results if comparing specific subset of companies with targeted features. – Banks – Blood Banks ● May be worthwhile to first classify records because feasible to easily distinguish between classes.
  • 39. Practical Tip: Features > 1 ● Odds ratios can be difficult to reason about when features are 0 to 1. ● Multiplying by 10 seems to work
  • 40. Practical Tip: Organizing ● Leverage Postgresql's data organizing features to organize your training and test data ● Schemas for isolating model tables and training tables ● Materialized views for partitioning examples into training and test sets – Tablesample ● Pair surrogate keys allows for generating new features from data relatively painlessly.
  • 41. Practical Tip: Probabilistic Foreign Key ● Manual review of close calls to minimize human effort ● order by real_humans_pr where real_humans_pr < 0.55
  • 42. Summary ● Identify Problem as multiple records for same entity without a key ● Set of true pairs in A X B – Binary Classification of pairs ● Solve binary classification in Postgresql with Madlib logistic regression module – String metric features & features from relations ● Create probabilistic foreign keys – Manual review to taste
  • 43. Final Thoughts ● We should strive for correct data with referential integrity ● Be comfortable in a world of imperfect databases Slides at http://no0p.github.io/record_linking.pdf