SlideShare ist ein Scribd-Unternehmen logo
1 von 46
An Open-source Similar-name Finder Dallan Quass  [email_address]
What's the problem?
People can't spell unusual names Maybe a piece of mail addressed to Solverg Quast? Solverg Quast 5934 Phoenix Ave. Shoreview, MN 55126 Johnston Bros. 1256 Bristol St. Mapleton, MN 55126 Should be:  Solveig Quass
People use nicknames John Johnny Jack
Transcribers make typos Jhon
Most of our ancestors didn't know how to read or write  signature
What does it matter?
How do you find records? Johnny Snith John Smith
How do you match people? John Smith Johnny Smithe
Not a new problem
Lots of solutions Soundex Nysiis Double Metaphone Refined Soundex Daitch-Mokotoff Caverphone Levenstein Jaro Winkler Monge Elkan Needleman Wunch Smith Waterman
No Bullseye
Why is this so hard?
How similar are two names? We’re neighbors John Jonny Joe I don’t know those guys
First approach: Coders Soundex Nysiis Double Metaphone Refined Soundex Daitch-Mokotoff Caverphone ,[object Object],[object Object],[object Object],[object Object]
First approach: Coders Jim John Jane Johan Johannes
Second approach: Distance functions Levenstein Jaro Winkler Monge Elkan Needleman Wunch Smith Waterman ,[object Object],[object Object],[object Object],[object Object]
Second approach: Distance functions Jim John Jane Johan Johannes Better results, but Doesn't scale well
Can we do better?
Warning: Machine learning ahead!
Thank you Ancestry! ,[object Object],[object Object],[object Object]
A closer look at Levenstein Jon John Bohn -1 -1
Maximize your expectations ,[object Object],[object Object],[object Object],[object Object],Jon John Bohn high cost low cost Weighted  Edit Distance
Learn to classify ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Wait, i sn't this just another distance function? Distance functions don't scale, right?
Right
Back to the basics x  f(x) -5  -1 -3  4.5 0  7 2  3.5 4  2
Long tail
Long tail 200,000 Surnames  70,000 Given names ≀   1/5,000,000 names
Long tail Use distance function with table here Use coder here
Result: Table initialized by a function Dallan:  Dalana Daleen Dalen Dalin 
 Talan Tallon Ryan:  Aaran Aran Arrin 
 Rian Riana ...
A nice thing about tables... Dallan:  Dalana Daleen Dalen Dalin 
 Talan Tallon Ryan:  Aaran Aran Arrin 
 Rian Riana ...
Add to the table Nicknames BehindTheName.com The New American Dictionary of Baby Names by Leslie Dunking and William Gosling A Dictionary of Surnames by Patrick Hanks and Flavia Hodges WeRelate community
Thank you BehindTheName.com! Fascinating  Family Trees for given names
Result Soundex Our approach Precision  Recall 28% decrease in false negatives Given names Soundex Our approach Precision  Recall 28% decrease in false negatives Surnames 97 65 97 74 89 68 89 77
Who is using it?
WeRelate.org
Continuous improvement
Continuous improvement
Community oversight
How do I use it? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Roadmap ,[object Object],[object Object],[object Object],[object Object]
Future work
Future work ,[object Object],[object Object],[object Object],[object Object],[object Object],Remove “chaff” variants from common names
Conclusion Images appearing on these slides are copyrighted by the contributors to  http://commons.wikimedia.org and are used under license ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
 

Weitere Àhnliche Inhalte

Ähnlich wie An Open-source Similar-name Finder

The Evolution of Speech Segmentation: A Computer Simulation
The Evolution of Speech Segmentation: A Computer SimulationThe Evolution of Speech Segmentation: A Computer Simulation
The Evolution of Speech Segmentation: A Computer SimulationRichard Littauer
 
Visualizing Words and Topics with Scattertext
Visualizing Words and Topics with ScattertextVisualizing Words and Topics with Scattertext
Visualizing Words and Topics with ScattertextJason Kessler
 
Redevelop 2019 - Debugging our biases and intuition in software development
Redevelop 2019 - Debugging our biases and intuition in software developmentRedevelop 2019 - Debugging our biases and intuition in software development
Redevelop 2019 - Debugging our biases and intuition in software developmentDave Hulbert
 
Aac and mowat wilson syndrome 0617
Aac and mowat wilson syndrome 0617Aac and mowat wilson syndrome 0617
Aac and mowat wilson syndrome 0617Kim Singleton
 
Subword tokenizers
Subword tokenizersSubword tokenizers
Subword tokenizersHa Loc Do
 
The system sound and listening
The system sound and listeningThe system sound and listening
The system sound and listeningOscar Ramirez Lozano
 
Articulation Chapter From Previous Book
Articulation Chapter From Previous BookArticulation Chapter From Previous Book
Articulation Chapter From Previous Bookguest2dd347
 
Literacy 2.0
Literacy 2.0Literacy 2.0
Literacy 2.0nmangum
 
Language
LanguageLanguage
LanguageWu Heping
 

Ähnlich wie An Open-source Similar-name Finder (10)

The Evolution of Speech Segmentation: A Computer Simulation
The Evolution of Speech Segmentation: A Computer SimulationThe Evolution of Speech Segmentation: A Computer Simulation
The Evolution of Speech Segmentation: A Computer Simulation
 
Visualizing Words and Topics with Scattertext
Visualizing Words and Topics with ScattertextVisualizing Words and Topics with Scattertext
Visualizing Words and Topics with Scattertext
 
Redevelop 2019 - Debugging our biases and intuition in software development
Redevelop 2019 - Debugging our biases and intuition in software developmentRedevelop 2019 - Debugging our biases and intuition in software development
Redevelop 2019 - Debugging our biases and intuition in software development
 
Aac and mowat wilson syndrome 0617
Aac and mowat wilson syndrome 0617Aac and mowat wilson syndrome 0617
Aac and mowat wilson syndrome 0617
 
Subword tokenizers
Subword tokenizersSubword tokenizers
Subword tokenizers
 
The system sound and listening
The system sound and listeningThe system sound and listening
The system sound and listening
 
Articulation Chapter From Previous Book
Articulation Chapter From Previous BookArticulation Chapter From Previous Book
Articulation Chapter From Previous Book
 
Class14
Class14Class14
Class14
 
Literacy 2.0
Literacy 2.0Literacy 2.0
Literacy 2.0
 
Language
LanguageLanguage
Language
 

Mehr von Dallan Quass

FamilySearch Javascript SDK
FamilySearch Javascript SDKFamilySearch Javascript SDK
FamilySearch Javascript SDKDallan Quass
 
FamilySearch Reference Client
FamilySearch Reference ClientFamilySearch Reference Client
FamilySearch Reference ClientDallan Quass
 
Using WeRelate.org (2009)
Using WeRelate.org (2009)Using WeRelate.org (2009)
Using WeRelate.org (2009)Dallan Quass
 
WeRelate.org flyer (2010)
WeRelate.org flyer (2010)WeRelate.org flyer (2010)
WeRelate.org flyer (2010)Dallan Quass
 
Why share your genealogy content on WeRelate.org (2009)
Why share your genealogy content on WeRelate.org (2009)Why share your genealogy content on WeRelate.org (2009)
Why share your genealogy content on WeRelate.org (2009)Dallan Quass
 
An Open-source Place-finder for Genealogy
An Open-source Place-finder for GenealogyAn Open-source Place-finder for Genealogy
An Open-source Place-finder for GenealogyDallan Quass
 
A Robust Open-source GEDCOM Parser
A Robust Open-source GEDCOM ParserA Robust Open-source GEDCOM Parser
A Robust Open-source GEDCOM ParserDallan Quass
 

Mehr von Dallan Quass (7)

FamilySearch Javascript SDK
FamilySearch Javascript SDKFamilySearch Javascript SDK
FamilySearch Javascript SDK
 
FamilySearch Reference Client
FamilySearch Reference ClientFamilySearch Reference Client
FamilySearch Reference Client
 
Using WeRelate.org (2009)
Using WeRelate.org (2009)Using WeRelate.org (2009)
Using WeRelate.org (2009)
 
WeRelate.org flyer (2010)
WeRelate.org flyer (2010)WeRelate.org flyer (2010)
WeRelate.org flyer (2010)
 
Why share your genealogy content on WeRelate.org (2009)
Why share your genealogy content on WeRelate.org (2009)Why share your genealogy content on WeRelate.org (2009)
Why share your genealogy content on WeRelate.org (2009)
 
An Open-source Place-finder for Genealogy
An Open-source Place-finder for GenealogyAn Open-source Place-finder for Genealogy
An Open-source Place-finder for Genealogy
 
A Robust Open-source GEDCOM Parser
A Robust Open-source GEDCOM ParserA Robust Open-source GEDCOM Parser
A Robust Open-source GEDCOM Parser
 

KĂŒrzlich hochgeladen

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Mcleodganj Call Girls đŸ„° 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls đŸ„° 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls đŸ„° 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls đŸ„° 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 

KĂŒrzlich hochgeladen (20)

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Mcleodganj Call Girls đŸ„° 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls đŸ„° 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls đŸ„° 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls đŸ„° 8617370543 Service Offer VIP Hot Model
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 

An Open-source Similar-name Finder