Approximate string comparators

•Als PPTX, PDF herunterladen•

2 gefällt mir•2,478 views

Lars Marius Garshol

A quick overview of some common approximate string comparators used in record linkage.

Technologie Unterhaltung & Humor

Approximate string comparators

TvungenOne, 2012-06-15
Lars Marius Garshol, <larsga@bouvet.no>
http://twitter.com/larsga

1

Approximate string comparators?

• Basically, measures of the similarity between
two strings
• Useful in situations where exact match is
insufficient
– record linkage
– search
– ...
• Many of these are slow: O(n2)

2

Levenshtein

• Also known as edit distance
• Measures the number of edit operations
necessary to turn s1 into s2
• Edit operations are
– insert a character
– remove a character
– substitute a character

3

Levenshtein example

• Levenshtein -> Löwenstein
– Levenstein (remove „h‟)
– Lövenstein (substitute „ö‟)
– Löwenstein (substitute „w‟)
• Edit distance = 3

4

Weighted Levenshtein

• Not all edit operations are equal
• Substituting “i” for “e” is a smaller edit than
substituting “o” for “k”
• Weighted Levenshtein evaluates each edit
operation as a number 0.0-1.0
• Difficult to implement
– weights are also language-dependent

5

Jaro-Winkler

• Developed at the US Bureau of the Census
• For name comparisons
– not well suited to long strings
– best if given name/surname are separated
• Exists in a few variants
– originally proposed by Winkler
– then modified by Jaro
– a few different versions of modifications etc

6

Jaro-Winkler definition

• Formula:
– m = number of matching characters
– t = number of transposed characters
• A character from string s1 matches s2 if the
same character is found in s2 less then half the
length of the string away
• Levenshtein ~ Löwenstein = 0.8
• Axel ~ Aksel = 0.783

7

Soundex

• A coarse schema for matching names by sound
– produces a key from the name
– names match if key is the same
• In common use in many places
– Nav‟s person register uses it for search
– built-in in many databases
– ...

9

Examples

• soundex(“Axel”) = „A240‟
• soundex(“Aksel”) = „A240‟
• soundex(“Levenshtein”) = „L523‟
• soundex(“Löwenstein”) = „L152‟

11

Metaphone

• Developed by Lawrence Philips
• Similar to Soundex, but much more complex
– both more accurate and more sensitive
• Developed further into Double Metaphone
• Metaphone 3.0 also exists, but only available
commercially

12

Metaphone examples

• metaphone(“Axel”) = „AKSL‟
• metaphone(“Aksel”) = „AKSL‟
• metaphone(“Levenshtein”) = „LFNX‟
• metaphone(“Löwenstein”) = „LWNS‟

13

Dice coefficient

• A similarity measure for sets
– set can be tokens in a string
– or characters in a string
• Formula:

14

TFIDF

• Compares strings as sets of tokens
– a la Dice coefficient
• However, takes frequency of tokens in corpus
into account
– this matches how we evaluate matches mentally
• Has done well in evaluations
– however, can be difficult to evaluate
– results will change as corpus changes

15

More comparators

• Smith-Waterman
– originated in DNA sequencing
• Q-grams distance
– breaks string into sets of pieces of q characters
– then does set similarity comparison
• Monge-Elkan
– similar to Smith-Waterman, but with affine gap distances
– has done very well in evaluations
– costly to evaluate
• Many, many more
– ...

16

Weitere ähnliche Inhalte

Ähnlich wie Approximate string comparators

Flat unit 1VenkataRaoS1

SWRL OverviewEmiliano Reynares

[DCSB] Chiara Palladino & Tariq Youssef (Leipzig) iAligner: a tool for syntax...Digital Classicist Seminar Berlin

intro.pptThe English and Foreign Languages University(EFL Central University)

Sequence to sequence (encoder-decoder) learningRoberto Pereira Silveira

N20190729TMU, Japan

Regular expressions h1Rajendran

Fuzzy Matching with Apache SparkDataWorks Summit

1 introductionparmeet834

Lexical analysis - Compiler DesignKuppusamy P

Programming_Language_Syntax.pptAmrita Sharma

Ähnlich wie Approximate string comparators (11)

Flat unit 1

SWRL Overview

[DCSB] Chiara Palladino & Tariq Youssef (Leipzig) iAligner: a tool for syntax...

intro.ppt

Sequence to sequence (encoder-decoder) learning

N20190729

Regular expressions h1

Fuzzy Matching with Apache Spark

1 introduction

Lexical analysis - Compiler Design

Programming_Language_Syntax.ppt

Mehr von Lars Marius Garshol

JSLT: JSON querying and transformationLars Marius Garshol

Data collection in AWS at SchibstedLars Marius Garshol

Kveik - what is it?Lars Marius Garshol

Nature-inspired algorithmsLars Marius Garshol

Collecting 600M events/dayLars Marius Garshol

History of writingLars Marius Garshol

NoSQL and Einstein's theory of relativityLars Marius Garshol

Norwegian farmhouse aleLars Marius Garshol

Archive integration with RDFLars Marius Garshol

The Euro crisis in 10 minutesLars Marius Garshol

Using the search engine as recommendation engineLars Marius Garshol

Linked Open Data for the Cultural SectorLars Marius Garshol

NoSQL databases, the CAP theorem, and the theory of relativityLars Marius Garshol

Bitcoin - digital goldLars Marius Garshol

Introduction to Big Data/Machine LearningLars Marius Garshol

Hops - the green goldLars Marius Garshol

Big data 101Lars Marius Garshol

Linked Open DataLars Marius Garshol

Hafslund SESAM - Semantic integration in practiceLars Marius Garshol

Experiments in genetic programmingLars Marius Garshol

Mehr von Lars Marius Garshol (20)

JSLT: JSON querying and transformation

Data collection in AWS at Schibsted

Kveik - what is it?

Nature-inspired algorithms

Collecting 600M events/day

History of writing

NoSQL and Einstein's theory of relativity

Norwegian farmhouse ale

Archive integration with RDF

The Euro crisis in 10 minutes

Using the search engine as recommendation engine

Linked Open Data for the Cultural Sector

NoSQL databases, the CAP theorem, and the theory of relativity

Bitcoin - digital gold

Introduction to Big Data/Machine Learning

Hops - the green gold

Big data 101

Linked Open Data

Hafslund SESAM - Semantic integration in practice

Experiments in genetic programming

Kürzlich hochgeladen

FWD Group - Insurer Innovation Award 2024The Digital Insurer

presentation ICT roal in 21st century educationjfdjdjcjdnsjd

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez

Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh

MINDCTI Revenue Release Quarter One 2024MIND CTI

Corporate and higher education May webinar.pptxRustici Software

TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc

Apidays New York 2024 - The value of a flexible API Management solution for O...apidays

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

ICT role in 21st century education and its challengesrafiqahmad00786416

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Automating Google Workspace (GWS) & more with Apps Scriptwesley chun

A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz

Ransomware_Q4_2023. The report. [EN].pdfOverkill Security

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer

Kürzlich hochgeladen (20)

FWD Group - Insurer Innovation Award 2024

presentation ICT roal in 21st century education

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model

MINDCTI Revenue Release Quarter One 2024

Corporate and higher education May webinar.pptx

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

Apidays New York 2024 - The value of a flexible API Management solution for O...

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

ICT role in 21st century education and its challenges

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...

How to Troubleshoot Apps for the Modern Connected Worker

Automating Google Workspace (GWS) & more with Apps Script

A Beginners Guide to Building a RAG App Using Open Source Milvus

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

Ransomware_Q4_2023. The report. [EN].pdf

Axa Assurance Maroc - Insurer Innovation Award 2024

AXA XL - Insurer Innovation Award Americas 2024

Approximate string comparators

1. Approximate string comparators TvungenOne, 2012-06-15 Lars Marius Garshol, <larsga@bouvet.no> http://twitter.com/larsga 1

2. Approximate string comparators? • Basically, measures of the similarity between two strings • Useful in situations where exact match is insufficient – record linkage – search – ... • Many of these are slow: O(n2) 2

3. Levenshtein • Also known as edit distance • Measures the number of edit operations necessary to turn s1 into s2 • Edit operations are – insert a character – remove a character – substitute a character 3

4. Levenshtein example • Levenshtein -> Löwenstein – Levenstein (remove „h‟) – Lövenstein (substitute „ö‟) – Löwenstein (substitute „w‟) • Edit distance = 3 4

5. Weighted Levenshtein • Not all edit operations are equal • Substituting “i” for “e” is a smaller edit than substituting “o” for “k” • Weighted Levenshtein evaluates each edit operation as a number 0.0-1.0 • Difficult to implement – weights are also language-dependent 5

6. Jaro-Winkler • Developed at the US Bureau of the Census • For name comparisons – not well suited to long strings – best if given name/surname are separated • Exists in a few variants – originally proposed by Winkler – then modified by Jaro – a few different versions of modifications etc 6

7. Jaro-Winkler definition • Formula: – m = number of matching characters – t = number of transposed characters • A character from string s1 matches s2 if the same character is found in s2 less then half the length of the string away • Levenshtein ~ Löwenstein = 0.8 • Axel ~ Aksel = 0.783 7

8. Jaro-Winkler variant 8

9. Soundex • A coarse schema for matching names by sound – produces a key from the name – names match if key is the same • In common use in many places – Nav‟s person register uses it for search – built-in in many databases – ... 9

10. Soundex definition 10

11. Examples • soundex(“Axel”) = „A240‟ • soundex(“Aksel”) = „A240‟ • soundex(“Levenshtein”) = „L523‟ • soundex(“Löwenstein”) = „L152‟ 11

12. Metaphone • Developed by Lawrence Philips • Similar to Soundex, but much more complex – both more accurate and more sensitive • Developed further into Double Metaphone • Metaphone 3.0 also exists, but only available commercially 12

13. Metaphone examples • metaphone(“Axel”) = „AKSL‟ • metaphone(“Aksel”) = „AKSL‟ • metaphone(“Levenshtein”) = „LFNX‟ • metaphone(“Löwenstein”) = „LWNS‟ 13

14. Dice coefficient • A similarity measure for sets – set can be tokens in a string – or characters in a string • Formula: 14

15. TFIDF • Compares strings as sets of tokens – a la Dice coefficient • However, takes frequency of tokens in corpus into account – this matches how we evaluate matches mentally • Has done well in evaluations – however, can be difficult to evaluate – results will change as corpus changes 15

16. More comparators • Smith-Waterman – originated in DNA sequencing • Q-grams distance – breaks string into sets of pieces of q characters – then does set similarity comparison • Monge-Elkan – similar to Smith-Waterman, but with affine gap distances – has done very well in evaluations – costly to evaluate • Many, many more – ... 16

Approximate string comparators

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Approximate string comparators

Ähnlich wie Approximate string comparators (11)

Mehr von Lars Marius Garshol

Mehr von Lars Marius Garshol (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Approximate string comparators