SlideShare ist ein Scribd-Unternehmen logo
1 von 35
Downloaden Sie, um offline zu lesen
with Apache Solr
Markus Günther
Freelance Software Engineer / Architect
| |
Phonetic Matching
mail@mguenther.net mguenther.net @markus_guenther
Phonetic matching is concerned with searching for spelling variations in large databases.
Age-old problem
Algorithmic solutions date back to the pre-computer era
Soundex was invented by Russell and Odell in 1912
Compute a phonetic value for a given name
Names that sound the same share the same phonetic value
Variation: American Soundex
Variation: Daitch-Mokotoff (DM) Soundex
Soundex tends to generate many false hits which lowers precision
© 2022 Markus Günther IT-Beratung
American Soundex is a reasonably simple algorithm.
Rules
1. Replace the first letter of the name and drop all occurences of a, e, i, o, u ,y, h, w.
2. Replace consonants with digits as suggested by mapping table.
3. Retain only the first letter for two or more adjacent letters mapped to the same number.
4. Retain only the first letter for two letters mapped to the same number that are separated
by h, w, or y.
5. Trim the encoded numbers to a total of three. Pad with 0 if there are less than three.
© 2022 Markus Günther IT-Beratung
American Soundex is a reasonably simple algorithm.
public class AmericanSoundex {
private static final String MAPPING = "01230120022455012623010202";
public static String encode(final String term) {
char code[] = { term.charAt(0), '0', '0', '0'};
char previousDigit = encode(code[0]);
int count = 1;
for (int i = 1; i < term.length() && count < code.length; i++) {
final char ch = term.charAt(i);
if (ch == 'H' || ch == 'W' || ch == 'Y') continue;
final char digit = encode(ch);
if (digit != '0' && digit != previousDigit) {
code[count++] = digit;
}
previousDigit = digit;
}
return String.valueOf(code);
}
}
© 2022 Markus Günther IT-Beratung
Let's take a look at a couple of examples.
Name Phonetic value
Robert R163
Rupert R163
Rubin R150
Ashcraft A261
Ashcroft A261
© 2022 Markus Günther IT-Beratung
American Soundex is not optimized for Eastern European names.
Name Phonetic value
Schwarzenegger S625
Shwarzenegger S625
Schwartsenegger S632
A search application would not be able to find a match with that misspelling.
© 2022 Markus Günther IT-Beratung
Daitch-Mokotoff Soundex has a solution for this.
Name Phonetic values
Schwarzenegger 474659, 479465
Shwarzenegger 474659, 479465
Schwartsenegger 479465
Given a pair of names, we have a phonetic match if at least one of their codes match.
© 2022 Markus Günther IT-Beratung
Soundex suffers from a focus on the anlaut for short names leading to false-positives.
Phonetic value Names
S300 Scott, Seth, Sadie, Satoya, ...
C500 Connie, Cheyenne, Conway, ...
T200 Tasha, Tessa, Tekia, ...
© 2022 Markus Günther IT-Beratung
This isn't always the case, though.
Phonetic value Names
M622 Marcus, Marcos, Marques, Markus, Marquice, Marquisa, ...
F652 Frank, Francisco, Francis, Franklin, Francois, ...
C150 Chevonne, Chavon, Chavonne, Chivon, Cobin, ...
© 2022 Markus Günther IT-Beratung
Beider-Morse Phonetic Matching
Instead of focusing on spelling, Beider-Morse factors in linguistic properties of a language.
Of limited interest for common nouns, adjectives, adverbs and verbs
Good strategy for proper nouns (i.e., names)
History: Started off primarily for matching surnames of Ashkenazic Jews
Example: Consider variations of Schwarz (standard German spelling)
Schwartz (alternate German spelling)
Shwartz, Shvartz, Shvarts (Anglicized spelling)
Szwarc (Polish), Szwartz (blended German-Polish)
Svarc (Hungarian), Chvartz (blended French-German)
© 2022 Markus Günther IT-Beratung
Step 1: Identifying the language
BMPM includes about 200 rules for determining the language
Some are general, some need context
Examples Inferred Language(s)
tsch, final mann or witz German
final and initial cs or zs Hungarian
cz, cy, initial rz or wl, ... Polish
ö and ü German, Hungarian
Allows to specify a language explicitly
© 2022 Markus Günther IT-Beratung
Step 2: Calculating the exact phonetic value
Forms of surnames used by women differ in some languages
Affects Slavic languages, Polish, Russian, Lithuanian, Latvian
Masculine endings Feminine endings
Suchy Sucha
Novikov Novikova
BMPM replaces feminine endings with masculine ones
© 2022 Markus Günther IT-Beratung
Step 2: Calculating the exact phonetic value
1. Replace feminine endings with masculine ones.
2. Identify the exact phonetic value of all letters.
1. Transcribe letters into a phonetic alphabet.
Applies language-specific rule set in case of one possible language.
Applies generic rule set in case of multiple possible languages.
2. Apply phonetic rules that are common to many languages.
e.g. final devoicing, regressive assimilation
3. At the end, the algorithm yields the exact phonetic value.
© 2022 Markus Günther IT-Beratung
Step 2: What do language-specific rules look like?
BMPM applies roughly 80 mapping rules for German
sch maps to S
s at the start and s between two vowels maps to z
w maps to v
© 2022 Markus Günther IT-Beratung
Step 2: What do language-agnostic rules look like?
BMPM uses more than 300 generic rules
a final tz maps to ts
Some generic rules might be applicable to specific languages only
step 1 rules out certain languages
rule is applied if it complies with the remaining possible languages
© 2022 Markus Günther IT-Beratung
Step 3: Calculating the approximate phonetic value
Some sounds can be interchangeable in specific contexts
beginning / end of word
previous next / letter
Language Example Sounds alike
Russian unstressed o is pronounced as a Mostov, Mastov
German n before b is close to m Grinberg, Grimberg
Spanish phonetic equivalence of n and m Grinberg, Grimberg
Rules can be language-agnostic or -specific
© 2022 Markus Günther IT-Beratung
Step 4: Searching for matches
1. BMPM generates the exact and approximate phonetic value for a given name.
2. We have an exact match if two names match on their exact phonetic value.
This might be too aggressive for your use-case.
3. We have an approximate match if two names match on their approximate phonetic value.
Matches done by BMPM are not necessarily commutative.
© 2022 Markus Günther IT-Beratung
Integration with Apache Solr
Apache Solr supports a variety of phonetic matching algorithms.
Beider-Morse Phonetic Matching
Daitch-Mokotoff Soundex
Double Metaphone
Metaphone
Soundex
© 2022 Markus Günther IT-Beratung
Refined Soundex
Caverphone
Cologne Phonetic
NYSISS
Add a field type that works with the phonetic matching algorithm.
Admissible values for ruleType are: APPROX and EXACT
They map to the semantics of approximate matches resp. exact matches
<fieldType name="phonetic_names" class="solr.TextField"
positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"></tokenizer>
<filter class="solr.BeiderMorseFilterFactory"
nameType="GENERIC"
ruleType="APPROX"
concat="true"
languageSet="auto"></filter>
</analyzer>
</fieldType>
© 2022 Markus Günther IT-Beratung
Add an index field using the resp. field type.
You probably already have a name field of sorts for basic name searches.
Use a copyField-directive to source name_phonetic from that field.
<field name="name_phonetic"
type="phonetic_names"
indexed="true"
stored="false"
multiValued="false"></field>
<copyField source="name" dest="name_phonetic"></copyField>
© 2022 Markus Günther IT-Beratung
Execute queries against that field.
Query for mustermann
(name_phonetic:mustermann)
© 2022 Markus Günther IT-Beratung
Evaluation
Let's do a couple of experiments with different parameters for BMPN.
Dataset: Large enterprise naming directory, approx. 340k individual persons
Naive implementation using phonetic matching incl. wildcards and N-Gram backed
fields yields approx. 3k results for a popular surname
Queries:
Large result set: q=(name_phonetic:meier)
Small result set: q=(name_phonetic:<some-unique-name>)
© 2022 Markus Günther IT-Beratung
Experiment 1: Querying for a popular name
Variant ruleType languageSet q=(phonetic_name:meier)
Naive - - 2997
1 APPROX auto 1279
2 EXACT auto 1228
3 APPROX german,english 1261
4 EXACT german,english 1216
Restricting languages to pre-dominantly ones of the corpus removes non-intuitive matches
Almost no noticeable diff between APPROX and EXACT wrt. result quality
© 2022 Markus Günther IT-Beratung
Few ordering issues, meier almost everytime before phonetic variations
Experiment 2: Querying for a unique name with spelling variations
Variant ruleType languageSet Correct Var. 1 Var. 2
Naive - - 7 0 30 (non-intuitive)
1 APPROX auto 1 5 14 (no match, not
intuitive)
2 EXACT auto 1 0 1 (no match)
3 APPROX german,english 7 (top
match)
5
(match)
25 (no match,
intuitive)
4 EXACT german,english 1 0 1 (no match)
Variant 3: Precision is good, recall could be better (i.e. one-off-corrections)
© 2022 Markus Günther IT-Beratung
Adding one-off-corrections using Damerau-Levensthein distance complements BMPM.
Prerequisites
name index field that stores <first name> <middle-initial> <surname>
name index field uses n-grams
Refine the query
Can be applied within phrases as well to allow for displacements
"Mustermann Max" should yield the same results as "Max Mustermann"
(name_phonetic:mustermann) OR (name:mustermann~1)
© 2022 Markus Günther IT-Beratung
Adding a boost on first name and surname for direct matches.
Influence ordering a bit to always prefer direct matches before phonetic variations.
Prerequisites:
firstname index field that stores <first name> (non-analyzed, lowercased)
surname index field that stores <surname> (non-analyzed, lowercased)
Refine the query
bq=firstname:("mustermann")surname:("mustermann")
© 2022 Markus Günther IT-Beratung
Tuning BMPM using additional mechanisms yields well-grounded phonetic matches.
What have we done?
Test the effect of BMPM parameterizations on your dataset
Add one-off-corrections to mitigate spelling mistakes that phonetics won't catch
Allow for displacement of max. two terms within a phrase
Boost on first and surname separately to influence relevance sorting
© 2022 Markus Günther IT-Beratung
Tuning BMPM using additional mechanisms yields well-grounded phonetic matches.
Achievements
Good trade-off between precision and recall
usually top match on search for unique names
Result sets are explainable
Relevance ordering feels natural
direct matches, phonetic variations, one-off corrections
© 2022 Markus Günther IT-Beratung
Questions?
Phonetic Matching with Apache Solr

Weitere ähnliche Inhalte

Was ist angesagt?

String In C Language
String In C Language String In C Language
String In C Language Simplilearn
 
Inheritance in oops
Inheritance in oopsInheritance in oops
Inheritance in oopsHirra Sultan
 
ONTOLOGY BASED DATA ACCESS
ONTOLOGY BASED DATA ACCESSONTOLOGY BASED DATA ACCESS
ONTOLOGY BASED DATA ACCESSKishan Patel
 
Configuring Apache Solr for Thai Text Search
Configuring Apache Solr for Thai Text SearchConfiguring Apache Solr for Thai Text Search
Configuring Apache Solr for Thai Text Searchsagarote
 
Multilingualism in Information Retrieval System
Multilingualism in Information Retrieval SystemMultilingualism in Information Retrieval System
Multilingualism in Information Retrieval SystemAriel Hess
 
Perl programming language
Perl programming languagePerl programming language
Perl programming languageElie Obeid
 
Inheritance and its types In Java
Inheritance and its types In JavaInheritance and its types In Java
Inheritance and its types In JavaMD SALEEM QAISAR
 
HTML and CSS crash course!
HTML and CSS crash course!HTML and CSS crash course!
HTML and CSS crash course!Ana Cidre
 
Introduction to CSS
Introduction to CSSIntroduction to CSS
Introduction to CSSLarry King
 
Eye catching HTML BASICS tips: Learn easily
Eye catching HTML BASICS tips: Learn easilyEye catching HTML BASICS tips: Learn easily
Eye catching HTML BASICS tips: Learn easilyshabab shihan
 
Parts of speech tagger
Parts of speech taggerParts of speech tagger
Parts of speech taggersadakpramodh
 
Web Development Course: PHP lecture 1
Web Development Course: PHP lecture 1Web Development Course: PHP lecture 1
Web Development Course: PHP lecture 1Gheyath M. Othman
 
Tracking.js: um framework open source de visão computacional
Tracking.js: um framework open source de visão computacional Tracking.js: um framework open source de visão computacional
Tracking.js: um framework open source de visão computacional João Gabriel Lima
 
Linguistic variable
Linguistic variable Linguistic variable
Linguistic variable Math-Circle
 
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...Edureka!
 
Academic writing
Academic writingAcademic writing
Academic writingMiann91
 

Was ist angesagt? (20)

String In C Language
String In C Language String In C Language
String In C Language
 
Inheritance in oops
Inheritance in oopsInheritance in oops
Inheritance in oops
 
ONTOLOGY BASED DATA ACCESS
ONTOLOGY BASED DATA ACCESSONTOLOGY BASED DATA ACCESS
ONTOLOGY BASED DATA ACCESS
 
Configuring Apache Solr for Thai Text Search
Configuring Apache Solr for Thai Text SearchConfiguring Apache Solr for Thai Text Search
Configuring Apache Solr for Thai Text Search
 
Semantic search
Semantic searchSemantic search
Semantic search
 
Multilingualism in Information Retrieval System
Multilingualism in Information Retrieval SystemMultilingualism in Information Retrieval System
Multilingualism in Information Retrieval System
 
Perl programming language
Perl programming languagePerl programming language
Perl programming language
 
Inheritance and its types In Java
Inheritance and its types In JavaInheritance and its types In Java
Inheritance and its types In Java
 
HTML and CSS crash course!
HTML and CSS crash course!HTML and CSS crash course!
HTML and CSS crash course!
 
Exception handling
Exception handlingException handling
Exception handling
 
Introduction to CSS
Introduction to CSSIntroduction to CSS
Introduction to CSS
 
Eye catching HTML BASICS tips: Learn easily
Eye catching HTML BASICS tips: Learn easilyEye catching HTML BASICS tips: Learn easily
Eye catching HTML BASICS tips: Learn easily
 
Parts of speech tagger
Parts of speech taggerParts of speech tagger
Parts of speech tagger
 
NLP_KASHK:POS Tagging
NLP_KASHK:POS TaggingNLP_KASHK:POS Tagging
NLP_KASHK:POS Tagging
 
Pop operation
Pop operationPop operation
Pop operation
 
Web Development Course: PHP lecture 1
Web Development Course: PHP lecture 1Web Development Course: PHP lecture 1
Web Development Course: PHP lecture 1
 
Tracking.js: um framework open source de visão computacional
Tracking.js: um framework open source de visão computacional Tracking.js: um framework open source de visão computacional
Tracking.js: um framework open source de visão computacional
 
Linguistic variable
Linguistic variable Linguistic variable
Linguistic variable
 
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
 
Academic writing
Academic writingAcademic writing
Academic writing
 

Kürzlich hochgeladen

Transform your Corporate Strategy Office - Harness OnePlan’s Strategic Portfo...
Transform your Corporate Strategy Office - Harness OnePlan’s Strategic Portfo...Transform your Corporate Strategy Office - Harness OnePlan’s Strategic Portfo...
Transform your Corporate Strategy Office - Harness OnePlan’s Strategic Portfo...OnePlan Solutions
 
Business Analyzopedia - Your Pocket Gita for Business Analysis
Business Analyzopedia - Your Pocket Gita for Business AnalysisBusiness Analyzopedia - Your Pocket Gita for Business Analysis
Business Analyzopedia - Your Pocket Gita for Business AnalysisDEEPRAJ PATHAK
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldRoberto Pérez Alcolea
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesVictoriaMetrics
 
Effort Estimation Techniques used in Software Projects
Effort Estimation Techniques used in Software ProjectsEffort Estimation Techniques used in Software Projects
Effort Estimation Techniques used in Software ProjectsDEEPRAJ PATHAK
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorTier1 app
 
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...kalichargn70th171
 
AmsterdamJUG April 2024 - Going serverless with Quarkus GraalVM native images...
AmsterdamJUG April 2024 - Going serverless with Quarkus GraalVM native images...AmsterdamJUG April 2024 - Going serverless with Quarkus GraalVM native images...
AmsterdamJUG April 2024 - Going serverless with Quarkus GraalVM native images...Bert Jan Schrijver
 
full course of software engineering mid term.pdf
full course of software engineering mid term.pdffull course of software engineering mid term.pdf
full course of software engineering mid term.pdfAbdul salam
 
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jNeo4j
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...OnePlan Solutions
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shardsChristopher Curtin
 
Explore the Three Main Types of Logistics - Inbound Logistics, Outbound Logis...
Explore the Three Main Types of Logistics - Inbound Logistics, Outbound Logis...Explore the Three Main Types of Logistics - Inbound Logistics, Outbound Logis...
Explore the Three Main Types of Logistics - Inbound Logistics, Outbound Logis...Piyovi
 
oracle 23c new features for developer and dba
oracle 23c new features for developer and dbaoracle 23c new features for developer and dba
oracle 23c new features for developer and dbaRemote DBA Services
 
What is Mendix and the concept of low-code development.docx
What is Mendix and the concept of low-code development.docxWhat is Mendix and the concept of low-code development.docx
What is Mendix and the concept of low-code development.docxTechnogeeks
 
OpenMetadata Community Meeting - 4th April, 2024
OpenMetadata Community Meeting - 4th April, 2024OpenMetadata Community Meeting - 4th April, 2024
OpenMetadata Community Meeting - 4th April, 2024OpenMetadata
 
Advantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptxAdvantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptxRTS corp
 
Understanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptxUnderstanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptxSasikiranMarri
 
logical backup of Oracle Datapump-detailed.pptx
logical backup of Oracle Datapump-detailed.pptxlogical backup of Oracle Datapump-detailed.pptx
logical backup of Oracle Datapump-detailed.pptxRemote DBA Services
 
Key Steps in Agile Software Delivery Roadmap
Key Steps in Agile Software Delivery RoadmapKey Steps in Agile Software Delivery Roadmap
Key Steps in Agile Software Delivery RoadmapIshara Amarasekera
 

Kürzlich hochgeladen (20)

Transform your Corporate Strategy Office - Harness OnePlan’s Strategic Portfo...
Transform your Corporate Strategy Office - Harness OnePlan’s Strategic Portfo...Transform your Corporate Strategy Office - Harness OnePlan’s Strategic Portfo...
Transform your Corporate Strategy Office - Harness OnePlan’s Strategic Portfo...
 
Business Analyzopedia - Your Pocket Gita for Business Analysis
Business Analyzopedia - Your Pocket Gita for Business AnalysisBusiness Analyzopedia - Your Pocket Gita for Business Analysis
Business Analyzopedia - Your Pocket Gita for Business Analysis
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository world
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 Updates
 
Effort Estimation Techniques used in Software Projects
Effort Estimation Techniques used in Software ProjectsEffort Estimation Techniques used in Software Projects
Effort Estimation Techniques used in Software Projects
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryError
 
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
 
AmsterdamJUG April 2024 - Going serverless with Quarkus GraalVM native images...
AmsterdamJUG April 2024 - Going serverless with Quarkus GraalVM native images...AmsterdamJUG April 2024 - Going serverless with Quarkus GraalVM native images...
AmsterdamJUG April 2024 - Going serverless with Quarkus GraalVM native images...
 
full course of software engineering mid term.pdf
full course of software engineering mid term.pdffull course of software engineering mid term.pdf
full course of software engineering mid term.pdf
 
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards
 
Explore the Three Main Types of Logistics - Inbound Logistics, Outbound Logis...
Explore the Three Main Types of Logistics - Inbound Logistics, Outbound Logis...Explore the Three Main Types of Logistics - Inbound Logistics, Outbound Logis...
Explore the Three Main Types of Logistics - Inbound Logistics, Outbound Logis...
 
oracle 23c new features for developer and dba
oracle 23c new features for developer and dbaoracle 23c new features for developer and dba
oracle 23c new features for developer and dba
 
What is Mendix and the concept of low-code development.docx
What is Mendix and the concept of low-code development.docxWhat is Mendix and the concept of low-code development.docx
What is Mendix and the concept of low-code development.docx
 
OpenMetadata Community Meeting - 4th April, 2024
OpenMetadata Community Meeting - 4th April, 2024OpenMetadata Community Meeting - 4th April, 2024
OpenMetadata Community Meeting - 4th April, 2024
 
Advantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptxAdvantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptx
 
Understanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptxUnderstanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptx
 
logical backup of Oracle Datapump-detailed.pptx
logical backup of Oracle Datapump-detailed.pptxlogical backup of Oracle Datapump-detailed.pptx
logical backup of Oracle Datapump-detailed.pptx
 
Key Steps in Agile Software Delivery Roadmap
Key Steps in Agile Software Delivery RoadmapKey Steps in Agile Software Delivery Roadmap
Key Steps in Agile Software Delivery Roadmap
 

Phonetic Matching with Apache Solr

  • 1. with Apache Solr Markus Günther Freelance Software Engineer / Architect | | Phonetic Matching mail@mguenther.net mguenther.net @markus_guenther
  • 2. Phonetic matching is concerned with searching for spelling variations in large databases. Age-old problem Algorithmic solutions date back to the pre-computer era Soundex was invented by Russell and Odell in 1912 Compute a phonetic value for a given name Names that sound the same share the same phonetic value Variation: American Soundex Variation: Daitch-Mokotoff (DM) Soundex Soundex tends to generate many false hits which lowers precision © 2022 Markus Günther IT-Beratung
  • 3. American Soundex is a reasonably simple algorithm. Rules 1. Replace the first letter of the name and drop all occurences of a, e, i, o, u ,y, h, w. 2. Replace consonants with digits as suggested by mapping table. 3. Retain only the first letter for two or more adjacent letters mapped to the same number. 4. Retain only the first letter for two letters mapped to the same number that are separated by h, w, or y. 5. Trim the encoded numbers to a total of three. Pad with 0 if there are less than three. © 2022 Markus Günther IT-Beratung
  • 4. American Soundex is a reasonably simple algorithm. public class AmericanSoundex { private static final String MAPPING = "01230120022455012623010202"; public static String encode(final String term) { char code[] = { term.charAt(0), '0', '0', '0'}; char previousDigit = encode(code[0]); int count = 1; for (int i = 1; i < term.length() && count < code.length; i++) { final char ch = term.charAt(i); if (ch == 'H' || ch == 'W' || ch == 'Y') continue; final char digit = encode(ch); if (digit != '0' && digit != previousDigit) { code[count++] = digit; } previousDigit = digit; } return String.valueOf(code); } } © 2022 Markus Günther IT-Beratung
  • 5. Let's take a look at a couple of examples. Name Phonetic value Robert R163 Rupert R163 Rubin R150 Ashcraft A261 Ashcroft A261 © 2022 Markus Günther IT-Beratung
  • 6. American Soundex is not optimized for Eastern European names. Name Phonetic value Schwarzenegger S625 Shwarzenegger S625 Schwartsenegger S632 A search application would not be able to find a match with that misspelling. © 2022 Markus Günther IT-Beratung
  • 7. Daitch-Mokotoff Soundex has a solution for this. Name Phonetic values Schwarzenegger 474659, 479465 Shwarzenegger 474659, 479465 Schwartsenegger 479465 Given a pair of names, we have a phonetic match if at least one of their codes match. © 2022 Markus Günther IT-Beratung
  • 8. Soundex suffers from a focus on the anlaut for short names leading to false-positives. Phonetic value Names S300 Scott, Seth, Sadie, Satoya, ... C500 Connie, Cheyenne, Conway, ... T200 Tasha, Tessa, Tekia, ... © 2022 Markus Günther IT-Beratung
  • 9. This isn't always the case, though. Phonetic value Names M622 Marcus, Marcos, Marques, Markus, Marquice, Marquisa, ... F652 Frank, Francisco, Francis, Franklin, Francois, ... C150 Chevonne, Chavon, Chavonne, Chivon, Cobin, ... © 2022 Markus Günther IT-Beratung
  • 11. Instead of focusing on spelling, Beider-Morse factors in linguistic properties of a language. Of limited interest for common nouns, adjectives, adverbs and verbs Good strategy for proper nouns (i.e., names) History: Started off primarily for matching surnames of Ashkenazic Jews Example: Consider variations of Schwarz (standard German spelling) Schwartz (alternate German spelling) Shwartz, Shvartz, Shvarts (Anglicized spelling) Szwarc (Polish), Szwartz (blended German-Polish) Svarc (Hungarian), Chvartz (blended French-German) © 2022 Markus Günther IT-Beratung
  • 12. Step 1: Identifying the language BMPM includes about 200 rules for determining the language Some are general, some need context Examples Inferred Language(s) tsch, final mann or witz German final and initial cs or zs Hungarian cz, cy, initial rz or wl, ... Polish ö and ü German, Hungarian Allows to specify a language explicitly © 2022 Markus Günther IT-Beratung
  • 13. Step 2: Calculating the exact phonetic value Forms of surnames used by women differ in some languages Affects Slavic languages, Polish, Russian, Lithuanian, Latvian Masculine endings Feminine endings Suchy Sucha Novikov Novikova BMPM replaces feminine endings with masculine ones © 2022 Markus Günther IT-Beratung
  • 14. Step 2: Calculating the exact phonetic value 1. Replace feminine endings with masculine ones. 2. Identify the exact phonetic value of all letters. 1. Transcribe letters into a phonetic alphabet. Applies language-specific rule set in case of one possible language. Applies generic rule set in case of multiple possible languages. 2. Apply phonetic rules that are common to many languages. e.g. final devoicing, regressive assimilation 3. At the end, the algorithm yields the exact phonetic value. © 2022 Markus Günther IT-Beratung
  • 15. Step 2: What do language-specific rules look like? BMPM applies roughly 80 mapping rules for German sch maps to S s at the start and s between two vowels maps to z w maps to v © 2022 Markus Günther IT-Beratung
  • 16. Step 2: What do language-agnostic rules look like? BMPM uses more than 300 generic rules a final tz maps to ts Some generic rules might be applicable to specific languages only step 1 rules out certain languages rule is applied if it complies with the remaining possible languages © 2022 Markus Günther IT-Beratung
  • 17. Step 3: Calculating the approximate phonetic value Some sounds can be interchangeable in specific contexts beginning / end of word previous next / letter Language Example Sounds alike Russian unstressed o is pronounced as a Mostov, Mastov German n before b is close to m Grinberg, Grimberg Spanish phonetic equivalence of n and m Grinberg, Grimberg Rules can be language-agnostic or -specific © 2022 Markus Günther IT-Beratung
  • 18. Step 4: Searching for matches 1. BMPM generates the exact and approximate phonetic value for a given name. 2. We have an exact match if two names match on their exact phonetic value. This might be too aggressive for your use-case. 3. We have an approximate match if two names match on their approximate phonetic value. Matches done by BMPM are not necessarily commutative. © 2022 Markus Günther IT-Beratung
  • 20. Apache Solr supports a variety of phonetic matching algorithms. Beider-Morse Phonetic Matching Daitch-Mokotoff Soundex Double Metaphone Metaphone Soundex © 2022 Markus Günther IT-Beratung
  • 22. Add a field type that works with the phonetic matching algorithm. Admissible values for ruleType are: APPROX and EXACT They map to the semantics of approximate matches resp. exact matches <fieldType name="phonetic_names" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"></tokenizer> <filter class="solr.BeiderMorseFilterFactory" nameType="GENERIC" ruleType="APPROX" concat="true" languageSet="auto"></filter> </analyzer> </fieldType> © 2022 Markus Günther IT-Beratung
  • 23. Add an index field using the resp. field type. You probably already have a name field of sorts for basic name searches. Use a copyField-directive to source name_phonetic from that field. <field name="name_phonetic" type="phonetic_names" indexed="true" stored="false" multiValued="false"></field> <copyField source="name" dest="name_phonetic"></copyField> © 2022 Markus Günther IT-Beratung
  • 24. Execute queries against that field. Query for mustermann (name_phonetic:mustermann) © 2022 Markus Günther IT-Beratung
  • 26. Let's do a couple of experiments with different parameters for BMPN. Dataset: Large enterprise naming directory, approx. 340k individual persons Naive implementation using phonetic matching incl. wildcards and N-Gram backed fields yields approx. 3k results for a popular surname Queries: Large result set: q=(name_phonetic:meier) Small result set: q=(name_phonetic:<some-unique-name>) © 2022 Markus Günther IT-Beratung
  • 27. Experiment 1: Querying for a popular name Variant ruleType languageSet q=(phonetic_name:meier) Naive - - 2997 1 APPROX auto 1279 2 EXACT auto 1228 3 APPROX german,english 1261 4 EXACT german,english 1216 Restricting languages to pre-dominantly ones of the corpus removes non-intuitive matches Almost no noticeable diff between APPROX and EXACT wrt. result quality © 2022 Markus Günther IT-Beratung
  • 28. Few ordering issues, meier almost everytime before phonetic variations
  • 29. Experiment 2: Querying for a unique name with spelling variations Variant ruleType languageSet Correct Var. 1 Var. 2 Naive - - 7 0 30 (non-intuitive) 1 APPROX auto 1 5 14 (no match, not intuitive) 2 EXACT auto 1 0 1 (no match) 3 APPROX german,english 7 (top match) 5 (match) 25 (no match, intuitive) 4 EXACT german,english 1 0 1 (no match) Variant 3: Precision is good, recall could be better (i.e. one-off-corrections) © 2022 Markus Günther IT-Beratung
  • 30. Adding one-off-corrections using Damerau-Levensthein distance complements BMPM. Prerequisites name index field that stores <first name> <middle-initial> <surname> name index field uses n-grams Refine the query Can be applied within phrases as well to allow for displacements "Mustermann Max" should yield the same results as "Max Mustermann" (name_phonetic:mustermann) OR (name:mustermann~1) © 2022 Markus Günther IT-Beratung
  • 31. Adding a boost on first name and surname for direct matches. Influence ordering a bit to always prefer direct matches before phonetic variations. Prerequisites: firstname index field that stores <first name> (non-analyzed, lowercased) surname index field that stores <surname> (non-analyzed, lowercased) Refine the query bq=firstname:("mustermann")surname:("mustermann") © 2022 Markus Günther IT-Beratung
  • 32. Tuning BMPM using additional mechanisms yields well-grounded phonetic matches. What have we done? Test the effect of BMPM parameterizations on your dataset Add one-off-corrections to mitigate spelling mistakes that phonetics won't catch Allow for displacement of max. two terms within a phrase Boost on first and surname separately to influence relevance sorting © 2022 Markus Günther IT-Beratung
  • 33. Tuning BMPM using additional mechanisms yields well-grounded phonetic matches. Achievements Good trade-off between precision and recall usually top match on search for unique names Result sets are explainable Relevance ordering feels natural direct matches, phonetic variations, one-off corrections © 2022 Markus Günther IT-Beratung