SlideShare a Scribd company logo
1 of 35
Download to read offline
with Apache Solr
Markus Günther
Freelance Software Engineer / Architect
| |
Phonetic Matching
mail@mguenther.net mguenther.net @markus_guenther
Phonetic matching is concerned with searching for spelling variations in large databases.
Age-old problem
Algorithmic solutions date back to the pre-computer era
Soundex was invented by Russell and Odell in 1912
Compute a phonetic value for a given name
Names that sound the same share the same phonetic value
Variation: American Soundex
Variation: Daitch-Mokotoff (DM) Soundex
Soundex tends to generate many false hits which lowers precision
© 2022 Markus Günther IT-Beratung
American Soundex is a reasonably simple algorithm.
Rules
1. Replace the first letter of the name and drop all occurences of a, e, i, o, u ,y, h, w.
2. Replace consonants with digits as suggested by mapping table.
3. Retain only the first letter for two or more adjacent letters mapped to the same number.
4. Retain only the first letter for two letters mapped to the same number that are separated
by h, w, or y.
5. Trim the encoded numbers to a total of three. Pad with 0 if there are less than three.
© 2022 Markus Günther IT-Beratung
American Soundex is a reasonably simple algorithm.
public class AmericanSoundex {
private static final String MAPPING = "01230120022455012623010202";
public static String encode(final String term) {
char code[] = { term.charAt(0), '0', '0', '0'};
char previousDigit = encode(code[0]);
int count = 1;
for (int i = 1; i < term.length() && count < code.length; i++) {
final char ch = term.charAt(i);
if (ch == 'H' || ch == 'W' || ch == 'Y') continue;
final char digit = encode(ch);
if (digit != '0' && digit != previousDigit) {
code[count++] = digit;
}
previousDigit = digit;
}
return String.valueOf(code);
}
}
© 2022 Markus Günther IT-Beratung
Let's take a look at a couple of examples.
Name Phonetic value
Robert R163
Rupert R163
Rubin R150
Ashcraft A261
Ashcroft A261
© 2022 Markus Günther IT-Beratung
American Soundex is not optimized for Eastern European names.
Name Phonetic value
Schwarzenegger S625
Shwarzenegger S625
Schwartsenegger S632
A search application would not be able to find a match with that misspelling.
© 2022 Markus Günther IT-Beratung
Daitch-Mokotoff Soundex has a solution for this.
Name Phonetic values
Schwarzenegger 474659, 479465
Shwarzenegger 474659, 479465
Schwartsenegger 479465
Given a pair of names, we have a phonetic match if at least one of their codes match.
© 2022 Markus Günther IT-Beratung
Soundex suffers from a focus on the anlaut for short names leading to false-positives.
Phonetic value Names
S300 Scott, Seth, Sadie, Satoya, ...
C500 Connie, Cheyenne, Conway, ...
T200 Tasha, Tessa, Tekia, ...
© 2022 Markus Günther IT-Beratung
This isn't always the case, though.
Phonetic value Names
M622 Marcus, Marcos, Marques, Markus, Marquice, Marquisa, ...
F652 Frank, Francisco, Francis, Franklin, Francois, ...
C150 Chevonne, Chavon, Chavonne, Chivon, Cobin, ...
© 2022 Markus Günther IT-Beratung
Beider-Morse Phonetic Matching
Instead of focusing on spelling, Beider-Morse factors in linguistic properties of a language.
Of limited interest for common nouns, adjectives, adverbs and verbs
Good strategy for proper nouns (i.e., names)
History: Started off primarily for matching surnames of Ashkenazic Jews
Example: Consider variations of Schwarz (standard German spelling)
Schwartz (alternate German spelling)
Shwartz, Shvartz, Shvarts (Anglicized spelling)
Szwarc (Polish), Szwartz (blended German-Polish)
Svarc (Hungarian), Chvartz (blended French-German)
© 2022 Markus Günther IT-Beratung
Step 1: Identifying the language
BMPM includes about 200 rules for determining the language
Some are general, some need context
Examples Inferred Language(s)
tsch, final mann or witz German
final and initial cs or zs Hungarian
cz, cy, initial rz or wl, ... Polish
ö and ü German, Hungarian
Allows to specify a language explicitly
© 2022 Markus Günther IT-Beratung
Step 2: Calculating the exact phonetic value
Forms of surnames used by women differ in some languages
Affects Slavic languages, Polish, Russian, Lithuanian, Latvian
Masculine endings Feminine endings
Suchy Sucha
Novikov Novikova
BMPM replaces feminine endings with masculine ones
© 2022 Markus Günther IT-Beratung
Step 2: Calculating the exact phonetic value
1. Replace feminine endings with masculine ones.
2. Identify the exact phonetic value of all letters.
1. Transcribe letters into a phonetic alphabet.
Applies language-specific rule set in case of one possible language.
Applies generic rule set in case of multiple possible languages.
2. Apply phonetic rules that are common to many languages.
e.g. final devoicing, regressive assimilation
3. At the end, the algorithm yields the exact phonetic value.
© 2022 Markus Günther IT-Beratung
Step 2: What do language-specific rules look like?
BMPM applies roughly 80 mapping rules for German
sch maps to S
s at the start and s between two vowels maps to z
w maps to v
© 2022 Markus Günther IT-Beratung
Step 2: What do language-agnostic rules look like?
BMPM uses more than 300 generic rules
a final tz maps to ts
Some generic rules might be applicable to specific languages only
step 1 rules out certain languages
rule is applied if it complies with the remaining possible languages
© 2022 Markus Günther IT-Beratung
Step 3: Calculating the approximate phonetic value
Some sounds can be interchangeable in specific contexts
beginning / end of word
previous next / letter
Language Example Sounds alike
Russian unstressed o is pronounced as a Mostov, Mastov
German n before b is close to m Grinberg, Grimberg
Spanish phonetic equivalence of n and m Grinberg, Grimberg
Rules can be language-agnostic or -specific
© 2022 Markus Günther IT-Beratung
Step 4: Searching for matches
1. BMPM generates the exact and approximate phonetic value for a given name.
2. We have an exact match if two names match on their exact phonetic value.
This might be too aggressive for your use-case.
3. We have an approximate match if two names match on their approximate phonetic value.
Matches done by BMPM are not necessarily commutative.
© 2022 Markus Günther IT-Beratung
Integration with Apache Solr
Apache Solr supports a variety of phonetic matching algorithms.
Beider-Morse Phonetic Matching
Daitch-Mokotoff Soundex
Double Metaphone
Metaphone
Soundex
© 2022 Markus Günther IT-Beratung
Refined Soundex
Caverphone
Cologne Phonetic
NYSISS
Add a field type that works with the phonetic matching algorithm.
Admissible values for ruleType are: APPROX and EXACT
They map to the semantics of approximate matches resp. exact matches
<fieldType name="phonetic_names" class="solr.TextField"
positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"></tokenizer>
<filter class="solr.BeiderMorseFilterFactory"
nameType="GENERIC"
ruleType="APPROX"
concat="true"
languageSet="auto"></filter>
</analyzer>
</fieldType>
© 2022 Markus Günther IT-Beratung
Add an index field using the resp. field type.
You probably already have a name field of sorts for basic name searches.
Use a copyField-directive to source name_phonetic from that field.
<field name="name_phonetic"
type="phonetic_names"
indexed="true"
stored="false"
multiValued="false"></field>
<copyField source="name" dest="name_phonetic"></copyField>
© 2022 Markus Günther IT-Beratung
Execute queries against that field.
Query for mustermann
(name_phonetic:mustermann)
© 2022 Markus Günther IT-Beratung
Evaluation
Let's do a couple of experiments with different parameters for BMPN.
Dataset: Large enterprise naming directory, approx. 340k individual persons
Naive implementation using phonetic matching incl. wildcards and N-Gram backed
fields yields approx. 3k results for a popular surname
Queries:
Large result set: q=(name_phonetic:meier)
Small result set: q=(name_phonetic:<some-unique-name>)
© 2022 Markus Günther IT-Beratung
Experiment 1: Querying for a popular name
Variant ruleType languageSet q=(phonetic_name:meier)
Naive - - 2997
1 APPROX auto 1279
2 EXACT auto 1228
3 APPROX german,english 1261
4 EXACT german,english 1216
Restricting languages to pre-dominantly ones of the corpus removes non-intuitive matches
Almost no noticeable diff between APPROX and EXACT wrt. result quality
© 2022 Markus Günther IT-Beratung
Few ordering issues, meier almost everytime before phonetic variations
Experiment 2: Querying for a unique name with spelling variations
Variant ruleType languageSet Correct Var. 1 Var. 2
Naive - - 7 0 30 (non-intuitive)
1 APPROX auto 1 5 14 (no match, not
intuitive)
2 EXACT auto 1 0 1 (no match)
3 APPROX german,english 7 (top
match)
5
(match)
25 (no match,
intuitive)
4 EXACT german,english 1 0 1 (no match)
Variant 3: Precision is good, recall could be better (i.e. one-off-corrections)
© 2022 Markus Günther IT-Beratung
Adding one-off-corrections using Damerau-Levensthein distance complements BMPM.
Prerequisites
name index field that stores <first name> <middle-initial> <surname>
name index field uses n-grams
Refine the query
Can be applied within phrases as well to allow for displacements
"Mustermann Max" should yield the same results as "Max Mustermann"
(name_phonetic:mustermann) OR (name:mustermann~1)
© 2022 Markus Günther IT-Beratung
Adding a boost on first name and surname for direct matches.
Influence ordering a bit to always prefer direct matches before phonetic variations.
Prerequisites:
firstname index field that stores <first name> (non-analyzed, lowercased)
surname index field that stores <surname> (non-analyzed, lowercased)
Refine the query
bq=firstname:("mustermann")surname:("mustermann")
© 2022 Markus Günther IT-Beratung
Tuning BMPM using additional mechanisms yields well-grounded phonetic matches.
What have we done?
Test the effect of BMPM parameterizations on your dataset
Add one-off-corrections to mitigate spelling mistakes that phonetics won't catch
Allow for displacement of max. two terms within a phrase
Boost on first and surname separately to influence relevance sorting
© 2022 Markus Günther IT-Beratung
Tuning BMPM using additional mechanisms yields well-grounded phonetic matches.
Achievements
Good trade-off between precision and recall
usually top match on search for unique names
Result sets are explainable
Relevance ordering feels natural
direct matches, phonetic variations, one-off corrections
© 2022 Markus Günther IT-Beratung
Questions?
Phonetic Matching with Apache Solr

More Related Content

What's hot

MongoDB (Advanced)
MongoDB (Advanced)MongoDB (Advanced)
MongoDB (Advanced)
TO THE NEW | Technology
 

What's hot (20)

5. stored procedure and functions
5. stored procedure and functions5. stored procedure and functions
5. stored procedure and functions
 
Optimizing MySQL Queries
Optimizing MySQL QueriesOptimizing MySQL Queries
Optimizing MySQL Queries
 
DOAG - Oracle Database Locking Mechanism Demystified
DOAG - Oracle Database Locking Mechanism Demystified DOAG - Oracle Database Locking Mechanism Demystified
DOAG - Oracle Database Locking Mechanism Demystified
 
MySQL 8.0 EXPLAIN ANALYZE
MySQL 8.0 EXPLAIN ANALYZEMySQL 8.0 EXPLAIN ANALYZE
MySQL 8.0 EXPLAIN ANALYZE
 
What is new in PostgreSQL 14?
What is new in PostgreSQL 14?What is new in PostgreSQL 14?
What is new in PostgreSQL 14?
 
PL/SQL Introduction and Concepts
PL/SQL Introduction and Concepts PL/SQL Introduction and Concepts
PL/SQL Introduction and Concepts
 
Json
JsonJson
Json
 
PostgreSQL Tutorial For Beginners | Edureka
PostgreSQL Tutorial For Beginners | EdurekaPostgreSQL Tutorial For Beginners | Edureka
PostgreSQL Tutorial For Beginners | Edureka
 
PL/SQL TRIGGERS
PL/SQL TRIGGERSPL/SQL TRIGGERS
PL/SQL TRIGGERS
 
ProxySQL and the Tricks Up Its Sleeve - Percona Live 2022.pdf
ProxySQL and the Tricks Up Its Sleeve - Percona Live 2022.pdfProxySQL and the Tricks Up Its Sleeve - Percona Live 2022.pdf
ProxySQL and the Tricks Up Its Sleeve - Percona Live 2022.pdf
 
Apache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationApache Lucene/Solr Document Classification
Apache Lucene/Solr Document Classification
 
Php array
Php arrayPhp array
Php array
 
MariaDB 마이그레이션 - 네오클로바
MariaDB 마이그레이션 - 네오클로바MariaDB 마이그레이션 - 네오클로바
MariaDB 마이그레이션 - 네오클로바
 
OSMC 2021 | Introduction into OpenSearch
OSMC 2021 | Introduction into OpenSearchOSMC 2021 | Introduction into OpenSearch
OSMC 2021 | Introduction into OpenSearch
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Bloomfilter
BloomfilterBloomfilter
Bloomfilter
 
Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™
 
Basics of JavaScript
Basics of JavaScriptBasics of JavaScript
Basics of JavaScript
 
Elasticsearch for beginners
Elasticsearch for beginnersElasticsearch for beginners
Elasticsearch for beginners
 
MongoDB (Advanced)
MongoDB (Advanced)MongoDB (Advanced)
MongoDB (Advanced)
 

Recently uploaded

AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
Alluxio, Inc.
 

Recently uploaded (20)

The Evolution of Web App Testing_ An Ultimate Guide to Future Trends.pdf
The Evolution of Web App Testing_ An Ultimate Guide to Future Trends.pdfThe Evolution of Web App Testing_ An Ultimate Guide to Future Trends.pdf
The Evolution of Web App Testing_ An Ultimate Guide to Future Trends.pdf
 
Odoo vs Shopify: Why Odoo is Best for Ecommerce Website Builder in 2024
Odoo vs Shopify: Why Odoo is Best for Ecommerce Website Builder in 2024Odoo vs Shopify: Why Odoo is Best for Ecommerce Website Builder in 2024
Odoo vs Shopify: Why Odoo is Best for Ecommerce Website Builder in 2024
 
Crafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM IntegrationCrafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM Integration
 
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
 
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning Framework
 
Microsoft 365 Copilot; An AI tool changing the world of work _PDF.pdf
Microsoft 365 Copilot; An AI tool changing the world of work _PDF.pdfMicrosoft 365 Copilot; An AI tool changing the world of work _PDF.pdf
Microsoft 365 Copilot; An AI tool changing the world of work _PDF.pdf
 
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
 
IT Software Development Resume, Vaibhav jha 2024
IT Software Development Resume, Vaibhav jha 2024IT Software Development Resume, Vaibhav jha 2024
IT Software Development Resume, Vaibhav jha 2024
 
OpenChain Webinar: AboutCode and Beyond - End-to-End SCA
OpenChain Webinar: AboutCode and Beyond - End-to-End SCAOpenChain Webinar: AboutCode and Beyond - End-to-End SCA
OpenChain Webinar: AboutCode and Beyond - End-to-End SCA
 
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
 
Workforce Efficiency with Employee Time Tracking Software.pdf
Workforce Efficiency with Employee Time Tracking Software.pdfWorkforce Efficiency with Employee Time Tracking Software.pdf
Workforce Efficiency with Employee Time Tracking Software.pdf
 
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
 
architecting-ai-in-the-enterprise-apis-and-applications.pdf
architecting-ai-in-the-enterprise-apis-and-applications.pdfarchitecting-ai-in-the-enterprise-apis-and-applications.pdf
architecting-ai-in-the-enterprise-apis-and-applications.pdf
 
Automate your OpenSIPS config tests - OpenSIPS Summit 2024
Automate your OpenSIPS config tests - OpenSIPS Summit 2024Automate your OpenSIPS config tests - OpenSIPS Summit 2024
Automate your OpenSIPS config tests - OpenSIPS Summit 2024
 
INGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by DesignINGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by Design
 
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
KLARNA -  Language Models and Knowledge Graphs: A Systems ApproachKLARNA -  Language Models and Knowledge Graphs: A Systems Approach
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
 
10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdf10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdf
 
Modern binary build systems - PyCon 2024
Modern binary build systems - PyCon 2024Modern binary build systems - PyCon 2024
Modern binary build systems - PyCon 2024
 
Top Mobile App Development Companies 2024
Top Mobile App Development Companies 2024Top Mobile App Development Companies 2024
Top Mobile App Development Companies 2024
 

Phonetic Matching with Apache Solr

  • 1. with Apache Solr Markus Günther Freelance Software Engineer / Architect | | Phonetic Matching mail@mguenther.net mguenther.net @markus_guenther
  • 2. Phonetic matching is concerned with searching for spelling variations in large databases. Age-old problem Algorithmic solutions date back to the pre-computer era Soundex was invented by Russell and Odell in 1912 Compute a phonetic value for a given name Names that sound the same share the same phonetic value Variation: American Soundex Variation: Daitch-Mokotoff (DM) Soundex Soundex tends to generate many false hits which lowers precision © 2022 Markus Günther IT-Beratung
  • 3. American Soundex is a reasonably simple algorithm. Rules 1. Replace the first letter of the name and drop all occurences of a, e, i, o, u ,y, h, w. 2. Replace consonants with digits as suggested by mapping table. 3. Retain only the first letter for two or more adjacent letters mapped to the same number. 4. Retain only the first letter for two letters mapped to the same number that are separated by h, w, or y. 5. Trim the encoded numbers to a total of three. Pad with 0 if there are less than three. © 2022 Markus Günther IT-Beratung
  • 4. American Soundex is a reasonably simple algorithm. public class AmericanSoundex { private static final String MAPPING = "01230120022455012623010202"; public static String encode(final String term) { char code[] = { term.charAt(0), '0', '0', '0'}; char previousDigit = encode(code[0]); int count = 1; for (int i = 1; i < term.length() && count < code.length; i++) { final char ch = term.charAt(i); if (ch == 'H' || ch == 'W' || ch == 'Y') continue; final char digit = encode(ch); if (digit != '0' && digit != previousDigit) { code[count++] = digit; } previousDigit = digit; } return String.valueOf(code); } } © 2022 Markus Günther IT-Beratung
  • 5. Let's take a look at a couple of examples. Name Phonetic value Robert R163 Rupert R163 Rubin R150 Ashcraft A261 Ashcroft A261 © 2022 Markus Günther IT-Beratung
  • 6. American Soundex is not optimized for Eastern European names. Name Phonetic value Schwarzenegger S625 Shwarzenegger S625 Schwartsenegger S632 A search application would not be able to find a match with that misspelling. © 2022 Markus Günther IT-Beratung
  • 7. Daitch-Mokotoff Soundex has a solution for this. Name Phonetic values Schwarzenegger 474659, 479465 Shwarzenegger 474659, 479465 Schwartsenegger 479465 Given a pair of names, we have a phonetic match if at least one of their codes match. © 2022 Markus Günther IT-Beratung
  • 8. Soundex suffers from a focus on the anlaut for short names leading to false-positives. Phonetic value Names S300 Scott, Seth, Sadie, Satoya, ... C500 Connie, Cheyenne, Conway, ... T200 Tasha, Tessa, Tekia, ... © 2022 Markus Günther IT-Beratung
  • 9. This isn't always the case, though. Phonetic value Names M622 Marcus, Marcos, Marques, Markus, Marquice, Marquisa, ... F652 Frank, Francisco, Francis, Franklin, Francois, ... C150 Chevonne, Chavon, Chavonne, Chivon, Cobin, ... © 2022 Markus Günther IT-Beratung
  • 11. Instead of focusing on spelling, Beider-Morse factors in linguistic properties of a language. Of limited interest for common nouns, adjectives, adverbs and verbs Good strategy for proper nouns (i.e., names) History: Started off primarily for matching surnames of Ashkenazic Jews Example: Consider variations of Schwarz (standard German spelling) Schwartz (alternate German spelling) Shwartz, Shvartz, Shvarts (Anglicized spelling) Szwarc (Polish), Szwartz (blended German-Polish) Svarc (Hungarian), Chvartz (blended French-German) © 2022 Markus Günther IT-Beratung
  • 12. Step 1: Identifying the language BMPM includes about 200 rules for determining the language Some are general, some need context Examples Inferred Language(s) tsch, final mann or witz German final and initial cs or zs Hungarian cz, cy, initial rz or wl, ... Polish ö and ü German, Hungarian Allows to specify a language explicitly © 2022 Markus Günther IT-Beratung
  • 13. Step 2: Calculating the exact phonetic value Forms of surnames used by women differ in some languages Affects Slavic languages, Polish, Russian, Lithuanian, Latvian Masculine endings Feminine endings Suchy Sucha Novikov Novikova BMPM replaces feminine endings with masculine ones © 2022 Markus Günther IT-Beratung
  • 14. Step 2: Calculating the exact phonetic value 1. Replace feminine endings with masculine ones. 2. Identify the exact phonetic value of all letters. 1. Transcribe letters into a phonetic alphabet. Applies language-specific rule set in case of one possible language. Applies generic rule set in case of multiple possible languages. 2. Apply phonetic rules that are common to many languages. e.g. final devoicing, regressive assimilation 3. At the end, the algorithm yields the exact phonetic value. © 2022 Markus Günther IT-Beratung
  • 15. Step 2: What do language-specific rules look like? BMPM applies roughly 80 mapping rules for German sch maps to S s at the start and s between two vowels maps to z w maps to v © 2022 Markus Günther IT-Beratung
  • 16. Step 2: What do language-agnostic rules look like? BMPM uses more than 300 generic rules a final tz maps to ts Some generic rules might be applicable to specific languages only step 1 rules out certain languages rule is applied if it complies with the remaining possible languages © 2022 Markus Günther IT-Beratung
  • 17. Step 3: Calculating the approximate phonetic value Some sounds can be interchangeable in specific contexts beginning / end of word previous next / letter Language Example Sounds alike Russian unstressed o is pronounced as a Mostov, Mastov German n before b is close to m Grinberg, Grimberg Spanish phonetic equivalence of n and m Grinberg, Grimberg Rules can be language-agnostic or -specific © 2022 Markus Günther IT-Beratung
  • 18. Step 4: Searching for matches 1. BMPM generates the exact and approximate phonetic value for a given name. 2. We have an exact match if two names match on their exact phonetic value. This might be too aggressive for your use-case. 3. We have an approximate match if two names match on their approximate phonetic value. Matches done by BMPM are not necessarily commutative. © 2022 Markus Günther IT-Beratung
  • 20. Apache Solr supports a variety of phonetic matching algorithms. Beider-Morse Phonetic Matching Daitch-Mokotoff Soundex Double Metaphone Metaphone Soundex © 2022 Markus Günther IT-Beratung
  • 22. Add a field type that works with the phonetic matching algorithm. Admissible values for ruleType are: APPROX and EXACT They map to the semantics of approximate matches resp. exact matches <fieldType name="phonetic_names" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"></tokenizer> <filter class="solr.BeiderMorseFilterFactory" nameType="GENERIC" ruleType="APPROX" concat="true" languageSet="auto"></filter> </analyzer> </fieldType> © 2022 Markus Günther IT-Beratung
  • 23. Add an index field using the resp. field type. You probably already have a name field of sorts for basic name searches. Use a copyField-directive to source name_phonetic from that field. <field name="name_phonetic" type="phonetic_names" indexed="true" stored="false" multiValued="false"></field> <copyField source="name" dest="name_phonetic"></copyField> © 2022 Markus Günther IT-Beratung
  • 24. Execute queries against that field. Query for mustermann (name_phonetic:mustermann) © 2022 Markus Günther IT-Beratung
  • 26. Let's do a couple of experiments with different parameters for BMPN. Dataset: Large enterprise naming directory, approx. 340k individual persons Naive implementation using phonetic matching incl. wildcards and N-Gram backed fields yields approx. 3k results for a popular surname Queries: Large result set: q=(name_phonetic:meier) Small result set: q=(name_phonetic:<some-unique-name>) © 2022 Markus Günther IT-Beratung
  • 27. Experiment 1: Querying for a popular name Variant ruleType languageSet q=(phonetic_name:meier) Naive - - 2997 1 APPROX auto 1279 2 EXACT auto 1228 3 APPROX german,english 1261 4 EXACT german,english 1216 Restricting languages to pre-dominantly ones of the corpus removes non-intuitive matches Almost no noticeable diff between APPROX and EXACT wrt. result quality © 2022 Markus Günther IT-Beratung
  • 28. Few ordering issues, meier almost everytime before phonetic variations
  • 29. Experiment 2: Querying for a unique name with spelling variations Variant ruleType languageSet Correct Var. 1 Var. 2 Naive - - 7 0 30 (non-intuitive) 1 APPROX auto 1 5 14 (no match, not intuitive) 2 EXACT auto 1 0 1 (no match) 3 APPROX german,english 7 (top match) 5 (match) 25 (no match, intuitive) 4 EXACT german,english 1 0 1 (no match) Variant 3: Precision is good, recall could be better (i.e. one-off-corrections) © 2022 Markus Günther IT-Beratung
  • 30. Adding one-off-corrections using Damerau-Levensthein distance complements BMPM. Prerequisites name index field that stores <first name> <middle-initial> <surname> name index field uses n-grams Refine the query Can be applied within phrases as well to allow for displacements "Mustermann Max" should yield the same results as "Max Mustermann" (name_phonetic:mustermann) OR (name:mustermann~1) © 2022 Markus Günther IT-Beratung
  • 31. Adding a boost on first name and surname for direct matches. Influence ordering a bit to always prefer direct matches before phonetic variations. Prerequisites: firstname index field that stores <first name> (non-analyzed, lowercased) surname index field that stores <surname> (non-analyzed, lowercased) Refine the query bq=firstname:("mustermann")surname:("mustermann") © 2022 Markus Günther IT-Beratung
  • 32. Tuning BMPM using additional mechanisms yields well-grounded phonetic matches. What have we done? Test the effect of BMPM parameterizations on your dataset Add one-off-corrections to mitigate spelling mistakes that phonetics won't catch Allow for displacement of max. two terms within a phrase Boost on first and surname separately to influence relevance sorting © 2022 Markus Günther IT-Beratung
  • 33. Tuning BMPM using additional mechanisms yields well-grounded phonetic matches. Achievements Good trade-off between precision and recall usually top match on search for unique names Result sets are explainable Relevance ordering feels natural direct matches, phonetic variations, one-off corrections © 2022 Markus Günther IT-Beratung