Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Phonetic Matching with Apache Solr

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Wird geladen in …3
×

Hier ansehen

1 von 35 Anzeige

Phonetic Matching with Apache Solr

Herunterladen, um offline zu lesen

In this session, I'll talk about phonetic matching with Apache Solr. We start off with an explanation of various Soundex algorithms and learn their shortcomings, then move on to the characteristics of the Beider-Morse algorithm for phonetic matching. I'll demonstrate how to integrate Beider-Morse into your Solr setup and discuss the benefits of it as part of an evaluation performed on real-world data.

In this session, I'll talk about phonetic matching with Apache Solr. We start off with an explanation of various Soundex algorithms and learn their shortcomings, then move on to the characteristics of the Beider-Morse algorithm for phonetic matching. I'll demonstrate how to integrate Beider-Morse into your Solr setup and discuss the benefits of it as part of an evaluation performed on real-world data.

Anzeige
Anzeige

Weitere Verwandte Inhalte

Aktuellste (20)

Anzeige

Phonetic Matching with Apache Solr

  1. 1. with Apache Solr Markus Günther Freelance Software Engineer / Architect | | Phonetic Matching mail@mguenther.net mguenther.net @markus_guenther
  2. 2. Phonetic matching is concerned with searching for spelling variations in large databases. Age-old problem Algorithmic solutions date back to the pre-computer era Soundex was invented by Russell and Odell in 1912 Compute a phonetic value for a given name Names that sound the same share the same phonetic value Variation: American Soundex Variation: Daitch-Mokotoff (DM) Soundex Soundex tends to generate many false hits which lowers precision © 2022 Markus Günther IT-Beratung
  3. 3. American Soundex is a reasonably simple algorithm. Rules 1. Replace the first letter of the name and drop all occurences of a, e, i, o, u ,y, h, w. 2. Replace consonants with digits as suggested by mapping table. 3. Retain only the first letter for two or more adjacent letters mapped to the same number. 4. Retain only the first letter for two letters mapped to the same number that are separated by h, w, or y. 5. Trim the encoded numbers to a total of three. Pad with 0 if there are less than three. © 2022 Markus Günther IT-Beratung
  4. 4. American Soundex is a reasonably simple algorithm. public class AmericanSoundex { private static final String MAPPING = "01230120022455012623010202"; public static String encode(final String term) { char code[] = { term.charAt(0), '0', '0', '0'}; char previousDigit = encode(code[0]); int count = 1; for (int i = 1; i < term.length() && count < code.length; i++) { final char ch = term.charAt(i); if (ch == 'H' || ch == 'W' || ch == 'Y') continue; final char digit = encode(ch); if (digit != '0' && digit != previousDigit) { code[count++] = digit; } previousDigit = digit; } return String.valueOf(code); } } © 2022 Markus Günther IT-Beratung
  5. 5. Let's take a look at a couple of examples. Name Phonetic value Robert R163 Rupert R163 Rubin R150 Ashcraft A261 Ashcroft A261 © 2022 Markus Günther IT-Beratung
  6. 6. American Soundex is not optimized for Eastern European names. Name Phonetic value Schwarzenegger S625 Shwarzenegger S625 Schwartsenegger S632 A search application would not be able to find a match with that misspelling. © 2022 Markus Günther IT-Beratung
  7. 7. Daitch-Mokotoff Soundex has a solution for this. Name Phonetic values Schwarzenegger 474659, 479465 Shwarzenegger 474659, 479465 Schwartsenegger 479465 Given a pair of names, we have a phonetic match if at least one of their codes match. © 2022 Markus Günther IT-Beratung
  8. 8. Soundex suffers from a focus on the anlaut for short names leading to false-positives. Phonetic value Names S300 Scott, Seth, Sadie, Satoya, ... C500 Connie, Cheyenne, Conway, ... T200 Tasha, Tessa, Tekia, ... © 2022 Markus Günther IT-Beratung
  9. 9. This isn't always the case, though. Phonetic value Names M622 Marcus, Marcos, Marques, Markus, Marquice, Marquisa, ... F652 Frank, Francisco, Francis, Franklin, Francois, ... C150 Chevonne, Chavon, Chavonne, Chivon, Cobin, ... © 2022 Markus Günther IT-Beratung
  10. 10. Beider-Morse Phonetic Matching
  11. 11. Instead of focusing on spelling, Beider-Morse factors in linguistic properties of a language. Of limited interest for common nouns, adjectives, adverbs and verbs Good strategy for proper nouns (i.e., names) History: Started off primarily for matching surnames of Ashkenazic Jews Example: Consider variations of Schwarz (standard German spelling) Schwartz (alternate German spelling) Shwartz, Shvartz, Shvarts (Anglicized spelling) Szwarc (Polish), Szwartz (blended German-Polish) Svarc (Hungarian), Chvartz (blended French-German) © 2022 Markus Günther IT-Beratung
  12. 12. Step 1: Identifying the language BMPM includes about 200 rules for determining the language Some are general, some need context Examples Inferred Language(s) tsch, final mann or witz German final and initial cs or zs Hungarian cz, cy, initial rz or wl, ... Polish ö and ü German, Hungarian Allows to specify a language explicitly © 2022 Markus Günther IT-Beratung
  13. 13. Step 2: Calculating the exact phonetic value Forms of surnames used by women differ in some languages Affects Slavic languages, Polish, Russian, Lithuanian, Latvian Masculine endings Feminine endings Suchy Sucha Novikov Novikova BMPM replaces feminine endings with masculine ones © 2022 Markus Günther IT-Beratung
  14. 14. Step 2: Calculating the exact phonetic value 1. Replace feminine endings with masculine ones. 2. Identify the exact phonetic value of all letters. 1. Transcribe letters into a phonetic alphabet. Applies language-specific rule set in case of one possible language. Applies generic rule set in case of multiple possible languages. 2. Apply phonetic rules that are common to many languages. e.g. final devoicing, regressive assimilation 3. At the end, the algorithm yields the exact phonetic value. © 2022 Markus Günther IT-Beratung
  15. 15. Step 2: What do language-specific rules look like? BMPM applies roughly 80 mapping rules for German sch maps to S s at the start and s between two vowels maps to z w maps to v © 2022 Markus Günther IT-Beratung
  16. 16. Step 2: What do language-agnostic rules look like? BMPM uses more than 300 generic rules a final tz maps to ts Some generic rules might be applicable to specific languages only step 1 rules out certain languages rule is applied if it complies with the remaining possible languages © 2022 Markus Günther IT-Beratung
  17. 17. Step 3: Calculating the approximate phonetic value Some sounds can be interchangeable in specific contexts beginning / end of word previous next / letter Language Example Sounds alike Russian unstressed o is pronounced as a Mostov, Mastov German n before b is close to m Grinberg, Grimberg Spanish phonetic equivalence of n and m Grinberg, Grimberg Rules can be language-agnostic or -specific © 2022 Markus Günther IT-Beratung
  18. 18. Step 4: Searching for matches 1. BMPM generates the exact and approximate phonetic value for a given name. 2. We have an exact match if two names match on their exact phonetic value. This might be too aggressive for your use-case. 3. We have an approximate match if two names match on their approximate phonetic value. Matches done by BMPM are not necessarily commutative. © 2022 Markus Günther IT-Beratung
  19. 19. Integration with Apache Solr
  20. 20. Apache Solr supports a variety of phonetic matching algorithms. Beider-Morse Phonetic Matching Daitch-Mokotoff Soundex Double Metaphone Metaphone Soundex © 2022 Markus Günther IT-Beratung
  21. 21. Refined Soundex Caverphone Cologne Phonetic NYSISS
  22. 22. Add a field type that works with the phonetic matching algorithm. Admissible values for ruleType are: APPROX and EXACT They map to the semantics of approximate matches resp. exact matches <fieldType name="phonetic_names" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"></tokenizer> <filter class="solr.BeiderMorseFilterFactory" nameType="GENERIC" ruleType="APPROX" concat="true" languageSet="auto"></filter> </analyzer> </fieldType> © 2022 Markus Günther IT-Beratung
  23. 23. Add an index field using the resp. field type. You probably already have a name field of sorts for basic name searches. Use a copyField-directive to source name_phonetic from that field. <field name="name_phonetic" type="phonetic_names" indexed="true" stored="false" multiValued="false"></field> <copyField source="name" dest="name_phonetic"></copyField> © 2022 Markus Günther IT-Beratung
  24. 24. Execute queries against that field. Query for mustermann (name_phonetic:mustermann) © 2022 Markus Günther IT-Beratung
  25. 25. Evaluation
  26. 26. Let's do a couple of experiments with different parameters for BMPN. Dataset: Large enterprise naming directory, approx. 340k individual persons Naive implementation using phonetic matching incl. wildcards and N-Gram backed fields yields approx. 3k results for a popular surname Queries: Large result set: q=(name_phonetic:meier) Small result set: q=(name_phonetic:<some-unique-name>) © 2022 Markus Günther IT-Beratung
  27. 27. Experiment 1: Querying for a popular name Variant ruleType languageSet q=(phonetic_name:meier) Naive - - 2997 1 APPROX auto 1279 2 EXACT auto 1228 3 APPROX german,english 1261 4 EXACT german,english 1216 Restricting languages to pre-dominantly ones of the corpus removes non-intuitive matches Almost no noticeable diff between APPROX and EXACT wrt. result quality © 2022 Markus Günther IT-Beratung
  28. 28. Few ordering issues, meier almost everytime before phonetic variations
  29. 29. Experiment 2: Querying for a unique name with spelling variations Variant ruleType languageSet Correct Var. 1 Var. 2 Naive - - 7 0 30 (non-intuitive) 1 APPROX auto 1 5 14 (no match, not intuitive) 2 EXACT auto 1 0 1 (no match) 3 APPROX german,english 7 (top match) 5 (match) 25 (no match, intuitive) 4 EXACT german,english 1 0 1 (no match) Variant 3: Precision is good, recall could be better (i.e. one-off-corrections) © 2022 Markus Günther IT-Beratung
  30. 30. Adding one-off-corrections using Damerau-Levensthein distance complements BMPM. Prerequisites name index field that stores <first name> <middle-initial> <surname> name index field uses n-grams Refine the query Can be applied within phrases as well to allow for displacements "Mustermann Max" should yield the same results as "Max Mustermann" (name_phonetic:mustermann) OR (name:mustermann~1) © 2022 Markus Günther IT-Beratung
  31. 31. Adding a boost on first name and surname for direct matches. Influence ordering a bit to always prefer direct matches before phonetic variations. Prerequisites: firstname index field that stores <first name> (non-analyzed, lowercased) surname index field that stores <surname> (non-analyzed, lowercased) Refine the query bq=firstname:("mustermann")surname:("mustermann") © 2022 Markus Günther IT-Beratung
  32. 32. Tuning BMPM using additional mechanisms yields well-grounded phonetic matches. What have we done? Test the effect of BMPM parameterizations on your dataset Add one-off-corrections to mitigate spelling mistakes that phonetics won't catch Allow for displacement of max. two terms within a phrase Boost on first and surname separately to influence relevance sorting © 2022 Markus Günther IT-Beratung
  33. 33. Tuning BMPM using additional mechanisms yields well-grounded phonetic matches. Achievements Good trade-off between precision and recall usually top match on search for unique names Result sets are explainable Relevance ordering feels natural direct matches, phonetic variations, one-off corrections © 2022 Markus Günther IT-Beratung
  34. 34. Questions?

×