Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

A Rose By Any Other Name.pdf

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige

Hier ansehen

1 von 26 Anzeige

A Rose By Any Other Name.pdf

Herunterladen, um offline zu lesen

Rosette Name Indexer applies machine learning and artificial intelligence to the problem of matching names and specific techniques for Hebrew names. Rosette is a leader in applying NLP and computational linguistics to text analytics.

Rosette Name Indexer applies machine learning and artificial intelligence to the problem of matching names and specific techniques for Hebrew names. Rosette is a leader in applying NLP and computational linguistics to text analytics.

Anzeige
Anzeige

Weitere Verwandte Inhalte

Anzeige

A Rose By Any Other Name.pdf

  1. 1. A ‫ד‬ ֶ ‫ֶר‬‫ו‬ By Any Other Name Matching Names in Hebrew and Other Languages Gil Irizarry Fiona Hasanaj VP Engineering Senior Software Engineer
  2. 2. BASIS TECHNOLOGY Speakers 2 Gil Irizarry VP Engineering Fiona Hasanaj Senior Software Engineer
  3. 3. BASIS TECHNOLOGY What is a name? What's Montague? it is nor hand, nor foot, Nor arm, nor face, nor any other part Belonging to a man. O, be some other name! What's in a name? that which we call a rose By any other name would smell as sweet; ⎯ Romeo and Juliet, Act 2 Scene 2 3 Public domain image https://commons.wikimedia.org/wiki/File:Romeo_and_Juliet_(detail)_by_Frank_Dicksee.png
  4. 4. BASIS TECHNOLOGY A rose by any other name rose ‫ורד‬ ‫وردة‬ 장미 一朵玫瑰 バラ роза trëndafil 4 https://commons.wikimedia.org/wiki/File:Rose_on_a_table.jpg Creative Commons License
  5. 5. BASIS TECHNOLOGY John by any other name John, Jan, Johan, Johann, Johannes, Hannes, Hans, Gjon, Gjin, ዮሐንስ (Yoḥännǝs), ‫ﯾﺣﯾﻰ‬ (Yaḥyā, Qurʾānic), ‫ﯾوﺣﻧﺎ‬ (Yūḥannā, Biblical) or ‫ّﺎ‬‫ﻧ‬‫ﺣ‬ (Henna or Hanna), ‫ܝܘܚܢܢ‬ (Yuḥanon), ‫ܚܢܐ‬ (Henna or Hanna), ‫ܐܝܘܢ‬ (Ewan), Chuan, Հովհաննես (Hovhannes), Օհաննես (Ohannes), Յովհաննէս (Hovhannēs), Xuan, Manez, Ganix, Joanes, Iban, Ян (Yan), Янка (Yanka), Янэк (Yanek), Ясь (Yas'), Іван (Ivan), ইয়ািহয়া (Iyahiya), য়াহয়া (Yahya), Ivan, Jahija, Yann, Yannig, Иван (Ivan), Йоан (Yoan), Янко (Yanko), Яне (Yane), Joan, 约翰, 約翰, Yuēhàn, ⲓⲱϩⲁⲛⲛⲏⲥ (Iohannes), ⲓⲱⲁ (Ioa), Jowan, Ghjuvanni, Ivo, Ive, Ivica, Ivano, Ivanko, Janko, Ivek, Honza, Hanuš, Jens, Yohanes, Han, Hannes, Jannes, Wannes, Sjeng, Guiàn, Zvan, Ian, Johnny, Jack, Shawn, Sean, Shaun, Shane, Shani, Jaan, Juhan, Juho, Janno, Jukk, Jaanus, Hannes, Johano, Huan, Jann, Janus, Jenis, Jóannes, Jónar, Jógvan, Hannis, Hanus, Jone, Ioane, Juan, Hannes, Hannu, Jani, Janne, Joni, Juha, Juho, Juhani, Jonne, Juntti (archaic), Jean, Jehan (outdated), Xoán, Xan, იოანე (Ioane), ივანე (Ivane), იოვანე (Iovane), ვანო (Vano), ივა (Iva), Hannes, Ιωάννης (Ioannis), Γιάννης (Yiannis, sometimes Giannis), Huã, Keoni, ʻIoane, ‫יוחנן‬ (Yôḥānān) Johanan, János, Jancsi (moniker), Hannes, Yohana, Yuhanna, Ayan, ా ను Yohanu, Iwan, Yahya, Yan, Yaya, Yuan, Luan, Eóin, Gianni, Giannino, Gionino, Giovanni, Ivano, Ivo, Vanni, Nino, Vannino, ヨハネ (Yohane), ジョハン (Johan), Жақия (Zhaqiya, Yahya), Шоқан (Shoqan), Жакыя (Jakyya, Yahya), Жакан (Jakan), 요한 (Yohan)[12], Juang, Yohanis, Iohannes, Ioannes, Jānis, Janis, Jancis, Janka, Jans, Jāns, Jānuss, Jonass, Žans, Žanis, Džons, Džonijs, Džanni, Džovanni, Ians, Džeks, Šeins, Johans, Hanss, Ansis, Johaness, Johanness, Johanāns, Haness, Hanness, Ivans, Aivans, Aivens, Aiens, Jonas, Giuàn, Јован (Jovan), Јованче (Jovanče), Иван (Ivan), Јане (Jane), േയാഹ ാൻ (Yōhannān) ഉലഹ ാൻ (Ulahannan) േലാന ൻ (Lonappan) നയിനാ൯ (Nainan, Ninan), Ġwanni, Hōne, Jon, (Yohannan), ‫ﯾﺣﯾﯽ‬ (Yahya), Gioann, Janek, João, Ivo, Ivã, Ioan, Ionuț, Ionel, Ionică, Nelu, Iancu, Иван (Ivan), Иоанн (Ioann, Hebrew form), Ян (Yan), Ioane, Juons, Giuanni, Jock, Iain, Eòin, Seathan, Euan/Ewan, Јован (Jovan), Иван (Ivan), Јанко (Janko), Јовица (Jovica), Ивица (Ivica), Ивко (Ivko), Giuvanni, Giuanni, Juwam, Yohan, Janez, Ivo, Janko, Anže, Anžej, Jon, Nuño, Hannes, য়াহয়া (Yahya), ‫ܝܘܚܢܢ‬ (Yuḥanon), ‫ܚܢܐ‬ (Ḥanna), ‫ܐܝܘܢ‬ (Ewan), ேயாவா (Yovaan), Sione, Yahya, Yuhanna, Іван (Ivan), Іванко (Ivanko), Ян (Jan), Dương, Giăng, Gioan, Evan, Ianto, Ieuan, Ifan, Ioan, Siôn 5 https://en.wikipedia.org/wiki/John_(given_name)
  6. 6. BASIS TECHNOLOGY An easy name-matching challenge 6
  7. 7. BASIS TECHNOLOGY A harder name-matching challenge 7
  8. 8. BASIS TECHNOLOGY Overcoming the name-matching challenge 8
  9. 9. BASIS TECHNOLOGY Hidden Markov Models 9 https://en.wikipedia.org/wiki/Hidden_Markov_model#/media/File:HMMGraph.svg Public Domain image
  10. 10. BASIS TECHNOLOGY Other name-matching challenges 10
  11. 11. BASIS TECHNOLOGY Using Vector Similarity for Name Matching 11
  12. 12. BASIS TECHNOLOGY Matching Hebrew Names 12
  13. 13. BASIS TECHNOLOGY Hebrew String Normalization 13 ● Keep these characters: ○ All letters, digits, split characters (hyphens, periods, commas etc.), whitespace, symbols ○ u05B0 through u05BB (Hebrew vowels) ○ u05BC, u05BF, u05C1, and u05C2 (Hebrew consonant modifiers) ○ u05F3 (geresh, a punctuation mark also used as a consonant modifier) ● Map some Hebrew punctuation to common ASCII fallbacks: ○ u05BE (HEBREW PUNCTUATION MAQAF) to - ○ u05F3 (HEBREW PUNCTUATION GERESH) to ' ○ u2019 (RIGHT SINGLE QUOTATION MARK ) to ' ● We do not normalize Hebrew final letters ○ ‫ג'ף‬ - Jef word-final final pe ○ ‫קאמפ‬ - Kamp word-final non-final pe
  14. 14. BASIS TECHNOLOGY Hebrew Vocalization 14 ● Process of vocalization: ○ Dictionary Lookup ○ Statistical Model Vocalization ○ Rule-based Vocalization Checker Vocalization Dictionary Statistical Model Vocalizer Vocalization Checker Input Vocalized Output Output
  15. 15. BASIS TECHNOLOGY Hebrew Transliteration 15 ● FOLK transliteration scheme example: ○ ‫שמיר‬ ‫עדי‬ ⟹ Adi Shamir ● ISO 259-2-1994 transliteration scheme example: ○ ‫שמיר‬ ‫עדי‬ ⟹ ʿadiy Šamiyr ● ICU (UNGEGN - United Nations Group of Experts on Geographical Names) transliteration scheme example: ○ ‫שמיר‬ ‫עדי‬ ⟹ ʻàdiy Şá̌miyr ● Statistical model example for names of foreign origin: ○ ‫פרנקלין‬ ‫רוזלינד‬ ⟹ Rosalind Franklin
  16. 16. BASIS TECHNOLOGY Hebrew Transliteration - FOLK 16 ● Many-to-one mapping from the source table. ● Gathered valid onsets, valid codas, allowed “‫”נג‬ (ng) as a coda, which occurs in English loanwords. ● Bet, kaf, and pe each have two pronunciations, a stop and a fricative, and are romanized accordingly. ● Complexity with shva. (continued on next two slides)
  17. 17. BASIS TECHNOLOGY Hebrew Transliteration - FOLK 17
  18. 18. BASIS TECHNOLOGY Hebrew Transliteration - FOLK 18
  19. 19. BASIS TECHNOLOGY Hebrew Name Matching 19 ● Statistical model trained on Hebrew-English PERSON names (used for matching all entity types) ● Levenshtein edit distance ● Initial and initialism matching ● Embedding matching and entity resolution for ORGs ● Overrides matching ● Gender Model ● Frequency Model
  20. 20. BASIS TECHNOLOGY Hebrew Name Matching - Statistical Model 20
  21. 21. BASIS TECHNOLOGY Hebrew Name Matching - Embedding Model 21
  22. 22. BASIS TECHNOLOGY Hebrew Name Matching - Entity Resolution 22
  23. 23. BASIS TECHNOLOGY Hebrew Name Matching - Frequency Model 23
  24. 24. BASIS TECHNOLOGY Hebrew Name Matching - Gender Model 24
  25. 25. BASIS TECHNOLOGY Hebrew Name Matching - Overrides 25
  26. 26. BASIS TECHNOLOGY Hebrew Name Matching - Stopwords 26

×