Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Towards Identity Resolution: The Challenge of Name Matching

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Wird geladen in …3
×

Hier ansehen

1 von 22 Anzeige
Anzeige

Weitere Verwandte Inhalte

Ähnlich wie Towards Identity Resolution: The Challenge of Name Matching (20)

Weitere von Gil Irizarry (17)

Anzeige

Aktuellste (20)

Towards Identity Resolution: The Challenge of Name Matching

  1. 1. Towards Identity Resolution: The Challenge Of Name Matching
  2. 2. About Me Gil Irizarry - Director of Engineering for name and identity resolution technology Basis Technology - leading provider of software solutions for extracting meaningful intelligence from multilingual text and digital devices
  3. 3. Names What's in a name? that which we call a rose By any other name would smell as sweet; So Romeo would, were he not Romeo call'd, Retain that dear perfection which he owes Without that title. Romeo and Juliet, Act 2, Scene 2
  4. 4. First, An Exercise... Ask your neighbor for his/her name
  5. 5. The Challenge =문재인 文在寅?
  6. 6. Lies We Believe About Names People have exactly one canonical full name. People have exactly one full name which they go by. People have, at this point in time, exactly one canonical full name. People have, at this point in time, one full name which they go by. People have exactly N names, for any value of N. People’s names fit within a certain defined amount of space. People’s names do not change. People’s names change, but only at a certain enumerated set of events. People’s names are written in ASCII. People’s names are written in any single character set. People’s names are all mapped in Unicode code points. People’s names are case sensitive. People’s names are case insensitive. People’s names sometimes have prefixes or suffixes, but you can safely ignore those. People’s names do not contain numbers. People’s names are not written in ALL CAPS. People’s names are not written in all lower case letters. People’s names have an order to them. Picking any ordering scheme will automatically result in consistent ordering among all systems, as long as both use the same ordering scheme for the same name. People’s first names and last names are, by necessity, different. ...and more… https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-na mes/
  7. 7. A True Story Mícheál MacDonncha vs. Ardmhéara Micheál MacDonncha (vs. Micheál Mac Donncha) https://www.irishpost.com/news/ misspelling-irish-lord-mayors-name -leads-mishap-trip-israel-153193
  8. 8. Imagine That You Need To Enter Your Name https://www.basistech.com/case-study/nyu-names-search/
  9. 9. Imagine we have a backend datastore of identities... ● This identity directory is stored in Elasticsearch. (Why Elasticsearch? Because it has fuzzy search capability) ● To access data, you have to enter a name ● The terminal asks for Last Name, First Name (and the system stores your name as Smith, John Armstrong) ● However, the user enters: ○ John Smith ○ John A. Smith ○ John Armstrong Smith
  10. 10. Matching Terms
  11. 11. Now imagine a user calls into a call center... ● User calls into a call center ● An operator hears the user’s name and transcribes it ● This is prone to errors ● The operator enters ○ Jon Smyth ○ John ArmstrongSmith ○ John Armstrong-Smith
  12. 12. At The Call Center
  13. 13. Suppose Other Phenomena Are Entered ● Nicknames ○ Johnny Smith ● Gender mistakes ○ Joan Smith ○ Joanie Smith ● Differences in relative name frequencies ○ Shaun Smith ● Initials, especially for organization names ○ IBM vs. International Business Machines
  14. 14. Scoring The Different Phenomena
  15. 15. Scoring The Different Phenomena
  16. 16. If Our Users Cross National Borders 문재인 vs. Moon Jae-in or 安倍 晋三 vs. Shinzo Abe
  17. 17. Multi-lingual Matching
  18. 18. Multi-lingual Matching
  19. 19. A Challenge Fulfilled =문재인 Moon Jae-in in Hangul Hangul: Korean Alphabet 文在寅 Moon Jae-in in Hanja Hanja: Chinese characters with Korean pronunciation
  20. 20. Conclusions 1. Matching text strings is straightforward; matching names is not. 2. Text strings comprised of different characters may have the same social or cultural meaning.
  21. 21. Conclusions 3. The situation gets more complex when combining a name with other information, such as date of birth or address. These types of data also have multiple formats and are prone to transcription errors.
  22. 22. Finally... What is your neighbor’s name?

×