2. Approximate string comparators?
⢠Basically, measures of the similarity between
two strings
⢠Useful in situations where exact match is
insufficient
â record linkage
â search
â ...
⢠Many of these are slow: O(n2)
2
3. Levenshtein
⢠Also known as edit distance
⢠Measures the number of edit operations
necessary to turn s1 into s2
⢠Edit operations are
â insert a character
â remove a character
â substitute a character
3
5. Weighted Levenshtein
⢠Not all edit operations are equal
⢠Substituting âiâ for âeâ is a smaller edit than
substituting âoâ for âkâ
⢠Weighted Levenshtein evaluates each edit
operation as a number 0.0-1.0
⢠Difficult to implement
â weights are also language-dependent
5
6. Jaro-Winkler
⢠Developed at the US Bureau of the Census
⢠For name comparisons
â not well suited to long strings
â best if given name/surname are separated
⢠Exists in a few variants
â originally proposed by Winkler
â then modified by Jaro
â a few different versions of modifications etc
6
7. Jaro-Winkler definition
⢠Formula:
â m = number of matching characters
â t = number of transposed characters
⢠A character from string s1 matches s2 if the
same character is found in s2 less then half the
length of the string away
⢠Levenshtein ~ LÜwenstein = 0.8
⢠Axel ~ Aksel = 0.783
7
9. Soundex
⢠A coarse schema for matching names by sound
â produces a key from the name
â names match if key is the same
⢠In common use in many places
â Navâs person register uses it for search
â built-in in many databases
â ...
9
12. Metaphone
⢠Developed by Lawrence Philips
⢠Similar to Soundex, but much more complex
â both more accurate and more sensitive
⢠Developed further into Double Metaphone
⢠Metaphone 3.0 also exists, but only available
commercially
12
14. Dice coefficient
⢠A similarity measure for sets
â set can be tokens in a string
â or characters in a string
⢠Formula:
14
15. TFIDF
⢠Compares strings as sets of tokens
â a la Dice coefficient
⢠However, takes frequency of tokens in corpus
into account
â this matches how we evaluate matches mentally
⢠Has done well in evaluations
â however, can be difficult to evaluate
â results will change as corpus changes
15
16. More comparators
⢠Smith-Waterman
â originated in DNA sequencing
⢠Q-grams distance
â breaks string into sets of pieces of q characters
â then does set similarity comparison
⢠Monge-Elkan
â similar to Smith-Waterman, but with affine gap distances
â has done very well in evaluations
â costly to evaluate
⢠Many, many more
â ...
16