Weitere ähnliche Inhalte Ähnlich wie Fuzzy Matching on Apache Spark with Jennifer Shin (20) Kürzlich hochgeladen (20) Fuzzy Matching on Apache Spark with Jennifer Shin2. Agenda
• Intro to fuzzy matching:
what you need to know
• Use Case:
a fuzzy solution for surveys
• Fuzzy implementations:
real world considerations
© 2017 8 Path Solutions LLC. All Rights Reserved.
4. Fuzzy Matching
(aka Approximate String Matching)
• process of finding strings that approximately match a given
pattern
• closeness of a match is measured in terms of an edit
distance, i.e. the number of operations necessary to convert
the string into an exact match.
© 2017 8 Path Solutions LLC. All Rights Reserved.
5. Fuzzy Matching
The edit distance is the number of primitive operations
necessary to convert the string into an exact match.
Examples of primitive operations are:
insertion: cot → coat
deletion: coat → cot
substitution: coat → cost
© 2017 8 Path Solutions LLC. All Rights Reserved.
6. What is fuzzy matching?
• A fuzzy matching program is used to returns a list of results that
are not an exact match for the term being searched
– search cab argument words
– spellings may not exactly match.
© 2017 8 Path Solutions LLC. All Rights Reserved.
7. Why use fuzzy matching?
• Not all data is clean
• Not all formatting is consistent
• Not all databases are structured
• Not all text is correct
• People are not perfect
© 2017 8 Path Solutions LLC. All Rights Reserved.
8. When can we use fuzzy matching?
• Case by case basis
• Data cleaning
© 2017 8 Path Solutions LLC. All Rights Reserved.
9. When can we use fuzzy matching?
• Case by case basis
• Data cleaning
• Entity/Name matching
© 2017 8 Path Solutions LLC. All Rights Reserved.
10. When can we use fuzzy matching?
• Case by case basis
• Data cleaning
• Entity/Name matching
• Recommendations
© 2017 8 Path Solutions LLC. All Rights Reserved.
11. When can we use fuzzy matching?
• Case by case basis
• Data cleaning
• Entity/Name matching
• Recommendations
• Predictive text
© 2017 8 Path Solutions LLC. All Rights Reserved.
13. Data: Survey
Comprehensive survey about attitudes, usage, purchases
6,000 products
20,000 variables
26 feed categories
© 2017 8 Path Solutions LLC. All Rights Reserved.
14. Problem Description
A:
Dental Floss: Light Users: 0-2 Times/Last 7 Days:
Total Category
B:
Dental Floss: Times/Last 7 Days: Light (0-2)
How similar is A to B?
A B
+ =
How many new questions?
© 2017 8 Path Solutions LLC. All Rights Reserved.
15. Anxiety/Panic Used a branded
prescription remedy
Ailments/Remedies: : Anxiety/Panic: In
last 12 months: Used a branded
prescription remedy
Old label New label
© 2017 8 Path Solutions LLC. All Rights Reserved.
Word Based Comparison Model (WCM)
16. Anxiety/Panic Used a branded
prescription remedy
Ailments/Remedies: : Anxiety/Panic: In
last 12 months: Used a branded
prescription remedy
Old label New label
Score: 6
Good match
Then set threshold: match with scores above 5 is a good match
Word Based Comparison Model (WCM)
17. Anxiety/Panic Used a branded
prescription remedy
Ailments/Remedies: : Anxiety/Panic: In
last 12 months: Used a branded
prescription remedy
Old label New label
Score: 6
Good match
Then set threshold: match with scores above 5 is a good match
Word Based Comparison Model (WCM)
18. Any air conditioner Amount spent :
total :in last 12 months: $1000+
Shoes - Amount Spent in Total: any
Nike air: In last 12 months: $1000+
anyairconditioneramountspent
$1000+
anynikeair
Inlast12months
$1000+
shoesamountspentintotal
inlast12months
By cells
total
Two cell does not have match, even most of the words do have matches.
Word Based Comparison Model (WCM)
19. Tires: Total Users: Bought in Last
12 Months: Hankook
Batteries: Total Users: Bought in Last
6 Months: Kodak
Prescription Brands - Used: : Evista
(men only): In last 12 months
Prescription Brands - Used: : Avodart
(men only): In last 12 months
wrong matches due to changes of brand names:
Score
7
Score
9
Match with scores above 5 can be a wrong match!
Why does Word-based Comparison Model(WCM) perform so poorly?
© 2017 8 Path Solutions LLC. All Rights Reserved.
20. Athletic Shoes - Amount Spent in Total: : Baseball
/Softball shoes: In last 12 months: $75 - $149
Athletic Shoes - Amount Spent in Total: Baseball
/Softball shoes: In last 12 months: $50 - $74
wrong matches due to different numbers:
Athletic Shoes - Number of pairs bought: :
Baseball/Softball shoes: In last 12 months: 2+
Athletic Shoes - Number of pairs bought:
Baseball/Softball shoes: In last 12 months: 2
Hair Tonic Or Dressing (Men): Heavy Users: 8+
Times/Last 7 Days: Total Category
Hair Tonic Or Dressing (Men): Heavy Users: 3+
Times/Last 7 Days: Total Category
Why does Word-based Comparison Model (WCM) perform so poorly?
Scores
12
Scores
11
Scores
12
Match with scores above 5 can be a wrong match!
© 2017 8 Path Solutions LLC. All Rights Reserved.
21. • Check if one cell is a subset of another cell.
• If all the cells in shorter label can find their counterparts, a
match is found.
Criteria:
Fuzzy Matching:
Levenshtein distance
© 2017 8 Path Solutions LLC. All Rights Reserved.
New Approach Proposed by Gan Song
22. • Levenshtein distance is a string metric for measuring the
difference between two sequences.
• Informally, the Levenshtein distance between two words is the
minimum number of single-character edits
(i.e. insertions, deletions or substitutions)
© 2017 8 Path Solutions LLC. All Rights Reserved.
Levenshtein Distance
24. smtchgy smmtchg
smtchgy----> smmtchgy----> smmtchg
smtchgy---->smmchgy----> smmthgy----> smmtcgy----> smmtchy----> smmtchg
Insert ‘m’ delete‘y’
Change ‘t’
To ‘m’
Change ‘c’
To ‘t’
Change ‘h’
To ‘c’
Change ‘g’
To ‘h’
Change ‘y’
To ‘g’
© 2017 8 Path Solutions LLC. All Rights Reserved.
Levenshtein Distance
25. H O A N
H O N A
A H O N
A N H O
N H O A
N A H O
H A O N
H N O A
A H N O
O A N H
N H A O
O N A H
H A N O
H N A O
O A H N
A O N H
O N H A
N O A H
O H A N
O H N A
A O H N
A N O H
N O H A
N A O H
H O A N
H O N A
A H O N
A N H O
N H O A
N A H O
H A O N
H N O A
A H N O
O A N H
N H A O
O N A H
H A N O
H N A O
O A H N
A O N H
O N H A
N O A H
O H A N
O H N A
A O H N
A N O H
N O H A
N A O H
H O A N H O A N
Shuffle!
© 2017 8 Path Solutions LLC. All Rights Reserved.
26. H O A N
H O N A
A H O N
A N H O
N H O A
N A H O
H A O N
H N O A
A H N O
O A N H
N H A O
O N A H
H A N O
H N A O
O A H N
A O N H
O N H A
N O A H
O H A N
O H N A
A O H N
A N O H
N O H A
N A O H
H O A N
H O N A
A H O N
A N H O
N H O A
N A H O
H A O N
H N O A
A H N O
O A N H
N H A O
O N A H
H A N O
H N A O
O A H N
A O N H
O N H A
N O A H
O H A N
O H N A
A O H N
A N O H
N O H A
N A O H
H O A N H O A N
Find a match!
© 2017 8 Path Solutions LLC. All Rights Reserved.
27. Cell-based Comparison Model (CCM)
Social Networking – LinkedIn How
important to you: : Not at all Important ::
Keep in touch with family/friends
Social Networking – LinkedIn.com How
important to you: : Keep in touch with
family/friends: : Not at all Important
['socialnetworkinglinkedincomhowimportanttoyou',
'keepintouchwithfamilyfriends',
'notatallimportant']
['socialnetworkinglinkedinhowimportanttoyou',
'notatallimportant',
'keepintouchwithfamilyfriends']
© 2017 8 Path Solutions LLC. All Rights Reserved.
28. Levenshtein
'socialnetworkinglinkedincomh
owimportanttoyou’
'keepintouchwithfamilyfriends’ 'notatallimportant’
'socialnetworkinglinkedinhowi
mportanttoyou’
{'insert': 3, 'replace': 0,
'delete': 0}
{'insert': 0, 'replace': 21,
'delete': 13}
{'insert': 0, 'replace': 4,
'delete': 24}
'notatallimportant’
{'insert': 27, 'replace': 4,
'delete': 0}
{'insert': 11, 'replace': 11,
'delete': 0}
{'insert': 0, 'replace': 0,
'delete': 0}
'keepintouchwithfamilyfriends’ {'insert': 16, 'replace':
20, 'delete': 0}
{'insert': 0, 'replace': 0,
'delete': 0}
{'insert': 0, 'replace': 11,
'delete': 11}
Old
New
Only small amount of insertions or deletions is accepted.
Any other combination of operations are rejected as a match.
['socialnetworkinglinkedincomhowimportanttoyou',
'keepintouchwithfamilyfriends',
'notatallimportant']
['socialnetworkinglinkedinhowimportanttoyou',
'notatallimportant',
'keepintouchwithfamilyfriends']
29. Levenshtein
'socialnetworkinglinkedincomh
owimportanttoyou’
'keepintouchwithfamilyfriends’ 'notatallimportant’
'socialnetworkinglinkedinhowi
mportanttoyou’
{'insert': 3, 'replace': 0,
'delete': 0}
{'insert': 0, 'replace': 21,
'delete': 13}
{'insert': 0, 'replace': 4,
'delete': 24}
'notatallimportant’
{'insert': 27, 'replace': 4,
'delete': 0}
{'insert': 11, 'replace': 11,
'delete': 0}
{'insert': 0, 'replace': 0,
'delete': 0}
'keepintouchwithfamilyfriends’ {'insert': 16, 'replace':
20, 'delete': 0}
{'insert': 0, 'replace': 0,
'delete': 0}
{'insert': 0, 'replace': 11,
'delete': 11}
Old
New
Only small amount of insertions or deletions is accepted.
Any other combination of operations are rejected as a match.
['socialnetworkinglinkedincomhowimportanttoyou',
'keepintouchwithfamilyfriends',
'notatallimportant']
['socialnetworkinglinkedinhowimportanttoyou',
'notatallimportant',
'keepintouchwithfamilyfriends']
30. Process
1. Preprocess the labels
3. Compare the labels by using CCM
4. Find out good matches
5. Output the ‘old not in new’ and ‘new not in old’
2. Remove duplicates
© 2017 8 Path Solutions LLC. All Rights Reserved.
32. 2. Process Design
1. Data Suitability
3. Validation Methodology
Implementation Considerations
4. Computing Resources
© 2017 8 Path Solutions LLC. All Rights Reserved.
33. Python
def levenshtein(s1, s2):
if (s1) < (s2):
return levenshtein(s2, s1)
if (s2) == 0:
return (s1)
previous_row = range(len(s2) + 1)
for i, c1 in enumerate(s1):
current_row = [i + 1]
for j, c2 in enumerate(s2):
insertions = previous_row[j + 1] + 1
deletions = current_row[j] + 1 # than s2
substitutions = previous_row[j] + (c1 != c2)
current_row.append(min(insertions, deletions, substitutions))
previous_row = current_row
return previous_row[-1]
34. def levenshtein(str1: String, str2: String): Int = {
val lenStr1 = str1.length
val lenStr2 = str2.length
val d: Array[Array[Int]] = Array.ofDim(lenStr1 + 1, lenStr2 + 1)
for (i <- 0 to lenStr1) d(i)(0) = i for (j <- 0 to lenStr2) d(0)(j) = j
for (i <- 1 to lenStr1; j <- 1 to lenStr2) {
val cost = if (str1(i - 1) == str2(j - 1)) 0 else 1
d(i)(j) = min( d(i-1)(j ) + 1, // deletion
d(i )(j-1) + 1, // insertion
d(i-1)(j-1) + cost // substitution ) }
d(lenStr1)(lenStr2)
}
def min(nums: Int*): Int = nums.min
Scala
36. Example: kitinmy vs. sitting
[('replace', 0, 0), ('insert', 2, 2), ('delete', 5, 6), ('replace', 6, 6)]
© 2017 8 Path Solutions LLC. All Rights Reserved.