SlideShare ist ein Scribd-Unternehmen logo
1 von 39
Downloaden Sie, um offline zu lesen
Jennifer Shin
Founder, 8 Path Solutions LLC
Lecturer, UC Berkeley
Fuzzy Matching on
Apache Spark
Agenda
• Intro to fuzzy matching:
what you need to know
• Use Case:
a fuzzy solution for surveys
• Fuzzy implementations:
real world considerations
© 2017 8 Path Solutions LLC. All Rights Reserved.
Intro to Fuzzy Matching
What You Need To Know
Fuzzy Matching
(aka Approximate String Matching)
• process of finding strings that approximately match a given
pattern
• closeness of a match is measured in terms of an edit
distance, i.e. the number of operations necessary to convert
the string into an exact match.
© 2017 8 Path Solutions LLC. All Rights Reserved.
Fuzzy Matching
The edit distance is the number of primitive operations
necessary to convert the string into an exact match.
Examples of primitive operations are:
insertion: cot → coat
deletion: coat → cot
substitution: coat → cost
© 2017 8 Path Solutions LLC. All Rights Reserved.
What is fuzzy matching?
• A fuzzy matching program is used to returns a list of results that
are not an exact match for the term being searched
– search cab argument words
– spellings may not exactly match.
© 2017 8 Path Solutions LLC. All Rights Reserved.
Why use fuzzy matching?
• Not all data is clean
• Not all formatting is consistent
• Not all databases are structured
• Not all text is correct
• People are not perfect
© 2017 8 Path Solutions LLC. All Rights Reserved.
When can we use fuzzy matching?
• Case by case basis
• Data cleaning
© 2017 8 Path Solutions LLC. All Rights Reserved.
When can we use fuzzy matching?
• Case by case basis
• Data cleaning
• Entity/Name matching
© 2017 8 Path Solutions LLC. All Rights Reserved.
When can we use fuzzy matching?
• Case by case basis
• Data cleaning
• Entity/Name matching
• Recommendations
© 2017 8 Path Solutions LLC. All Rights Reserved.
When can we use fuzzy matching?
• Case by case basis
• Data cleaning
• Entity/Name matching
• Recommendations
• Predictive text
© 2017 8 Path Solutions LLC. All Rights Reserved.
Use Case
A Fuzzy Solution For Surveys
Data: Survey
 Comprehensive survey about attitudes, usage, purchases
 6,000 products
 20,000 variables
 26 feed categories
© 2017 8 Path Solutions LLC. All Rights Reserved.
Problem Description
A:
Dental Floss: Light Users: 0-2 Times/Last 7 Days:
Total Category
B:
Dental Floss: Times/Last 7 Days: Light (0-2)
How similar is A to B?
A B
+ =
How many new questions?
© 2017 8 Path Solutions LLC. All Rights Reserved.
Anxiety/Panic Used a branded
prescription remedy
Ailments/Remedies: : Anxiety/Panic: In
last 12 months: Used a branded
prescription remedy
Old label New label
© 2017 8 Path Solutions LLC. All Rights Reserved.
Word Based Comparison Model (WCM)
Anxiety/Panic Used a branded
prescription remedy
Ailments/Remedies: : Anxiety/Panic: In
last 12 months: Used a branded
prescription remedy
Old label New label
Score: 6
Good match
Then set threshold: match with scores above 5 is a good match
Word Based Comparison Model (WCM)
Anxiety/Panic Used a branded
prescription remedy
Ailments/Remedies: : Anxiety/Panic: In
last 12 months: Used a branded
prescription remedy
Old label New label
Score: 6
Good match
Then set threshold: match with scores above 5 is a good match
Word Based Comparison Model (WCM)
Any air conditioner Amount spent :
total :in last 12 months: $1000+
Shoes - Amount Spent in Total: any
Nike air: In last 12 months: $1000+
anyairconditioneramountspent
$1000+
anynikeair
Inlast12months
$1000+
shoesamountspentintotal
inlast12months
By cells
 total
Two cell does not have match, even most of the words do have matches.

Word Based Comparison Model (WCM)
Tires: Total Users: Bought in Last
12 Months: Hankook
Batteries: Total Users: Bought in Last
6 Months: Kodak
Prescription Brands - Used: : Evista
(men only): In last 12 months
Prescription Brands - Used: : Avodart
(men only): In last 12 months
wrong matches due to changes of brand names:
Score
7
Score
9
Match with scores above 5 can be a wrong match!
Why does Word-based Comparison Model(WCM) perform so poorly?
© 2017 8 Path Solutions LLC. All Rights Reserved.
Athletic Shoes - Amount Spent in Total: : Baseball
/Softball shoes: In last 12 months: $75 - $149
Athletic Shoes - Amount Spent in Total: Baseball
/Softball shoes: In last 12 months: $50 - $74
wrong matches due to different numbers:
Athletic Shoes - Number of pairs bought: :
Baseball/Softball shoes: In last 12 months: 2+
Athletic Shoes - Number of pairs bought:
Baseball/Softball shoes: In last 12 months: 2
Hair Tonic Or Dressing (Men): Heavy Users: 8+
Times/Last 7 Days: Total Category
Hair Tonic Or Dressing (Men): Heavy Users: 3+
Times/Last 7 Days: Total Category
Why does Word-based Comparison Model (WCM) perform so poorly?
Scores
12
Scores
11
Scores
12
Match with scores above 5 can be a wrong match!
© 2017 8 Path Solutions LLC. All Rights Reserved.
• Check if one cell is a subset of another cell.
• If all the cells in shorter label can find their counterparts, a
match is found.
Criteria:
Fuzzy Matching:
Levenshtein distance
© 2017 8 Path Solutions LLC. All Rights Reserved.
New Approach Proposed by Gan Song
• Levenshtein distance is a string metric for measuring the
difference between two sequences.
• Informally, the Levenshtein distance between two words is the
minimum number of single-character edits
(i.e. insertions, deletions or substitutions)
© 2017 8 Path Solutions LLC. All Rights Reserved.
Levenshtein Distance
smtchgy smmtchg
© 2017 8 Path Solutions LLC. All Rights Reserved.
Levenshtein Distance
smtchgy smmtchg
smtchgy----> smmtchgy----> smmtchg
smtchgy---->smmchgy----> smmthgy----> smmtcgy----> smmtchy----> smmtchg
Insert ‘m’ delete‘y’
Change ‘t’
To ‘m’
Change ‘c’
To ‘t’
Change ‘h’
To ‘c’
Change ‘g’
To ‘h’
Change ‘y’
To ‘g’
© 2017 8 Path Solutions LLC. All Rights Reserved.
Levenshtein Distance
H O A N
H O N A
A H O N
A N H O
N H O A
N A H O
H A O N
H N O A
A H N O
O A N H
N H A O
O N A H
H A N O
H N A O
O A H N
A O N H
O N H A
N O A H
O H A N
O H N A
A O H N
A N O H
N O H A
N A O H
H O A N
H O N A
A H O N
A N H O
N H O A
N A H O
H A O N
H N O A
A H N O
O A N H
N H A O
O N A H
H A N O
H N A O
O A H N
A O N H
O N H A
N O A H
O H A N
O H N A
A O H N
A N O H
N O H A
N A O H
H O A N H O A N
Shuffle!
© 2017 8 Path Solutions LLC. All Rights Reserved.
H O A N
H O N A
A H O N
A N H O
N H O A
N A H O
H A O N
H N O A
A H N O
O A N H
N H A O
O N A H
H A N O
H N A O
O A H N
A O N H
O N H A
N O A H
O H A N
O H N A
A O H N
A N O H
N O H A
N A O H
H O A N
H O N A
A H O N
A N H O
N H O A
N A H O
H A O N
H N O A
A H N O
O A N H
N H A O
O N A H
H A N O
H N A O
O A H N
A O N H
O N H A
N O A H
O H A N
O H N A
A O H N
A N O H
N O H A
N A O H
H O A N H O A N
Find a match!
© 2017 8 Path Solutions LLC. All Rights Reserved.
Cell-based Comparison Model (CCM)
Social Networking – LinkedIn How
important to you: : Not at all Important ::
Keep in touch with family/friends
Social Networking – LinkedIn.com How
important to you: : Keep in touch with
family/friends: : Not at all Important
['socialnetworkinglinkedincomhowimportanttoyou',
'keepintouchwithfamilyfriends',
'notatallimportant']
['socialnetworkinglinkedinhowimportanttoyou',
'notatallimportant',
'keepintouchwithfamilyfriends']
© 2017 8 Path Solutions LLC. All Rights Reserved.
Levenshtein
'socialnetworkinglinkedincomh
owimportanttoyou’
'keepintouchwithfamilyfriends’ 'notatallimportant’
'socialnetworkinglinkedinhowi
mportanttoyou’
{'insert': 3, 'replace': 0,
'delete': 0}
{'insert': 0, 'replace': 21,
'delete': 13}
{'insert': 0, 'replace': 4,
'delete': 24}
'notatallimportant’
{'insert': 27, 'replace': 4,
'delete': 0}
{'insert': 11, 'replace': 11,
'delete': 0}
{'insert': 0, 'replace': 0,
'delete': 0}
'keepintouchwithfamilyfriends’ {'insert': 16, 'replace':
20, 'delete': 0}
{'insert': 0, 'replace': 0,
'delete': 0}
{'insert': 0, 'replace': 11,
'delete': 11}
Old
New
Only small amount of insertions or deletions is accepted.
Any other combination of operations are rejected as a match.
['socialnetworkinglinkedincomhowimportanttoyou',
'keepintouchwithfamilyfriends',
'notatallimportant']
['socialnetworkinglinkedinhowimportanttoyou',
'notatallimportant',
'keepintouchwithfamilyfriends']
Levenshtein
'socialnetworkinglinkedincomh
owimportanttoyou’
'keepintouchwithfamilyfriends’ 'notatallimportant’
'socialnetworkinglinkedinhowi
mportanttoyou’
{'insert': 3, 'replace': 0,
'delete': 0}
{'insert': 0, 'replace': 21,
'delete': 13}
{'insert': 0, 'replace': 4,
'delete': 24}
'notatallimportant’
{'insert': 27, 'replace': 4,
'delete': 0}
{'insert': 11, 'replace': 11,
'delete': 0}
{'insert': 0, 'replace': 0,
'delete': 0}
'keepintouchwithfamilyfriends’ {'insert': 16, 'replace':
20, 'delete': 0}
{'insert': 0, 'replace': 0,
'delete': 0}
{'insert': 0, 'replace': 11,
'delete': 11}
Old
New
Only small amount of insertions or deletions is accepted.
Any other combination of operations are rejected as a match.
['socialnetworkinglinkedincomhowimportanttoyou',
'keepintouchwithfamilyfriends',
'notatallimportant']
['socialnetworkinglinkedinhowimportanttoyou',
'notatallimportant',
'keepintouchwithfamilyfriends']
Process
1. Preprocess the labels
3. Compare the labels by using CCM
4. Find out good matches
5. Output the ‘old not in new’ and ‘new not in old’
2. Remove duplicates
© 2017 8 Path Solutions LLC. All Rights Reserved.
Fuzzy Implementations
Real World Considerations
2. Process Design
1. Data Suitability
3. Validation Methodology
Implementation Considerations
4. Computing Resources
© 2017 8 Path Solutions LLC. All Rights Reserved.
Python
def levenshtein(s1, s2):
if (s1) < (s2):
return levenshtein(s2, s1)
if (s2) == 0:
return (s1)
previous_row = range(len(s2) + 1)
for i, c1 in enumerate(s1):
current_row = [i + 1]
for j, c2 in enumerate(s2):
insertions = previous_row[j + 1] + 1
deletions = current_row[j] + 1 # than s2
substitutions = previous_row[j] + (c1 != c2)
current_row.append(min(insertions, deletions, substitutions))
previous_row = current_row
return previous_row[-1]
def levenshtein(str1: String, str2: String): Int = {
val lenStr1 = str1.length
val lenStr2 = str2.length
val d: Array[Array[Int]] = Array.ofDim(lenStr1 + 1, lenStr2 + 1)
for (i <- 0 to lenStr1) d(i)(0) = i for (j <- 0 to lenStr2) d(0)(j) = j
for (i <- 1 to lenStr1; j <- 1 to lenStr2) {
val cost = if (str1(i - 1) == str2(j - 1)) 0 else 1
d(i)(j) = min( d(i-1)(j ) + 1, // deletion
d(i )(j-1) + 1, // insertion
d(i-1)(j-1) + cost // substitution ) }
d(lenStr1)(lenStr2)
}
def min(nums: Int*): Int = nums.min
Scala
Spark
pyspark.sql.functions.levenshtein(left, right)
Computes the Levenshtein distance of the two given strings.
from pyspark.sql.functions import *
df = spark.createDataFrame([(<word 1>, <word 2>,)], ['l', 'r'])
df.select(levenshtein('l', 'r').alias('d')).collect()
© 2017 8 Path Solutions LLC. All Rights Reserved.
Example: kitinmy vs. sitting
[('replace', 0, 0), ('insert', 2, 2), ('delete', 5, 6), ('replace', 6, 6)]
© 2017 8 Path Solutions LLC. All Rights Reserved.
Example: Kitten vs Sitting
© 2017 8 Path Solutions LLC. All Rights Reserved.
Example: Kitten vs Sitten
© 2017 8 Path Solutions LLC. All Rights Reserved.
Jennifer Shin
jshin@8pathsolutions.com
Thank You.

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
Presto: SQL-on-anything
Presto: SQL-on-anythingPresto: SQL-on-anything
Presto: SQL-on-anything
 
Hive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmarkHive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmark
 
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid CloudKafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid Cloud
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Performance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud EnvironmentsPerformance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud Environments
 
ETL VS ELT.pdf
ETL VS ELT.pdfETL VS ELT.pdf
ETL VS ELT.pdf
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Tutorial on SPARQL: SPARQL Protocol and RDF Query Language
Tutorial on SPARQL: SPARQL Protocol and RDF Query Language Tutorial on SPARQL: SPARQL Protocol and RDF Query Language
Tutorial on SPARQL: SPARQL Protocol and RDF Query Language
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Architecting a datalake
Architecting a datalakeArchitecting a datalake
Architecting a datalake
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
 
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Iceberg
 

Ähnlich wie Fuzzy Matching on Apache Spark with Jennifer Shin

Ähnlich wie Fuzzy Matching on Apache Spark with Jennifer Shin (20)

Fuzzy Matching to the Rescue
Fuzzy Matching to the RescueFuzzy Matching to the Rescue
Fuzzy Matching to the Rescue
 
GPSBUS206_Best Practices for Building a Partner Database Practice on AWS
GPSBUS206_Best Practices for Building a Partner Database Practice on AWSGPSBUS206_Best Practices for Building a Partner Database Practice on AWS
GPSBUS206_Best Practices for Building a Partner Database Practice on AWS
 
AWS reInvent 2017 Recap Webinar
AWS reInvent 2017 Recap WebinarAWS reInvent 2017 Recap Webinar
AWS reInvent 2017 Recap Webinar
 
AWS Migration - General
AWS Migration - GeneralAWS Migration - General
AWS Migration - General
 
The Enterprise Fast Lane - What Your Competition Doesn't Want You to Know abo...
The Enterprise Fast Lane - What Your Competition Doesn't Want You to Know abo...The Enterprise Fast Lane - What Your Competition Doesn't Want You to Know abo...
The Enterprise Fast Lane - What Your Competition Doesn't Want You to Know abo...
 
ALX401-Advanced Alexa Skill Building Conversation and Memory
ALX401-Advanced Alexa Skill Building Conversation and MemoryALX401-Advanced Alexa Skill Building Conversation and Memory
ALX401-Advanced Alexa Skill Building Conversation and Memory
 
Keynote: What Transformation Really Means for the Enterprise - AWS Transforma...
Keynote: What Transformation Really Means for the Enterprise - AWS Transforma...Keynote: What Transformation Really Means for the Enterprise - AWS Transforma...
Keynote: What Transformation Really Means for the Enterprise - AWS Transforma...
 
Conversation and Memory - ALX401-R - re:Invent 2017
Conversation and Memory - ALX401-R - re:Invent 2017Conversation and Memory - ALX401-R - re:Invent 2017
Conversation and Memory - ALX401-R - re:Invent 2017
 
AWS reInvent Recap 線上研討會
AWS reInvent Recap 線上研討會AWS reInvent Recap 線上研討會
AWS reInvent Recap 線上研討會
 
AWS Migration - As-Is Tool
AWS Migration - As-Is ToolAWS Migration - As-Is Tool
AWS Migration - As-Is Tool
 
Making Your User Stories "Ready" to Get to "Done"
Making Your User Stories "Ready" to Get to "Done" Making Your User Stories "Ready" to Get to "Done"
Making Your User Stories "Ready" to Get to "Done"
 
MCL301_Building a Voice-Enabled Customer Service Chatbot Using Amazon Lex and...
MCL301_Building a Voice-Enabled Customer Service Chatbot Using Amazon Lex and...MCL301_Building a Voice-Enabled Customer Service Chatbot Using Amazon Lex and...
MCL301_Building a Voice-Enabled Customer Service Chatbot Using Amazon Lex and...
 
Do More of This and Less of That (Sam Yen at Enterprise UX 2017)
Do More of This and Less of That (Sam Yen at Enterprise UX 2017)Do More of This and Less of That (Sam Yen at Enterprise UX 2017)
Do More of This and Less of That (Sam Yen at Enterprise UX 2017)
 
GAM311-How Linden Lab Built a Virtual World on the AWS Cloud.pdf
GAM311-How Linden Lab Built a Virtual World on the AWS Cloud.pdfGAM311-How Linden Lab Built a Virtual World on the AWS Cloud.pdf
GAM311-How Linden Lab Built a Virtual World on the AWS Cloud.pdf
 
Latam virtual event_keynote-pt-br_americo
Latam virtual event_keynote-pt-br_americoLatam virtual event_keynote-pt-br_americo
Latam virtual event_keynote-pt-br_americo
 
Quarterly Planning Deck
Quarterly Planning DeckQuarterly Planning Deck
Quarterly Planning Deck
 
GPSBUS216-GPS Applying AI-ML to Find Security Needles in the Haystack
GPSBUS216-GPS Applying AI-ML to Find Security Needles in the HaystackGPSBUS216-GPS Applying AI-ML to Find Security Needles in the Haystack
GPSBUS216-GPS Applying AI-ML to Find Security Needles in the Haystack
 
CityWallet - From Mount Augustus to Los Roques Archipelago
CityWallet - From Mount Augustus to Los Roques ArchipelagoCityWallet - From Mount Augustus to Los Roques Archipelago
CityWallet - From Mount Augustus to Los Roques Archipelago
 
AWS Black Belt Online Seminar 2017 Amazon GameLift
AWS Black Belt Online Seminar 2017 Amazon GameLiftAWS Black Belt Online Seminar 2017 Amazon GameLift
AWS Black Belt Online Seminar 2017 Amazon GameLift
 
AWS上でのオンラインゲームリリースガイド
AWS上でのオンラインゲームリリースガイドAWS上でのオンラインゲームリリースガイド
AWS上でのオンラインゲームリリースガイド
 

Mehr von Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Mehr von Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Kürzlich hochgeladen

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 

Kürzlich hochgeladen (20)

Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 

Fuzzy Matching on Apache Spark with Jennifer Shin

  • 1. Jennifer Shin Founder, 8 Path Solutions LLC Lecturer, UC Berkeley Fuzzy Matching on Apache Spark
  • 2. Agenda • Intro to fuzzy matching: what you need to know • Use Case: a fuzzy solution for surveys • Fuzzy implementations: real world considerations © 2017 8 Path Solutions LLC. All Rights Reserved.
  • 3. Intro to Fuzzy Matching What You Need To Know
  • 4. Fuzzy Matching (aka Approximate String Matching) • process of finding strings that approximately match a given pattern • closeness of a match is measured in terms of an edit distance, i.e. the number of operations necessary to convert the string into an exact match. © 2017 8 Path Solutions LLC. All Rights Reserved.
  • 5. Fuzzy Matching The edit distance is the number of primitive operations necessary to convert the string into an exact match. Examples of primitive operations are: insertion: cot → coat deletion: coat → cot substitution: coat → cost © 2017 8 Path Solutions LLC. All Rights Reserved.
  • 6. What is fuzzy matching? • A fuzzy matching program is used to returns a list of results that are not an exact match for the term being searched – search cab argument words – spellings may not exactly match. © 2017 8 Path Solutions LLC. All Rights Reserved.
  • 7. Why use fuzzy matching? • Not all data is clean • Not all formatting is consistent • Not all databases are structured • Not all text is correct • People are not perfect © 2017 8 Path Solutions LLC. All Rights Reserved.
  • 8. When can we use fuzzy matching? • Case by case basis • Data cleaning © 2017 8 Path Solutions LLC. All Rights Reserved.
  • 9. When can we use fuzzy matching? • Case by case basis • Data cleaning • Entity/Name matching © 2017 8 Path Solutions LLC. All Rights Reserved.
  • 10. When can we use fuzzy matching? • Case by case basis • Data cleaning • Entity/Name matching • Recommendations © 2017 8 Path Solutions LLC. All Rights Reserved.
  • 11. When can we use fuzzy matching? • Case by case basis • Data cleaning • Entity/Name matching • Recommendations • Predictive text © 2017 8 Path Solutions LLC. All Rights Reserved.
  • 12. Use Case A Fuzzy Solution For Surveys
  • 13. Data: Survey  Comprehensive survey about attitudes, usage, purchases  6,000 products  20,000 variables  26 feed categories © 2017 8 Path Solutions LLC. All Rights Reserved.
  • 14. Problem Description A: Dental Floss: Light Users: 0-2 Times/Last 7 Days: Total Category B: Dental Floss: Times/Last 7 Days: Light (0-2) How similar is A to B? A B + = How many new questions? © 2017 8 Path Solutions LLC. All Rights Reserved.
  • 15. Anxiety/Panic Used a branded prescription remedy Ailments/Remedies: : Anxiety/Panic: In last 12 months: Used a branded prescription remedy Old label New label © 2017 8 Path Solutions LLC. All Rights Reserved. Word Based Comparison Model (WCM)
  • 16. Anxiety/Panic Used a branded prescription remedy Ailments/Remedies: : Anxiety/Panic: In last 12 months: Used a branded prescription remedy Old label New label Score: 6 Good match Then set threshold: match with scores above 5 is a good match Word Based Comparison Model (WCM)
  • 17. Anxiety/Panic Used a branded prescription remedy Ailments/Remedies: : Anxiety/Panic: In last 12 months: Used a branded prescription remedy Old label New label Score: 6 Good match Then set threshold: match with scores above 5 is a good match Word Based Comparison Model (WCM)
  • 18. Any air conditioner Amount spent : total :in last 12 months: $1000+ Shoes - Amount Spent in Total: any Nike air: In last 12 months: $1000+ anyairconditioneramountspent $1000+ anynikeair Inlast12months $1000+ shoesamountspentintotal inlast12months By cells  total Two cell does not have match, even most of the words do have matches.  Word Based Comparison Model (WCM)
  • 19. Tires: Total Users: Bought in Last 12 Months: Hankook Batteries: Total Users: Bought in Last 6 Months: Kodak Prescription Brands - Used: : Evista (men only): In last 12 months Prescription Brands - Used: : Avodart (men only): In last 12 months wrong matches due to changes of brand names: Score 7 Score 9 Match with scores above 5 can be a wrong match! Why does Word-based Comparison Model(WCM) perform so poorly? © 2017 8 Path Solutions LLC. All Rights Reserved.
  • 20. Athletic Shoes - Amount Spent in Total: : Baseball /Softball shoes: In last 12 months: $75 - $149 Athletic Shoes - Amount Spent in Total: Baseball /Softball shoes: In last 12 months: $50 - $74 wrong matches due to different numbers: Athletic Shoes - Number of pairs bought: : Baseball/Softball shoes: In last 12 months: 2+ Athletic Shoes - Number of pairs bought: Baseball/Softball shoes: In last 12 months: 2 Hair Tonic Or Dressing (Men): Heavy Users: 8+ Times/Last 7 Days: Total Category Hair Tonic Or Dressing (Men): Heavy Users: 3+ Times/Last 7 Days: Total Category Why does Word-based Comparison Model (WCM) perform so poorly? Scores 12 Scores 11 Scores 12 Match with scores above 5 can be a wrong match! © 2017 8 Path Solutions LLC. All Rights Reserved.
  • 21. • Check if one cell is a subset of another cell. • If all the cells in shorter label can find their counterparts, a match is found. Criteria: Fuzzy Matching: Levenshtein distance © 2017 8 Path Solutions LLC. All Rights Reserved. New Approach Proposed by Gan Song
  • 22. • Levenshtein distance is a string metric for measuring the difference between two sequences. • Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) © 2017 8 Path Solutions LLC. All Rights Reserved. Levenshtein Distance
  • 23. smtchgy smmtchg © 2017 8 Path Solutions LLC. All Rights Reserved. Levenshtein Distance
  • 24. smtchgy smmtchg smtchgy----> smmtchgy----> smmtchg smtchgy---->smmchgy----> smmthgy----> smmtcgy----> smmtchy----> smmtchg Insert ‘m’ delete‘y’ Change ‘t’ To ‘m’ Change ‘c’ To ‘t’ Change ‘h’ To ‘c’ Change ‘g’ To ‘h’ Change ‘y’ To ‘g’ © 2017 8 Path Solutions LLC. All Rights Reserved. Levenshtein Distance
  • 25. H O A N H O N A A H O N A N H O N H O A N A H O H A O N H N O A A H N O O A N H N H A O O N A H H A N O H N A O O A H N A O N H O N H A N O A H O H A N O H N A A O H N A N O H N O H A N A O H H O A N H O N A A H O N A N H O N H O A N A H O H A O N H N O A A H N O O A N H N H A O O N A H H A N O H N A O O A H N A O N H O N H A N O A H O H A N O H N A A O H N A N O H N O H A N A O H H O A N H O A N Shuffle! © 2017 8 Path Solutions LLC. All Rights Reserved.
  • 26. H O A N H O N A A H O N A N H O N H O A N A H O H A O N H N O A A H N O O A N H N H A O O N A H H A N O H N A O O A H N A O N H O N H A N O A H O H A N O H N A A O H N A N O H N O H A N A O H H O A N H O N A A H O N A N H O N H O A N A H O H A O N H N O A A H N O O A N H N H A O O N A H H A N O H N A O O A H N A O N H O N H A N O A H O H A N O H N A A O H N A N O H N O H A N A O H H O A N H O A N Find a match! © 2017 8 Path Solutions LLC. All Rights Reserved.
  • 27. Cell-based Comparison Model (CCM) Social Networking – LinkedIn How important to you: : Not at all Important :: Keep in touch with family/friends Social Networking – LinkedIn.com How important to you: : Keep in touch with family/friends: : Not at all Important ['socialnetworkinglinkedincomhowimportanttoyou', 'keepintouchwithfamilyfriends', 'notatallimportant'] ['socialnetworkinglinkedinhowimportanttoyou', 'notatallimportant', 'keepintouchwithfamilyfriends'] © 2017 8 Path Solutions LLC. All Rights Reserved.
  • 28. Levenshtein 'socialnetworkinglinkedincomh owimportanttoyou’ 'keepintouchwithfamilyfriends’ 'notatallimportant’ 'socialnetworkinglinkedinhowi mportanttoyou’ {'insert': 3, 'replace': 0, 'delete': 0} {'insert': 0, 'replace': 21, 'delete': 13} {'insert': 0, 'replace': 4, 'delete': 24} 'notatallimportant’ {'insert': 27, 'replace': 4, 'delete': 0} {'insert': 11, 'replace': 11, 'delete': 0} {'insert': 0, 'replace': 0, 'delete': 0} 'keepintouchwithfamilyfriends’ {'insert': 16, 'replace': 20, 'delete': 0} {'insert': 0, 'replace': 0, 'delete': 0} {'insert': 0, 'replace': 11, 'delete': 11} Old New Only small amount of insertions or deletions is accepted. Any other combination of operations are rejected as a match. ['socialnetworkinglinkedincomhowimportanttoyou', 'keepintouchwithfamilyfriends', 'notatallimportant'] ['socialnetworkinglinkedinhowimportanttoyou', 'notatallimportant', 'keepintouchwithfamilyfriends']
  • 29. Levenshtein 'socialnetworkinglinkedincomh owimportanttoyou’ 'keepintouchwithfamilyfriends’ 'notatallimportant’ 'socialnetworkinglinkedinhowi mportanttoyou’ {'insert': 3, 'replace': 0, 'delete': 0} {'insert': 0, 'replace': 21, 'delete': 13} {'insert': 0, 'replace': 4, 'delete': 24} 'notatallimportant’ {'insert': 27, 'replace': 4, 'delete': 0} {'insert': 11, 'replace': 11, 'delete': 0} {'insert': 0, 'replace': 0, 'delete': 0} 'keepintouchwithfamilyfriends’ {'insert': 16, 'replace': 20, 'delete': 0} {'insert': 0, 'replace': 0, 'delete': 0} {'insert': 0, 'replace': 11, 'delete': 11} Old New Only small amount of insertions or deletions is accepted. Any other combination of operations are rejected as a match. ['socialnetworkinglinkedincomhowimportanttoyou', 'keepintouchwithfamilyfriends', 'notatallimportant'] ['socialnetworkinglinkedinhowimportanttoyou', 'notatallimportant', 'keepintouchwithfamilyfriends']
  • 30. Process 1. Preprocess the labels 3. Compare the labels by using CCM 4. Find out good matches 5. Output the ‘old not in new’ and ‘new not in old’ 2. Remove duplicates © 2017 8 Path Solutions LLC. All Rights Reserved.
  • 32. 2. Process Design 1. Data Suitability 3. Validation Methodology Implementation Considerations 4. Computing Resources © 2017 8 Path Solutions LLC. All Rights Reserved.
  • 33. Python def levenshtein(s1, s2): if (s1) < (s2): return levenshtein(s2, s1) if (s2) == 0: return (s1) previous_row = range(len(s2) + 1) for i, c1 in enumerate(s1): current_row = [i + 1] for j, c2 in enumerate(s2): insertions = previous_row[j + 1] + 1 deletions = current_row[j] + 1 # than s2 substitutions = previous_row[j] + (c1 != c2) current_row.append(min(insertions, deletions, substitutions)) previous_row = current_row return previous_row[-1]
  • 34. def levenshtein(str1: String, str2: String): Int = { val lenStr1 = str1.length val lenStr2 = str2.length val d: Array[Array[Int]] = Array.ofDim(lenStr1 + 1, lenStr2 + 1) for (i <- 0 to lenStr1) d(i)(0) = i for (j <- 0 to lenStr2) d(0)(j) = j for (i <- 1 to lenStr1; j <- 1 to lenStr2) { val cost = if (str1(i - 1) == str2(j - 1)) 0 else 1 d(i)(j) = min( d(i-1)(j ) + 1, // deletion d(i )(j-1) + 1, // insertion d(i-1)(j-1) + cost // substitution ) } d(lenStr1)(lenStr2) } def min(nums: Int*): Int = nums.min Scala
  • 35. Spark pyspark.sql.functions.levenshtein(left, right) Computes the Levenshtein distance of the two given strings. from pyspark.sql.functions import * df = spark.createDataFrame([(<word 1>, <word 2>,)], ['l', 'r']) df.select(levenshtein('l', 'r').alias('d')).collect() © 2017 8 Path Solutions LLC. All Rights Reserved.
  • 36. Example: kitinmy vs. sitting [('replace', 0, 0), ('insert', 2, 2), ('delete', 5, 6), ('replace', 6, 6)] © 2017 8 Path Solutions LLC. All Rights Reserved.
  • 37. Example: Kitten vs Sitting © 2017 8 Path Solutions LLC. All Rights Reserved.
  • 38. Example: Kitten vs Sitten © 2017 8 Path Solutions LLC. All Rights Reserved.