SlideShare ist ein Scribd-Unternehmen logo
1 von 40
Downloaden Sie, um offline zu lesen
Name Matching at Scale:
CPU, GPU or SPARK?
Wendell Kuling and Chris Broeren
ING Wholesale Banking Advanced Analytics Team
Chris Broeren,
Data Scientist
Wendell Kuling,
Data Scientist
Overview
• Introduction to problem
• Methods to solve problem
• Brute Force approach
• Metric tree approach
• Tokenised approach
• Current status
Introduction
Wholesale bank = dealing with companies
Interested in different data sets about companies
To join multiple data sets together, we need a common key: company name
However one company may be called by different name:
: McDonalds Corporation, McDonalds, McDonald’s Corp, etc…
Therefore we need to match approximately similar names of companies
together
Introduction
Define an existing list of company names as the ground truth (G)
Aim: match new sets of names (S1, S2, S3, … ) with G:
Without loss of generality, let’s assume we’re going to match one set of names, S with G for this talk
ABN Amro Bank
RBS Bank
Rabobank
JP Morgan
ING Groep
ASN Bank
Chase Bank
BINCK Bank
HSBC Bank
Westpac Bank
Goldman Sachs
ABN Amro N.V
RBS LLC
Rabobank NV
JPM USA
ING Groep N.V.
ASN
Chase
BINCK N.V
HSBC
Westpac Australia
GS Global
Source 1Ground Truth
ABN Amro N.V
RBS LLC
Rabobank N.V
JPM USA
ING Groep
ASN
Chase
BINCK N.V
HSBC
Westpac
GS Global
Source 2
ABN Amro N.V
RBS LLC
RABOBANK NV
JPM USA
ING N.V.
ASN
Chase Bank
BINCK N.V
HSBC
Westpac Aus
GS Global
Source 3
G S1 S2 S3
Introduction
Many ways to look at problem:
• Approximate string match problem
• Nearest Neighbour Search problem
• Pattern matching
• etc…
We need to find the “closest” name in G to match to every name in S
Reality
In our first case:
• G has 12 million names
• S ranges in length between 3000 and 5 mln names
To make matters worse:
• On average, a name is 31 characters long, containing ~4 words
• The world isn’t UTF8 compliant, we have over 160 characters
• Although there are limited duplicates in G, some companies have similar
names and have hierarchical structures which must be observed
Overview
• Introduction to problem
• Methods to solve problem
• Brute Force approach
• Metric tree approach
• Tokenised approach
• Current status
Brute Force Method
Define a function to measure word closeness:
The closer the names are to each other, the more similar they are
Calculate closeness for each word and choose the closest
Ensemble with different functions to get better results
Brute Force Method
There are many word similarity functions. An example is the Levenshtein distance.
Levenshtein distance calculates the minimum number of character edits
(replacing, adding or subtracting) it takes to make two strings equal.
Example: levenshtein(“ABN Amro Bank”, “RBS Bank”)
• ABN Amro Bank —> RBN Amro Bank (replace A with R)
• RBN Amro Bank —> RBN Bank (remove Amro)
• RBN Bank —> RBS Bank (replace N with S)
Therefore Levenshtein(“ABN Amro Bank”, “RBS Bank”) = 1 + 4 + 1
Brute Force Method
• “ABN Amro Bank” vs {“ABN Amro N.V, … , “GS Global”}
ABN Amro Bank
RBS Bank
Rabobank
JP Morgan
ING Groep
ASN Bank
Chase Bank
BINCK Bank
HSBC Bank
Westpac Bank
Goldman Sachs
ABN Amro N.V
RBS LLC
Rabobank NV
JPM USA
ING Groep N.V.
ASN
Chase
BINCK N.V
HSBC
Westpac Australia
GS Global
SG
Brute Force Method
• “RBS Bank” vs {“ABN Amro N.V, … , “GS Global”}
ABN Amro Bank
RBS Bank
Rabobank
JP Morgan
ING Groep
ASN Bank
Chase Bank
BINCK Bank
HSBC Bank
Westpac Bank
Goldman Sachs
ABN Amro N.V
RBS LLC
Rabobank NV
JPM USA
ING Groep N.V.
ASN
Chase
BINCK N.V
HSBC
Westpac Australia
GS Global
SG
Brute Force Method
• “Goldman Sachs” vs {“ABN Amro N.V, … , “GS Global”}
ABN Amro Bank
RBS Bank
Rabobank
JP Morgan
ING Groep
ASN Bank
Chase Bank
BINCK Bank
HSBC Bank
Westpac Bank
Goldman Sachs
ABN Amro N.V
RBS LLC
Rabobank NV
JPM USA
ING Groep N.V.
ASN
Chase
BINCK N.V
HSBC
Westpac Australia
GS Global
SG
Brute force method
• Problem: 12 million names in G, 5 million names in S
• This is 60,000,000,000,000 similarity calculations
• Levenshtein algorithm has time complexity of O(mn), where m, n are length
of strings.
• If we could calculate 10 similarity calculations a second…We would be
here for ~ 190,000 years
• Parallel: 10,000 cores … 19 years
Know which package to use for edit-based
distances
Fuzzywuzzy: string matching like a boss… but for
smaller sets only
Overview
• Introduction to problem
• Methods to solve problem
• Brute Force approach
• Metric tree approach
• Tokenised approach
• Current status
Metric Tree Method
We can think of names as points in some topological space
We don’t necessarily need to know absolute location of a word in a space, just the
relative distance between points
Therefore we still use a distance function (as per brute force), but define it so it
satisfies some mathematical properties:
1. d(x,y) = 0 —> x = y
2. d(x,y) = d(y,x)
3. d(x,z) <= d(x,y) + d(y,z)
This is known as a is a metric, we can save ourself time by organising the words into a
tree structure that preserves metric-distances between words
Metric Tree Method
Once we create this metric tree, we can query the nearest neighbour by
traversing the tree, blocking out “known far away words” - effectively
reducing the search space
Book
BowlHook Head
Cook Boek Bow Dead
1
2
4
1 2 1 1
Metric Tree Method
Building the tree, is well feasible with ~2.7 mln different words - O(n log(n))
Typically, all words with distance of 1 determined in ~1 sec
Build + query time still years worth of calculation
• Added problem of making a tree in parallel
• Lots of space required
• Worst case performance is actually bad
Overview
• Introduction to problem
• Methods to solve problem
• Brute Force approach
• Metric tree approach
• Tokenised approach
• Current status
Tokenised Method
Break name up into components (tokenising)
Many different types of tokens available: words, grams
Do this for all names in both G and S (this creates two matrices [names x tokens])
Example: Indicator function word tokeniser:
ABN RBS BANK Rabobank NV
ABN Amro
Bank
1 0 1 0 0
RBS Bank 0 1 1 0 0
Rabobank NV 0 0 0 1 1
Tokenised Method
• For given token length d:
• matrix of names in G
• matrix of names in S
• Dot product of and yields
• Row i, column j of corresponds to inner product of the tokens of the i-th word in
G and the j-th word in S
=.
Tokenised Method
• Why the dot product?
• The elements of look somewhat familiar to us:
• elements are the cosine similarity of the individual name-token vectors
multiplied by the L2 norm
• If we normalise the token-vector on creation we end up calculating the
cosine-similarity measure!
Tokenised Method
• Same number of total comparisons as brute-force
• But inner-products are cheap to calculate
• Tokenised matrices can be computed offline cheaply
• Tokenised methods allow for vectorisation and allow for increased memory
and CPU efficiency
• We can even compute this on a GPU cluster
Overview
• Introduction to problem
• Methods to solve problem
• Brute Force approach
• Metric tree approach
• Tokenised approach
• Current status
Preprocessing-steps turn out relatively cheap (fast),
whereas the calculation is expensive
Read data
(Hive)
Clean data
Build ‘G’ TFIDF
matrix
Build ‘S’ TFIDF
matrix
<5 mins <5 mins <5 mins xxx hours
Preprocessing
Calculate
<5 mins
Things you would wish you knew before (1/4)…
Read data
(Hive)
Runs out of memory
(or use Python 3.x ;))
Clean data
Things you would wish you knew before (2/4)…
tokenize(‘McDonaldś’)
Build ‘G’ TFIDF
matrix
Things you would wish you knew before (3/4)…
Standard token_pattern (‘(?u)bww+b’) ignores single letters
Use token_pattern (‘(?u)bw+b’) for ‘full’ tokenization
(token_pattern = u’(?u)S', ngram_range=(3, 3)) gives 3-gram matching
‘Taxibedrijf M. van Seben’ —> [‘Taxibedrijf’, ‘van’, ‘ Seben’ ]
Build ‘S’ TFIDF
matrix
Things you would wish you knew before (4/4)
Standard ‘transform’ function of Sklearn TFIDFVectorizer ignores unseen tokens
—> either transform using customized function, or tokenise on combination of G and S
match(‘JonasTheMan Nederland’) —> 100% match ‘Nederland Nederland’ ?
Calculation of cosine similarity:
matrix multiplication using Numpy/Scipy
Using Numpy and Scipy, fast Matrix multiplication of Sparse matrices. Suggested format: CSR.
.7
0
0
0
.7
1 0 0 0 0
0 .7 0 0 .7
0 0 .6 .6 .6
x
# tokens
# company
names
# of tokens (Transposed)
G S.Transpose
=
.7
.49
.42
Argmax = best match
Calculate
Look at 0.01% of the ‘G’ matrix:
what do you notice?
Input:
Sparsity: ~0.0001%
(~3 tokens per 2.6 mln columns)
Storage required: ~2 GB
Output:
Sparsity: ~0.5%
Storage required: ~10 TB
Depending on resolution, distance and eye-sight:
white dots can be seen for non-zero entries
Cruncher:
48 Cores, 512 GB RAM
Tesla:
GPUs: 3x2496 threads, 3x12 GB
Spark cluster:
150 cores, 2.5TB of memory
34
Introducing the three contestants for the
calculation part…
Numpy matrix multiplication:
first ~100 extra slices are cheap
Scipy/Numpy sparse matrix multiplication:
most expensive and highly-optimized function
Effectively using 1 core, 100 rows / iteration: ~140 matches per second
(additional memory usage: ~1 GB)
Tesla - GPU multiplication:
PyCuda is flexible, but requires deep C++ knowledge
Current custom kernel works with
Sparse Matrix x Dense Vector
(slice = 1)
Didn’t distribute the data across the GPU up-front
Using single GPU at the moment
…so, in short, further optimizations are possible!
Using 1 GPU, slice of 1 and Sparse x Dense multiplication:
~50 matches per second
Spark cluster: broadcast both sparse matrices,
use RDD with just the row-indices to work on
Driver
Step 1: push matrix G and S to workers
(broadcast variable)
Worker node
Worker node
Worker node
Step 2: distribute RDD with ‘chunks’
of row-indices: map ‘ multiply & argmax’
broadcast
G, S.T
broadcast
G, S.T
broadcast
G, S.T
Driver
Worker node
Worker node
Worker node
work on rows 0 - 9
return argmax(G.dot(S.T)) for 0-9
work on
rows 10-19
return argmax(G.dot(S.T)) for 10-19
etc.
Using standard TFIDF implementation from Spark MLLib:
vector by vector multiplication (scaleable, but slow) + hashing
Spark cluster: scales with only small modifications
to original Python code
612,630 matches in 12 containers, 12 cores/container, chunks
of 20 rows in ~5 min: 2000 matches / sec
Concluding for name-matching using Python

Weitere ähnliche Inhalte

Was ist angesagt?

Etsy Activity Feeds Architecture
Etsy Activity Feeds ArchitectureEtsy Activity Feeds Architecture
Etsy Activity Feeds Architecture
Dan McKinley
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 

Was ist angesagt? (20)

Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
 
From deep learning to deep reasoning
From deep learning to deep reasoningFrom deep learning to deep reasoning
From deep learning to deep reasoning
 
Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019
 
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You Need
 
Introduction to Machine Learning with Spark
Introduction to Machine Learning with SparkIntroduction to Machine Learning with Spark
Introduction to Machine Learning with Spark
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
 
Etsy Activity Feeds Architecture
Etsy Activity Feeds ArchitectureEtsy Activity Feeds Architecture
Etsy Activity Feeds Architecture
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Cutting edge hyperparameter tuning made simple with ray tune
Cutting edge hyperparameter tuning made simple with ray tuneCutting edge hyperparameter tuning made simple with ray tune
Cutting edge hyperparameter tuning made simple with ray tune
 
Graph Convolutional Neural Networks
Graph Convolutional Neural Networks Graph Convolutional Neural Networks
Graph Convolutional Neural Networks
 
MongoDB Days UK: Building an Enterprise Data Fabric at Royal Bank of Scotland...
MongoDB Days UK: Building an Enterprise Data Fabric at Royal Bank of Scotland...MongoDB Days UK: Building an Enterprise Data Fabric at Royal Bank of Scotland...
MongoDB Days UK: Building an Enterprise Data Fabric at Royal Bank of Scotland...
 
Machine Learning on Your Hand - Introduction to Tensorflow Lite Preview
Machine Learning on Your Hand - Introduction to Tensorflow Lite PreviewMachine Learning on Your Hand - Introduction to Tensorflow Lite Preview
Machine Learning on Your Hand - Introduction to Tensorflow Lite Preview
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERT
 
Real-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino Busa
Real-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino BusaReal-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino Busa
Real-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino Busa
 
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAdvancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
 
Big Data Platform at Pinterest
Big Data Platform at PinterestBig Data Platform at Pinterest
Big Data Platform at Pinterest
 
Predicting Flights with Azure Databricks
Predicting Flights with Azure DatabricksPredicting Flights with Azure Databricks
Predicting Flights with Azure Databricks
 
How to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They WorkHow to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They Work
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 

Andere mochten auch

PG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrPG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated Asyncr
Kohei KaiGai
 
Tree distance algorithm
Tree distance algorithmTree distance algorithm
Tree distance algorithm
Trector Rancor
 

Andere mochten auch (20)

Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL...
Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL...Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL...
Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL...
 
Real time data driven applications (SQL vs NoSQL databases)
Real time data driven applications (SQL vs NoSQL databases)Real time data driven applications (SQL vs NoSQL databases)
Real time data driven applications (SQL vs NoSQL databases)
 
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, DatabricksSpark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
 
Accelerating Machine Learning Applications on Spark Using GPUs
Accelerating Machine Learning Applications on Spark Using GPUsAccelerating Machine Learning Applications on Spark Using GPUs
Accelerating Machine Learning Applications on Spark Using GPUs
 
GPU Ecosystem
GPU EcosystemGPU Ecosystem
GPU Ecosystem
 
GPUs in Big Data - StampedeCon 2014
GPUs in Big Data - StampedeCon 2014GPUs in Big Data - StampedeCon 2014
GPUs in Big Data - StampedeCon 2014
 
Computational Techniques for the Statistical Analysis of Big Data in R
Computational Techniques for the Statistical Analysis of Big Data in RComputational Techniques for the Statistical Analysis of Big Data in R
Computational Techniques for the Statistical Analysis of Big Data in R
 
GTC 2012: GPU-Accelerated Path Rendering
GTC 2012: GPU-Accelerated Path RenderingGTC 2012: GPU-Accelerated Path Rendering
GTC 2012: GPU-Accelerated Path Rendering
 
SIGGRAPH 2012: GPU-Accelerated 2D and Web Rendering
SIGGRAPH 2012: GPU-Accelerated 2D and Web RenderingSIGGRAPH 2012: GPU-Accelerated 2D and Web Rendering
SIGGRAPH 2012: GPU-Accelerated 2D and Web Rendering
 
PG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrPG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated Asyncr
 
Deep learning on spark
Deep learning on sparkDeep learning on spark
Deep learning on spark
 
Enabling Graph Analytics at Scale: The Opportunity for GPU-Acceleration of D...
Enabling Graph Analytics at Scale:  The Opportunity for GPU-Acceleration of D...Enabling Graph Analytics at Scale:  The Opportunity for GPU-Acceleration of D...
Enabling Graph Analytics at Scale: The Opportunity for GPU-Acceleration of D...
 
Heterogeneous System Architecture Overview
Heterogeneous System Architecture OverviewHeterogeneous System Architecture Overview
Heterogeneous System Architecture Overview
 
PG-Strom - GPGPU meets PostgreSQL, PGcon2015
PG-Strom - GPGPU meets PostgreSQL, PGcon2015PG-Strom - GPGPU meets PostgreSQL, PGcon2015
PG-Strom - GPGPU meets PostgreSQL, PGcon2015
 
Deep Learning on Hadoop
Deep Learning on HadoopDeep Learning on Hadoop
Deep Learning on Hadoop
 
Hadoop + GPU
Hadoop + GPUHadoop + GPU
Hadoop + GPU
 
From Machine Learning to Learning Machines: Creating an End-to-End Cognitive ...
From Machine Learning to Learning Machines: Creating an End-to-End Cognitive ...From Machine Learning to Learning Machines: Creating an End-to-End Cognitive ...
From Machine Learning to Learning Machines: Creating an End-to-End Cognitive ...
 
DeepLearning4J and Spark: Successes and Challenges - François Garillot
DeepLearning4J and Spark: Successes and Challenges - François GarillotDeepLearning4J and Spark: Successes and Challenges - François Garillot
DeepLearning4J and Spark: Successes and Challenges - François Garillot
 
Tree distance algorithm
Tree distance algorithmTree distance algorithm
Tree distance algorithm
 
How to Solve Real-Time Data Problems
How to Solve Real-Time Data ProblemsHow to Solve Real-Time Data Problems
How to Solve Real-Time Data Problems
 

Ähnlich wie PyData Amsterdam - Name Matching at Scale

Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015
Big Data Spain
 
Tensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdTensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with Hummingbird
Databricks
 
Cloud flare jgc bigo meetup rolling hashes
Cloud flare jgc   bigo meetup rolling hashesCloud flare jgc   bigo meetup rolling hashes
Cloud flare jgc bigo meetup rolling hashes
Cloudflare
 

Ähnlich wie PyData Amsterdam - Name Matching at Scale (20)

Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++
 
Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015
 
14 query processing-sorting
14 query processing-sorting14 query processing-sorting
14 query processing-sorting
 
modeling.ppt
modeling.pptmodeling.ppt
modeling.ppt
 
2014 pycon-talk
2014 pycon-talk2014 pycon-talk
2014 pycon-talk
 
Query Optimization - Brandon Latronica
Query Optimization - Brandon LatronicaQuery Optimization - Brandon Latronica
Query Optimization - Brandon Latronica
 
Deep Learning Summit (DLS01-4)
Deep Learning Summit (DLS01-4)Deep Learning Summit (DLS01-4)
Deep Learning Summit (DLS01-4)
 
Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"
 
Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Word2vec slide(lab seminar)
Word2vec slide(lab seminar)
 
Tensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdTensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with Hummingbird
 
MongoDB for Time Series Data: Sharding
MongoDB for Time Series Data: ShardingMongoDB for Time Series Data: Sharding
MongoDB for Time Series Data: Sharding
 
GBM package in r
GBM package in rGBM package in r
GBM package in r
 
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
 
5_RNN_LSTM.pdf
5_RNN_LSTM.pdf5_RNN_LSTM.pdf
5_RNN_LSTM.pdf
 
Cloud flare jgc bigo meetup rolling hashes
Cloud flare jgc   bigo meetup rolling hashesCloud flare jgc   bigo meetup rolling hashes
Cloud flare jgc bigo meetup rolling hashes
 
large_scale_search.pdf
large_scale_search.pdflarge_scale_search.pdf
large_scale_search.pdf
 
How we use functional programming to find the bad guys @ Build Stuff LT and U...
How we use functional programming to find the bad guys @ Build Stuff LT and U...How we use functional programming to find the bad guys @ Build Stuff LT and U...
How we use functional programming to find the bad guys @ Build Stuff LT and U...
 
Chengqi zhang graph processing and mining in the era of big data
Chengqi zhang graph processing and mining in the era of big dataChengqi zhang graph processing and mining in the era of big data
Chengqi zhang graph processing and mining in the era of big data
 
Deep Dive: Amazon DynamoDB
Deep Dive: Amazon DynamoDBDeep Dive: Amazon DynamoDB
Deep Dive: Amazon DynamoDB
 
Word2vec and Friends
Word2vec and FriendsWord2vec and Friends
Word2vec and Friends
 

Mehr von GoDataDriven

DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...
DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...
DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...
GoDataDriven
 

Mehr von GoDataDriven (20)

Streamlining Data Science Workflows with a Feature Catalog
Streamlining Data Science Workflows with a Feature CatalogStreamlining Data Science Workflows with a Feature Catalog
Streamlining Data Science Workflows with a Feature Catalog
 
Visualizing Big Data in a Small Screen
Visualizing Big Data in a Small ScreenVisualizing Big Data in a Small Screen
Visualizing Big Data in a Small Screen
 
Building a Scalable and reliable open source ML Platform with MLFlow
Building a Scalable and reliable open source ML Platform with MLFlowBuilding a Scalable and reliable open source ML Platform with MLFlow
Building a Scalable and reliable open source ML Platform with MLFlow
 
Training Taster: Leading the way to become a data-driven organization
Training Taster: Leading the way to become a data-driven organizationTraining Taster: Leading the way to become a data-driven organization
Training Taster: Leading the way to become a data-driven organization
 
My Path From Data Engineer to Analytics Engineer
My Path From Data Engineer to Analytics EngineerMy Path From Data Engineer to Analytics Engineer
My Path From Data Engineer to Analytics Engineer
 
dbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo Sanchezdbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo Sanchez
 
Workshop on Google Cloud Data Platform
Workshop on Google Cloud Data PlatformWorkshop on Google Cloud Data Platform
Workshop on Google Cloud Data Platform
 
How to create a Devcontainer for your Python project
How to create a Devcontainer for your Python projectHow to create a Devcontainer for your Python project
How to create a Devcontainer for your Python project
 
Using Graph Neural Networks To Embrace The Dependency In Your Data by Usman Z...
Using Graph Neural Networks To Embrace The Dependency In Your Data by Usman Z...Using Graph Neural Networks To Embrace The Dependency In Your Data by Usman Z...
Using Graph Neural Networks To Embrace The Dependency In Your Data by Usman Z...
 
Common Issues With Time Series by Vadim Nelidov - GoDataFest 2022
Common Issues With Time Series by Vadim Nelidov - GoDataFest 2022Common Issues With Time Series by Vadim Nelidov - GoDataFest 2022
Common Issues With Time Series by Vadim Nelidov - GoDataFest 2022
 
MLOps CodeBreakfast on AWS - GoDataFest 2022
MLOps CodeBreakfast on AWS - GoDataFest 2022MLOps CodeBreakfast on AWS - GoDataFest 2022
MLOps CodeBreakfast on AWS - GoDataFest 2022
 
MLOps CodeBreakfast on Azure - GoDataFest 2022
MLOps CodeBreakfast on Azure - GoDataFest 2022MLOps CodeBreakfast on Azure - GoDataFest 2022
MLOps CodeBreakfast on Azure - GoDataFest 2022
 
Tableau vs. Power BI by Juan Manuel Perafan - GoDataFest 2022
Tableau vs. Power BI by Juan Manuel Perafan - GoDataFest 2022Tableau vs. Power BI by Juan Manuel Perafan - GoDataFest 2022
Tableau vs. Power BI by Juan Manuel Perafan - GoDataFest 2022
 
Deploying a Modern Data Stack by Lasse Benninga - GoDataFest 2022
Deploying a Modern Data Stack by Lasse Benninga - GoDataFest 2022Deploying a Modern Data Stack by Lasse Benninga - GoDataFest 2022
Deploying a Modern Data Stack by Lasse Benninga - GoDataFest 2022
 
AWS Well-Architected Webinar Security - Ben de Haan
AWS Well-Architected Webinar Security - Ben de HaanAWS Well-Architected Webinar Security - Ben de Haan
AWS Well-Architected Webinar Security - Ben de Haan
 
The 7 Habits of Effective Data Driven Companies
The 7 Habits of Effective Data Driven CompaniesThe 7 Habits of Effective Data Driven Companies
The 7 Habits of Effective Data Driven Companies
 
DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...
DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...
DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...
 
Artificial intelligence in actions: delivering a new experience to Formula 1 ...
Artificial intelligence in actions: delivering a new experience to Formula 1 ...Artificial intelligence in actions: delivering a new experience to Formula 1 ...
Artificial intelligence in actions: delivering a new experience to Formula 1 ...
 
Smart application on Azure at Vattenfall - Rens Weijers & Peter van 't Hof
Smart application on Azure at Vattenfall - Rens Weijers & Peter van 't HofSmart application on Azure at Vattenfall - Rens Weijers & Peter van 't Hof
Smart application on Azure at Vattenfall - Rens Weijers & Peter van 't Hof
 
Democratizing AI/ML with GCP - Abishay Rao (Google) at GoDataFest 2019
Democratizing AI/ML with GCP - Abishay Rao (Google) at GoDataFest 2019Democratizing AI/ML with GCP - Abishay Rao (Google) at GoDataFest 2019
Democratizing AI/ML with GCP - Abishay Rao (Google) at GoDataFest 2019
 

Kürzlich hochgeladen

!~+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUD...
!~+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUD...!~+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUD...
!~+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUD...
DUBAI (+971)581248768 BUY ABORTION PILLS IN ABU dhabi...Qatar
 
Mifepristone Available in Muscat +918761049707^^ €€ Buy Abortion Pills in Oman
Mifepristone Available in Muscat +918761049707^^ €€ Buy Abortion Pills in OmanMifepristone Available in Muscat +918761049707^^ €€ Buy Abortion Pills in Oman
Mifepristone Available in Muscat +918761049707^^ €€ Buy Abortion Pills in Oman
instagramfab782445
 
Structuring and Writing DRL Mckinsey (1).pdf
Structuring and Writing DRL Mckinsey (1).pdfStructuring and Writing DRL Mckinsey (1).pdf
Structuring and Writing DRL Mckinsey (1).pdf
laloo_007
 
Al Mizhar Dubai Escorts +971561403006 Escorts Service In Al Mizhar
Al Mizhar Dubai Escorts +971561403006 Escorts Service In Al MizharAl Mizhar Dubai Escorts +971561403006 Escorts Service In Al Mizhar
Al Mizhar Dubai Escorts +971561403006 Escorts Service In Al Mizhar
allensay1
 
The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai KuwaitThe Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
daisycvs
 
Mckinsey foundation level Handbook for Viewing
Mckinsey foundation level Handbook for ViewingMckinsey foundation level Handbook for Viewing
Mckinsey foundation level Handbook for Viewing
Nauman Safdar
 

Kürzlich hochgeladen (20)

Cannabis Legalization World Map: 2024 Updated
Cannabis Legalization World Map: 2024 UpdatedCannabis Legalization World Map: 2024 Updated
Cannabis Legalization World Map: 2024 Updated
 
Escorts in Nungambakkam Phone 8250092165 Enjoy 24/7 Escort Service Enjoy Your...
Escorts in Nungambakkam Phone 8250092165 Enjoy 24/7 Escort Service Enjoy Your...Escorts in Nungambakkam Phone 8250092165 Enjoy 24/7 Escort Service Enjoy Your...
Escorts in Nungambakkam Phone 8250092165 Enjoy 24/7 Escort Service Enjoy Your...
 
!~+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUD...
!~+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUD...!~+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUD...
!~+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUD...
 
SEO Case Study: How I Increased SEO Traffic & Ranking by 50-60% in 6 Months
SEO Case Study: How I Increased SEO Traffic & Ranking by 50-60%  in 6 MonthsSEO Case Study: How I Increased SEO Traffic & Ranking by 50-60%  in 6 Months
SEO Case Study: How I Increased SEO Traffic & Ranking by 50-60% in 6 Months
 
Mifepristone Available in Muscat +918761049707^^ €€ Buy Abortion Pills in Oman
Mifepristone Available in Muscat +918761049707^^ €€ Buy Abortion Pills in OmanMifepristone Available in Muscat +918761049707^^ €€ Buy Abortion Pills in Oman
Mifepristone Available in Muscat +918761049707^^ €€ Buy Abortion Pills in Oman
 
Structuring and Writing DRL Mckinsey (1).pdf
Structuring and Writing DRL Mckinsey (1).pdfStructuring and Writing DRL Mckinsey (1).pdf
Structuring and Writing DRL Mckinsey (1).pdf
 
Pre Engineered Building Manufacturers Hyderabad.pptx
Pre Engineered  Building Manufacturers Hyderabad.pptxPre Engineered  Building Manufacturers Hyderabad.pptx
Pre Engineered Building Manufacturers Hyderabad.pptx
 
Al Mizhar Dubai Escorts +971561403006 Escorts Service In Al Mizhar
Al Mizhar Dubai Escorts +971561403006 Escorts Service In Al MizharAl Mizhar Dubai Escorts +971561403006 Escorts Service In Al Mizhar
Al Mizhar Dubai Escorts +971561403006 Escorts Service In Al Mizhar
 
The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai KuwaitThe Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
 
Famous Olympic Siblings from the 21st Century
Famous Olympic Siblings from the 21st CenturyFamous Olympic Siblings from the 21st Century
Famous Olympic Siblings from the 21st Century
 
Paradip CALL GIRL❤7091819311❤CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDING
Paradip CALL GIRL❤7091819311❤CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDINGParadip CALL GIRL❤7091819311❤CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDING
Paradip CALL GIRL❤7091819311❤CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDING
 
CROSS CULTURAL NEGOTIATION BY PANMISEM NS
CROSS CULTURAL NEGOTIATION BY PANMISEM NSCROSS CULTURAL NEGOTIATION BY PANMISEM NS
CROSS CULTURAL NEGOTIATION BY PANMISEM NS
 
Rice Manufacturers in India | Shree Krishna Exports
Rice Manufacturers in India | Shree Krishna ExportsRice Manufacturers in India | Shree Krishna Exports
Rice Manufacturers in India | Shree Krishna Exports
 
Marel Q1 2024 Investor Presentation from May 8, 2024
Marel Q1 2024 Investor Presentation from May 8, 2024Marel Q1 2024 Investor Presentation from May 8, 2024
Marel Q1 2024 Investor Presentation from May 8, 2024
 
Katrina Personal Brand Project and portfolio 1
Katrina Personal Brand Project and portfolio 1Katrina Personal Brand Project and portfolio 1
Katrina Personal Brand Project and portfolio 1
 
Falcon Invoice Discounting: Aviate Your Cash Flow Challenges
Falcon Invoice Discounting: Aviate Your Cash Flow ChallengesFalcon Invoice Discounting: Aviate Your Cash Flow Challenges
Falcon Invoice Discounting: Aviate Your Cash Flow Challenges
 
Falcon Invoice Discounting: Empowering Your Business Growth
Falcon Invoice Discounting: Empowering Your Business GrowthFalcon Invoice Discounting: Empowering Your Business Growth
Falcon Invoice Discounting: Empowering Your Business Growth
 
Mckinsey foundation level Handbook for Viewing
Mckinsey foundation level Handbook for ViewingMckinsey foundation level Handbook for Viewing
Mckinsey foundation level Handbook for Viewing
 
BeMetals Investor Presentation_May 3, 2024.pdf
BeMetals Investor Presentation_May 3, 2024.pdfBeMetals Investor Presentation_May 3, 2024.pdf
BeMetals Investor Presentation_May 3, 2024.pdf
 
HomeRoots Pitch Deck | Investor Insights | April 2024
HomeRoots Pitch Deck | Investor Insights | April 2024HomeRoots Pitch Deck | Investor Insights | April 2024
HomeRoots Pitch Deck | Investor Insights | April 2024
 

PyData Amsterdam - Name Matching at Scale

  • 1. Name Matching at Scale: CPU, GPU or SPARK? Wendell Kuling and Chris Broeren ING Wholesale Banking Advanced Analytics Team
  • 2. Chris Broeren, Data Scientist Wendell Kuling, Data Scientist
  • 3. Overview • Introduction to problem • Methods to solve problem • Brute Force approach • Metric tree approach • Tokenised approach • Current status
  • 4. Introduction Wholesale bank = dealing with companies Interested in different data sets about companies To join multiple data sets together, we need a common key: company name However one company may be called by different name: : McDonalds Corporation, McDonalds, McDonald’s Corp, etc… Therefore we need to match approximately similar names of companies together
  • 5. Introduction Define an existing list of company names as the ground truth (G) Aim: match new sets of names (S1, S2, S3, … ) with G: Without loss of generality, let’s assume we’re going to match one set of names, S with G for this talk ABN Amro Bank RBS Bank Rabobank JP Morgan ING Groep ASN Bank Chase Bank BINCK Bank HSBC Bank Westpac Bank Goldman Sachs ABN Amro N.V RBS LLC Rabobank NV JPM USA ING Groep N.V. ASN Chase BINCK N.V HSBC Westpac Australia GS Global Source 1Ground Truth ABN Amro N.V RBS LLC Rabobank N.V JPM USA ING Groep ASN Chase BINCK N.V HSBC Westpac GS Global Source 2 ABN Amro N.V RBS LLC RABOBANK NV JPM USA ING N.V. ASN Chase Bank BINCK N.V HSBC Westpac Aus GS Global Source 3 G S1 S2 S3
  • 6. Introduction Many ways to look at problem: • Approximate string match problem • Nearest Neighbour Search problem • Pattern matching • etc… We need to find the “closest” name in G to match to every name in S
  • 7. Reality In our first case: • G has 12 million names • S ranges in length between 3000 and 5 mln names To make matters worse: • On average, a name is 31 characters long, containing ~4 words • The world isn’t UTF8 compliant, we have over 160 characters • Although there are limited duplicates in G, some companies have similar names and have hierarchical structures which must be observed
  • 8. Overview • Introduction to problem • Methods to solve problem • Brute Force approach • Metric tree approach • Tokenised approach • Current status
  • 9. Brute Force Method Define a function to measure word closeness: The closer the names are to each other, the more similar they are Calculate closeness for each word and choose the closest Ensemble with different functions to get better results
  • 10. Brute Force Method There are many word similarity functions. An example is the Levenshtein distance. Levenshtein distance calculates the minimum number of character edits (replacing, adding or subtracting) it takes to make two strings equal. Example: levenshtein(“ABN Amro Bank”, “RBS Bank”) • ABN Amro Bank —> RBN Amro Bank (replace A with R) • RBN Amro Bank —> RBN Bank (remove Amro) • RBN Bank —> RBS Bank (replace N with S) Therefore Levenshtein(“ABN Amro Bank”, “RBS Bank”) = 1 + 4 + 1
  • 11. Brute Force Method • “ABN Amro Bank” vs {“ABN Amro N.V, … , “GS Global”} ABN Amro Bank RBS Bank Rabobank JP Morgan ING Groep ASN Bank Chase Bank BINCK Bank HSBC Bank Westpac Bank Goldman Sachs ABN Amro N.V RBS LLC Rabobank NV JPM USA ING Groep N.V. ASN Chase BINCK N.V HSBC Westpac Australia GS Global SG
  • 12. Brute Force Method • “RBS Bank” vs {“ABN Amro N.V, … , “GS Global”} ABN Amro Bank RBS Bank Rabobank JP Morgan ING Groep ASN Bank Chase Bank BINCK Bank HSBC Bank Westpac Bank Goldman Sachs ABN Amro N.V RBS LLC Rabobank NV JPM USA ING Groep N.V. ASN Chase BINCK N.V HSBC Westpac Australia GS Global SG
  • 13. Brute Force Method • “Goldman Sachs” vs {“ABN Amro N.V, … , “GS Global”} ABN Amro Bank RBS Bank Rabobank JP Morgan ING Groep ASN Bank Chase Bank BINCK Bank HSBC Bank Westpac Bank Goldman Sachs ABN Amro N.V RBS LLC Rabobank NV JPM USA ING Groep N.V. ASN Chase BINCK N.V HSBC Westpac Australia GS Global SG
  • 14. Brute force method • Problem: 12 million names in G, 5 million names in S • This is 60,000,000,000,000 similarity calculations • Levenshtein algorithm has time complexity of O(mn), where m, n are length of strings. • If we could calculate 10 similarity calculations a second…We would be here for ~ 190,000 years • Parallel: 10,000 cores … 19 years
  • 15. Know which package to use for edit-based distances
  • 16. Fuzzywuzzy: string matching like a boss… but for smaller sets only
  • 17. Overview • Introduction to problem • Methods to solve problem • Brute Force approach • Metric tree approach • Tokenised approach • Current status
  • 18. Metric Tree Method We can think of names as points in some topological space We don’t necessarily need to know absolute location of a word in a space, just the relative distance between points Therefore we still use a distance function (as per brute force), but define it so it satisfies some mathematical properties: 1. d(x,y) = 0 —> x = y 2. d(x,y) = d(y,x) 3. d(x,z) <= d(x,y) + d(y,z) This is known as a is a metric, we can save ourself time by organising the words into a tree structure that preserves metric-distances between words
  • 19. Metric Tree Method Once we create this metric tree, we can query the nearest neighbour by traversing the tree, blocking out “known far away words” - effectively reducing the search space Book BowlHook Head Cook Boek Bow Dead 1 2 4 1 2 1 1
  • 20. Metric Tree Method Building the tree, is well feasible with ~2.7 mln different words - O(n log(n)) Typically, all words with distance of 1 determined in ~1 sec Build + query time still years worth of calculation • Added problem of making a tree in parallel • Lots of space required • Worst case performance is actually bad
  • 21. Overview • Introduction to problem • Methods to solve problem • Brute Force approach • Metric tree approach • Tokenised approach • Current status
  • 22. Tokenised Method Break name up into components (tokenising) Many different types of tokens available: words, grams Do this for all names in both G and S (this creates two matrices [names x tokens]) Example: Indicator function word tokeniser: ABN RBS BANK Rabobank NV ABN Amro Bank 1 0 1 0 0 RBS Bank 0 1 1 0 0 Rabobank NV 0 0 0 1 1
  • 23. Tokenised Method • For given token length d: • matrix of names in G • matrix of names in S • Dot product of and yields • Row i, column j of corresponds to inner product of the tokens of the i-th word in G and the j-th word in S =.
  • 24. Tokenised Method • Why the dot product? • The elements of look somewhat familiar to us: • elements are the cosine similarity of the individual name-token vectors multiplied by the L2 norm • If we normalise the token-vector on creation we end up calculating the cosine-similarity measure!
  • 25. Tokenised Method • Same number of total comparisons as brute-force • But inner-products are cheap to calculate • Tokenised matrices can be computed offline cheaply • Tokenised methods allow for vectorisation and allow for increased memory and CPU efficiency • We can even compute this on a GPU cluster
  • 26. Overview • Introduction to problem • Methods to solve problem • Brute Force approach • Metric tree approach • Tokenised approach • Current status
  • 27. Preprocessing-steps turn out relatively cheap (fast), whereas the calculation is expensive Read data (Hive) Clean data Build ‘G’ TFIDF matrix Build ‘S’ TFIDF matrix <5 mins <5 mins <5 mins xxx hours Preprocessing Calculate <5 mins
  • 28. Things you would wish you knew before (1/4)… Read data (Hive) Runs out of memory
  • 29. (or use Python 3.x ;)) Clean data Things you would wish you knew before (2/4)… tokenize(‘McDonaldś’)
  • 30. Build ‘G’ TFIDF matrix Things you would wish you knew before (3/4)… Standard token_pattern (‘(?u)bww+b’) ignores single letters Use token_pattern (‘(?u)bw+b’) for ‘full’ tokenization (token_pattern = u’(?u)S', ngram_range=(3, 3)) gives 3-gram matching ‘Taxibedrijf M. van Seben’ —> [‘Taxibedrijf’, ‘van’, ‘ Seben’ ]
  • 31. Build ‘S’ TFIDF matrix Things you would wish you knew before (4/4) Standard ‘transform’ function of Sklearn TFIDFVectorizer ignores unseen tokens —> either transform using customized function, or tokenise on combination of G and S match(‘JonasTheMan Nederland’) —> 100% match ‘Nederland Nederland’ ?
  • 32. Calculation of cosine similarity: matrix multiplication using Numpy/Scipy Using Numpy and Scipy, fast Matrix multiplication of Sparse matrices. Suggested format: CSR. .7 0 0 0 .7 1 0 0 0 0 0 .7 0 0 .7 0 0 .6 .6 .6 x # tokens # company names # of tokens (Transposed) G S.Transpose = .7 .49 .42 Argmax = best match Calculate
  • 33. Look at 0.01% of the ‘G’ matrix: what do you notice? Input: Sparsity: ~0.0001% (~3 tokens per 2.6 mln columns) Storage required: ~2 GB Output: Sparsity: ~0.5% Storage required: ~10 TB Depending on resolution, distance and eye-sight: white dots can be seen for non-zero entries
  • 34. Cruncher: 48 Cores, 512 GB RAM Tesla: GPUs: 3x2496 threads, 3x12 GB Spark cluster: 150 cores, 2.5TB of memory 34 Introducing the three contestants for the calculation part…
  • 35. Numpy matrix multiplication: first ~100 extra slices are cheap
  • 36. Scipy/Numpy sparse matrix multiplication: most expensive and highly-optimized function Effectively using 1 core, 100 rows / iteration: ~140 matches per second (additional memory usage: ~1 GB)
  • 37. Tesla - GPU multiplication: PyCuda is flexible, but requires deep C++ knowledge Current custom kernel works with Sparse Matrix x Dense Vector (slice = 1) Didn’t distribute the data across the GPU up-front Using single GPU at the moment …so, in short, further optimizations are possible! Using 1 GPU, slice of 1 and Sparse x Dense multiplication: ~50 matches per second
  • 38. Spark cluster: broadcast both sparse matrices, use RDD with just the row-indices to work on Driver Step 1: push matrix G and S to workers (broadcast variable) Worker node Worker node Worker node Step 2: distribute RDD with ‘chunks’ of row-indices: map ‘ multiply & argmax’ broadcast G, S.T broadcast G, S.T broadcast G, S.T Driver Worker node Worker node Worker node work on rows 0 - 9 return argmax(G.dot(S.T)) for 0-9 work on rows 10-19 return argmax(G.dot(S.T)) for 10-19 etc. Using standard TFIDF implementation from Spark MLLib: vector by vector multiplication (scaleable, but slow) + hashing
  • 39. Spark cluster: scales with only small modifications to original Python code 612,630 matches in 12 containers, 12 cores/container, chunks of 20 rows in ~5 min: 2000 matches / sec