SlideShare ist ein Scribd-Unternehmen logo
1 von 34
Vectorise all the things!
Speeding up your code with basic
linear algebra
What we’re going to cover
Linear algebra
basics
Distance
metrics and
kNN
Unoptimised
solution
Optimising
distance
metrics
Optimising
kNN
Linear algebra basics
What is a vector?
Major axis length Minor axis length
653.7 392.3
427.8 247.0
229.2 162.4
Vector spaces
What is a matrix?
Major axis length Minor axis length
653.7 392.3
427.8 247.0
229.2 162.4
Representations in NumPy
● Vectors and matrices are stored in
arrays
● Arrays can be n-dimensional
○ Vectors in 1D arrays
○ Matrices in 2D arrays
○ Tensors in 3D (or nD) arrays
● Shape gives size along each
dimension
1D array
(2,)
2D array
(2,2)
3D array
(1,2,2)
K-Nearest Neighbours
Distances in vector spaces
● Objects that are similar are closer (“less
distant”) in vector space
● Have similar values along every dimension in
the vector space
● Cali beans:
○ Similar values on features to each other
○ Distinct values compared to other beans
Manhattan distance
Manhattan distance:
D = |(4 - 2)| + |(3 - 1)|
= 4
a
b
k-nearest neighbours
● Data are divided into train and test
sets
● Distance between test point and
training points measured
k-nearest neighbours
● Data are divided into train and test
sets
● Distance between test point and
training points measured
● k nearest points are retained
● Labels of k nearest points counted
● Test point assigned majority label
Cali
Cali
Cali
Seker
Dermason
Cali
First pass with loops
Our first code improvement
def calculate_manhattan_distance(a: list, b: list, p: int) -> float:
"""Calculates the Manhattan distance between two vectors, X and Y."""
i = len(a)
diffs = []
for element in range(0, i):
diffs.append(abs(a[element] - b[element]))
return sum(diffs)
Our first code improvement
def calculate_manhattan_distance(a: list, a: list, p: int) -> float:
"""Calculates the Manhattan distance between two vectors, X and Y."""
i = len(a)
diffs = []
for element in range(0, i):
diffs.append(abs(a[element] - b[element]))
return sum(diffs)
Vectorising the
Manhattan distance
Vector and matrix subtraction
● Vectors and matrices of the same size can
be subtracted
○ E.g., 4 x 1 vectors
○ E.g., 3 x 2 matrices
● Subtractions are performed elementwise
● Result is vector or matrix of the same size
Operations on elements of vectors
● Elements of vectors can have an
operation performed on them:
○ Scalar multiplication
○ Other functions such as absolute
value
● Result is vector or matrix of the same
size
Times after vector subtraction
1.3x
1.2x
3.2x
1.8x 1.8x
4.3x
Nested for loops can get expensive
● Nested loops compound issues with single
loops
● Sequential processing means time scales as
product of lengths of each list:
○ Small dataset = 3000 x 1000 = 3 million
○ Medium and large = 20000 x 7000 =
140 million
Our second code improvement
def apply_manhattan_distance(vectors_1: list, vectors_2: list, p: int
) -> list:
"""Calculates the pairwise difference between two lists of vectors."""
distances = []
for train_obs in vectors_1:
tmp_distances = []
for test_obs in vectors_2:
tmp_distances.append(calculate_manhattan_distance(train_obs, test_obs, p))
distances.append(tmp_distances)
return [list(x) for x in zip(*distances)]
Getting rid of the nested for loop
Getting rid of the nested for loop
Doing the matrix subtraction in one pass
(1,3,4) (3,3,4)
(1,3,4) (3,3,4)
Doing the matrix subtraction in one pass
● A memory efficient way for NumPy to transform arrays to a compatible size for operations
● For an operation, NumPy compares each dimension and checks:
○ Are the dimensions the same size?
○ If not, is one of the dimensions size = 1
● Replicates or “stretches” incompatible dimensions to be the same size
○ E.g., subtraction between a 3 x 4 matrix and 1 x 4 vector
Broadcasting
Broadcasting
(1,3,4)
(3,1,4)
Times after broadcasting
1.3x
1.2x
1.8x 1.8x
4.3x
10x 11x 13x
Vectorising kNN
Our final code improvements
def calculate_nearest_neighbour(distances: list, labels: list, k: int
) -> str:
"""
Calculates the k-nearest neighbours for a test point,
using k selected neighbours.
"""
sorted_distances = sorted(zip(distances, labels), key=itemgetter(0))[1:]
top_n_labels = [label for dist, label in sorted_distances][:k]
return max(set(top_n_labels), key=top_n_labels.count)
Our final code improvements
def calculate_nearest_neighbour(distances: list, labels: list, k: int
) -> str:
"""
Calculates the k-nearest neighbours for a test point,
using k selected neighbours.
"""
sorted_distances = sorted(zip(distances, labels), key=itemgetter(0))[1:]
top_n_labels = [label for dist, label in sorted_distances][:k]
return max(set(top_n_labels), key=top_n_labels.count)
● Sorting:
○ sort and sorted methods locked to Timsort
○ NumPy sort methods default to quicksort
○ Stable methods adjust to dtype
● List comprehension:
○ For loop in disguise
The problems with this function
Final timings
1.3x
1.2x
1.8x 1.8x
4.3x
10x 11x 13x
1150x
50x 15x

Weitere ähnliche Inhalte

Ähnlich wie Vectorise all the things

Multimedia lossy compression algorithms
Multimedia lossy compression algorithmsMultimedia lossy compression algorithms
Multimedia lossy compression algorithmsMazin Alwaaly
 
Principal component analysis and lda
Principal component analysis and ldaPrincipal component analysis and lda
Principal component analysis and ldaSuresh Pokharel
 
Aaa ped-17-Unsupervised Learning: Dimensionality reduction
Aaa ped-17-Unsupervised Learning: Dimensionality reductionAaa ped-17-Unsupervised Learning: Dimensionality reduction
Aaa ped-17-Unsupervised Learning: Dimensionality reductionAminaRepo
 
Unit-1 Introduction and Mathematical Preliminaries.pptx
Unit-1 Introduction and Mathematical Preliminaries.pptxUnit-1 Introduction and Mathematical Preliminaries.pptx
Unit-1 Introduction and Mathematical Preliminaries.pptxavinashBajpayee1
 
DimensionalityReduction.pptx
DimensionalityReduction.pptxDimensionalityReduction.pptx
DimensionalityReduction.pptx36rajneekant
 
Making BIG DATA smaller
Making BIG DATA smallerMaking BIG DATA smaller
Making BIG DATA smallerTony Tran
 
SkNoushadddoja_28100119039.pptx
SkNoushadddoja_28100119039.pptxSkNoushadddoja_28100119039.pptx
SkNoushadddoja_28100119039.pptxPrakasBhowmik
 
Basic MATLAB-Presentation.pptx
Basic MATLAB-Presentation.pptxBasic MATLAB-Presentation.pptx
Basic MATLAB-Presentation.pptxPremanandS3
 
Md2k 0219 shang
Md2k 0219 shangMd2k 0219 shang
Md2k 0219 shangBBKuhn
 
Clustering-dendogram.pptx
Clustering-dendogram.pptxClustering-dendogram.pptx
Clustering-dendogram.pptxANKIT915111
 
machine learning.pptx
machine learning.pptxmachine learning.pptx
machine learning.pptxAbdusSadik
 
A Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and DistributionsA Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and DistributionsRebecca Bilbro
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroPyData
 
Machine Learning Notes for beginners ,Step by step
Machine Learning Notes for beginners ,Step by stepMachine Learning Notes for beginners ,Step by step
Machine Learning Notes for beginners ,Step by stepSanjanaSaxena17
 
chapter1.pdf ......................................
chapter1.pdf ......................................chapter1.pdf ......................................
chapter1.pdf ......................................nourhandardeer3
 
Kulum alin-11 jan2014
Kulum alin-11 jan2014Kulum alin-11 jan2014
Kulum alin-11 jan2014rolly purnomo
 

Ähnlich wie Vectorise all the things (20)

Multimedia lossy compression algorithms
Multimedia lossy compression algorithmsMultimedia lossy compression algorithms
Multimedia lossy compression algorithms
 
Principal component analysis and lda
Principal component analysis and ldaPrincipal component analysis and lda
Principal component analysis and lda
 
Aaa ped-17-Unsupervised Learning: Dimensionality reduction
Aaa ped-17-Unsupervised Learning: Dimensionality reductionAaa ped-17-Unsupervised Learning: Dimensionality reduction
Aaa ped-17-Unsupervised Learning: Dimensionality reduction
 
Unit-1 Introduction and Mathematical Preliminaries.pptx
Unit-1 Introduction and Mathematical Preliminaries.pptxUnit-1 Introduction and Mathematical Preliminaries.pptx
Unit-1 Introduction and Mathematical Preliminaries.pptx
 
DimensionalityReduction.pptx
DimensionalityReduction.pptxDimensionalityReduction.pptx
DimensionalityReduction.pptx
 
Making BIG DATA smaller
Making BIG DATA smallerMaking BIG DATA smaller
Making BIG DATA smaller
 
SkNoushadddoja_28100119039.pptx
SkNoushadddoja_28100119039.pptxSkNoushadddoja_28100119039.pptx
SkNoushadddoja_28100119039.pptx
 
Cs345 cl
Cs345 clCs345 cl
Cs345 cl
 
Basic MATLAB-Presentation.pptx
Basic MATLAB-Presentation.pptxBasic MATLAB-Presentation.pptx
Basic MATLAB-Presentation.pptx
 
Md2k 0219 shang
Md2k 0219 shangMd2k 0219 shang
Md2k 0219 shang
 
Statistics lab 1
Statistics lab 1Statistics lab 1
Statistics lab 1
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
Clustering-dendogram.pptx
Clustering-dendogram.pptxClustering-dendogram.pptx
Clustering-dendogram.pptx
 
machine learning.pptx
machine learning.pptxmachine learning.pptx
machine learning.pptx
 
A Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and DistributionsA Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and Distributions
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
Machine Learning Notes for beginners ,Step by step
Machine Learning Notes for beginners ,Step by stepMachine Learning Notes for beginners ,Step by step
Machine Learning Notes for beginners ,Step by step
 
Lec3
Lec3Lec3
Lec3
 
chapter1.pdf ......................................
chapter1.pdf ......................................chapter1.pdf ......................................
chapter1.pdf ......................................
 
Kulum alin-11 jan2014
Kulum alin-11 jan2014Kulum alin-11 jan2014
Kulum alin-11 jan2014
 

Kürzlich hochgeladen

[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 

Kürzlich hochgeladen (20)

[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 

Vectorise all the things

  • 1. Vectorise all the things! Speeding up your code with basic linear algebra
  • 2.
  • 3. What we’re going to cover Linear algebra basics Distance metrics and kNN Unoptimised solution Optimising distance metrics Optimising kNN
  • 5. What is a vector? Major axis length Minor axis length 653.7 392.3 427.8 247.0 229.2 162.4
  • 7. What is a matrix? Major axis length Minor axis length 653.7 392.3 427.8 247.0 229.2 162.4
  • 8. Representations in NumPy ● Vectors and matrices are stored in arrays ● Arrays can be n-dimensional ○ Vectors in 1D arrays ○ Matrices in 2D arrays ○ Tensors in 3D (or nD) arrays ● Shape gives size along each dimension 1D array (2,) 2D array (2,2) 3D array (1,2,2)
  • 10. Distances in vector spaces ● Objects that are similar are closer (“less distant”) in vector space ● Have similar values along every dimension in the vector space ● Cali beans: ○ Similar values on features to each other ○ Distinct values compared to other beans
  • 11. Manhattan distance Manhattan distance: D = |(4 - 2)| + |(3 - 1)| = 4 a b
  • 12. k-nearest neighbours ● Data are divided into train and test sets ● Distance between test point and training points measured
  • 13. k-nearest neighbours ● Data are divided into train and test sets ● Distance between test point and training points measured ● k nearest points are retained ● Labels of k nearest points counted ● Test point assigned majority label Cali Cali Cali Seker Dermason Cali
  • 15. Our first code improvement def calculate_manhattan_distance(a: list, b: list, p: int) -> float: """Calculates the Manhattan distance between two vectors, X and Y.""" i = len(a) diffs = [] for element in range(0, i): diffs.append(abs(a[element] - b[element])) return sum(diffs)
  • 16. Our first code improvement def calculate_manhattan_distance(a: list, a: list, p: int) -> float: """Calculates the Manhattan distance between two vectors, X and Y.""" i = len(a) diffs = [] for element in range(0, i): diffs.append(abs(a[element] - b[element])) return sum(diffs)
  • 18. Vector and matrix subtraction ● Vectors and matrices of the same size can be subtracted ○ E.g., 4 x 1 vectors ○ E.g., 3 x 2 matrices ● Subtractions are performed elementwise ● Result is vector or matrix of the same size
  • 19. Operations on elements of vectors ● Elements of vectors can have an operation performed on them: ○ Scalar multiplication ○ Other functions such as absolute value ● Result is vector or matrix of the same size
  • 20. Times after vector subtraction 1.3x 1.2x 3.2x 1.8x 1.8x 4.3x
  • 21. Nested for loops can get expensive ● Nested loops compound issues with single loops ● Sequential processing means time scales as product of lengths of each list: ○ Small dataset = 3000 x 1000 = 3 million ○ Medium and large = 20000 x 7000 = 140 million
  • 22. Our second code improvement def apply_manhattan_distance(vectors_1: list, vectors_2: list, p: int ) -> list: """Calculates the pairwise difference between two lists of vectors.""" distances = [] for train_obs in vectors_1: tmp_distances = [] for test_obs in vectors_2: tmp_distances.append(calculate_manhattan_distance(train_obs, test_obs, p)) distances.append(tmp_distances) return [list(x) for x in zip(*distances)]
  • 23. Getting rid of the nested for loop
  • 24. Getting rid of the nested for loop
  • 25. Doing the matrix subtraction in one pass (1,3,4) (3,3,4) (1,3,4) (3,3,4)
  • 26. Doing the matrix subtraction in one pass
  • 27. ● A memory efficient way for NumPy to transform arrays to a compatible size for operations ● For an operation, NumPy compares each dimension and checks: ○ Are the dimensions the same size? ○ If not, is one of the dimensions size = 1 ● Replicates or “stretches” incompatible dimensions to be the same size ○ E.g., subtraction between a 3 x 4 matrix and 1 x 4 vector Broadcasting
  • 31. Our final code improvements def calculate_nearest_neighbour(distances: list, labels: list, k: int ) -> str: """ Calculates the k-nearest neighbours for a test point, using k selected neighbours. """ sorted_distances = sorted(zip(distances, labels), key=itemgetter(0))[1:] top_n_labels = [label for dist, label in sorted_distances][:k] return max(set(top_n_labels), key=top_n_labels.count)
  • 32. Our final code improvements def calculate_nearest_neighbour(distances: list, labels: list, k: int ) -> str: """ Calculates the k-nearest neighbours for a test point, using k selected neighbours. """ sorted_distances = sorted(zip(distances, labels), key=itemgetter(0))[1:] top_n_labels = [label for dist, label in sorted_distances][:k] return max(set(top_n_labels), key=top_n_labels.count)
  • 33. ● Sorting: ○ sort and sorted methods locked to Timsort ○ NumPy sort methods default to quicksort ○ Stable methods adjust to dtype ● List comprehension: ○ For loop in disguise The problems with this function