Have you found that your code works beautifully on a few dozen examples, but leaves you wondering how to spend the next couple of hours after you start looping through all of your data? Are you only familiar with Python, and wish there was a way to speed things up without subjecting yourself to learning C?
In this talk, you'll see some simple tricks, borrowed from linear algebra, which can give you significant performance gains in your data science code, and how you can implement these in NumPy. We'll start exploring an inefficient implementation of a machine learning algorithm that relies heavily on loops and lists. Throughout the talk, we'll iteratively replace bottlenecks with NumPy vectorized operations and learn the linear algebra that makes these methods work. You'll see how straightforward it can be to make your code many times faster, all without losing readability or needing to understand complex coding concepts.
7. What is a matrix?
Major axis length Minor axis length
653.7 392.3
427.8 247.0
229.2 162.4
8. Representations in NumPy
● Vectors and matrices are stored in
arrays
● Arrays can be n-dimensional
○ Vectors in 1D arrays
○ Matrices in 2D arrays
○ Tensors in 3D (or nD) arrays
● Shape gives size along each
dimension
1D array
(2,)
2D array
(2,2)
3D array
(1,2,2)
10. Distances in vector spaces
● Objects that are similar are closer (“less
distant”) in vector space
● Have similar values along every dimension in
the vector space
● Cali beans:
○ Similar values on features to each other
○ Distinct values compared to other beans
12. k-nearest neighbours
● Data are divided into train and test
sets
● Distance between test point and
training points measured
13. k-nearest neighbours
● Data are divided into train and test
sets
● Distance between test point and
training points measured
● k nearest points are retained
● Labels of k nearest points counted
● Test point assigned majority label
Cali
Cali
Cali
Seker
Dermason
Cali
15. Our first code improvement
def calculate_manhattan_distance(a: list, b: list, p: int) -> float:
"""Calculates the Manhattan distance between two vectors, X and Y."""
i = len(a)
diffs = []
for element in range(0, i):
diffs.append(abs(a[element] - b[element]))
return sum(diffs)
16. Our first code improvement
def calculate_manhattan_distance(a: list, a: list, p: int) -> float:
"""Calculates the Manhattan distance between two vectors, X and Y."""
i = len(a)
diffs = []
for element in range(0, i):
diffs.append(abs(a[element] - b[element]))
return sum(diffs)
18. Vector and matrix subtraction
● Vectors and matrices of the same size can
be subtracted
○ E.g., 4 x 1 vectors
○ E.g., 3 x 2 matrices
● Subtractions are performed elementwise
● Result is vector or matrix of the same size
19. Operations on elements of vectors
● Elements of vectors can have an
operation performed on them:
○ Scalar multiplication
○ Other functions such as absolute
value
● Result is vector or matrix of the same
size
21. Nested for loops can get expensive
● Nested loops compound issues with single
loops
● Sequential processing means time scales as
product of lengths of each list:
○ Small dataset = 3000 x 1000 = 3 million
○ Medium and large = 20000 x 7000 =
140 million
22. Our second code improvement
def apply_manhattan_distance(vectors_1: list, vectors_2: list, p: int
) -> list:
"""Calculates the pairwise difference between two lists of vectors."""
distances = []
for train_obs in vectors_1:
tmp_distances = []
for test_obs in vectors_2:
tmp_distances.append(calculate_manhattan_distance(train_obs, test_obs, p))
distances.append(tmp_distances)
return [list(x) for x in zip(*distances)]
27. ● A memory efficient way for NumPy to transform arrays to a compatible size for operations
● For an operation, NumPy compares each dimension and checks:
○ Are the dimensions the same size?
○ If not, is one of the dimensions size = 1
● Replicates or “stretches” incompatible dimensions to be the same size
○ E.g., subtraction between a 3 x 4 matrix and 1 x 4 vector
Broadcasting
31. Our final code improvements
def calculate_nearest_neighbour(distances: list, labels: list, k: int
) -> str:
"""
Calculates the k-nearest neighbours for a test point,
using k selected neighbours.
"""
sorted_distances = sorted(zip(distances, labels), key=itemgetter(0))[1:]
top_n_labels = [label for dist, label in sorted_distances][:k]
return max(set(top_n_labels), key=top_n_labels.count)
32. Our final code improvements
def calculate_nearest_neighbour(distances: list, labels: list, k: int
) -> str:
"""
Calculates the k-nearest neighbours for a test point,
using k selected neighbours.
"""
sorted_distances = sorted(zip(distances, labels), key=itemgetter(0))[1:]
top_n_labels = [label for dist, label in sorted_distances][:k]
return max(set(top_n_labels), key=top_n_labels.count)
33. ● Sorting:
○ sort and sorted methods locked to Timsort
○ NumPy sort methods default to quicksort
○ Stable methods adjust to dtype
● List comprehension:
○ For loop in disguise
The problems with this function