In this post, we will optimize our kNN implementation from previous post using Numpy and Numba.
For previous post, you can follow:
K-Nearest Neighbors Algorithm using Python and Scikit-Learn?
In [1]:
import math
from collections import Counter
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
import numba
%matplotlib inline
# making results reproducible
np.random.seed(42)
Teaser¶
Let us first see a small example. We will calculate the sum of inverse square root between 1 to 10000, using basic pure python method and using Numpy.
In [2]:
n = 10000
In [3]:
%timeit -n 100 sum([1./math.sqrt(i) for i in range(1,n)])
In [4]:
%timeit -n 100 np.sum(1./np.sqrt(np.arange(1,n)))
In [85]:
'Numpy vectorized calculation is {:.0f} times faster than pure python in this example'.format(1.87*1000/52.7)
Out[85]:
Using wine dataset for kNN¶
In [7]:
df = pd.read_csv(
'https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data', header=None, sep=',')
df.columns = ['CLASS', 'ALCOHOL_LEVEL', 'MALIC_ACID', 'ASH', 'ALCALINITY','MAGNESIUM', 'PHENOLS',
'FLAVANOIDS', 'NON_FLAVANOID_PHENOL', 'PROANTHOCYANINS', 'COLOR_INTENSITY',
'HUE', 'OD280/OD315_DILUTED','PROLINE']
# Let us use only two features : 'ALCOHOL_LEVEL', 'MALIC_ACID' for this problem
df = df[['CLASS', 'ALCOHOL_LEVEL', 'MALIC_ACID']]
df.head()
Out[7]:
In [8]:
# we are using 10% of the data for the testing purpose
train_sample_idx = np.random.choice(df.index, size=int(df.shape[0]*0.9), replace=False)
train_data, test_data = df.iloc[train_sample_idx], df.drop(train_sample_idx)
# get features and label from train/test data
X_train, y_train = train_data.drop('CLASS', axis=1), train_data['CLASS']
X_test, y_test = test_data.drop('CLASS', axis=1), test_data['CLASS']
1. kNN using Python from scratch¶
In [9]:
def euclidean_distance(vector1, vector2):
'''calculate the euclidean distance, core python method
input: numpy.arrays or lists
return: euclidean distance
'''
dist = [(a - b)**2 for a, b in zip(vector1, vector2)]
dist = math.sqrt(sum(dist))
return dist
In [10]:
def predict_instance(inputs, labels, test_instance, k):
inputs['distance'] = inputs.apply(euclidean_distance, vector2=test_instance, axis=1)
# concatenate inputs and labels before sorting the distances
inputs = pd.concat([inputs, labels], axis=1)
# sort based on distance
inputs = inputs.sort_values('distance', ascending=True)
# pick k neighbors
neighbors = inputs.head(k)
# get list from dataframe column
classes = neighbors['CLASS'].tolist()
# create counter of labels
majority_count = Counter(classes)
return majority_count.most_common(1).pop()[0]
def knn(X_train, y_train, X_test, k):
"""
Calculate k-nn for given k.
"""
predictions = np.zeros(X_test.shape[0])
X_test.reset_index(drop=True, inplace=True)
for index, row in X_test.iterrows():
predictions[index] = predict_instance(X_train.copy(), y_train.copy(), row, k)
return predictions
In [11]:
# knn = KNN(3)
predictions = knn(X_train, y_train, X_test, 3)
In [12]:
true_values = y_test.to_numpy()
accuracy = np.mean(predictions == true_values)
accuracy
Out[12]:
a. %timeit magic command to observe execution time for core python functions¶
In [13]:
# %%timeit
# -n execute the function 10 times in a loop
%timeit -n 10 knn(X_train, y_train, X_test, 3)
b. Line profiling to see which functions/call has higher contribution on execution time¶
In [14]:
%load_ext line_profiler
In [15]:
# We are profiling function knn, we supply the name using -f
%lprun -f knn knn(X_train, y_train, X_test, 3)
Timer unit: 1e-06 s
Total time: 0.312758 s
File: <ipython-input-115-bd6cbd244598>
Function: knn at line 18
Line # Hits Time Per Hit % Time Line Contents
==============================================================
18 def knn(X_train, y_train, X_test, k):
19 """
20 Calculate k-nn for given k.
21
22 """
23 1 55.0 55.0 0.0 predictions = np.zeros(X_test.shape[0])
24 1 586.0 586.0 0.2 X_test.reset_index(drop=True, inplace=True)
25 19 3188.0 167.8 1.0 for index, row in X_test.iterrows():
26 18 308928.0 17162.7 98.8 predictions[index] = predict_instance(X_train.copy(), y_train.copy(), row, k)
27 1 1.0 1.0 0.0 return predictions
NOTE:
- Time is in microseconds at mentioned at the top of the cell.
- (Hits) shows number of times that particular line in code executed.
- (Time) total microseconds for executing that line. (total amount of time)
- (per Hit) = (Time)/(Hits), average time for a single execution.
- (% Time) fraction of time spent on that line relative to total amount of time.
We can see the line number 26 is expensive one. We will improve it using numpy below.
2. Improving kNN with numpy¶
In [54]:
def predict_instance_numpy(inputs, labels, test_instance, k):
# calculate L2 norm between all training points and given test_point
inputs['distance'] = np.linalg.norm(inputs.values-test_instance.values, axis=1)
# concatenate inputs and labels before sorting the distances
inputs = pd.concat([inputs, labels], axis=1)
# sort based on distance
inputs = inputs.sort_values('distance', ascending=True)
# pick k neighbors
neighbors = inputs.head(k)
# get list from dataframe column
classes = neighbors['CLASS'].tolist()
# create counter of labels
majority_count = Counter(classes)
return majority_count.most_common(1).pop()[0]
def knn_numpy(X_train, y_train, X_test, k):
"""
Calculate k-nn for given k.
"""
predictions = np.zeros(X_test.shape[0])
X_test.reset_index(drop=True, inplace=True)
for index, row in X_test.iterrows():
predictions[index] = predict_instance_numpy(X_train.copy(), y_train.copy(), row, k)
return predictions
In [55]:
# knn improved with distance calculation using np.linalg.norm
predictions = knn_numpy(X_train, y_train, X_test, 3)
true_values = y_test.to_numpy()
accuracy = np.mean(predictions == true_values)
accuracy
Out[55]:
Observing execution time and line profiling after optimizing with numpy¶
In [56]:
# %%timeit
# -n execute the function 10 times in a loop
%timeit -n 10 knn_numpy(X_train, y_train, X_test, 3)
The total execution has reduced from 135 ms to 41.2 ms ¶
In [57]:
# We are profiling function knn, we supply the name using -f
%lprun -f knn_numpy knn_numpy(X_train, y_train, X_test, 3)
Timer unit: 1e-06 s
Total time: 0.094862 s
File: <ipython-input-121-a0da67624daf>
Function: knn_numpy at line 22
Line # Hits Time Per Hit % Time Line Contents
==============================================================
22 def knn_numpy(X_train, y_train, X_test, k):
23 """
24 Calculate k-nn for given k.
25
26 """
27 1 65.0 65.0 0.1 predictions = np.zeros(X_test.shape[0])
28 1 258.0 258.0 0.3 X_test.reset_index(drop=True, inplace=True)
29 19 3696.0 194.5 3.9 for index, row in X_test.iterrows():
30 18 90843.0 5046.8 95.8 predictions[index] = predict_instance_numpy(X_train.copy(), y_train.copy(), row, k)
31 1 0.0 0.0 0.0 return predictions
The per Hit time has reduced from 17162.7 µs to 5046.8 µs ¶
In [79]:
import operator
import numba
@numba.jit(nopython=True)
def euclidean_distance_numba(vector1, vector2):
'''calculate the euclidean distance,
'''
dist = np.linalg.norm(vector1-vector2)
return dist
@numba.jit(nopython=True)
def predict_instance_numba(inputs, labels, test_instance, k):
distances = np.zeros((inputs.shape[0], 1))
# calculate distance between test point and training points
for i in np.arange(inputs.shape[0]):
distances[i] = euclidean_distance_numba(inputs[i], test_instance)
labels = labels.reshape((labels.shape[0],1))
# add labels column
inputs = np.hstack((inputs, labels))
# add distance column
inputs = np.hstack((inputs, distances))
# sort based on distance column
inputs = inputs[inputs[:,3].argsort()]
# 2nd columns contains classes, select first k values
neighbor_classes = inputs[:, 2][:k]
counter = {}
for item in neighbor_classes:
if item in counter:
counter[item] = counter.get(item) + 1
else:
counter[item] = 1
counter_sorted = sorted(counter)
return counter_sorted[0]
# @numba.jit(nopython=True)
def knn_numba(X_train, y_train, X_test, k):
"""
Calculate k-nn for given k.
"""
predictions = np.zeros(X_test.shape[0])
for i in np.arange(X_test.shape[0]):
predictions[i] = predict_instance_numba(X_train.copy(), y_train.copy(), X_test[i], k)
return predictions
In [80]:
# knn improved with distance calculation using np.linalg.norm
predictions = knn_numba(X_train.values, y_train.values, X_test.values, 3)
true_values = y_test.to_numpy()
accuracy = np.mean(predictions == true_values)
accuracy
Out[80]:
Observing execution time and line profiling after optimizing with numba¶
In [81]:
# %%timeit
# -n execute the function 10 times in a loop
%timeit -n 10 knn_numba(X_train.values, y_train.values, X_test.values, 3)
The total execution has reduced from 41.2 ms to 1.65 ms ¶
In [84]:
# We are profiling function knn_numba, we supply the name using -f
%lprun -f knn_numba knn_numba(X_train.values, y_train.values, X_test.values, 3)
Timer unit: 1e-06 s
Total time: 0.002649 s
File: <ipython-input-79-89228ad76421>
Function: knn_numba at line 36
Line # Hits Time Per Hit % Time Line Contents
==============================================================
36 def knn_numba(X_train, y_train, X_test, k):
37 """
38 Calculate k-nn for given k.
39
40 """
41 1 11.0 11.0 0.4 predictions = np.zeros(X_test.shape[0])
42 19 44.0 2.3 1.7 for i in np.arange(X_test.shape[0]):
43 18 2593.0 144.1 97.9 predictions[i] = predict_instance_numba(X_train.copy(), y_train.copy(), X_test[i], k)
44 1 1.0 1.0 0.0 return predictions
The per Hit time has reduced from 5046.8 µs 144.1 µs¶
In [ ]: