Optimizing k-Nearest Neighbors (kNN) algorithm in Python

In this post, we will optimize our kNN implementation from previous post using Numpy and Numba.

For previous post, you can follow:

How kNN works ?

K-Nearest Neighbors Algorithm using Python and Scikit-Learn?

Out of sample accuracy estimation using cv in knn

Tuning Hyperparameter in kNN

In [1]:
import math
from collections import Counter
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

import numba

%matplotlib inline

# making results reproducible
np.random.seed(42)

Teaser

Let us first see a small example. We will calculate the sum of inverse square root between 1 to 10000, using basic pure python method and using Numpy.

In [2]:
n = 10000
In [3]:
%timeit -n 100 sum([1./math.sqrt(i) for i in range(1,n)])
1.87 ms ± 48.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [4]:
%timeit -n 100 np.sum(1./np.sqrt(np.arange(1,n)))
52.7 µs ± 8.09 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [85]:
'Numpy vectorized calculation is {:.0f} times faster than pure python in this example'.format(1.87*1000/52.7)
Out[85]:
'Numpy vectorized calculation is 35 times faster than pure python in this example'

Using wine dataset for kNN

In [7]:
df = pd.read_csv(
    'https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data', header=None, sep=',')

df.columns = ['CLASS', 'ALCOHOL_LEVEL', 'MALIC_ACID', 'ASH', 'ALCALINITY','MAGNESIUM', 'PHENOLS', 
              'FLAVANOIDS', 'NON_FLAVANOID_PHENOL', 'PROANTHOCYANINS', 'COLOR_INTENSITY', 
              'HUE', 'OD280/OD315_DILUTED','PROLINE']

# Let us use only two features : 'ALCOHOL_LEVEL', 'MALIC_ACID' for this problem
df = df[['CLASS', 'ALCOHOL_LEVEL', 'MALIC_ACID']]
df.head()
Out[7]:
CLASS ALCOHOL_LEVEL MALIC_ACID
0 1 14.23 1.71
1 1 13.20 1.78
2 1 13.16 2.36
3 1 14.37 1.95
4 1 13.24 2.59
In [8]:
# we are using 10% of the data for the testing purpose

train_sample_idx = np.random.choice(df.index, size=int(df.shape[0]*0.9), replace=False)
train_data, test_data = df.iloc[train_sample_idx], df.drop(train_sample_idx)

# get features and label from train/test data
X_train, y_train = train_data.drop('CLASS', axis=1), train_data['CLASS']
X_test, y_test = test_data.drop('CLASS', axis=1), test_data['CLASS']

1. kNN using Python from scratch

In [9]:
def euclidean_distance(vector1, vector2):
    '''calculate the euclidean distance, core python method
    input: numpy.arrays or lists
    return: euclidean distance
    '''
    dist = [(a - b)**2 for a, b in zip(vector1, vector2)]
    dist = math.sqrt(sum(dist))
    return dist
In [10]:
def predict_instance(inputs, labels, test_instance, k):
    inputs['distance'] = inputs.apply(euclidean_distance, vector2=test_instance, axis=1)
    
    # concatenate inputs and labels before sorting the distances
    inputs = pd.concat([inputs, labels], axis=1)
    # sort based on distance
    inputs = inputs.sort_values('distance', ascending=True)
    # pick k neighbors
    neighbors = inputs.head(k)

    # get list from dataframe column
    classes = neighbors['CLASS'].tolist()

    # create counter of labels
    majority_count = Counter(classes)
    return majority_count.most_common(1).pop()[0]

def knn(X_train, y_train, X_test, k):
    """
    Calculate k-nn for given k.
    
    """
    predictions = np.zeros(X_test.shape[0])
    X_test.reset_index(drop=True, inplace=True)
    for index, row in X_test.iterrows():
        predictions[index] = predict_instance(X_train.copy(), y_train.copy(), row, k)
    return predictions
In [11]:
# knn = KNN(3)
predictions = knn(X_train, y_train, X_test, 3)
In [12]:
true_values = y_test.to_numpy()
accuracy = np.mean(predictions == true_values)
accuracy
Out[12]:
0.7222222222222222

a. %timeit magic command to observe execution time for core python functions

In [13]:
# %%timeit
# -n execute the function 10 times in a loop
%timeit -n 10 knn(X_train, y_train, X_test, 3)
144 ms ± 11.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

b. Line profiling to see which functions/call has higher contribution on execution time

In [14]:
%load_ext line_profiler
In [15]:
# We are profiling function knn, we supply the name using -f
%lprun -f knn knn(X_train, y_train, X_test, 3)
Timer unit: 1e-06 s

Total time: 0.312758 s
File: <ipython-input-115-bd6cbd244598>
Function: knn at line 18

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    18                                           def knn(X_train, y_train, X_test, k):
    19                                               """
    20                                               Calculate k-nn for given k.
    21                                               
    22                                               """
    23         1         55.0     55.0      0.0      predictions = np.zeros(X_test.shape[0])
    24         1        586.0    586.0      0.2      X_test.reset_index(drop=True, inplace=True)
    25        19       3188.0    167.8      1.0      for index, row in X_test.iterrows():
    26        18     308928.0  17162.7     98.8          predictions[index] = predict_instance(X_train.copy(), y_train.copy(), row, k)
    27         1          1.0      1.0      0.0      return predictions

NOTE:

  1. Time is in microseconds at mentioned at the top of the cell.
  2. (Hits) shows number of times that particular line in code executed.
  3. (Time) total microseconds for executing that line. (total amount of time)
  4. (per Hit) = (Time)/(Hits), average time for a single execution.
  5. (% Time) fraction of time spent on that line relative to total amount of time.

We can see the line number 26 is expensive one. We will improve it using numpy below.

2. Improving kNN with numpy

In [54]:
def predict_instance_numpy(inputs, labels, test_instance, k):
    # calculate L2 norm between all training points and given test_point
    inputs['distance'] = np.linalg.norm(inputs.values-test_instance.values, axis=1)

    # concatenate inputs and labels before sorting the distances
    inputs = pd.concat([inputs, labels], axis=1)
    # sort based on distance
    inputs = inputs.sort_values('distance', ascending=True)
    
    # pick k neighbors
    neighbors = inputs.head(k)

    # get list from dataframe column
    classes = neighbors['CLASS'].tolist()

    # create counter of labels
    majority_count = Counter(classes)
    return majority_count.most_common(1).pop()[0]

def knn_numpy(X_train, y_train, X_test, k):
    """
    Calculate k-nn for given k.
    
    """
    predictions = np.zeros(X_test.shape[0])
    X_test.reset_index(drop=True, inplace=True)
    for index, row in X_test.iterrows():
        predictions[index] = predict_instance_numpy(X_train.copy(), y_train.copy(), row, k)
    return predictions
In [55]:
# knn improved with distance calculation using np.linalg.norm
predictions = knn_numpy(X_train, y_train, X_test, 3)
true_values = y_test.to_numpy()
accuracy = np.mean(predictions == true_values)
accuracy
Out[55]:
0.7222222222222222

Observing execution time and line profiling after optimizing with numpy

In [56]:
# %%timeit
# -n execute the function 10 times in a loop
%timeit -n 10 knn_numpy(X_train, y_train, X_test, 3)
40.6 ms ± 2.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The total execution has reduced from 135 ms to 41.2 ms

In [57]:
# We are profiling function knn, we supply the name using -f
%lprun -f knn_numpy knn_numpy(X_train, y_train, X_test, 3)
Timer unit: 1e-06 s

Total time: 0.094862 s
File: <ipython-input-121-a0da67624daf>
Function: knn_numpy at line 22

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    22                                           def knn_numpy(X_train, y_train, X_test, k):
    23                                               """
    24                                               Calculate k-nn for given k.
    25                                               
    26                                               """
    27         1         65.0     65.0      0.1      predictions = np.zeros(X_test.shape[0])
    28         1        258.0    258.0      0.3      X_test.reset_index(drop=True, inplace=True)
    29        19       3696.0    194.5      3.9      for index, row in X_test.iterrows():
    30        18      90843.0   5046.8     95.8          predictions[index] = predict_instance_numpy(X_train.copy(), y_train.copy(), row, k)
    31         1          0.0      0.0      0.0      return predictions

The per Hit time has reduced from 17162.7 µs to 5046.8 µs

3. Improving kNN with numba

Numba does it's best work when we do operations over numpy arrays. For the implemention below, I have converted pandas dataframe to numpy array and perform relevant operations.

For more information about numba

In [79]:
import operator
import numba

@numba.jit(nopython=True)
def euclidean_distance_numba(vector1, vector2):
    '''calculate the euclidean distance,
    '''
    dist = np.linalg.norm(vector1-vector2)
    return dist

@numba.jit(nopython=True)
def predict_instance_numba(inputs, labels, test_instance, k):
    distances = np.zeros((inputs.shape[0], 1))
    # calculate distance between test point and training points
    for i in np.arange(inputs.shape[0]):
        distances[i] = euclidean_distance_numba(inputs[i], test_instance)
    labels = labels.reshape((labels.shape[0],1))
    
    # add labels column
    inputs = np.hstack((inputs, labels))
    
    # add distance column
    inputs = np.hstack((inputs, distances))
    
    # sort based on distance column
    inputs = inputs[inputs[:,3].argsort()]
    # 2nd columns contains classes, select first k values
    neighbor_classes = inputs[:, 2][:k]
    counter = {}
    for item in neighbor_classes:
        if item in counter:
            counter[item] = counter.get(item) + 1
        else:
            counter[item] = 1
    counter_sorted = sorted(counter)
    return counter_sorted[0]

# @numba.jit(nopython=True)
def knn_numba(X_train, y_train, X_test, k):
    """
    Calculate k-nn for given k.
    
    """
    predictions = np.zeros(X_test.shape[0])
    for i in np.arange(X_test.shape[0]):
        predictions[i] = predict_instance_numba(X_train.copy(), y_train.copy(), X_test[i], k)
    return predictions
In [80]:
# knn improved with distance calculation using np.linalg.norm
predictions = knn_numba(X_train.values, y_train.values, X_test.values, 3)
true_values = y_test.to_numpy()
accuracy = np.mean(predictions == true_values)
accuracy
Out[80]:
0.7222222222222222

Observing execution time and line profiling after optimizing with numba

In [81]:
# %%timeit
# -n execute the function 10 times in a loop
%timeit -n 10 knn_numba(X_train.values, y_train.values, X_test.values, 3)
1.65 ms ± 362 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

The total execution has reduced from 41.2 ms to 1.65 ms

In [84]:
# We are profiling function knn_numba, we supply the name using -f
%lprun -f knn_numba knn_numba(X_train.values, y_train.values, X_test.values, 3)
Timer unit: 1e-06 s

Total time: 0.002649 s
File: <ipython-input-79-89228ad76421>
Function: knn_numba at line 36

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    36                                           def knn_numba(X_train, y_train, X_test, k):
    37                                               """
    38                                               Calculate k-nn for given k.
    39                                               
    40                                               """
    41         1         11.0     11.0      0.4      predictions = np.zeros(X_test.shape[0])
    42        19         44.0      2.3      1.7      for i in np.arange(X_test.shape[0]):
    43        18       2593.0    144.1     97.9          predictions[i] = predict_instance_numba(X_train.copy(), y_train.copy(), X_test[i], k)
    44         1          1.0      1.0      0.0      return predictions

The per Hit time has reduced from 5046.8 µs 144.1 µs

In [ ]:
 

Comments

Comments powered by Disqus