Optimizing k-Nearest Neighbors (kNN) algorithm in Python

In this post, we will optimize our kNN implementation from previous post using Numpy and Numba.

For previous post, you can follow:

How kNN works ?

K-Nearest Neighbors Algorithm using Python and Scikit-Learn?

Out of sample accuracy estimation using cv in knn

Tuning Hyperparameter in kNN

In [1]:
import math
from collections import Counter
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

import numba

%matplotlib inline

# making results reproducible
np.random.seed(42)

Teaser

Let us first see a small example. We will calculate the sum of inverse square root between 1 to 10000, using basic pure python method and using Numpy.

In [2]:
n = 10000
In [3]:
%timeit -n 100 sum([1./math.sqrt(i) for i in range(1,n)])
1.87 ms ± 48.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [4]:
%timeit -n 100 np.sum(1./np.sqrt(np.arange(1,n)))
52.7 µs ± 8.09 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [85]:
'Numpy vectorized calculation is {:.0f} times faster than pure python in this example'.format(1.87*1000/52.7)
Out[85]:
'Numpy vectorized calculation is 35 times faster than pure python in this example'

Using wine dataset for kNN

In [7]:
df = pd.read_csv(
    'https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data', header=None, sep=',')

df.columns = ['CLASS', 'ALCOHOL_LEVEL', 'MALIC_ACID', 'ASH', 'ALCALINITY','MAGNESIUM', 'PHENOLS', 
              'FLAVANOIDS', 'NON_FLAVANOID_PHENOL', 'PROANTHOCYANINS', 'COLOR_INTENSITY', 
              'HUE', 'OD280/OD315_DILUTED','PROLINE']

# Let us use only two features : 'ALCOHOL_LEVEL', 'MALIC_ACID' for this problem
df = df[['CLASS', 'ALCOHOL_LEVEL', 'MALIC_ACID']]
df.head()
Out[7]:
CLASS ALCOHOL_LEVEL MALIC_ACID
0 1 14.23 1.71
1 1 13.20 1.78
2 1 13.16 2.36
3 1 14.37 1.95
4 1 13.24 2.59
In [8]:
# we are using 10% of the data for the testing purpose

train_sample_idx = np.random.choice(df.index, size=int(df.shape[0]*0.9), replace=False)
train_data, test_data = df.iloc[train_sample_idx], df.drop(train_sample_idx)

# get features and label from train/test data
X_train, y_train = train_data.drop('CLASS', axis=1), train_data['CLASS']
X_test, y_test = test_data.drop('CLASS', axis=1), test_data['CLASS']

1. kNN using Python from scratch

In [9]:
def euclidean_distance(vector1, vector2):
    '''calculate the euclidean distance, core python method
    input: numpy.arrays or lists
    return: euclidean distance
    '''
    dist = [(a - b)**2 for a, b in zip(vector1, vector2)]
    dist = math.sqrt(sum(dist))
    return dist
In [10]:
def predict_instance(inputs, labels, test_instance, k):
    inputs['distance'] = inputs.apply(euclidean_distance, vector2=test_instance, axis=1)
    
    # concatenate inputs and labels before sorting the distances
    inputs = pd.concat([inputs, labels], axis=1)
    # sort based on distance
    inputs = inputs.sort_values('distance', ascending=True)
    # pick k neighbors
    neighbors = inputs.head(k)

    # get list from dataframe column
    classes = neighbors['CLASS'].tolist()

    # create counter of labels
    majority_count = Counter(classes)
    return majority_count.most_common(1).pop()[0]

def knn(X_train, y_train, X_test, k):
    """
    Calculate k-nn for given k.
    
    """
    predictions = np.zeros(X_test.shape[0])
    X_test.reset_index(drop=True, inplace=True)
    for index, row in X_test.iterrows():
        predictions[index] = predict_instance(X_train.copy(), y_train.copy(), row, k)
    return predictions
In [11]:
# knn = KNN(3)
predictions = knn(X_train, y_train, X_test, 3)
In [12]:
true_values = y_test.to_numpy()
accuracy = np.mean(predictions == true_values)
accuracy
Out[12]:
0.7222222222222222

a. %timeit magic command to observe execution time for core python functions

In [13]:
# %%timeit
# -n execute the function 10 times in a loop
%timeit -n 10 knn(X_train, y_train, X_test, 3)
144 ms ± 11.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

b. Line profiling to see which functions/call has higher contribution on execution time

In [14]:
%load_ext line_profiler
In [15]:
# We are profiling function knn, we supply the name using -f
%lprun -f knn knn(X_train, y_train, X_test, 3)
Timer unit: 1e-06 s

Total time: 0.312758 s
File: <ipython-input-115-bd6cbd244598>
Function: knn at line 18

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    18                                           def knn(X_train, y_train, X_test, k):
    19                                               """
    20                                               Calculate k-nn for given k.
    21                                               
    22                                               """
    23         1         55.0     55.0      0.0      predictions = np.zeros(X_test.shape[0])
    24         1        586.0    586.0      0.2      X_test.reset_index(drop=True, inplace=True)
    25        19       3188.0    167.8      1.0      for index, row in X_test.iterrows():
    26        18     308928.0  17162.7     98.8          predictions[index] = predict_instance(X_train.copy(), y_train.copy(), row, k)
    27         1          1.0      1.0      0.0      return predictions

NOTE:

  1. Time is in microseconds at mentioned at the top of the cell.
  2. (Hits) shows number of times that particular line in code executed.
  3. (Time) total microseconds for executing that line. (total amount of time)
  4. (per Hit) = (Time)/(Hits), average time for a single execution.
  5. (% Time) fraction of time spent on that line relative to total amount of time.

We can see the line number 26 is expensive one. We will improve it using numpy below.

2. Improving kNN with numpy

In [54]:
def predict_instance_numpy(inputs, labels, test_instance, k):
    # calculate L2 norm between all training points and given test_point
    inputs['distance'] = np.linalg.norm(inputs.values-test_instance.values, axis=1)

    # concatenate inputs and labels before sorting the distances
    inputs = pd.concat([inputs, labels], axis=1)
    # sort based on distance
    inputs = inputs.sort_values('distance', ascending=True)
    
    # pick k neighbors
    neighbors = inputs.head(k)

    # get list from dataframe column
    classes = neighbors['CLASS'].tolist()

    # create counter of labels
    majority_count = Counter(classes)
    return majority_count.most_common(1).pop()[0]

def knn_numpy(X_train, y_train, X_test, k):
    """
    Calculate k-nn for given k.
    
    """
    predictions = np.zeros(X_test.shape[0])
    X_test.reset_index(drop=True, inplace=True)
    for index, row in X_test.iterrows():
        predictions[index] = predict_instance_numpy(X_train.copy(), y_train.copy(), row, k)
    return predictions
In [55]:
# knn improved with distance calculation using np.linalg.norm
predictions = knn_numpy(X_train, y_train, X_test, 3)
true_values = y_test.to_numpy()
accuracy = np.mean(predictions == true_values)
accuracy
Out[55]:
0.7222222222222222

Observing execution time and line profiling after optimizing with numpy

In [56]:
# %%timeit
# -n execute the function 10 times in a loop
%timeit -n 10 knn_numpy(X_train, y_train, X_test, 3)
40.6 ms ± 2.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The total execution has reduced from 135 ms to 41.2 ms

In [57]:
# We are profiling function knn, we supply the name using -f
%lprun -f knn_numpy knn_numpy(X_train, y_train, X_test, 3)
Timer unit: 1e-06 s

Total time: 0.094862 s
File: <ipython-input-121-a0da67624daf>
Function: knn_numpy at line 22

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    22                                           def knn_numpy(X_train, y_train, X_test, k):
    23                                               """
    24                                               Calculate k-nn for given k.
    25                                               
    26                                               """
    27         1         65.0     65.0      0.1      predictions = np.zeros(X_test.shape[0])
    28         1        258.0    258.0      0.3      X_test.reset_index(drop=True, inplace=True)
    29        19       3696.0    194.5      3.9      for index, row in X_test.iterrows():
    30        18      90843.0   5046.8     95.8          predictions[index] = predict_instance_numpy(X_train.copy(), y_train.copy(), row, k)
    31         1          0.0      0.0      0.0      return predictions

The per Hit time has reduced from 17162.7 µs to 5046.8 µs

3. Improving kNN with numba

Numba does it's best work when we do operations over numpy arrays. For the implemention below, I have converted pandas dataframe to numpy array and perform relevant operations.

For more information about numba

In [79]:
import operator
import numba

@numba.jit(nopython=True)
def euclidean_distance_numba(vector1, vector2):
    '''calculate the euclidean distance,
    '''
    dist = np.linalg.norm(vector1-vector2)
    return dist

@numba.jit(nopython=True)
def predict_instance_numba(inputs, labels, test_instance, k):
    distances = np.zeros((inputs.shape[0], 1))
    # calculate distance between test point and training points
    for i in np.arange(inputs.shape[0]):
        distances[i] = euclidean_distance_numba(inputs[i], test_instance)
    labels = labels.reshape((labels.shape[0],1))
    
    # add labels column
    inputs = np.hstack((inputs, labels))
    
    # add distance column
    inputs = np.hstack((inputs, distances))
    
    # sort based on distance column
    inputs = inputs[inputs[:,3].argsort()]
    # 2nd columns contains classes, select first k values
    neighbor_classes = inputs[:, 2][:k]
    counter = {}
    for item in neighbor_classes:
        if item in counter:
            counter[item] = counter.get(item) + 1
        else:
            counter[item] = 1
    counter_sorted = sorted(counter)
    return counter_sorted[0]

# @numba.jit(nopython=True)
def knn_numba(X_train, y_train, X_test, k):
    """
    Calculate k-nn for given k.
    
    """
    predictions = np.zeros(X_test.shape[0])
    for i in np.arange(X_test.shape[0]):
        predictions[i] = predict_instance_numba(X_train.copy(), y_train.copy(), X_test[i], k)
    return predictions
In [80]:
# knn improved with distance calculation using np.linalg.norm
predictions = knn_numba(X_train.values, y_train.values, X_test.values, 3)
true_values = y_test.to_numpy()
accuracy = np.mean(predictions == true_values)
accuracy
Out[80]:
0.7222222222222222

Observing execution time and line profiling after optimizing with numba

In [81]:
# %%timeit
# -n execute the function 10 times in a loop
%timeit -n 10 knn_numba(X_train.values, y_train.values, X_test.values, 3)
1.65 ms ± 362 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

The total execution has reduced from 41.2 ms to 1.65 ms

In [84]:
# We are profiling function knn_numba, we supply the name using -f
%lprun -f knn_numba knn_numba(X_train.values, y_train.values, X_test.values, 3)
Timer unit: 1e-06 s

Total time: 0.002649 s
File: <ipython-input-79-89228ad76421>
Function: knn_numba at line 36

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    36                                           def knn_numba(X_train, y_train, X_test, k):
    37                                               """
    38                                               Calculate k-nn for given k.
    39                                               
    40                                               """
    41         1         11.0     11.0      0.4      predictions = np.zeros(X_test.shape[0])
    42        19         44.0      2.3      1.7      for i in np.arange(X_test.shape[0]):
    43        18       2593.0    144.1     97.9          predictions[i] = predict_instance_numba(X_train.copy(), y_train.copy(), X_test[i], k)
    44         1          1.0      1.0      0.0      return predictions

The per Hit time has reduced from 5046.8 µs 144.1 µs

In [ ]:
 

How to tune hyperparameter in K Nearest Neighbors Classifier?

In this post, we will go through an approach to get optimal/tuned model for the final prediction. First, we will see how to select best 'k' in kNN using simple python example. We will then jump to using sklearn apis to explore different options for hyperparameter tuning.

For previous post, you can follow:

How kNN works ?

K-Nearest Neighbors Algorithm using Python and Scikit-Learn?

Out of sample accuracy estimation using cv in knn

In [1]:
import os
import math
from collections import Counter
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

# making results reproducible
np.random.seed(42)
In [2]:
df = pd.read_csv(
    'https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data', header=None, sep=',')

df.columns = ['CLASS', 'ALCOHOL_LEVEL', 'MALIC_ACID', 'ASH', 'ALCALINITY','MAGNESIUM', 'PHENOLS', 
              'FLAVANOIDS', 'NON_FLAVANOID_PHENOL', 'PROANTHOCYANINS', 'COLOR_INTENSITY', 
              'HUE', 'OD280/OD315_DILUTED','PROLINE']

# Let us use only two features : 'ALCOHOL_LEVEL', 'MALIC_ACID' for this problem
df = df[['CLASS', 'ALCOHOL_LEVEL', 'MALIC_ACID']]
df.head()
Out[2]:
CLASS ALCOHOL_LEVEL MALIC_ACID
0 1 14.23 1.71
1 1 13.20 1.78
2 1 13.16 2.36
3 1 14.37 1.95
4 1 13.24 2.59

1. kNN and Cross validation using Python from Scratch

In [3]:
class KNN:
    def __init__(self, K):
        self.K = K
        self.X_train = None
        self.y_train = None
        
    def fit(self, X_train, y_train):
        self.X_train = X_train
        self.y_train = y_train
    
    def predict_instance(self, test_instance):
        inputs = self.X_train.copy()
        # calculate L2 norm between all training points and given test_point
        inputs['distance'] = np.linalg.norm(inputs.values-test_instance.values, axis=1)
        
        # concatenate inputs and labels before sorting the distances
        inputs = pd.concat([inputs, self.y_train], axis=1)

        # sort based on distance
        inputs = inputs.sort_values('distance', ascending=True)

        # pick k neighbors
        neighbors = inputs.head(self.K)

        # get list from dataframe column
        classes = neighbors['CLASS'].tolist()

        # create counter of labels
        majority_count = Counter(classes)
        
        return majority_count.most_common(1).pop()[0]
        
        
    def predict(self, X_test):
        predictions = np.zeros(X_test.shape[0])
        # we want out index to be start from 0
        X_test.reset_index(drop=True, inplace=True)
        for index, row in X_test.iterrows():
            predictions[index] = self.predict_instance(row)
        return predictions

def cross_validation(n, k, data, n_neighbors):
    """
    n : # iterations
    k : k-fold size
    data: training data
    n_neighbors: k in knn
    """
    accuracies = []
    
    for _ in range(0, n):
        # data shuffle
        data.sample(frac=1)
        
        fold=int(data.shape[0]/k)
        
        for j in range(k):
            test = data[j*fold:j*fold+fold]
            train = data[~data.index.isin(test.index)]
            X_train, y_train = train.drop('CLASS', axis=1), train['CLASS']
            X_test, y_test = test.drop('CLASS', axis=1), test['CLASS']
            
            knn = KNN(n_neighbors)
            knn.fit(X_train, y_train)
            
            predictions = knn.predict(X_test)
            true_values = y_test.to_numpy()
            accuracy = np.mean(predictions == true_values)
            
            accuracies.append(accuracy)
    return np.array(accuracies).mean()
In [4]:
# We will be using following settings for all the cases below
k_values = np.arange(1, 16)
cross_validation_fold = 10
accuracies = []

2. Finding optimal k value for kNN

In [5]:
for k in k_values:
    # run cross-validation with given neighbor size k
    accuracy = cross_validation(1, cross_validation_fold, df, k)
    accuracies.append(accuracy)
print(accuracies)
[0.6411764705882353, 0.6647058823529411, 0.7470588235294118, 0.7529411764705881, 0.7588235294117647, 0.7529411764705882, 0.7529411764705881, 0.7294117647058822, 0.7823529411764707, 0.7647058823529412, 0.7705882352941177, 0.7764705882352941, 0.7705882352941176, 0.7529411764705881, 0.7411764705882352]
In [6]:
fig = plt.figure()
plt.plot(k_values, accuracies)
plt.xlabel('k in kNN')
plt.ylabel('CV-Accuracy')
fig.suptitle('kNN hyperparameter (k) tuning with python alone', fontsize=20)
Out[6]:
Text(0.5, 0.98, 'kNN hyperparameter (k) tuning with python alone')

We can see that k=9 seems a good choice for our dataset.

Sijan Bhandari on #kNN,

Out-of-sample accuracy estimation using Cross validation in python and scikit-learn

In this post, we will be continuing from our previous post:

K-Nearest Neighbors Algorithm using Python and Scikit-Learn?

Before starting with the implementation, let's discuss few important points in cross validation.

  1. Using Cross validation (CV), we splits our dataset into k folds (k generally setup by developer)
  2. Once you created k folds, you use each of the folds as test set during run and all remaining folds as train set.
  3. With cross validation, one can assess the average model performance (this post) or also for the hyperparameters selection (for example : selecting optimal neighbors size(k) in kNN) or selecting good feature combinations from given data features.
In [1]:
import math
from collections import Counter
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

# making results reproducible
np.random.seed(42)
In [2]:
df = pd.read_csv(
    'https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data', header=None, sep=',')

df.columns = ['CLASS', 'ALCOHOL_LEVEL', 'MALIC_ACID', 'ASH', 'ALCALINITY','MAGNESIUM', 'PHENOLS', 
              'FLAVANOIDS', 'NON_FLAVANOID_PHENOL', 'PROANTHOCYANINS', 'COLOR_INTENSITY', 
              'HUE', 'OD280/OD315_DILUTED','PROLINE']

# Let us use only two features : 'ALCOHOL_LEVEL', 'MALIC_ACID' for this problem
df = df[['CLASS', 'ALCOHOL_LEVEL', 'MALIC_ACID']]
df.head()
Out[2]:
CLASS ALCOHOL_LEVEL MALIC_ACID
0 1 14.23 1.71
1 1 13.20 1.78
2 1 13.16 2.36
3 1 14.37 1.95
4 1 13.24 2.59

1. Cross validation using Python from Scratch

K-Nearest Neighbors Algorithm using Python and Scikit-Learn

K nearest Neighbors (kNN) works based on calculating distance between given test data point and all the training samples. We, then, collect first K closest points from training set and the majority vote gives you the predicted class for a given test data point.

For more intuitive explanation, please follow previous post :

How kNN works ?

In [1]:
import math
from collections import Counter
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

# making results reproducible
np.random.seed(42)
In [2]:
df = pd.read_csv(
    'https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data', header=None, sep=',')

df.columns = ['CLASS', 'ALCOHOL_LEVEL', 'MALIC_ACID', 'ASH', 'ALCALINITY','MAGNESIUM', 'PHENOLS', 
              'FLAVANOIDS', 'NON_FLAVANOID_PHENOL', 'PROANTHOCYANINS', 'COLOR_INTENSITY', 
              'HUE', 'OD280/OD315_DILUTED','PROLINE']

# Let us use only two features : 'ALCOHOL_LEVEL', 'MALIC_ACID' for this problem
df = df[['CLASS', 'ALCOHOL_LEVEL', 'MALIC_ACID']]
df.head()
Out[2]:
CLASS ALCOHOL_LEVEL MALIC_ACID
0 1 14.23 1.71
1 1 13.20 1.78
2 1 13.16 2.36
3 1 14.37 1.95
4 1 13.24 2.59
In [3]:
# class distribution looks okay; not so imbalanced.
df['CLASS'].value_counts().plot(kind="bar")
plt.show()

How do I know Principal Component Analysis (PCA) is preserving information from my data ?

Principal Component Analysis is used for dimension reduction of high-dimensional data. We also refer PCA as feature extraction technique, where new features take the linear combinations from original features.

More on PCA : principal-component-analysis-pca-for-visualization-using-python

In this post, we will investigate 'how reliable the information is as preserved by PCA'. In order to do that, we will use labelled data and evaluate the trained model to see the final performance. (In side note, PCA is unsupervised learning algorithm. But, in our case, we are using in supervised fashion to assess the model performance)

To make it more interesting, we will also see Random Forests as 'feature selection' and compare the result with PCA.

1. PCA as Feature Extraction

In [13]:
%matplotlib inline

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd 

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
In [14]:
df = pd.read_csv(
    'https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data', header=None, sep=',')

df.columns = ['CLASS', 'ALCOHOL_LEVEL', 'MALIC_ACID', 'ASH', 'ALCALINITY','MAGNESIUM', 'PHENOLS', 
              'FLAVANOIDS', 'NON_FLAVANOID_PHENOL', 'PROANTHOCYANINS', 'COLOR_INTENSITY', 
              'HUE', 'OD280/OD315_DILUTED','PROLINE']
df.head()
Out[14]:
CLASS ALCOHOL_LEVEL MALIC_ACID ASH ALCALINITY MAGNESIUM PHENOLS FLAVANOIDS NON_FLAVANOID_PHENOL PROANTHOCYANINS COLOR_INTENSITY HUE OD280/OD315_DILUTED PROLINE
0 1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065
1 1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050
2 1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185
3 1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480
4 1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735
In [15]:
features = ['ALCOHOL_LEVEL', 'MALIC_ACID', 'ASH', 'ALCALINITY','MAGNESIUM', 'PHENOLS', 
              'FLAVANOIDS', 'NON_FLAVANOID_PHENOL', 'PROANTHOCYANINS', 'COLOR_INTENSITY', 
              'HUE', 'OD280/OD315_DILUTED','PROLINE']
label = 'CLASS'

X = df[features]
y = df[label]

# train test split with 70% for training
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
a. Preparing Projected data using PCA
In [16]:
# prepare correlation matrix
# standar scaler for normalization
N, _ = df.shape
scaler = StandardScaler()
Z = scaler.fit_transform(X)
# Correlation estimation
R = np.dot(Z.T, Z) / N

# eigendecomposition
eigen_values, eigen_vectors = np.linalg.eig(R)

# prepare projection matrix
value_idx = eigen_values.argsort()[::-1]
eigen_vectors_sorted = eigen_vectors[:, value_idx]

# Projection matrix with 3 PCs ( 3 PCs cover 65% variance in the data)
# more on : https://sijanb.com.np/posts/principal-component-analysis-pca-for-visualization-using-python/
M = np.hstack((eigen_vectors_sorted[0][:, np.newaxis],
               eigen_vectors_sorted[1][:, np.newaxis],
               eigen_vectors_sorted[2][:, np.newaxis]))

# projected data
projected_data = np.asmatrix(Z) * np.asmatrix(M)
b. Using Projected data for the training and prediction using Decision Tree
In [17]:
# train test split for training
Xpc_train, Xpc_test, ypc_train, ypc_test = train_test_split( projected_data, y, test_size=0.3, random_state=0)

tree_pca = DecisionTreeClassifier(max_depth=6, random_state=0)
tree_pca.fit(Xpc_train, ypc_train)

ypca_pred = tree_pca.predict(Xpc_test)
print('Test accuracy using Decision tree on PCA projected data: %.2f' % accuracy_score(ypc_test, ypca_pred))
Test accuracy using Decision tree on PCA projected data: 0.76

Principal Component Analysis (PCA) for Visualization using Python

1. Basic Setup

Principal Component Analysis (PCA) is being used to reduce the dimensionality of data whilst retaining as much of information as possible. The general idea of PCA works as follows:

 a. Find the principal components from your original data
 b. Project your original data into the space spanned by principal components from (a)


Let's use $ \textbf{X} $ as our data matrix and $ \sum $ as our covariance matrix of $ \textbf{X} $. We will get eigenvectors ( $ \bf{v_1}, {v_2}, .....{v_k} $) and eigenvalues (${\lambda_1},{\lambda_2},....,{\lambda_k}$) from the covariance matrix $ \sum $, such that:

$ \lambda_1 \geq \lambda_2 \geq ...... \lambda_k $

NOTE* : Elements of the vector ($ \bf{v_1} $ ) represents the coefficients of principal components.

Our goal is to maximize the variance of projection along each of principal components. This can be written as:

$ \bf{var(y_i)} = \bf{var}(v_{i1} * X_1 + v_{i2} * X_2 + ...... + v_{ik} * X_k ) $

You can see that, we are projecting the original data into our new vector space given by PCs.

NOTE* : $ \bf{var(y_i)} = \lambda_i $ and principal components are uncorrelated i.e $ cov(y_i, y_j) $ = 0

2. Principal Component Analysis Algorithm (Pseudocode)

a. $ \textbf{X} \gets $ design data matrix with dimension ( N*k )

b. $ \textbf{X} \gets $ subtract mean from each column vector of $ \bf{X} $

c. $ \sum \gets $ compute covariance matrix of $ \bf{X} $

d. Calculate eigenvectors and eigenvalues from $ \sum $

e. Principal Components (PCs) $ \gets $ the first M eigenvectors with largest eigenvalues.

3. Basic Data Analysis

Sijan Bhandari on

Linear Regression with Maximum Likelihood (MSE) and Bayesian Learning Approach from scratch

1. Setup

Given the input dataset $ \textbf{D} = \{(x_i, t_i)\}_{i=1}^{N} $, our goal is to learn the parameters that model this data and use those parameters (w) for the prediction new data point.

We often define the features (basis functions) as : $ \{ \phi_1(x),......,\phi_m(x) \} $ and the linear regression model is defined as :

$ y(x,w) = \sum_{i=1}^m w_i \phi_i (x) + \varepsilon_i $

$ \varepsilon_i $ suggesting that we can not perfectly model the data generation process and there will be certain noise in our designed model(paramters). Usually, we assume that the noise is Gaussian. $ \varepsilon_i \sim Normal(0, \beta^{-1} )$

2. Objective Functions

a. Maximum Likelihood objective (MLE) .i.e . ( Mean Squared Error )

J(w) = $ \frac{1}{2} \sum_i^N \{t_i - w^T \phi(x_i) \}^2 $

b. Regularized Linear Regression

J(w) = $ \frac{1}{2} \sum_i^N \{t_i - w^T \phi(x_i) \}^2 $ + $ \frac{\lambda}{2} w^T w $

3. Closed Form Solutions

a. For MSE

$ w_{MLE} = ( \phi^T \phi )^{-1} \phi^T t $

$ ( \phi^T \phi )^{-1} $ is called Moore-Penrose inverse.

b. For Regularized MSE

$ w_{MLE} = ( \lambda I + \phi^T \phi )^{-1} \phi^T t $

4. Bayesian Learning

By using Bayes rule, we have :

$ p(w | t) = \frac{p(t|w)p(w)}{p(t)} $

$ p(w|t) $ - Posterior distribution $ p(t|w) $ - Likelihood of data given parameters $ p(w) $ - Prior distribution over paramters (w)

a. Pior on w :

Understanding Regularization for Support Vector Machines (SVMs)

I would like you to go through Intuition Behind SVM before exploring about Regularization.

SVM has an objective To find the optimal linearly speparating hyperplane which maximizes the margin. But, we know that Hard-Margin SVM can work well when data is completely lineary separable (without any noise or outliers). What if our data is not perfectly separable? We have two options for non-separable data:

 a. Using Hard-Margin SVM with feature transformations
 b. Using Soft-Margin SVM

If you want a good generalization on your result, we should tolerate some errors. If we force our model to be perfect, it will be just an attempt to overfit the data!.

Let's talk about Soft-Margin SVM and it helps us to understand Regularization. If the training data is not linearly separable, we allow our hyperplane to make few mistakes on outliers or say noisy data. Mistakes means those outliers/noise data can be inside the margin or on the wrong side of the margin.

But, we will have a mechanism to pay a cost for each of those misclassified example. That cost will depend on how far the data point is from the margin. This cost is represented by slack variables ($ξ_i$)

objective function : $ \frac{1}{2} ||w||^2 + C \sum_{i=1}^{n} ξ_i $

In the above equation, the parameter C defines the strength of regularization. We can discuss three different cases based on values of C:

Sijan Bhandari on

Implementation of K means clustering algorithm in Python

For K means clustering algorithm, I will be using Credit Cards Dataset for Clustering from Kaggle.

In [135]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

%matplotlib inline

Data preprocessing

In [136]:
credit_data = pd.read_csv('../data/CC GENERAL.csv')
In [137]:
credit_data.head()
Out[137]:
CUST_ID BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX PURCHASES_TRX CREDIT_LIMIT PAYMENTS MINIMUM_PAYMENTS PRC_FULL_PAYMENT TENURE
0 C10001 40.900749 0.818182 95.40 0.00 95.4 0.000000 0.166667 0.000000 0.083333 0.000000 0 2 1000.0 201.802084 139.509787 0.000000 12
1 C10002 3202.467416 0.909091 0.00 0.00 0.0 6442.945483 0.000000 0.000000 0.000000 0.250000 4 0 7000.0 4103.032597 1072.340217 0.222222 12
2 C10003 2495.148862 1.000000 773.17 773.17 0.0 0.000000 1.000000 1.000000 0.000000 0.000000 0 12 7500.0 622.066742 627.284787 0.000000 12
3 C10004 1666.670542 0.636364 1499.00 1499.00 0.0 205.788017 0.083333 0.083333 0.000000 0.083333 1 1 7500.0 0.000000 NaN 0.000000 12
4 C10005 817.714335 1.000000 16.00 16.00 0.0 0.000000 0.083333 0.083333 0.000000 0.000000 0 1 1200.0 678.334763 244.791237 0.000000 12
A. Check for missing data
In [138]:
credit_data.isna().sum()
Out[138]:
CUST_ID                               0
BALANCE                               0
BALANCE_FREQUENCY                     0
PURCHASES                             0
ONEOFF_PURCHASES                      0
INSTALLMENTS_PURCHASES                0
CASH_ADVANCE                          0
PURCHASES_FREQUENCY                   0
ONEOFF_PURCHASES_FREQUENCY            0
PURCHASES_INSTALLMENTS_FREQUENCY      0
CASH_ADVANCE_FREQUENCY                0
CASH_ADVANCE_TRX                      0
PURCHASES_TRX                         0
CREDIT_LIMIT                          1
PAYMENTS                              0
MINIMUM_PAYMENTS                    313
PRC_FULL_PAYMENT                      0
TENURE                                0
dtype: int64

We can see that some missing values in column MINIMUM_PAYMENTS column. Since we are focusing on algorithm aspect in this tutorial, I will simply remove entries having 'NaN' value.

B. Remove 'NaN' entries
In [139]:
credit_data = credit_data.dropna(how='any')
C. Remove nonrelevant column/feature
In [140]:
# Customer ID does not bear any meaning to build cluster. So, let's remove it.
credit_data.drop("CUST_ID", axis=1, inplace=True)
Sijan Bhandari on