How k-Nearest Neighbors works?

K-Nearest Neighbors (k-NN) is one of the classification algorithms in Machine learing. Since, K-NN simply memories or stores the rules in memory, we can also say that it does not follow actual machine learning paradigm.

Before going to implementation, let us build some standard notation so that it is easier to follow the code in the implementation section (next post):

          X_training = given training data
          Y = labels for your given training data
          X_test = new data (For which we need to predict the labels)


The whole algorithm can be divided into three major steps:

  1. Finding most nearest neighbors (similar data instances) from your training data for a given test data (X_test)

    Let's say we have total 20 training data, and you find out 4 training instances as nearest neighbors for
    one of your test data
  2. Once you get the nearest neighbors from your training data, collect all the labels of your selected training data

    In our case, we have 4 training instances as nearest neighbors. So, we will collect labels for all
    these 4 training data. 
  3. Finally, predict the label of your test data (X_test) based on majority count.

      In our case, suppose 3 out of 4 training instances have the same label. Then, the majority count
      will assign the label to new data point

The K defines the number of neighbors that the algorithm will collect for making the prediction of your new data.

Example

Suppose, we have a data set of a students with features : weight and height and labels as "Asian, European, American" as follows:

In [32]:
from IPython.display import HTML, display
import tabulate
table = [["Height(cm)","Weight(kg)","Class"],
         [169, 60, 'Asian'],
         [171, 59, 'Asian'],
         [172, 70, 'European'],
         [179, 69, 'European'],
         [170, 75, 'Asian'],
         [175, 80, 'American'],
         [176, 79, 'American'],
         [180, 71, 'European'],
         [171, 76, 'American'],
        
        
        ]
display(HTML(tabulate.tabulate(table, tablefmt='html')))
Height(cm) Weight(kg) Class
169 60 Asian
171 59 Asian
172 70 European
179 69 European
170 75 Asian
175 80 American
176 79 American
180 71 European
171 76 American

Now, suppose new student enters the class with feature

     Height : 168
     Weight : 65

Which class this new student belongs to?

Now, let us use some distance measure between this new point and all the training data points as follows:

sqrt((169-168)^2 + (60-65)^2) = 5.09

In [24]:
import math
def distance(p1, p2):
    sq_distance = (p1[0] - p2[0])**2 + (p1[1] - p2[1])**2
    return math.ceil(sq_distance**(1/2))
In [33]:
table = [["Height(cm)","Weight(kg)","Class", "Distance"],
         [169, 60, 'Asian', distance([169, 60], [168, 65])],
         [171, 59, 'Asian', distance([171, 59], [168, 65])],
         [172, 70, 'European', distance([172, 70], [168, 65])],
         [179, 69, 'European', distance([179, 69], [168, 65])],
         [170, 75, 'Asian', distance([170, 72], [168, 65])],
         [175, 80, 'American', distance([175, 80], [168, 65])],
         [176, 79, 'American', distance([176, 79], [168, 65])],
         [180, 71, 'European', distance([180, 71], [168, 65])],
         [171, 76, 'American', distance([171, 76], [168, 65])],
        
        
        ]
display(HTML(tabulate.tabulate(table, tablefmt='html')))
Height(cm) Weight(kg) Class Distance
169 60 Asian 6
171 59 Asian 7
172 70 European 7
179 69 European 12
170 75 Asian 8
175 80 American 17
176 79 American 17
180 71 European 14
171 76 American 12

From the result table, let's select k=3, neighbors having minimal distance(nearest points)

 a. 169     60  Asian         6 
 b. 171     59  Asian         7 
 c. 172     70  European        7 

So all possible labels in our final result is : ['Asian', 'Asian', 'European']

With majority vote, we can classify the new student as 'Asian'

In [ ]:
 

Comments

Comments powered by Disqus