The Nearest centroid classifier (NCC) says that we should classify a data point to a class whose centroid is closest to this data point.
The algorithms follows:
Suppose $ c_l $ represents the set of indices which belong to class l
. And n = |$c_l$|
1.Training step :¶
We compute the centroids(CTs) for each of the classes as:
$CT_l$ = $ \frac{1}{n} \sum_{i \in c_l} x_i $
2.Prediction step :¶
a. Given a new data point $x_{new}$, compute the distance between $x_{new}$ and each centroids as
distance
: $|| x_{new} - $CT_l$||_2$ (Euclidean distance)
b. Assign the class to this new point which has minimum distance
value.
Let us taken an example. We have to classify fruits into two classes : Apple and Orange, based on their height and width.
Our inputs (x) are :
x1=[5,6], x2=[5,7], x3=[4,3], x4=[5,7], x5=[6,4]
and corrresponding labels (y) are
𝑦1='AP' 𝑦2='AP' 𝑦3='AP' 𝑦4='ORG' 𝑦5='ORG'
Here $x_i$ = [width, height] , 'AP' = 'Apple', 'ORG' = 'Orange'.
Now, centroids for two classes are :
$ CT_{AP} $ = $ \frac{1}{3} (5 + 5 + 4, 6 + 7 + 3) $ = ($\frac{14}{3}, \frac{16}{3} $)
$ CT_{ORG} $ = $ \frac{1}{2} (5 + 6, 7 + 4) $ = ($\frac{11}{2}, \frac{11}{2} $)
Suppose, you got a new test data point : (3, 7) i.e $x_{new}$, and you want to classify this point. We can calculate the distance between new point and our centroids as:
$ ||x_{new} - CT_{AP} || $ = || (3,7) - ($\frac{14}{3}, \frac{16}{3} $) || = 2.357
$ ||x_{new} - CT_{ORG} || $ = || (3,7) - ($\frac{11}{2}, \frac{11}{2} $) || = 2.915
Here, the new data point is classified as 'Apple' as the new data point is closest to the centroid of data points that belong to class 'Apple'