Support vector machine (SVM) is a supervised machine learning algorithm which is considered effective tool for both classification and regression problem.

In a simple word, SVM tries to find a linearly separable hyperplane in order to separate members of one class from another. If SVM can not find the hyperplane for a given data set, it applies non-linear mapping to the training data and transform them to higher dimension where it searches for the optimal hyperplane. The SVM algorithm uses support vectors and margins in order to draw these hyperplanes in the training data.

Since it has ability to understand the complex relation in input data by applying nonlinear mapping, it has high accuracy compare to other supervised classification algorithms (kNN, NCC..)

People have been using SVM for different applications like : text data classification, image data(handwritten) recognition and more.

Intuition:

Let us see a linearly separable problem as shown in the diagram below.

In the given diagram, we can draw infinite lines that can separate data into two different classes.

Can you decide Which of the lines (pink, yellow, green) better suits our problem? If you look closely, the green line is close to red dots and any minor change in data point may caues this point to fall into other class. On the other hand, Pink line is close to blue dots and similarly, a minor twist on data are prone to change the class of new point to other side and may misclassified.

But, the yellow line looks more stable, in the sense, it is far from data points on either side and not susceptible to the small changes in data observations.

Support vector machines help us to make a decision on this problem. If you think 'yellow' line in the figure is the optimal hyperplane for this problem, we can take this as an intuition that an optimal choice from the number of choices of hyperplanes is the one which is farthest from our training data points.

We can compute the perpendicular distances from all our training points to all possible hyperplanes and choose the one which has largest minimum distance to the training examples. The minimum distance between hyperplane and the observation point is called margin.

In summary, SVMs maximize the margin around our separating hyperplane. And, The idea is to draw a margin of some width around the separating hyperplane, up to the nearest point.

The training data that falls exactly on the boundaries of the margin are called

the support vectors. If these points shifted on any direction, they also influence the hyperplanes passing through the margin points and they also shifted. It is important to notice that support vectors are critical data points on our training set since they are the one defining margin boundaries.

It is important to note that the complexity of SVM is characterized by the number of support vectors, rather than the dimension of the feature space. That is the reason SVM has a comparatively less tendency to overfit.

#### Support vectors "support the plane" and are hence called Support Vectors.¶

Upto now, we are trying to find the hard-margin for our hyperplane. But, as shown in the diagram below, a single point can have high influence on our boundary. So hard margin SVM are sensitive to outliers, giving overfitted model.

What about we relax our hard margin condition? where, a small amount of data points are allowed to cross the margins/boundaries .i.e data points can be on incorrect side as well. This approach is called soft-margin. Soft in the sense that few data points can violate the margin condition. The softness is measured by slack variable; which control the position of training data points relative to the margins and our hyperplane. The ultimate goal is, our SVM tries to maximize the soft margin.

In [ ]:


Sijan Bhandari on

The Nearest centroid classifier (NCC) says that we should classify a data point to a class whose centroid is closest to this data point.

The algorithms follows:

Suppose $c_l$ represents the set of indices which belong to class l. And n = |$c_l$|

##### 1.Training step :¶

We compute the centroids(CTs) for each of the classes as:

$CT_l$ = $\frac{1}{n} \sum_{i \in c_l} x_i$

##### 2.Prediction step :¶

a. Given a new data point $x_{new}$, compute the distance between $x_{new}$ and each centroids as

distance : $|| x_{new} -$CT_l$||_2$ (Euclidean distance)

b. Assign the class to this new point which has minimum distance value.

Let us taken an example. We have to classify fruits into two classes : Apple and Orange, based on their height and width.

Our inputs (x) are :

x1=[5,6], x2=[5,7], x3=[4,3], x4=[5,7], x5=[6,4] and corrresponding labels (y) are

𝑦1='AP' 𝑦2='AP' 𝑦3='AP' 𝑦4='ORG' 𝑦5='ORG'

Here $x_i$ = [width, height] , 'AP' = 'Apple', 'ORG' = 'Orange'.

Now, centroids for two classes are :

$CT_{AP}$ = $\frac{1}{3} (5 + 5 + 4, 6 + 7 + 3)$ = ($\frac{14}{3}, \frac{16}{3}$)

$CT_{ORG}$ = $\frac{1}{2} (5 + 6, 7 + 4)$ = ($\frac{11}{2}, \frac{11}{2}$)

Suppose, you got a new test data point : (3, 7) i.e $x_{new}$, and you want to classify this point. We can calculate the distance between new point and our centroids as:

$||x_{new} - CT_{AP} ||$ = || (3,7) - ($\frac{14}{3}, \frac{16}{3}$) || = 2.357

$||x_{new} - CT_{ORG} ||$ = || (3,7) - ($\frac{11}{2}, \frac{11}{2}$) || = 2.915

Here, the new data point is classified as 'Apple' as the new data point is closest to the centroid of data points that belong to class 'Apple'

In [ ]:


Sijan Bhandari on

K-Nearest Neighbors (k-NN) is one of the classification algorithms in Machine learing. Since, K-NN simply memories or stores the rules in memory, we can also say that it does not follow actual machine learning paradigm.

Before going to implementation, let us build some standard notation so that it is easier to follow the code in the implementation section (next post):

          X_training = given training data
Y = labels for your given training data
X_test = new data (For which we need to predict the labels)



The whole algorithm can be divided into three major steps:

1. Finding most nearest neighbors (similar data instances) from your training data for a given test data (X_test)

Let's say we have total 20 training data, and you find out 4 training instances as nearest neighbors for
one of your test data
2. Once you get the nearest neighbors from your training data, collect all the labels of your selected training data

In our case, we have 4 training instances as nearest neighbors. So, we will collect labels for all
these 4 training data. 
3. Finally, predict the label of your test data (X_test) based on majority count.

      In our case, suppose 3 out of 4 training instances have the same label. Then, the majority count
will assign the label to new data point

The K defines the number of neighbors that the algorithm will collect for making the prediction of your new data.

###### Example¶

Suppose, we have a data set of a students with features : weight and height and labels as "Asian, European, American" as follows:

In [32]:
from IPython.display import HTML, display
import tabulate
table = [["Height(cm)","Weight(kg)","Class"],
[169, 60, 'Asian'],
[171, 59, 'Asian'],
[172, 70, 'European'],
[179, 69, 'European'],
[170, 75, 'Asian'],
[175, 80, 'American'],
[176, 79, 'American'],
[180, 71, 'European'],
[171, 76, 'American'],

]
display(HTML(tabulate.tabulate(table, tablefmt='html')))

 Height(cm) Weight(kg) Class 169 60 Asian 171 59 Asian 172 70 European 179 69 European 170 75 Asian 175 80 American 176 79 American 180 71 European 171 76 American

Now, suppose new student enters the class with feature

     Height : 168
Weight : 65



Which class this new student belongs to?

Now, let us use some distance measure between this new point and all the training data points as follows:

sqrt((169-168)^2 + (60-65)^2) = 5.09

In [24]:
import math
def distance(p1, p2):
sq_distance = (p1[0] - p2[0])**2 + (p1[1] - p2[1])**2
return math.ceil(sq_distance**(1/2))

In [33]:
table = [["Height(cm)","Weight(kg)","Class", "Distance"],
[169, 60, 'Asian', distance([169, 60], [168, 65])],
[171, 59, 'Asian', distance([171, 59], [168, 65])],
[172, 70, 'European', distance([172, 70], [168, 65])],
[179, 69, 'European', distance([179, 69], [168, 65])],
[170, 75, 'Asian', distance([170, 72], [168, 65])],
[175, 80, 'American', distance([175, 80], [168, 65])],
[176, 79, 'American', distance([176, 79], [168, 65])],
[180, 71, 'European', distance([180, 71], [168, 65])],
[171, 76, 'American', distance([171, 76], [168, 65])],

]
display(HTML(tabulate.tabulate(table, tablefmt='html')))

 Height(cm) Weight(kg) Class Distance 169 60 Asian 6 171 59 Asian 7 172 70 European 7 179 69 European 12 170 75 Asian 8 175 80 American 17 176 79 American 17 180 71 European 14 171 76 American 12

From the result table, let's select k=3, neighbors having minimal distance(nearest points)

 a. 169     60  Asian         6
b. 171     59  Asian         7
c. 172     70  European        7 

So all possible labels in our final result is : ['Asian', 'Asian', 'European']

With majority vote, we can classify the new student as 'Asian'

In [ ]:


Sijan Bhandari on

Let me start with an example. Suppose you have built a ML model under supervised learning which can detect faces from your family members. You simply feed all the family photos to your model. And, at the end, you would like to know how much your model learnt from your training process.

After training your model, you will find a function that maps input (image) to output labels(family members). If we represent your ML model with a function 'f',we can write this process as:

  y = f(x),
where  x = input  (your family photos)
y = output (model prediction) (predictiong whether the photo is your father, mother or sister)



Suppose, you use 50 images for your training process and use the same images to see whether the model can predict the label correctly or not. If you found that 10 out of 50 images are not predicted well, you can say : 20% training error. If you think about the intuition behind this calculation, it is saying that 'the model is not doing well with the training data itself'. That means it underfits the data. It does not want to carry all the information available in your data.

*NOTE: If your model 'f' is a linear model than it will be really hard for your model to classify the photos using linear function.

Let us consider the different scenario. You found that there is only 2% training error after training your model. And, you are happy that only 1 image out of 50 is being wrongly predicted. But, you are still suspicious about your model and apply different strategy this time. You captured new 10 photos of your family and gave it to the model. And, at this time all the 10 photos are wrongly predicted by your model. What we can see is that 'your model learn everything from the training data without considering the generalization of data' (increased variance). We can also say that model overfits your training data. Obviously, learning everything from training data will reduce the biasness but it increases the variance in your data.

In [ ]:


Sijan Bhandari on
In [2]:
# -*- coding: utf-8 -*-
# @Author: Sijan
# @Date:   2019-04-01

from random import randint

def step_function(result):
"""
Simple linear function which will be activated if the value is greater than 0.
"""
if result > 0:
return 1
return 0

class Perceptron:
"""
Perceptron class defines a neuron with attributes : weights, bias and learning rate.
"""

def __init__(self, input_size):
self.learning_rate = 0.5
self.bias = randint(0, 1)
self.weights = [randint(0, 1) for _ in range(input_size)]

def feedforward(perceptron, node_input):
"""
Implements product between input and weights
"""
node_sum = 0
node_sum += perceptron.bias

for index, item in enumerate(node_input):
# print('input node is', item)
node_sum += item * perceptron.weights[index]

return step_function(node_sum)

"""
Adjust weightage based on error. It simply scales input values towards right direction.

"""
for index, item in enumerate(node_input):
perceptron.weights[index] += item * error * perceptron.learning_rate

perceptron.bias += error * perceptron.learning_rate

def train(perceptron, inputs, outputs):
"""
Trains perceptron for given inputs.
"""
for training_input, training_output in zip(inputs, outputs):
actual_output = feedforward(perceptron, training_input)
desired_output = training_output
error = desired_output - actual_output

def predict(perceptron, test_input, test_output):
"""
Predicts new inputs.
"""
prediction = feedforward(perceptron, test_input)

# if test_input[1] == test_output:
print('input :%s gives output :%s' % (test_input, prediction))
print('input :%s has true output :%s' % (test_input, test_output))

if __name__ == '__main__':

train_inputs = [(0, 0), (0, 1), (1, 0), (1, 1)]
train_outputs = [0, 0, 0, 1]

# train perceptron
perceptron = Perceptron(2)
epochs = 10

for _ in range(epochs):
train(perceptron, train_inputs, train_outputs)

# test perceptron
test_input = (1,1)
test_output = 1
print('...................................')
print('...................................')
predict(perceptron, test_input, test_output)

weight after adjustment [0.0, 0.0]
...................................
...................................
input :(1, 1) gives output :1
input :(1, 1) has true output :1

In [ ]:


Sijan Bhandari on

Randomness is not a property of a phenomenon. It is simply an unpredectability of occurence of events around you. It occurs in different scenarios of our life. For example, while roaming around street, you found a coin; now you would certainly look for other coin around that spot. But, there will be not any certainty or possibility or pattern of finding one. Other examples are - tossing coin / dice, fluctuating market prices for common goods.

In the field of Mathematics and probability, we assign some numerical value for identifying each of this random outcome. i.e. we use probability to quantify randomness. And, probability of certain event is calculated by the relative frequency of that event in the experiment.

In probability, the current occurrence / selection you do for your experiment is an event. For example,fliping a coin is an event.And, the act of tossing the coin is called independent trail. If you do number of trails, it is called an experiment. And, all the possible outcomes of an experiment is called sample space. So, we can say that an event is also a subset of sample space.

Another example : Suppose you need to choose a point from an interval (10, 100). Your selection E = (12, 34) is an event.

#### Disjoint and Independent Event¶

Disjoint event means if an event occurs than other events can't occur simultaneously. That means joint probabilit will be zero P(A and B) = 0. We can say that disjoint events are very dependent. That is : when an event occurs with certain probability, then other event will have zero chance of occurrence. For example: when tossing a coin, the result can either be heads or tails but cannot be both.

On the other hand, independent events are those where occurrence of one does not affect the occurence of another event. For example: when tossing two coins, the result of one flip does not affect the result of the other.

       conditions for independence
P(A and B) = P(A)×P(B)
P(A, given that B occurs) = P(A)

#### Joint and Marginal Probability¶

Joint probability measure the probability that two events will occur simultaneously and marginal probability is the probability of single event.

Joint probability is expressed as :P(A1,A2 ...,An) ; where A1, A2...An are the events.

Let us take an example:

P(life expectancy=70, nationality=Nepal) = 0.5 means there is 0.5 chances of a person, picked from a population, is a Nepali and has the life expectancy of 70 years.

#### Conditional Probability¶

Formally, conditional probability can be defined as the probability of an event (A); given the probability of another event (B).

Mathematically, it is denoted as P(A | B).

   P(A | B) = P(A and B) / P(B)   ; P(A and B) - joint probability of A and B.

Let us take an example :

Consider that a student has an 70% chance of being accepted in a university, only 40% of all of the accepted students will get the student residence offer. Then the chance of student getting accepted and receiving student residence offer is defined by

P(Accepted and Student resi.) = P(Student resi.|Accepted)P(Accepted)
= (0.40)*(0.70) = 0.28 

#### Bayes' Law¶

Let us suppose that we have a prior knowledge of a probability for a disease D to occurs when there is a symptom S P(S|D); and the probability of having disease D P(D); the probability of having symptom S P(S). If we have all these three information; we can calculate the probability of occurrence of disease D; given that the person has symptom S i.e P(D|S).

By using joint probability,

     P(A and B) = P(A|B) * P(A) = P(B|A) * P(B)

P(A|B) =  P(A)   *   P(B|A) / P(B)
(prior prob.)   (Posterior prob.)

or

P(B|A) =P(A|B)P(B) / [P(A|B)P(B) + P(A|B′)P(B′)]

P(B′) is the probability of B not occurring.

Example:

Suppose it has been observed empirically that the word “Congratulations” occurs in 1 out of 10 spam emails, but that “Congratulations” only occurs in 1 out of 1000 non-spam emails. Suppose it has also been observed empirically that about 4 out of 10 emails are spam. Suppose we get a new email that contains “Congratulations”.Then, what is the probability of this email being a spam?

Ans : Let C be the event representing emails having word 'Congratulations' and S be the event saying the email is spam.

Now, we need to calculate : P ( S | C ) ?

Here

 P(C|S) ~= 1/10
P(C|S') ~= 1/1000
P(S) ~= 4/10
P(S') ~= 6/10



By Bayes’ Theorem: P(S|C) = P(C|S )P(S) / [P(C|S )P(S) + P(C|S')P(S')]

                      = (1/10) (4/10) / [ 1/10 * 4/10 + 1/1000 * 6/10]
= 0.985
In [ ]:


Sijan Bhandari on

In this post, we are going to devise a measurement tool (perceptron model) in order to classify : whether a person is infected by a diseases or not.

In binary terms, the output will be

       {
1   if infected
0   not infected
}



To build inputs for our neural network, we take readings from the patients and we will treat readings as follows :

  body temperature = {
1   if body temperator > 99'F
-1   if body temperator = 99'F
}

heart rate = {
1   if heart rate > 60 to 100
-1   if heart rate = 60 to 100
}

blood pressure = {
1   if heart rate > 120/80
-1   if heart rate = 120/80
}



So, input from each patient will be represented as a three dimensional vector:

  input = (body temperatur, heart rate, blood pressure)


So, a person can now be represented as :

(1, -1, 1)
i.e (body temperator > 99'F, heart rate = 60 to 100, heart rate > 120/80)

Let us create two inputs with desired output value

      x1 = (1, 1, 1), d1 = 1 (infected)
x2 = (-1, -1, -1), d2 = 0 (not infected)


Let us take initial values for weights and biases: weights, w0 = (-1, 0.5, 0) bias, b0 = 0.5

And, activation function:

         A(S)   = {
1 if S >=0
0 otherwise
}

##### STEP 1¶

Feed x1 = (1, 1, 1) into the network.

weighted_sum:

S = (-1, 0.5, 0) * (1, 1, 1)^T + 0
= -1 + 0.5 + 0 + 0
= -0.5



When passed through activation function A(-0.5) = 0 = y1 We passed an infected input vector, but our perceptron classified it as not infected. Let's calculate the error term:

             e = d1 - y1 = 1 - 0 = 1


Update weight as:

             w1 = w0 + e * x1 = (-1, 0.5, 0) + 1 * (1, 1, 1) = (0, 1.5, 1)


And, update bias as:

             b1 = b0 + e = 1
##### STEP 2¶

Now, we feed second input (-1, -1, -1) into our network.

weighted_sum :

S = w1 * x2^T + b1
= (0, 1.5, 1) * (-1, -1, -1)^T + 1
= -1.5 - 1 + 1
= -1.5


When passed through activation function A(-1.5) = 0 = y2 We passed an not infected input vector, and our perceptron successfully classified it as not infected.

##### STEP 3¶

Since, our first input is mis-classified, so we will go for it.

weighted_sum :

S = w1 * x1^T + b1
= (0, 1.5, 1) * (1, 1, 1)^T + 1
= 1.5 + 1 + 1
= 3.5


When passed through activation function A(3.5) = 1 = y3 We passed an infected input vector, and our perceptron successfully classified it as infected.

Here, both input vectors are correctly classified. i.e algorithm is converged to a solution point.

In [ ]:



Perceptron is simply an artificial neuron capable of solving linear classification problems. It is made up of single layer feed-forward neural network.

A percentron can only takes binary input values and signals binary output for decision making. The output decision (either0 or 1), is based on the value of weighted sum of inputs and weights.

Mathematically perceptron can be defined as :

output O(n)=
{    0 if ∑wixi + $\theta$ <= 0
1 if ∑wixi + $\theta$ > 0
}

$\theta$ = threshold / bias

##### Perceptron Learning Algorithm¶

Perceptron learning is basically done by adjusting the weights and bias in training process.

Initially, we will have training set of input vector

[x_1(n), x_2(n),....x_m(n) ]



And, weight vetor

[w_1(n), w_2(n),....w_m(n)]



And, bias = b

For convenience let w0=b(n) and x0(n) = 1

i.e input vector = [1, x_1(n),x_2(n),.....x_m(n)]
and weight vector = [b(n), w_1(n), w_2(n),....,w_m(n)]

y(n) = actual output in training
d(n) = desired output
η = the learning rate

And, suppose, M and N are two different classes. Where output +1 belongs to M and -1 belongs N.

learning steps:

##### 1. Initialization¶

We will set initial value for weights : w(0) = 0

##### a. Activation¶

We will supply (input vector, desired output) = [x(n), d(n)] to perceptron.

##### b. Actual Response¶

For each input vector, we will calculate the actual output based on

      y(n) = sgn[w(n)x(n)]

Where sgn(.) represents signum function as following:

sgn(x) = {  +1 if x>=0
-1 if x<0 }



We will adjust the weight vector as follows:

calculate error :

         e = d(n) - y(n)


    w(n+1) = w(n) + eta * e * x(n)



where

      d(n) = { +1   if x(n) classified as M
-1   if x(n) classified as N
}
In [ ]:



Deep learning, in simpler version is a learning mechanisms for Neural networks. And, Neural networks are computational model mimicing human nervous system which are capable of learning. Like interconnected neurons in human brains, the neural network is also connected by different nodes. It receives signals as a set of inputs, perform calcuations and signals output based on some activation value.

Here are some list of problems, that deep learning can solve

1. Classification : object and speech recongnistion, classify sentiments from text
2. Clustering : Fraud detection

Elements of Neural Networks

1. Weights Biological neurons has synaptic strengths to define the importance of particular inputs. In the similar fashion, inputs to NNs have associated relative weights. These weights ultimately defines the connection intensity of the input to any neuron. These weights are adaptive in nature since they will be modified in the process of training.

2. Summation This is the first step in NN, where each input is multiplied by its corresponding weight and weighted sum is computed.

3. Transfer In the transfer process, we simply compare the summation output with threshold value and decide the final neural output.

An example would be,

inputs
(x1) = 2
(x2) = 1

weights
(w1) = 0.7
(w2) = 0.8

threshold = 2



Summation value :

 x1w1 + x2w2 = (2 x 0.7) + (1 x 0.8) = 2.2



Since summation value is greather than threshold, neuron will be fired.

In [ ]: