1. Setup¶

Given the input dataset $ \textbf{D} = \{(x_i, t_i)\}_{i=1}^{N} $, our goal is to learn the parameters that model this data and use those parameters (w) for the prediction new data point.

We often define the features (basis functions) as : $ \{ \phi_1(x),......,\phi_m(x) \} $ and the linear regression model is defined as :

$ y(x,w) = \sum_{i=1}^m w_i \phi_i (x) + \varepsilon_i $

$ \varepsilon_i $ suggesting that we can not perfectly model the data generation process and there will be certain noise in our designed model(paramters). Usually, we assume that the noise is Gaussian. $ \varepsilon_i \sim Normal(0, \beta^{-1} )$

2. Objective Functions¶

a. Maximum Likelihood objective (MLE) .i.e . ( Mean Squared Error )¶

J(w) = $ \frac{1}{2} \sum_i^N \{t_i - w^T \phi(x_i) \}^2 $

b. Regularized Linear Regression¶

J(w) = $ \frac{1}{2} \sum_i^N \{t_i - w^T \phi(x_i) \}^2 $ + $ \frac{\lambda}{2} w^T w $

3. Closed Form Solutions¶

a. For MSE¶

$ w_{MLE} = ( \phi^T \phi )^{-1} \phi^T t $

$ ( \phi^T \phi )^{-1} $ is called Moore-Penrose inverse.

b. For Regularized MSE¶

$ w_{MLE} = ( \lambda I + \phi^T \phi )^{-1} \phi^T t $

4. Bayesian Learning¶

By using Bayes rule, we have :

$ p(w | t) = \frac{p(t|w)p(w)}{p(t)} $

$ p(w|t) $ - Posterior distribution $ p(t|w) $ - Likelihood of data given parameters $ p(w) $ - Prior distribution over paramters (w)

a. Pior on w :¶

$ p(w) \sim Normal( w | m_o, S_o) $

And, For convenience, we will use $ m_o = 0 $ and $ S_o = \alpha^{-1} I $ i.e. $ p(w) \sim Normal( 0, \alpha^{-1} I) $

b. Likelihood of data :¶

As we mentioned before, we are assuming Gaussian Noise with precision paramter $ \beta $, the likelihood will be also Gaussian.

$ p(t|w) = Normal (t|y(x,w), \beta^{-1}) $

And, because we have conditional independence of $ t_i $ given w, our likelihood becomes:

$ p(t_1, t_2,....t_N|w, X, \beta) $ = $ \prod_{i=1}^N Normal(t_i | w^T \phi(x_i), \beta^{-1}) $

c. Posterior on w :¶

Using Bayes, we get:

$ p(w |X, t) $ $ \propto $ $ \prod_{i=1}^N Normal(t_i | w^T \phi(x_i), \beta^{-1}) * Normal( 0, \alpha^{-1} I) $

And, our posterior will be :

$ p(w | t) $ = $ Normal( w| m_{post}, S_{post}) $

Where mean of posterior ($ m_{post} $) is given by : $ m_{post} = S_{post} (S_o^{-1} m_o + \beta \phi^T t ) $

We are assuming $ m_o = 0 $, so $ m_{post} = \beta S_{post} \phi^T t $

And, covariance of the posterior ($ S_{post}) $ is given by:

$ S_{post} = S_o^{-1} + \beta \phi^T \phi $

We have assumed $ S_o = \alpha^{-1} * I $, therefore:

$ S_{post} = (\alpha^{-1} * I )^{-1} + \beta \phi^T \phi $ = $ \alpha * I + \beta \phi^T \phi $

d. Predictive Distribution :¶

For a new test input data x, we need to make a prediction for t':

Using Sum rule of Probability, we can write :

$ p(t') = \int p(t',w) dw $ = $ \int p(t' | w) p(w) dw $

And, applying input data $ \textbf{t} $, we get :

$ p(t' | \textbf{t} ) $ = $ \int p(t' | w) p(w | \textbf{t} ) dw $

using both prior and noise precision parameters $ \alpha, \beta $, we will have :

$ p(t' | \textbf{t}, \alpha, \beta ) $ = $ \int p(t' | w, \beta) p(w | \textbf{t}, \alpha, \beta ) dw $

On the right side, inside integral, the first term is condition on the target for a given paramter, so it will normal with precision $ \beta $ and the second term is our posterior distribution. So, the final distribution will be Gaussian :

$ p(t' | \textbf{t}, \alpha, \beta ) $ = $ Normal(t'| m_{post} \phi(x), \sigma^2(x)) $

We only need mean $ m_{post} \phi(x) $ for the prediction.

5. Linear and Regularized Regression Implementation in Python¶

In [1]:

%matplotlib notebook
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

In [2]:

class LinearRegression():

    def __init__(self, is_regularized=False, lambda_param=0.5):
        self.lambda_param = lambda_param
        self.is_regularized = is_regularized
        self.weights = None

    def fit(self, X, y):
        PHI = np.asmatrix(X.values)
        t = np.asmatrix(y.values)
        if self.is_regularized:
            ex = PHI.T * PHI
            expr = (PHI.T * PHI) + self.lambda_param * np.eye(PHI.shape[1])
            weights = np.linalg.inv(expr) * PHI.T * t.T
        else:
            # calculating normal inverse(X.T * X)*X.T is equivalent to Moore-Penrose inverse
            weights = np.linalg.pinv(PHI) * t.T

        self.weights = weights

    def rmse(self, X, y):
        N, _ = X.shape
       
        PHI = np.asmatrix(X.values)
        t_predict = PHI * self.weights
        t = np.asmatrix(y.values)
        
        sq_error = np.square(t_predict - t.T)
        return np.sqrt(sum(sq_error) / N)

    def predict(self, X):
        x = np.asmatrix(X.values)
        return x * self.weights

6. Bayesian Linear Regression Implementation in Python¶

In [3]:

class BayesianLinearRegression():

    def __init__(self, alpha=1.0, beta=1.0):
        """
        alpha: Prior has precision paramter alpha
        beta : We assume Gaussian Noise with mean=0 and precision = Beta (also called inverse variance)
        """
        self.alpha = alpha
        self.beta = beta
        self.posterior_covar = None
        self.posterior_mean = None

    def fit(self, X, y):
        PHI = np.asmatrix(X.values)
        t = np.asmatrix(y.values)

        prior_covar_inv = self.alpha * np.eye(PHI.shape[1])
        posterior_covar_inv =  prior_covar_inv + self.beta * PHI.T * PHI
        self.posterior_covar = np.linalg.inv(posterior_covar_inv)
        self.posterior_mean = self.beta * self.posterior_covar * PHI.T * t.T

    def rmse(self, X, y):
        N, _ = X.shape
        PHI = np.asmatrix(X.values)
        t_predict = PHI * self.posterior_mean
        t = np.asmatrix(y.values)
        sq_error = np.square(t_predict - t.T)
        return np.sqrt(sum(sq_error) / N)

    def predict(self, X):
        x = np.asmatrix(X.values)
        return x * self.posterior_mean

7. Linear Regression and Bayesian Linear Regression on Housing Data¶

In [5]:

df = pd.read_csv(
    'https://archive.ics.uci.edu/ml/machine-learning-databases/'
    'housing/housing.data',
    header=None,
    sep='\s+')

df.columns = [
    'CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
    'PTRATIO', 'B', 'LSTAT', 'MEDV'
]

In [6]:

df.head()

Out[6]:

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	MEDV
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1	296.0	15.3	396.90	4.98	24.0
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2	242.0	17.8	396.90	9.14	21.6
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2	242.0	17.8	392.83	4.03	34.7
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3	222.0	18.7	394.63	2.94	33.4
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3	222.0	18.7	396.90	5.33	36.2

In [7]:

features = [
    'CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
    'PTRATIO', 'B', 'LSTAT'
]
label = 'MEDV'

In [9]:

X = df[features]
y = df[label]
# 70% for train and 30% for test also define stratified on target value
X_train, X_test, y_train, y_test = train_test_split(X,y, train_size = 0.7,random_state = 0)

In [14]:

linear_reg = LinearRegression()
linear_reg.fit(X_train, y_train)

linear_train_score = float(linear_reg.rmse(X_train, y_train))
linear_test_score = float(linear_reg.rmse(X_test, y_test))
linear_predictions = linear_reg.predict(X_test)

plt.figure(1, figsize=(9, 6))
# only take few samples for the plot
label_linear = 'Linear regression : test score ({:.2f})'.format(linear_test_score)

samples = min(100, len(y_test))
x = range(samples)
plt.plot(x, y_test.iloc[0:samples], label='Ground Truth')
plt.plot(x, linear_predictions[0:samples], label=label_linear)
plt.xlabel("Index")
plt.ylabel("Values")
plt.legend(loc='upper right', prop={'size': 7})
plt.grid()
plt.show()

No description has been provided for this image

In [15]:

linear_regular_reg = LinearRegression(is_regularized=True)
linear_regular_reg.fit(X_train, y_train)

linear_regular_train_score = float(linear_regular_reg.rmse(X_train, y_train))
linear_regular_test_score = float(linear_regular_reg.rmse(X_test, y_test))
linear_regular_predictions = linear_regular_reg.predict(X_test)

# only take few samples for the plot
plt.figure(2, figsize=(8, 6))

label_regular = 'Regularized Linear regression : test score ({:.2f})'.format(linear_regular_test_score)

samples = min(100, len(y_test))
x = range(samples)
plt.plot(x, y_test.iloc[0:samples], label='Ground Truth')
# plt.plot(x, linear_predictions[0:samples], label=label_linear)
plt.plot(x, linear_regular_predictions[0:samples], label=label_regular)
plt.xlabel("Index")
plt.ylabel("Values")
plt.legend(loc='upper right', prop={'size': 7})
plt.grid()
plt.show()

In [16]:

linear_bayes_reg = BayesianLinearRegression()
linear_bayes_reg.fit(X_train, y_train)

linear_bayes_train_score = float(linear_bayes_reg.rmse(X_train, y_train))
linear_bayes_test_score = float(linear_bayes_reg.rmse(X_test, y_test))
linear_bayes_predictions = linear_bayes_reg.predict(X_test)

plt.figure(3, figsize=(9, 6))
# only take few samples for the plot
label_linear_bayes = 'Bayesian Linear regression : test score ({:.2f})'.format(linear_test_score)

samples = min(100, len(y_test))
x = range(samples)
plt.plot(x, y_test.iloc[0:samples], label='Ground Truth')
plt.plot(x, linear_bayes_predictions[0:samples], label=label_linear_bayes)
plt.xlabel("Index")
plt.ylabel("Values")
plt.legend(loc='upper right', prop={'size': 7})
plt.grid()
plt.show()

In [ ]:

Menu

Linear Regression with Maximum Likelihood (MSE) and Bayesian Learning Approach from scratch