Linear Regression with Maximum Likelihood (MSE) and Bayesian Learning Approach from scratch

1. Setup

Given the input dataset $ \textbf{D} = \{(x_i, t_i)\}_{i=1}^{N} $, our goal is to learn the parameters that model this data and use those parameters (w) for the prediction new data point.

We often define the features (basis functions) as : $ \{ \phi_1(x),......,\phi_m(x) \} $ and the linear regression model is defined as :

$ y(x,w) = \sum_{i=1}^m w_i \phi_i (x) + \varepsilon_i $

$ \varepsilon_i $ suggesting that we can not perfectly model the data generation process and there will be certain noise in our designed model(paramters). Usually, we assume that the noise is Gaussian. $ \varepsilon_i \sim Normal(0, \beta^{-1} )$

2. Objective Functions

a. Maximum Likelihood objective (MLE) .i.e . ( Mean Squared Error )

J(w) = $ \frac{1}{2} \sum_i^N \{t_i - w^T \phi(x_i) \}^2 $

b. Regularized Linear Regression

J(w) = $ \frac{1}{2} \sum_i^N \{t_i - w^T \phi(x_i) \}^2 $ + $ \frac{\lambda}{2} w^T w $

3. Closed Form Solutions

a. For MSE

$ w_{MLE} = ( \phi^T \phi )^{-1} \phi^T t $

$ ( \phi^T \phi )^{-1} $ is called Moore-Penrose inverse.

b. For Regularized MSE

$ w_{MLE} = ( \lambda I + \phi^T \phi )^{-1} \phi^T t $

4. Bayesian Learning

By using Bayes rule, we have :

$ p(w | t) = \frac{p(t|w)p(w)}{p(t)} $

$ p(w|t) $ - Posterior distribution $ p(t|w) $ - Likelihood of data given parameters $ p(w) $ - Prior distribution over paramters (w)

a. Pior on w :

$ p(w) \sim Normal( w | m_o, S_o) $

And, For convenience, we will use $ m_o = 0 $ and $ S_o = \alpha^{-1} I $ i.e. $ p(w) \sim Normal( 0, \alpha^{-1} I) $

b. Likelihood of data :

As we mentioned before, we are assuming Gaussian Noise with precision paramter $ \beta $, the likelihood will be also Gaussian.

$ p(t|w) = Normal (t|y(x,w), \beta^{-1}) $

And, because we have conditional independence of $ t_i $ given w, our likelihood becomes:

$ p(t_1, t_2,....t_N|w, X, \beta) $ = $ \prod_{i=1}^N Normal(t_i | w^T \phi(x_i), \beta^{-1}) $

c. Posterior on w :

Using Bayes, we get:

$ p(w |X, t) $ $ \propto $ $ \prod_{i=1}^N Normal(t_i | w^T \phi(x_i), \beta^{-1}) * Normal( 0, \alpha^{-1} I) $

And, our posterior will be :

$ p(w | t) $ = $ Normal( w| m_{post}, S_{post}) $

Where mean of posterior ($ m_{post} $) is given by : $ m_{post} = S_{post} (S_o^{-1} m_o + \beta \phi^T t ) $

We are assuming $ m_o = 0 $, so $ m_{post} = \beta S_{post} \phi^T t $

And, covariance of the posterior ($ S_{post}) $ is given by:

$ S_{post} = S_o^{-1} + \beta \phi^T \phi $

We have assumed $ S_o = \alpha^{-1} * I $, therefore:

$ S_{post} = (\alpha^{-1} * I )^{-1} + \beta \phi^T \phi $ = $ \alpha * I + \beta \phi^T \phi $

d. Predictive Distribution :

For a new test input data x, we need to make a prediction for t':

Using Sum rule of Probability, we can write :

$ p(t') = \int p(t',w) dw $ = $ \int p(t' | w) p(w) dw $

And, applying input data $ \textbf{t} $, we get :

$ p(t' | \textbf{t} ) $ = $ \int p(t' | w) p(w | \textbf{t} ) dw $

using both prior and noise precision parameters $ \alpha, \beta $, we will have :

$ p(t' | \textbf{t}, \alpha, \beta ) $ = $ \int p(t' | w, \beta) p(w | \textbf{t}, \alpha, \beta ) dw $

On the right side, inside integral, the first term is condition on the target for a given paramter, so it will normal with precision $ \beta $ and the second term is our posterior distribution. So, the final distribution will be Gaussian :

$ p(t' | \textbf{t}, \alpha, \beta ) $ = $ Normal(t'| m_{post} \phi(x), \sigma^2(x)) $

We only need mean $ m_{post} \phi(x) $ for the prediction.

5. Linear and Regularized Regression Implementation in Python

In [1]:
%matplotlib notebook
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
In [2]:
class LinearRegression():

    def __init__(self, is_regularized=False, lambda_param=0.5):
        self.lambda_param = lambda_param
        self.is_regularized = is_regularized
        self.weights = None

    def fit(self, X, y):
        PHI = np.asmatrix(X.values)
        t = np.asmatrix(y.values)
        if self.is_regularized:
            ex = PHI.T * PHI
            expr = (PHI.T * PHI) + self.lambda_param * np.eye(PHI.shape[1])
            weights = np.linalg.inv(expr) * PHI.T * t.T
        else:
            # calculating normal inverse(X.T * X)*X.T is equivalent to Moore-Penrose inverse
            weights = np.linalg.pinv(PHI) * t.T

        self.weights = weights

    def rmse(self, X, y):
        N, _ = X.shape
       
        PHI = np.asmatrix(X.values)
        t_predict = PHI * self.weights
        t = np.asmatrix(y.values)
        
        sq_error = np.square(t_predict - t.T)
        return np.sqrt(sum(sq_error) / N)

    def predict(self, X):
        x = np.asmatrix(X.values)
        return x * self.weights

6. Bayesian Linear Regression Implementation in Python

In [3]:
class BayesianLinearRegression():

    def __init__(self, alpha=1.0, beta=1.0):
        """
        alpha: Prior has precision paramter alpha
        beta : We assume Gaussian Noise with mean=0 and precision = Beta (also called inverse variance)
        """
        self.alpha = alpha
        self.beta = beta
        self.posterior_covar = None
        self.posterior_mean = None

    def fit(self, X, y):
        PHI = np.asmatrix(X.values)
        t = np.asmatrix(y.values)

        prior_covar_inv = self.alpha * np.eye(PHI.shape[1])
        posterior_covar_inv =  prior_covar_inv + self.beta * PHI.T * PHI
        self.posterior_covar = np.linalg.inv(posterior_covar_inv)
        self.posterior_mean = self.beta * self.posterior_covar * PHI.T * t.T

    def rmse(self, X, y):
        N, _ = X.shape
        PHI = np.asmatrix(X.values)
        t_predict = PHI * self.posterior_mean
        t = np.asmatrix(y.values)
        sq_error = np.square(t_predict - t.T)
        return np.sqrt(sum(sq_error) / N)

    def predict(self, X):
        x = np.asmatrix(X.values)
        return x * self.posterior_mean

7. Linear Regression and Bayesian Linear Regression on Housing Data

In [5]:
df = pd.read_csv(
    'https://archive.ics.uci.edu/ml/machine-learning-databases/'
    'housing/housing.data',
    header=None,
    sep='\s+')

df.columns = [
    'CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
    'PTRATIO', 'B', 'LSTAT', 'MEDV'
]
In [6]:
df.head()
Out[6]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222.0 18.7 396.90 5.33 36.2
In [7]:
features = [
    'CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
    'PTRATIO', 'B', 'LSTAT'
]
label = 'MEDV'
In [9]:
X = df[features]
y = df[label]
# 70% for train and 30% for test also define stratified on target value
X_train, X_test, y_train, y_test = train_test_split(X,y, train_size = 0.7,random_state = 0)
In [14]:
linear_reg = LinearRegression()
linear_reg.fit(X_train, y_train)

linear_train_score = float(linear_reg.rmse(X_train, y_train))
linear_test_score = float(linear_reg.rmse(X_test, y_test))
linear_predictions = linear_reg.predict(X_test)

plt.figure(1, figsize=(9, 6))
# only take few samples for the plot
label_linear = 'Linear regression : test score ({:.2f})'.format(linear_test_score)

samples = min(100, len(y_test))
x = range(samples)
plt.plot(x, y_test.iloc[0:samples], label='Ground Truth')
plt.plot(x, linear_predictions[0:samples], label=label_linear)
plt.xlabel("Index")
plt.ylabel("Values")
plt.legend(loc='upper right', prop={'size': 7})
plt.grid()
plt.show()
In [15]:
linear_regular_reg = LinearRegression(is_regularized=True)
linear_regular_reg.fit(X_train, y_train)

linear_regular_train_score = float(linear_regular_reg.rmse(X_train, y_train))
linear_regular_test_score = float(linear_regular_reg.rmse(X_test, y_test))
linear_regular_predictions = linear_regular_reg.predict(X_test)

# only take few samples for the plot
plt.figure(2, figsize=(8, 6))

label_regular = 'Regularized Linear regression : test score ({:.2f})'.format(linear_regular_test_score)

samples = min(100, len(y_test))
x = range(samples)
plt.plot(x, y_test.iloc[0:samples], label='Ground Truth')
# plt.plot(x, linear_predictions[0:samples], label=label_linear)
plt.plot(x, linear_regular_predictions[0:samples], label=label_regular)
plt.xlabel("Index")
plt.ylabel("Values")
plt.legend(loc='upper right', prop={'size': 7})
plt.grid()
plt.show()
In [16]:
linear_bayes_reg = BayesianLinearRegression()
linear_bayes_reg.fit(X_train, y_train)

linear_bayes_train_score = float(linear_bayes_reg.rmse(X_train, y_train))
linear_bayes_test_score = float(linear_bayes_reg.rmse(X_test, y_test))
linear_bayes_predictions = linear_bayes_reg.predict(X_test)

plt.figure(3, figsize=(9, 6))
# only take few samples for the plot
label_linear_bayes = 'Bayesian Linear regression : test score ({:.2f})'.format(linear_test_score)

samples = min(100, len(y_test))
x = range(samples)
plt.plot(x, y_test.iloc[0:samples], label='Ground Truth')
plt.plot(x, linear_bayes_predictions[0:samples], label=label_linear_bayes)
plt.xlabel("Index")
plt.ylabel("Values")
plt.legend(loc='upper right', prop={'size': 7})
plt.grid()
plt.show()
In [ ]:
 

Comments

Comments powered by Disqus