#### 1. Setup¶

Given the input dataset $\textbf{D} = \{(x_i, t_i)\}_{i=1}^{N}$, our goal is to learn the parameters that model this data and use those parameters (w) for the prediction new data point.

We often define the features (basis functions) as : $\{ \phi_1(x),......,\phi_m(x) \}$ and the linear regression model is defined as :

$y(x,w) = \sum_{i=1}^m w_i \phi_i (x) + \varepsilon_i$

$\varepsilon_i$ suggesting that we can not perfectly model the data generation process and there will be certain noise in our designed model(paramters). Usually, we assume that the noise is Gaussian. $\varepsilon_i \sim Normal(0, \beta^{-1} )$

#### 2. Objective Functions¶

##### a. Maximum Likelihood objective (MLE) .i.e . ( Mean Squared Error )¶

J(w) = $\frac{1}{2} \sum_i^N \{t_i - w^T \phi(x_i) \}^2$

##### b. Regularized Linear Regression¶

J(w) = $\frac{1}{2} \sum_i^N \{t_i - w^T \phi(x_i) \}^2$ + $\frac{\lambda}{2} w^T w$

#### 3. Closed Form Solutions¶

##### a. For MSE¶

$w_{MLE} = ( \phi^T \phi )^{-1} \phi^T t$

$( \phi^T \phi )^{-1}$ is called Moore-Penrose inverse.

##### b. For Regularized MSE¶

$w_{MLE} = ( \lambda I + \phi^T \phi )^{-1} \phi^T t$

#### 4. Bayesian Learning¶

By using Bayes rule, we have :

$p(w | t) = \frac{p(t|w)p(w)}{p(t)}$

$p(w|t)$ - Posterior distribution $p(t|w)$ - Likelihood of data given parameters $p(w)$ - Prior distribution over paramters (w)

##### a. Pior on w :¶

$p(w) \sim Normal( w | m_o, S_o)$

And, For convenience, we will use $m_o = 0$ and $S_o = \alpha^{-1} I$ i.e. $p(w) \sim Normal( 0, \alpha^{-1} I)$

##### b. Likelihood of data :¶

As we mentioned before, we are assuming Gaussian Noise with precision paramter $\beta$, the likelihood will be also Gaussian.

$p(t|w) = Normal (t|y(x,w), \beta^{-1})$

And, because we have conditional independence of $t_i$ given w, our likelihood becomes:

$p(t_1, t_2,....t_N|w, X, \beta)$ = $\prod_{i=1}^N Normal(t_i | w^T \phi(x_i), \beta^{-1})$

##### c. Posterior on w :¶

Using Bayes, we get:

$p(w |X, t)$ $\propto$ $\prod_{i=1}^N Normal(t_i | w^T \phi(x_i), \beta^{-1}) * Normal( 0, \alpha^{-1} I)$

And, our posterior will be :

$p(w | t)$ = $Normal( w| m_{post}, S_{post})$

Where mean of posterior ($m_{post}$) is given by : $m_{post} = S_{post} (S_o^{-1} m_o + \beta \phi^T t )$

We are assuming $m_o = 0$, so $m_{post} = \beta S_{post} \phi^T t$

And, covariance of the posterior ($S_{post})$ is given by:

$S_{post} = S_o^{-1} + \beta \phi^T \phi$

We have assumed $S_o = \alpha^{-1} * I$, therefore:

$S_{post} = (\alpha^{-1} * I )^{-1} + \beta \phi^T \phi$ = $\alpha * I + \beta \phi^T \phi$

##### d. Predictive Distribution :¶

For a new test input data x, we need to make a prediction for t':

Using Sum rule of Probability, we can write :

$p(t') = \int p(t',w) dw$ = $\int p(t' | w) p(w) dw$

And, applying input data $\textbf{t}$, we get :

$p(t' | \textbf{t} )$ = $\int p(t' | w) p(w | \textbf{t} ) dw$

using both prior and noise precision parameters $\alpha, \beta$, we will have :

$p(t' | \textbf{t}, \alpha, \beta )$ = $\int p(t' | w, \beta) p(w | \textbf{t}, \alpha, \beta ) dw$

On the right side, inside integral, the first term is condition on the target for a given paramter, so it will normal with precision $\beta$ and the second term is our posterior distribution. So, the final distribution will be Gaussian :

$p(t' | \textbf{t}, \alpha, \beta )$ = $Normal(t'| m_{post} \phi(x), \sigma^2(x))$

We only need mean $m_{post} \phi(x)$ for the prediction.

#### 5. Linear and Regularized Regression Implementation in Python¶

In [1]:
%matplotlib notebook
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

In [2]:
class LinearRegression():

def __init__(self, is_regularized=False, lambda_param=0.5):
self.lambda_param = lambda_param
self.is_regularized = is_regularized
self.weights = None

def fit(self, X, y):
PHI = np.asmatrix(X.values)
t = np.asmatrix(y.values)
if self.is_regularized:
ex = PHI.T * PHI
expr = (PHI.T * PHI) + self.lambda_param * np.eye(PHI.shape[1])
weights = np.linalg.inv(expr) * PHI.T * t.T
else:
# calculating normal inverse(X.T * X)*X.T is equivalent to Moore-Penrose inverse
weights = np.linalg.pinv(PHI) * t.T

self.weights = weights

def rmse(self, X, y):
N, _ = X.shape

PHI = np.asmatrix(X.values)
t_predict = PHI * self.weights
t = np.asmatrix(y.values)

sq_error = np.square(t_predict - t.T)
return np.sqrt(sum(sq_error) / N)

def predict(self, X):
x = np.asmatrix(X.values)
return x * self.weights


#### 6. Bayesian Linear Regression Implementation in Python¶

In [3]:
class BayesianLinearRegression():

def __init__(self, alpha=1.0, beta=1.0):
"""
alpha: Prior has precision paramter alpha
beta : We assume Gaussian Noise with mean=0 and precision = Beta (also called inverse variance)
"""
self.alpha = alpha
self.beta = beta
self.posterior_covar = None
self.posterior_mean = None

def fit(self, X, y):
PHI = np.asmatrix(X.values)
t = np.asmatrix(y.values)

prior_covar_inv = self.alpha * np.eye(PHI.shape[1])
posterior_covar_inv =  prior_covar_inv + self.beta * PHI.T * PHI
self.posterior_covar = np.linalg.inv(posterior_covar_inv)
self.posterior_mean = self.beta * self.posterior_covar * PHI.T * t.T

def rmse(self, X, y):
N, _ = X.shape
PHI = np.asmatrix(X.values)
t_predict = PHI * self.posterior_mean
t = np.asmatrix(y.values)
sq_error = np.square(t_predict - t.T)
return np.sqrt(sum(sq_error) / N)

def predict(self, X):
x = np.asmatrix(X.values)
return x * self.posterior_mean


#### 7. Linear Regression and Bayesian Linear Regression on Housing Data¶

In [5]:
df = pd.read_csv(
'https://archive.ics.uci.edu/ml/machine-learning-databases/'
'housing/housing.data',
sep='\s+')

df.columns = [
'CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
'PTRATIO', 'B', 'LSTAT', 'MEDV'
]

In [6]:
df.head()

Out[6]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222.0 18.7 396.90 5.33 36.2
In [7]:
features = [
'CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
'PTRATIO', 'B', 'LSTAT'
]
label = 'MEDV'

In [9]:
X = df[features]
y = df[label]
# 70% for train and 30% for test also define stratified on target value
X_train, X_test, y_train, y_test = train_test_split(X,y, train_size = 0.7,random_state = 0)

In [14]:
linear_reg = LinearRegression()
linear_reg.fit(X_train, y_train)

linear_train_score = float(linear_reg.rmse(X_train, y_train))
linear_test_score = float(linear_reg.rmse(X_test, y_test))
linear_predictions = linear_reg.predict(X_test)

plt.figure(1, figsize=(9, 6))
# only take few samples for the plot
label_linear = 'Linear regression : test score ({:.2f})'.format(linear_test_score)

samples = min(100, len(y_test))
x = range(samples)
plt.plot(x, y_test.iloc[0:samples], label='Ground Truth')
plt.plot(x, linear_predictions[0:samples], label=label_linear)
plt.xlabel("Index")
plt.ylabel("Values")
plt.legend(loc='upper right', prop={'size': 7})
plt.grid()
plt.show()

In [15]:
linear_regular_reg = LinearRegression(is_regularized=True)
linear_regular_reg.fit(X_train, y_train)

linear_regular_train_score = float(linear_regular_reg.rmse(X_train, y_train))
linear_regular_test_score = float(linear_regular_reg.rmse(X_test, y_test))
linear_regular_predictions = linear_regular_reg.predict(X_test)

# only take few samples for the plot
plt.figure(2, figsize=(8, 6))

label_regular = 'Regularized Linear regression : test score ({:.2f})'.format(linear_regular_test_score)

samples = min(100, len(y_test))
x = range(samples)
plt.plot(x, y_test.iloc[0:samples], label='Ground Truth')
# plt.plot(x, linear_predictions[0:samples], label=label_linear)
plt.plot(x, linear_regular_predictions[0:samples], label=label_regular)
plt.xlabel("Index")
plt.ylabel("Values")
plt.legend(loc='upper right', prop={'size': 7})
plt.grid()
plt.show()

In [16]:
linear_bayes_reg = BayesianLinearRegression()
linear_bayes_reg.fit(X_train, y_train)

linear_bayes_train_score = float(linear_bayes_reg.rmse(X_train, y_train))
linear_bayes_test_score = float(linear_bayes_reg.rmse(X_test, y_test))
linear_bayes_predictions = linear_bayes_reg.predict(X_test)

plt.figure(3, figsize=(9, 6))
# only take few samples for the plot
label_linear_bayes = 'Bayesian Linear regression : test score ({:.2f})'.format(linear_test_score)

samples = min(100, len(y_test))
x = range(samples)
plt.plot(x, y_test.iloc[0:samples], label='Ground Truth')
plt.plot(x, linear_bayes_predictions[0:samples], label=label_linear_bayes)
plt.xlabel("Index")
plt.ylabel("Values")
plt.legend(loc='upper right', prop={'size': 7})
plt.grid()
plt.show()

In [ ]: