1. Setup¶
Given the input dataset $ \textbf{D} = \{(x_i, t_i)\}_{i=1}^{N} $, our goal is to learn the parameters that model this data and use those parameters (w) for the prediction new data point.
We often define the features (basis functions) as : $ \{ \phi_1(x),......,\phi_m(x) \} $ and the linear regression model is defined as :
$ y(x,w) = \sum_{i=1}^m w_i \phi_i (x) + \varepsilon_i $
$ \varepsilon_i $ suggesting that we can not perfectly model the data generation process and there will be certain noise in our designed model(paramters). Usually, we assume that the noise is Gaussian. $ \varepsilon_i \sim Normal(0, \beta^{-1} )$
2. Objective Functions¶
3. Closed Form Solutions¶
4. Bayesian Learning¶
By using Bayes rule, we have :
$ p(w | t) = \frac{p(t|w)p(w)}{p(t)} $
$ p(w|t) $ - Posterior distribution $ p(t|w) $ - Likelihood of data given parameters $ p(w) $ - Prior distribution over paramters (w)
a. Pior on w :¶
$ p(w) \sim Normal( w | m_o, S_o) $
And, For convenience, we will use $ m_o = 0 $ and $ S_o = \alpha^{-1} I $ i.e. $ p(w) \sim Normal( 0, \alpha^{-1} I) $
b. Likelihood of data :¶
As we mentioned before, we are assuming Gaussian Noise with precision paramter $ \beta $, the likelihood will be also Gaussian.
$ p(t|w) = Normal (t|y(x,w), \beta^{-1}) $
And, because we have conditional independence of $ t_i $ given w, our likelihood becomes:
$ p(t_1, t_2,....t_N|w, X, \beta) $ = $ \prod_{i=1}^N Normal(t_i | w^T \phi(x_i), \beta^{-1}) $
c. Posterior on w :¶
Using Bayes, we get:
$ p(w |X, t) $ $ \propto $ $ \prod_{i=1}^N Normal(t_i | w^T \phi(x_i), \beta^{-1}) * Normal( 0, \alpha^{-1} I) $
And, our posterior will be :
$ p(w | t) $ = $ Normal( w| m_{post}, S_{post}) $
Where mean of posterior ($ m_{post} $) is given by : $ m_{post} = S_{post} (S_o^{-1} m_o + \beta \phi^T t ) $
We are assuming $ m_o = 0 $, so $ m_{post} = \beta S_{post} \phi^T t $
And, covariance of the posterior ($ S_{post}) $ is given by:
$ S_{post} = S_o^{-1} + \beta \phi^T \phi $
We have assumed $ S_o = \alpha^{-1} * I $, therefore:
$ S_{post} = (\alpha^{-1} * I )^{-1} + \beta \phi^T \phi $ = $ \alpha * I + \beta \phi^T \phi $
d. Predictive Distribution :¶
For a new test input data x, we need to make a prediction for t':
Using Sum rule of Probability, we can write :
$ p(t') = \int p(t',w) dw $ = $ \int p(t' | w) p(w) dw $
And, applying input data $ \textbf{t} $, we get :
$ p(t' | \textbf{t} ) $ = $ \int p(t' | w) p(w | \textbf{t} ) dw $
using both prior and noise precision parameters $ \alpha, \beta $, we will have :
$ p(t' | \textbf{t}, \alpha, \beta ) $ = $ \int p(t' | w, \beta) p(w | \textbf{t}, \alpha, \beta ) dw $
On the right side, inside integral, the first term is condition on the target for a given paramter, so it will normal with precision $ \beta $ and the second term is our posterior distribution. So, the final distribution will be Gaussian :
$ p(t' | \textbf{t}, \alpha, \beta ) $ = $ Normal(t'| m_{post} \phi(x), \sigma^2(x)) $
We only need mean $ m_{post} \phi(x) $ for the prediction.
5. Linear and Regularized Regression Implementation in Python¶
%matplotlib notebook
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
class LinearRegression():
def __init__(self, is_regularized=False, lambda_param=0.5):
self.lambda_param = lambda_param
self.is_regularized = is_regularized
self.weights = None
def fit(self, X, y):
PHI = np.asmatrix(X.values)
t = np.asmatrix(y.values)
if self.is_regularized:
ex = PHI.T * PHI
expr = (PHI.T * PHI) + self.lambda_param * np.eye(PHI.shape[1])
weights = np.linalg.inv(expr) * PHI.T * t.T
else:
# calculating normal inverse(X.T * X)*X.T is equivalent to Moore-Penrose inverse
weights = np.linalg.pinv(PHI) * t.T
self.weights = weights
def rmse(self, X, y):
N, _ = X.shape
PHI = np.asmatrix(X.values)
t_predict = PHI * self.weights
t = np.asmatrix(y.values)
sq_error = np.square(t_predict - t.T)
return np.sqrt(sum(sq_error) / N)
def predict(self, X):
x = np.asmatrix(X.values)
return x * self.weights
6. Bayesian Linear Regression Implementation in Python¶
class BayesianLinearRegression():
def __init__(self, alpha=1.0, beta=1.0):
"""
alpha: Prior has precision paramter alpha
beta : We assume Gaussian Noise with mean=0 and precision = Beta (also called inverse variance)
"""
self.alpha = alpha
self.beta = beta
self.posterior_covar = None
self.posterior_mean = None
def fit(self, X, y):
PHI = np.asmatrix(X.values)
t = np.asmatrix(y.values)
prior_covar_inv = self.alpha * np.eye(PHI.shape[1])
posterior_covar_inv = prior_covar_inv + self.beta * PHI.T * PHI
self.posterior_covar = np.linalg.inv(posterior_covar_inv)
self.posterior_mean = self.beta * self.posterior_covar * PHI.T * t.T
def rmse(self, X, y):
N, _ = X.shape
PHI = np.asmatrix(X.values)
t_predict = PHI * self.posterior_mean
t = np.asmatrix(y.values)
sq_error = np.square(t_predict - t.T)
return np.sqrt(sum(sq_error) / N)
def predict(self, X):
x = np.asmatrix(X.values)
return x * self.posterior_mean
7. Linear Regression and Bayesian Linear Regression on Housing Data¶
df = pd.read_csv(
'https://archive.ics.uci.edu/ml/machine-learning-databases/'
'housing/housing.data',
header=None,
sep='\s+')
df.columns = [
'CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
'PTRATIO', 'B', 'LSTAT', 'MEDV'
]
df.head()
features = [
'CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
'PTRATIO', 'B', 'LSTAT'
]
label = 'MEDV'
X = df[features]
y = df[label]
# 70% for train and 30% for test also define stratified on target value
X_train, X_test, y_train, y_test = train_test_split(X,y, train_size = 0.7,random_state = 0)
linear_reg = LinearRegression()
linear_reg.fit(X_train, y_train)
linear_train_score = float(linear_reg.rmse(X_train, y_train))
linear_test_score = float(linear_reg.rmse(X_test, y_test))
linear_predictions = linear_reg.predict(X_test)
plt.figure(1, figsize=(9, 6))
# only take few samples for the plot
label_linear = 'Linear regression : test score ({:.2f})'.format(linear_test_score)
samples = min(100, len(y_test))
x = range(samples)
plt.plot(x, y_test.iloc[0:samples], label='Ground Truth')
plt.plot(x, linear_predictions[0:samples], label=label_linear)
plt.xlabel("Index")
plt.ylabel("Values")
plt.legend(loc='upper right', prop={'size': 7})
plt.grid()
plt.show()
linear_regular_reg = LinearRegression(is_regularized=True)
linear_regular_reg.fit(X_train, y_train)
linear_regular_train_score = float(linear_regular_reg.rmse(X_train, y_train))
linear_regular_test_score = float(linear_regular_reg.rmse(X_test, y_test))
linear_regular_predictions = linear_regular_reg.predict(X_test)
# only take few samples for the plot
plt.figure(2, figsize=(8, 6))
label_regular = 'Regularized Linear regression : test score ({:.2f})'.format(linear_regular_test_score)
samples = min(100, len(y_test))
x = range(samples)
plt.plot(x, y_test.iloc[0:samples], label='Ground Truth')
# plt.plot(x, linear_predictions[0:samples], label=label_linear)
plt.plot(x, linear_regular_predictions[0:samples], label=label_regular)
plt.xlabel("Index")
plt.ylabel("Values")
plt.legend(loc='upper right', prop={'size': 7})
plt.grid()
plt.show()
linear_bayes_reg = BayesianLinearRegression()
linear_bayes_reg.fit(X_train, y_train)
linear_bayes_train_score = float(linear_bayes_reg.rmse(X_train, y_train))
linear_bayes_test_score = float(linear_bayes_reg.rmse(X_test, y_test))
linear_bayes_predictions = linear_bayes_reg.predict(X_test)
plt.figure(3, figsize=(9, 6))
# only take few samples for the plot
label_linear_bayes = 'Bayesian Linear regression : test score ({:.2f})'.format(linear_test_score)
samples = min(100, len(y_test))
x = range(samples)
plt.plot(x, y_test.iloc[0:samples], label='Ground Truth')
plt.plot(x, linear_bayes_predictions[0:samples], label=label_linear_bayes)
plt.xlabel("Index")
plt.ylabel("Values")
plt.legend(loc='upper right', prop={'size': 7})
plt.grid()
plt.show()