1. Basic Setup¶
Principal Component Analysis (PCA) is being used to reduce the dimensionality of data whilst retaining as much of information as possible. The general idea of PCA works as follows:
a. Find the principal components from your original data
b. Project your original data into the space spanned by principal components from (a)
Let's use $ \textbf{X} $ as our data matrix and $ \sum $ as our covariance matrix of $ \textbf{X} $. We will get eigenvectors ( $ \bf{v_1}, {v_2}, .....{v_k} $) and eigenvalues (${\lambda_1},{\lambda_2},....,{\lambda_k}$) from the covariance matrix $ \sum $, such that:
$ \lambda_1 \geq \lambda_2 \geq ...... \lambda_k $
NOTE* : Elements of the vector ($ \bf{v_1} $ ) represents the coefficients of principal components.
Our goal is to maximize the variance of projection along each of principal components. This can be written as:
$ \bf{var(y_i)} = \bf{var}(v_{i1} * X_1 + v_{i2} * X_2 + ...... + v_{ik} * X_k ) $
You can see that, we are projecting the original data into our new vector space given by PCs.
NOTE* : $ \bf{var(y_i)} = \lambda_i $ and principal components are uncorrelated i.e $ cov(y_i, y_j) $ = 0
2. Principal Component Analysis Algorithm (Pseudocode)¶
a. $ \textbf{X} \gets $ design data matrix with dimension ( N*k )
b. $ \textbf{X} \gets $ subtract mean from each column vector of $ \bf{X} $
c. $ \sum \gets $ compute covariance matrix of $ \bf{X} $
d. Calculate eigenvectors and eigenvalues from $ \sum $
e. Principal Components (PCs) $ \gets $ the first M eigenvectors with largest eigenvalues.
3. Basic Data Analysis¶
%matplotlib inline
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import StandardScaler
df = pd.read_csv(
'https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data', header=None, sep=',')
df.columns = ['CLASS', 'ALCOHOL_LEVEL', 'MALIC_ACID', 'ASH', 'ALCALINITY','MAGNESIUM', 'PHENOLS',
'FLAVANOIDS', 'NON_FLAVANOID_PHENOL', 'PROANTHOCYANINS', 'COLOR_INTENSITY',
'HUE', 'OD280/OD315_DILUTED','PROLINE']
df.head()
features = ['ALCOHOL_LEVEL', 'MALIC_ACID', 'ASH', 'ALCALINITY','MAGNESIUM', 'PHENOLS',
'FLAVANOIDS', 'NON_FLAVANOID_PHENOL', 'PROANTHOCYANINS', 'COLOR_INTENSITY',
'HUE', 'OD280/OD315_DILUTED','PROLINE']
label = 'CLASS'
X = df[features]
y = df[label]
df.columns[1 :]
sns.set(style = 'white')
sns.pairplot(df, vars = X.columns, hue = "CLASS", height = 3)
plt.tight_layout()
plt.show()
We can only observe one linear relation between 'PHENOLS', 'FLAVANOIDS'. Even with 13 variables, it will be difficult to find the relation/pattern in the data. So, we need to reduce the dimensionality of the data to observe the better patterns.
4. Correlation Matrix¶
PCA is not invariant to scaling. So, the solution we get from covariance will be different than that we get from Correlation matrix. Correlation measures the linear relation between two variables when they are normalized, and can be written as :
$ Corr(X_1, X_2) $ = $ \frac{Cov(X_1, X_2)}{Var(X_1) Var(X_2)} $
We can estimate the correlation using : $ \hat{R} = \frac{1}{N} Z^T Z $, where Z represents the normalized data as : $ z_1 = \frac{X_1 - \mu_1}{\sigma_1} $
# standar scaler for normalization
N, _ = df.shape
scaler = StandardScaler()
Z = scaler.fit_transform(X)
# Correlation estimation
R = np.dot(Z.T, Z) / N
df.shape[0], N
5. Eigendecomposition of Correlation Matrix¶
eigen_values, eigen_vectors = np.linalg.eig(R)
6. Principal Components Selection¶
We use the variance exlained ratio ( = $ \frac{\lambda_i}{\sum_{i=1}^K \lambda_i} $ ) to select the components covering lots of information from original data in terms of variance.
total_var = sum(np.abs(eigen_vals))
var_explained = [(i / total_var) for i in sorted(np.abs(eigen_values), reverse=True)]
cum_var_explained = np.cumsum(var_explained)
plt.bar(range(1, eigen_values.size + 1), var_explained)
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal components')
plt.show()
We can see that first two principal components can explain almost 55% variance in the data. Let's select those two components for the visualization.
7. Projection Matrix¶
We select 2 PCs having largest eigenvalues and build our projection matrix.
value_idx = eigen_values.argsort()[::-1]
eigen_vectors_sorted = eigen_vectors[:, value_idx]
# adding new dimension with np.newaxis
M = np.hstack((eigen_vectors_sorted[0][:, np.newaxis],
eigen_vectors_sorted[1][:, np.newaxis]))
M.shape
8. Project Original Data to PCs space¶
projected_data = Z * M
projected_data = np.asmatrix(Z) * np.asmatrix(M)
projected_data.shape
colors = ['r', 'b', 'g']
for label, color in zip(np.unique(y.values), colors):
idx = df[df['CLASS']==label].index.values.astype(int)
x_axis_vals = projected_data[idx, 0]
y_axis_vals = projected_data[idx, 1]
plt.scatter([x_axis_vals], [y_axis_vals], c=color, label=label)
plt.title('Projection')
plt.xlabel('PC 1')
plt.ylabel('PC 2')
plt.legend(loc='upper left')
plt.tight_layout()
plt.show()
We can see that class 1 is clustered more on the left side, blue on at the middle and green class at the right side of the projection plot. It seems like a simple linear classifier might give moderate result in our classification. But, for more accurate result, we should use polynomial/kernel based classification methods.
9. Projection Matrix with 3 PCs¶
# adding new dimension with np.newaxis
M2 = np.hstack((eigen_vectors_sorted[0][:, np.newaxis],
eigen_vectors_sorted[1][:, np.newaxis],
eigen_vectors_sorted[2][:, np.newaxis]))
M2.shape
projected_data2 = np.asmatrix(Z) * np.asmatrix(M2)
projected_data2.shape
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot(111, projection='3d')
colors = ['r', 'b', 'g']
for label, color in zip(np.unique(y.values), colors):
idx = df[df['CLASS']==label].index.values.astype(int)
x_axis_vals = projected_data2[idx, 0]
y_axis_vals = projected_data2[idx, 1]
z_axis_vals = projected_data2[idx, 2]
ax.scatter([x_axis_vals], [y_axis_vals], [z_axis_vals], c=color, label=label, s=30)
plt.title('Projection')
ax.set_xlabel('PC 1')
ax.set_ylabel('PC 2')
ax.set_zlabel('PC 3')
ax.legend(loc='upper left')
plt.show()