Principal Component Analysis (PCA) for Visualization using Python

1. Basic Setup

Principal Component Analysis (PCA) is being used to reduce the dimensionality of data whilst retaining as much of information as possible. The general idea of PCA works as follows:

 a. Find the principal components from your original data
 b. Project your original data into the space spanned by principal components from (a)


Let's use $ \textbf{X} $ as our data matrix and $ \sum $ as our covariance matrix of $ \textbf{X} $. We will get eigenvectors ( $ \bf{v_1}, {v_2}, .....{v_k} $) and eigenvalues (${\lambda_1},{\lambda_2},....,{\lambda_k}$) from the covariance matrix $ \sum $, such that:

$ \lambda_1 \geq \lambda_2 \geq ...... \lambda_k $

NOTE* : Elements of the vector ($ \bf{v_1} $ ) represents the coefficients of principal components.

Our goal is to maximize the variance of projection along each of principal components. This can be written as:

$ \bf{var(y_i)} = \bf{var}(v_{i1} * X_1 + v_{i2} * X_2 + ...... + v_{ik} * X_k ) $

You can see that, we are projecting the original data into our new vector space given by PCs.

NOTE* : $ \bf{var(y_i)} = \lambda_i $ and principal components are uncorrelated i.e $ cov(y_i, y_j) $ = 0

2. Principal Component Analysis Algorithm (Pseudocode)

a. $ \textbf{X} \gets $ design data matrix with dimension ( N*k )

b. $ \textbf{X} \gets $ subtract mean from each column vector of $ \bf{X} $

c. $ \sum \gets $ compute covariance matrix of $ \bf{X} $

d. Calculate eigenvectors and eigenvalues from $ \sum $

e. Principal Components (PCs) $ \gets $ the first M eigenvectors with largest eigenvalues.

3. Basic Data Analysis

In [159]:
%matplotlib inline

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd 

from sklearn.preprocessing import StandardScaler
In [160]:
df = pd.read_csv(
    'https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data', header=None, sep=',')

df.columns = ['CLASS', 'ALCOHOL_LEVEL', 'MALIC_ACID', 'ASH', 'ALCALINITY','MAGNESIUM', 'PHENOLS', 
              'FLAVANOIDS', 'NON_FLAVANOID_PHENOL', 'PROANTHOCYANINS', 'COLOR_INTENSITY', 
              'HUE', 'OD280/OD315_DILUTED','PROLINE']
df.head()
Out[160]:
CLASS ALCOHOL_LEVEL MALIC_ACID ASH ALCALINITY MAGNESIUM PHENOLS FLAVANOIDS NON_FLAVANOID_PHENOL PROANTHOCYANINS COLOR_INTENSITY HUE OD280/OD315_DILUTED PROLINE
0 1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065
1 1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050
2 1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185
3 1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480
4 1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735
In [161]:
features = ['ALCOHOL_LEVEL', 'MALIC_ACID', 'ASH', 'ALCALINITY','MAGNESIUM', 'PHENOLS', 
              'FLAVANOIDS', 'NON_FLAVANOID_PHENOL', 'PROANTHOCYANINS', 'COLOR_INTENSITY', 
              'HUE', 'OD280/OD315_DILUTED','PROLINE']
label = 'CLASS'

X = df[features]
y = df[label]
In [162]:
df.columns[1 :]
Out[162]:
Index(['ALCOHOL_LEVEL', 'MALIC_ACID', 'ASH', 'ALCALINITY', 'MAGNESIUM',
       'PHENOLS', 'FLAVANOIDS', 'NON_FLAVANOID_PHENOL', 'PROANTHOCYANINS',
       'COLOR_INTENSITY', 'HUE', 'OD280/OD315_DILUTED', 'PROLINE'],
      dtype='object')
In [163]:
sns.set(style = 'white')

sns.pairplot(df, vars = X.columns, hue = "CLASS", height = 3)
plt.tight_layout()
plt.show()