## How do I know Principal Component Analysis (PCA) is preserving information from my data ?

Principal Component Analysis is used for dimension reduction of high-dimensional data. We also refer PCA as feature extraction technique, where new features take the linear combinations from original features.

More on PCA : principal-component-analysis-pca-for-visualization-using-python

In this post, we will investigate 'how reliable the information is as preserved by PCA'. In order to do that, we will use labelled data and evaluate the trained model to see the final performance. (In side note, PCA is unsupervised learning algorithm. But, in our case, we are using in supervised fashion to assess the model performance)

To make it more interesting, we will also see Random Forests as 'feature selection' and compare the result with PCA.

#### 1. PCA as Feature Extraction¶

```
%matplotlib inline
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
```

```
df = pd.read_csv(
'https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data', header=None, sep=',')
df.columns = ['CLASS', 'ALCOHOL_LEVEL', 'MALIC_ACID', 'ASH', 'ALCALINITY','MAGNESIUM', 'PHENOLS',
'FLAVANOIDS', 'NON_FLAVANOID_PHENOL', 'PROANTHOCYANINS', 'COLOR_INTENSITY',
'HUE', 'OD280/OD315_DILUTED','PROLINE']
df.head()
```

```
features = ['ALCOHOL_LEVEL', 'MALIC_ACID', 'ASH', 'ALCALINITY','MAGNESIUM', 'PHENOLS',
'FLAVANOIDS', 'NON_FLAVANOID_PHENOL', 'PROANTHOCYANINS', 'COLOR_INTENSITY',
'HUE', 'OD280/OD315_DILUTED','PROLINE']
label = 'CLASS'
X = df[features]
y = df[label]
# train test split with 70% for training
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
```

##### a. Preparing Projected data using PCA¶

```
# prepare correlation matrix
# standar scaler for normalization
N, _ = df.shape
scaler = StandardScaler()
Z = scaler.fit_transform(X)
# Correlation estimation
R = np.dot(Z.T, Z) / N
# eigendecomposition
eigen_values, eigen_vectors = np.linalg.eig(R)
# prepare projection matrix
value_idx = eigen_values.argsort()[::-1]
eigen_vectors_sorted = eigen_vectors[:, value_idx]
# Projection matrix with 3 PCs ( 3 PCs cover 65% variance in the data)
# more on : https://sijanb.com.np/posts/principal-component-analysis-pca-for-visualization-using-python/
M = np.hstack((eigen_vectors_sorted[0][:, np.newaxis],
eigen_vectors_sorted[1][:, np.newaxis],
eigen_vectors_sorted[2][:, np.newaxis]))
# projected data
projected_data = np.asmatrix(Z) * np.asmatrix(M)
```

##### b. Using Projected data for the training and prediction using Decision Tree¶

```
# train test split for training
Xpc_train, Xpc_test, ypc_train, ypc_test = train_test_split( projected_data, y, test_size=0.3, random_state=0)
tree_pca = DecisionTreeClassifier(max_depth=6, random_state=0)
tree_pca.fit(Xpc_train, ypc_train)
ypca_pred = tree_pca.predict(Xpc_test)
print('Test accuracy using Decision tree on PCA projected data: %.2f' % accuracy_score(ypc_test, ypca_pred))
```