Principal Component Analysis is an unsupervised dimensionality reduction technique, which is extensively used in machine learning.

It helps us to alleviate the problem of the curse of dimensionality by reducing the dimension of the data.

PCA reduces the dimension of the data by projecting them onto a lower-dimensional subspace.

It takes a d-dimensional data matrix and reduces it into a k-dimensional data matrix, where k <= d

Let’s say I have a dataset with two features x1 and x2.

Now, if I want to reduce this two-dimensional data into one-dimensional data, then I have to find the direction in which the data is most spread out.

The line here forms the first principal component as in this direction the spread is maximum.

This first principal component explains most of the variance since it captures most of the information about the data.

The second principal component is orthogonal to the first principal component and tries to capture the remaining variance in the data.

The number of principal components equals the number of dimensions in the data.

Now, how can we calculate the direction of the maximum variance mathematically?

To calculate the direction of the maximum variance, we use eigenvalues and eigenvectors.

**EIGENVALUES AND EIGENVECTORS:**

Each eigenvalue has a corresponding eigenvector. The eigenvector gives the direction, and the proportion of the variance captured in that direction is given by the corresponding eigenvalues.

The number of eigenvalues equals the dimension of the data. If we have an n-dimensional data matrix, then we get n eigenvalues.

The eigenvector with the highest eigenvalue becomes our first principal component, and so on.

While reducing the dimension of the data, we eliminate the eigenvectors which are having near-zero eigenvalues since the variance explained by them is very low.

**Steps to calculate Principal Component Analysis:**

- Subtract mean from all the observations, so our data is zero centered.
- Calculate the covariance matrix.
- Next, we calculate the eigenvalues and eigenvectors of the covariance matrix.
- Now use the top k eigenvectors (sorted by descending order of eigenvalues) to construct a projection matrix.
- Multiply the original data with the projection matrix to get the new k-dimensional subspace.

Now that we’ve gone over some of what goes on in PCA let’s implement Principal Component Analysis in python from scratch using numpy

Importing the iris dataset

1 2 3 4 |
import pandas as pd from sklearn import datasets iris = datasets.load_iris() X=pd.DataFrame(iris['data'],columns=iris['feature_names']) |

The first step is to subtract the mean from all the observations so that our data will be zero centered.

1 2 3 |
import numpy as np M = np.mean(X.T, axis=1) X_scaled = X - M |

Next, we have to construct the covariance matrix and calculate it’s eigenvalues and eigenvectors

1 2 3 |
from numpy.linalg import eig S = np.cov(X_scaled.T) eig_vals, eig_vecs = eig(S) |

Now we have to sort the eigenpairs in descending order of eigenvalues so that the top eigenvalues will explain the maximum variance.

1 2 3 |
sort_index = eig_vals.argsort()[::-1] values_sorted = eig_vals[sort_index] vectors_sorted = eig_vecs[:, sort_index] |

let’s print the variance explained by each of the eigenvalues

1 2 |
explained_variance_ratio = values_sorted/values_sorted.sum() print(explained_variance_ratio) |

1 2 |
OUTPUT: [ 0.92461621 0.05301557 0.01718514 0.00518309] |

The below plot indicates that the first two principal components explain more than 95% of the variance in the data.

Since the first two principal components account for most of the variance in our data, we will use the first two principal components to construct the projection matrix (p).

1 |
p = np.hstack((vectors_sorted[0][:,np.newaxis], vectors_sorted[1][:,np.newaxis])) |

To transform the original data onto the lower-dimensional subspace, we need to multiply the original data with the projection matrix.

1 2 3 |
X_pca = X_scaled.dot(p) print("Original size of the data:",X_scaled.shape) print("After applying PCA:",X_pca.shape) |

1 2 3 |
OUTPUT: Original size of the data: (150, 4) After applying PCA: (150, 2) |

As you can see, the size of the data is reduced from (150,4) to (150,2) while preserving most of the information about the data.

**SUMMARY:**

Dimensionality reduction techniques allow us to make data easier to use. In this post, I have discussed one kind of dimensionality reduction technique called Principal Component Analysis(PCA).

PCA is an unsupervised method in dimensionality reduction.

It is vital to make sure that we don’t lose more information while reducing the dimension of the data.

PCA preserves most of the information of the data by choosing variables which have more variance as its principal components.