In the era of big data, massive datasets are increasingly common in many disciplines and are often difficult to interpret. In this article, we’ll discuss the principal component analysis which is widely used as a dimensionaity reduction technique and see different types of PCA.

PCA is a classical tool which is commonly used to explore and visualize high-dimensional datasets. It is also being used as a technique to alleviate the problem of the curse of dimensionality.

Before we start discussing different types of PCA, let’s first understand what PCA is?

**What is PCA?**

PCA is an unsupervised dimensionality reduction technique which is widely used in machine learning.

It reduces the dimension of the data by projecting them onto a lower-dimensional subspace.

However, while reducing the dimension of the data, we need to preserve as much information about the original data as possible.

PCA preserves most of the variance of the original data by transforming to a new set of variables called principal components, which are linear combinations of the variables in the original data.

These principal components are uncorrelated and are ordered in such a way that the first few principal components retain most of the variance of the original data.

Let’s say I have a dataset with two features x1 and x2.

Now, if I want to reduce this two-dimensional data into one-dimensional data, then I have to find the direction in which the data is most spread out.

The red line here forms the first principal component as in this direction the spread is maximum.

This first principal component explains most of the variance since it captures most of the information about the data.

The second principal component is orthogonal to the first principal component and attempts to capture the maximum variance from what remains after the first Principal Component.

We can find the principal components mathematically by solving the eigenvalue/eigenvector problem.

We’ll use scikit-learn to apply PCA on the MNIST dataset. If you want to learn how to implement PCA from scratch you can read the article PCA from scratch.

Let’s start by importing the required packages

1 2 3 4 |
import seaborn as sns from sklearn.datasets import fetch_openml from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler |

Now we can load the data and we’ll standardize it using StandarScaler.

1 2 3 4 5 6 7 |
mnist = fetch_openml('mnist_784') X = mnist.data y = mnist.target #Standardize the data sc = StandardScaler() X = sc.fit_transform(X) |

Let’s create a function to display the scatterplot of first and second principal components.

1 2 3 4 5 6 |
def scatter_plot(X_trans, y): X_p = pd.DataFrame(data = X_pca, columns=['PC1','PC2']) X_p['Label'] = y sns.lmplot(x="PC1", y="PC2", hue="Label", data=X_p, fit_reg=False) ax = plt.gca() plt.show() |

Now we can apply PCA on the dataset and plot the first two pricipal compoents.

1 2 3 |
pca = PCA(n_components=2) X_pca = pca.fit_transform(X) scatter_plot(X_pca, y) |

**SPARSE PCA:**

One of the key shortcomings of PCA is that in most of the cases the principal components are dense, i.e., most of the loadings are non-zero.

Each principal component is a linear combination of all the original variables making it difficult to interpret the model.

However, for machine learning problems like gene analytics, each axis might correspond to a specific gene.

In such cases, if most of the entries in the loadings are zeros, we can easily interpret the model and understand the physical meaning of the loading as well as the principal components.

Sparse PCA is a variant of PCA which attempts to produce easily interpretable models through sparse loading.

In Sparse PCA each principal component is a linear combination of a subset of the original variables.

1 2 3 4 5 6 |
from sklearn.decomposition import SparsePCA spca = SparsePCA(n_components=2, alpha=0.0001) X_spca = spca.fit_transform(X) scatter_plot(X_spca, y) |

**RANDOMIZED PCA:**

The classical PCA uses the low-rank matrix approximation to estimate the principal components. However, this method becomes costly and makes the whole process difficult to scale, for large datasets.

By randomizing how the singular value decomposition of the dataset happens, we can approximate the first K principal components quickly than classical PCA.

1 2 3 4 5 6 |
from sklearn.decomposition import PCA rpca = PCA(n_components=2, svd_solver='randomized') X_rpca = rpca.fit_transform(X) scatter_plot(X_rpca, y) |

**INCREMENTAL PCA:**

The above-discussed methods require the whole training dataset to fit in the memory.

Incremental PCA can be used when the dataset is too large to fit in the memory.

Here we split the dataset into mini-batches where each batch can fit into the memory and then feed it one mini-batch at a moment to the IPCA algorithm.

1 2 3 4 5 6 |
from sklearn.decomposition import IncrementalPCA ipca = IncrementalPCA(n_components=2, batch_size=200) X_ipca = ipca.fit_transform(X) scatter_plot(X_ipca, y) |

**KERNEL PCA:**

PCA is a linear method. It works great for linearly separable datasets. However, if the dataset has non-linear relationships, then it produces undesirable results.

Kernel PCA is a technique which uses the so-called kernel trick and projects the linearly inseparable data into a higher dimension where it is linearly separable.

There are various kernels that are popularly used; some of them are linear, polynomial, RBF, and sigmoid.

Let’s create a dataset using sklearn’s make_circles which is not linearly separable.

1 2 3 4 5 6 7 8 |
import matplotlib.pyplot as plt from sklearn.datasets import make_circles from sklearn.decomposition import PCA, KernelPCA X,y = make_circles(n_samples=500, factor=.1, noise=0.02, random_state=47) plt.scatter(X[:,0], X[:,1], c=y) plt.show() |

Let’s first apply PCA on this dataset and see how it performs.

1 2 3 4 5 6 7 8 |
pca = PCA(n_components=2) X_pca = pca.fit_transform(X) plt.title("PCA") plt.scatter(X_pca[:,0], X_pca[:,1], c=y) plt.xlabel("Component 1") plt.ylabel("Component 2") plt.show() |

As you can see from the preceding diagram the PCA fails to do the required job. The principal components are unable to distinguish between the two classes.

Now let’s apply Kernel PCA with radial basis function(rbf) as the kernel with gamma(kernel coefficient) value of 1 on this dataset.

1 2 3 4 5 6 |
kpca = KernelPCA(kernel='rbf', gamma=1) X_kpca = kpca.fit_transform(X) plt.title("Kernel PCA") plt.scatter(X_kpca[:,0], X_kpca[:,1], c=y) plt.show() |

You can see in the above diagram that the points are linearly separable in the kernel space.

**CONCLUSION:**

In this tutorial, we discussed different types of PCA.

The problem with classical PCA is that it produces principal components which are dense. Sparse PCA overcomes this shortcoming by introducing some degree of sparsity.

To handle non-linear datasets, we discussed kernel PCA, which uses kernel methods to project the linearly inseparable data into a higher dimension where it is linearly separable.

For large datasets which don’t fit in the memory, we discussed IPCA where we can feed data in mini-batches which fit into the memory.