**INTRODUCTION to T – SNE:**

T-SNE is a non-linear dimensionality reduction technique used to visualize high-dimensional data in two or more dimensions.

Unlike PCA which preserves only the global structure of the data T-SNE preserves both the local and global structure.

It uses the local relationship between data to map the high-dimensional data in two dimensions.

Since T-SNE is just an advancement of Stochastic Neighbor Embedding(SNE), let’s first try to understand what SNE is?

**STOCHASTIC NEIGHBOR EMBEDDING:**

Stochastic neighbor embedding is a probabilistic approach to visualize high-dimensional data. It converts high dimensional Euclidean distances between points into conditional probabilities.

SNE makes an assumption that the distances in both the high and low dimension are Gaussian distributed.

The conditional probability p_{j|i} in the high dimension that a data point x_{i} would pick x_{j} as its neighbor is given by

The conditional probability q_{j|i} for the low-dimension is given by

The variance σ_{i} is associated with the parameter called perplexity which is a measure of the number of neighbors. SNE performs a binary search to find the value σ_{i} which produces P_{i} with a fixed perplexity specified by the user.

Perplexity is defined as,

where H(P_{i}) is the Shannon entropy of P_{i} measured in bits

Since we are calculating the pairwise similarities, the probability that a point will choose itself as a neighbor is zero.

That is the values of P_{i|i} and Q_{j|j} are equal to zero

Once the probabilities are constructed in both the high and low dimensions, the algorithm aims to reduce the differences between P_{j|i} and

Q_{j|i} so that the low dimensional mapping will be similar to the high dimensional one.

This is achieved by minimizing a cost function which is a sum of Kullback-Leibler divergences. The cost function C is given by,

The minimization of the cost function is done using the gradient descent method.

Since the KL divergence is not symmetric, it emphasizes preserving the local structure.

It preserves the local structure by penalizing heavily when the points which are near in the high dimension are mapped far away in the low dimension, but the penalty is small for mapping widely separated data points close together.

**DRAWBACKS OF SNE:**

There are two significant drawbacks in Stochastic Neighbor Embedding.

1. The cost function used is difficult to optimize.

2. Crowding problem, where the moderately-distant data points and the points which are nearby are clumped together to fit in the 2-dimensional space.

**T-SNE:**

As the cost function used in SNE is difficult to optimize T-SNE uses a cost function which is a symmetric version of SNE, i.e., p_{i|j} = p_{j|i} and q_{i|j} = q_{j|i}

The cost function is given by

with p_{ii} = q_{ii} = 0

Here, the pairwise similarities in the high dimensional space are defined by

For low dimensional mapping, instead of a normal distribution, we will use a heavy-tailed student t-distribution with one degree of freedom.

This heavy-tailed distribution helps to alleviate the crowding problem.

Using this distribution, low dimensional representation is given by,

Let’s implement T-SNE using scikit-learn

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
from sklearn import datasets import seaborn as sn import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.manifold import TSNE #import the digits dataset digits = datasets.load_digits() X = digits.data y = digits.target tsne = TSNE(n_components=2, random_state=0) tsne_data = tsne.fit_transform(X) #creating a new dataframe for plotting tsne_data = np.vstack((tsne_data.T, y)).T tsne_df = pd.DataFrame(data=tsne_data, columns=("Dim_1", "Dim_2", "label")) #Plotting the result sn.FacetGrid(tsne_df, hue="label", size=6).map(plt.scatter, 'Dim_1', 'Dim_2').add_legend() plt.show() |

As you can see in the above plot, TSNE constructs an excellent visualization. The cluster of classes are well separated, and there are fewer overlaps.

**SUMMARY:**

In this post, I have discussed one of the powerful dimensionality reduction techniques called t-SNE which has limitless applications.

In the beginning, we discussed Stochastic Neighbor Embedding(SNE) which is a predecessor of T-SNE.

We also discussed the drawbacks of SNE and how T-SNE overcomes it. Finally, we implement T-SNE using sklearn.

I hope this post has given you a better understanding of how tsne works and for those who want to delve deep into this topic I suggest you read the original paper and this blog post.