In the era of big data, massive datasets(not only in the number of samples being collected but also in the number of features) are increasingly prevalent in many disciplines and are often difficult to interpret.
In this article, we’ll discuss the curse of dimensionality a situation that arises in high dimensional spaces which leads to certain anomalies that do not occur in lower dimensions.
Before venturing into the curse of dimensionality, let’s first understand what a dimension is?
A machine-learning algorithm with d features operates in a d-dimensional space.
For instance, let’s say we have to predict whether a person is male or female.
If we have features like hair length, height, and weight for each person, each feature corresponds to a dimension and we are working in a 3-dimensional space.
CURSE OF DIMENSIONALITY:
It was R.Bellman who first coined the term “curse of dimensionality” in his book Adaptive Control Process a Guided Tour.
Curse of dimensionality indicates that with each additional dimension, the number of samples needed grows exponentially, in order to achieve high accuracy.
Increase in dimensionality without increasing the number of samples will increase the sparsity in the training instances, which makes it difficult for any learning method to produce reliable results.
For example, take a look at the figure below.
If you see the above figure from left to right, you can see that we increase the dimension by 1 for each drawing. 25 random points have been generated in the range [0,1] along three dimensions.
The first plot on the left shows the data projected onto one dimension. By observing, we can see that the points are close together.
In the second plot, we increased the dimension by one and projected the points onto two dimensions.
Now the points are spread across the two dimensions, and the gaps can be seen, creating sparsity in the training instances.
Finally, in the last plot by adding another dimension, we can see that the points are further spread apart and become sparse creating a lot of empty cells and it requires a large number of samples for any learning method, to achieve a desired level of performance.
Cure for the Curse:
One solution to the curse of dimensionality is to add more training samples.
However, adding more training samples is not always possible as the number of points required to fill the space increases exponentially with each dimension.
Another solution is to reduce the dimension of the data.
One of the most popular techniques to reduce the dimension of the data is to use Principal Component Analysis.
To learn more about PCA, you can read the article Introduction to Principal Component Analysis.
Now let’s do some coding. We’ll create a synthetic dataset with 1500 dimensions and try to reduce the dimensions with PCA.
We’ll first apply the logistic regression with 1500 dimensions and print the accuracy and the time taken for the calculation.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from time import time
X, y = make_classification(n_samples=3000, n_features=1500, n_classes=2, n_informative=2, random_state=43)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=43)
start = time()
reg = LogisticRegression()
y_pred = reg.predict(X_test)
print(time() - start)
Now we’ll apply PCA on the dataset and reduce it to 100 dimensions
pca = PCA(n_components=100).fit(X_train)
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)
start = time()
y_pred1 = reg.predict(X_test_pca)
print(f'Time:',time() - start)
As you can see from the results, by reducing the dimensions to 100 we increase the calculation time and also get an accuracy which is close to what we got with all the dimensions of the data set.
In this article, we discussed the curse of dimensionality.
Curse of dimensionality indicates that with each additional dimension, the number of samples needed grows exponentially, to achieve high accuracy.
It creates sparsity in the data making it difficult for any learning method to produce reliable results.
One way to overcome this curse is to add more training samples. However, practically, it is not possible as the number of points required increases exponentially with each dimension.
Another solution we discussed is to use dimensionality reduction techniques like PCA to reduce the dimension of the data.
Finally, we created a synthetic dataset with 1500 dimensions and apply PCA on it.