Share on facebook
Share on twitter
Share on linkedin
Share on pinterest

Introduction to K-means

K-means clustering is one of the simplest unsupervised learning algorithms that solve the well known clustering problem.

Before we venture into K-means, let’s first understand what clustering is?

What is clustering?

The idea behind clustering is straightforward. Clustering algorithms group together data points based on the similarities between them.

The data points are clustered in such a way that the data points belong to the same cluster are more similar to each other, and data points belong to a different cluster are very different from each other.

They are unsupervised, and they do not require any class label.

There are various clustering algorithms. In this post, we discuss the most popular clustering algorithm K-means.

K-means:

K-means is one of the common techniques for clustering where we iteratively assign points to different clusters.

Here each data point is assigned to only one cluster, which is also known as hard clustering.

The k in the title is a hyperparameter specifying the exact number of clusters. It should be defined beforehand.

The primary objective of the algorithm is to minimize the intra-cluster distance. There’ll be a centroid for each cluster and initially, these centroids are selected randomly.

K-means assigns each data point to a centroid that it is closest to. The metric which is used to measure the closeness is Euclidean distance.

If you want to learn more about distance measures I’ve written an article discussing various distance measures used in machine learning with implementation in python. You can read it here.

Since the primary objective of the algorithm is to minimize the intra-cluster distance, it groups data points into a cluster where the distance from the point to the centroid of the cluster is minimum.

The standard and most commonly used algorithm for K-means in Lloyd’s algorithm.

Let’s see the actual steps of the algorithm:

  1. The first step is to choose k points randomly from the dataset as the centroids of the clusters.
  2. Once we choose the centroids the next step is to assign data points to centroids closest to it.
  3. Then recompute the centroid so that it is closest to all the data points allocated to that cluster.
  4. Repeat step 2 and 3 until the algorithm converges. These two steps are repeated until the cluster centroids no longer change.

CHOOSING THE VALUE OF K:

Choosing a proper value of the hyperparameter k is essential as it can enhance the performance of the model as well as degrades it if wrongly selected.

You can choose k to be that number if you know how many clusters you are looking for.

Otherwise, you need to experiment with different values of k.

I have written an article explaining various supervised and unsupervised methods to determine the right value of k. You can read the article here.

Now, we’ll discuss two popular methods the elbow method and the silhouette coefficient to determine the ideal value for k.

ELBOW METHOD:

In this method we’ll try to minimize the within-sum-of-squares(WSS). The WSS measures the sum of squared difference from the cluster center.

The following diagram illustrates the idea behind the elbow method

elbow method

We’ll plot WSS versus the number of clusters.

Then we select the value of k after which the WSS score doesn’t decrease significantly.

In the diagram, we choose the value of k where we identify the elbow-like inflection. Hence, the name elbow method.

SILHOUETTE SCORE:

It measures how similar observation is to the assigned cluster and how dissimilar to the observation of nearby cluster.

The silhouette score range from -1 to 1. The better it is if the score is near to 1.

Let’s implement K-means using sklearn

We’ll use sklearn’s make_blobs to generate a sample dataset

Let’s visualize our dataset

K-means

Next we can use silhouette score and elbow method to find the optimal number of k

First we use Silhouette score to find the optimal k

As you can see from the results, k value of 4 has the highest score

Let’s plot the score against k

silhouette score

Now, let’s use the elbow method to find the k value

K-means

You can see an elbow forming at k=4. That is the optimal k value.

We used both the elbow method and the silhouette score to find the optimal k value. As a result, we find out that the optimal value of k is 4.

Let’s implement the K-means algorithm with k=4

Now let’s visualize the clusters produced by the K-means algorithm

The below diagram shows the resulting clusters.

K-means

As you can see, the algorithm recognized the four distinct clusters.

The black stars are the centroids of the clusters.

SUMMARY:

Cluster analysis is used in nearly every sector where a wide variety of transactions occurs.

It can help identify the natural grouping of customers, products etcetera.

One such example is market segmentation where customers are categorized based on their similarities.

In this post, we discussed one such clustering technique k-means.

We discussed Lloyd’s algorithm, which is used to implement k-means.

We also discussed why it is crucial to pick a proper k value as it can impact the performance of the model and discussed two popular methods the silhouette score and the elbow method to pick a k value.

Finally, we implement the K-means clustering algorithm using scikit-learn.

Love What you Read. Subscribe to our Newsletter.

Stay up to date! We’ll send the content straight to your inbox, once a week. We promise not to spam you.

Leave a Reply

Your email address will not be published. Required fields are marked *

Subscribe Now! We'll keep you updated.