A major challenge in the K-means algorithm is choosing the optimal value of k; however, selecting the right value of k is quite tricky and is also crucial as it can impact the performance of the model.

However, If you don’t know how many clusters you have in advance, then how do you select the ideal value of k?

There are several methods to determine the optimal k in K-means.

We’ll discuss various supervised and unsupervised methods to determine the right value for k.

In this post, we concentrate on the k-means algorithm; however, most of the methods can be applied to other clustering algorithms also.

**SUPERVISED METHODS:**

These methods can be used when we have external information about the data, i.e., if we know the actual ground truth.

**HOMOGENEITY:**

It estimates how many of the clusters predicted contain only members of a single class.

Homogeneity is bounded between 0 and 1, higher the score better it is.

It is determined by,

h=1-\frac{H(C | K)}{H(C)}

H(C|K) is the conditional entropy, which measures the uncertainty in determining the right class after having the clustered dataset.

Where C is the number of classes, and K is the number of clusters.

We’ll normalize the value with the entropy of the class H(C) to have a homogeneity score.

1 2 3 4 |
from sklearn.metrics import homogeneity_score y_true = [0, 0, 1, 1] y_pred = [1, 1, 0, 0] print(homogeneity_score(y_true, y_pred)) |

1 2 |
OUTPUT: 1.0 |

**COMPLETENESS:**

Completeness measures whether all members of a given class are assigned to the same cluster.

It is bounded between 0 and 1.

Completeness is symmetrical to homogeneity.

Completeness is determined using the conditional entropy H(C|K), which measures the uncertainty in determining the right cluster given the information of the class.

c=1-\frac{H(K | C)}{H(K)}

The value is normalized with the entropy of the clusters H(K).

1 2 3 4 |
from sklearn.metrics import completeness_score y_true = [0, 0, 1, 1] y_pred = [1, 1, 0, 0] print(completeness_score(y_true, y_pred)) |

1 2 |
OUTPUT: 1.0 |

**VALIDITY MEASURE(V-MEASURE):**

V-measure is the harmonic mean of homogeneity and completeness.

v=2 \cdot \frac{h \cdot c}{h+c}

By using this value, we can monitor both the homogeneity and completeness.

1 2 3 4 |
from sklearn.metrics import v_measure_score y_true = [0, 0, 1, 1] y_pred = [1, 1, 0, 0] print(v_measure_score(y_true, y_pred)) |

1 2 |
OUTPUT: 1.0 |

Now let’s see how we can use these three metrics to find the optimal number of k.

First, We’ll create a dataset with 3 clusters using the sklearn’s make_blobs.

1 2 3 4 5 6 7 8 9 |
import numpy as np import matplotlib.pyplot as plt from sklearn import metrics from sklearn.cluster import KMeans from sklearn.datasets.samples_generator import make_blobs X, y = make_blobs(n_samples=200, centers=3, cluster_std=1.0, random_state=43) plt.scatter(X[:, 0], X[:, 1], s=50) plt.show() |

Now, we can compute the homogeneity, completeness, and v-measure using sklearn for different values of k.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
k = [2, 3, 4, 5, 6, 7, 8] homo_score = [] comp_score = [] vm_score = [] for n_cluster in k: y_pred = KMeans(n_clusters = n_cluster, max_iter=1000, random_state=47).fit_predict(X) homo = metrics.homogeneity_score(y, y_pred) comp = metrics.completeness_score(y, y_pred) vm = metrics.v_measure_score(y, y_pred) homo_score.append(homo) comp_score.append(comp) vm_score.append(vm) |

Let’s plot the scores against different values of k.

1 2 3 4 5 6 7 |
plt.plot(k, homo_score, 'r', label='Homogeneity') plt.plot(k, comp_score, 'b', label='Completeness') plt.plot(k, vm_score, 'y', label='V- Measure') plt.xlabel('Value of K') plt.ylabel('homogeneity_completeness_v_measure') plt.legend(loc=4) plt.show() |

As you can see in the preceding plot all of the three metrics peaks at k=3. That is the ideal value of k.

sklearn also provides a way to calculate all of these three scores at once using the method homogeneity_completeness_v_measure(y_true, y_pred)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
k = [2, 3, 4, 5, 6, 7, 8] scores = [] for n_cluster in k: y_pred = KMeans(n_clusters = n_cluster, max_iter=1000, random_state=47).fit_predict(X) score = metrics.homogeneity_completeness_v_measure(y, y_pred) scores.append(score) # plotting the scores against the value of k plt.plot(k, [s[0] for s in scores], 'r', label='Homogeneity') plt.plot(k, [s[1] for s in scores], 'b', label='Completeness') plt.plot(k, [s[2] for s in scores], 'y', label='V-Measure') plt.xlabel('Value of K') plt.ylabel('homogeneity_completeness_v_measure') plt.legend(loc=4) plt.show() |

**RAND INDEX:**

The Rand index is a simple metric based on the agreements and disagreements between pairs of data points in the clusterings.

It measures how similar the induced clusters are to the actual ground truth.

It is defined as,

R=\frac{a+b}{\left(\begin{array}{c}{M} \\ {2}\end{array}\right)}

a represents the number of pair of observations belonging to the same cluster across two different clusterings

b represents the number of pair of observations belonging to different cluster across two different clusterings

m is the number of observations.

Rand index ranges from 0 to 1. Where 1 indicates that the clusterings are exactly the same, and 0 indicates that the clusterings are maximally dissimilar.

However, the complaint with the Rand index is that the expected value of the Rand index between two random partitions is not a constant(say zero), i.e., even if we cluster points randomly, we will get a value greater than 0.

**ADJUSTED RAND INDEX:**

To counter the problems faced by Rand Index, Hubert and Arabie suggest an Adjusted Rand index.

ARI has zero expected value in the case of random partition.

It is defined as,

A R I=\frac{\text { Index }-\text { Expected Index }}{\text { Maximum Index -Expected Index }}

This ARI is bounded between -1 and 1. Where 1 indicates that the clusters are identical.

1 2 3 4 5 6 7 8 9 10 11 |
k = [2, 3, 4, 5, 6, 7, 8] scores = [] for n_cluster in k: y_pred = KMeans(n_clusters = n_cluster, max_iter=1000, random_state=47).fit_predict(X) score = metrics.adjusted_rand_score(y, y_pred) scores.append(score) plt.plot(k, scores, 'o-') plt.title('Adjusted Rand Index') plt.show() |

Visit this link to learn more about Rand index and this link to learn more about the Adjusted Rand Index.

We can use all the above-discussed metrics only if we have prior information about the data or if we have the actual ground truth.

In clustering, this is rarely the case, as clustering in most of the time is applied as unsupervised.

**UNSUPERVISED METHODS:**

These methods are based on the intrinsic parameters of the clustering algorithm and don’t need any external information about the data.

**ELBOW METHOD:**

The first method we are going to see in this section is the elbow method.

The elbow method plots the value of inertia produced by different values of k.

The value of inertia will decline as k increases.

The idea here is to choose the value of k after which the inertia doesn’t decrease significantly anymore.

1 2 3 4 5 6 7 8 9 10 11 |
k = [2, 3, 4, 5, 6, 7, 8] inertias = [] for i in k: km = KMeans(n_clusters=i, max_iter=1000, random_state=47) km.fit(X) inertias.append(km.inertia_) plt.plot(k, inertias, 'o-') plt.xlabel("Value of k") plt.ylabel("WSS") plt.title('Elbow Method') plt.show() |

If you eyeball the preceding diagram, you can see an elbow like inflection at k=3.

That is the value of k we have to choose.

**SILHOUETTE COEFFICIENT:**

Another unsupervised metric we are going to discuss is the silhouette coefficient.

The silhouette coefficient combines the idea of cluster cohesion and cluster separation.

Let’s take a step back and understand what cohesion and separation are

**COHESION:** It measures how similar observation is to the assigned cluster.

It is measured by the within-cluster sum of squares(WSS).

WSS=\sum \sum_{x \in C_{i}}\left(x-c_{i}\right)^{2}

**SEPARATION:** It measures how dissimilar an observation is to the observation of nearby cluster. In simple words, it measures how well a cluster is separated from other clusters.

It is measured by between cluster sum of squares(BSS).

BSS=\sum_{i}\left|C_{i}\right|\left(c-c_{i}\right)^{2}

C_{i} is the number of observations in each group.

The silhouette coefficient is determined by,

S C=\frac{b-a}{\max (a, b)}

a: mean distance to all other points in its cluster

b: mean distance to all other points in the next nearest cluster

The silhouette score ranges from -1 to 1. The better it is if the score is near to 1.

1 2 3 4 5 6 7 8 9 10 11 |
k = [2, 3, 4, 5, 6, 7, 8] score=[] for n_cluster in k: kmeans = KMeans(n_clusters=n_cluster).fit(X) score.append(metrics.silhouette_score(X,kmeans.labels_)) plt.plot(k, score, 'o-') plt.xlabel("Value for k") plt.ylabel("Silhouette score") plt.title('Silhouette Method') plt.show() |

**DAVIES-BOULDIN INDEX(DB-INDEX):**

The Davies-Bouldin index is based on the ratio between compactness to separation.

The DB-index is calculated using the formula,

D B=\frac{1}{n} \sum_{i=1}^{n} \max _{j \neq i}\left(\frac{\sigma_{i}+\sigma_{j}}{d\left(c_{i}, c_{j}\right)}\right)

Where n is the number of clusters, c_{i} is the centroid of cluster i, σ_{i} is the average distance of all observations in cluster i, and d(c_{i},c_{j}) is the distance between clusters i and j.

It is bounded between 0 to 1, lower is better.

1 2 3 4 5 6 7 8 9 10 11 12 |
k = [2, 3, 4, 5, 6, 7, 8] scores = [] for i in k: y_pred = KMeans(n_clusters = i, max_iter=1000, random_state = 43).fit_predict(X) score = metrics.davies_bouldin_score(X, y_pred) scores.append(score) print(score) plt.plot(k, scores, 'o-') plt.title('DAVIES-BOULDIN') plt.show() |

The optimum number of cluster corresponds to the minimum value of DB-index. In our case, it is 3.

**CALINSKI-HARABAZ INDEX(CH-INDEX):**

The CH-index is another metric which can be used to find the best value of k using with-cluster-sum-of-squares(WSS) and between-cluster-sum-of-squares(BSS).

WSS measures the compactness of the cluster, while the BSS is a measure of separation.

CH-index is defined as,

\mathrm{CH}(K)=\frac{B(K) /(K-1)}{W(K) /(n-K)}

Where B is between-sum-of-squares, W is within-sum-of-square, n is the number of data points and k is the number of clusters.

A higher score indicates that the clusters are well separated.

1 2 3 4 5 6 7 8 9 10 11 12 |
k = [2, 3, 4, 5, 6, 7, 8] scores = [] for i in k: y_pred = KMeans(n_clusters = i, max_iter=1000, random_state = 43).fit_predict(X) score = metrics.calinski_harabaz_score(X, y_pred) scores.append(score) print(score) plt.plot(k, scores, 'o-') plt.title('CALINSKI-HARABASZ') plt.show() |

**GAP STATISTIC:**

The idea behind the gap statistic is to compare the WSS value for different values of k on our actual dataset versus a null reference distribution of the data.

The reference distribution is usually a random uniform distribution.

To find the ideal number of clusters, we choose a value of k that gives us the maximum value of the Gap statistic.

The gap statistic for a given k is defined as follows,

\operatorname{Gap}(k)=E\left(\log \left(W_{k}\right)\right)-\log \left(W_{k}\right)

Where E\left(\log \left(W_{k}\right)\right) indicates the expected value of \log \left(W_{k}\right) under the null distribution.

Unfortunately, sklearn doesn’t have an implementation of gap statistic. We need to install an additional package gap_stat.

You may either use this link or this to install the package.

1 2 3 4 5 6 7 8 9 10 11 12 |
from gap_statistic import OptimalK optimalK = OptimalK(parallel_backend='None') n_clusters = optimalK(X, cluster_array=np.arange(1, 10)) plt.plot(optimalK.gap_df.n_clusters, optimalK.gap_df.gap_value, linewidth=3) plt.scatter(optimalK.gap_df[optimalK.gap_df.n_clusters == n_clusters].n_clusters, optimalK.gap_df[optimalK.gap_df.n_clusters == n_clusters].gap_value) plt.xlabel('Cluster Count') plt.ylabel('Gap Value') plt.title('Gap Values by Cluster Count') plt.show() |

**SUMMARY:**

In this post, we have discussed various methods to find the optimal value of k in the K-means algorithm.

I hope this article gives some insight on choosing the ideal value of k.

If you want to learn more metrics, Milligan and Cooper have presented a comparison study over thirty validity indexes. You can read it here.