Hierarchical clustering is the second most popular technique for clustering after K-means.

Remember, in K-means; we need to define the number of clusters beforehand. However, in hierarchical clustering, we don’t have to specify the number of clusters.

There are two categories of hierarchical clustering

- Agglomerative Hierarchical clustering
- Divisive Hierarchical clustering

**AGGLOMERATIVE:** It works bottom-up, starts with each point as the individual cluster and sequentially merging two closest clusters until all the points are in a single cluster.

**DIVISIVE:** It works the other way around. It proceeds top-down, starting with all points in a single cluster and sequentially splits them until all points are in their own cluster.

Hierarchical clustering generates clusters that are organized into a hierarchical structure. This hierarchical structure can be visualized using a tree-like diagram called dendrogram.

Dendrogram records the sequence of merges in case of agglomerative and sequence of splits in case of divisive clustering. Every time we merge or split a cluster it will record the distance between the two cluters.

In this article, we’ll concentrate on agglomerative hierarchical clustering and discuss it in detail.

**AGGLOMERATIVE HIERARCHICAL CLUSTERING:**

As we have seen earlier it builds the cluster in a bottom-up manner. It starts with each element being a single cluster. The distance between each cluster and all other cluster is computed and the closest pairs of clusters are merged sequentially until there is only one cluster.

Agglomerative clustering performs merges based on the distance between the clusters. So, we need to choose a distance or similarity metric and construct a distance matrix.

Euclidean distance is a good choice. However, you can also use other metrics like manhattan or cosine distance.

You can read my article Distance Measures in Machine Learning if you want to learn more about some of the commonly used distance metrics.

Now, once we have our distance matrix, we need a strategy or linkage criterion to merge the clusters.

The linkage method figures out which clusters should be merged into one.

Several types of linkage methods are used. Some of them are discussed below.

**SINGLE LINKAGE:**

In single linkage, the distance between two clusters is defined as the minimum distance between a point in one cluster and a point in the other cluster.

This method is also known as the nearest neighbor method.

D(X, Y)=\min _{x \in X, y \in Y} d(x, y)

**COMPLETE LINKAGE:**

In complete linkage, the distance between two clusters is defined as the maximum distance between a point in one cluster and a point in the other cluster.

This method is also known as the furthest neighbor method.

D(X, Y)=\max _{x \in X, y \in Y} d(x, y)

**AVERAGE LINKAGE:**

The distance between two clusters is the average distance between each point in one cluster to every point in the other cluster.

This method is also known as the unweighted pair group method with arithmetic mean(UPGMA).

D(X,Y)=\frac{1}{|X| \cdot|Y|} \mathop{\sum}_{x \in X} \mathop{\sum}_{y \in Y} d(x, y)

**CENTROID LINKAGE:**

The distance between two clusters is the distance between the cluster centroids.

d(X,Y)=\left\|\mathbf{c}_{x}-\mathbf{c}_{y}\right\|

**WARD’S LINKAGE:**

The ward’s linkage is based on minimizing the total within-cluster-sum of squares from merging two clusters.

This method merges two clusters such that the merger results in the smallest within-sum of squares.

L_{W a r d}(X, Y)=\mathop{\sum}_{x \in X} \mathop{\sum}_{y \in Y}\left\|x-{y}\right\|^{2}

Each of these linkage methods has its own advantages and disadvantages.

Single linkage inflicts no constraints on the shape of the clusters and often produce unbalanced and irregularly shaped clusters.

If you want to choose between single and complete linkage, the latter is preferred as it tends to produce more compact clusters with equal diameters.

Both single and complete linkage doesn’t take into account the cluster structure.

Average linkage, on the other hand, take account of the cluster structure and is less susceptible to noise and outliers.

In most of the cases, Ward’s linkage is preferred as it usually produces better cluster hierarchies.

**STEPS IN AGGLOMERATIVE HIERARCHICAL CLUSTERING:**

- Let each data point be a single cluster
- Compute the distance matrix
- Use linkage criteria to merge the clusters
- Update the distance matrix
- Repeat step three and four until all the points are in a single cluster

Let’s now see a simple example. The following are the data points.

A(1,1) B(1.5, 1.5) C(5,5) D(3,4) E(4,4) F(3,3.5)

We’ll use complete linkage as our linkage criteria.

Let’s first visualize the datapoints

First, let’s construct the distance matrix using Euclidean distance.

The Euclidean distance is given by d=\sqrt{\left(x_{2}-x_{1}\right)^{2}+\left(y_{2}-y_{1}\right)^{2}}

Let’s say we want to find the Euclidean distance between points A(1, 1) and B(1.5, 1.5),

\begin{array}{l}{d=\sqrt{(1.5-1)^{2}+(1.5-1)^{2}}} \\ \ \ \ {=\sqrt{(0.5)^{2}+(0.5)^{2}}} \\ \ \ \ {=\sqrt{0.25+0.25}} \\ \ \ \ {=\sqrt{0.5}} \\ \ \ \ {=0.707107}\end{array}

Similarly, if we find the distance for all the points, we’ll obtain the following distance matrix

\begin{array}{|c|c|c|c|c|c|}\hline & {A} & {B} & {C} & {D} & {E} & {F} \\ \hline A & {0} & {0.7} & {5.65} & {3.6} & {4.24} &{3.2}\\ \hline B & {0.7} & {0} & {4.94} & {2.91} & 3.53 & 2.5 \\ \hline C & {5.65} & {4.94} & {0} & {2.23} & {1.41} & 2.5\\ \hline D & {3.6} & {2.91} & {2.23} & {0} & {1} & 0.5 \\ \hline E & {4.24} & {3.53} & {1.41} & {1} & {0} & 1.11 \\ \hline F & {3.2} & {2.5} & {2.5} & {0.5} & {1.11} & {0} \\ \hline\end{array}

Next, we find the smallest non-zero element in the matrix and merge them.

Here, 0.5 is the smallest element. So we merge D and F.

Every time we merge two clusters we’ll draw a dendrogram between those clusters.

Now we have to update the distances for the new cluster DF.

\begin{array}{|c|c|c|c|c|}\hline & {A} & {B} & {C} & {D, F} & {E} \\ \hline A & {0} & {0.7} & {5.65} & {?} & 4.24\\ \hline B & {0.7} & {0} & {4.94} & {?} & 3.53 \\ \hline C & {5.65} & {4.94} & {0} & {?} & 1.41\\ \hline D, F & {?} & {?} & {?} & {?} & ? \\ \hline E & {4.24} & {3.53} & {1.41} & {?} & {0} \\ \hline\end{array}

Since we are using complete linkage as our linkage criteria, we can update the distance between the clusters DF and A by using the formula,

\operatorname{max}(\operatorname{dist}(D, F), A) )

\begin{aligned} \operatorname{max} &(\operatorname{dist}(D, A),\operatorname{dist}(F, A)) \\ &=\operatorname{max}((3.6,3.2)) \\ &=3.6 \end{aligned}

Likewise, if we update all the values, we’ll get the following distance matrix.

\begin{array}{|c|c|c|c|c|}\hline & {A} & {B} & {C} & {D, F} & {E} \\ \hline A & {0} & {0.7} & {5.65} & {3.6} & 4.24\\ \hline B & {0.7} & {0} & {4.94} & {2.91} & 3.53\\ \hline C & {5.65} & {4.94} & {0} & {2.5} & 1.41\\ \hline D, F & {3.6} & {2.91} & {2.5} & {0} & 1.11\\ \hline E & {4.24} & {3.53} & {1.41} & {1.11} & {0} \\ \hline\end{array}

Now again, we have to follow the same steps.

Find the smallest entry, merge the cluster, and update the distance matrix.

The smallest element is now 0.7, hence we merge A and B and update the new distances.

\begin{array}{|c|c|c|c|c|}\hline & {A, B} & {C} & {D, F} & {E} \\ \hline A, B & {0} & {5.65} & {3.6} & 4.24\\ \hline C & {5.65} & {0} & {2.5} & 1.41 \\ \hline D, F & {3.6} & {2.5} & {0} & 1.11 \\ \hline E & {4.24} & {1.41} & {1.11} & {0} \\ \hline\end{array}

Remember, we have to repeat this process until only one cluster is left. The smallest element is 1.11, so merge clusters DF and E.

\begin{array}{|c|c|c|c|}\hline & {A, B} & {C} & {D, F, E} \\ \hline A, B & {0} & {5.65} & {4.24} \\ \hline C & {5.65} & {0} & {2.5} \\ \hline D, F, E & {4.24} & {2.5} & {0} \\ \hline\end{array}

Now, the smallest element is 2.5, so merge the clusters DFE and C and update the distances in the matrix.

\begin{array}{|c|c|c|}\hline & {A, B} & {D, F, E, C} \\ \hline A, B & {0} & {5.65} \\ \hline D, F, E, C & {5.65} & {0} \\ \hline\end{array}

Then finally if we merge the clusters AB and DFEC we’ll get the following dendrogram as the output

**CHOOSING NUMBER OF CLUSTERS USING DENDROGRAM:**

Inorder to decide the number of clusters one could use the dendrogram. Recall a dendrogram will give only the hierarchy of the clusters.

However, by inspecting the dendrogram and cutting it at a certain height we can decide the appropriate number of clusters for our dataset.

The vertical line indicates the distance between two clusters amalgamated.

Cutting a dendrogram at different height will yield different clusters. Usually, we cut the tree such that it cuts the tallest vertical line.

The number of clusters will be equal to the number of intersections with the vertical line made by the horizontal line which is drawn using the cut-off value.

By choosing a cut-off value of 4 we’ll get two clusters.

Let’s do some coding

We’ll start by importing the necessary packages

1 2 3 4 5 |
import numpy as np import matplotlib.pyplot as plt from scipy.cluster.hierarchy import linkage, dendrogram from sklearn.cluster import AgglomerativeClustering from sklearn.datasets import make_blobs |

Next, we’ll create a dataset with 5 clusters using sklearn’s make_blobs

1 2 3 |
X, Y = make_blobs(n_samples=250, n_features=7, centers=5, cluster_std=1.0, random_state=47) plt.scatter(X[:,0],X[:,1]) plt.show() |

Now, we can compute the linkage matrix and visualize its posterior dendrogram. We’ll use scipy’s linkage and dendrogram functions from the hierarchical module.

Ward’s method is used as the linkage criteria. However, you can try with different methods.

1 2 3 |
Z = linkage(X, method='ward') dendrogram(Z) plt.show() |

That’s it two lines of code and scipy does all the work for us and present us with the following dendrogram.

The bottom level of the dendrogram is highly congested. We can truncate this diagram to show only the last p merges.

The following code visualizes the dendrogram for the last 20 merges.

1 2 3 |
Z = linkage(X, method='ward') dendrogram(Z, truncate_mode='lastp', p=20) plt.show() |

Remember the vertical line indicates the distance between the two clusters amalgamated.

If you eyeball the preceding diagram, you can see that there is a massive jump in the distance for the last four mergers. This means that we are trying to merge clusters which are far apart.

Let’s print the distances at which the clusters are merged for the last 10 merges.

1 |
print(Z[-10:, 2]) |

1 2 3 |
OUTPUT: [ 9.21426221 9.28031088 9.28499777 9.74114631 10.35996514 11.47695208 122.33910337 152.29478626 173.15315068 191.87563329] |

As you can see from the output there is a huge jump in distance from 11 to 122 indicating that we are trying to merge two clusters which are highly dissimilar.

If we choose a cut-off value of 50, it’ll yield us five clusters.

Since, we now have identified the number of clusters we can use scikit-learn to implement AHC.

1 2 3 |
from sklearn.cluster import AgglomerativeClustering ahc = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='ward') ahc.fit_predict(X) |

Now we can visualize the clusters

1 2 |
plt.scatter(X[:,0], X[:,1],c=ahc.labels_) plt.show() |

**CONCLUSION:**

In this article, we discussed hierarchical clustering and discussed one of its a type called agglomerative hierarchical clustering.

Hierarchical clustering is often used in the form of descriptive rather than predictive modeling.

The AHC is a bottom-up approach starting with each element being a single cluster and sequentially merges the closest pairs of clusters until all the points are in a single cluster.

We discussed different linkage methods that are used to merge the clusters and reviewed some of the pros and cons of each of the methods.

Then we discussed dendrogram, which gives the hierarchy of clusters. We also discussed how we could use dendrogram to decide the appropriate number of clusters.

Finally, we use scipy to visualize the dendrogram and implement the agglomerative hierarchical clustering using sklearn.