7.3 Hierarchical Methods

7.3.1 Introduction


In K-means, we need to specify the number of clusters, \(K\), in advance.

What if we don’t know it?

There is no guarantee that a \((K+1)\)-partition is “similar” to the \(K\)-one.

Hierarchical methods, on the other hand, output a whole hierarchy of mutually nested partitions, which increase the interpretability of the results.

A \(K\)-partition for any \(K\) can be extracted later.


Here we are interested in agglomerative algorithms.

At the lowest level of the hierarchy, each point belongs to its own cluster (there are \(n\) singletons).

At the highest level of the hierarchy, there is one cluster containing all the points.

Moving from the \(i\)-th to the \((i+1)\)-th level, we select a pair of clusters to be merged.

7.3.2 Example in R


## 
## Call:
## hclust(d = D)
## 
## Cluster method   : complete 
## Distance         : euclidean 
## Number of objects: 150

##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [30] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2
##  [59] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [88] 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3 3 3 2 3 3 2 2 2
## [117] 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 3 3 3 3 2 3 3 2 2 3 3
## [146] 2 2 2 3 2

Different cuts of the hierarchy:

plot of chunk unnamed-chunk-19


A dendrogram depicts the distances (* as defined by the linkage function) between the clusters merged at every stage. This can provide us with the insight into the underlying data structure.

plot of chunk unnamed-chunk-20

7.3.3 Agglomerative Hierarchical Clustering


Initially, \(\mathcal{C}^{(0)}=\{\{1\},\dots,\{n\}\}\), i.e., each point is a member of its own cluster.

While a hierarchical agglomerative clustering algorithm is being computed, there are \(n-k\) clusters at the \(k\)-th step of the procedure, \(\mathcal{C}^{(k)}=\{C_1^{(k)},\dots,C_{n-k}^{(k)}\}\).

When proceeding from step \(k\) to \(k+1\), we determine \(C_u^{(k)}\) and \(C_v^{(k)}\), \(u<v\), to be merged together so that we get: \[ \mathcal{C}^{(k+1)} = \left\{ C_1^{(k)},\dots,C_{u-1}^{(k)}, C_u^{(k)}{\cup C_v^{(k)}}, C_{u+1}^{(k)},\dots,C_{v-1}^{(k)}, C_{v+1}^{(k)},\dots,C_{n-k}^{(k)} \right\}. \]

Thus, \((\mathcal{C}^{(0)}, \mathcal{C}^{(1)}, \dots, \mathcal{C}^{(n-1)})\) is a sequence of nested partitions of \(\{1,2,\dots,n\}\) with \(\mathcal{C}^{(n-1)}=\left\{ \{1,2,\dots,n\} \right\}\).

7.3.4 Linkage Functions


A pair of clusters \(C_u^{(k)}\) and \(C_v^{(k)}\) to be merged with each other is determined by: \[ \mathrm{arg}\min_{u < v} d^*(C_u^{(k)}, C_v^{(k)}), \] where \(d^*(A,B)\) is the distance between two clusters \(A\) and \(B\).

Note that we usually only consider the distances between single points, not sets of points.

\(d^*\) is a suitable extension of a pointwise distance \(d\) (usually the Euclidean metric) to whole sets.

We will assume that \(d^*(\{\mathbf{x}_{i,\cdot}\},\{\mathbf{x}_{j,\cdot}\})= d(\mathbf{x}_{i,\cdot},\mathbf{x}_{j,\cdot})\), i.e., the distance between singleton clusters is the same as the distance between the points themselves.


There are many popular choices of \(d^*\) (which in the context of hierarchical clustering we call a linkage function)

  • Single linkage: \[ d_S^*(A,B) = \min_{\mathbf{a}\in A, \mathbf{b}\in B} d(\mathbf{a},\mathbf{b}) \]
  • Complete linkage: \[ d_C^*(A,B) = \max_{\mathbf{a}\in A, \mathbf{b}\in B} d(\mathbf{a},\mathbf{b}) \]
  • Average linkage: \[ d_A^*(A,B) = \frac{1}{|A| |B|} \sum_{\mathbf{a}\in A}\sum_{\mathbf{b}\in B} d(\mathbf{a},\mathbf{b}) \]

Computing linkages – an illustration:

plot of chunk unnamed-chunk-21


Complete linkage (again):

plot of chunk unnamed-chunk-22


plot of chunk unnamed-chunk-23


Average linkage:

plot of chunk unnamed-chunk-24


plot of chunk unnamed-chunk-25


Single linkage (this is a typical behaviour!):

plot of chunk unnamed-chunk-26


plot of chunk unnamed-chunk-27