7.3 Hierarchical Methods

7.3.1 Introduction

In K-means, we need to specify the number of clusters, $$K$$, in advance.

What if we don’t know it?

There is no guarantee that a $$(K+1)$$-partition is “similar” to the $$K$$-one.

Hierarchical methods, on the other hand, output a whole hierarchy of mutually nested partitions, which increase the interpretability of the results.

A $$K$$-partition for any $$K$$ can be extracted later.

Here we are interested in agglomerative algorithms.

At the lowest level of the hierarchy, each point belongs to its own cluster (there are $$n$$ singletons).

At the highest level of the hierarchy, there is one cluster containing all the points.

Moving from the $$i$$-th to the $$(i+1)$$-th level, we select a pair of clusters to be merged.

7.3.2 Example in R

# Distances between all pairs of points:
D <- dist(X)
# Apply Complete Linkage (the default, see below):
h <- hclust(D) # method="complete"
print(h)
##
## Call:
## hclust(d = D)
##
## Cluster method   : complete
## Distance         : euclidean
## Number of objects: 150

cutree(h, k=3) # extract the 3-partition
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [30] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2
##  [59] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [88] 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3 3 3 2 3 3 2 2 2
## [117] 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 3 3 3 3 2 3 3 2 2 3 3
## [146] 2 2 2 3 2

Different cuts of the hierarchy:

A dendrogram depicts the distances (* as defined by the linkage function) between the clusters merged at every stage. This can provide us with the insight into the underlying data structure.

plot(h)

7.3.3 Agglomerative Hierarchical Clustering

Initially, $$\mathcal{C}^{(0)}=\{\{1\},\dots,\{n\}\}$$, i.e., each point is a member of its own cluster.

While a hierarchical agglomerative clustering algorithm is being computed, there are $$n-k$$ clusters at the $$k$$-th step of the procedure, $$\mathcal{C}^{(k)}=\{C_1^{(k)},\dots,C_{n-k}^{(k)}\}$$.

When proceeding from step $$k$$ to $$k+1$$, we determine $$C_u^{(k)}$$ and $$C_v^{(k)}$$, $$u<v$$, to be merged together so that we get: $\mathcal{C}^{(k+1)} = \left\{ C_1^{(k)},\dots,C_{u-1}^{(k)}, C_u^{(k)}{\cup C_v^{(k)}}, C_{u+1}^{(k)},\dots,C_{v-1}^{(k)}, C_{v+1}^{(k)},\dots,C_{n-k}^{(k)} \right\}.$

Thus, $$(\mathcal{C}^{(0)}, \mathcal{C}^{(1)}, \dots, \mathcal{C}^{(n-1)})$$ is a sequence of nested partitions of $$\{1,2,\dots,n\}$$ with $$\mathcal{C}^{(n-1)}=\left\{ \{1,2,\dots,n\} \right\}$$.

A pair of clusters $$C_u^{(k)}$$ and $$C_v^{(k)}$$ to be merged with each other is determined by: $\mathrm{arg}\min_{u < v} d^*(C_u^{(k)}, C_v^{(k)}),$ where $$d^*(A,B)$$ is the distance between two clusters $$A$$ and $$B$$.

Note that we usually only consider the distances between single points, not sets of points.

$$d^*$$ is a suitable extension of a pointwise distance $$d$$ (usually the Euclidean metric) to whole sets.

We will assume that $$d^*(\{\mathbf{x}_{i,\cdot}\},\{\mathbf{x}_{j,\cdot}\})= d(\mathbf{x}_{i,\cdot},\mathbf{x}_{j,\cdot})$$, i.e., the distance between singleton clusters is the same as the distance between the points themselves.

There are many popular choices of $$d^*$$ (which in the context of hierarchical clustering we call a linkage function)

• Single linkage: $d_S^*(A,B) = \min_{\mathbf{a}\in A, \mathbf{b}\in B} d(\mathbf{a},\mathbf{b})$
• Complete linkage: $d_C^*(A,B) = \max_{\mathbf{a}\in A, \mathbf{b}\in B} d(\mathbf{a},\mathbf{b})$
• Average linkage: $d_A^*(A,B) = \frac{1}{|A| |B|} \sum_{\mathbf{a}\in A}\sum_{\mathbf{b}\in B} d(\mathbf{a},\mathbf{b})$