7.4 Outro

7.4.1 Remarks


The aim of K-means is to find \(K\) clusters based on the notion of the points’ closeness to the cluster centres.

Remember that \(K\) must be set in advance.

By definition (* via its relation to Voronoi diagrams), all clusters will be of convex shapes.

However, we may try applying \(K'\)-means for \(K' \gg K\) to obtained a “fine grained” data compression and then combine the (sub)clusters into more meaningful groups using other methods.

Iterative K-means algorithms are very fast even for large data sets, but they may fail to find a desirable solution, see the next chapter for discussion.


On the other hand, hierarchical methods output a whole family of mutually nested partitions, which may provide us with insight into the underlying data structure.

Unfortunately, there is no easy way to assign new points to existing clusters.

Linkage scheme must be chosen with care.

These are generally slow – \(O(n^2)\) to \(O(n^3)\) complexity.

Note that the fastcluster package provides a more efficient implementation of some methods available via a call to hclust(). See also the genie package for a robust algorithm based on the minimum spanning tree, which can be computed quickly.


Unsupervised learning is often performed during the data pre-processing and exploration stage.

Assessing the quality of clustering is particularly challenging as, unlike in a supervised setting, we have no access to “ground truth” information.

Moreover, some unsupervised methods do not easily generalise to unobserved data.

Also many clustering methods are based on the notion of pairwise distances but these tend to behave weirdly in high-dimensional spaces (“the curse of dimensionality”).

However, clustering methods can aid with supervised tasks – instead of fitting a single “large model”, it might be useful to fit separate models to each cluster.

7.4.2 Other Noteworthy Clustering Algorithms


Other noteworthy clustering approaches:

  • Genie (see R package genie) (Gagolewski et al. 2016, Cena & Gagolewski 2020)
  • ITM (Müller et al. 2012)
  • DBSCAN, HDBSCAN* (Ling 1973, Ester et al. 1996, Campello et al. 2015)
  • K-medoids, K-medians
  • Fuzzy C-means (a.k.a. weighted K-means) (Bezdek et al. 1984)
  • Spectral clustering; e.g., (Ng et al. 2001)
  • BIRCH (Zhang et al. 1996)

7.4.3 Further Reading

Recommended further reading:

  • (James et al. 2017: Section 10.3)

Other:

  • (Hastie et al. 2017: Section 14.3)