7.1 Unsupervised Learning

7.1.1 Introduction


In unsupervised learning (learning without a teacher), the input data points \(\mathbf{x}_{1,\cdot},\dots,\mathbf{x}_{n,\cdot}\) are not assigned any reference labels.

plot of chunk unnamed-chunk-2

Our aim now is to discover the underlying structure in the data.

7.1.2 Main Types of Unsupervised Learning Problems


In dimensionality reduction we seek a meaningful projection of a high dimensional space (think: many variables/columns).

plot of chunk unnamed-chunk-3

This might enable us to plot high-dimensional data or understand its structure better.


Example methods:

  • Multidimensional scaling (MDS)
  • Principal component analysis (PCA)
  • Kernel PCA
  • t-SNE
  • Autoencoders (deep learning)

In anomaly detection, out task is to identify rare, suspicious, ab-normal or out-standing items.

plot of chunk unnamed-chunk-4

For example, these can be cars on walkways in a park’s security camera footage.


The aim of clustering is to automatically discover some naturally occurring subgroups in the data set.

plot of chunk unnamed-chunk-5

For example, these may be customers having different shopping patterns (such as “young parents”, “students”, “boomers”).

7.1.3 Clustering


More formally, given \(K\ge 2\), clustering aims is to find a special kind of a \(K\)-partition of the input data set \(\mathbf{X}\).

\(\mathcal{C}=\{C_1,\dots,C_K\}\) is a \(K\)-partition of \(\mathbf{X}\) of size \(n\), whenever:

  • \(C_k\neq\emptyset\) for all \(k\) (each set is nonempty),
  • \(C_k\cap C_l=\emptyset\) for all \(k\neq l\) (sets are pairwise disjoint),
  • \(\bigcup_{k=1}^K C_k=\{1,\dots,n\}\) (no point is neglected).

This can also be thought of as assigning each point a unique label \(\{1,\dots,K\}\) (think: colouring of the points, where each number has a colour).

\(\mathbf{x}_{i,\cdot}\) is labelled \(j\) iff it belongs to cluster \(C_j\), i.e., \(i\in C_j\)”.


Example applications of clustering:

  • taxonomization: e.g., partition the consumers to more “uniform” groups to better understand who they are and what do they need,
  • image processing: e.g., object detection, like tumour tissues on medical images,
  • complex networks analysis: e.g., detecting communities in friendship, retweets and other networks,
  • fine-tuning supervised learning algorithms: e.g., recommender systems indicating content that was rated highly by users from the same group or learning multiple manifolds in a dimension reduction task.

The number of possible \(K\)-partitions of a set with \(n\) elements is given by the Stirling number of the second kind:

\[ \left\{{n \atop K}\right\}={\frac {1}{K!}}\sum _{{j=0}}^{{K}}(-1)^{{K-j}}{\binom {K}{j}}j^{n}; \]

e.g., already \(\left\{{n \atop 2}\right\}=2^{n-1}-1\) and \(\left\{{n \atop 3}\right\}=O(3^n)\) – that is a lot.

Certainly, we are not just interested in “any” partition.

However, even one of the most famous textbooks provides us with only a vague hint:

Clustering concerns “segmenting a collection of objects into subsets so that those within each cluster are more closely related to one another than objects assigned to different clusters” (Hastie et al. 2017).


There are two main types of clustering algorithms:

  • parametric (model-based):
    • find clusters of specific shapes, or following specific multidimensional probability distributions,
    • e.g., \(K\)-means, expectation-maximization for Gaussian mixtures (EM), average linkage agglomerative clustering;
  • nonparametric (model-free):
    • identify high-density or well-separable regions, perhaps in the presence of noise points,
    • e.g., single linkage agglomerative clustering, Genie, (H)DBSCAN, BIRCH.

In this chapter we’ll take a look at two clustering approaches:

  • K-means clustering that looks for a specific number of clusters
  • (agglomerative) hierarchical clustering that outputs a whole hierarchy of nested data partitions