7.2 K-means Clustering

7.2.1 Example in R


Let us apply \(K\)-means clustering to find \(K=3\) groups in the famous Fisher’s iris data set (variables Sepal.Width and Petal.Length variables only)

##   [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [30] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1
##  [59] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1
##  [88] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 2 2 2 2 2 2 1 2 2
## [117] 2 2 2 1 2 1 2 1 2 2 1 1 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2
## [146] 2 1 2 2 2

Later we’ll see that nstart is responsible for random restarting the (local) optimisation procedure, just like we did in the previous chapter.


plot of chunk unnamed-chunk-7

The colours indicate the detected clusters,

while the plotting characters – the true iris species

Note that they were not used during the clustering procedure!
(we’re dealing with unsupervised learning)


A contingency table for detected vs. true clusters:

##    
##     setosa versicolor virginica
##   1      0         48         9
##   2      0          2        41
##   3     50          0         0
## [1] 0.9266667

The discovered partition matches the original one very well.

7.2.2 Problem Statement


The aim of \(K\)-means clustering is to find \(K\) “good” cluster centres \(\boldsymbol\mu_{1,\cdot}, \dots, \boldsymbol\mu_{K,\cdot}\).

Then, a point \(\mathbf{x}_{i,\cdot}\) will be assigned to the cluster represented by the closest centre.

Closest == w.r.t. the squared Euclidean distance.

Assuming all the points are in a \(p\)-dimensional space, \(\mathbb{R}^p\), \[ d(\mathbf{x}_{i,\cdot}, \boldsymbol\mu_{k,\cdot}) = \| \mathbf{x}_{i,\cdot} - \boldsymbol\mu_{k,\cdot} \|^2 = \sum_{j=1}^p \left(x_{i,j}-\mu_{k,j}\right)^2 \]


The \(i\)-th point’s cluster is determined by: \[ \mathrm{C}(i) = \mathrm{arg}\min_{k=1,\dots,K} d(\mathbf{x}_{i,\cdot}, \boldsymbol\mu_{k,\cdot}), \] where \(\mathrm{arg}\min\) == argument minimum == the index \(k\) that minimises the given expression.


In the previous example, we have:

##   Petal.Length Sepal.Width
## 1     4.328070    2.750877
## 2     5.672093    3.032558
## 3     1.462000    3.428000

plot of chunk unnamed-chunk-9


Here is the partition of the whole \(\mathbb{R}^2\) space into clusters based on the closeness to the three cluster centres:

plot of chunk unnamed-chunk-10

(*) For the interested, see “Voronoi diagrams”.


To compute the pairwise distances, we may call pdist::pdist():

##          [,1]     [,2]       [,3]
## [1,] 3.022380 4.297590 0.09501583
## [2,] 2.938649 4.272217 0.43246734
## [3,] 3.061196 4.375298 0.27969268
## [4,] 2.849538 4.172638 0.33019394
## [5,] 3.048705 4.309613 0.18283319
## [6,] 2.868316 4.065708 0.52860963


Therefore, the cluster memberships (\(\mathrm{arg}\min\)s) can be determined by:

##   [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [30] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1
##  [59] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1
##  [88] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 2 2 2 2 2 2 1 2 2
## [117] 2 2 2 1 2 1 2 1 2 2 1 1 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2
## [146] 2 1 2 2 2
## [1] TRUE

7.2.3 Algorithms for the K-means Problem


How to find “good” cluster centres?

In the \(K\)-means clustering, we wish to find \(\boldsymbol\mu_{1,\cdot}, \dots, \boldsymbol\mu_{K,\cdot}\) that minimise the total within-cluster distances (distances from each point to each own cluster centre): \[ \min_{\boldsymbol\mu_{1,\cdot}, \dots, \boldsymbol\mu_{K,\cdot} \in \mathbb{R}^p} \sum_{i=1}^n d(\mathbf{x}_{i,\cdot}, \boldsymbol\mu_{C(i),\cdot}), \] Note that the \(\boldsymbol\mu\)s are also “hidden” inside the point-to-cluster belongingness mapping, \(\mathrm{C}\).

Expanding the above yields: \[ \min_{\boldsymbol\mu_{1,\cdot}, \dots, \boldsymbol\mu_{K,\cdot} \in \mathbb{R}^p} \sum_{i=1}^n \left( \min_{k=1,\dots,K} \sum_{j=1}^p \left(x_{i,j}-\mu_{k,j}\right)^2 \right). \] Unfortunately, the \(\min\) operator in the objective function makes this optimisation problem not tractable with the methods discussed in the previous chapter.


The above problem is hard to solve.

(*) More precisely, it is an NP-hard problem.

Therefore, in practice we use various heuristics to solve it.

kmeans() in R implements the Hartigan-Wong, Lloyd, Forgy and MacQueen algorithms.

(*) Technically, there is no such thing as “the K-means algorithm” – all the above are particular heuristic approaches to solving the K-means clustering problem as specified by the aforementioned optimisation task.

By setting nstart = 10 above, we ask the (Hartigan-Wong) algorithm to find 10 solution candidates obtained by considering different random initial clusterings and choose the best one (with respect to the sum of within-cluster distance) amongst them. This does not guarantee finding the optimal solution, especially in high-dimensional spaces, but increases the likelihood of such.


Lloyd’s algorithm (1957) is sometimes referred to as “the” K-means algorithm:

  1. Start with random cluster centres \(\boldsymbol\mu_{1,\cdot}, \dots, \boldsymbol\mu_{K,\cdot}\).

  2. For each point \(\mathbf{x}_{i,\cdot}\), determine its closest centre \(C(i)\in\{1,\dots,K\}\).

  3. For each cluster \(k\in\{1,\dots,K\}\), compute the new cluster centre \(\boldsymbol\mu_{k,\cdot}\) as the componentwise arithmetic mean of the coordinates of all the point indices \(i\) such that \(C(i)=k\).

  4. If the cluster centres changed since last iteration, go to step 2, otherwise stop and return the result.


(*) Example implementation:

##      Petal.Length Sepal.Width
## [1,]     1.106172    2.994400
## [2,]     1.394794    3.321308
## [3,]     6.754820    3.808716


##   Petal.Length Sepal.Width
## 1     1.462000    3.428000
## 2     4.328070    2.750877
## 3     5.672093    3.032558
##   Petal.Length Sepal.Width
## 1     4.328070    2.750877
## 2     5.672093    3.032558
## 3     1.462000    3.428000