## 7.2 K-means Clustering

### 7.2.1 Example in R

Let us apply $$K$$-means clustering to find $$K=3$$ groups in the famous Fisher’s iris data set (variables Sepal.Width and Petal.Length variables only)

X <- as.matrix(iris[,c(3,2)])
# never forget to set nstart>>1!
km <- kmeans(X, centers=3, nstart=10)
km$cluster # assigned labels ## [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 ## [30] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 ## [59] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 ## [88] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 2 2 2 2 2 2 1 2 2 ## [117] 2 2 2 1 2 1 2 1 2 2 1 1 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 ## [146] 2 1 2 2 2 Later we’ll see that nstart is responsible for random restarting the (local) optimisation procedure, just like we did in the previous chapter. plot(X, col=km$cluster, pch=as.numeric(iris$Species), las=1) The colours indicate the detected clusters, while the plotting characters – the true iris species Note that they were not used during the clustering procedure! (we’re dealing with unsupervised learning) A contingency table for detected vs. true clusters: (C <- table(km$cluster, iris$Species)) ## ## setosa versicolor virginica ## 1 0 48 9 ## 2 0 2 41 ## 3 50 0 0 sum(apply(C, 1, max))/sum(C) # accuracy ## [1] 0.9266667 The discovered partition matches the original one very well. ### 7.2.2 Problem Statement The aim of $$K$$-means clustering is to find $$K$$ “good” cluster centres $$\boldsymbol\mu_{1,\cdot}, \dots, \boldsymbol\mu_{K,\cdot}$$. Then, a point $$\mathbf{x}_{i,\cdot}$$ will be assigned to the cluster represented by the closest centre. Closest == w.r.t. the squared Euclidean distance. Assuming all the points are in a $$p$$-dimensional space, $$\mathbb{R}^p$$, $d(\mathbf{x}_{i,\cdot}, \boldsymbol\mu_{k,\cdot}) = \| \mathbf{x}_{i,\cdot} - \boldsymbol\mu_{k,\cdot} \|^2 = \sum_{j=1}^p \left(x_{i,j}-\mu_{k,j}\right)^2$ The $$i$$-th point’s cluster is determined by: $\mathrm{C}(i) = \mathrm{arg}\min_{k=1,\dots,K} d(\mathbf{x}_{i,\cdot}, \boldsymbol\mu_{k,\cdot}),$ where $$\mathrm{arg}\min$$ == argument minimum == the index $$k$$ that minimises the given expression. In the previous example, we have: km$centers
##   Petal.Length Sepal.Width
## 1     4.328070    2.750877
## 2     5.672093    3.032558
## 3     1.462000    3.428000
plot(X, col=km$cluster, las=1, asp=1) # asp=1 gives the same scale on both axes points(km$centers, cex=2, col=4, pch=16)

Here is the partition of the whole $$\mathbb{R}^2$$ space into clusters based on the closeness to the three cluster centres:

(*) For the interested, see “Voronoi diagrams”.

To compute the pairwise distances, we may call pdist::pdist():

library("pdist")
D <- as.matrix(pdist(X, km$centers)) head(D) # D[i,j] - distance between x[i,] and µ[j,] ## [,1] [,2] [,3] ## [1,] 3.022380 4.297590 0.09501583 ## [2,] 2.938649 4.272217 0.43246734 ## [3,] 3.061196 4.375298 0.27969268 ## [4,] 2.849538 4.172638 0.33019394 ## [5,] 3.048705 4.309613 0.18283319 ## [6,] 2.868316 4.065708 0.52860963 Therefore, the cluster memberships ($$\mathrm{arg}\min$$s) can be determined by: (idx <- apply(D, 1, which.min)) ## [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 ## [30] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 ## [59] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 ## [88] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 2 2 2 2 2 2 1 2 2 ## [117] 2 2 2 1 2 1 2 1 2 2 1 1 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 ## [146] 2 1 2 2 2 all(km$cluster == idx) # sanity check
## [1] TRUE

### 7.2.3 Algorithms for the K-means Problem

How to find “good” cluster centres?

In the $$K$$-means clustering, we wish to find $$\boldsymbol\mu_{1,\cdot}, \dots, \boldsymbol\mu_{K,\cdot}$$ that minimise the total within-cluster distances (distances from each point to each own cluster centre): $\min_{\boldsymbol\mu_{1,\cdot}, \dots, \boldsymbol\mu_{K,\cdot} \in \mathbb{R}^p} \sum_{i=1}^n d(\mathbf{x}_{i,\cdot}, \boldsymbol\mu_{C(i),\cdot}),$ Note that the $$\boldsymbol\mu$$s are also “hidden” inside the point-to-cluster belongingness mapping, $$\mathrm{C}$$.

Expanding the above yields: $\min_{\boldsymbol\mu_{1,\cdot}, \dots, \boldsymbol\mu_{K,\cdot} \in \mathbb{R}^p} \sum_{i=1}^n \left( \min_{k=1,\dots,K} \sum_{j=1}^p \left(x_{i,j}-\mu_{k,j}\right)^2 \right).$ Unfortunately, the $$\min$$ operator in the objective function makes this optimisation problem not tractable with the methods discussed in the previous chapter.

The above problem is hard to solve.

(*) More precisely, it is an NP-hard problem.

Therefore, in practice we use various heuristics to solve it.

kmeans() in R implements the Hartigan-Wong, Lloyd, Forgy and MacQueen algorithms.

(*) Technically, there is no such thing as “the K-means algorithm” – all the above are particular heuristic approaches to solving the K-means clustering problem as specified by the aforementioned optimisation task.

By setting nstart = 10 above, we ask the (Hartigan-Wong) algorithm to find 10 solution candidates obtained by considering different random initial clusterings and choose the best one (with respect to the sum of within-cluster distance) amongst them. This does not guarantee finding the optimal solution, especially in high-dimensional spaces, but increases the likelihood of such.

Lloyd’s algorithm (1957) is sometimes referred to as “the” K-means algorithm:

1. Start with random cluster centres $$\boldsymbol\mu_{1,\cdot}, \dots, \boldsymbol\mu_{K,\cdot}$$.

2. For each point $$\mathbf{x}_{i,\cdot}$$, determine its closest centre $$C(i)\in\{1,\dots,K\}$$.

3. For each cluster $$k\in\{1,\dots,K\}$$, compute the new cluster centre $$\boldsymbol\mu_{k,\cdot}$$ as the componentwise arithmetic mean of the coordinates of all the point indices $$i$$ such that $$C(i)=k$$.

4. If the cluster centres changed since last iteration, go to step 2, otherwise stop and return the result.

(*) Example implementation:

K <- 3

# Random initial cluster centres
M <- jitter(X[sample(1:nrow(X), K),])
M
##      Petal.Length Sepal.Width
## [1,]     1.106172    2.994400
## [2,]     1.394794    3.321308
## [3,]     6.754820    3.808716
# Let D[i,k] bet the distance between the i-th point and the k-th centre:
D <- as.matrix(pdist(X, M))
# Let idx[i] be the centre closest to the i-th point
idx <- apply(D, 1, which.min)

repeat {
# Previous cluster belongingness:
old_idx <- idx
# Split X into a list of K data frames based on old_idx info
X_split <- split(as.data.frame(X), old_idx)
# Compute componentwise arithmetic means of each data frame - new centres
M <- t(sapply(X_split, colMeans))
# Recompute cluster belongingness
D <- as.matrix(pdist(X, M))
idx <- apply(D, 1, which.min)
}
M # our result
##   Petal.Length Sepal.Width
## 3     5.672093    3.032558
km\$center # result of kmeans()
##   Petal.Length Sepal.Width
## 3     1.462000    3.428000