## 3.2 K-nearest Neighbour Classifier

### 3.2.1 Introduction

“If you don’t know what to do in a situation, just act like the people around you”

For some integer $$K\ge 1$$, the K-Nearest Neighbour (K-NN) Classifier proceeds as follows.

To classify a new point $$\mathbf{x}'$$:

1. find the $$K$$ nearest neighbours of a given point $$\mathbf{x}'$$ amongst the points in the train set, denoted $$\mathbf{x}_{i_1,\cdot}, \dots, \mathbf{x}_{i_K,\cdot}$$:
1. compute the Euclidean distances between $$\mathbf{x}'$$ and each $$\mathbf{x}_{i,\cdot}$$ form the train set, $d_i = \|\mathbf{x}'-\mathbf{x}_{i,\cdot}\|$
2. order $$d_i$$s in increasing order, $$d_{i_1} \le d_{i_2} \le \dots \le d_{i_K}$$
3. pick first $$K$$ indices (these are the nearest neighbours)
2. fetch the corresponding reference labels $$y_{i_1}, \dots, y_{i_K}$$
3. return their mode as a result, i.e., the most frequently occurring label (a.k.a. majority vote)

### 3.2.2 Example in R

We shall be calling the knn() function from package FNN to classify the points from the test sample extracted from the wines dataset:

library("FNN")

Let us make prediction using the 5-nn classifier:

Y_knn5 <- knn(X_train, X_test, Y_train, k=5)
head(Y_test, 28) # True Ys
##  [1] 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## Levels: 0 1
head(Y_knn5, 28) # Predicted Ys
##  [1] 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## Levels: 0 1
mean(Y_test == Y_knn5) # accuracy
## [1] 0.7896055

10-nn classifier:

Y_knn10 <- knn(X_train, X_test, Y_train, k=10)
head(Y_test, 28) # True Ys
##  [1] 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## Levels: 0 1
head(Y_knn10, 28) # Predicted Ys
##  [1] 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## Levels: 0 1
mean(Y_test == Y_knn10) # accuracy
## [1] 0.8008766

### 3.2.3 Different Metrics (*)

The Euclidean distance is just one particular example of many possible metrics.

Mathematically, we say that $$d$$ is a metric on a set $$X$$ (e.g., $$\mathbb{R}^p$$), whenever it is a function $$d:X\times X\to [0,\infty]$$ such that for all $$x,x',x''\in X$$:

• $$d(x, x') = 0$$ if and only if $$x=x'$$,
• $$d(x, x') = d(x', x)$$ (it is symmetric)
• $$d(x, x'') \le d(x, x') + d(x', x'')$$ (it fulfils the triangle inequality)

(*) Not all the properties are required in all the applications; sometimes we might need a few additional ones.

We can easily generalise the way we introduced the K-NN method to have a classifier that is based on a point’s neighbourhood with respect to any metric.

Example metrics on $$\mathbb{R}^p$$:

• Euclidean $d_2(\mathbf{x}, \mathbf{x}') = \| \mathbf{x}-\mathbf{x}' \| = \| \mathbf{x}-\mathbf{x}' \|_2 = \sqrt{ \sum_{i=1}^p (x_i-x_i')^2 }$
• Manhattan (taxicab) $d_1(\mathbf{x}, \mathbf{x}') = \| \mathbf{x}-\mathbf{x}' \|_1 = { \sum_{i=1}^p |x_i-x_i'| }$
• Chebyshev (maximum) $d_\infty(\mathbf{x}, \mathbf{x}') = \| \mathbf{x}-\mathbf{x}' \|_\infty = \max_{i=1,\dots,p} |x_i-x_i'|$

We can define metrics on different spaces too.

For example, the Levenshtein distance is a popular choice for comparing character strings (also DNA sequences etc.)

It is an edit distance – it measures the minimal number of single-character insertions, deletions or substitutions to change one string into another.

For instance:

adist("happy", "nap")
##      [,1]
## [1,]    3

This is because we need 1 substitution and 2 deletions,

happy → nappy → napp → nap.

• the Hamming distance for categorical vectors (or strings of equal lengths),
• the Jaccard distance for sets,
• the Kendall tau rank distance for rankings.

Moreover, R package stringdist includes implementations of numerous string metrics.

### 3.2.4 Standardisation of Independent Variables

Note that the Euclidean distance that we used above implicitly assumes that every feature (independent variable) is on the same scale.

However, when dealing with, e.g., physical quantities, we often perform conversions of units of measurement (kg → g, feet → m etc.).

Transforming a single feature may drastically change the metric structure of the dataset and therefore highly affect the obtained predictions.

To “bring data to the same scale”, we often apply a trick called standardization.

Computing the so-called Z-scores of the $$j$$-th feature, $$\mathbf{x}_{\cdot,j}$$, is done by subtracting from each observation the sample mean and dividing the result by the sample standard deviation:

$z_{i,j} = \frac{x_{i,j}-\bar{x}_{\cdot,j}}{s_{x_{\cdot,j}}}$

This a new feature $$\mathbf{z}_{\cdot,j}$$ that always has mean 0 and standard deviation of 1.

Moreover, it is unit-less (e.g., we divide a value in kgs by a value in kgs, the units are cancelled out).

Z-scores are easy to interpret, e.g., 0.5 denotes an observation that is 0.5 standard deviations above the mean.

Let us compute Z_train and Z_test, being the standardised versions of X_train and X_test, respectively.

means <- apply(X, 2, mean) # column means
sds   <- apply(X, 2, sd)   # column standard deviations
Z_train <- X_train # to be done
Z_test  <- X_test  # to be done
for (j in 1:ncol(X)) {
Z_train[,j] <- (Z_train[,j]-means[j])/sds[j]
Z_test[,j]  <- (Z_test[,j] -means[j])/sds[j]
}

Alternatively:

Z_train <- t(apply(X_train, 1, function(c) (c-means)/sds))
Z_test  <- t(apply(X_test,  1, function(c) (c-means)/sds))

Let us compute the accuracy of K-NN classifiers acting on standardised data.

Y_knn5s <- knn(Z_train, Z_test, Y_train, k=5)
mean(Y_test == Y_knn5s) # accuracy
## [1] 0.8215404
Y_knn10s <- knn(Z_train, Z_test, Y_train, k=10)
mean(Y_test == Y_knn10s) # accuracy
## [1] 0.8334377