## 3.3 Implementing a K-NN Classifier (*)

### 3.3.1 Main Routine (*)

Let us implement a K-NN classifier ourselves by using a top-bottom approach.

Then we will arrange the processing of data into conveniently manageable chunks.

The function’s declaration will look like:

our_knn <- function(X_train, X_test, Y_train, k=1) {
# k=1 denotes a parameter with a default value
# ...
}

First, we should specify the type and form of the arguments we’re expecting:

# this is the body of our_knn() - part 1
stopifnot(is.numeric(X_train), is.matrix(X_train))
stopifnot(is.numeric(X_test), is.matrix(X_test))
stopifnot(is.factor(Y_train))
stopifnot(ncol(X_train) == ncol(X_test))
stopifnot(nrow(X_train) == length(Y_train))
stopifnot(k >= 1)
n_train <- nrow(X_train)
n_test  <- nrow(X_test)
p <- ncol(X_train)
M <- length(levels(Y_train))

Therefore,

$$\mathtt{X\_train}\in\mathbb{R}^{\mathtt{n\_train}\times \mathtt{p}}$$, $$\mathtt{X\_test}\in\mathbb{R}^{\mathtt{n\_test}\times \mathtt{p}}$$ and $$\mathtt{Y\_train}\in\{1,\dots,M\}^{\mathtt{n\_train}}$$

Recall that R factor objects are internally encoded as integer vectors.

Next, we will call the (to-be-done) function our_get_knnx(), which seeks nearest neighbours of all the points:

# our_get_knnx returns a matrix nn_indices of size n_test*k,
# where nn_indices[i,j] denotes the index of
# X_test[i,]'s j-th nearest neighbour in X_train.
# (It is the point X_train[nn_indices[i,j],]).
nn_indices <- our_get_knnx(X_train, X_test, k)

Then, for each point in X_test, we fetch the labels corresponding to its nearest neighbours and compute their mode:

Y_pred <- numeric(n_test) # vector of length n_test
# For now we will operate on the integer labels in {1,...,M}
Y_train_int <- as.numeric(Y_train)
for (i in 1:n_test) {
# Get the labels of the NNs of the i-th point:
nn_labels_i <- Y_train_int[nn_indices[i,]]
# Compute the mode (majority vote):
Y_pred[i] <- our_mode(nn_labels_i) # in {1,...,M}
}

Finally, we should convert the resulting integer vector to an object of type factor:

# Convert Y_pred to factor:
return(factor(Y_pred, labels=levels(Y_train)))

### 3.3.2 Mode

To implement the mode, we can use the tabulate() function.

Read the function’s man page, see ?tabulate.

For example:

tabulate(c(1, 2, 1, 1, 1, 5, 2))
## [1] 4 2 0 0 1

There might be multiple modes – in such a case, we should pick one at random.

For that, we can use the sample() function.

Read the function’s man page, see ?sample. Note that its behaviour is different when it’s first argument is a vector of length 1.

An example implementation:

our_mode <- function(Y) {
# tabulate() will take care of
# checking the correctness of Y
t <- tabulate(Y)
mode_candidates <- which(t == max(t))
if (length(mode_candidates) == 1) return(mode_candidates)
else return(sample(mode_candidates, 1))
}

our_mode(c(1, 1, 1, 1))
## [1] 1
our_mode(c(2, 2, 2, 2))
## [1] 2
our_mode(c(3, 1, 3, 3))
## [1] 3
our_mode(c(1, 1, 3, 3, 2))
## [1] 3
our_mode(c(1, 1, 3, 3, 2))
## [1] 1

### 3.3.3 NN Search Routines (*)

Last but not least, we should implement the our_get_knnx() function.

It is the function responsible for seeking the indices of nearest neighbours.

It turns out this function will actually constitute the K-NN classifier’s performance bottleneck in case of big data samples.

# our_get_knnx returns a matrix nn_indices of size n_test*k,
# where nn_indices[i,j] denotes the index of
# X_test[i,]'s j-th nearest neighbour in X_train.
# (It is the point X_train[nn_indices[i,j],]).
our_get_knnx <- function(X_train, X_test, k) {
# ...
}

A naive approach to our_get_knnx() relies on computing all pairwise distances, and sorting them.

our_get_knnx <- function(X_train, X_test, k) {
n_test <- nrow(X_test)
nn_indices <- matrix(NA_real_, nrow=n_test, ncol=k)
for (i in 1:n_test) {
d <- apply(X_train, 1, function(x)
sqrt(sum((x-X_test[i,])^2)))
# now d[j] is the distance
# between X_train[j,] and X_test[i,]
nn_indices[i,] <- order(d)[1:k]
}
nn_indices
}

A comparison with FNN:knn():

system.time(Ya <- knn(X_train, X_test, Y_train, k=5))
##    user  system elapsed
##   0.251   0.000   0.251
system.time(Yb <- our_knn(X_train, X_test, Y_train, k=5))
##    user  system elapsed
##  44.054   0.038  45.748
mean(Ya == Yb) # 1.0 on perfect match
## [1] 1

Both functions return identical results but our implementation is “slightly” slower.

FNN:knn() is efficiently written in C++, which is a compiled programming language.

R, on the other hand (just like Python and Matlab) is interpreted, therefore as a rule of thumb we should consider it an order of magnitude slower (see, however, the Julia language).

Let us substitute our naive implementation with the equivalent one, but written in C++ (available in the FNN package).

(*) Note that we could write a C++ implementation ourselves, see the Rcpp package for seamless R and C++ integration.

our_get_knnx <- function(X_train, X_test, k) {
# this is used by our_knn()
FNN::get.knnx(X_train, X_test, k, algorithm="brute")$nn.index } system.time(Ya <- knn(X_train, X_test, Y_train, k=5)) ## user system elapsed ## 0.309 0.000 0.308 system.time(Yb <- our_knn(X_train, X_test, Y_train, k=5)) ## user system elapsed ## 0.141 0.000 0.142 mean(Ya == Yb) # 1.0 on perfect match ## [1] 1 Note that our solution requires $$c\cdot n_\text{test}\cdot n_\text{train}\cdot p$$ arithmetic operations for some $$c>1$$. The overall cost of sorting is at least $$d\cdot n_\text{test}\cdot n_\text{train}\cdot\log n_\text{train}$$ for some $$d>1$$. This does not scale well with both $$n_\text{test}$$ and $$n_\text{train}$$ (think – big data). It turns out that there are special spatial data structures – such as metric trees – that aim to speed up searching for nearest neighbours in low-dimensional spaces (for small $$p$$). (*) Searching in high-dimensional spaces is hard due to the so-called curse of dimensionality. For example, FNN::get.knnx() also implements the so-called kd-trees. library("microbenchmark") test_speed <- function(n, p, k) { A <- matrix(runif(n*p), nrow=n, ncol=p) s <- summary(microbenchmark( brute=FNN::get.knnx(A, A, k, algorithm="brute"), kd_tree=FNN::get.knnx(A, A, k, algorithm="kd_tree"), times=3 ), unit="s") # minima of 3 time measurements: structure(s$min, names=as.character(s\$expr))
}

test_speed(10000, 2, 5)
##      brute    kd_tree
## 0.82655346 0.02878045
test_speed(10000, 5, 5)
##     brute   kd_tree
## 1.2348245 0.1320792
test_speed(10000, 10, 5)
##    brute  kd_tree
## 4.589286 1.719830
test_speed(10000, 20, 5)
##    brute  kd_tree
## 10.20877 38.18202