3.3 Implementing a K-NN Classifier (*)

3.3.1 Main Routine (*)


Let us implement a K-NN classifier ourselves by using a top-bottom approach.

We will start with a general description of the admissible inputs and the expected output.

Then we will arrange the processing of data into conveniently manageable chunks.

The function’s declaration will look like:


First, we should specify the type and form of the arguments we’re expecting:

Therefore,

\(\mathtt{X\_train}\in\mathbb{R}^{\mathtt{n\_train}\times \mathtt{p}}\), \(\mathtt{X\_test}\in\mathbb{R}^{\mathtt{n\_test}\times \mathtt{p}}\) and \(\mathtt{Y\_train}\in\{1,\dots,M\}^{\mathtt{n\_train}}\)

Recall that R factor objects are internally encoded as integer vectors.


Next, we will call the (to-be-done) function our_get_knnx(), which seeks nearest neighbours of all the points:


Then, for each point in X_test, we fetch the labels corresponding to its nearest neighbours and compute their mode:

Finally, we should convert the resulting integer vector to an object of type factor:

3.3.2 Mode


To implement the mode, we can use the tabulate() function.

Read the function’s man page, see ?tabulate.

For example:

## [1] 4 2 0 0 1

There might be multiple modes – in such a case, we should pick one at random.

For that, we can use the sample() function.

Read the function’s man page, see ?sample. Note that its behaviour is different when it’s first argument is a vector of length 1.


An example implementation:


## [1] 1
## [1] 2
## [1] 3
## [1] 3
## [1] 1

3.3.3 NN Search Routines (*)


Last but not least, we should implement the our_get_knnx() function.

It is the function responsible for seeking the indices of nearest neighbours.

It turns out this function will actually constitute the K-NN classifier’s performance bottleneck in case of big data samples.


A naive approach to our_get_knnx() relies on computing all pairwise distances, and sorting them.


A comparison with FNN:knn():

##    user  system elapsed 
##   0.251   0.000   0.251
##    user  system elapsed 
##  44.054   0.038  45.748
## [1] 1

Both functions return identical results but our implementation is “slightly” slower.


FNN:knn() is efficiently written in C++, which is a compiled programming language.

R, on the other hand (just like Python and Matlab) is interpreted, therefore as a rule of thumb we should consider it an order of magnitude slower (see, however, the Julia language).

Let us substitute our naive implementation with the equivalent one, but written in C++ (available in the FNN package).

(*) Note that we could write a C++ implementation ourselves, see the Rcpp package for seamless R and C++ integration.


##    user  system elapsed 
##   0.309   0.000   0.308
##    user  system elapsed 
##   0.141   0.000   0.142
## [1] 1

Note that our solution requires \(c\cdot n_\text{test}\cdot n_\text{train}\cdot p\) arithmetic operations for some \(c>1\). The overall cost of sorting is at least \(d\cdot n_\text{test}\cdot n_\text{train}\cdot\log n_\text{train}\) for some \(d>1\).

This does not scale well with both \(n_\text{test}\) and \(n_\text{train}\) (think – big data).

It turns out that there are special spatial data structures – such as metric trees – that aim to speed up searching for nearest neighbours in low-dimensional spaces (for small \(p\)).

(*) Searching in high-dimensional spaces is hard due to the so-called curse of dimensionality.

For example, FNN::get.knnx() also implements the so-called kd-trees.



##      brute    kd_tree 
## 0.82655346 0.02878045
##     brute   kd_tree 
## 1.2348245 0.1320792
##    brute  kd_tree 
## 4.589286 1.719830
##    brute  kd_tree 
## 10.20877 38.18202