3.1 Introduction

3.1.1 Classification Task


Let \(\mathbf{X}\in\mathbb{R}^{n\times p}\) be an input matrix that consists of \(n\) points in a \(p\)-dimensional space.

In other words, we have a database on \(n\) objects, each of which being described by means of \(p\) numerical features.

\[ \mathbf{X}= \left[ \begin{array}{cccc} x_{1,1} & x_{1,2} & \cdots & x_{1,p} \\ x_{2,1} & x_{2,2} & \cdots & x_{2,p} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n,1} & x_{n,2} & \cdots & x_{n,p} \\ \end{array} \right] \]


Recall that in supervised learning, apart from \(\mathbf{X}\), we are also given the corresponding \(\mathbf{y}\).

With each input point \(\mathbf{x}_{i,\cdot}\) we associate the desired output \(y_i\).

In this chapter we are interested in classification tasks; we assume that each \(y_i\) is a label (e.g., a character string).


Most commonly, we are faced with binary classification tasks where there are only two possible distinct labels.

We traditionally denote them with \(0\)s and \(1\)s.

For example:

0 1
no yes
false true
failure success
healthy ill

On the other hand, in multiclass classification, we assume that each \(y_i\) takes more than two possible values.


plot of chunk unnamed-chunk-2

3.1.2 Factor Data Type


On a side note (see the Appendix for more details), factor type in R is a very convenient means to store categorical data, and hence, to encode the data in \(\mathbf{y}\).

## [1] yes no  no  yes no 
## Levels: no yes
## f
##  no yes 
##   3   2

Internally, objects of type factor are represented as integer vectors with elements in \(\{1,\dots,M\}\), where \(M\) is the number of possible levels.

Labels, used to “decipher” the numeric codes, are stored separately.

## [1] 2 1 1 2 1
## [1] "no"  "yes"
## [1] success failure failure success failure
## Levels: failure success

3.1.3 Data


For illustration, let’s consider the wines dataset.

## [1] 5320

The input matrix \(\mathbf{X}\in\mathbb{R}^{n\times p}\) consists of all the numeric variables:

## [1] 5320   11
##      fixed.acidity volatile.acidity citric.acid residual.sugar
## [1,]           7.4             0.70           0            1.9
## [2,]           7.8             0.88           0            2.6
##      chlorides free.sulfur.dioxide total.sulfur.dioxide density
## [1,]     0.076                  11                   34  0.9978
## [2,]     0.098                  25                   67  0.9968
##        pH sulphates alcohol
## [1,] 3.51      0.56     9.4
## [2,] 3.20      0.68     9.8

The response variable is an ordinal one, giving each wine’s rating as assigned by a sommelier.

Here: 0 == a very bad wine, 10 == a very good one.

We will convert this dependent variable to a binary one:

  • 0 == response < 5 == bad
  • 1 == response >= 5 == good
## Y
##    0    1 
## 4311 1009

Now \((\mathbf{X},\mathbf{y})\) is a basis for an interesting binary classification task.

3.1.4 Training and Test Sets


Recall that we are genuinely interested in the construction of supervised learning models for the two following purposes:

  • description – to explain a given dataset in simpler terms,
  • prediction – to forecast the values of the dependent variable for inputs that are yet to be observed.

In the latter case:

  • we want our models to generalise well to new data,
  • we don’t want our models to overfit to current data.

One way to assess if a model has sufficient predictive power is based on a random train-test split of the original dataset:

  • training sample (usually 60-80% of the observations) – used to construct a model,
  • test sample (remaining 40-20%) – used to assess the goodness of fit.

Test sample must not be used in the training phase! (No cheating!)


70/30% train-test split in R:

## [1] 2463 2511 2227  526 4291 2986

3.1.5 Discussed Methods


We will discuss 3 simple and educational (yet practically useful) classification algorithms:

  • K-nearest neighbour scheme – this chapter,
  • Decision trees – the next chapter,
  • Logistic regression – the next chapter.