4.1 Introduction

4.1.1 Classification Task


Let \(\mathbf{X}\in\mathbb{R}^{n\times p}\) be an input matrix that consists of \(n\) points in a \(p\)-dimensional space.

In other words, we have a database on \(n\) objects, each of which being described by means of \(p\) numerical features.

\[ \mathbf{X}= \left[ \begin{array}{cccc} x_{1,1} & x_{1,2} & \cdots & x_{1,p} \\ x_{2,1} & x_{2,2} & \cdots & x_{2,p} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n,1} & x_{n,2} & \cdots & x_{n,p} \\ \end{array} \right] \]


Recall that in supervised learning, apart from \(\mathbf{X}\), we are also given the corresponding \(\mathbf{y}\).

With each input point \(\mathbf{x}_{i,\cdot}\) we associate the desired output \(y_i\).

In this chapter we are still interested in classification tasks; we assume that each \(y_i\) is a descriptive label.


In this part we assume that we are faced with binary classification tasks.

Hence, there are only two possible labels that we traditionally denote with \(0\)s and \(1\)s.

For example:

0 1
no yes
false true
failure success
healthy ill

plot of chunk unnamed-chunk-2

4.1.2 Data


For illustration, let’s consider the wines dataset again.

## [1] 5320

The input matrix \(\mathbf{X}\in\mathbb{R}^{n\times p}\) consists of all the numeric variables:

## [1] 5320   11
##      fixed.acidity volatile.acidity citric.acid residual.sugar
## [1,]           7.4             0.70           0            1.9
## [2,]           7.8             0.88           0            2.6
##      chlorides free.sulfur.dioxide total.sulfur.dioxide density
## [1,]     0.076                  11                   34  0.9978
## [2,]     0.098                  25                   67  0.9968
##        pH sulphates alcohol
## [1,] 3.51      0.56     9.4
## [2,] 3.20      0.68     9.8

The response variable is an ordinal one, giving each wine’s rating as assigned by a sommelier.

Here: 0 == a very bad wine, 10 == a very good one.

We will convert this dependent variable to a binary one:

  • 0 == response < 5 == bad
  • 1 == response >= 5 == good
## Y
##    0    1 
## 4311 1009

Now \((\mathbf{X},\mathbf{y})\) is a basis for an interesting binary classification task.


70/30% train-test split:

## [1] 2463 2511 2227  526 4291 2986

Let’s also compute Z_train and Z_test, being the standardised versions of X_train and X_test, respectively.

4.1.3 Discussed Methods


We are soon going to discuss the following simple and educational (yet practically useful) classification algorithms:

  • decision trees,
  • logistic regression.

Before that happens, let’s go back to the K-NN algorithm.

## [1] 0.7896055 0.8008766 0.8215404 0.8334377

We should answer the question regarding the optimal choice of \(K\) first.

How should we do that?