## 4.1 Introduction

Let $$\mathbf{X}\in\mathbb{R}^{n\times p}$$ be an input matrix that consists of $$n$$ points in a $$p$$-dimensional space.

In other words, we have a database on $$n$$ objects, each of which being described by means of $$p$$ numerical features.

$\mathbf{X}= \left[ \begin{array}{cccc} x_{1,1} & x_{1,2} & \cdots & x_{1,p} \\ x_{2,1} & x_{2,2} & \cdots & x_{2,p} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n,1} & x_{n,2} & \cdots & x_{n,p} \\ \end{array} \right]$

Recall that in supervised learning, apart from $$\mathbf{X}$$, we are also given the corresponding $$\mathbf{y}$$.

With each input point $$\mathbf{x}_{i,\cdot}$$ we associate the desired output $$y_i$$.

In this chapter we are still interested in classification tasks; we assume that each $$y_i$$ is a descriptive label.

In this part we assume that we are faced with binary classification tasks.

Hence, there are only two possible labels that we traditionally denote with $$0$$s and $$1$$s.

For example:

0 1
no yes
false true
failure success
healthy ill ### 4.1.2 Data

For illustration, let’s consider the wines dataset again.

wines <- read.csv("datasets/winequality-all.csv", comment="#")
(n <- nrow(wines)) # number of samples
##  5320

The input matrix $$\mathbf{X}\in\mathbb{R}^{n\times p}$$ consists of all the numeric variables:

X <- as.matrix(wines[,1:11])
dim(X)
##  5320   11
head(X, 2) # first two rows
##      fixed.acidity volatile.acidity citric.acid residual.sugar
## [1,]           7.4             0.70           0            1.9
## [2,]           7.8             0.88           0            2.6
##      chlorides free.sulfur.dioxide total.sulfur.dioxide density
## [1,]     0.076                  11                   34  0.9978
## [2,]     0.098                  25                   67  0.9968
##        pH sulphates alcohol
## [1,] 3.51      0.56     9.4
## [2,] 3.20      0.68     9.8

The response variable is an ordinal one, giving each wine’s rating as assigned by a sommelier.

Here: 0 == a very bad wine, 10 == a very good one.

We will convert this dependent variable to a binary one:

• 0 == response < 5 == bad
• 1 == response >= 5 == good
# recall that TRUE == 1
Y <- as.numeric(wines\$response >= 5)
table(Y)
## Y
##    0    1
## 4311 1009

Now $$(\mathbf{X},\mathbf{y})$$ is a basis for an interesting binary classification task.

70/30% train-test split:

set.seed(123) # reproducibility matters
random_indices <- sample(n)
head(random_indices) # preview
##  2463 2511 2227  526 4291 2986
# first 70% of the indices (they are arranged randomly)
# will constitute the train sample:
train_indices <- random_indices[1:floor(n*0.7)]
X_train <- X[train_indices,]
Y_train <- Y[train_indices]
# the remaining indices (30%) go to the test sample:
X_test  <- X[-train_indices,]
Y_test  <- Y[-train_indices]

Let’s also compute Z_train and Z_test, being the standardised versions of X_train and X_test, respectively.

means <- apply(X, 2, mean) # column means
sds   <- apply(X, 2, sd)   # column standard deviations
Z_train <- t(apply(X_train, 1, function(c) (c-means)/sds))
Z_test  <- t(apply(X_test,  1, function(c) (c-means)/sds))

### 4.1.3 Discussed Methods

We are soon going to discuss the following simple and educational (yet practically useful) classification algorithms:

• decision trees,
• logistic regression.

Before that happens, let’s go back to the K-NN algorithm.

library("FNN")
Y_knn5   <- knn(X_train, X_test, Y_train, k=5)
Y_knn10  <- knn(X_train, X_test, Y_train, k=10)
Y_knn5s  <- knn(Z_train, Z_test, Y_train, k=5)
Y_knn10s <- knn(Z_train, Z_test, Y_train, k=10)
c(
mean(Y_test == Y_knn5),
mean(Y_test == Y_knn10),
mean(Y_test == Y_knn5s),
mean(Y_test == Y_knn10s)
)
##  0.7896055 0.8008766 0.8215404 0.8334377

We should answer the question regarding the optimal choice of $$K$$ first.

How should we do that?