## 3.1 Introduction

Let $$\mathbf{X}\in\mathbb{R}^{n\times p}$$ be an input matrix that consists of $$n$$ points in a $$p$$-dimensional space.

In other words, we have a database on $$n$$ objects, each of which being described by means of $$p$$ numerical features.

$\mathbf{X}= \left[ \begin{array}{cccc} x_{1,1} & x_{1,2} & \cdots & x_{1,p} \\ x_{2,1} & x_{2,2} & \cdots & x_{2,p} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n,1} & x_{n,2} & \cdots & x_{n,p} \\ \end{array} \right]$

Recall that in supervised learning, apart from $$\mathbf{X}$$, we are also given the corresponding $$\mathbf{y}$$.

With each input point $$\mathbf{x}_{i,\cdot}$$ we associate the desired output $$y_i$$.

In this chapter we are interested in classification tasks; we assume that each $$y_i$$ is a label (e.g., a character string).

Most commonly, we are faced with binary classification tasks where there are only two possible distinct labels.

We traditionally denote them with $$0$$s and $$1$$s.

For example:

0 1
no yes
false true
failure success
healthy ill

On the other hand, in multiclass classification, we assume that each $$y_i$$ takes more than two possible values.

### 3.1.2 Factor Data Type

On a side note (see the Appendix for more details), factor type in R is a very convenient means to store categorical data, and hence, to encode the data in $$\mathbf{y}$$.

x <- c("yes", "no", "no", "yes", "no")
f <- factor(x, levels=c("no", "yes"))
f
## [1] yes no  no  yes no
## Levels: no yes
table(f) # counts
## f
##  no yes
##   3   2

Internally, objects of type factor are represented as integer vectors with elements in $$\{1,\dots,M\}$$, where $$M$$ is the number of possible levels.

Labels, used to “decipher” the numeric codes, are stored separately.

as.numeric(f) # 2nd label, 1st label, 1st label etc.
## [1] 2 1 1 2 1
levels(f)
## [1] "no"  "yes"
levels(f) <- c("failure", "success") # re-encode
f
## [1] success failure failure success failure
## Levels: failure success

### 3.1.3 Data

For illustration, let’s consider the wines dataset.

wines <- read.csv("datasets/winequality-all.csv", comment="#")
(n <- nrow(wines)) # number of samples
## [1] 5320

The input matrix $$\mathbf{X}\in\mathbb{R}^{n\times p}$$ consists of all the numeric variables:

X <- as.matrix(wines[,1:11])
dim(X)
## [1] 5320   11
head(X, 2) # first two rows
##      fixed.acidity volatile.acidity citric.acid residual.sugar
## [1,]           7.4             0.70           0            1.9
## [2,]           7.8             0.88           0            2.6
##      chlorides free.sulfur.dioxide total.sulfur.dioxide density
## [1,]     0.076                  11                   34  0.9978
## [2,]     0.098                  25                   67  0.9968
##        pH sulphates alcohol
## [1,] 3.51      0.56     9.4
## [2,] 3.20      0.68     9.8

The response variable is an ordinal one, giving each wine’s rating as assigned by a sommelier.

Here: 0 == a very bad wine, 10 == a very good one.

We will convert this dependent variable to a binary one:

• 0 == response < 5 == bad
• 1 == response >= 5 == good
# recall that TRUE == 1
Y <- factor(as.numeric(wines\$response >= 5))
table(Y)
## Y
##    0    1
## 4311 1009

Now $$(\mathbf{X},\mathbf{y})$$ is a basis for an interesting binary classification task.

### 3.1.4 Training and Test Sets

Recall that we are genuinely interested in the construction of supervised learning models for the two following purposes:

• description – to explain a given dataset in simpler terms,
• prediction – to forecast the values of the dependent variable for inputs that are yet to be observed.

In the latter case:

• we want our models to generalise well to new data,
• we don’t want our models to overfit to current data.

One way to assess if a model has sufficient predictive power is based on a random train-test split of the original dataset:

• training sample (usually 60-80% of the observations) – used to construct a model,
• test sample (remaining 40-20%) – used to assess the goodness of fit.

Test sample must not be used in the training phase! (No cheating!)

70/30% train-test split in R:

set.seed(123) # reproducibility matters
random_indices <- sample(n)
head(random_indices) # preview
## [1] 2463 2511 2227  526 4291 2986
# first 70% of the indices (they are arranged randomly)
# will constitute the train sample:
train_indices <- random_indices[1:floor(n*0.7)]
X_train <- X[train_indices,]
Y_train <- Y[train_indices]
# the remaining indices (30%) go to the test sample:
X_test  <- X[-train_indices,]
Y_test  <- Y[-train_indices]

### 3.1.5 Discussed Methods

We will discuss 3 simple and educational (yet practically useful) classification algorithms:

• K-nearest neighbour scheme – this chapter,
• Decision trees – the next chapter,
• Logistic regression – the next chapter.