4.2 Model Assessment and Selection

4.2.1 Performance Metrics


Recall that \(y_i\) denotes the true label associated with the \(i\)-th observation.

Let \(\hat{y}_i\) denote the classifier’s output for a given \(\mathbf{x}_{i,\cdot}\).

Ideally, we’d wish that \(\hat{y}_i=y_i\).

Sadly, in practice we will make errors.

Here are the 4 possible situations (true vs. predicted label):

. \(y_i=0\) \(y_i=1\)
\(\hat{y}_i=0\) True Negative False Negative (Type II error)
\(\hat{y}_i=1\) False Positive (Type I error) True Positive

Note that the terms positive and negative refer to the classifier’s output, i.e., occur when \(\hat{y}_i\) is equal to \(1\) and \(0\), respectively.


A confusion matrix is used to summarise the correctness of predictions for the whole sample:

##       Y_test
## Y_pred    0    1
##      0 1220  187
##      1   79  111

For example,

## [1] 1220
## [1] 79

Accuracy is the ratio of the correctly classified instances to all the instances.

In other words, it is the probability of making a correct prediction.

\[ \text{Accuracy} = \frac{TP+TN}{TP+TN+FP+FN} = \frac{1}{n} \sum_{i=1}^n \mathbb{I}\left( y_i = \hat{y}_i \right) \] where \(\mathbb{I}\) is the indicator function, \(\mathbb{I}(l)=1\) if logical condition \(l\) is true and \(0\) otherwise.

## [1] 0.8334377
## [1] 0.8334377

In many applications we are dealing with unbalanced problems, where the case \(y_i=1\) is relatively rare, yet predicting it correctly is much more important than being accurate with respect to class \(0\).

Think of medical applications, e.g., HIV testing or tumour diagnosis.

In such a case, accuracy as a metric fails to quantify what we are aiming for.

If only 1% of the cases have true \(y_i=1\), then a dummy classifier that always outputs \(\hat{y}_i=0\) has 99% accuracy.

Metrics such as precision and recall (and their aggregated version, F-measure) aim to address this problem.


Precision

\[ \text{Precision} = \frac{TP}{TP+FP} \]

If the classifier outputs \(1\), what is the probability that this is indeed true?

## [1] 0.5842105

Recall (a.k.a. sensitivity, hit rate or true positive rate)

\[ \text{Recall} = \frac{TP}{TP+FN} \]

If the true class is \(1\), what is the probability that the classifier will detect it?

## [1] 0.3724832

Precision or recall? It depends on an application. Think of medical diagnosis, medical screening, plagiarism detection, etc. — which measure is more important in each of the settings listed?

As a compromise, we can use the F-measure (a.k.a. \(F_1\)-measure), which is the harmonic mean of precision and recall:

\[ \text{F} = \frac{1}{ \frac{ \frac{1}{\text{Precision}}+\frac{1}{\text{Recall}} }{2} } = \left( \frac{1}{2} \left( \text{Precision}^{-1}+\text{Recall}^{-1} \right) \right)^{-1} = \frac{TP}{TP + \frac{FP + FN}{2}} \]

Show that the above equality holds.

## [1] 0.454918

4.2.2 How to Choose K for K-NN Classification?


We haven’t yet considered the question which \(K\) yields the best classifier.

Best == one that has the highest predictive power.

Best == with respect to some chosen metric (accuracy, recall, precision, F-measure, …)

Let us study how the metrics on the test set change as functions of the number of nearest neighbours considered, \(K\).


Auxiliary function:

For example:

##          Acc         Prec          Rec            F 
##    0.8215404    0.5234657    0.4865772    0.5043478 
##           TN           FN           FP           TP 
## 1167.0000000  153.0000000  132.0000000  145.0000000

Example call to evaluate metrics as a function of different \(K\)s:

Note that sapply(X, f, arg1, arg2, ...) outputs a list Y such that Y[[i]] = f(X[i], arg1, arg2, ...) which is then simplified to a matrix.

We transpose this result, t(), in order to get each metric corresponding to different columns in the result.

As usual, if you keep wondering, e.g., why t(), play with the code yourself – it’s fun fun fun.


Example results:

##     K  Acc Prec  Rec    F   TN  FN  FP  TP
## 1   1 0.80 0.47 0.54 0.50 1115 137 184 161
## 2   2 0.83 0.59 0.36 0.45 1226 191  73 107
## 3   3 0.81 0.48 0.49 0.49 1142 151 157 147
## 4   4 0.83 0.55 0.37 0.44 1209 189  90 109
## 5   5 0.82 0.52 0.49 0.50 1167 153 132 145
## 6   6 0.83 0.57 0.38 0.45 1213 186  86 112
## 7   7 0.83 0.56 0.49 0.52 1182 151 117 147
## 8   8 0.83 0.59 0.39 0.47 1217 182  82 116
## 9   9 0.83 0.55 0.46 0.50 1187 161 112 137
## 10 10 0.83 0.58 0.37 0.45 1220 187  79 111

A picture is worth a thousand tables though (see ?matplot in R).


plot of chunk unnamed-chunk-20


plot of chunk unnamed-chunk-21

4.2.3 Training, Validation and Test sets


In the \(K\)-NN classification task, there are many hyperparameters to tune up:

  • Which \(K\) should we choose?

  • Should we standardise the dataset?

  • Which variables should be taken into account when computing the Euclidean metric?

  • Which metric should be used?

If we select the best hyperparameter set based on test sample error, we will run into the trap of overfitting again.

This time we’ll be overfitting to the test set — the model that is optimal for a given test sample doesn’t have to generalise well to other test samples (!).


In order to overcome this problem, we can perform a random train-validation-test split of the original dataset:

  • training sample (e.g., 60%) – used to construct the models
  • validation sample (e.g., 20%) – used to tune the hyperparameters of the classifier
  • test sample (e.g., 20%) – used to assess the goodness of fit

(*) If our dataset is too small, we can use various crossvalidation techniques instead of a train-validate-test split.


An example way to perform a train-validation-test split: