1.2 Supervised Learning

1.2.1 Formalism


Let \(\mathbf{X}=\{\mathfrak{X}_1,\dots,\mathfrak{X}_n\}\) be an input sample (“a database”) that consists of \(n\) objects.

Most often we assume that each object \(\mathfrak{X}_i\) is represented using \(p\) numbers for some \(p\).

We denote this fact as \(\mathfrak{X}_i\in \mathbb{R}^p\) (it is a \(p\)-dimensional real vector or a sequence of \(p\) numbers or a point in a \(p\)-dimensional real space or an element of a real \(p\)-space etc.).

Of course, this setting is abstract in the sense that there might be different realities hidden behind these symbols.

This is what maths is for – creating abstractions or models of complex entities/phenomena so that they can be much more easily manipulated or understood.

This is very powerful – let’s spend a moment contemplating how many real-world situations fit into this framework.


If we have “complex” objects on input, we can try representing them as feature vectors (e.g., come up with numeric attributes that best describe them in a task at hand).

How would you represent a patient in a clinic?

How would you represent a car in an insurance company’s database?

How would you represent a student in an university?

This also includes image/video data, e.g., a 1920×1080 pixel image can be “unwound” to a “flat” vector of length 2,073,600.

(*) There are some algorithms such as Multidimensional Scaling, Locally Linear Embedding, IsoMap etc. that can do that automagically.


In cases such as this we say that we deal with structured (tabular) data
\(\mathbf{X}\) can be written as an (\(n\times p\))-matrix: \[ \mathbf{X}= \left[ \begin{array}{cccc} x_{1,1} & x_{1,2} & \cdots & x_{1,p} \\ x_{2,1} & x_{2,2} & \cdots & x_{2,p} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n,1} & x_{n,2} & \cdots & x_{n,p} \\ \end{array} \right] \] Mathematically, we denote this as \(\mathbf{X}\in\mathbb{R}^{n\times d}\).

Structured data == think: Excel/Calc spreadsheets, SQL tables etc.


Example: The famous Fisher’s Iris flower dataset, see ?iris in R and https://en.wikipedia.org/wiki/Iris_flower_data_set.

##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1          5.1         3.5          1.4         0.2
## 2          4.9         3.0          1.4         0.2
## 3          4.7         3.2          1.3         0.2
## 4          4.6         3.1          1.5         0.2
## 5          5.0         3.6          1.4         0.2
## 6          5.4         3.9          1.7         0.4
## [1] 6 4
## [1] 150   5

\(x_{i,j}\in\mathbb{R}\) represents the \(j\)-th feature of the \(i\)-th observation, \(j=1,\dots,p\), \(i=1,\dots,n\).

For instance:

## [1] 3.2

The third observation (data point, row in \(\mathbf{X}\)) consists of items \((x_{3,1}, \dots, x_{3,p})\) that can be extracted by calling:

##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 3          4.7         3.2          1.3         0.2
## [1] 4.7 3.2 1.3 0.2
## [1] 4

Moreover, the second feature/variable/column is comprised of \((x_{1,2}, x_{2,2}, \dots, x_{n,2})\):

## [1] 3.5 3.0 3.2 3.1 3.6 3.9
## [1] 6

We will sometimes use the following notation to emphasise that the \(\mathbf{X}\) matrix consists of \(n\) rows or \(p\) columns:

\[ \mathbf{X}=\left[ \begin{array}{c} \mathbf{x}_{1,\cdot} \\ \mathbf{x}_{2,\cdot} \\ \vdots\\ \mathbf{x}_{n,\cdot} \\ \end{array} \right] = \left[ \begin{array}{cccc} \mathbf{x}_{\cdot,1} & \mathbf{x}_{\cdot,2} & \cdots & \mathbf{x}_{\cdot,p} \\ \end{array} \right]. \]


Here, \(\mathbf{x}_{i,\cdot}\) is a row vector of length \(p\), i.e., a \((1\times p)\)-matrix:

\[ \mathbf{x}_{i,\cdot} = \left[ \begin{array}{cccc} x_{i,1} & x_{i,2} & \cdots & x_{i,p} \\ \end{array} \right]. \]

Moreover, \(\mathbf{x}_{\cdot,j}\) is a column vector of length \(n\), i.e., an \((n\times 1)\)-matrix:

\[ \mathbf{x}_{\cdot,j} = \left[ \begin{array}{cccc} x_{1,j} & x_{2,j} & \cdots & x_{n,j} \\ \end{array} \right]^T=\left[ \begin{array}{c} {x}_{1,j} \\ {x}_{2,j} \\ \vdots\\ {x}_{n,j} \\ \end{array} \right], \]

where \(\cdot^T\) denotes the transpose of a given matrix – think of this as a kind of rotation; it allows us to introduce a set of “vertically stacked” objects using a single inline formula.

1.2.2 Desired Outputs


In supervised learning, apart from the inputs we are also given the corresponding reference/desired outputs.

The aim of supervised learning is to try to create an “algorithm” that, given an input point, generates an output that is as close as possible to the desired one. The given data sample will be used to “train” this “model”.

Usually the reference outputs are encoded as individual numbers (scalars) or textual labels.

In other words, with each input \(\mathbf{x}_{i,\cdot}\) we associate the desired output \(y_i\):

##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 14           4.3         3.0          1.1         0.1    setosa
## 50           5.0         3.3          1.4         0.2    setosa
## 118          7.7         3.8          6.7         2.2 virginica

Hence, our dataset is \([\mathbf{X}\ \mathbf{y}]\) – each object is represented as a row vector \([\mathbf{x}_{i,\cdot}\ y_i]\), \(i=1,\dots,n\):

\[ [\mathbf{X}\ \mathbf{y}]= \left[ \begin{array}{ccccc} x_{1,1} & x_{1,2} & \cdots & x_{1,p} & y_1\\ x_{2,1} & x_{2,2} & \cdots & x_{2,p} & y_2\\ \vdots & \vdots & \ddots & \vdots & \vdots\\ x_{n,1} & x_{n,2} & \cdots & x_{n,p} & y_n\\ \end{array} \right], \]

where:

\[ \mathbf{y} = \left[ \begin{array}{cccc} y_{1} & y_{2} & \cdots & y_{n} \\ \end{array} \right]^T=\left[ \begin{array}{c} {y}_{1} \\ {y}_{2} \\ \vdots\\ {y}_{n} \\ \end{array} \right]. \]

1.2.3 Types of Supervised Learning Problems


Depending on the type of the elements in \(\mathbf{y}\) (the domain of \(\mathbf{y}\)), supervised learning problems are usually classified as:

  • regression – each \(y_i\) is a real number

    e.g., \(y_i=\) future market stock price with \(\mathbf{x}_{i,\cdot}=\) prices from \(p\) previous days

  • classification – each \(y_i\) is a discrete label

    e.g., \(y_i=\) healthy (0) or ill (1) with \(\mathbf{x}_{i,\cdot}=\) a patient’s health record

  • ordinal regression (a.k.a. ordinal classification) – each \(y_i\) is a rank

    e.g., \(y_i=\) rating of a product on the scale 1–5 with \(\mathbf{x}_{i,\cdot}=\) ratings of \(p\) most similar products


Example Problems – Discussion:

Which of the following are instances of classification problems? Which of them are regression tasks?

What kind of data should you gather in order to tackle them?

  • Detect email spam
  • Predict a market stock price
  • Predict the likeability of a new ad
  • Assess credit risk
  • Detect tumour tissues in medical images
  • Predict time-to-recovery of cancer patients
  • Recognise smiling faces on photographs
  • Detect unattended luggage in airport security camera footage
  • Turn on emergency braking to avoid a collision with pedestrians

A single dataset can become an instance of many different ML problems.

Examples – the wines dataset:

##   fixed.acidity volatile.acidity citric.acid residual.sugar
## 1           7.4              0.7           0            1.9
##   chlorides free.sulfur.dioxide total.sulfur.dioxide density
## 1     0.076                  11                   34  0.9978
##     pH sulphates alcohol response color
## 1 3.51      0.56     9.4        3   red

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.55   11.40   14.90

plot of chunk unnamed-chunk-9


## 
##   red white 
##  1359  3961

plot of chunk unnamed-chunk-10


## 
##    1    2    3    4    5    6    7 
##   30  206 1752 2323  856  148    5

plot of chunk unnamed-chunk-11