## 5.1 Introduction

### 5.1.1 Binary Logistic Regression: Recap

Let $$\mathbf{X}\in\mathbb{R}^{n\times p}$$ be an input matrix that consists of $$n$$ points in a $$p$$-dimensional space.

In other words, we have a database on $$n$$ objects, each of which being described by means of $$p$$ numerical features.

$\mathbf{X}= \left[ \begin{array}{cccc} x_{1,1} & x_{1,2} & \cdots & x_{1,p} \\ x_{2,1} & x_{2,2} & \cdots & x_{2,p} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n,1} & x_{n,2} & \cdots & x_{n,p} \\ \end{array} \right]$

With each input $$\mathbf{x}_{i,\cdot}$$ we associate the desired output $$y_i$$ which is a categorical label – hence we will be dealing with classification tasks again.

In binary logistic regression we were modelling the probabilities that a given input belongs to either of the two classes:

$\begin{array}{ll} \Pr(Y=1|\mathbf{X},\boldsymbol\beta)=&\phantom{1-}\phi(\beta_0 + \beta_1 X_1 + \dots + \beta_p X_p)\\ \Pr(Y=0|\mathbf{X},\boldsymbol\beta)=&1-\phi(\beta_0 + \beta_1 X_1 + \dots + \beta_p X_p)\\ \end{array}$ where $$\phi(z) = \frac{1}{1+e^{-z}}$$ is the logistic sigmoid function.

It holds: $\begin{array}{ll} \Pr(Y=1|\mathbf{X},\boldsymbol\beta)=&\displaystyle\frac{1}{1+e^{-(\beta_0 + \beta_1 X_1 + \dots + \beta_p X_p)}}\\ \Pr(Y=0|\mathbf{X},\boldsymbol\beta)=&\displaystyle\frac{e^{-(\beta_0 + \beta_1 X_1 + \dots + \beta_p X_p)}}{1+e^{-(\beta_0 + \beta_1 X_1 + \dots + \beta_p X_p)}}\\ \end{array}$

The fitting of the model was performed by minimising the cross-entropy (log-loss): $\min_{\boldsymbol\beta\in\mathbb{R}^{p+1}} -\frac{1}{n} \sum_{i=1}^n \left(y_i\log \Pr(Y=1|\mathbf{x}_{i,\cdot},\boldsymbol\beta) + (1-y_i)\log \Pr(Y=0|\mathbf{x}_{i,\cdot},\boldsymbol\beta)\right).$

Note that for each $$i$$, either the left or the right term (in the bracketed expression) vanishes.

Hence, we may also write the above as: $\min_{\boldsymbol\beta\in\mathbb{R}^{p+1}} -\frac{1}{n} \sum_{i=1}^n \log \Pr(Y=y_i|\mathbf{x}_{i,\cdot},\boldsymbol\beta).$

In this chapter we will generalise the binary logistic regression model:

• First we will consider the case of multiclass classification.

• Then we will note that multinomial logistic regression is a special case of a feed-forward neural network.

### 5.1.2 Data

We will study the famous classic – the MNIST image classification dataset.

== Modified National Institute of Standards and Technology database, see http://yann.lecun.com/exdb/mnist/

It consists of 28×28 pixel images of handwritten digits:

• train: 60,000 training images,
• t10k: 10,000 testing images.

There are 10 unique digits, so this is a multiclass classification problem.

The dataset is already “too easy” for testing of the state-of-the-art classifiers (see the notes below), but it’s a great educational example.

A few image instances from each class:

Accessing MNIST via the keras package (which we will use throughout this chapter anyway) is easy:

library("keras")
mnist <- dataset_mnist()
X_train <- mnist$train$x
Y_train <- mnist$train$y
X_test  <- mnist$test$x
Y_test  <- mnist$test$y

X_train and X_test consist of 28×28 pixel images.

dim(X_train)
## [1] 60000    28    28
dim(X_test)
## [1] 10000    28    28

X_train and X_test are 3-dimensional arrays, think of them as vectors of 60000 and 10000 matrices of size 28×28, respectively.

These are greyscale images, with 0 = black, …, 255 = white:

range(X_train)
## [1]   0 255

It is better to convert the colour values to 0.0 = black, …, 1.0 = white:

X_train <- X_train/255
X_test  <- X_test/255

Y_train and Y_test are the corresponding integer labels:

length(Y_train)
## [1] 60000
length(Y_test)
## [1] 10000
table(Y_train) # label distribution in train sample
## Y_train
##    0    1    2    3    4    5    6    7    8    9
## 5923 6742 5958 6131 5842 5421 5918 6265 5851 5949
table(Y_test)  # label distribution in test sample
## Y_test
##    0    1    2    3    4    5    6    7    8    9
##  980 1135 1032 1010  982  892  958 1028  974 1009

id <- 123 # which image to show
image(z=t(X_train[id,,]), col=grey.colors(256, 0, 1),
axes=FALSE, asp=1, ylim=c(1, 0))
legend("topleft", bg="white",
legend=sprintf("True label=%d", Y_train[id]))