5.1 Introduction

5.1.1 Binary Logistic Regression: Recap


Let \(\mathbf{X}\in\mathbb{R}^{n\times p}\) be an input matrix that consists of \(n\) points in a \(p\)-dimensional space.

In other words, we have a database on \(n\) objects, each of which being described by means of \(p\) numerical features.

\[ \mathbf{X}= \left[ \begin{array}{cccc} x_{1,1} & x_{1,2} & \cdots & x_{1,p} \\ x_{2,1} & x_{2,2} & \cdots & x_{2,p} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n,1} & x_{n,2} & \cdots & x_{n,p} \\ \end{array} \right] \]

With each input \(\mathbf{x}_{i,\cdot}\) we associate the desired output \(y_i\) which is a categorical label – hence we will be dealing with classification tasks again.


In binary logistic regression we were modelling the probabilities that a given input belongs to either of the two classes:

\[ \begin{array}{ll} \Pr(Y=1|\mathbf{X},\boldsymbol\beta)=&\phantom{1-}\phi(\beta_0 + \beta_1 X_1 + \dots + \beta_p X_p)\\ \Pr(Y=0|\mathbf{X},\boldsymbol\beta)=&1-\phi(\beta_0 + \beta_1 X_1 + \dots + \beta_p X_p)\\ \end{array} \] where \(\phi(z) = \frac{1}{1+e^{-z}}\) is the logistic sigmoid function.

It holds: \[ \begin{array}{ll} \Pr(Y=1|\mathbf{X},\boldsymbol\beta)=&\displaystyle\frac{1}{1+e^{-(\beta_0 + \beta_1 X_1 + \dots + \beta_p X_p)}}\\ \Pr(Y=0|\mathbf{X},\boldsymbol\beta)=&\displaystyle\frac{e^{-(\beta_0 + \beta_1 X_1 + \dots + \beta_p X_p)}}{1+e^{-(\beta_0 + \beta_1 X_1 + \dots + \beta_p X_p)}}\\ \end{array} \]


The fitting of the model was performed by minimising the cross-entropy (log-loss): \[ \min_{\boldsymbol\beta\in\mathbb{R}^{p+1}} -\frac{1}{n} \sum_{i=1}^n \left(y_i\log \Pr(Y=1|\mathbf{x}_{i,\cdot},\boldsymbol\beta) + (1-y_i)\log \Pr(Y=0|\mathbf{x}_{i,\cdot},\boldsymbol\beta)\right). \]

Note that for each \(i\), either the left or the right term (in the bracketed expression) vanishes.

Hence, we may also write the above as: \[ \min_{\boldsymbol\beta\in\mathbb{R}^{p+1}} -\frac{1}{n} \sum_{i=1}^n \log \Pr(Y=y_i|\mathbf{x}_{i,\cdot},\boldsymbol\beta). \]


In this chapter we will generalise the binary logistic regression model:

  • First we will consider the case of multiclass classification.

  • Then we will note that multinomial logistic regression is a special case of a feed-forward neural network.

5.1.2 Data


We will study the famous classic – the MNIST image classification dataset.

== Modified National Institute of Standards and Technology database, see http://yann.lecun.com/exdb/mnist/

It consists of 28×28 pixel images of handwritten digits:

  • train: 60,000 training images,
  • t10k: 10,000 testing images.

There are 10 unique digits, so this is a multiclass classification problem.

The dataset is already “too easy” for testing of the state-of-the-art classifiers (see the notes below), but it’s a great educational example.


A few image instances from each class:

plot of chunk mnist_demo


Accessing MNIST via the keras package (which we will use throughout this chapter anyway) is easy:


X_train and X_test consist of 28×28 pixel images.

## [1] 60000    28    28
## [1] 10000    28    28

X_train and X_test are 3-dimensional arrays, think of them as vectors of 60000 and 10000 matrices of size 28×28, respectively.

These are greyscale images, with 0 = black, …, 255 = white:

## [1]   0 255

It is better to convert the colour values to 0.0 = black, …, 1.0 = white:


Y_train and Y_test are the corresponding integer labels:

## [1] 60000
## [1] 10000
## Y_train
##    0    1    2    3    4    5    6    7    8    9 
## 5923 6742 5958 6131 5842 5421 5918 6265 5851 5949
## Y_test
##    0    1    2    3    4    5    6    7    8    9 
##  980 1135 1032 1010  982  892  958 1028  974 1009

plot of chunk mnist_info2b