5.2 Multinomial Logistic Regression

5.2.1 A Note on Data Representation


So… you may now be wondering “how do we construct an image classifier, this seems so complicated!”.

For a computer, (almost) everything is just numbers.

Instead of playing with \(n\) matrices, each of size 28×28, we may “flatten” the images so as to get \(n\) “long” vectors of length \(p=784\).

The classifiers studied here do not take the “spatial” positioning of the pixels into account anyway.

(*) See, however, convolutional neural networks (CNNs), e.g., in (Goodfellow et al. 2016).

Hence, now we’re back to our “comfort zone”.

5.2.2 Extending Logistic Regression


Let us generalise the binary logistic regression model to a 10-class one (or, more generally, \(K\)-class one).

This time we will be modelling ten probabilities, with \(\Pr(Y=k|\mathbf{X},\mathbf{B})\) denoting the confidence that a given image \(\mathbf{X}\) is in fact the \(k\)-th digit:

\[ \begin{array}{lcl} \Pr(Y=0|\mathbf{X},\mathbf{B})&=&\dots\\ \Pr(Y=1|\mathbf{X},\mathbf{B})&=&\dots\\ &\vdots&\\ \Pr(Y=9|\mathbf{X},\mathbf{B})&=&\dots\\ \end{array} \]

where \(\mathbf{B}\) is the set of underlying model parameters (to be determined soon).


In binary logistic regression, the class probabilities are obtained by “cleverly normalising” the outputs of a linear model (so that we obtain a value in \([0,1]\)).

In the multinomial case, we can use a separate linear model for each digit so that \(\Pr(Y=k|\mathbf{X},\mathbf{B})\) is given as a function of \[\beta_{0,k} + \beta_{1,k} X_{1} + \dots + \beta_{p,k} X_{p}.\]

Therefore, instead of a parameter vector of length \((p+1)\), we will need a parameter matrix of size \((p+1)\times 10\) representing the model’s definition.

Side note: upper case of \(\beta\) is \(B\).

Then, these 10 numbers will have to be normalised so as to they are positive and sum to \(1\).


To maintain the spirit of the original model, we can apply \(e^{-(\beta_{0,k} + \beta_{1,k} X_{1} + \dots + \beta_{p,k} X_{p})}\) to get a positive value, because the co-domain of the exponential function \(t\mapsto e^t\) is \((0,\infty)\).

Then, dividing each output by the sum of all the outputs will guarantee that the total sum equals 1.

This leads to: \[ \begin{array}{lcl} \Pr(Y=0|\mathbf{X},\mathbf{B})&=&\displaystyle\frac{e^{-(\beta_{0,0} + \beta_{1,0} X_{1} + \dots + \beta_{p,0} X_{p})}}{\sum_{k=0}^9 e^{-(\beta_{0,k} + \beta_{1,k} X_{1} + \dots + \beta_{p,k} X_p)}},\\ \Pr(Y=1|\mathbf{X},\mathbf{B})&=&\displaystyle\frac{e^{-(\beta_{0,1} + \beta_{1,1} X_{1} + \dots + \beta_{p,1} X_{p})}}{\sum_{k=0}^9 e^{-(\beta_{0,k} + \beta_{1,k} X_{1} + \dots + \beta_{p,k} X_{p})}},\\ &\vdots&\\ \Pr(Y=9|\mathbf{X},\mathbf{B})&=&\displaystyle\frac{e^{-(\beta_{0,9} + \beta_{1,9} X_{1} + \dots + \beta_{p,9} X_{p})}}{\sum_{k=0}^9 e^{-(\beta_{0,k} + \beta_{1,k} X_{1} + \dots + \beta_{p,k} X_{p})}}.\\ \end{array} \]

Note that we get the binary logistic regression if we fix \(\beta_{0,0}=\beta_{1,0}=\dots=\beta_{p,0}=0\) as \(e^0=1\) and consider only the classes \(0\) and \(1\).

5.2.3 Softmax Function


The above transformation (that maps 10 arbitrary real numbers to positive ones that sum to 1) is called the softmax function (or softargmax).

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    0    0 1.00    0    0    0 0.00    0    0  0.00
## [2,]    0    0 0.50    0    0    0 0.50    0    0  0.00
## [3,]    0    0 0.73    0    0    0 0.27    0    0  0.00
## [4,]    0    0 0.67    0    0    0 0.24    0    0  0.09

5.2.4 One-Hot Encoding and Decoding


The ten class-belongingness-degrees can be decoded to obtain a single label by simply choosing the class that is assigned the highest probability.

##  [1] 0.00 0.00 0.67 0.00 0.00 0.00 0.24 0.00 0.00 0.09
## [1] 2

which.max(y) returns an index k such that y[k]==max(y) (recall that in R the first element in a vector is at index 1). Mathematically, we denote this operation as \(\mathrm{arg}\max_{k=1,\dots,K} y_k\).


To make processing the outputs of a logistic regression model more convenient, we will apply the one-hot-encoding of the labels.

Here, each label will be represented as a 0-1 probability vector – with probability 1 corresponding to the true class only.

For example:

##  [1] 0 0 1 0 0 0 0 0 0 0

To one-hot encode the reference outputs in R, we start with a matrix of size \(n\times 10\) populated with “0”s:

Next, for every \(i\), we insert a “1” in the \(i\)-th row and the (Y_train[\(i\)]+1)-th column:

In R, indexing a matrix A with a 2-column matrix B, i.e., A[B], allows for an easy access to A[B[1,1], B[1,2]], A[B[2,1], B[2,2]], A[B[3,1], B[3,2]], …


Sanity check:

## [1] 5 0 4 1 9 2
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    0    0    0    0    0    1    0    0    0     0
## [2,]    1    0    0    0    0    0    0    0    0     0
## [3,]    0    0    0    0    1    0    0    0    0     0
## [4,]    0    1    0    0    0    0    0    0    0     0
## [5,]    0    0    0    0    0    0    0    0    0     1
## [6,]    0    0    1    0    0    0    0    0    0     0

Let us generalise the above idea and write a function that can one-hot-encode any vector of integer labels:

Encode Y_train and Y_test:

5.2.5 Cross-entropy Revisited


In essence, we will be comparing the probability vectors as generated by a classifier, \(\hat{Y}\):

##  [1] 0.00 0.00 0.67 0.00 0.00 0.00 0.24 0.00 0.00 0.09

with the one-hot-encoded true probabilities, \(Y\):

##  [1] 0 0 1 0 0 0 0 0 0 0

It turns out that one of the definitions of cross-entropy introduced above already handles the case of multiclass classification: \[ E(\mathbf{B}) = -\frac{1}{n} \sum_{i=1}^n \log \Pr(Y=y_i|\mathbf{x}_{i,\cdot},\mathbf{B}). \] The smaller the probability corresponding to the ground-truth class outputted by the classifier, the higher the penalty:

plot of chunk unnamed-chunk-7


To sum up, we will be solving the optimisation problem: \[ \min_{\mathbf{B}\in\mathbb{R}^{(p+1)\times 10}} -\frac{1}{n} \sum_{i=1}^n \log \Pr(Y=y_i|\mathbf{x}_{i,\cdot},\mathbf{B}). \] This has no analytical solution, but can be solved using iterative methods (see the chapter on optimisation).


(*) Side note: A single term in the above formula, \[ \log \Pr(Y=y_i|\mathbf{x}_{i,\cdot},\mathbf{B}) \] given:

  • y_pred – a vector of 10 probabilities generated by the model: \[ \left[\Pr(Y=0|\mathbf{x}_{i,\cdot},\mathbf{B})\ \Pr(Y=1|\mathbf{x}_{i,\cdot},\mathbf{B})\ \cdots\ \Pr(Y=9|\mathbf{x}_{i,\cdot},\mathbf{B})\right] \]
  • y2 – a one-hot-encoded version of the true label, \(y_i\), of the form \[ \left[0\ 0\ \cdots\ 0\ 1\ 0\ \cdots\ 0\right] \]

can be computed as:

## [1] -0.4078174

5.2.6 Problem Formulation in Matrix Form (**)


The definition of a multinomial logistic regression model for a multiclass classification task involving classes \(\{1,2,\dots,K\}\) is slightly bloated.

Assuming that \(\mathbf{X}\in\mathbb{R}^{n\times p}\) is the input matrix, to compute the \(K\) predicted probabilities for the \(i\)-th input, \[ \left[ \hat{y}_{i,1}\ \hat{y}_{i,2}\ \cdots\ \hat{y}_{i,K} \right], \] given a parameter matrix \(\mathbf{B}^{(p+1)\times K}\), we apply: \[ \begin{array}{lcl} \hat{y}_{i,1}=\Pr(Y=1|\mathbf{x}_{i,\cdot},\mathbf{B})&=&\displaystyle\frac{e^{\beta_{0,1} + \beta_{1,1} x_{i,1} + \dots + \beta_{p,1} x_{i,p}}}{\sum_{k=1}^K e^{\beta_{0,k} + \beta_{1,k} x_{i,1} + \dots + \beta_{p,k} x_{i,p}}},\\ &\vdots&\\ \hat{y}_{i,K}=\Pr(Y=K|\mathbf{x}_{i,\cdot},\mathbf{B})&=&\displaystyle\frac{e^{\beta_{0,K} + \beta_{1,K} x_{i,1} + \dots + \beta_{p,K} x_{i,p}}}{\sum_{k=1}^K e^{\beta_{0,k} + \beta_{1,k} x_{i,1} + \dots + \beta_{p,k} x_{i,p}}}.\\ \end{array} \]

We have dropped the minus sign in the exponentiation for the brevity of notation. Note that we can always map \(b_{j,k}'=-b_{j,k}\).

It turns out we can make use of matrix notation to tidy the above formulas.


Denote the linear combinations prior to computing the softmax function with: \[ \begin{array}{lcl} t_{i,1}&=&\beta_{0,1} + \beta_{1,1} x_{i,1} + \dots + \beta_{p,1} x_{i,p}\\ &\vdots&\\ t_{i,K}&=&\beta_{0,K} + \beta_{1,K} x_{i,1} + \dots + \beta_{p,K} x_{i,p}\\ \end{array} \]

We have:

  • \(x_{i,j}\)\(i\)-th observation, \(j\)-th feature;
  • \(\hat{y}_{i,k}\)\(i\)-th observation, \(k\)-th class probability;
  • \(\beta_{j,k}\) – coefficient for the \(j\)-th feature when computing the \(k\)-th class.

Note that by augmenting \(\mathbf{\dot{X}}=[\boldsymbol{1}\ \mathbf{X}]\in\mathbb{R}^{n\times (p+1)}\), where \(\dot{x}_{i,0}=1\) and \(\dot{x}_{i,j}=x_{i,j}\) for all \(j\ge 1\) and all \(i\), we can write the above as: \[ \begin{array}{lclcl} t_{i,1}&=&\sum_{j=0}^p \dot{x}_{i,j}\, \beta_{j,1} &=& \dot{\mathbf{x}}_{i,\cdot}\, \boldsymbol\beta_{\cdot,1}\\ &\vdots&\\ t_{i,K}&=&\sum_{j=0}^p \dot{x}_{i,j}\, \beta_{j,K} &=& \dot{\mathbf{x}}_{i,\cdot}\, \boldsymbol\beta_{\cdot,K}\\ \end{array} \]


We can get the \(K\) linear combinations all at once in the form of a row vector by writing: \[ \left[ t_{i,1}\ t_{i,2}\ \cdots\ t_{i,K} \right] = {\mathbf{x}_{i,\cdot}}\, \mathbf{B} \]

Moreover, we can do that for all the \(n\) inputs by writing: \[ \mathbf{T}=\dot{\mathbf{X}}\,\mathbf{B} \] Yes, this is a single matrix multiplication, we have \(\mathbf{T}\in\mathbb{R}^{n\times K}\).

To obtain \(\hat{\mathbf{Y}}\), we have to apply the softmax function on every row of \(\mathbf{T}\): \[ \hat{\mathbf{Y}}=\mathrm{softmax}\left( \dot{\mathbf{X}}\,\mathbf{B} \right). \]

That’s it. Take some time to appreciate the elegance of this notation.

Methods for minimising crossentropy expressed in matrix form will be discussed in the next chapter.