## 5.2 Multinomial Logistic Regression

### 5.2.1 A Note on Data Representation

So… you may now be wondering “how do we construct an image classifier, this seems so complicated!”.

For a computer, (almost) everything is just numbers.

Instead of playing with $$n$$ matrices, each of size 28×28, we may “flatten” the images so as to get $$n$$ “long” vectors of length $$p=784$$.

X_train2 <- matrix(X_train, ncol=28*28)
X_test2  <- matrix(X_test, ncol=28*28)

The classifiers studied here do not take the “spatial” positioning of the pixels into account anyway.

(*) See, however, convolutional neural networks (CNNs), e.g., in (Goodfellow et al. 2016).

Hence, now we’re back to our “comfort zone”.

### 5.2.2 Extending Logistic Regression

Let us generalise the binary logistic regression model to a 10-class one (or, more generally, $$K$$-class one).

This time we will be modelling ten probabilities, with $$\Pr(Y=k|\mathbf{X},\mathbf{B})$$ denoting the confidence that a given image $$\mathbf{X}$$ is in fact the $$k$$-th digit:

$\begin{array}{lcl} \Pr(Y=0|\mathbf{X},\mathbf{B})&=&\dots\\ \Pr(Y=1|\mathbf{X},\mathbf{B})&=&\dots\\ &\vdots&\\ \Pr(Y=9|\mathbf{X},\mathbf{B})&=&\dots\\ \end{array}$

where $$\mathbf{B}$$ is the set of underlying model parameters (to be determined soon).

In binary logistic regression, the class probabilities are obtained by “cleverly normalising” the outputs of a linear model (so that we obtain a value in $$[0,1]$$).

In the multinomial case, we can use a separate linear model for each digit so that $$\Pr(Y=k|\mathbf{X},\mathbf{B})$$ is given as a function of $\beta_{0,k} + \beta_{1,k} X_{1} + \dots + \beta_{p,k} X_{p}.$

Therefore, instead of a parameter vector of length $$(p+1)$$, we will need a parameter matrix of size $$(p+1)\times 10$$ representing the model’s definition.

Side note: upper case of $$\beta$$ is $$B$$.

Then, these 10 numbers will have to be normalised so as to they are positive and sum to $$1$$.

To maintain the spirit of the original model, we can apply $$e^{-(\beta_{0,k} + \beta_{1,k} X_{1} + \dots + \beta_{p,k} X_{p})}$$ to get a positive value, because the co-domain of the exponential function $$t\mapsto e^t$$ is $$(0,\infty)$$.

Then, dividing each output by the sum of all the outputs will guarantee that the total sum equals 1.

This leads to: $\begin{array}{lcl} \Pr(Y=0|\mathbf{X},\mathbf{B})&=&\displaystyle\frac{e^{-(\beta_{0,0} + \beta_{1,0} X_{1} + \dots + \beta_{p,0} X_{p})}}{\sum_{k=0}^9 e^{-(\beta_{0,k} + \beta_{1,k} X_{1} + \dots + \beta_{p,k} X_p)}},\\ \Pr(Y=1|\mathbf{X},\mathbf{B})&=&\displaystyle\frac{e^{-(\beta_{0,1} + \beta_{1,1} X_{1} + \dots + \beta_{p,1} X_{p})}}{\sum_{k=0}^9 e^{-(\beta_{0,k} + \beta_{1,k} X_{1} + \dots + \beta_{p,k} X_{p})}},\\ &\vdots&\\ \Pr(Y=9|\mathbf{X},\mathbf{B})&=&\displaystyle\frac{e^{-(\beta_{0,9} + \beta_{1,9} X_{1} + \dots + \beta_{p,9} X_{p})}}{\sum_{k=0}^9 e^{-(\beta_{0,k} + \beta_{1,k} X_{1} + \dots + \beta_{p,k} X_{p})}}.\\ \end{array}$

Note that we get the binary logistic regression if we fix $$\beta_{0,0}=\beta_{1,0}=\dots=\beta_{p,0}=0$$ as $$e^0=1$$ and consider only the classes $$0$$ and $$1$$.

### 5.2.3 Softmax Function

The above transformation (that maps 10 arbitrary real numbers to positive ones that sum to 1) is called the softmax function (or softargmax).

softmax <- function(T) {
T2 <- exp(T) # ignore the minus sign above
T2/sum(T2)
}
round(rbind(
softmax(c(0, 0, 10, 0, 0, 0, 0,  0, 0, 0)),
softmax(c(0, 0, 10, 0, 0, 0, 10, 0, 0, 0)),
softmax(c(0, 0, 10, 0, 0, 0, 9,  0, 0, 0)),
softmax(c(0, 0, 10, 0, 0, 0, 9,  0, 0, 8))), 2)
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    0    0 1.00    0    0    0 0.00    0    0  0.00
## [2,]    0    0 0.50    0    0    0 0.50    0    0  0.00
## [3,]    0    0 0.73    0    0    0 0.27    0    0  0.00
## [4,]    0    0 0.67    0    0    0 0.24    0    0  0.09

### 5.2.4 One-Hot Encoding and Decoding

The ten class-belongingness-degrees can be decoded to obtain a single label by simply choosing the class that is assigned the highest probability.

y_pred <- softmax(c(0, 0, 10, 0, 0, 0, 9,  0, 0, 8))
round(y_pred, 2) # probabilities of class 0, 1, 2, ..., 9
##  [1] 0.00 0.00 0.67 0.00 0.00 0.00 0.24 0.00 0.00 0.09
which.max(y_pred)-1 # 1..10 -> 0..9
## [1] 2

which.max(y) returns an index k such that y[k]==max(y) (recall that in R the first element in a vector is at index 1). Mathematically, we denote this operation as $$\mathrm{arg}\max_{k=1,\dots,K} y_k$$.

To make processing the outputs of a logistic regression model more convenient, we will apply the one-hot-encoding of the labels.

Here, each label will be represented as a 0-1 probability vector – with probability 1 corresponding to the true class only.

For example:

y <- 2 # true class (example)
y2 <- rep(0, 10)
y2[y+1] <- 1 # +1 because we need 0..9 -> 1..10
y2  # one-hot-encoded y
##  [1] 0 0 1 0 0 0 0 0 0 0

To one-hot encode the reference outputs in R, we start with a matrix of size $$n\times 10$$ populated with “0”s:

Y_train2 <- matrix(0, nrow=length(Y_train), ncol=10)

Next, for every $$i$$, we insert a “1” in the $$i$$-th row and the (Y_train[$$i$$]+1)-th column:

# Note the "+1" 0..9 -> 1..10
Y_train2[cbind(1:length(Y_train), Y_train+1)] <- 1

In R, indexing a matrix A with a 2-column matrix B, i.e., A[B], allows for an easy access to A[B[1,1], B[1,2]], A[B[2,1], B[2,2]], A[B[3,1], B[3,2]], …

Sanity check:

head(Y_train)
## [1] 5 0 4 1 9 2
head(Y_train2)
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    0    0    0    0    0    1    0    0    0     0
## [2,]    1    0    0    0    0    0    0    0    0     0
## [3,]    0    0    0    0    1    0    0    0    0     0
## [4,]    0    1    0    0    0    0    0    0    0     0
## [5,]    0    0    0    0    0    0    0    0    0     1
## [6,]    0    0    1    0    0    0    0    0    0     0

Let us generalise the above idea and write a function that can one-hot-encode any vector of integer labels:

one_hot_encode <- function(Y) {
stopifnot(is.numeric(Y))
c1 <- min(Y) # first class label
cK <- max(Y) # last class label
K <- cK-c1+1 # number of classes

Y2 <- matrix(0, nrow=length(Y), ncol=K)
Y2[cbind(1:length(Y), Y-c1+1)] <- 1
Y2
}

Encode Y_train and Y_test:

Y_train2 <- one_hot_encode(Y_train)
Y_test2 <- one_hot_encode(Y_test)

### 5.2.5 Cross-entropy Revisited

In essence, we will be comparing the probability vectors as generated by a classifier, $$\hat{Y}$$:

round(y_pred, 2)
##  [1] 0.00 0.00 0.67 0.00 0.00 0.00 0.24 0.00 0.00 0.09

with the one-hot-encoded true probabilities, $$Y$$:

y2
##  [1] 0 0 1 0 0 0 0 0 0 0

It turns out that one of the definitions of cross-entropy introduced above already handles the case of multiclass classification: $E(\mathbf{B}) = -\frac{1}{n} \sum_{i=1}^n \log \Pr(Y=y_i|\mathbf{x}_{i,\cdot},\mathbf{B}).$ The smaller the probability corresponding to the ground-truth class outputted by the classifier, the higher the penalty:

To sum up, we will be solving the optimisation problem: $\min_{\mathbf{B}\in\mathbb{R}^{(p+1)\times 10}} -\frac{1}{n} \sum_{i=1}^n \log \Pr(Y=y_i|\mathbf{x}_{i,\cdot},\mathbf{B}).$ This has no analytical solution, but can be solved using iterative methods (see the chapter on optimisation).

(*) Side note: A single term in the above formula, $\log \Pr(Y=y_i|\mathbf{x}_{i,\cdot},\mathbf{B})$ given:

• y_pred – a vector of 10 probabilities generated by the model: $\left[\Pr(Y=0|\mathbf{x}_{i,\cdot},\mathbf{B})\ \Pr(Y=1|\mathbf{x}_{i,\cdot},\mathbf{B})\ \cdots\ \Pr(Y=9|\mathbf{x}_{i,\cdot},\mathbf{B})\right]$
• y2 – a one-hot-encoded version of the true label, $$y_i$$, of the form $\left[0\ 0\ \cdots\ 0\ 1\ 0\ \cdots\ 0\right]$

can be computed as:

sum(y2*log(y_pred))
## [1] -0.4078174

### 5.2.6 Problem Formulation in Matrix Form (**)

The definition of a multinomial logistic regression model for a multiclass classification task involving classes $$\{1,2,\dots,K\}$$ is slightly bloated.

Assuming that $$\mathbf{X}\in\mathbb{R}^{n\times p}$$ is the input matrix, to compute the $$K$$ predicted probabilities for the $$i$$-th input, $\left[ \hat{y}_{i,1}\ \hat{y}_{i,2}\ \cdots\ \hat{y}_{i,K} \right],$ given a parameter matrix $$\mathbf{B}^{(p+1)\times K}$$, we apply: $\begin{array}{lcl} \hat{y}_{i,1}=\Pr(Y=1|\mathbf{x}_{i,\cdot},\mathbf{B})&=&\displaystyle\frac{e^{\beta_{0,1} + \beta_{1,1} x_{i,1} + \dots + \beta_{p,1} x_{i,p}}}{\sum_{k=1}^K e^{\beta_{0,k} + \beta_{1,k} x_{i,1} + \dots + \beta_{p,k} x_{i,p}}},\\ &\vdots&\\ \hat{y}_{i,K}=\Pr(Y=K|\mathbf{x}_{i,\cdot},\mathbf{B})&=&\displaystyle\frac{e^{\beta_{0,K} + \beta_{1,K} x_{i,1} + \dots + \beta_{p,K} x_{i,p}}}{\sum_{k=1}^K e^{\beta_{0,k} + \beta_{1,k} x_{i,1} + \dots + \beta_{p,k} x_{i,p}}}.\\ \end{array}$

We have dropped the minus sign in the exponentiation for the brevity of notation. Note that we can always map $$b_{j,k}'=-b_{j,k}$$.

It turns out we can make use of matrix notation to tidy the above formulas.

Denote the linear combinations prior to computing the softmax function with: $\begin{array}{lcl} t_{i,1}&=&\beta_{0,1} + \beta_{1,1} x_{i,1} + \dots + \beta_{p,1} x_{i,p}\\ &\vdots&\\ t_{i,K}&=&\beta_{0,K} + \beta_{1,K} x_{i,1} + \dots + \beta_{p,K} x_{i,p}\\ \end{array}$

We have:

• $$x_{i,j}$$$$i$$-th observation, $$j$$-th feature;
• $$\hat{y}_{i,k}$$$$i$$-th observation, $$k$$-th class probability;
• $$\beta_{j,k}$$ – coefficient for the $$j$$-th feature when computing the $$k$$-th class.

Note that by augmenting $$\mathbf{\dot{X}}=[\boldsymbol{1}\ \mathbf{X}]\in\mathbb{R}^{n\times (p+1)}$$, where $$\dot{x}_{i,0}=1$$ and $$\dot{x}_{i,j}=x_{i,j}$$ for all $$j\ge 1$$ and all $$i$$, we can write the above as: $\begin{array}{lclcl} t_{i,1}&=&\sum_{j=0}^p \dot{x}_{i,j}\, \beta_{j,1} &=& \dot{\mathbf{x}}_{i,\cdot}\, \boldsymbol\beta_{\cdot,1}\\ &\vdots&\\ t_{i,K}&=&\sum_{j=0}^p \dot{x}_{i,j}\, \beta_{j,K} &=& \dot{\mathbf{x}}_{i,\cdot}\, \boldsymbol\beta_{\cdot,K}\\ \end{array}$

We can get the $$K$$ linear combinations all at once in the form of a row vector by writing: $\left[ t_{i,1}\ t_{i,2}\ \cdots\ t_{i,K} \right] = {\mathbf{x}_{i,\cdot}}\, \mathbf{B}$

Moreover, we can do that for all the $$n$$ inputs by writing: $\mathbf{T}=\dot{\mathbf{X}}\,\mathbf{B}$ Yes, this is a single matrix multiplication, we have $$\mathbf{T}\in\mathbb{R}^{n\times K}$$.

To obtain $$\hat{\mathbf{Y}}$$, we have to apply the softmax function on every row of $$\mathbf{T}$$: $\hat{\mathbf{Y}}=\mathrm{softmax}\left( \dot{\mathbf{X}}\,\mathbf{B} \right).$

That’s it. Take some time to appreciate the elegance of this notation.

Methods for minimising crossentropy expressed in matrix form will be discussed in the next chapter.