4.4 Binary Logistic Regression

4.4.1 Motivation


Recall that for a regression task, we fitted a very simple family of models – the linear ones – by minimising the sum of squared residuals.

This approach was pretty effective.

Theoretically, we could treat the class labels as numeric \(0\)s and \(1\)s and apply regression models in a binary classification task.


##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.76785  0.03625  0.18985  0.20120  0.34916  0.95385

The predicted outputs, \(\hat{Y}\), are arbitrary real numbers, but we can convert them to binary ones by checking if, e.g., \(\hat{Y}>0.5\).

##          Acc         Prec          Rec            F 
##    0.8303068    0.6194690    0.2348993    0.3406326 
##           TN           FN           FP           TP 
## 1256.0000000  228.0000000   43.0000000   70.0000000

The threshold \(T=0.5\) could even be treated as a free parameter we optimise for (w.r.t. different metrics over the validation sample).

plot of chunk unnamed-chunk-40

4.4.2 Logistic Model


Inspired by this idea, we could try modelling the probability that a given point belongs to class \(1\).

This could also provide us with the confidence in our prediction.

Probability is a number in \([0,1]\), but \(Y\in \mathbb{R}\).

However, we could transform the real-valued outputs by means of some function \(\phi:\mathbb{R}\to[0,1]\) (preferably S-shaped == sigmoid), so as to get:

\(\Pr(Y=1|\mathbf{X},\boldsymbol\beta)=\phi(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p)\)

The above reads as “Probability that \(Y\) is from class 1 given \(\mathbf{X}\) and \(\boldsymbol\beta\)”.


A popular choice is the logistic sigmoid function,

\[ \phi(y) = \frac{1}{1+e^{-y}} = \frac{e^y}{1+e^y} \]

plot of chunk unnamed-chunk-41


Hence our model becomes:

\[ Y=\frac{1}{1+e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p)}} \]

We call it a generalised linear model (glm).

4.4.3 Example in R


Let us first fit a simple (i.e., \(p=1\)) logistic regression model using the alcohol variable.

“logit” below denotes the inverse of the logistic sigmoid function.

## 
## Call:
## glm(formula = Y ~ alcohol, family = binomial("logit"), data = XY_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8703  -0.5994  -0.3984  -0.2770   2.7463  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -11.94770    0.46819  -25.52   <2e-16 ***
## alcohol       0.96469    0.04166   23.16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3630.9  on 3722  degrees of freedom
## Residual deviance: 2963.4  on 3721  degrees of freedom
## AIC: 2967.4
## 
## Number of Fisher Scoring iterations: 5

plot of chunk unnamed-chunk-43


Some predicted probabilities:

##          1          2          3          4          5 
## 0.07629709 0.07629709 0.05316971 0.05824087 0.13961713 
##          6 
## 0.13961713

We classify \(Y\) as 1 if the corresponding membership probability is greater than \(0.5\).

##          Acc         Prec          Rec            F 
##    0.8127740    0.4971751    0.2953020    0.3705263 
##           TN           FN           FP           TP 
## 1210.0000000  210.0000000   89.0000000   88.0000000

And now a fit based on all the input variables:

## Warning: glm.fit: fitted probabilities numerically 0 or 1
## occurred
##          Acc         Prec          Rec            F 
##    0.1872260    0.1863237    0.9966443    0.3139535 
##           TN           FN           FP           TP 
##    2.0000000    1.0000000 1297.0000000  297.0000000

4.4.4 Loss Function


The fitting of the model can be written as an optimisation task:

\[ \min_{\beta_0, \beta_1,\dots, \beta_p\in\mathbb{R}} \frac{1}{n} \sum_{i=1}^n e\left(\frac{1}{1+e^{-(\beta_0 + \beta_1 x_{i,1}+\dots+\beta_p x_{i,p})}}, y_i \right) \]

where \(e(\hat{y}, y)\) denotes the penalty that measures the “difference” between the true \(y\) and its predicted version \(\hat{y}\).

In the ordinary regression, we use the squared residual \(e(\hat{y}, y) = (\hat{y}-y)^2\).

In logistic regression (the kind of a classifier we are interested in right now), we use the cross-entropy (a.k.a. log-loss),

\[ e(\hat{y},y) = - \left(y\log \hat{y} + (1-y)\log(1-\hat{y})\right) \]

The corresponding loss function has not only nice statistical properties (**) but also an intuitive interpretation.


Note that the predicted \(\hat{y}\) is in \((0,1)\) and the true \(y\) equals to either 0 or 1.

Recall also that \(\log t\in(-\infty, 0)\) for \(t\in (0,1)\).

\[ e(\hat{y},y) = - \left(y\log \hat{y} + (1-y)\log(1-\hat{y})\right) \]

  • if true \(y=1\), then the penalty becomes \(e(\hat{y}, 1) = -\log(\hat{y})\)

    • \(\hat{y}\) is the probability that the classified input is indeed from class \(1\)
    • we’d be happy if the classifier outputted \(\hat{y}\simeq 1\) in this case; this is not penalised as \(-\log(t)\to 0\) as \(t\to 1\)
    • however, if the classifier is totally wrong, i.e., it thinks that \(\hat{y}\simeq 0\), then the penalty will be very high, as \(-\log(t)\to+\infty\) as \(t\to 0\)
  • if true \(y=0\), then the penalty becomes \(e(\hat{y}, 0) = -\log(1-\hat{y})\)

    • \(1-\hat{y}\) is the predicted probability that the input is from class \(0\)
    • we penalise heavily the case where \(1-\hat{y}\) is small (we’d be happy if the classifier was sure that \(1-\hat{y}\simeq 1\), because this is the ground-truth)

(*) Interestingly, there is no analytical formula for the optimal set of parameters (\(\beta_0,\beta_1,\dots,\beta_p\)) minimising the log-loss.

In the chapter on optimisation, we shall see that the solution to the logistic regression can be solved numerically by means of quite simple iterative algorithms.