## 4.4 Binary Logistic Regression

### 4.4.1 Motivation

Recall that for a regression task, we fitted a very simple family of models – the linear ones – by minimising the sum of squared residuals.

This approach was pretty effective.

Theoretically, we could treat the class labels as numeric $$0$$s and $$1$$s and apply regression models in a binary classification task.

XY_train_r <- as.data.frame(cbind(X_train,
Y=as.numeric(as.character(Y_train)) # 0.0 or 1.0
))
f <- lm(Y~., data=XY_train_r)

Y_pred <- predict(f, as.data.frame(X_test))
summary(Y_pred)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
## -0.76785  0.03625  0.18985  0.20120  0.34916  0.95385

The predicted outputs, $$\hat{Y}$$, are arbitrary real numbers, but we can convert them to binary ones by checking if, e.g., $$\hat{Y}>0.5$$.

Y_pred <- as.numeric(Y_pred>0.5)
get_metrics(Y_test, Y_pred)
##          Acc         Prec          Rec            F
##    0.8303068    0.6194690    0.2348993    0.3406326
##           TN           FN           FP           TP
## 1256.0000000  228.0000000   43.0000000   70.0000000

The threshold $$T=0.5$$ could even be treated as a free parameter we optimise for (w.r.t. different metrics over the validation sample).

### 4.4.2 Logistic Model

Inspired by this idea, we could try modelling the probability that a given point belongs to class $$1$$.

This could also provide us with the confidence in our prediction.

Probability is a number in $$[0,1]$$, but $$Y\in \mathbb{R}$$.

However, we could transform the real-valued outputs by means of some function $$\phi:\mathbb{R}\to[0,1]$$ (preferably S-shaped == sigmoid), so as to get:

$$\Pr(Y=1|\mathbf{X},\boldsymbol\beta)=\phi(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p)$$

The above reads as “Probability that $$Y$$ is from class 1 given $$\mathbf{X}$$ and $$\boldsymbol\beta$$”.

A popular choice is the logistic sigmoid function,

$\phi(y) = \frac{1}{1+e^{-y}} = \frac{e^y}{1+e^y}$

Hence our model becomes:

$Y=\frac{1}{1+e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p)}}$

We call it a generalised linear model (glm).

### 4.4.3 Example in R

Let us first fit a simple (i.e., $$p=1$$) logistic regression model using the alcohol variable.

“logit” below denotes the inverse of the logistic sigmoid function.

f <- glm(Y~alcohol, data=XY_train, family=binomial("logit"))
summary(f)
##
## Call:
## glm(formula = Y ~ alcohol, family = binomial("logit"), data = XY_train)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -1.8703  -0.5994  -0.3984  -0.2770   2.7463
##
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept) -11.94770    0.46819  -25.52   <2e-16 ***
## alcohol       0.96469    0.04166   23.16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 3630.9  on 3722  degrees of freedom
## Residual deviance: 2963.4  on 3721  degrees of freedom
## AIC: 2967.4
##
## Number of Fisher Scoring iterations: 5

Some predicted probabilities:

head(predict(f, XY_test, type="response"))
##          1          2          3          4          5
## 0.07629709 0.07629709 0.05316971 0.05824087 0.13961713
##          6
## 0.13961713

We classify $$Y$$ as 1 if the corresponding membership probability is greater than $$0.5$$.

Y_pred <- as.numeric(predict(f, XY_test, type="response")>0.5)
get_metrics(Y_test, Y_pred)
##          Acc         Prec          Rec            F
##    0.8127740    0.4971751    0.2953020    0.3705263
##           TN           FN           FP           TP
## 1210.0000000  210.0000000   89.0000000   88.0000000

And now a fit based on all the input variables:

f <- glm(Y~., data=XY_train, family=binomial("logit"))
## Warning: glm.fit: fitted probabilities numerically 0 or 1
## occurred
Y_pred <- as.numeric(predict(f, XY_test, type="response")>0.5)
get_metrics(Y_test, Y_pred)
##          Acc         Prec          Rec            F
##    0.1872260    0.1863237    0.9966443    0.3139535
##           TN           FN           FP           TP
##    2.0000000    1.0000000 1297.0000000  297.0000000

### 4.4.4 Loss Function

The fitting of the model can be written as an optimisation task:

$\min_{\beta_0, \beta_1,\dots, \beta_p\in\mathbb{R}} \frac{1}{n} \sum_{i=1}^n e\left(\frac{1}{1+e^{-(\beta_0 + \beta_1 x_{i,1}+\dots+\beta_p x_{i,p})}}, y_i \right)$

where $$e(\hat{y}, y)$$ denotes the penalty that measures the “difference” between the true $$y$$ and its predicted version $$\hat{y}$$.

In the ordinary regression, we use the squared residual $$e(\hat{y}, y) = (\hat{y}-y)^2$$.

In logistic regression (the kind of a classifier we are interested in right now), we use the cross-entropy (a.k.a. log-loss),

$e(\hat{y},y) = - \left(y\log \hat{y} + (1-y)\log(1-\hat{y})\right)$

The corresponding loss function has not only nice statistical properties (**) but also an intuitive interpretation.

Note that the predicted $$\hat{y}$$ is in $$(0,1)$$ and the true $$y$$ equals to either 0 or 1.

Recall also that $$\log t\in(-\infty, 0)$$ for $$t\in (0,1)$$.

$e(\hat{y},y) = - \left(y\log \hat{y} + (1-y)\log(1-\hat{y})\right)$

• if true $$y=1$$, then the penalty becomes $$e(\hat{y}, 1) = -\log(\hat{y})$$

• $$\hat{y}$$ is the probability that the classified input is indeed from class $$1$$
• we’d be happy if the classifier outputted $$\hat{y}\simeq 1$$ in this case; this is not penalised as $$-\log(t)\to 0$$ as $$t\to 1$$
• however, if the classifier is totally wrong, i.e., it thinks that $$\hat{y}\simeq 0$$, then the penalty will be very high, as $$-\log(t)\to+\infty$$ as $$t\to 0$$
• if true $$y=0$$, then the penalty becomes $$e(\hat{y}, 0) = -\log(1-\hat{y})$$

• $$1-\hat{y}$$ is the predicted probability that the input is from class $$0$$
• we penalise heavily the case where $$1-\hat{y}$$ is small (we’d be happy if the classifier was sure that $$1-\hat{y}\simeq 1$$, because this is the ground-truth)

(*) Interestingly, there is no analytical formula for the optimal set of parameters ($$\beta_0,\beta_1,\dots,\beta_p$$) minimising the log-loss.

In the chapter on optimisation, we shall see that the solution to the logistic regression can be solved numerically by means of quite simple iterative algorithms.