## 2.4 Outro

### 2.4.1 Remarks

Multiple regression is simple, fast to apply and interpretable.

Linear models go beyond fitting of straight lines and other hyperplanes!

A complex model may overfit and hence generalise poorly to unobserved inputs.

Note that the SSR criterion makes the models sensitive to outliers.

Remember:

good models $=$ better understanding of the modelled reality $$+$$ better predictions $=$ more revenue, your boss’ happiness, your startup’s growth etc.

### 2.4.2 Other Methods for Regression

Other example approaches to regression:

• ridge regression,
• lasso regression,
• least absolute deviations (LAD) regression,
• multiadaptive regression splines (MARS),
• K-nearest neighbour (K-NN) regression, see FNN::knn.reg() in R,
• regression trees,
• support-vector regression (SVR),
• neural networks (also deep) for regression.

### 2.4.3 Derivation of the Solution (**)

We would like to find an analytical solution to the problem of minimising of the sum of squared residuals:

$\min_{\beta_0, \beta_1,\dots, \beta_p\in\mathbb{R}} E(\beta_0, \beta_1, \dots, \beta_p)= \sum_{i=1}^n \left( \beta_0 + \beta_1 x_{i,1}+\dots+\beta_p x_{i,p} - y_{i} \right)^2$

This requires computing the $$p+1$$ partial derivatives $${\partial E}/{\partial \beta_j}$$ for $$j=0,\dots,p$$.

The partial derivatives are very similar to each other; $$\frac{\partial E}{\partial \beta_0}$$ is given by: $\frac{\partial E}{\partial \beta_0}(\beta_0,\beta_1,\dots,\beta_p)= 2 \sum_{i=1}^n \left( \beta_0 + \beta_1 x_{i,1}+\dots+\beta_p x_{i,p} - y_{i} \right)$ and $$\frac{\partial E}{\partial \beta_j}$$ for $$j>0$$ is equal to: $\frac{\partial E}{\partial \beta_j}(\beta_0,\beta_1,\dots,\beta_p)= 2 \sum_{i=1}^n x_{i,j} \left( \beta_0 + \beta_1 x_{i,1}+\dots+\beta_p x_{i,p} - y_{i} \right)$

Then all we need to do is to solve the system of linear equations:

$\left\{ \begin{array}{rcl} \frac{\partial E}{\partial \beta_0}(\beta_0,\beta_1,\dots,\beta_p)&=&0 \\ \frac{\partial E}{\partial \beta_1}(\beta_0,\beta_1,\dots,\beta_p)&=&0 \\ \vdots\\ \frac{\partial E}{\partial \beta_p}(\beta_0,\beta_1,\dots,\beta_p)&=&0 \\ \end{array} \right.$

The above system of $$p+1$$ linear equations, which we are supposed to solve for $$\beta_0,\beta_1,\dots,\beta_p$$: $\left\{ \begin{array}{rcl} 2 \sum_{i=1}^n \phantom{x_{i,0}}\left( \beta_0 + \beta_1 x_{i,1}+\dots+\beta_p x_{i,p} - y_{i} \right)&=&0 \\ 2 \sum_{i=1}^n x_{i,1} \left( \beta_0 + \beta_1 x_{i,1}+\dots+\beta_p x_{i,p} - y_{i} \right)&=&0 \\ \vdots\\ 2 \sum_{i=1}^n x_{i,p} \left( \beta_0 + \beta_1 x_{i,1}+\dots+\beta_p x_{i,p} - y_{i} \right)&=&0 \\ \end{array} \right.$ can be rewritten as: $\left\{ \begin{array}{rcl} \sum_{i=1}^n \phantom{x_{i,0}}\left( \beta_0 + \beta_1 x_{i,1}+\dots+\beta_p x_{i,p}\right)&=& \sum_{i=1}^n \phantom{x_{i,0}} y_i \\ \sum_{i=1}^n x_{i,1} \left( \beta_0 + \beta_1 x_{i,1}+\dots+\beta_p x_{i,p}\right)&=&\sum_{i=1}^n x_{i,1} y_i \\ \vdots\\ \sum_{i=1}^n x_{i,p} \left( \beta_0 + \beta_1 x_{i,1}+\dots+\beta_p x_{i,p}\right)&=&\sum_{i=1}^n x_{i,p} y_i \\ \end{array} \right.$

and further as: $\left\{ \begin{array}{rcl} \beta_0\ n\phantom{\sum_{i=1}^n x} + \beta_1\sum_{i=1}^n \phantom{x_{i,0}} x_{i,1}+\dots+\beta_p \sum_{i=1}^n \phantom{x_{i,0}} x_{i,p} &=&\sum_{i=1}^n\phantom{x_{i,0}} y_i \\ \beta_0 \sum_{i=1}^n x_{i,1} + \beta_1\sum_{i=1}^n x_{i,1} x_{i,1}+\dots+\beta_p \sum_{i=1}^n x_{i,1} x_{i,p} &=&\sum_{i=1}^n x_{i,1} y_i \\ \vdots\\ \beta_0 \sum_{i=1}^n x_{i,p} + \beta_1\sum_{i=1}^n x_{i,p} x_{i,1}+\dots+\beta_p \sum_{i=1}^n x_{i,p} x_{i,p} &=&\sum_{i=1}^n x_{i,p} y_i \\ \end{array} \right.$ Note that the terms involving $$x_{i,j}$$ and $$y_i$$ (the sums) are all constant – these are some fixed real numbers. We have learned how to solve such problems in high school.

Try deriving the analytical solution and implementing it for $$p=2$$. Recall that in the previous chapter we solved the special case of $$p=1$$.

### 2.4.4 Solution in Matrix Form (***)

Assume that $$\mathbf{X}\in\mathbb{R}^{n\times p}$$ (a matrix with inputs), $$\mathbf{y}\in\mathbb{R}^{n\times 1}$$ (a column vector of reference outputs) and $$\boldsymbol{\beta}\in\mathbb{R}^{(p+1)\times 1}$$ (a column vector of parameters).

Firstly, note that a linear model of the form: $f_{\boldsymbol\beta}(\mathbf{x})=\beta_0+\beta_1 x_1+\dots+\beta_p x_p$ can be rewritten as: $f_{\boldsymbol\beta}(\mathbf{x})=\beta_0 1+\beta_1 x_1+\dots+\beta_p x_p =\mathbf{\dot{x}}\boldsymbol\beta,$ where $$\mathbf{\dot{x}}=[1\ x_1\ x_2\ \cdots\ x_p]$$.

Similarly, if we assume that $$\mathbf{\dot{X}}=[\boldsymbol{1}\ \mathbf{X}]\in\mathbb{R}^{n\times (p+1)}$$ is the input matrix with a prepended column of $$1$$s, i.e., $$\boldsymbol{1}=[1\ 1\ \cdots\ 1]^T$$ and $$\dot{x}_{i,0}=1$$ (for brevity of notation the columns added will have index $$0$$), $$\dot{x}_{i,j}=x_{i,j}$$ for all $$j\ge 1$$ and all $$i$$, then: $\mathbf{\hat{y}} = \mathbf{\dot{X}} \boldsymbol\beta$ gives the vector of predicted outputs for every input point.

This way, the sum of squared residuals $E(\beta_0, \beta_1, \dots, \beta_p)= \sum_{i=1}^n \left( \beta_0 + \beta_1 x_{i,1}+\dots+\beta_p x_{i,p} - y_{i} \right)^2$ can be rewritten as: $E(\boldsymbol\beta)=\| \mathbf{\dot{X}} \boldsymbol\beta - \mathbf{y} \|^2,$ where as usual $$\|\cdot\|^2$$ denotes the squared Euclidean norm.

Recall that this can be re-expressed as: $E(\boldsymbol\beta)= (\mathbf{\dot{X}} \boldsymbol\beta - \mathbf{y})^T (\mathbf{\dot{X}} \boldsymbol\beta - \mathbf{y}).$

In order to find the minimum of $$E$$ w.r.t. $$\boldsymbol\beta$$, we need to find the parameters that make the partial derivatives vanish, i.e.:

$\left\{ \begin{array}{rcl} \frac{\partial E}{\partial \beta_0}(\boldsymbol\beta)&=&0 \\ \frac{\partial E}{\partial \beta_1}(\boldsymbol\beta)&=&0 \\ \vdots\\ \frac{\partial E}{\partial \beta_p}(\boldsymbol\beta)&=&0 \\ \end{array} \right.$

(***) Interestingly, the above can also be expressed in matrix form, using the special notation: $\nabla E(\boldsymbol\beta) = \boldsymbol{0}$ Here, $$\nabla E$$ (nabla symbol = differential operator) denotes the function gradient, i.e., the vector of all partial derivatives. This is nothing more than syntactic sugar for this quite commonly applied operator.

Anyway, the system of linear equations we have derived above: $\left\{ \begin{array}{rcl} \beta_0\ n\phantom{\sum_{i=1}^n x} + \beta_1\sum_{i=1}^n \phantom{x_{i,0}} x_{i,1}+\dots+\beta_p \sum_{i=1}^n \phantom{x_{i,0}} x_{i,p} &=&\sum_{i=1}^n\phantom{x_{i,0}} y_i \\ \beta_0 \sum_{i=1}^n x_{i,1} + \beta_1\sum_{i=1}^n x_{i,1} x_{i,1}+\dots+\beta_p \sum_{i=1}^n x_{i,1} x_{i,p} &=&\sum_{i=1}^n x_{i,1} y_i \\ \vdots\\ \beta_0 \sum_{i=1}^n x_{i,p} + \beta_1\sum_{i=1}^n x_{i,p} x_{i,1}+\dots+\beta_p \sum_{i=1}^n x_{i,p} x_{i,p} &=&\sum_{i=1}^n x_{i,p} y_i \\ \end{array} \right.$ can be rewritten in matrix terms as: $\left\{ \begin{array}{rcl} \beta_0 \mathbf{\dot{x}}_{\cdot,0}^T \mathbf{\dot{x}}_{\cdot,0} + \beta_1 \mathbf{\dot{x}}_{\cdot,0}^T \mathbf{\dot{x}}_{\cdot,1}+\dots+\beta_p \mathbf{\dot{x}}_{\cdot,0}^T \mathbf{\dot{x}}_{\cdot,p} &=& \mathbf{\dot{x}}_{\cdot,0}^T \mathbf{y} \\ \beta_0 \mathbf{\dot{x}}_{\cdot,1}^T \mathbf{\dot{x}}_{\cdot,0} + \beta_1 \mathbf{\dot{x}}_{\cdot,1}^T \mathbf{\dot{x}}_{\cdot,1}+\dots+\beta_p \mathbf{\dot{x}}_{\cdot,1}^T \mathbf{\dot{x}}_{\cdot,p} &=& \mathbf{\dot{x}}_{\cdot,1}^T \mathbf{y} \\ \vdots\\ \beta_0 \mathbf{\dot{x}}_{\cdot,p}^T \mathbf{\dot{x}}_{\cdot,0} + \beta_1 \mathbf{\dot{x}}_{\cdot,p}^T \mathbf{\dot{x}}_{\cdot,1}+\dots+\beta_p \mathbf{\dot{x}}_{\cdot,p}^T \mathbf{\dot{x}}_{\cdot,p} &=& \mathbf{\dot{x}}_{\cdot,p}^T \mathbf{y}\\ \end{array} \right.$

This can be restated as: $\left\{ \begin{array}{rcl} \left(\mathbf{\dot{x}}_{\cdot,0}^T \mathbf{\dot{X}}\right)\, \boldsymbol\beta &=& \mathbf{\dot{x}}_{\cdot,0}^T \mathbf{y} \\ \left(\mathbf{\dot{x}}_{\cdot,1}^T \mathbf{\dot{X}}\right)\, \boldsymbol\beta &=& \mathbf{\dot{x}}_{\cdot,1}^T \mathbf{y} \\ \vdots\\ \left(\mathbf{\dot{x}}_{\cdot,p}^T \mathbf{\dot{X}}\right)\, \boldsymbol\beta &=& \mathbf{\dot{x}}_{\cdot,p}^T \mathbf{y}\\ \end{array} \right.$ which in turn is equivalent to: $\left(\mathbf{\dot{X}}^T\mathbf{X}\right)\,\boldsymbol\beta = \mathbf{\dot{X}}^T\mathbf{y}.$

Such a system of linear equations in matrix form can be solved numerically using, amongst others, the solve() function.

(***) In practice, we’d rather rely on QR or SVD decompositions of matrices for efficiency and numerical accuracy reasons.

Numeric example – solution via lm():

X1 <- as.numeric(Credit$Balance[Credit$Balance>0])
X2 <- as.numeric(Credit$Income[Credit$Balance>0])
Y  <- as.numeric(Credit$Rating[Credit$Balance>0])
lm(Y~X1+X2)$coefficients ## (Intercept) X1 X2 ## 172.5586670 0.1828011 2.1976461 Recalling that $$\mathbf{A}^T \mathbf{B}$$ can be computed by calling t(A) %*% B or – even faster – by calling crossprod(A, B), we can also use solve() to obtain the same result: X_dot <- cbind(1, X1, X2) solve( crossprod(X_dot, X_dot), crossprod(X_dot, Y) ) ## [,1] ## 172.5586670 ## X1 0.1828011 ## X2 2.1976461 ### 2.4.5 Pearson’s r in Matrix Form (**) Recall the Pearson linear correlation coefficient: $r(\boldsymbol{x},\boldsymbol{y}) = \frac{ \sum_{i=1}^n (x_i-\bar{x}) (y_i-\bar{y}) }{ \sqrt{\sum_{i=1}^n (x_i-\bar{x})^2}\ \sqrt{\sum_{i=1}^n (y_i-\bar{y})^2} }$ Denote with $$\boldsymbol{x}^\circ$$ and $$\boldsymbol{y}^\circ$$ the centred versions of $$\boldsymbol{x}$$ and $$\boldsymbol{y}$$, respectively, i.e., $$x_i^\circ=x_i-\bar{x}$$ and $$y_i^\circ=y_i-\bar{y}$$. Rewriting the above yields: $r(\boldsymbol{x},\boldsymbol{y}) = \frac{ \sum_{i=1}^n x_i^\circ y_i^\circ }{ \sqrt{\sum_{i=1}^n ({x_i^\circ})^2}\ \sqrt{\sum_{i=1}^n ({y_i^\circ})^2} }$ which is exactly: $r(\boldsymbol{x},\boldsymbol{y}) = \frac{ \boldsymbol{x}^\circ\cdot \boldsymbol{y}^\circ }{ \| \boldsymbol{x}^\circ \|\ \| \boldsymbol{y}^\circ \| }$ i.e., the normalised dot product of the centred versions of the two vectors. This is the cosine of the angle between the two vectors (in $$n$$-dimensional spaces)! (**) Recalling from the previous chapter that $$\mathbf{A}^T \mathbf{A}$$ gives the dot product between all the pairs of columns in a matrix $$\mathbf{A}$$, we can implement an equivalent version of cor(C) as follows: C <- Credit[Credit$Balance>0,
c("Rating", "Limit", "Income", "Age",
"Education", "Balance")]
C_centred <- apply(C, 2, function(c) c-mean(c))
C_normalised <- apply(C_centred, 2, function(c)
c/sqrt(sum(c^2)))
round(t(C_normalised) %*% C_normalised, 3)
##           Rating  Limit Income   Age Education Balance
## Rating     1.000  0.996  0.831 0.167    -0.040   0.798
## Limit      0.996  1.000  0.834 0.164    -0.032   0.796
## Income     0.831  0.834  1.000 0.227    -0.033   0.414
## Age        0.167  0.164  0.227 1.000     0.024   0.008
## Education -0.040 -0.032 -0.033 0.024     1.000   0.001
## Balance    0.798  0.796  0.414 0.008     0.001   1.000

### 2.4.6 Further Reading

Recommended further reading:

• (James et al. 2017: Chapters 1, 2 and 3)

Other:

• (Hastie et al. 2017: Chapter 1, Sections 3.2 and 3.3)