2 Multiple Regression

2.1 Introduction

2.1.1 Formalism

Let \(\mathbf{X}\in\mathbb{R}^{n\times p}\) be an input matrix that consists of \(n\) points in a \(p\)-dimensional space.

In other words, we have a database on \(n\) objects, each of which being described by means of \(p\) numerical features.

\[ \mathbf{X}= \left[ \begin{array}{cccc} x_{1,1} & x_{1,2} & \cdots & x_{1,p} \\ x_{2,1} & x_{2,2} & \cdots & x_{2,p} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n,1} & x_{n,2} & \cdots & x_{n,p} \\ \end{array} \right] \]

Recall that in supervised learning, apart from \(\mathbf{X}\), we are also given the corresponding \(\mathbf{y}\); with each input point \(\mathbf{x}_{i,\cdot}\) we associate the desired output \(y_i\).

In this chapter we are still interested in regression tasks; hence, we assume that each \(y_i\) it is a real number, i.e., \(y_i\in\mathbb{R}\).

Hence, our dataset is \([\mathbf{X}\ \mathbf{y}]\) – where each object is represented as a row vector \([\mathbf{x}_{i,\cdot}\ y_i]\), \(i=1,\dots,n\):

\[ [\mathbf{X}\ \mathbf{y}]= \left[ \begin{array}{ccccc} x_{1,1} & x_{1,2} & \cdots & x_{1,p} & y_1\\ x_{2,1} & x_{2,2} & \cdots & x_{2,p} & y_2\\ \vdots & \vdots & \ddots & \vdots & \vdots\\ x_{n,1} & x_{n,2} & \cdots & x_{n,p} & y_n\\ \end{array} \right]. \]

2.1.2 Simple Linear Regression - Recap

In a simple regression task, we have assumed that \(p=1\) – there is only one independent variable, denoted \(x_i=x_{i,1}\).

We restricted ourselves to linear models of the form \(Y=f(X)=a+bX\) that minimised the sum of squared residuals (SSR), i.e.,

\[ \min_{a,b\in\mathbb{R}} \sum_{i=1}^n \left( a+bx_i-y_i \right)^2. \]

The solution is:

\[ \left\{ \begin{array}{rl} b = & \dfrac{ n \displaystyle\sum_{i=1}^n x_i y_i - \displaystyle\sum_{i=1}^n y_i \displaystyle\sum_{i=1}^n x_i }{ n \displaystyle\sum_{i=1}^n x_i x_i - \displaystyle\sum_{i=1}^n x_i\displaystyle\sum_{i=1}^n x_i }\\ a = & \dfrac{1}{n}\displaystyle\sum_{i=1}^n y_i - b \dfrac{1}{n} \displaystyle\sum_{i=1}^n x_i \\ \end{array} \right. \]

Fitting in R can be performed by calling the lm() function:

## 
## Call:
## lm(formula = Y ~ X)
## 
## Coefficients:
## (Intercept)            X  
##    226.4711       0.2661

Figure 1 gives the scatter plot of Y vs. X together with the fitted simple linear model.

Figure 1: Fitted regression line for the Credit dataset

Figure 1: Fitted regression line for the Credit dataset

2.2 Multiple Linear Regression

2.2.1 Problem Formulation

Let’s now generalise the above to the case of many variables \(X_1, \dots, X_p\).

We wish to model the dependent variable as a function of \(p\) independent variables. \[ Y = f(X_1,\dots,X_p) \qquad (+\varepsilon) \]

Restricting ourselves to the class of linear models, we have \[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p. \]

Above we studied the case where \(p=1\), i.e., \(Y=a+bX_1\) with \(\beta_0=a\) and \(\beta_1=b\).

The above equation defines:

  • \(p=1\) — a line (see Figure 1),
  • \(p=2\) — a plane (see Figure 2),
  • \(p\ge 3\) — a hyperplane (well, most people find it difficult to imagine objects in high dimensions, but we are lucky to have this thing called maths).
Figure 2: Fitted regression plane for the Credit dataset

Figure 2: Fitted regression plane for the Credit dataset

2.2.2 Fitting a Linear Model in R

lm() accepts a formula of the form Y~X1+X2+...+Xp.

It finds the least squares fit, i.e., solves \[ \min_{\beta_0, \beta_1,\dots, \beta_p\in\mathbb{R}} \sum_{i=1}^n \left( \beta_0 + \beta_1 x_{i,1}+\dots+\beta_p x_{i,p} - y_i \right) ^2 \]

## (Intercept)          X1          X2 
## 172.5586670   0.1828011   2.1976461

By the way, the 3D scatter plot in Figure 2 was generated by calling:

(s3d is an R list, one of its elements named plane3d is a function object – this is legal)

2.3 Finding the Best Model

2.3.1 Model Diagnostics

Here is Rating (\(Y\)) as function of Balance (\(X_1\), lefthand side of Figure 3) and Income (\(X_2\), righthand side of Figure 3).

Figure 3: Scatter plots of Y vs. X_1 and X_2

Figure 3: Scatter plots of \(Y\) vs. \(X_1\) and \(X_2\)

Moreover, Figure 4 depicts (in a hopefully readable manner) both \(X_1\) and \(X_2\) with Rating \(Y\) encoded with a colour (low ratings are green, high ratings are red; some rating values are explicitly printed out within the plot).

Figure 4: A heatmap for Rating as a function of Balance and Income; greens represent low credit ratings, whereas reds – high ones

Figure 4: A heatmap for Rating as a function of Balance and Income; greens represent low credit ratings, whereas reds – high ones

Consider the three following models.

Formula Equation
Rating ~ Balance + Income \(Y=\beta_0 + \beta_1 X_1 + \beta_2 X_2\)
Rating ~ Balance \(Y=a + b X_1\) (\(\beta_0=a, \beta_1=b, \beta_2=0\))
Rating ~ Income \(Y=a + b X_2\) (\(\beta_0=a, \beta_1=0, \beta_2=b\))
## (Intercept)          X1          X2 
## 172.5586670   0.1828011   2.1976461
## (Intercept)          X1 
## 226.4711446   0.2661459
## (Intercept)          X2 
##  253.851416    3.025286

Which of the three models is the best? Of course, by using the word “best”, we need to answer the question “best?… but with respect to what kind of measure?”

So far we were fitting w.r.t. SSR, as the multiple regression model generalises the two simple ones, the former must yield a not-worse SSR. This is because in the case of \(Y=\beta_0 + \beta_1 X_1 + \beta_2 X_2\), setting \(\beta_1\) to 0 (just one of uncountably many possible \(\beta_1\)s, if it happens to be the best one, good for us) gives \(Y=a + b X_2\) whereas by setting \(\beta_2\) to 0 we obtain \(Y=a + b X_1\).

## [1] 358260.6
## [1] 2132108
## [1] 1823473

We get that, in terms of SSRs, \(f_{12}\) is better than \(f_{2}\), which in turn is better than \(f_{1}\). However, these error values per se (sheer numbers) are meaningless (not meaningful).

Remark.

Interpretability in ML has always been an important issue, think the EU General Data Protection Regulation (GDPR), amongst others.

2.3.1.1 SSR, MSE, RMSE and MAE

The quality of fit can be assessed by performing some descriptive statistical analysis of the residuals, \(\hat{y}_i-y_i\), for \(i=1,\dots,n\).

I know how to summarise data on the residuals! Of course I should compute their arithmetic mean and I’m done with that shtask! Interestingly, the mean of residuals (this can be shown analytically) in the least squared fit is always equal to \(0\): \[ \frac{1}{n} \sum_{i=1}^n (\hat{y}_i-y_i)=0. \] Therefore, we need a different metric.

Exercise.

(*) A proof of this fact is left as an exercise to the curious; assume \(p=1\) just as in the previous chapter and note that \(\hat{y}_i=a x_i+b\).

## [1] -2.086704e-16
## [1] TRUE

We noted that sum of squared residuals (SSR) is not interpretable, but the mean squared residuals (MSR) – also called mean squared error (MSE) regression loss – is a little better. Recall that mean is defined as the sum divided by number of samples.

\[ \mathrm{MSE}(f) = \frac{1}{n} \sum_{i=1}^n (f(\mathbf{x}_{i,\cdot})-y_i)^2. \]

## [1] 1155.679
## [1] 6877.768
## [1] 5882.171

This gives an information of how much do we err per sample, so at least this measure does not depend on \(n\) anymore. However, if the original \(Y\)s are, say, in metres \([\mathrm{m}]\), MSE is expressed in metres squared \([\mathrm{m}^2]\).

To account for that, we may consider the root mean squared error (RMSE): \[ \mathrm{RMSE}(f) = \sqrt{\frac{1}{n} \sum_{i=1}^n (f(\mathbf{x}_{i,\cdot})-y_i)^2}. \] This is just like with the sample variance vs. standard deviation – recall the latter is defined as the square root of the former.

## [1] 33.99528
## [1] 82.93231
## [1] 76.69531

The interpretation of the RMSE is rather quirky; it is some-sort-of-averaged deviance from the true rating (which is on the scale 0–1000, hence we see that the first model is not that bad). Recall that the square function is sensitive to large observations, hence, it penalises notable deviations more heavily.

As still we have a problem with finding something easily interpretable (your non-technical boss or client may ask you: but what do these numbers mean??), we suggest here that the mean absolute error (MAE; also called mean absolute deviations, MAD) might be a better idea than the above: \[ \mathrm{MAE}(f) = \frac{1}{n} \sum_{i=1}^n |f(\mathbf{x}_{i,\cdot})-y_i|. \]

## [1] 22.86342
## [1] 61.48892
## [1] 64.1506

With the above we may say “On average, the predicted rating differs from the observed one by…”. That is good enough.

Remark.

(*) You may ask why don’t we fit models so as to minimise the MAE and we minimise the RMSE instead (note that minimising RMSE is the same as minimising the SSR, one is a strictly monotone transformation of the other and do not affect the solution). Well, it is possible. It turns out that, however, minimising MAE is more computationally expensive and the solution may be numerically unstable. So it’s rarely an analyst’s first choice (assuming they are well-educated enough to know about the MAD regression task). However, it may be worth trying it out sometimes.

Sometimes we might prefer MAD regression to the classic one if our data is heavily contaminated by outliers. But in such cases it is worth checking if proper data cleansing does the trick.

2.3.1.2 Graphical Summaries of Residuals

If we are not happy with single numerical aggregated of the residuals or their absolute values, we can (and should) always compute a whole bunch of descriptive statistics:

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -108.100   -1.940    7.812    0.000   20.249   50.623
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -226.75  -48.30  -10.08    0.00   42.58  268.74
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -195.156  -57.341   -1.284    0.000   64.013  175.344

The outputs generated by summary() include:

  • Min. – sample minimum
  • 1st Qu. – 1st quartile == 25th percentile == quantile of order 0.25
  • Median – median == 50th percentile == quantile of order 0.5
  • 3rd Qu. – 3rd quartile = 75th percentile == quantile of order 0.75
  • Max. – sample maximum

For example, 1st quartile is the observation \(q\) such that 25% values are \(\le q\) and 75% values are \(\ge q\), see ?quantile in R.

Graphically, it is nice to summarise the empirical distribution of the residuals on a box and whisker plot. Here is the key to decipher Figure 5:

  • IQR == Interquartile range == Q3\(-\)Q1 (box width)
  • The box contains 50% of the “most typical” observations
  • Box and whiskers altogether have width \(\le\) 4 IQR
  • Outliers == observations potentially worth inspecting (is it a bug or a feature?)
Figure 5: An example boxplot

Figure 5: An example boxplot

Figure 6 is worth a thousand words:

Figure 6: Box plots of the residuals for the three models studied

Figure 6: Box plots of the residuals for the three models studied

Figure 7 gives a violin plot – a blend of a box plot and a (kernel) density estimator (histogram-like):

Figure 7: Violin plots of the residuals for the three models studied

Figure 7: Violin plots of the residuals for the three models studied

We can also take a look at the absolute values of the residuals. Here are some descriptive statistics:

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##   0.06457   6.46397  14.07055  22.86342  26.41772 108.09995
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.5056  19.6640  45.0716  61.4889  80.1239 268.7377
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.6545  29.8540  59.6756  64.1506  95.7384 195.1557

Figure 8 is worth $1000:

Figure 8: Box plots of the modules of the residuals for the three models studied

Figure 8: Box plots of the modules of the residuals for the three models studied

2.3.1.3 Coefficient of Determination (R-squared)

If we didn’t know the range of the dependent variable (in our case we do know that the credit rating is on the scale 0–1000), the RMSE or MAE would be hard to interpret.

It turns out that there is a popular normalised (unit-less) measure that is somehow easy to interpret with no domain-specific knowledge of the modelled problem. Namely, the (unadjusted) \(R^2\) score (the coefficient of determination) is given by:

\[ R^2(f) = 1 - \frac{\sum_{i=1}^{n} \left(y_i-f(\mathbf{x}_{i,\cdot})\right)^2}{\sum_{i=1}^{n} \left(y_i-\bar{y}\right)^2}, \] where \(\bar{y}\) is the arithmetic mean \(\frac{1}{n}\sum_{i=1}^n y_i\).

## [1] 0.9390901
## [1] 0.9390901
## [1] 0.6375085
## [1] 0.6899812

The coefficient of determination gives the proportion of variance of the dependent variable explained by independent variables in the model; \(R^2(f)\simeq 1\) indicates a perfect fit. The first model is a very good one, the simple models are “more or less okay”.

Unfortunately, \(R^2\) tends to automatically increase as the number of independent variables increase (recall that the more variables in the model, the better the SSR must be). To correct for this phenomenon, we sometimes consider the adjusted \(R^2\):

\[ \bar{R}^2(f) = 1 - (1-{R}^2(f))\frac{n-1}{n-p-1} \]

## [1] 0.9386933
## [1] 0.9386933
## [1] 0.6363316
## [1] 0.6889747

In other words, the adjusted \(R^2\) penalises for more complex models.

Remark.

(*) Side note – results of some statistical tests (e.g., significance of coefficients) are reported by calling summary(f12) etc. — refer to a more advanced source to obtain more information. These, however, require the verification of some assumptions regarding the input data and the residuals.

## 
## Call:
## lm(formula = Y ~ X1 + X2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -108.100   -1.940    7.812   20.249   50.623 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.726e+02  3.950e+00   43.69   <2e-16 ***
## X1          1.828e-01  5.159e-03   35.43   <2e-16 ***
## X2          2.198e+00  5.637e-02   38.99   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34.16 on 307 degrees of freedom
## Multiple R-squared:  0.9391, Adjusted R-squared:  0.9387 
## F-statistic:  2367 on 2 and 307 DF,  p-value: < 2.2e-16

2.3.1.4 Residuals vs. Fitted Plot

We can also create scatter plots of the residuals (predicted \(\hat{y}_i\) minus true \(y_i\)) as a function of the predicted \(\hat{y}_i=f(\mathbf{x}_{i,\cdot})\), see Figure 9.

Figure 9: Residuals vs. fitted outputs for the three regression models

Figure 9: Residuals vs. fitted outputs for the three regression models

Ideally (provided that the hypothesis that the dependent variable is indeed a linear function of the dependent variable(s) is true), we would expect to see a point cloud that spread around \(0\) in a very much unorderly fashion.

2.3.2 Variable Selection

Okay, up to now we’ve been considering the problem of modelling the Rating variable as a function of Balance and/or Income. However, it the Credit data set there are other variables possibly worth inspecting.

Consider all quantitative (numeric-continuous) variables in the Credit data set.

##   Rating Limit  Income Age Education Balance
## 1    283  3606  14.891  34        11     333
## 2    483  6645 106.025  82        15     903
## 3    514  7075 104.593  71        11     580
## 4    681  9504 148.924  36        11     964
## 5    357  4897  55.882  68        16     331
## 6    569  8047  80.180  77        10    1151

Obviously there are many possible combinations of the variables upon which regression models can be constructed (precisely, for \(p\) variables there are \(2^p\) such models). How do we choose the best set of inputs?

Remark.

We should already be suspicious at this point: wait… best requires some sort of criterion, right?

First, however, let’s draw a matrix of scatter plots for every pair of variables – so as to get an impression of how individual variables interact with each other, see Figure 10.

Figure 10: Scatter plot matrix for the Credit dataset

Figure 10: Scatter plot matrix for the Credit dataset

It seems like Rating depends on Limit almost linearly… We have a tool to actually quantify the degree of linear dependence between a pair of variables – Pearson’s \(r\) – the linear correlation coefficient:

\[ r(\boldsymbol{x},\boldsymbol{y}) = \frac{ \sum_{i=1}^n (x_i-\bar{x}) (y_i-\bar{y}) }{ \sqrt{\sum_{i=1}^n (x_i-\bar{x})^2} \sqrt{\sum_{i=1}^n (y_i-\bar{y})^2} }. \]

It holds \(r\in[-1,1]\), where:

  • \(r=1\) – positive linear dependence (\(y\) increases as \(x\) increases)
  • \(r=-1\) – negative linear dependence (\(y\) decreases as \(x\) increases)
  • \(r\simeq 0\) – uncorrelated or non-linearly dependent

Figure 11 gives an illustration of the above.

Figure 11: Different datasets and the corresponding Pearson’s r coefficients

Figure 11: Different datasets and the corresponding Pearson’s \(r\) coefficients

To compute Pearson’s \(r\) between all pairs of variables, we call:

##           Rating  Limit Income   Age Education Balance
## Rating     1.000  0.996  0.831 0.167    -0.040   0.798
## Limit      0.996  1.000  0.834 0.164    -0.032   0.796
## Income     0.831  0.834  1.000 0.227    -0.033   0.414
## Age        0.167  0.164  0.227 1.000     0.024   0.008
## Education -0.040 -0.032 -0.033 0.024     1.000   0.001
## Balance    0.798  0.796  0.414 0.008     0.001   1.000

Rating and Limit are almost perfectly linearly correlated, and both seem to describe the same thing.

For practical purposes, we’d rather model Rating as a function of the other variables. For simple linear regression models, we’d choose either Income or Balance. How about multiple regression though?

The best model:

  • has high predictive power,
  • is simple.

These two criteria are often mutually exclusive.

Which variables should be included in the optimal model?

Again, the definition of the “best” object needs a fitness function.

For fitting a single model to data, we use the SSR.

We need a metric that takes the number of dependent variables into account.

Remark.

(*) Unfortunately, the adjusted \(R^2\), despite its interpretability, is not really suitable for this task. It does not penalise complex models heavily enough to be really useful.

Here we’ll be using the Akaike Information Criterion (AIC).

For a model \(f\) with \(p'\) independent variables: \[ \mathrm{AIC}(f) = 2(p'+1)+n\log(\mathrm{SSR}(f))-n\log n \]

Our task is to find the combination of independent variables that minimises the AIC.

Remark.

(**) Note that this is a bi-level optimisation problem – for every considered combination of variables (which we look for), we must solve another problem of finding the best model involving these variables – the one that minimises the SSR. \[ \min_{s_1,s_2,\dots,s_p\in\{0, 1\}} \left( \begin{array}{l} 2\left(\displaystyle\sum_{j=1}^p s_j +1\right)+\\ n\log\left( \displaystyle\min_{\beta_0,\beta_1,\dots,\beta_p\in\mathbb{R}} \sum_{i=1}^n \left( \beta_0 + s_1\beta_1 x_{i,1} + \dots + s_p\beta_p x_{i,p} -y_i \right)^2 \right) \end{array} \right) \] We dropped the \(n\log n\) term, because it is always constant and hence doesn’t affect the solution. If \(s_j=0\), then the \(s_j\beta_j x_{i,j}\) term is equal to \(0\), and hence is not considered in the model. This plays the role of including \(s_j=1\) or omitting \(s_j=0\) the \(j\)-th variable in the model building exercise.

For \(p\) variables, the number of their possible combinations is equal to \(2^p\) (grows exponentially with \(p\)). For large \(p\) (think big data), an extensive search is impractical (in our case we could get away with this though – left as an exercise to a slightly more advanced reader). Therefore, to find the variable combination minimising the AIC, we often rely on one of the two following greedy heuristics:

  • forward selection:

    1. start with an empty model
    2. find an independent variable whose addition to the current model would yield the highest decrease in the AIC and add it to the model
    3. go to step 2 until AIC decreases
  • backward elimination:

    1. start with the full model
    2. find an independent variable whose removal from the current model would decrease the AIC the most and eliminate it from the model
    3. go to step 2 until AIC decreases
Remark.

(**) The above bi-level optimisation problem can be solved by implementing a genetic algorithm – see further chapter for more details.

Remark.

(*) There are of course many other methods which also perform some form of variable selection, e.g., lasso regression. But these minimise a different objective.

First, a forward selection example. We need a data sample to work with:

Then, a formula that represents a model with no variables (model from which we’ll start our search):

## Rating ~ 1

Last, we need a model that includes all the variables. We’re too lazy to list all of them manually, therefore, we can use the model.frame() function to generate a corresponding formula:

## Rating ~ Income + Age + Education + Balance

Now we are ready.

## Start:  AIC=3055.75
## Rating ~ 1
## 
##             Df Sum of Sq     RSS    AIC
## + Income     1   4058342 1823473 2694.7
## + Balance    1   3749707 2132108 2743.2
## + Age        1    164567 5717248 3048.9
## <none>                   5881815 3055.8
## + Education  1      9631 5872184 3057.2
## 
## Step:  AIC=2694.7
## Rating ~ Income
## 
##             Df Sum of Sq     RSS    AIC
## + Balance    1   1465212  358261 2192.3
## <none>                   1823473 2694.7
## + Age        1      2836 1820637 2696.2
## + Education  1      1063 1822410 2696.5
## 
## Step:  AIC=2192.26
## Rating ~ Income + Balance
## 
##             Df Sum of Sq    RSS    AIC
## + Age        1    4119.1 354141 2190.7
## + Education  1    2692.1 355568 2191.9
## <none>                   358261 2192.3
## 
## Step:  AIC=2190.67
## Rating ~ Income + Balance + Age
## 
##             Df Sum of Sq    RSS    AIC
## + Education  1    2925.7 351216 2190.1
## <none>                   354141 2190.7
## 
## Step:  AIC=2190.1
## Rating ~ Income + Balance + Age + Education
## 
## Call:
## lm(formula = Rating ~ Income + Balance + Age + Education, data = C)
## 
## Coefficients:
## (Intercept)       Income      Balance          Age  
##    173.8300       2.1668       0.1839       0.2234  
##   Education  
##     -0.9601
## Rating ~ Income + Age + Education + Balance

The full model has been selected.

And now for something completely different – a backward elimination example:

## Start:  AIC=2190.1
## Rating ~ Income + Age + Education + Balance
## 
##             Df Sum of Sq     RSS    AIC
## <none>                    351216 2190.1
## - Education  1      2926  354141 2190.7
## - Age        1      4353  355568 2191.9
## - Balance    1   1468466 1819682 2698.1
## - Income     1   1617191 1968406 2722.4
## 
## Call:
## lm(formula = Rating ~ Income + Age + Education + Balance, data = C)
## 
## Coefficients:
## (Intercept)       Income          Age    Education  
##    173.8300       2.1668       0.2234      -0.9601  
##     Balance  
##      0.1839

The full model is considered the best again.

Forward selection example – full dataset:

## Start:  AIC=4034.31
## Rating ~ 1
## 
##             Df Sum of Sq     RSS    AIC
## + Balance    1   7124258 2427627 3488.4
## + Income     1   5982140 3569744 3642.6
## + Age        1    101661 9450224 4032.0
## <none>                   9551885 4034.3
## + Education  1      8675 9543210 4036.0
## 
## Step:  AIC=3488.38
## Rating ~ Balance
## 
##             Df Sum of Sq     RSS    AIC
## + Income     1   1859749  567878 2909.3
## + Age        1     98562 2329065 3473.8
## <none>                   2427627 3488.4
## + Education  1      5130 2422497 3489.5
## 
## Step:  AIC=2909.28
## Rating ~ Balance + Income
## 
##             Df Sum of Sq    RSS    AIC
## <none>                   567878 2909.3
## + Age        1    2142.4 565735 2909.8
## + Education  1    1208.6 566669 2910.4
## 
## Call:
## lm(formula = Rating ~ Balance + Income, data = C)
## 
## Coefficients:
## (Intercept)      Balance       Income  
##    145.3506       0.2129       2.1863

This procedure suggests including only the Balance and Income variables.

Backward elimination example – full dataset:

## Start:  AIC=2910.89
## Rating ~ Income + Age + Education + Balance
## 
##             Df Sum of Sq     RSS    AIC
## - Education  1      1238  565735 2909.8
## - Age        1      2172  566669 2910.4
## <none>                    564497 2910.9
## - Income     1   1759273 2323770 3474.9
## - Balance    1   2992164 3556661 3645.1
## 
## Step:  AIC=2909.77
## Rating ~ Income + Age + Balance
## 
##           Df Sum of Sq     RSS    AIC
## - Age      1      2142  567878 2909.3
## <none>                  565735 2909.8
## - Income   1   1763329 2329065 3473.8
## - Balance  1   2991523 3557259 3643.2
## 
## Step:  AIC=2909.28
## Rating ~ Income + Balance
## 
##           Df Sum of Sq     RSS    AIC
## <none>                  567878 2909.3
## - Income   1   1859749 2427627 3488.4
## - Balance  1   3001866 3569744 3642.6
## 
## Call:
## lm(formula = Rating ~ Income + Balance, data = C)
## 
## Coefficients:
## (Intercept)       Income      Balance  
##    145.3506       2.1863       0.2129

This procedure gives the same results as forward selection (however, for other data sets this might not necessarily be the case).

2.3.3 Variable Transformation

So far we have been fitting linear models of the form: \[ Y = \beta_0 + \beta_1 X_1 + \dots + \beta_p X_p. \]

What about some non-linear models such as polynomials etc.? For example: \[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_1^2 + \beta_3 X_1^3 + \beta_4 X_2. \]

Solution: pre-process inputs by setting \(X_1' := X_1\), \(X_2' := X_1^2\), \(X_3' := X_1^3\), \(X_4' := X_2\) and fit a linear model:

\[ Y = \beta_0 + \beta_1 X_1' + \beta_2 X_2' + \beta_3 X_3' + \beta_4 X_4'. \]

This trick works for every model of the form \(Y=\sum_{i=1}^k \sum_{j=1}^p \varphi_{i,j}(X_j)\) for any \(k\) and any univariate functions \(\varphi_{i,j}\).

Also, with a little creativity (and maths), we might be able to transform a few other models to a linear one, e.g.,

\[ Y = b e^{aX} \qquad \to \qquad \log Y = \log b + aX \qquad\to\qquad Y'=aX+b' \]

This is an example of a model’s linearisation. However, not every model can be linearised. In particular, one that involves functions that are not invertible.

For example, here’s a series of simple (\(p=1\)) degree-\(d\) polynomial regression models of the form: \[ Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \beta_3 X^3 + \dots + \beta_d X^d. \]

Such models can be fitted with the lm() function based on the formula of the form Y~poly(X, d, raw=TRUE) or Y~X+I(X^2)+I(X^3)+...

Above we have fitted the polynomials of degrees 1, 3 and 10. Note that a polynomial of degree 1 is just a line.

Let us depict the three models:

Figure 12: Polynomials of different degrees fitted to the Credit dataset

Figure 12: Polynomials of different degrees fitted to the Credit dataset

From Figure 12 we see that there’s clearly a problem with the degree-10 polynomial.

2.3.4 Predictive vs. Descriptive Power

The above high-degree polynomial model (f1_10) is a typical instance of a phenomenon called an overfit.

Clearly (based on our expert knowledge), the Rating shouldn’t decrease as Balance increases.

In other words, f1_10 gives a better fit to data actually observed, but fails to produce good results for the points that are yet to come.

We say that it generalises poorly to unseen data.

Assume our true model is of the form:

Let’s generate the following random sample from this model (with \(Y\) subject to error), see Figure 13:

Figure 13: Synthetic data generated by means of the formula Y=3x^3+5 (+ noise)

Figure 13: Synthetic data generated by means of the formula \(Y=3x^3+5\) (\(+\) noise)

Let’s fit polynomials of different degrees, see Figure 14.

Figure 14: Polynomials fitted to our synthetic dataset

Figure 14: Polynomials fitted to our synthetic dataset

Some of the polynomials are fitted too well!

Remark

(*) The oscillation of the high-degree polynomials at the domain boundaries is known as the Runge phenomenon.

Compare the mean squared error (MSE) for the observed vs. future data points, see Figure 15.

Figure 15: MSE on the dataset used to construct the model vs. MSE on a whole range of points as function of the polynomial degree

Figure 15: MSE on the dataset used to construct the model vs. MSE on a whole range of points as function of the polynomial degree

Note the logarithmic scale on the \(y\) axis.

This is a very typical behaviour!

  • A model’s fit to observed data improves as the model’s complexity increases.

  • A model’s generalisation to unseen data initially improves, but then becomes worse.

  • In the above example, the sweet spot is at a polynomial of degree 3, which is exactly our true underlying model.

Hence, most often we should be interested in the accuracy of the predictions made in the case of unobserved data.

If we have a data set of a considerable size, we can divide it (randomly) into two parts:

  • training sample (say, 60% or 80%) – used to fit a model
  • test sample (the remaining 40% or 20%) – used to assess its quality (e.g., using MSE)

More on this issue in the chapter on Classification.

Remark.

(*) We shall see that sometimes a train-test-validate split will be necessary, e.g., 60-20-20%.

2.4 Exercises in R

2.4.1 Anscombe’s Quartet Revisited

Consider the anscombe database once again:

##    x1 x2 x3 x4    y1   y2    y3    y4
## 1  10 10 10  8  8.04 9.14  7.46  6.58
## 2   8  8  8  8  6.95 8.14  6.77  5.76
## 3  13 13 13  8  7.58 8.74 12.74  7.71
## 4   9  9  9  8  8.81 8.77  7.11  8.84
## 5  11 11 11  8  8.33 9.26  7.81  8.47
## 6  14 14 14  8  9.96 8.10  8.84  7.04
## 7   6  6  6  8  7.24 6.13  6.08  5.25
## 8   4  4  4 19  4.26 3.10  5.39 12.50
## 9  12 12 12  8 10.84 9.13  8.15  5.56
## 10  7  7  7  8  4.82 7.26  6.42  7.91
## 11  5  5  5  8  5.68 4.74  5.73  6.89

Recall that in the previous Chapter we have split the above data into four data frames ans1, …, ans4 with columns x and y.

Exercise.

In ans1, fit a regression line to the data set as-is.

Click here for a solution.

We’ve done that already, see Figure 16. What a wonderful exercise, thank you – effective learning is often done by repeating stuff.

Figure 16: Fitted regression line for ans1

Figure 16: Fitted regression line for ans1

Exercise.

In ans2, fit a quadratic model (\(y=a + bx + cx^2\)).

Click here for a solution.

How to fit a polynomial model is explained above.

Figure 17: Fitted quadratic model for ans2

Figure 17: Fitted quadratic model for ans2

Comment: From Figure 17 we see that it’s an almost-perfect fit! Clearly, the second Anscombe dataset isn’t a case of linearly dependent variables.

Exercise.

In ans3, remove the obvious outlier from data and fit a regression line.

Click here for a solution.

Let’s plot the data set first, see Figure 18.

Figure 18: Scatter plot for ans3

Figure 18: Scatter plot for ans3

Indeed, the observation at \(x\simeq 13\) is an obvious outlier. Perhaps the easiest way to remove it is to call:

We could also use the condition y < max(y), amongst others.

Now let’s fit the linear model:

Figure 19: Scatter plot for ans3 with the outlier removed and the fitted linear model

Figure 19: Scatter plot for ans3 with the outlier removed and the fitted linear model

Comment: Now Figure 19 is what we call linearly correlated data. By the way, Pearson’s coefficient now equals 0.9999966.

2.4.2 Countries of the World – Simple models involving the GDP per capita

Let’s consider the World Factbook 2020 dataset (see this book’s datasets folder). It consists of country names, their population, area, GDP, mortality rates etc. We have scraped it from the CIA website at https://www.cia.gov/library/publications/the-world-factbook/docs/rankorderguide.html and compiled into a single file on 3 April 2020.

Here is a preview of a few features for 3 selected countries (see help("%in%")):

##           country    area population gdp_per_capita_ppp
## 15      Australia 7741220   25466459              50400
## 169   New Zealand  268838    4925477              39000
## 247 United States 9833517  332639102              59800
Exercise.

List the 10 countries with the highest GDP per capita.

Click here for a solution.

To recall, to generate a list of indexes that produce an ordered version of a numeric vector, we need to call the order() function.

##           country gdp_per_capita_ppp
## 113       Ireland              73200
## 35         Brunei              78900
## 114   Isle of Man              84600
## 211     Singapore              94100
## 26        Bermuda              99400
## 141    Luxembourg             105100
## 157        Monaco             115700
## 142         Macau             122000
## 192         Qatar             124100
## 139 Liechtenstein             139100

By the way, the reported values are in USD.

Question: Which of these countries are tax havens?

Exercise.

Find the 5 most positively and the 5 most negatively correlated variables with the gdp_per_capita_ppp feature (of course, with respect to the Pearson coefficient).

Click here for a solution.

This can be solved via a call to cor(). Note that we need to make sure that missing vales are omitted from computations. A quick glimpse at the manual page (?cor) reveals that computing the correlation between a column and all the other ones (of course, except country, which is non-numeric) can be performed as follows.

##   infant_mortality_rate maternal_mortality_rate 
##              -0.7465825              -0.6700486 
##              birth_rate              death_rate 
##              -0.6082201              -0.5721644 
##    total_fertility_rate 
##              -0.5672550
##        natural_gas_production         gross_national_saving 
##                     0.5689764                     0.6113255 
##                    median_age obesity_adult_prevalence_rate 
##                     0.6208990                     0.6368120 
##      life_expectancy_at_birth 
##                     0.7546062

Comment: “Live long and prosper” just gained a new meaning. Richer countries have lower infant and maternal mortality rates, lower birth rates, but higher life expectancy and obesity prevalence. Note, however, that correlation is not causation: we are unlikely to increase the GDP by asking people to put on weight.

Exercise.

Fit simple regression models where the per capita GDP explains its four most correlated variables (four individual models). Draw them on a scatter plot. Compute the root mean squared errors (RMSE), mean absolute errors (MAE) and the coefficients of determination (\(R^2\)).

Click here for a solution.

The four most correlated variables (we should look at the absolute value of the correlation coefficient now – recall that it is the correlation of 0 that means no linear dependence; 1 and -1 show a strong association between a pair of variables) are:

## [1] "obesity_adult_prevalence_rate"
## [2] "maternal_mortality_rate"      
## [3] "infant_mortality_rate"        
## [4] "life_expectancy_at_birth"

We could take the above column names and construct four formulas manually, e.g., by writing gdp_per_capita_ppp~life_expectancy_at_birth, but we are lazy. Being lazy when it comes to computer programming is often a virtue, not a flaw in one’s character.

Instead, we will run a for loop that extracts the pairs of interesting columns and constructs a formula based on two vectors (lm(Y~X)), see Figure 20.

## [1] "obesity_adult_prevalence_rate"
##          RMSE      MAE         R2
## [1,] 11.04095 8.158911 0.06219638
## [1] "maternal_mortality_rate"
##          RMSE      MAE        R2
## [1,] 204.9272 146.5324 0.2148087
## [1] "infant_mortality_rate"
##          RMSE     MAE        R2
## [1,] 15.74646 12.1665 0.3005046
## [1] "life_expectancy_at_birth"
##          RMSE      MAE        R2
## [1,] 5.429182 4.372718 0.4309554
Figure 20: A scatter plot matrix and regression lines for the 4 variables most correlated with the per capita GDP

Figure 20: A scatter plot matrix and regression lines for the 4 variables most correlated with the per capita GDP

Recall that the root mean squared error is the square root of the arithmetic mean of the squared residuals. Mean absolute error is the average of the absolute values of the residuals. The coefficient of determination is given by: \(R^2(f) = 1 - \frac{\sum_{i=1}^{n} \left(y_i-f(\mathbf{x}_{i,\cdot})\right)^2}{\sum_{i=1}^{n} \left(y_i-\bar{y}\right)^2}\).

Comment: Unfortunately, we were misled by the high correlation coefficients between the \(X\)s and \(Y\)s: the low actual \(R^2\) scores indicate that these models should not be deemed trustworthy. Note that 3 of the plots are evidently L-shaped.

Fun fact: (*) Interestingly, it can be shown that \(R^2\) (in the case of the linear models fitted by minimising the SSR) is the square of the correlation between the true \(Y\)s and the predicted \(Y\)s:

## [1] 0.4309554
## [1] 0.4309554

Side note: Do note that RMSE and MAE are interpretable: for instance, average error of life expectancy prediction based on the GDP is 4-5 years. Recall that you can find the information on the variables’ units of measure at https://www.cia.gov/library/publications/the-world-factbook/docs/rankorderguide.html.

2.4.3 Countries of the World – Most correlated variables (*)

Let’s get back to the World Factbook 2020 dataset (world_factbook_2020.csv).

Exercise.

Create a data frame C with three columns named col1, col2 and r and \(p(p-1)/2\) rows, where \(p\) is the number of numeric features in factbook. Every row should represent a unique pair of column names in factbook (we do not distinguish between a,b and b,a) of correlation coefficients between them.

Click here for a solution.

First we will solve this exercise considering only 4 numeric features in our dataset, so that we can keep track of how the R expressions we evaluate actually work.

Let us compute the Pearson coefficients between chosen pairs of variables.

##                   area  median_age birth_rate    exports
## area        1.00000000  0.04452395 -0.0319946  0.4925857
## median_age  0.04452395  1.00000000 -0.9215918  0.2997270
## birth_rate -0.03199460 -0.92159182  1.0000000 -0.2429553
## exports     0.49258568  0.29972695 -0.2429553  1.0000000

Note that the R matrix has 1.0 on the diagonal (where each entry represents a correlation between a variable and itself). Moreover, it is symmetric around the diagonal – R[i,j] == R[j,i], because it is the correlation between the same pair of variables. Hence, from now on we may be interested in the elements below the diagonal. We can get access to them by using lower.tri() (“lower triangle”).

## [1]  0.04452395 -0.03199460  0.49258568 -0.92159182  0.29972695
## [6] -0.24295529

This is already the 3rd column of the data frame we are asked to generate, which should look like:

##         col1       col2           r
## 1 median_age       area  0.04452395
## 2 birth_rate       area -0.03199460
## 3    exports       area  0.49258568
## 4 birth_rate median_age -0.92159182
## 5    exports median_age  0.29972695
## 6    exports birth_rate -0.24295529

How the generate col1 and col2? One idea is to take the “lower triangles” of the following matrices:

##      [,1]         [,2]         [,3]         [,4]        
## [1,] "area"       "area"       "area"       "area"      
## [2,] "median_age" "median_age" "median_age" "median_age"
## [3,] "birth_rate" "birth_rate" "birth_rate" "birth_rate"
## [4,] "exports"    "exports"    "exports"    "exports"

and:

##      [,1]   [,2]         [,3]         [,4]     
## [1,] "area" "median_age" "birth_rate" "exports"
## [2,] "area" "median_age" "birth_rate" "exports"
## [3,] "area" "median_age" "birth_rate" "exports"
## [4,] "area" "median_age" "birth_rate" "exports"

Here is a complete solution for all the features is factbook:

Comment: In “classical” programming languages we would perhaps have used of a double (nested) for loop here (a less readable solution).

Exercise.

Find the 5 most correlated pairs of variables.

Click here for a solution.

This can be done by ordering the rows of C in decreasing order of absolute values of C$r, and then choosing the first 5 rows.

col1 col2 r
1687 electricity_installed_generating_capacity electricity_production 0.9994159
1684 electricity_consumption electricity_production 0.9992063
88 labor_force population 0.9986190
1718 electricity_installed_generating_capacity electricity_consumption 0.9981540
1300 telephones_mobile_cellular labor_force 0.9979274

Comment: The most correlated pairs of features are not really “mind-blowing”…

Exercise.

Fit simple regression models for the most correlated pair of variables.

Click here for a solution.

There is a degree of ambiguity here: should col1 or rather col2 be treated as the dependent variable in our model? Let’s do it either way.

To learn something new, which is exactly why we are all here, we will create the formulas programmatically, by first concatenating (joining) appropriate strings (note that in order to input a double quotes character, we need to proceed in with a backslash), and then calling the formula() function.

## 
## Call:
## lm(formula = form, data = factbook)
## 
## Coefficients:
##                               (Intercept)  
##                                 795225381  
## electricity_installed_generating_capacity  
##                                      3625
Figure 21: Most correlated pair of variables and the invisible regression line

Figure 21: Most correlated pair of variables and the invisible regression line

Figure 21 is just great.

Wait, hold on a second! The fitted line is not present on the plot!

Exercise.

There is a serious, extremely hard-to-detect bug in the above code. The reader is kindly asked to try to identify it.

Click here for a solution.

The author wishes to apologise, but this where he will get slightly… emotional. We are all programmers here. So let’s speak like a programmer to a programmer.

Aaaaaaaaaaaaaaaargh!!!!!!!!!!! I’ve spent 3 hours trying to locate it!!!

So in Appendix related to data frame wrangling I have explicitly asked everyone (including myself, because I tend to forget that!!!) to globally set the stringsAsFactors=FALSE option. But we didn’t do so. The problem is where we create the C matrix using the data.frame() function – C$col1 and C$col2 are objects of type:

## [1] "factor"
## [1] "factor"

Is this is a problem? Well it is, because indexing an object x with a factor f, x[f] means x[as.numeric(f)] and not x[as.character(f)].

Therefore, factbook[,C_top[1,1]] refers to:

## [1] "gdp_per_capita_ppp"

and not:

## [1] "electricity_installed_generating_capacity"

Here is a corrected version, see 22.

## 
## Call:
## lm(formula = form, data = factbook)
## 
## Coefficients:
##                               (Intercept)  
##                                 795225381  
## electricity_installed_generating_capacity  
##                                      3625
Figure 22: A bugfixed version of Figure 21

Figure 22: A bugfixed version of Figure 21

Seriously, I really like the R language, but this stringsAsFactors thing is it’s weakest spot. Just fire up options(stringsAsFactors=FALSE) next time! (Which I of course won’t). End of story.

2.4.4 Countries of the World – A non-linear model based on the GDP per capita

Let’s revisit the World Factbook 2020 dataset (world_factbook_2020.csv).

Exercise.

Draw a histogram of the empirical distribution of the GDP per capita. Moreover, draw a histogram of the logarithm of the GDP/person.

Click here for a solution.
Figure 23: Histograms of the empirical distribution of the GDP per capita with linear (left) and log (right) scale on the X axis

Figure 23: Histograms of the empirical distribution of the GDP per capita with linear (left) and log (right) scale on the X axis

Comment: In Figure 23 we see that distribution of the GDP is right-skewed: most countries have small GDP. However, few of them (those in the “right tail” of the distribution) are very very rich (hey, how about taxing the richest countries?!). There is the famous observation made by V. Pareto stating that most assets are in the hands of the “wealthy minority” (compare: power law, rich-get-richer rule, preferential attachment in complex networks). Interestingly, many real-world-phenomena are distributed similarly (e.g., the popularity of web pages, the number of followers of Instagram profiles). It is frequently the case that the logarithm of the aforementioned variable looks more “normal” (is bell-shaped).

Side note: “The” logarithm most often refers to the logarithm base \(e\), \(\log x = \log_e x\), where \(e\simeq 2.72\) is the Euler constant, see exp(1) in R. Note that you can only compute logarithms of positive real numbers.

Non-technical audience might be confused when asked to contemplate the distribution of the logarithm of a variable. Let’s make it more user-friendly (on the other hand, we could’ve asked them to harden up…) by nicely re-labelling the X axis, see Figure 24.

Figure 24: Histogram of the empirical distribution of the GDP per capita now with human-readable X axis labels (not the logarithmic scale)

Figure 24: Histogram of the empirical distribution of the GDP per capita now with human-readable X axis labels (not the logarithmic scale)

Comment: This is still a plot of the logarithm of the distribution of the per capita GDP, but it’s somehow “hidden” behind the human-readable axis labels. Nice.

Exercise.

Fit a simple linear model of life_expectancy_at_birth as a function of gdp_per_capita_ppp.

Click here for a solution.

Easy. We have already done than in one of the previous exercises. Yet, to learn something new, let’s note that the plot() function accepts formulas as well.

## [1] 0.4309554
Figure 25: Linear model fitted for life expectancy vs. GDP/person

Figure 25: Linear model fitted for life expectancy vs. GDP/person

Comment: From Figure 25 we see that this is not a good model.

Exercise.

Draw a scatter plot of life_expectancy_at_birth as a function gdp_per_capita_ppp, with the X axis being logarithmic. Compute the correlation coefficient between log(gdp_per_capita_ppp) and life_expectancy_at_birth.

Click here for a solution.

We could apply the log()-transformation manually and generate fancy X axis labels ourselves. However, the plot() function has the log argument (see ?plot.default) which provides us with all we need, see Figure 26.

Figure 26: Scatter plot of life expectancy vs. GDP/person with log scale on the X axis

Figure 26: Scatter plot of life expectancy vs. GDP/person with log scale on the X axis

Here is the linear correlation coefficient between the logarithm of the GDP/person and the life expectancy.

## [1] 0.8066505

The correlation is quite high, hence the following task.

Exercise.

Fit a model predicting life_expectancy_at_birth by means of log(gdp_per_capita_ppp).

Click here for a solution.

We would like to fit a model of the form \(Y=a\log X+b\). The formula life_expectancy_at_birth~log(gdp_per_capita_ppp) should do the trick here.

##             (Intercept) log(gdp_per_capita_ppp) 
##               28.306385                4.817806
## [1] 0.650685
Figure 27: Linear model fitted for life expectancy vs. the logarithm of GDP/person

Figure 27: Linear model fitted for life expectancy vs. the logarithm of GDP/person

Comment: That is an okay model (in terms of the coefficient of determination), see Figure 27.

Exercise.

Draw the fitted logarithmic model on a scatter plot with a standard, non-logarithmic X axis.

Click here for a solution.

The model fitted above is of the form \(Y\simeq4.82 \log X+28.31\). To depict it on a plot with linear (non-logarithmic) axes, we can compute this formula on multiple points by hand, see Figure 28.

Figure 28: Logarithmic model fitted for life expectancy vs. GDP/person

Figure 28: Logarithmic model fitted for life expectancy vs. GDP/person

Comment: Well, people are not immortal… The original (linear) model didn’t really take that into account. Also, recall that correlation is not causation. Moreover, there is a lot of variability at an individual level. Being born in a less-wealthy country (e.g., not in a tax haven), doesn’t mean you don’t have the whole life ahead of you. Do the cool staff, do something for the others. Life’s not about money.

2.4.5 Countries of the World – A multiple regression model for the per capita GDP

Let’s play with World Factbook 2020 (world_factbook_2020.csv) once again. World is an interesting place, so we’re far from being bored with this dataset.

Let’s restrict ourselves to the following columns, mostly related to imports and exports:

Let’s compute the per capita versions of the above, by dividing all values by each country’s population:

We are going to build a few multiple regression models using the step() function, which is not too fond of missing values, therefore they should be removed first:

## [1] 261 157
Exercise.

Build a model for gdp_purchasing_power_parity as a function of imports and exports (all per capita).

Click here for a solution.

Let’s first take a look at how the aforementioned variables are related to each other, see Figure 29.

##                             gdp_purchasing_power_parity
## gdp_purchasing_power_parity                   1.0000000
## imports                                       0.8289144
## exports                                       0.8189854
##                               imports   exports
## gdp_purchasing_power_parity 0.8289144 0.8189854
## imports                     1.0000000 0.9424084
## exports                     0.9424084 1.0000000
Figure 29: Scatter plot matrix for GDP, imports and exports

Figure 29: Scatter plot matrix for GDP, imports and exports

They are nicely correlated. Moreover, they are on a similar scale (“tens of thousands of USD per capita”).

Fitting the requested model yields:

##  (Intercept)      imports      exports 
## 9852.5381318    1.4419398    0.7806693
## [1] 0.6959805
Exercise.

Use forward selection to come up with a model for gdp_purchasing_power_parity per capita.

Click here for a solution.
## gdp_purchasing_power_parity ~ 1
## gdp_purchasing_power_parity ~ imports + exports + electricity_exports + 
##     electricity_imports + military_expenditures + crude_oil_exports + 
##     crude_oil_imports + natural_gas_exports + natural_gas_imports + 
##     reserves_of_foreign_exchange_and_gold
## 
## Call:
## lm(formula = gdp_purchasing_power_parity ~ imports + crude_oil_exports + 
##     crude_oil_imports + electricity_imports + natural_gas_imports, 
##     data = factbookn)
## 
## Coefficients:
##         (Intercept)              imports    crude_oil_exports  
##            7603.236                1.775           128472.216  
##   crude_oil_imports  electricity_imports  natural_gas_imports  
##          100781.638                1.621                3.131
## [1] 0.7865036

Comment: Interestingly, it’s mostly the import-related variables that contribute to the GDP per capita. However, the model is not perfect, so we should refrain ourselves from building a brand new economic theory around this “discovery”. On the other hand, you know what they say: all models are wrong, but some might be useful. Note that we used the adjusted \(R^2\) coefficient to correct for the number of variables in the model so as to make it more comparable with the coefficient corresponding to the f1 model.

Exercise.

Use backward elimination to construct a model for gdp_purchasing_power_parity per capita.

Click here for a solution.
## 
## Call:
## lm(formula = gdp_purchasing_power_parity ~ imports + electricity_imports + 
##     crude_oil_exports + crude_oil_imports + natural_gas_imports, 
##     data = factbookn)
## 
## Coefficients:
##         (Intercept)              imports  electricity_imports  
##            7603.236                1.775                1.621  
##   crude_oil_exports    crude_oil_imports  natural_gas_imports  
##          128472.216           100781.638                3.131
## [1] 0.7865036

Comment: This is the same model as the one found by forward selection, i.e., f2.

2.5 Outro

2.5.1 Remarks

Multiple regression is simple, fast to apply and interpretable.

Linear models go beyond fitting of straight lines and other hyperplanes!

A complex model may overfit and hence generalise poorly to unobserved inputs.

Note that the SSR criterion makes the models sensitive to outliers.

Remember:

good models \[=\] better understanding of the modelled reality \(+\) better predictions \[=\] more revenue, your boss’ happiness, your startup’s growth etc.

2.5.2 Other Methods for Regression

Other example approaches to regression:

  • ridge regression,
  • lasso regression,
  • least absolute deviations (LAD) regression,
  • multiadaptive regression splines (MARS),
  • K-nearest neighbour (K-NN) regression, see FNN::knn.reg() in R,
  • regression trees,
  • support-vector regression (SVR),
  • neural networks (also deep) for regression.

2.5.3 Derivation of the Solution (**)

We would like to find an analytical solution to the problem of minimising of the sum of squared residuals:

\[ \min_{\beta_0, \beta_1,\dots, \beta_p\in\mathbb{R}} E(\beta_0, \beta_1, \dots, \beta_p)= \sum_{i=1}^n \left( \beta_0 + \beta_1 x_{i,1}+\dots+\beta_p x_{i,p} - y_{i} \right)^2 \]

This requires computing the \(p+1\) partial derivatives \({\partial E}/{\partial \beta_j}\) for \(j=0,\dots,p\).

The partial derivatives are very similar to each other; \(\frac{\partial E}{\partial \beta_0}\) is given by: \[ \frac{\partial E}{\partial \beta_0}(\beta_0,\beta_1,\dots,\beta_p)= 2 \sum_{i=1}^n \left( \beta_0 + \beta_1 x_{i,1}+\dots+\beta_p x_{i,p} - y_{i} \right) \] and \(\frac{\partial E}{\partial \beta_j}\) for \(j>0\) is equal to: \[ \frac{\partial E}{\partial \beta_j}(\beta_0,\beta_1,\dots,\beta_p)= 2 \sum_{i=1}^n x_{i,j} \left( \beta_0 + \beta_1 x_{i,1}+\dots+\beta_p x_{i,p} - y_{i} \right) \]

Then all we need to do is to solve the system of linear equations:

\[ \left\{ \begin{array}{rcl} \frac{\partial E}{\partial \beta_0}(\beta_0,\beta_1,\dots,\beta_p)&=&0 \\ \frac{\partial E}{\partial \beta_1}(\beta_0,\beta_1,\dots,\beta_p)&=&0 \\ \vdots\\ \frac{\partial E}{\partial \beta_p}(\beta_0,\beta_1,\dots,\beta_p)&=&0 \\ \end{array} \right. \]

The above system of \(p+1\) linear equations, which we are supposed to solve for \(\beta_0,\beta_1,\dots,\beta_p\): \[ \left\{ \begin{array}{rcl} 2 \sum_{i=1}^n \phantom{x_{i,0}}\left( \beta_0 + \beta_1 x_{i,1}+\dots+\beta_p x_{i,p} - y_{i} \right)&=&0 \\ 2 \sum_{i=1}^n x_{i,1} \left( \beta_0 + \beta_1 x_{i,1}+\dots+\beta_p x_{i,p} - y_{i} \right)&=&0 \\ \vdots\\ 2 \sum_{i=1}^n x_{i,p} \left( \beta_0 + \beta_1 x_{i,1}+\dots+\beta_p x_{i,p} - y_{i} \right)&=&0 \\ \end{array} \right. \] can be rewritten as: \[ \left\{ \begin{array}{rcl} \sum_{i=1}^n \phantom{x_{i,0}}\left( \beta_0 + \beta_1 x_{i,1}+\dots+\beta_p x_{i,p}\right)&=& \sum_{i=1}^n \phantom{x_{i,0}} y_i \\ \sum_{i=1}^n x_{i,1} \left( \beta_0 + \beta_1 x_{i,1}+\dots+\beta_p x_{i,p}\right)&=&\sum_{i=1}^n x_{i,1} y_i \\ \vdots\\ \sum_{i=1}^n x_{i,p} \left( \beta_0 + \beta_1 x_{i,1}+\dots+\beta_p x_{i,p}\right)&=&\sum_{i=1}^n x_{i,p} y_i \\ \end{array} \right. \]

and further as: \[ \left\{ \begin{array}{rcl} \beta_0\ n\phantom{\sum_{i=1}^n x} + \beta_1\sum_{i=1}^n \phantom{x_{i,0}} x_{i,1}+\dots+\beta_p \sum_{i=1}^n \phantom{x_{i,0}} x_{i,p} &=&\sum_{i=1}^n\phantom{x_{i,0}} y_i \\ \beta_0 \sum_{i=1}^n x_{i,1} + \beta_1\sum_{i=1}^n x_{i,1} x_{i,1}+\dots+\beta_p \sum_{i=1}^n x_{i,1} x_{i,p} &=&\sum_{i=1}^n x_{i,1} y_i \\ \vdots\\ \beta_0 \sum_{i=1}^n x_{i,p} + \beta_1\sum_{i=1}^n x_{i,p} x_{i,1}+\dots+\beta_p \sum_{i=1}^n x_{i,p} x_{i,p} &=&\sum_{i=1}^n x_{i,p} y_i \\ \end{array} \right. \] Note that the terms involving \(x_{i,j}\) and \(y_i\) (the sums) are all constant – these are some fixed real numbers. We have learned how to solve such problems in high school.

Exercise.

Try deriving the analytical solution and implementing it for \(p=2\). Recall that in the previous chapter we solved the special case of \(p=1\).

2.5.4 Solution in Matrix Form (***)

Assume that \(\mathbf{X}\in\mathbb{R}^{n\times p}\) (a matrix with inputs), \(\mathbf{y}\in\mathbb{R}^{n\times 1}\) (a column vector of reference outputs) and \(\boldsymbol{\beta}\in\mathbb{R}^{(p+1)\times 1}\) (a column vector of parameters).

Firstly, note that a linear model of the form: \[ f_{\boldsymbol\beta}(\mathbf{x})=\beta_0+\beta_1 x_1+\dots+\beta_p x_p \] can be rewritten as: \[ f_{\boldsymbol\beta}(\mathbf{x})=\beta_0 1+\beta_1 x_1+\dots+\beta_p x_p =\mathbf{\dot{x}}\boldsymbol\beta, \] where \(\mathbf{\dot{x}}=[1\ x_1\ x_2\ \cdots\ x_p]\).

Similarly, if we assume that \(\mathbf{\dot{X}}=[\boldsymbol{1}\ \mathbf{X}]\in\mathbb{R}^{n\times (p+1)}\) is the input matrix with a prepended column of \(1\)s, i.e., \(\boldsymbol{1}=[1\ 1\ \cdots\ 1]^T\) and \(\dot{x}_{i,0}=1\) (for brevity of notation the columns added will have index \(0\)), \(\dot{x}_{i,j}=x_{i,j}\) for all \(j\ge 1\) and all \(i\), then: \[ \mathbf{\hat{y}} = \mathbf{\dot{X}} \boldsymbol\beta \] gives the vector of predicted outputs for every input point.

This way, the sum of squared residuals \[ E(\beta_0, \beta_1, \dots, \beta_p)= \sum_{i=1}^n \left( \beta_0 + \beta_1 x_{i,1}+\dots+\beta_p x_{i,p} - y_{i} \right)^2 \] can be rewritten as: \[ E(\boldsymbol\beta)=\| \mathbf{\dot{X}} \boldsymbol\beta - \mathbf{y} \|^2, \] where as usual \(\|\cdot\|^2\) denotes the squared Euclidean norm.

Recall that this can be re-expressed as: \[ E(\boldsymbol\beta)= (\mathbf{\dot{X}} \boldsymbol\beta - \mathbf{y})^T (\mathbf{\dot{X}} \boldsymbol\beta - \mathbf{y}). \]

In order to find the minimum of \(E\) w.r.t. \(\boldsymbol\beta\), we need to find the parameters that make the partial derivatives vanish, i.e.:

\[ \left\{ \begin{array}{rcl} \frac{\partial E}{\partial \beta_0}(\boldsymbol\beta)&=&0 \\ \frac{\partial E}{\partial \beta_1}(\boldsymbol\beta)&=&0 \\ \vdots\\ \frac{\partial E}{\partial \beta_p}(\boldsymbol\beta)&=&0 \\ \end{array} \right. \]

Remark.

(***) Interestingly, the above can also be expressed in matrix form, using the special notation: \[ \nabla E(\boldsymbol\beta) = \boldsymbol{0} \] Here, \(\nabla E\) (nabla symbol = differential operator) denotes the function gradient, i.e., the vector of all partial derivatives. This is nothing more than syntactic sugar for this quite commonly applied operator.

Anyway, the system of linear equations we have derived above: \[ \left\{ \begin{array}{rcl} \beta_0\ n\phantom{\sum_{i=1}^n x} + \beta_1\sum_{i=1}^n \phantom{x_{i,0}} x_{i,1}+\dots+\beta_p \sum_{i=1}^n \phantom{x_{i,0}} x_{i,p} &=&\sum_{i=1}^n\phantom{x_{i,0}} y_i \\ \beta_0 \sum_{i=1}^n x_{i,1} + \beta_1\sum_{i=1}^n x_{i,1} x_{i,1}+\dots+\beta_p \sum_{i=1}^n x_{i,1} x_{i,p} &=&\sum_{i=1}^n x_{i,1} y_i \\ \vdots\\ \beta_0 \sum_{i=1}^n x_{i,p} + \beta_1\sum_{i=1}^n x_{i,p} x_{i,1}+\dots+\beta_p \sum_{i=1}^n x_{i,p} x_{i,p} &=&\sum_{i=1}^n x_{i,p} y_i \\ \end{array} \right. \] can be rewritten in matrix terms as: \[ \left\{ \begin{array}{rcl} \beta_0 \mathbf{\dot{x}}_{\cdot,0}^T \mathbf{\dot{x}}_{\cdot,0} + \beta_1 \mathbf{\dot{x}}_{\cdot,0}^T \mathbf{\dot{x}}_{\cdot,1}+\dots+\beta_p \mathbf{\dot{x}}_{\cdot,0}^T \mathbf{\dot{x}}_{\cdot,p} &=& \mathbf{\dot{x}}_{\cdot,0}^T \mathbf{y} \\ \beta_0 \mathbf{\dot{x}}_{\cdot,1}^T \mathbf{\dot{x}}_{\cdot,0} + \beta_1 \mathbf{\dot{x}}_{\cdot,1}^T \mathbf{\dot{x}}_{\cdot,1}+\dots+\beta_p \mathbf{\dot{x}}_{\cdot,1}^T \mathbf{\dot{x}}_{\cdot,p} &=& \mathbf{\dot{x}}_{\cdot,1}^T \mathbf{y} \\ \vdots\\ \beta_0 \mathbf{\dot{x}}_{\cdot,p}^T \mathbf{\dot{x}}_{\cdot,0} + \beta_1 \mathbf{\dot{x}}_{\cdot,p}^T \mathbf{\dot{x}}_{\cdot,1}+\dots+\beta_p \mathbf{\dot{x}}_{\cdot,p}^T \mathbf{\dot{x}}_{\cdot,p} &=& \mathbf{\dot{x}}_{\cdot,p}^T \mathbf{y}\\ \end{array} \right. \]

This can be restated as: \[ \left\{ \begin{array}{rcl} \left(\mathbf{\dot{x}}_{\cdot,0}^T \mathbf{\dot{X}}\right)\, \boldsymbol\beta &=& \mathbf{\dot{x}}_{\cdot,0}^T \mathbf{y} \\ \left(\mathbf{\dot{x}}_{\cdot,1}^T \mathbf{\dot{X}}\right)\, \boldsymbol\beta &=& \mathbf{\dot{x}}_{\cdot,1}^T \mathbf{y} \\ \vdots\\ \left(\mathbf{\dot{x}}_{\cdot,p}^T \mathbf{\dot{X}}\right)\, \boldsymbol\beta &=& \mathbf{\dot{x}}_{\cdot,p}^T \mathbf{y}\\ \end{array} \right. \] which in turn is equivalent to: \[ \left(\mathbf{\dot{X}}^T\mathbf{X}\right)\,\boldsymbol\beta = \mathbf{\dot{X}}^T\mathbf{y}. \]

Such a system of linear equations in matrix form can be solved numerically using, amongst others, the solve() function.

Remark.

(***) In practice, we’d rather rely on QR or SVD decompositions of matrices for efficiency and numerical accuracy reasons.

Numeric example – solution via lm():

## (Intercept)          X1          X2 
## 172.5586670   0.1828011   2.1976461

Recalling that \(\mathbf{A}^T \mathbf{B}\) can be computed by calling t(A) %*% B or – even faster – by calling crossprod(A, B), we can also use solve() to obtain the same result:

##           [,1]
##    172.5586670
## X1   0.1828011
## X2   2.1976461

2.5.5 Pearson’s r in Matrix Form (**)

Recall the Pearson linear correlation coefficient: \[ r(\boldsymbol{x},\boldsymbol{y}) = \frac{ \sum_{i=1}^n (x_i-\bar{x}) (y_i-\bar{y}) }{ \sqrt{\sum_{i=1}^n (x_i-\bar{x})^2}\ \sqrt{\sum_{i=1}^n (y_i-\bar{y})^2} } \]

Denote with \(\boldsymbol{x}^\circ\) and \(\boldsymbol{y}^\circ\) the centred versions of \(\boldsymbol{x}\) and \(\boldsymbol{y}\), respectively, i.e., \(x_i^\circ=x_i-\bar{x}\) and \(y_i^\circ=y_i-\bar{y}\).

Rewriting the above yields: \[ r(\boldsymbol{x},\boldsymbol{y}) = \frac{ \sum_{i=1}^n x_i^\circ y_i^\circ }{ \sqrt{\sum_{i=1}^n ({x_i^\circ})^2}\ \sqrt{\sum_{i=1}^n ({y_i^\circ})^2} } \] which is exactly: \[ r(\boldsymbol{x},\boldsymbol{y}) = \frac{ \boldsymbol{x}^\circ\cdot \boldsymbol{y}^\circ }{ \| \boldsymbol{x}^\circ \|\ \| \boldsymbol{y}^\circ \| } \] i.e., the normalised dot product of the centred versions of the two vectors.

This is the cosine of the angle between the two vectors (in \(n\)-dimensional spaces)!

(**) Recalling from the previous chapter that \(\mathbf{A}^T \mathbf{A}\) gives the dot product between all the pairs of columns in a matrix \(\mathbf{A}\), we can implement an equivalent version of cor(C) as follows:

##           Rating  Limit Income   Age Education Balance
## Rating     1.000  0.996  0.831 0.167    -0.040   0.798
## Limit      0.996  1.000  0.834 0.164    -0.032   0.796
## Income     0.831  0.834  1.000 0.227    -0.033   0.414
## Age        0.167  0.164  0.227 1.000     0.024   0.008
## Education -0.040 -0.032 -0.033 0.024     1.000   0.001
## Balance    0.798  0.796  0.414 0.008     0.001   1.000

2.5.6 Further Reading

Recommended further reading: (James et al. 2017: Chapters 1, 2 and 3)

Other: (Hastie et al. 2017: Chapter 1, Sections 3.2 and 3.3)