1.4 Simple Linear Regression

1.4.1 Introduction

If the family of admissible models \(\mathcal{F}\) consists only of all linear functions of one variable, we deal with a simple linear regression.

Our problem becomes:

\[ \min_{a,b\in\mathbb{R}} \sum_{i=1}^n \left( ax_i+b-y_i \right)^2 \]

In other words, we seek best fitting line in terms of the squared residuals.

This is the method of least squares.

This is particularly nice, because our search space is just \(\mathbb{R}^2\) – easy to handle both analytically and numerically.

plot of chunk unnamed-chunk-18

Which of these is the least squares solution?

1.4.2 Solution in R

Let’s fit the linear model minimising the SSR in R. For convenience, let us store both \(\mathbf{x}\) and \(\mathbf{y}\) in a data frame:

##     X   Y
## 1 333 283
## 2 903 483
## 3 580 514

The lm() function (linear models) has a convenient formula-based interface.

In R, the expression “Y~X” denotes a formula, which we read as: variable Y is a function of X. Note that the dependent variable is on the left side of the formula.

Note that here X and Y refer to column names in the XY data frame.

## Call:
## lm(formula = Y ~ X, data = XY)
## Coefficients:
## (Intercept)            X  
##    226.4711       0.2661

Hence, the fitted model is: \[ Y = f(X) = 0.2661459X+226.4711446 \qquad (+ \varepsilon) \]

Coefficient \(a\) (slope):

##         X 
## 0.2661459

Coefficient \(b\) (intercept):

## (Intercept) 
##    226.4711


## [1] 2132108
## [1] 2132108

To make a prediction:

## [1] 625.6900 758.7630 891.8359


##        1        2        3 
## 625.6900 758.7630 891.8359


##        1 
## 1557.201

This is more than the highest possible rating – we have been extrapolating way beyond the observable data range.


plot of chunk unnamed-chunk-28

Note that our \(Y=aX+b\) model is interpretable and well-behaving:

(not all machine learning models will have this feature, think: deep neural networks, which we rather conceive as black boxes)

  • we know that by increasing \(X\) by a small amount, \(Y\) will also increase (positive correlation),

  • the model is continuous – small change in \(X\) doesn’t yield any drastic change in \(Y\),

  • we know what will happen if we increase or decrease \(X\) by, say, \(100\),

  • the function is invertible – if we want Rating of \(500\), we can compute the associated preferred Balance that should yield it (provided that the model is valid).

1.4.3 Analytic Solution

It may be shown (which we actually do below) that the solution is:

\[ \left\{ \begin{array}{rl} a = & \dfrac{ n \displaystyle\sum_{i=1}^n x_i y_i - \displaystyle\sum_{i=1}^n y_i \displaystyle\sum_{i=1}^n x_i }{ n \displaystyle\sum_{i=1}^n x_i x_i - \displaystyle\sum_{i=1}^n x_i\displaystyle\sum_{i=1}^n x_i }\\ b = & \dfrac{1}{n}\displaystyle\sum_{i=1}^n y_i - a \dfrac{1}{n} \displaystyle\sum_{i=1}^n x_i \\ \end{array} \right. \]

Which can be implemented in R as follows:

## [1]   0.2661459 226.4711446

1.4.4 Derivation of the Solution (**)

(You can safely skip this part if you are yet to know how to search for a minimum of a function of many variables and what are partial derivatives)

Denote with:

\[ E(a,b)=\mathrm{SSR}(x\mapsto ax+b|\mathbf{x},\mathbf{y}) =\sum_{i=1}^n \left( ax_i+b - y_i \right) ^2. \]

We seek the minimum of \(E\) w.r.t. both \(a,b\).


If \(E\) has a (local) minimum at \((a^*,b^*)\), then its partial derivatives vanish therein, i.e., \(\partial E/\partial a(a^*, b^*) = 0\) and \(\partial E/\partial b(a^*, b^*)=0\).

We have:

\[ E(a,b) = \displaystyle\sum_{i=1}^n \left( ax_i+b - y_i \right) ^2. \]

We need to compute the partial derivatives \(\partial E/\partial a\) (derivative of \(E\) w.r.t. variable \(a\) – all other terms treated as constants) and \(\partial E/\partial b\) (w.r.t. \(b\)).

Useful rules – derivatives w.r.t. \(a\) (denote \(f'(a)=(f(a))'\)):

  • \((f(a)+g(a))'=f'(a)+g'(a)\) (derivative of sum is sum of derivatives)
  • \((f(a) g(a))' = f'(a)g(a) + f(a)g'(a)\) (derivative of product)
  • \((f(g(a)))' = f'(g(a)) g'(a)\) (chain rule)
  • \((c)' = 0\) for any constant \(c\) (expression not involving \(a\))
  • \((a^p)' = pa^{p-1}\) for any \(p\)
  • in particular: \((c a^2+d)'=2ca\), \((ca)'=c\), \(((ca+d)^2)'=2(ca+d)c\) (application of the above rules)

We seek \(a,b\) such that \(\frac{\partial E}{\partial a}(a,b) = 0\) and \(\frac{\partial E}{\partial b}(a,b)=0\).

\[ \left\{ \begin{array}{rl} \frac{\partial E}{\partial a}(a,b)=&2\displaystyle\sum_{i=1}^n \left( ax_i+b - y_i \right) x_i = 0 \\ \frac{\partial E}{\partial b}(a,b)=&2\displaystyle\sum_{i=1}^n \left( ax_i+b - y_i \right) = 0\\ \end{array} \right. \]

This is a system of 2 linear equations. Easy.

Rearranging like back in the school days:

\[ \left\{ \begin{array}{rl} a \displaystyle\sum_{i=1}^n x_i x_i + b \displaystyle\sum_{i=1}^n x_i = & \displaystyle\sum_{i=1}^n x_i y_i \\ a \displaystyle\sum_{i=1}^n x_i+ b n = & \displaystyle\sum_{i=1}^n y_i \\ \end{array} \right. \]

It is left as an exercise to show that the solution is:

\[ \left\{ \begin{array}{rl} a^* = & \dfrac{ n \displaystyle\sum_{i=1}^n x_i y_i - \displaystyle\sum_{i=1}^n y_i \displaystyle\sum_{i=1}^n x_i }{ n \displaystyle\sum_{i=1}^n x_i x_i - \displaystyle\sum_{i=1}^n x_i\displaystyle\sum_{i=1}^n x_i }\\ b^* = & \dfrac{1}{n}\displaystyle\sum_{i=1}^n y_i - a^* \dfrac{1}{n} \displaystyle\sum_{i=1}^n x_i \\ \end{array} \right. \]

(we should additionally perform the second derivative test to assure that this is the minimum of \(E\) – which is exactly the case though)

(**) In the next chapter, we will introduce the notion of Pearson’s linear coefficient, \(r\) (see cor() in R). It might be shown that \(a\) and \(b\) can also be rewritten as:

##           [,1]
## [1,] 0.2661459
##          [,1]
## [1,] 226.4711