## 1.4 Simple Linear Regression

### 1.4.1 Introduction

If the family of admissible models $$\mathcal{F}$$ consists only of all linear functions of one variable, we deal with a simple linear regression.

Our problem becomes:

$\min_{a,b\in\mathbb{R}} \sum_{i=1}^n \left( ax_i+b-y_i \right)^2$

In other words, we seek best fitting line in terms of the squared residuals.

This is the method of least squares.

This is particularly nice, because our search space is just $$\mathbb{R}^2$$ – easy to handle both analytically and numerically. Which of these is the least squares solution?

### 1.4.2 Solution in R

Let’s fit the linear model minimising the SSR in R. For convenience, let us store both $$\mathbf{x}$$ and $$\mathbf{y}$$ in a data frame:

XY <- data.frame(X=X, Y=Y)
head(XY, 3)
##     X   Y
## 1 333 283
## 2 903 483
## 3 580 514

The lm() function (linear models) has a convenient formula-based interface.

In R, the expression “Y~X” denotes a formula, which we read as: variable Y is a function of X. Note that the dependent variable is on the left side of the formula.

f <- lm(Y~X, data=XY)

Note that here X and Y refer to column names in the XY data frame.

print(f)
##
## Call:
## lm(formula = Y ~ X, data = XY)
##
## Coefficients:
## (Intercept)            X
##    226.4711       0.2661

Hence, the fitted model is: $Y = f(X) = 0.2661459X+226.4711446 \qquad (+ \varepsilon)$

Coefficient $$a$$ (slope):

f$coefficient ## X ## 0.2661459 Coefficient $$b$$ (intercept): f$coefficient
## (Intercept)
##    226.4711

SSR:

sum(f$residuals^2) ##  2132108 sum((f$coefficient*X+f$coefficient-Y)^2) # equivalent ##  2132108 To make a prediction: Xnew <- data.frame(X=c(1500, 2000, 2500)) f$coefficient*Xnew$X + f$coefficient
##  625.6900 758.7630 891.8359

or:

predict(f, Xnew)
##        1        2        3
## 625.6900 758.7630 891.8359

However:

predict(f, data.frame(X=c(5000)))
##        1
## 1557.201

This is more than the highest possible rating – we have been extrapolating way beyond the observable data range.

Plotting:

plot(X, Y, col="#000000aa", las=1)
abline(f, col=2, lwd=3) Note that our $$Y=aX+b$$ model is interpretable and well-behaving:

(not all machine learning models will have this feature, think: deep neural networks, which we rather conceive as black boxes)

• we know that by increasing $$X$$ by a small amount, $$Y$$ will also increase (positive correlation),

• the model is continuous – small change in $$X$$ doesn’t yield any drastic change in $$Y$$,

• we know what will happen if we increase or decrease $$X$$ by, say, $$100$$,

• the function is invertible – if we want Rating of $$500$$, we can compute the associated preferred Balance that should yield it (provided that the model is valid).

### 1.4.3 Analytic Solution

It may be shown (which we actually do below) that the solution is:

$\left\{ \begin{array}{rl} a = & \dfrac{ n \displaystyle\sum_{i=1}^n x_i y_i - \displaystyle\sum_{i=1}^n y_i \displaystyle\sum_{i=1}^n x_i }{ n \displaystyle\sum_{i=1}^n x_i x_i - \displaystyle\sum_{i=1}^n x_i\displaystyle\sum_{i=1}^n x_i }\\ b = & \dfrac{1}{n}\displaystyle\sum_{i=1}^n y_i - a \dfrac{1}{n} \displaystyle\sum_{i=1}^n x_i \\ \end{array} \right.$

Which can be implemented in R as follows:

n <- length(X)
a <- (n*sum(X*Y)-sum(X)*sum(Y))/(n*sum(X*X)-sum(X)^2)
b <- mean(Y)-a*mean(X)
c(a, b) # the same as f\$coefficients
##    0.2661459 226.4711446

### 1.4.4 Derivation of the Solution (**)

(You can safely skip this part if you are yet to know how to search for a minimum of a function of many variables and what are partial derivatives)

Denote with:

$E(a,b)=\mathrm{SSR}(x\mapsto ax+b|\mathbf{x},\mathbf{y}) =\sum_{i=1}^n \left( ax_i+b - y_i \right) ^2.$

We seek the minimum of $$E$$ w.r.t. both $$a,b$$.

Theorem.

If $$E$$ has a (local) minimum at $$(a^*,b^*)$$, then its partial derivatives vanish therein, i.e., $$\partial E/\partial a(a^*, b^*) = 0$$ and $$\partial E/\partial b(a^*, b^*)=0$$.

We have:

$E(a,b) = \displaystyle\sum_{i=1}^n \left( ax_i+b - y_i \right) ^2.$

We need to compute the partial derivatives $$\partial E/\partial a$$ (derivative of $$E$$ w.r.t. variable $$a$$ – all other terms treated as constants) and $$\partial E/\partial b$$ (w.r.t. $$b$$).

Useful rules – derivatives w.r.t. $$a$$ (denote $$f'(a)=(f(a))'$$):

• $$(f(a)+g(a))'=f'(a)+g'(a)$$ (derivative of sum is sum of derivatives)
• $$(f(a) g(a))' = f'(a)g(a) + f(a)g'(a)$$ (derivative of product)
• $$(f(g(a)))' = f'(g(a)) g'(a)$$ (chain rule)
• $$(c)' = 0$$ for any constant $$c$$ (expression not involving $$a$$)
• $$(a^p)' = pa^{p-1}$$ for any $$p$$
• in particular: $$(c a^2+d)'=2ca$$, $$(ca)'=c$$, $$((ca+d)^2)'=2(ca+d)c$$ (application of the above rules)

We seek $$a,b$$ such that $$\frac{\partial E}{\partial a}(a,b) = 0$$ and $$\frac{\partial E}{\partial b}(a,b)=0$$.

$\left\{ \begin{array}{rl} \frac{\partial E}{\partial a}(a,b)=&2\displaystyle\sum_{i=1}^n \left( ax_i+b - y_i \right) x_i = 0 \\ \frac{\partial E}{\partial b}(a,b)=&2\displaystyle\sum_{i=1}^n \left( ax_i+b - y_i \right) = 0\\ \end{array} \right.$

This is a system of 2 linear equations. Easy.

Rearranging like back in the school days:

$\left\{ \begin{array}{rl} a \displaystyle\sum_{i=1}^n x_i x_i + b \displaystyle\sum_{i=1}^n x_i = & \displaystyle\sum_{i=1}^n x_i y_i \\ a \displaystyle\sum_{i=1}^n x_i+ b n = & \displaystyle\sum_{i=1}^n y_i \\ \end{array} \right.$

It is left as an exercise to show that the solution is:

$\left\{ \begin{array}{rl} a^* = & \dfrac{ n \displaystyle\sum_{i=1}^n x_i y_i - \displaystyle\sum_{i=1}^n y_i \displaystyle\sum_{i=1}^n x_i }{ n \displaystyle\sum_{i=1}^n x_i x_i - \displaystyle\sum_{i=1}^n x_i\displaystyle\sum_{i=1}^n x_i }\\ b^* = & \dfrac{1}{n}\displaystyle\sum_{i=1}^n y_i - a^* \dfrac{1}{n} \displaystyle\sum_{i=1}^n x_i \\ \end{array} \right.$

(we should additionally perform the second derivative test to assure that this is the minimum of $$E$$ – which is exactly the case though)

(**) In the next chapter, we will introduce the notion of Pearson’s linear coefficient, $$r$$ (see cor() in R). It might be shown that $$a$$ and $$b$$ can also be rewritten as:

(a <- cor(X,Y)*sd(Y)/sd(X))
##           [,1]
## [1,] 0.2661459
(b <- mean(Y)-a*mean(X))
##          [,1]
## [1,] 226.4711