## 1.3 Simple Regression

### 1.3.1 Introduction

Simple regression is the easiest setting to start with – let’s assume $$p=1$$, i.e., all inputs are 1-dimensional. Denote $$x_i=x_{i,1}$$.

We will use it to build many intuitions, for example, it’ll be easy to illustrate all the concepts graphically.

library("ISLR") # Credit dataset
plot(Credit$Balance, Credit$Rating, las=1) # scatter plot

In what follows we will be modelling the Credit Rating ($$Y$$) as a function of the average Credit Card Balance ($$X$$) in USD for customers with positive Balance only.

X <- as.matrix(as.numeric(Credit$Balance[Credit$Balance>0]))
Y <- as.matrix(as.numeric(Credit$Rating[Credit$Balance>0]))

plot(X, Y, las=1)

Our aim is to construct a function $$f$$ that models Rating as a function of Balance, $$f(X)=Y$$.

We are equipped with $$n=310$$ reference (observed) Ratings $$\mathbf{y}=[y_1\ \cdots\ y_n]^T$$ for particular Balances $$\mathbf{x}=[x_1\ \cdots\ x_n]^T$$.

Note the following naming conventions:

• Variable types:

• $$X$$ – independent/explanatory/predictor variable

• $$Y$$ – dependent/response/predicted variable

• Also note that:

• $$Y$$ – idealisation (any possible Rating)

• $$\mathbf{y}=[y_1\ \cdots\ y_n]^T$$ – values actually observed

The model will not be ideal, but it might be usable:

• We will be able to predict the rating of any new client.

What should be the rating of a client with Balance of $1500? What should be the rating of a client with Balance of$2500?

• We will be able to describe (understand) this reality using a single mathematical formula so as to infer that there is an association between $$X$$ and $$Y$$

Think of “data compression” and laws of physics, e.g., $$E=mc^2$$.

(*) Mathematically, we will assume that there is some “true” function that models the data (true relationship between $$Y$$ and $$X$$), but the observed outputs are subject to additive error: $Y=f(X)+\varepsilon.$

$$\varepsilon$$ is a random term, classically we assume that errors are independent of each other, have expected value of $$0$$ (there is no systematic error = unbiased) and that they follow a normal distribution.

(*) We denote this as $$\varepsilon\sim\mathcal{N}(0, \sigma)$$ (read: random variable $$\varepsilon$$ follows a normal distribution with expected value of $$0$$ and standard deviation of $$\sigma$$ for some $$\sigma\ge 0$$).

$$\sigma$$ controls the amount of noise (and hence, uncertainty). Here is the plot of the probability distribution function (PDFs, densities) of $$\mathcal{N}(0, \sigma)$$ for different $$\sigma$$s:

### 1.3.2 Search Space and Objective

There are many different functions that can be fitted into the observed $$(\mathbf{x},\mathbf{y})$$

Thus, we need a model selection criterion.

Usually, we will be interested in a model that minimises appropriately aggregated residuals $$f(x_i)-y_i$$, i.e., predicted outputs minus observed outputs, often denoted with $$\hat{y}_i-y_i$$, for $$i=1,\dots,n$$.

Top choice: sum of squared residuals: $\begin{array}{rl} \mathrm{SSR}(f|\mathbf{x},\mathbf{y}) & = \left( f(x_1)-y_1 \right)^2 + \dots + \left( f(x_n)-y_n \right)^2 \\ & = \displaystyle\sum_{i=1}^n \left( f(x_i)-y_i \right)^2 \end{array}$

Read “$$\sum_{i=1}^n z_i$$” as “the sum of $$z_i$$ for $$i$$ from $$1$$ to $$n$$”; this is just a shorthand for $$z_1+z_2+\dots+z_n$$.

The notation $$\mathrm{SSR}(f|\mathbf{x},\mathbf{y})$$ means that it is the error measure corresponding to the model $$(f)$$ given our data.
We could’ve denoted it with $$\mathrm{SSR}_{\mathbf{x},\mathbf{y}}(f)$$ or even $$\mathrm{SSR}(f)$$ to emphasise that $$\mathbf{x},\mathbf{y}$$ are just fixed values and we are not interested in changing them at all (they are “global variables”).

We enjoy SSR because (amongst others):

• larger errors are penalised much more than smaller ones

(this can be considered a drawback as well)

• (**) statistically speaking, this has a clear underlying interpretation

(assuming errors are normally distributed, finding a model minimising the SSR is equivalent to maximum likelihood estimation)

• the models minimising the SSR can often be found easily

(corresponding optimisation tasks have an analytic solution – studied already by Gauss in the late 18th century)

(**) Other choices:

• regularised SSR, e.g., lasso or ridge regression (in the case of multiple input variables)
• sum or median of absolute values (robust regression)

Fitting a model to data can be written as an optimisation problem:

$\min_{f\in\mathcal{F}} \mathrm{SSR}(f|\mathbf{x},\mathbf{y}),$

i.e., find $$f$$ minimising the SSR (seek “best” $$f$$)
amongst the set of admissible models $$\mathcal{F}$$.

Example $$\mathcal{F}$$s:

• $$\mathcal{F}=\{\text{All possible functions of one variable}\}$$ – if there are no repeated $$x_i$$’s, this corresponds to data interpolation; note that there are many functions that give SSR of $$0$$.

• $$\mathcal{F}=\{ x\mapsto x^2, x\mapsto \cos(x), x\mapsto \exp(2x+7)-9 \}$$ – obviously an ad-hoc choice but you can easily choose the best amongst the 3 by computing 3 sums of squares.

• $$\mathcal{F}=\{ x\mapsto ax+b\}$$ – the space of linear functions of one variable

• etc.

(e.g., $$x\mapsto x^2$$ is read “$$x$$ maps to $$x^2$$” and is an elegant way to define an inline function $$f$$ such that $$f(x)=x^2$$)