1.3 Simple Regression

1.3.1 Introduction

Simple regression is the easiest setting to start with – let’s assume \(p=1\), i.e., all inputs are 1-dimensional. Denote \(x_i=x_{i,1}\).

We will use it to build many intuitions, for example, it’ll be easy to illustrate all the concepts graphically.

plot of chunk unnamed-chunk-12

In what follows we will be modelling the Credit Rating (\(Y\)) as a function of the average Credit Card Balance (\(X\)) in USD for customers with positive Balance only.

plot of chunk unnamed-chunk-14

Our aim is to construct a function \(f\) that models Rating as a function of Balance, \(f(X)=Y\).

We are equipped with \(n=310\) reference (observed) Ratings \(\mathbf{y}=[y_1\ \cdots\ y_n]^T\) for particular Balances \(\mathbf{x}=[x_1\ \cdots\ x_n]^T\).

Note the following naming conventions:

  • Variable types:

    • \(X\) – independent/explanatory/predictor variable

    • \(Y\) – dependent/response/predicted variable

  • Also note that:

    • \(Y\) – idealisation (any possible Rating)

    • \(\mathbf{y}=[y_1\ \cdots\ y_n]^T\) – values actually observed

The model will not be ideal, but it might be usable:

  • We will be able to predict the rating of any new client.

    What should be the rating of a client with Balance of $1500?
    What should be the rating of a client with Balance of $2500?

  • We will be able to describe (understand) this reality using a single mathematical formula so as to infer that there is an association between \(X\) and \(Y\)

    Think of “data compression” and laws of physics, e.g., \(E=mc^2\).

(*) Mathematically, we will assume that there is some “true” function that models the data (true relationship between \(Y\) and \(X\)), but the observed outputs are subject to additive error: \[Y=f(X)+\varepsilon.\]

\(\varepsilon\) is a random term, classically we assume that errors are independent of each other, have expected value of \(0\) (there is no systematic error = unbiased) and that they follow a normal distribution.

(*) We denote this as \(\varepsilon\sim\mathcal{N}(0, \sigma)\) (read: random variable \(\varepsilon\) follows a normal distribution with expected value of \(0\) and standard deviation of \(\sigma\) for some \(\sigma\ge 0\)).

\(\sigma\) controls the amount of noise (and hence, uncertainty). Here is the plot of the probability distribution function (PDFs, densities) of \(\mathcal{N}(0, \sigma)\) for different \(\sigma\)s:

plot of chunk unnamed-chunk-15

1.3.2 Search Space and Objective

There are many different functions that can be fitted into the observed \((\mathbf{x},\mathbf{y})\)

plot of chunk unnamed-chunk-16

Thus, we need a model selection criterion.

Usually, we will be interested in a model that minimises appropriately aggregated residuals \(f(x_i)-y_i\), i.e., predicted outputs minus observed outputs, often denoted with \(\hat{y}_i-y_i\), for \(i=1,\dots,n\).

plot of chunk unnamed-chunk-17

Top choice: sum of squared residuals: \[ \begin{array}{rl} \mathrm{SSR}(f|\mathbf{x},\mathbf{y}) & = \left( f(x_1)-y_1 \right)^2 + \dots + \left( f(x_n)-y_n \right)^2 \\ & = \displaystyle\sum_{i=1}^n \left( f(x_i)-y_i \right)^2 \end{array} \]

Read “\(\sum_{i=1}^n z_i\)” as “the sum of \(z_i\) for \(i\) from \(1\) to \(n\)”; this is just a shorthand for \(z_1+z_2+\dots+z_n\).

The notation \(\mathrm{SSR}(f|\mathbf{x},\mathbf{y})\) means that it is the error measure corresponding to the model \((f)\) given our data.
We could’ve denoted it with \(\mathrm{SSR}_{\mathbf{x},\mathbf{y}}(f)\) or even \(\mathrm{SSR}(f)\) to emphasise that \(\mathbf{x},\mathbf{y}\) are just fixed values and we are not interested in changing them at all (they are “global variables”).

We enjoy SSR because (amongst others):

  • larger errors are penalised much more than smaller ones

    (this can be considered a drawback as well)

  • (**) statistically speaking, this has a clear underlying interpretation

    (assuming errors are normally distributed, finding a model minimising the SSR is equivalent to maximum likelihood estimation)

  • the models minimising the SSR can often be found easily

    (corresponding optimisation tasks have an analytic solution – studied already by Gauss in the late 18th century)

(**) Other choices:

  • regularised SSR, e.g., lasso or ridge regression (in the case of multiple input variables)
  • sum or median of absolute values (robust regression)

Fitting a model to data can be written as an optimisation problem:

\[ \min_{f\in\mathcal{F}} \mathrm{SSR}(f|\mathbf{x},\mathbf{y}), \]

i.e., find \(f\) minimising the SSR (seek “best” \(f\))
amongst the set of admissible models \(\mathcal{F}\).

Example \(\mathcal{F}\)s:

  • \(\mathcal{F}=\{\text{All possible functions of one variable}\}\) – if there are no repeated \(x_i\)’s, this corresponds to data interpolation; note that there are many functions that give SSR of \(0\).

  • \(\mathcal{F}=\{ x\mapsto x^2, x\mapsto \cos(x), x\mapsto \exp(2x+7)-9 \}\) – obviously an ad-hoc choice but you can easily choose the best amongst the 3 by computing 3 sums of squares.

  • \(\mathcal{F}=\{ x\mapsto ax+b\}\) – the space of linear functions of one variable

  • etc.

(e.g., \(x\mapsto x^2\) is read “\(x\) maps to \(x^2\)” and is an elegant way to define an inline function \(f\) such that \(f(x)=x^2\))