## 6.1 Introduction

### 6.1.1 Optimisation Problem

Mathematical optimisation (a.k.a. mathematical programming) deals with the study of algorithms to solve problems related to selecting the best element amongst the set of available alternatives.

Most frequently “best” is expressed in terms of an error or goodness of fit measure: $f:D\to\mathbb{R}$ called an objective function.

$$D$$ is the search space (problem domain, feasible set) – if defines the set of possible solution candidates.

An optimisation task deals with finding an element $${x}\in D$$ that minimises or maximises $$f$$:

$\min_{{x}\in D} f({x}) \quad\text{or}\quad\max_{{x}\in D} f({x}),$

In this chapter, we will deal with unconstrained continuous optimisation, i.e., we will assume the search space is $$D=\mathbb{R}^p$$ for some $$p$$.

### 6.1.2 Example Optimisation Problems in Machine Learning

In multiple linear regression we were minimising the sum of squared residuals $\min_{\beta_0, \beta_1,\dots, \beta_p\in\mathbb{R}} \sum_{i=1}^n \left( \beta_0 + \beta_1 x_{i,1}+\dots+\beta_p x_{i,p} - y_i \right)^2.$

In binary logistic regression we were minimising the cross-entropy: $\min_{(\beta_0, \beta_1,\dots, \beta_p)\in\mathbb{R}^{(p+1)}} -\frac{1}{n} \sum_{i=1}^n \left(\begin{array}{cc} y_i \log \left(\frac{1}{1+e^{-(\beta_0 + \beta_1 x_{i,1}+\dots+\beta_p x_{i,p}}} \right)+\\ + (1-y_i)\log \left(\frac{e^{-(\beta_0 + \beta_1 x_{i,1}+\dots+\beta_p x_{i,p}}}{1+e^{-(\beta_0 + \beta_1 x_{i,1}+\dots+\beta_p x_{i,p})}}\right) \end{array}\right).$

### 6.1.3 Types of Minima and Maxima

Note that minimising $$f$$ is the same as maximising $$\bar{f}=-f$$.

In other words, $$\min_{{x}\in D} f({x})$$ and $$\max_{{x}\in D} -f({x})$$ represent the same optimisation problems (and hence have identical solutions).

A minimum of $$f$$ is a point $${x}^*$$ such that $$f({x}^*)\le f({x})$$ for all $${x}\in D$$.

A maximum of $$f$$ is a point $${x}^*$$ such that $$f({x}^*)\ge f({x})$$ for all $${x}\in D$$.

Assuming that $$D=\mathbb{R}$$, here is an example objective function, $$f:\mathbb{D}\to\mathbb{R}$$, that has a minimum at $${x}^*=1$$ with $$f(x^*)=-2$$.

$$\min_{x\in \mathbb{R}} f(x)=-2$$ (value of $$f$$ at the minimum)

$$\mathrm{arg}\min_{x\in \mathbb{R}} f(x)=1$$ (location of the minimum)

By definition, a minimum/maximum might not necessarily be unique. This depends on a problem.

Assuming that $$D=\mathbb{R}$$, here is an example objective function, $$f:\mathbb{D}\to\mathbb{R}$$, that has multiple minima; every $${x}^*\in[1-\sqrt{2},1+\sqrt{2}]$$ yields $$f(x^*)=0$$.

If this was the case of some machine learning problem, it’d mean that we could have many equally well-performing models, and hence many equivalent explanations of the same phenomenon.

Moreover, it may happen that a function has multiple local minima.

We say that $$f$$ has a local minimum at $$\mathbf{x}^+\in D$$, if for some neighbourhood $$B(\mathbf{x}^+)$$ of $$\mathbf{x}^+$$ it holds $$f(\mathbf{x}^+) \le f(\mathbf{x})$$ for each $$\mathbf{x}\in B(\mathbf{x}^+)$$.

If $$D=\mathbb{R}$$, by neighbourhood $$B(x)$$ of $$x$$ we mean an open interval centred at $$x$$ of width $$2r$$ for some small $$r>0$$, i.e., $$(x-r, x+r)$$

(*) If $$D=\mathbb{R}^p$$ (for any $$p\ge 1$$), by neighbourhood $$B(\mathbf{x})$$ of $$\mathbf{x}$$ we mean an open ball centred at $$\mathbf{x}^+$$ of some small radius $$r>0$$, i.e., $$\{\mathbf{y}: \|\mathbf{x}-\mathbf{y}\|<r\}$$ (read: the set of all the points with Euclidean distances to $$\mathbf{x}$$ less than $$r$$).

To avoid ambiguity, the “true” minimum (a point $${x}^*$$ such that $$f({x}^*)\le f({x})$$ for all $${x}\in D$$) is sometimes also referred to as a global minimum.

Of course, the global minimum is also a function’s local minimum.

The existence of local minima is problematic as most of the optimisation methods might get stuck there and fail to return the global one.

Moreover, we cannot often be sure if the result returned by an algorithm is indeed a global minimum. Maybe there exists a better solution that hasn’t been considered yet?

Smooth vs non-smooth vs noisy objectives:

### 6.1.4 Example Objective over a 2D Domain

Of course, our objective function does not necessarily have to be defined over a one-dimensional domain.

For example, consider the following function: $g(x_1,x_2)=\log\left((x_1^{2}+x_2-5)^{2}+(x_1+x_2^{2}-3)^{2}+x_1^2-1.60644\dots\right)$

g  <- function(x1, x2)
log((x1^2+x2-5)^2+(x1+x2^2-3)^2+x1^2-1.60644366086443841)
x1 <- seq(-5, 5, length.out=100)
x2 <- seq(-5, 5, length.out=100)
# outer() expands two vectors to form a 2D grid
# and applies a given function on each point
y  <- outer(x1, x2, g)

There are four local minima:

x1 x2 f(x1,x2)
2.278005 -0.6134279 1.3564152
-2.612316 -2.3454621 1.7050788
1.798788 1.1987929 0.6954984
-1.542256 2.1564053 0.0000000

The global minimum is at $$(x_1^*, x_2^*)$$ as below:

g(-1.542255693195422641930153, 2.156405289793087261832605)
## [1] 0

Let’s explore various ways of depicting $$f$$.

A contour plot:

contour(x1, x2, y, las=1, nlevels=25)
points(-1.54226, 2.15641, col=2, pch=3)

A filled contour plot (a heatmap):

filled.contour(x1, x2, y, las=1)

Alternatively:

image(x1, x2, y, las=1)
contour(x1, x2, y, add=TRUE)

A perspective plot:

par(mfrow=c(1,2)) # 2 in 1
persp(x1, x2, y, las=1, phi=30, theta=-5, shade=2, border=NA)
persp(x1, x2, y, las=1, phi=30, theta=75, shade=2, border=NA)

As usual, depicting functions that are defined over high-dimensional (3D and higher) domains is… difficult. Usually 1D or 2D projections can give us some neat intuitions though.