6 Continuous Optimisation with Iterative Algorithms (*)
These lecture notes are distributed in the hope that they will be useful. Any bug reports are appreciated.
6.1 Introduction
6.1.1 Optimisation Problems
Mathematical optimisation (a.k.a. mathematical programming) deals with the study of algorithms to solve problems related to selecting the best element amongst the set of available alternatives.
Most frequently “best” is expressed in terms of an error or goodness of fit measure: \[ f:\mathbb{D}\to\mathbb{R} \] called an objective function, where \(\mathbb{D}\) is the search space (problem domain, feasible set).
An optimisation task deals with finding an element \(\mathbf{x}\in \mathbb{D}\) amongst the set of possible candidate solutions, that minimises or maximises \(f\):
\[ \min_{\mathbf{x}\in \mathbb{D}} f(\mathbf{x}) \quad\text{or}\quad\max_{\mathbf{x}\in \mathbb{D}} f(\mathbf{x}), \]
In this chapter we will deal with unconstrained continuous optimisation, i.e., we will assume the search space is \(\mathbb{D}=\mathbb{R}^p\) for some \(p\) – we’ll be optimising over \(p\) realvalued parameters.
6.1.2 Example Optimisation Problems in Machine Learning
In multiple linear regression we were minimising the sum of squared residuals \[ \min_{\boldsymbol\beta\in\mathbb{R}^p} \sum_{i=1}^n \left( \beta_0 + \beta_1 x_{i,1}+\dots+\beta_p x_{i,p}  y_i \right)^2. \]
In binary logistic regression we were minimising the crossentropy: \[ \min_{\boldsymbol\beta\in\mathbb{R}^p} \frac{1}{n} \sum_{i=1}^n \left(\begin{array}{r} y_i \log \left(\frac{1}{1+e^{(\beta_0 + \beta_1 x_{i,1}+\dots+\beta_p x_{i,p}}} \right)\\ + (1y_i)\log \left(\frac{e^{(\beta_0 + \beta_1 x_{i,1}+\dots+\beta_p x_{i,p}}}{1+e^{(\beta_0 + \beta_1 x_{i,1}+\dots+\beta_p x_{i,p})}}\right) \end{array}\right). \]
6.1.3 Types of Minima and Maxima
Note that minimising \(f\) is the same as maximising \(\bar{f}=f\).
In other words, \(\min_{\mathbf{x}\in \mathbb{D}} f(\mathbf{x})\) and \(\max_{\mathbf{x}\in \mathbb{D}} f(\mathbf{x})\) represent the same optimisation problems (and hence have identical solutions).
 Definition.

A minimum of \(f\) is a point \(\mathbf{x}^*\) such that \(f(\mathbf{x}^*)\le f(\mathbf{x})\) for all \(\mathbf{x}\in \mathbb{D}\). On the other hand, a maximum of \(f\) is a point \(\mathbf{x}^*\) such that \(f(\mathbf{x}^*)\ge f(\mathbf{x})\) for all \(\mathbf{x}\in \mathbb{D}\).
Assuming that \(\mathbb{D}=\mathbb{R}\), Figure 6.1 shows an example objective function, \(f:\mathbb{D}\to\mathbb{R}\), that has a minimum at \({x}^*=1\) with \(f(x^*)=2\).
 Remark.

We can denote these two facts as follows:
 \((\min_{x\in \mathbb{R}} f(x))=2\) (value of \(f\) at the minimum is \(2\)),
 \((\mathrm{arg}\min_{x\in \mathbb{R}} f(x))=1\) (location of the minimum, i.e., argument minimum, is \(1\)).
By definition, a minimum/maximum might not necessarily be unique. This depends on a problem.
Assuming that \(\mathbb{D}=\mathbb{R}\), Figure 6.2 gives an example objective function, \(f:\mathbb{D}\to\mathbb{R}\), that has multiple minima; every \({x}^*\in[1\sqrt{2},1+\sqrt{2}]\) yields \(f(x^*)=0\).
 Remark.

If this was the case of some machine learning problem, it would mean that we could have many equally wellperforming models, and hence many equivalent explanations of the same phenomenon.
Moreover, it may happen that a function has multiple local minima, compare Figure 6.3.
 Definition.

We say that \(f\) has a local minimum at \(\mathbf{x}^+\in \mathbb{D}\), if for some neighbourhood \(B(\mathbf{x}^+)\) of \(\mathbf{x}^+\) it holds \(f(\mathbf{x}^+) \le f(\mathbf{x})\) for each \(\mathbf{x}\in B(\mathbf{x}^+)\).

If \(\mathbb{D}=\mathbb{R}\), by neighbourhood \(B(x)\) of \(x\) we mean an open interval centred at \(x\) of width \(2r\) for some small \(r>0\), i.e., \((xr, x+r)\)
 Definition.

(*) If \(\mathbb{D}=\mathbb{R}^p\) (for any \(p\ge 1\)), by neighbourhood \(B(\mathbf{x})\) of \(\mathbf{x}\) we mean an open ball centred at \(\mathbf{x}^+\) of some small radius \(r>0\), i.e., \(\{\mathbf{y}: \\mathbf{x}\mathbf{y}\<r\}\) (read: the set of all the points with Euclidean distances to \(\mathbf{x}\) less than \(r\)).
To avoid ambiguity, the “true” minimum (a point \(\mathbf{x}^*\) such that \(f(\mathbf{x}^*)\le f({x})\) for all \(\mathbf{x}\in \mathbb{D}\)) is sometimes also referred to as a global minimum.
 Remark.

Of course, the global minimum is also a function’s local minimum.
The existence of local minima is problematic as most of the optimisation methods might get stuck there and fail to return the global one.
Moreover, we cannot often be sure if the result returned by an algorithm is indeed a global minimum. Maybe there exists a better solution that hasn’t been considered yet? Or maybe the function is very noisy (see Figure 6.4)?
6.1.4 Example Objective over a 2D Domain
Of course, our objective function does not necessarily have to be defined over a onedimensional domain.
For example, consider the following function: \[ g(x_1,x_2)=\log\left((x_1^{2}+x_25)^{2}+(x_1+x_2^{2}3)^{2}+x_1^21.60644\dots\right) \]
< function(x1, x2)
g log((x1^2+x25)^2+(x1+x2^23)^2+x1^21.60644366086443841)
< seq(5, 5, length.out=100)
x1 < seq(5, 5, length.out=100)
x2 # outer() expands two vectors to form a 2D grid
# and applies a given function on each point
< outer(x1, x2, g) y
There are four local minima:
x1  x2  f(x1,x2) 

2.2780  0.61343  1.3564 
2.6123  2.34546  1.7051 
1.7988  1.19879  0.6955 
1.5423  2.15641  0.0000 
The global minimum is at \(\mathbf{x}^*=(x_1^*, x_2^*)\) as below:
g(1.542255693195422641930153, 2.156405289793087261832605)
## [1] 0
Let’s explore various ways of depicting \(f\) first. A contour plot and a heat map are given in Figure 6.5.
par(mfrow=c(1,2)) # 2 in 1
# lefthand plot:
contour(x1, x2, y, nlevels=25)
points(1.54226, 2.15641, col=2, pch=3)
# righthand plot:
image(x1, x2, y)
contour(x1, x2, y, add=TRUE)
Two perspective plots (views from different angles) are given in Figure 6.6.
par(mfrow=c(1,2)) # 2 in 1
persp(x1, x2, y, phi=30, theta=5, shade=2, border=NA)
persp(x1, x2, y, phi=30, theta=75, shade=2, border=NA)
 Remark.

As usual, depicting functions that are defined over highdimensional (3D and higher) domains is… difficult. Usually 1D or 2D projections can give us some neat intuitions though.
6.2 Iterative Methods
6.2.1 Introduction
Many optimisation algorithms are built around the following scheme:
Starting from a random point, perform a walk, in each step deciding where to go based on the idea of where the location of the minimum might be.
 Example.

Imagine we’re to cycle from Deakin University’s Burwood Campus to the CBD not knowing the route and with GPS disabled – we’ll have to ask many people along the way, but we’ll eventually (because most people are good) get to some CBD (say, in Perth).
More formally, we are interested in iterative algorithms that operate in a greedylike manner:
\(\mathbf{x}^{(0)}\) – initial guess (e.g., generated at random)
for \(i=1,...,M\):
 \(\mathbf{x}^{(i)} = \mathbf{x}^{(i1)}+\text{[guessed direction]}\)
 if \(f(\mathbf{x}^{(i)})f(\mathbf{x}^{(i1)}) < \varepsilon\) break
return \(\mathbf{x}^{(i)}\) as result
Note that there are two stopping criteria, based on:
 \(M\) = maximum number of iterations,
 \(\varepsilon\) = tolerance, e.g, \(10^{8}\).
6.2.2 Example in R
R has a builtin function, optim()
, that provides an implementation
of (amongst others) the BFGS method
(proposed by Broyden, Fletcher, Goldfarb and Shanno in 1970).
 Remark.

(*) BFGS uses the assumption that the objective function is smooth – the [guessed direction] is determined by computing the (partial) derivatives (or their finitedifference approximations). However, they might work well even if this is not the case. We’ll be able to derive similar algorithms (called quasiNewton ones) ourselves once we learn about Taylor series approximation by reading a book/taking a course on calculus.
Here, we shall use the BFGS as a blackbox continuous optimisation method, i.e., without going into how it has been defined (in terms of our assumed math skills, it might be too early for this). Despite that, will still be able to point out a few interesting patterns.
optim(par, fn, method="BFGS")
where:
par
– an initial guess (a numeric vector of length \(p\))fn
– an objective function to minimise (takes a vector of length \(p\) on input, returns a single number)
Let us minimise the \(g\) function defined above (the one with the 2D domain):
# g needs to be rewritten to accept a 2ary vector
< function(x12) g(x12[1], x12[2])
g_vectorised # random starting point with coordinates in [5, 5]
< runif(2, 5, 5)) (x12_init
## [1] 2.1242 2.8831
< optim(x12_init, g_vectorised, method="BFGS")) (res
## $par
## [1] 1.5423 2.1564
##
## $value
## [1] 1.4131e12
##
## $counts
## function gradient
## 101 21
##
## $convergence
## [1] 0
##
## $message
## NULL
Note that:
par
gives the location of the local minimum found,value
gives the value of \(g\) atpar
,convergence
of 0 is a successful one (we were able to satisfy the \(f(\mathbf{x}^{(i)})f(\mathbf{x}^{(i1)}) < \varepsilon\) condition).
We can even depict the points that the algorithm is “visiting”, see Figure 6.7.
 Remark.

(*) Technically, the algorithm needs to evaluate a few more points in order to make the decision on where to go next (BFGS approximates the gradient and the Hessian matrix).
< function(x12) {
g_vectorised_plot points(x12[1], x12[2], col=2, pch=3) # draw
g(x12[1], x12[2]) # return value
}contour(x1, x2, y, nlevels=25)
< optim(x12_init, g_vectorised_plot, method="BFGS") res
6.2.3 Convergence to Local Optima
We were lucky, because the local minimum that the algorithm has found coincides with the global minimum.
Let’s see where does the BFGS algorithm converge if seek the minimum of the above \(g\) starting from many randomly chosen points uniformly distributed over the square \([5,5]\times[5,5]\):
< replicate(1000, {
res_value # this will be iterated 100 times
< runif(2, 5, 5)
x12_init < optim(x12_init, g_vectorised, method="BFGS")
res $value # return value from each iteration
res
})table(round(res_value,3))
##
## 0 0.695 1.356 1.705
## 273 352 156 219
Unfortunately, we find the global minimum only in \(\sim 25\%\) cases, compare Figure 6.8.
hist(res_value, col="white", breaks=100, main=NA); box()
Figure 6.9 depicts all the random starting points and where do we converge from them.
6.2.4 Random Restarts
A kind of “remedy” for the above limitation could be provided by repeated local search: in order to robustify an optimisation procedure it is often advised to consider multiple random initial points and pick the best solution amongst the identified local optima.
# N  number of restarts
# par_generator  a function generating initial guesses
# ...  further arguments to optim()
< function(par_generator, ..., N=10) {
optim_with_restarts < list(value=Inf) # cannot be worse than this
res_best for (i in 1:N) {
< optim(par_generator(), ...)
res if (res$value < res_best$value)
< res # a better candidate found
res_best
}
res_best }
optim_with_restarts(function() runif(2, 5, 5),
method="BFGS", N=10) g_vectorised,
## $par
## [1] 1.5423 2.1564
##
## $value
## [1] 3.9702e13
##
## $counts
## function gradient
## 48 17
##
## $convergence
## [1] 0
##
## $message
## NULL
Food for thought: Can we really really guarantee that the global minimum will be found within \(N\) tries?
Click here for a solution.
Absolutely not.
6.3 Gradient Descent
6.3.1 Function Gradient (*)
How to choose the [guessed direction] in our iterative optimisation algorithm?
If we are minimising a smooth function, the simplest possible choice is to use the information included in the objective’s gradient, which provides us with the information about the direction where the function decreases the fastest.
 Definition.

(*) Gradient of \(f:\mathbb{R}^p\to\mathbb{R}\), denoted \(\nabla f:\mathbb{R}^p\to\mathbb{R}^p\), is the vector of all its partial derivatives, (\(\nabla\) – nabla symbol = differential operator) \[ \nabla f(\mathbf{x}) = \left[ \begin{array}{c} \displaystyle\frac{\partial f}{\partial x_1}(\mathbf{x})\\ \vdots\\ \displaystyle\frac{\partial f}{\partial x_p}(\mathbf{x}) \end{array} \right] \] If we have a function \(f(x_1,...,x_p)\), the partial derivative w.r.t. the \(i\)th variable, denoted \(\frac{\partial f}{\partial x_i}\) is like an ordinary derivative w.r.t. \(x_i\) where \(x_1,...,x_{i1},x_{i+1},...,x_p\) are assumed constant.
 Remark.

Function differentiation is an important concept – see how it’s referred to in, e.g., the
keras
package manual at https://keras.rstudio.com/reference/fit.html.
Recall our \(g\) function defined above: \[ g(x_1,x_2)=\log\left((x_1^{2}+x_25)^{2}+(x_1+x_2^{2}3)^{2}+x_1^21.60644\dots\right) \]
It can be shown (*) that: \[ \begin{array}{ll} \frac{\partial g}{\partial x_1}(x_1,x_2)=& \displaystyle\frac{ 4x_1(x_1^{2}+x_25)+2(x_1+x_2^{2}3)+2x_1 }{(x_1^{2}+x_25)^{2}+(x_1+x_2^{2}3)^{2}+x_1^21.60644\dots} \\ \frac{\partial g}{\partial x_2}(x_1,x_2)=& \displaystyle\frac{ 2(x_1^{2}+x_25)+4x_2(x_1+x_2^{2}3) }{(x_1^{2}+x_25)^{2}+(x_1+x_2^{2}3)^{2}+x_1^21.60644\dots} \end{array} \]
< function(x) {
grad_g_vectorised c(
4*x[1]*(x[1]^2+x[2]5)+2*(x[1]+x[2]^23)+2*x[1],
2*(x[1]^2+x[2]5)+4*x[2]*(x[1]+x[2]^23)
/(
)1]^2+x[2]5)^2+(x[1]+x[2]^23)^2+x[1]^21.60644366086443841
(x[
) }
6.3.2 Three Facts on the Gradient
For now, we should emphasise three important facts:
 Fact 1.

If we are incapable of deriving the gradient analytically, we can rely on its finite differences approximation. Each partial derivative can be estimated by means of: \[ \frac{\partial f}{\partial x_i}(x_1,\dots,x_p) \simeq \frac{ f(x_1,...,x_i+\delta,...,x_p)f(x_1,...,x_i,...,x_p) }{ \delta } \] for some small \(\delta>0\), say, \(\delta=10^{6}\).
 Remark.

(*) Actually, a function’s partial derivative, by definition, is the limit of the above as \(\delta\to 0\).
Example implementation:
# gradient of f at x=c(x[1],...,x[p])
< function(f, x, delta=1e6) {
grad_approx < length(x)
p < numeric(p) # vector of length p
gf for (i in 1:p) {
< x
xi < xi[i]+delta
xi[i] < f(xi)
gf[i]
}f(x))/delta
(gf }
 Remark.

(*) Interestingly, some modern vector/matrix algebra frameworks like TensorFlow (upon which
keras
is built) or PyTorch, feature methods to “derive” the gradient algorithmically (autodiff; automatic differentiation).
Sanity check:
grad_approx(g_vectorised, c(2, 2))
## [1] 3.1865 1.3656
grad_g_vectorised(c(2, 2))
## [1] 3.1865 1.3656
grad_approx(g_vectorised, c(1.542255693, 2.15640528979))
## [1] 1.0588e05 1.9817e05
grad_g_vectorised(c(1.542255693, 2.15640528979))
## [1] 4.1292e09 3.5771e10
By the way, there is also the grad()
function in package numDeriv
that might be a little more accurate (uses a different approximation
formula).
 Fact 2.

The gradient of \(f\) at \(\mathbf{x}\), \(\nabla f(\mathbf{x})\), is a vector that points in the direction of the steepest slope. On the other hand, minus gradient, \(\nabla f(\mathbf{x})\), is the direction where the function decreases the fastest.
 Remark.

(*) This can be shown by considering a function’s firstorder Taylor series approximation.
Each gradient is a vector, therefore it can be depicted as an arrow. Figure 6.10 illustrates a few scaled gradients of the \(g\) function at different points – each arrow connects a point \(\mathbf{x}\) to \(\mathbf{x}\pm 0.1\nabla f(\mathbf{x})\).
Note that the blue arrows point more or less in the direction of the local minimum. Therefore, in our iterative algorithm, we may try taking the direction of the minus gradient! How far should we go in that direction? Well, a bit. We will refer to the desired step size as the learning rate, \(\eta\).
This will be called the gradient descent method (GD; Cauchy, 1847).
 Fact 3.

If a function \(f\) has a local minimum at \(\mathbf{x}^*\), then its gradient vanishes there, i.e., \(\nabla {f}(\mathbf{x}^*)=[0,\dots,0]\).
Note that the above condition is a necessary, not sufficient one. For example, the gradient also vanishes at a maximum or at a saddle point. In fact, we have what follows.
 Theorem.

(***) More generally, a twicedifferentiable function has a local minimum at \(\mathbf{x}^*\) if and only if its gradient vanishes there and \(\nabla^2 {f}(\mathbf{x}^*)\) (Hessian matrix = matrix of all secondorder derivatives) is positivedefinite.
6.3.3 Gradient Descent Algorithm (GD)
Taking the above into account, we arrive at the gradient descent algorithm:
\(\mathbf{x}^{(0)}\) – initial guess (e.g., generated at random)
for \(i=1,...,M\):
 \(\mathbf{x}^{(i)} = \mathbf{x}^{(i1)}\eta \nabla f(\mathbf{x}^{(i1)})\)
 if \(f(\mathbf{x}^{(i)})f(\mathbf{x}^{(i1)}) < \varepsilon\) break
return \(\mathbf{x}^{(i)}\) as result
where \(\eta>0\) is a step size frequently referred to as the learning rate, because that’s much more cool. We usually set \(\eta\) of small order of magnitude, say \(0.01\) or \(0.1\).
An implementation of the gradient descent algorithm is straightforward.
In essence, it’s the par < par  eta*grad_g_vectorised(par)
expression
run in a loop, until convergence.
# par  initial guess
# fn  a function to be minimised
# gr  a function to return the gradient of fn
# eta  learning rate
# maxit  maximum number of iterations
# tol  convergence tolerance
< function(par, fn, gr, eta=0.01,
optim_gd maxit=1000, tol=1e8) {
< fn(par)
f_last for (i in 1:maxit) {
< par  eta*grad_g_vectorised(par) # update step
par < fn(par)
f_cur if (abs(f_curf_last) < tol) break
< f_cur
f_last
}list( # see ?optim, section `Value`
par=par,
value=g_vectorised(par),
counts=i,
convergence=as.integer(i==maxit)
) }
Tests of the \(g\) function. First, let’s try \(\eta=0.01\). Figure 6.11 zooms in the contour plot so that we can see the actual path the algorithm has taken.
< 0.01
eta < optim_gd(c(3, 1), g_vectorised, grad_g_vectorised, eta=eta)
res str(res)
## List of 4
## $ par : num [1:2] 1.54 2.16
## $ value : num 1.33e08
## $ counts : int 135
## $ convergence: int 0
Now let’s try \(\eta=0.05\).
< 0.05
eta < optim_gd(c(3, 1), g_vectorised, grad_g_vectorised, eta=eta)
res str(res)
## List of 4
## $ par : num [1:2] 1.54 2.15
## $ value : num 0.000203
## $ counts : int 417
## $ convergence: int 0
With an increased step size, the algorithm needed many more iterations (3 times as many), see Figure 6.12 for the path taken.
And now for something completely different: \(\eta=0.1\), see Figure 6.13.
< 0.1
eta < optim_gd(c(3, 1), g_vectorised, grad_g_vectorised, eta=eta)
res str(res)
## List of 4
## $ par : num [1:2] 1.52 2.33
## $ value : num 0.507
## $ counts : int 1000
## $ convergence: int 1
The algorithm failed to converge.
If the learning rate \(\eta\) is too small, the convergence might be too slow or we might get stuck at a plateau. On the other hand, if \(\eta\) is too large, we might be overshooting and end up bouncing around the minimum.
This is why many optimisation libraries (including keras
/TensorFlow)
implement some
of the following ideas:
learning rate decay – start with large \(\eta\), decreasing it in every iteration, say, by some percent;
line search – determine optimal \(\eta\) in every step by solving a 1dimensional optimisation problem w.r.t. \(\eta\in[0,\eta_{\max}]\);
momentum – the update step is based on a combination of the gradient direction and the previous change of the parameters, \(\Delta\mathbf{x}\); can be used to accelerate search in the relevant direction and minimise oscillations.
Try implementing at least the first of the above
heuristics yourself. You can set eta < eta*0.95
in every iteration
of the gradient descent procedure.
6.3.4 Example: MNIST (*)
In the previous chapter we’ve studied the MNIST dataset. Let us go back to the task of fitting a multiclass logistic regression model.
library("keras")
< dataset_mnist() mnist
## Loaded Tensorflow version 2.9.1
# get train/test images in greyscale
< mnist$train$x/255 # to [0,1]
X_train < mnist$test$x/255 # to [0,1]
X_test
# get the corresponding labels in {0,1,...,9}:
< mnist$train$y
Y_train < mnist$test$y Y_test
The labels need to be onehot encoded:
< function(Y) {
one_hot_encode stopifnot(is.numeric(Y))
< min(Y) # first class label
c1 < max(Y) # last class label
cK < cKc1+1 # number of classes
K < matrix(0, nrow=length(Y), ncol=K)
Y2 cbind(1:length(Y), Yc1+1)] < 1
Y2[
Y2
}
< one_hot_encode(Y_train)
Y_train2 < one_hot_encode(Y_test) Y_test2
Our task is to find the parameters \(\mathbf{B}\) that minimise cross entropy \(E(\mathbf{B})\) over the training set:
\[ \min_{\mathbf{B}\in\mathbb{R}^{785\times 10}} \frac{1}{n^\text{train}} \sum_{i=1}^{n^\text{train}} \log \Pr(Y=y_i^\text{train}\mathbf{x}_{i,\cdot}^\text{train},\mathbf{B}). \]
In the previous chapter, we’ve relied on the methods implemented
in the keras
package. Let’s do that all by ourselves now.
In order to come up with a working version of the gradient
descent procedure for classifying of MNIST digits,
we will need to derive and implement grad_cross_entropy()
.
We do that below using matrix notation.
 Remark.

In the first reading, you can jump to the Safe landing zone below with no much loss in what we’re trying to convey here (you will then treat
grad_cross_entropy()
as a blackbox function). Nevertheless, keep in mind that this is the kind of maths you will need to master anyway sooner than later – this is inevitable. Perhaps you should go back to, e.g., the appendix on Matrix Computations with R or the chapter on Linear Regression? Learning maths is not a linear, stepbystep process. Everyone is different and will have a different path to success. The material needs to be frequently revisited, it will “click” someday, don’t you worry; good stuff isn’t built in a day or seven.
Recall that the output of the logistic regression model (1layer neural network with softmax) can be written in the matrix form as: \[ \hat{\mathbf{Y}}=\mathrm{softmax}\left( \mathbf{\dot{X}}\,\mathbf{B} \right), \] where \(\mathbf{\dot{X}}\in\mathbb{R}^{n\times 785}\) is a matrix representing \(n\) images of size \(28\times 28\), augmented with a column of \(1\)s, and \(\mathbf{B}\in\mathbb{R}^{785\times 10}\) is the coefficients matrix and \(\mathrm{softmax}\) is applied on each matrix row separately.
Of course, by the definition of matrix multiplication, \(\hat{\mathbf{Y}}\) will be a matrix of size \(n\times 10\), where \(\hat{y}_{i,k}\) represents the predicted probability that the \(i\)th image depicts the \(k\)th digit.
# convert to matrices of size n*784
# and add a column of 1s
< cbind(1.0, matrix(X_train, ncol=28*28))
X_train1 < cbind(1.0, matrix(X_test, ncol=28*28)) X_test1
The nn_predict()
function implements the above formula
for \(\hat{\mathbf{Y}}\):
< function(T) {
softmax < exp(T)
T /rowSums(T)
T
}
< function(B, X) {
nn_predict softmax(X %*% B)
}
Let’s define the functions to compute the crossentropy (which we shall minimise) and accuracy (which we shall report to a user):
< function(Y_true, Y_pred) {
cross_entropy sum(Y_true*log(Y_pred))/nrow(Y_true)
}
< function(Y_true, Y_pred) {
accuracy # both arguments are onehot encoded
< apply(Y_true, 1, which.max)
Y_true_decoded < apply(Y_pred, 1, which.max)
Y_pred_decoded # proportion of equal corresponding pairs:
mean(Y_true_decoded == Y_pred_decoded)
}
It may be shown (**) that the gradient of crossentropy (with respect to the parameter matrix \(\mathbf{B}\)) can be expressed in the matrix form as:
\[ \frac{1}{n} \mathbf{\dot{X}}^T\, (\mathbf{\hat{Y}}\mathbf{Y}) \]
< function(X, Y_true, Y_pred) {
grad_cross_entropy t(X) %*% (Y_predY_true)/nrow(Y_true)
}
Of course, we could always substitute the gradient with the finite difference approximation. Yet, this would be much slower).
The more mathematically inclined reader will sure notice that by expanding the formulas given in the previous chapter, we can write crossentropy in the nonmatrix form (\(n\) – number of samples, \(K\) – number of classes, \(p+1\) – number of model parameters; in our case \(K=10\) and \(p=784\)) as:
\[ \begin{array}{rcl} E(\mathbf{B}) &=& \displaystyle\frac{1}{n} \displaystyle\sum_{i=1}^n \displaystyle\sum_{k=1}^K y_{i,k} \log\left( \frac{ \exp\left( \displaystyle\sum_{j=0}^p \dot{x}_{i,j} \beta_{j,k} \right) }{ \displaystyle\sum_{c=1}^K \exp\left( \displaystyle\sum_{j=0}^p \dot{x}_{i,j} \beta_{j,c} \right) } \right)\\ &=& \displaystyle\frac{1}{n} \displaystyle\sum_{i=1}^n \left( \log \left(\displaystyle\sum_{k=1}^K \exp\left( \displaystyle\sum_{j=0}^p \dot{x}_{i,j} \beta_{j,k} \right)\right)  \displaystyle\sum_{k=1}^K y_{i,k} \displaystyle\sum_{j=0}^p \dot{x}_{i,j} \beta_{j,k} \right). \end{array} \]
Partial derivatives of crossentropy w.r.t. \(\beta_{a,b}\) in nonmatrix form can be derived (**) so as to get:
\[ \begin{array}{rcl} \displaystyle\frac{\partial E}{\partial \beta_{a,b}}(\mathbf{B}) &=& \displaystyle\frac{1}{n} \displaystyle\sum_{i=1}^n \dot{x}_{i,a} \left( \frac{ \exp\left( \displaystyle\sum_{j=0}^p \dot{x}_{i,j} \beta_{j,b} \right) }{ \displaystyle\sum_{k=1}^K \exp\left( \displaystyle\sum_{j=0}^p \dot{x}_{i,j} \beta_{j,k} \right) }  y_{i,b} \right)\\ &=& \displaystyle\frac{1}{n} \displaystyle\sum_{i=1}^n \dot{x}_{i,a} \left( \hat{y}_{i,b}  y_{i,b} \right). \end{array} \]
 Safe landing zone.

In case you’re lost with the above, continue from here. However, in the near future, harden up and revisit the skipped material to get the most out of our discussion.
We now have all the building blocks to
implement the gradient descent method.
The algorithm below follows exactly the same scheme as the one in the
\(g\) function example. This time, however, we play with a parameter matrix
\(\mathbf{B}\) (not a parameter vector \([x_1, x_2]\)) and we compute
the gradient of crossentropy (by means of grad_cross_entropy()
),
not the gradient of \(g\).
Note that a call to system.time(expr)
measures the time (in seconds)
spent evaluating an expression expr
.
# random matrix of size 785x10  initial guess
< matrix(rnorm(ncol(X_train1)*ncol(Y_train2)),
B nrow=ncol(X_train1))
< 0.1 # learning rate
eta < 100 # number of GD iterations
maxit system.time({ # measure time spent
# for simplicity, we stop only when we reach maxit
for (i in 1:maxit) {
< B  eta*grad_cross_entropy(
B nn_predict(B, X_train1))
X_train1, Y_train2,
}# `user`  processing time in seconds: })
## user system elapsed
## 90.573 39.037 32.149
Unfortunately, the method’s convergence is really slow (we are optimising over \(7850\) parameters…) and the results after 100 iterations are disappointing:
accuracy(Y_train2, nn_predict(B, X_train1))
## [1] 0.46462
accuracy(Y_test2, nn_predict(B, X_test1))
## [1] 0.4735
Recall that in the previous chapter
we obtained much better classification accuracy by using the keras
package.
What are we doing wrong then? Maybe keras
implements some SuperFancy
Hyper Optimisation Framework (TM) (R) that we could get access to for
only $19.99 per month?
6.3.5 Stochastic Gradient Descent (SGD) (*)
In turns out that there’s a very simple cure for the slow convergence of our method.
It might be shocking for some, but sometimes the true global minimum of crossentropy for the whole training set is not exactly what we really want. In our predictive modelling task, we are minimising train error, but what we actually desire is to minimise the test error (which we cannot refer to while training = no cheating!).
It is therefore rational to assume that both the train and the test set consist of random digits independently sampled from the set of “all the possible digits out there in the world”.
Looking at the original objective (crossentropy):
\[ E(\mathbf{B}) = \frac{1}{n^\text{train}} \sum_{i=1}^{n^\text{train}} \log \Pr(Y=y_i^\text{train}\mathbf{x}_{i,\cdot}^\text{train},\mathbf{B}). \]
How about we try fitting to different random samples of the train set in each iteration of the gradient descent method instead of fitting to the whole train set?
\[ E(\mathbf{B}) \simeq \frac{1}{b} \sum_{i=1}^b \log \Pr(Y=y_{\text{random\_index}_i}^\text{train}\mathbf{x}_{\text{random\_index}_i,\cdot}^\text{train},\mathbf{B}), \]
where \(b\) is some fixed batch size. Such an approach is often called stochastic gradient descent.
 Remark.

This scheme is sometimes referred to as minibatch gradient descent in the literature; some researchers reserve the term “stochastic” only for batches of size 1.
Stochastic gradient descent can be implemented very easily:
< matrix(rnorm(ncol(X_train1)*ncol(Y_train2)),
B nrow=ncol(X_train1))
< 0.1
eta < 100
maxit < 32
batch_size system.time({
for (i in 1:maxit) {
< sample(nrow(X_train1), size=batch_size)
wh < B  eta*grad_cross_entropy(
B
X_train1[wh,], Y_train2[wh,],nn_predict(B, X_train1[wh,])
)
} })
## user system elapsed
## 0.078 0.005 0.082
accuracy(Y_train2, nn_predict(B, X_train1))
## [1] 0.46198
accuracy(Y_test2, nn_predict(B, X_test1))
## [1] 0.4693
The errors are much worse but at least we got the (useless) solution very quickly. That’s the “fail fast” rule in practice.
However, why don’t we increase the number of iterations and see what happens? We’ve allowed the classic gradient descent to scrabble around the MNIST dataset for almost 2 minutes.
< matrix(rnorm(ncol(X_train1)*ncol(Y_train2)),
B nrow=ncol(X_train1))
< 0.1
eta < 10000
maxit < 32
batch_size system.time({
for (i in 1:maxit) {
< sample(nrow(X_train1), size=batch_size)
wh < B  eta*grad_cross_entropy(
B
X_train1[wh,], Y_train2[wh,],nn_predict(B, X_train1[wh,])
)
} })
## user system elapsed
## 7.312 0.140 7.453
accuracy(Y_train2, nn_predict(B, X_train1))
## [1] 0.89222
accuracy(Y_test2, nn_predict(B, X_test1))
## [1] 0.8935
Bingo! Let’s take a closer look at how the train/test error
behaves in each iteration for different batch sizes.
Figures 6.14 and 6.15 depict
the cases of batch_size
of 32 and 128, respectively.
The time needed to go through 10000 iterations with batch size of 32 is:
## user system elapsed
## 82.636 25.538 32.599
What’s more, batch size of 128 takes:
## user system elapsed
## 198.066 93.369 56.396
Draw conclusions.
6.4 A Note on Convex Optimisation (*)
Are there any cases where we are sure that a local minimum is the global minimum? It turns out that the answer to this is positive; for example, when we minimise objective functions that fulfil a special property defined below.
First let’s note that given two points \(\mathbf{x}_{1},\mathbf{x}_{2}\in \mathbb{R}^p\), by taking any \(\theta\in[0,1]\), the point defined as:
\[ \mathbf{t} = \theta \mathbf{x}_{1}+(1\theta)\mathbf{x}_{2} \]
lies on a (straight) line segment between \(\mathbf{x}_1\) and \(\mathbf{x}_2\).
 Definition.

We say that a function \(f:\mathbb{R}^p\to\mathbb{R}\) is convex, whenever: \[ (\forall \mathbf{x}_{1},\mathbf{x}_{2}\in \mathbb{R}^p) (\forall \theta\in [0,1])\quad f(\theta \mathbf{x}_{1}+(1\theta)\mathbf{x}_{2})\leq \theta f(\mathbf{x}_{1})+(1\theta)f(\mathbf{x}_{2}) \]
In other words, the function’s value at any convex combination of two points is not greater than that combination of the function values at these two points. See Figure 6.16 for a graphical illustration of the above.
The following result addresses the question we posed at the beginning of this section.
 Theorem.

For any convex function \(f\), if \(f\) has a local minimum at \(\mathbf{x}^+\) then \(\mathbf{x}^+\) is also its global minimum.
Convex functions are ubiquitous in machine learning, but of course not every objective function we are going to deal with will fulfil this property. Here are some basic examples of convex functions and how they come into being, see, e.g., (Boyd & Vandenberghe 2004) for more:
 the functions mapping \(x\) to \(x, x^2, x, e^x\) are all convex,
 \(f(x)=x^p\) is convex for all \(p\ge 1\),
 if \(f\) is convex, then \(f\) is concave,
 if \(f_1\) and \(f_2\) are convex, then \(w_1 f_1 + w_2 f_2\) are convex for any \(w_1,w_2\ge 0\),
 if \(f_1\) and \(f_2\) are convex, then \(\max\{f_1, f_2\}\) is convex,
 if \(f\) and \(g\) are convex and \(g\) is nondecreasing, then \(g(f(x))\) is convex.
The above feature the building blocks of our error measures in supervised learning problems! In particular, sum of squared residuals in linear regression is a convex function of the underlying parameters. Also, crossentropy in logistic regression is a convex function of the underlying parameters.
 Theorem.

(***) If a function is twice differentiable, then its convexity can be judged based on the positivedefiniteness of its Hessian matrix.
Note that optimising convex functions is relatively easy, especially if they are differentiable. This is because they are quite wellbehaving. However, it doesn’t mean that we an analytic solution to the problem of their minimisation. Methods such as gradient descent or BFGS should work well (unless there are vast regions where a function is constant or the function’s is defined over a large number of parameters).
 Remark.

(**) There is a special class of constrained optimisation problems called linear and, more generally, quadratic programming that involves convex functions. Moreover, the Karush–Kuhn–Tucker (KKT) conditions address the more general problem of minimisation with constraints (i.e., not over the whole \(\mathbb{R}^p\) set); see (Nocedal & Wright 2006, Fletcher 2008) for more details.
 Remark.

Not only functions, but also sets can be said to be convex. We say that \(C\subseteq \mathbb{R}^p\) is a convex set, whenever the line segment joining any two points in \(C\) is fully included in \(C\). More formally, for every \(\mathbf{x}_1\in C\) and \(\mathbf{x}_2\in C\), it holds \(\theta \mathbf{x}_{1}+(1\theta)\mathbf{x}_{2} \in C\) for all \(\theta\in [0,1]\); see Figure 6.17 for an illustration.
6.5 Outro
6.5.1 Remarks
Solving continuous problems with many variables (e.g., deep neural networks) is time consuming – the more variables to optimise over (e.g., model parameters, think the number of interconnections between all the neurons), the slower the optimisation process.
Moreover, it might be the case that the sole objective function takes long to compute (think of image classification with large training samples).
 Remark.

(*) Although theoretically possible, good luck fitting a logistic regression model to MNIST with
optim()
’s BFGS – there are 7850 variables!
Training deep neural networks with SGD is even slower (more parameters),
but there is a trick to propagate weight updates layer by layer,
called backpropagation (actually used in every neural network library),
see, e.g., (Sarle et al. 2002) and (Goodfellow et al. 2016).
Moreover, keras
and similar libraries implement automatic differentiation
procedures that make its user’s life much easier (swiping some of the tedious
math under the comfy carpet).
keras
implements various optimisers that
we can refer to in the compile()
function,
see
https://keras.rstudio.com/reference/compile.html
and
https://keras.io/optimizers/:
SGD
– stochastic gradient descent supporting momentum and learning rate decay,RMSprop
– divides the gradient by a running average of its recent magnitude,Adam
– adaptive momentum,
and so on. These are all noncomplicated variations of the pure stochastic GD. Some of them are just tricks that work well in some examples and destroy the convergence on many other ones. You can get into their details in a dedicated book/course aimed at covering neural networks (see, e.g., (Sarle et al. 2002), (Goodfellow et al. 2016)), but we have already developed some good intuitions here.
Keep in mind that with methods such as GD or SGD, there is no guarantee we reach a minimum, but an approximate solution is better than no solution at all. Also sometimes (especially in ML applications) we don’t really need the actual minimum (e.g., when optimising the error with respect to the train set). Those “mathematically pure” will find that a bit… unaesthetic, but here we are. Maybe the solution makes your boss or client happy, maybe it generates revenue. Maybe it helps solve some other problem. Some claim that a solution is better than no solution at all, remember? But… is it really always the case though?
6.5.2 Further Reading
Recommended further reading: (Nocedal & Wright 2006), (Boyd & Vandenberghe 2004), (Fletcher 2008).