## 6.2 Iterative Methods

### 6.2.1 Introduction

Many optimisation algorithms are built around the following scheme:

starting from a random point, perform a walk, in each step deciding where to go based on the idea of where the location of the minimum seems to be.

Example: cycling from the Deakin University’s Burwood Campus to the CBD not knowing the route and with GPS disabled – you’ll have to ask many people along the way, but you’ll eventually (because most people are good) get to some CBD (say, in Perth).

More formally, we are interested in iterative algorithms that operate in a greedy-like fashion:

1. $$\mathbf{x}^{(0)}$$ – initial guess (e.g., generated at random)

2. for $$i=1,...,M$$:
1. $$\mathbf{x}^{(i)} = \mathbf{x}^{(i-1)}+\text{[guessed direction]}$$
2. if $$|f(\mathbf{x}^{(i)})-f(\mathbf{x}^{(i-1)})| < \varepsilon$$ break
3. return $$\mathbf{x}^{(i)}$$ as result

Note that there are two stopping criteria, based on:

• $$M$$ = maximum number of iterations
• $$\varepsilon$$ = tolerance, e.g, $$10^{-8}$$

### 6.2.2 Example in R

R has a built-in optim() function that provides an implementation of (amongst others) the BFGS method (proposed by Broyden, Fletcher, Goldfarb and Shanno in 1970).

(*) BFGS uses the assumption that the objective function is smooth – the [guessed direction] is determined by computing the (partial) derivatives (or their finite-difference approximations). However, they might work well even if this is not the case. You will be able to derive similar algorithms (called quasi-Newton ones) yourself once you get to know about Taylor series approximation by taking a course on calculus.

Here, we shall use the BFGS as a black-box continuous optimisation method, i.e., without going into how it has been defined (it’s too early for this). However, this will still enable us to identify a few interesting behavioural patterns.

optim(par, fn, method="BFGS")

where:

• par – an initial guess (a numeric vector of length $$p$$)
• fn – an objective function to minimise (takes a vector of length $$p$$ on input, returns a single number)

Let us minimise the $$g$$ function defined above (the one with the 2D domain):

# g needs to be rewritten to accept a 2-ary vector
g_vectorised <- function(x12) g(x12[1], x12[2])
# random starting point with coordinates in [-5, 5]
(x12_init <- runif(2, -5, 5))
## [1] -2.124225  2.883051
res <- optim(x12_init, g_vectorised, method="BFGS")

res
## $par ## [1] -1.542255 2.156405 ## ##$value
## [1] 1.413092e-12
##
## $counts ## function gradient ## 101 21 ## ##$convergence
## [1] 0
##
## $message ## NULL par gives the location of the local minimum found value gives the value of $$g$$ at par We can even depict the points that the algorithm is “visiting”: (*) Technically, the algorithm needs to evaluate a few more points in order to make the decision on where to go next (BFGS approximates the Hessian matrix). g_vectorised_plot <- function(x12) { points(x12[1], x12[2], col=2, pch=3) # draw g(x12[1], x12[2]) # return value } contour(x1, x2, y, las=1, nlevels=25) res <- optim(x12_init, g_vectorised_plot, method="BFGS") ### 6.2.3 Convergence to Local Optima We were lucky, because the local minimum that the algorithm has found coincides with the global minimum. Let’s see where does the algorithm converge if we start it from many randomly chosen points uniformly distributed over the square $$[-5,5]\times[-5,5]$$: res_value <- replicate(1000, { # this will be iterated 100 times x12_init <- runif(2, -5, 5) res <- optim(x12_init, g_vectorised, method="BFGS") res$value # return value from each iteration
})
table(round(res_value,3))
##
##     0 0.695 1.356 1.705
##   273   352   156   219

We find the global minimum only in $$\sim 25\%$$ cases! :(

hist(res_value, las=1, breaks=100)
box()

Here is a depiction of all the random starting points and where do we converge from them:

### 6.2.4 Random Restarts

A “remedy”: repeated local search

In order to robustify an optimisation procedure it is often advised to consider multiple random initial points and pick the best solution amongst the identified local optima.

# N             - number of restarts
# par_generator - a function generating initial guesses
# ...           - further arguments to optim()
optim_with_restarts <- function(par_generator, ..., N=10) {
res_best <- list(value=Inf) # cannot be worse than this
for (i in 1:N) {
res <- optim(par_generator(), ...)
if (res$value < res_best$value)
res_best <- res # a better candidate found
}
res_best
}

optim_with_restarts(function() runif(2, -5, 5),
g_vectorised, method="BFGS", N=10)
## $par ## [1] -1.542256 2.156405 ## ##$value
## [1] 3.970158e-13
##
## $counts ## function gradient ## 48 17 ## ##$convergence
## [1] 0
##
## \$message
## NULL

Can we guarantee that the global minimum will be found within $$N$$ tries? No.