6.2 Iterative Methods

6.2.1 Introduction


Many optimisation algorithms are built around the following scheme:

starting from a random point, perform a walk, in each step deciding where to go based on the idea of where the location of the minimum seems to be.

Example: cycling from the Deakin University’s Burwood Campus to the CBD not knowing the route and with GPS disabled – you’ll have to ask many people along the way, but you’ll eventually (because most people are good) get to some CBD (say, in Perth).


More formally, we are interested in iterative algorithms that operate in a greedy-like fashion:

  1. \(\mathbf{x}^{(0)}\) – initial guess (e.g., generated at random)

  2. for \(i=1,...,M\):
    1. \(\mathbf{x}^{(i)} = \mathbf{x}^{(i-1)}+\text{[guessed direction]}\)
    2. if \(|f(\mathbf{x}^{(i)})-f(\mathbf{x}^{(i-1)})| < \varepsilon\) break
  3. return \(\mathbf{x}^{(i)}\) as result

 

Note that there are two stopping criteria, based on:

  • \(M\) = maximum number of iterations
  • \(\varepsilon\) = tolerance, e.g, \(10^{-8}\)

6.2.2 Example in R


R has a built-in optim() function that provides an implementation of (amongst others) the BFGS method (proposed by Broyden, Fletcher, Goldfarb and Shanno in 1970).

(*) BFGS uses the assumption that the objective function is smooth – the [guessed direction] is determined by computing the (partial) derivatives (or their finite-difference approximations). However, they might work well even if this is not the case. You will be able to derive similar algorithms (called quasi-Newton ones) yourself once you get to know about Taylor series approximation by taking a course on calculus.


Here, we shall use the BFGS as a black-box continuous optimisation method, i.e., without going into how it has been defined (it’s too early for this). However, this will still enable us to identify a few interesting behavioural patterns.

where:

  • par – an initial guess (a numeric vector of length \(p\))
  • fn – an objective function to minimise (takes a vector of length \(p\) on input, returns a single number)

Let us minimise the \(g\) function defined above (the one with the 2D domain):

## [1] -2.124225  2.883051

## $par
## [1] -1.542255  2.156405
## 
## $value
## [1] 1.413092e-12
## 
## $counts
## function gradient 
##      101       21 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

par gives the location of the local minimum found

value gives the value of \(g\) at par


We can even depict the points that the algorithm is “visiting”:

(*) Technically, the algorithm needs to evaluate a few more points in order to make the decision on where to go next (BFGS approximates the Hessian matrix).


plot of chunk unnamed-chunk-15

6.2.3 Convergence to Local Optima


We were lucky, because the local minimum that the algorithm has found coincides with the global minimum.

Let’s see where does the algorithm converge if we start it from many randomly chosen points uniformly distributed over the square \([-5,5]\times[-5,5]\):

## 
##     0 0.695 1.356 1.705 
##   273   352   156   219

We find the global minimum only in \(\sim 25\%\) cases! :(


plot of chunk unnamed-chunk-17


Here is a depiction of all the random starting points and where do we converge from them:

plot of chunk unnamed-chunk-18

6.2.4 Random Restarts


A “remedy”: repeated local search

In order to robustify an optimisation procedure it is often advised to consider multiple random initial points and pick the best solution amongst the identified local optima.


## $par
## [1] -1.542256  2.156405
## 
## $value
## [1] 3.970158e-13
## 
## $counts
## function gradient 
##       48       17 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

Can we guarantee that the global minimum will be found within \(N\) tries? No.