6.4 Outro

6.4.1 Remarks

Solving continuous problems with many variables (e.g., deep neural networks) is time consuming – the more variables to optimise over (e.g., model parameters, think the number of interconnections between all the neurons), the slower the optimisation process.

(*) Good luck fitting a logistic regression model to MNIST with optim()’s BFGS – there are 7850 variables.

Training deep neural networks with SGD is slow too, but there is a trick to propagate weight updates layer by layer, called backpropagation (actually used in every neural network library), see, e.g., (Goodfellow et al. 2016).


With methods such as GD or SGD, there is no guarantee we reach a minimum, but an approximate solution is better than no solution at all.

Also sometimes (especially in ML applications) we don’t really need the actual minimum (with respect to the train set).

6.4.2 Optimisers in Keras

Keras implements various optimisers that we can refer to in the compile() function, see https://keras.rstudio.com/reference/compile.html and https://keras.io/optimizers/

  • SGD – stochastic gradient descent supporting momentum and learning rate decay,

  • RMSprop – divides the gradient by a running average of its recent magnitude,

  • Adam – adaptive momentum

and so on.

These are all fancy variations of the pure stochastic GD.

Some of them are just tricks that work well in some examples and destroy the convergence on other ones.

You will get into their details in a dedicated course covering deep neural networks in more detail (see, e.g., (Goodfellow et al. 2016)), but you already have developed some good intuitions!

6.4.3 Note on Search Spaces

Most often, the choice of the search space \(D\) in an continuous optimisation problem can be:

  • \(D=\mathbb{R}^p\) – continuous unconstrained (typical in ML)

  • \(D=[a_1,b_1]\times\dots\times[a_n,b_n]\) – continuous with box constraints

    see method="L-BFGS-B" in optim()

  • constrained with \(k\) linear inequality constraints

    \(a_{1,1} x_1 + \dots + a_{1,p} x_p \le b_1\), …, \(a_{k,1} x_1 + \dots + a_{k,p} x_p \le b_k\)

    (*) supported in linear and quadratic programming solvers, where the objective function is from a very specific class

6.4.4 Further Reading

Recommended further reading:

  • (Nocedal & Wright 2006)
  • (Fletcher 2008)