6 Optimisation for Machine Learning

Once a modelling problem has been stated and a loss has been chosen, a new question takes over:

How do we actually find a model that performs well?

That is the job of optimisation. In machine learning, optimisation is the bridge between a mathematical objective and an implemented learning rule. We do not merely say what counts as a good model. We search for one.

This chapter connects the calculus and numerical methods of the earlier series to the practical logic of fitting modern models. The central ideas are not new: rates of change, local approximation, descent, iteration, numerical stability, and constrained trade-offs. What is new is the scale and role they acquire in learning systems.

In this chapter

Treat “training” as minimising an objective under finite computation.
Understand step size, curvature, and noise as the core numerical story.
See why stochastic gradients behave differently from full-batch gradients.
Connect regularisation to geometry (penalties reshape the landscape).

6.1 Guiding examples

Least squares as an objective you can see and optimise
Logistic loss as “fit a probability model, then pay for surprise”

Running example: Alberta wildfire smoke and station PM2.5

This chapter’s job in the smoke story is to make “training” feel like a real numerical process with choices and failure modes:

Try MAE vs MSE for PM2.5 and notice how peaks are weighted.
Try an asymmetric loss if missing a smoke peak is worse than a false alarm.
Compare full-batch vs stochastic gradients (noisy but fast updates).
Read regularisation as “do not chase measurement noise and rare quirks”.

6.2 Prerequisite anchors

The strongest backward links are:

vol-05/02-differential-calculus.qmd for gradients, tangent ideas, and local change
vol-07/numerics/01-numerical-methods.qmd for iteration and approximation
vol-07/optimisation/01-linear-programming.qmd for objective-driven reasoning
vol-08/07-nonlinear-optimisation.qmd for more general optimisation landscapes

6.3 Learning as minimisation

Many learning problems can be written in the form

\text{choose } \theta \text{ to minimise } J(\theta),

where \theta denotes the model parameters and J is an objective function.

Often the objective is built from a loss over many examples:

J(\theta) = \frac{1}{n}\sum_{i=1}^n L\big(y_i, f(x_i;\theta)\big).

This compact expression carries a great deal of meaning. The model family f(x;\theta) has already been chosen. The data (x_i, y_i) have already been gathered. The loss L has already encoded what kind of error matters. The remaining question is how to move through parameter space in a way that improves the objective.

So training is not mysterious. It is repeated optimisation under imperfect information and finite computation.

6.4 The geometry of a loss landscape

To optimise well, it helps to picture the objective as a landscape. Each possible parameter setting corresponds to a point in parameter space, and the objective value tells us the height of the landscape at that point.

This image is approximate, since real parameter spaces may have thousands or millions of dimensions. Still, it captures the right logic:

high regions correspond to poor fit
low regions correspond to better fit
slopes indicate directions of improvement
flat regions and narrow valleys affect numerical behaviour

The visual thread of objective landscapes belongs at the centre of this book because it ties together calculus, numerics, and learning. When a model is trained, it is not merely “updated”. It is moved through a landscape shaped by the data and the modelling choices.

6.5 Gradients and descent

In one dimension, the derivative tells us how a function changes locally. In many dimensions, the gradient plays the same role. For an objective J(\theta), the gradient \nabla J(\theta) points in the direction of steepest increase. If we want to reduce the objective, we move in the opposite direction.

This gives the basic update rule for gradient descent:

\theta_{k+1} = \theta_k - \eta \nabla J(\theta_k),

where \eta is the step size or learning rate.

The formula is simple enough to memorise, but its meaning is richer than it may first appear:

the gradient is local information, not global knowledge
the step size controls how strongly we trust that local information
the update is iterative, so errors in one step can accumulate or be corrected

This is why optimisation in practice is never only about derivatives. It is also about judgment concerning scale, stability, and stopping.

/var/folders/vh/5ydbwp613xz6_ddxgpk0l06h0000gp/T/ipykernel_12525/2378239035.py:83: UserWarning: FigureCanvasAgg is non-interactive, and thus cannot be shown
  plt.show()

Figure 6.1

6.6 Step size is a modelling decision in disguise

If the step size is too small, learning is slow. If it is too large, the updates may overshoot good regions, oscillate, or diverge.

This is one reason machine learning inherits so much from numerical analysis. Iterative methods behave differently depending on the geometry of the problem. The same learning rate that works well in one model may be disastrous in another.

It is tempting to treat step-size tuning as a technical nuisance, but it reveals something more fundamental: optimisation is part of the model’s behaviour. A loss function may describe the destination, but the optimisation rule affects which destination is actually reached in finite time.

/var/folders/vh/5ydbwp613xz6_ddxgpk0l06h0000gp/T/ipykernel_12525/466373727.py:38: UserWarning: FigureCanvasAgg is non-interactive, and thus cannot be shown
  plt.show()

Figure 6.2

6.7 Convex and nonconvex problems

Some optimisation problems are especially well behaved. If the objective is convex, then any local minimum is also a global minimum. This gives strong mathematical reassurance. Linear least squares and many regularised models have this flavour.

But modern machine learning frequently leaves the convex world. Neural networks and other expressive model families produce nonconvex objectives with many valleys, ridges, plateaus, and saddle-like regions.

This does not make optimisation hopeless. It does mean we must think differently. We are no longer promised a unique best point found by clean global logic. Instead we work with iterative methods that often find useful solutions without providing perfect guarantees.

The important intellectual move is not to panic at the loss of certainty. It is to understand which guarantees have disappeared and which practical regularities remain.

6.8 Local minima, saddles, and flat regions

Students often hear that machine learning fails because of local minima. That can happen, but in high-dimensional problems other difficulties are often just as important.

A saddle point is a place where the gradient is near zero even though the point is not a minimum. Some directions curve upward and others curve downward. A flat region may produce tiny gradients and painfully slow progress. A narrow curved valley may require very careful step sizes to make progress without bouncing from side to side.

So the real challenge is not simply “avoid bad minima”. It is to navigate landscapes whose local geometry varies from place to place.

6.9 Conditioning and sensitivity

Conditioning measures how sensitive a problem is to perturbations. In optimisation, poor conditioning often shows up when one direction in parameter space changes the objective much more sharply than another.

Geometrically, this can make level sets elongated rather than round. Numerically it means gradient descent may zig-zag inefficiently, making progress in some directions while struggling in others.

This is a familiar theme from linear algebra and numerical methods: not all problems are equally stable, even when the formulas look innocent. The same is true in machine learning. When training is unstable or painfully slow, the issue is often not lack of data alone but awkward geometry in the objective.

6.10 Stochastic optimisation and minibatches

For large datasets, computing the full gradient over every example at every step may be too expensive. A standard compromise is stochastic optimisation.

Instead of evaluating the objective on all n examples, we estimate the gradient from a smaller subset, often called a minibatch. The update becomes noisier, but much cheaper.

This is one of the beautiful compromises of modern ML:

exactness is relaxed
iteration becomes faster
many more updates become possible

The resulting path through parameter space is no longer a smooth deterministic descent. It jitters. Yet that noise can sometimes be helpful, nudging the algorithm away from shallow traps or narrow pathological regions.

So stochasticity here is not mere sloppiness. It is a computational strategy that changes the character of the optimisation process.

/var/folders/vh/5ydbwp613xz6_ddxgpk0l06h0000gp/T/ipykernel_12525/1627973557.py:42: UserWarning: FigureCanvasAgg is non-interactive, and thus cannot be shown
  plt.show()

Figure 6.3

6.11 Regularisation changes the landscape

Earlier we described regularisation as disciplined simplification. From the optimisation point of view, regularisation also reshapes the objective.

If we penalise large parameter magnitudes, for example, the new objective may look like

J_{\text{reg}}(\theta) = J(\theta) + \lambda R(\theta),

where R(\theta) is a penalty term and \lambda controls its strength.

This changes what counts as a desirable solution. A parameter setting that fits the training data slightly better may nevertheless be rejected if it is too complex or unstable. In geometric language, regularisation alters the landscape. In modelling language, it encodes caution.

/var/folders/vh/5ydbwp613xz6_ddxgpk0l06h0000gp/T/ipykernel_12525/1556530108.py:39: UserWarning: FigureCanvasAgg is non-interactive, and thus cannot be shown
  plt.show()

Figure 6.4

6.12 Optimisation is not the same as generalisation

It is possible to optimise the training objective very well and still obtain a poor model.

This distinction is essential. Optimisation asks whether we can find parameters that make the chosen objective small on the data we are using. Generalisation asks whether the resulting model will behave well on new data.

The two ideas are related but not identical. An expressive model with enough capacity can sometimes drive training error extremely low. That tells us the optimizer has succeeded at its local task. It does not by itself tell us that the model has learned something durable.

This is why machine learning cannot be reduced to optimisation alone. The objective matters. The data split matters. The model family matters. The evaluation rule matters.

6.13 A worked example: fitting a linear predictor

Suppose we model housing price by

\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2,

where x_1 is floor area and x_2 is lot size. If we use mean squared error, the optimisation problem becomes

J(\beta_0,\beta_1,\beta_2) = \frac{1}{n}\sum_{i=1}^n \left(y_i - \beta_0 - \beta_1 x_{i1} - \beta_2 x_{i2}\right)^2.

This objective is smooth and well behaved. The gradient can be computed, and the parameters can be improved iteratively. Because the model is relatively simple, the optimisation is usually not the hardest part of the problem. Data quality, feature choice, and model adequacy often matter more.

That observation is healthy. It reminds us that optimisation serves modelling, not the other way round.

6.14 A second example: training a layered model

Now imagine the model is a neural network with many hidden units and nonlinear activations. The same basic optimisation language survives:

define an objective
compute gradients
update parameters iteratively

But the terrain becomes much more complicated. The number of parameters grows. The geometry becomes strongly nonconvex. Minibatch methods become the practical default. Conditioning and scaling become more important.

The mathematics has not been abandoned. It has become more demanding.

6.15 Interactive: gradient descent on a 2D loss landscape

What to try

Change LEARNING_RATE to see how step size affects the descent path — try values between 0.05 and 0.95. Increase N_STEPS to watch longer trajectories. Set MOMENTUM to 0.85 to see how momentum smooths the path and accelerates convergence in poorly conditioned landscapes.

import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import io, base64

# --- Try changing these parameters ---
LEARNING_RATE = 0.25   # step size eta (try 0.05, 0.25, 0.85)
N_STEPS       = 30     # number of gradient descent iterations
MOMENTUM      = 0.0    # momentum coefficient (0 = plain GD; try 0.85)

# Loss: L(w1, w2) = w1^2 + 4*w2^2  (convex bowl, poorly conditioned)
def loss(w):
    return w[0]**2 + 4 * w[1]**2

def grad(w):
    return np.array([2 * w[0], 8 * w[1]])

# Run gradient descent with optional momentum
w = np.array([2.8, 2.2])
velocity = np.zeros(2)
traj = [w.copy()]
losses = [loss(w)]

for _ in range(N_STEPS):
    g = grad(w)
    velocity = MOMENTUM * velocity - LEARNING_RATE * g
    w = w + velocity
    traj.append(w.copy())
    losses.append(loss(w))

traj = np.array(traj)

# --- Plot ---
fig, axes = plt.subplots(1, 2, figsize=(9, 4))

# Left: contour + path
w1v = np.linspace(-3.5, 3.5, 300)
w2v = np.linspace(-3.5, 3.5, 300)
W1, W2 = np.meshgrid(w1v, w2v)
L = W1**2 + 4 * W2**2

ax = axes[0]
levels = np.linspace(0.2, 16, 22)
ax.contour(W1, W2, L, levels=levels, colors='#4e8ac4', linewidths=0.7, alpha=0.8)
ax.plot(traj[:, 0], traj[:, 1], color='#c44e4e', lw=1.5, zorder=3)
ax.scatter(traj[0, 0], traj[0, 1], s=55, color='#c44e4e', zorder=5, marker='o', label='Start')
ax.scatter([0], [0], s=65, color='#2a6099', zorder=5, marker='*', label='Minimum')
ax.set_title(f'Descent path  ($\\eta={LEARNING_RATE}$, mom={MOMENTUM})', fontsize=10)
ax.set_xlabel('$w_1$'); ax.set_ylabel('$w_2$')
ax.legend(fontsize=8); ax.set_xlim(-3.5, 3.5); ax.set_ylim(-3.5, 3.5)

# Right: loss vs iteration
ax2 = axes[1]
ax2.plot(range(N_STEPS + 1), losses, color='#1a3a5c', lw=2.0)
ax2.set_xlabel('Iteration'); ax2.set_ylabel('Loss $J(w)$')
ax2.set_title('Loss vs iteration', fontsize=10)
ax2.set_xlim(0, N_STEPS)

fig.tight_layout(pad=2.0)

buf = io.BytesIO()
fig.savefig(buf, format='png', dpi=96, bbox_inches='tight')
buf.seek(0)
img_b64 = base64.b64encode(buf.read()).decode()
print(f'<img src="data:image/png;base64,{img_b64}" style="max-width:100%">')

6.16 Looking ahead

The next chapter turns toward linear algebra and representation. This is a natural continuation, because optimisation acts on parameter spaces, while representation theory asks what kinds of spaces the data themselves inhabit.

For now, the key ideas to carry forward are:

training is an optimisation problem
gradients provide local directional information
landscape geometry affects computational behaviour
stochastic methods trade exactness for scalable iteration
a well-optimised model is not automatically a well-generalising model

6.17 Exercises and prompts

Explain in words what the update rule \theta_{k+1} = \theta_k - \eta \nabla J(\theta_k) means. What role is played by the gradient, and what role is played by the learning rate?
Give an example of a situation where a model might achieve a very low training error and still be untrustworthy.
Why can poor conditioning slow down gradient descent even when the objective is smooth?
Describe one advantage and one disadvantage of using minibatches instead of full-dataset gradients.