15  Exercise Answers

These are model answers — not the only valid approaches. The goal is to show one clear path to each answer, with enough working to make the reasoning legible.

15.1 Chapter 1: Mathematical Modelling for Data and Systems

15.1.1 Exercise 1

Choose a real system you know well. Identify possible features, a target, and at least one hidden variable.

Consider predicting daily electricity demand for a residential building. Plausible features include outdoor temperature, hour of the day, day of the week, and whether the day is a public holiday. The target is total energy consumption in kilowatt-hours over the next 24 hours.

Several hidden variables complicate the picture. Occupancy is one: the number of people inside the building changes hour to hour and is rarely measured directly. Appliance state is another: whether specific high-draw devices such as ovens or heating systems are running affects demand, but individual appliance logs are seldom available. A building’s thermal mass also plays a role — heat stored in the fabric of the building from the previous day creates a carry-over effect that cannot be read from outdoor sensors alone.

This example illustrates that features describe the observable inputs, while hidden variables represent latent structure that shapes the system. Good models must cope with the fact that what matters and what can be measured are not the same list.

15.1.2 Exercise 2

Give an example where squared error would be a poor loss function.

Squared error is a poor choice when the cost of making a large error is disproportionately severe, when error in one direction is far more damaging than error in the other, or when the target distribution has heavy tails that inflate the squared loss without reflecting real-world consequence.

A concrete example: predicting whether a river will exceed flood stage in the next 12 hours. Here an underestimate — predicting low flow when a flood is actually coming — may lead to catastrophic failure to issue warnings. An overestimate causes unnecessary inconvenience but not disaster. Squared error treats both directions symmetrically, so a forecaster minimising squared error faces no mathematical incentive to skew conservative on the dangerous side.

An asymmetric loss such as one that penalises underestimates at several times the rate of overestimates would be more aligned with the real stakes. More generally, whenever missing in one direction is qualitatively worse than missing in the other, the standard squared error is not a neutral device — its symmetry imposes its own set of values on the model.

15.1.3 Exercise 3

Describe a modelling task that is unsupervised rather than supervised. What mathematical object is being sought if there is no explicit target?

Consider a public health authority with a large dataset of anonymised hospital admission records, each described by a vector of diagnosis codes, demographic variables, and service-use patterns. There is no explicit target — no column saying “this patient belongs to group A”. The authority wants to discover whether there are meaningfully distinct patient subgroups that differ in how they use services or in their clinical trajectories.

This is an unsupervised task. The mathematical object being sought is a partition of the data — a set of clusters — together with the latent structure that organises the partition. In cluster analysis, we typically want centroids and an assignment rule, or a density model that explains where probability mass concentrates. More generally, unsupervised tasks may seek a low-dimensional representation (dimensionality reduction), an anomaly score, or a latent generative model.

The key difference from supervised learning is that there is no prediction function to be evaluated against known labels. Success must be measured differently: by cluster cohesion, interpretability, stability under perturbation, or downstream usefulness — none of which is a simple single number handed to us by the data.

15.1.4 Exercise 4

Explain why training and testing on the same data produces a misleading sense of success.

Suppose we fit a model to all available data and then report how well it explains those same data. The model has already had the opportunity to adapt, implicitly or explicitly, to every pattern, including noise and chance regularities that appear in this sample but will not appear in the next one. The measured error therefore reflects how well the model memorises, not how well it generalises.

A concrete analogy: a student who memorises past exam papers and is then tested on those same papers will score very well. That score reveals nothing about their understanding of the subject, because the test is not independent of the preparation. A fresh exam, drawn from the same syllabus but unseen, provides an honest assessment.

In forecasting, this matters especially because errors are compounding. A model that appears accurate on its training period may fail precisely in future periods when the conditions shift slightly — and it has no mechanism for detecting that gap, having never encountered genuinely novel data during evaluation. Honest evaluation requires a set of examples the model has never influenced.


15.2 Chapter 2: Statistical Learning

15.2.1 Exercise 1

Show analytically that \mathbb{E}[(y - \hat{y})^2] = \text{Bias}^2 + \text{Variance} + \sigma^2.

Write y = f(x) + \varepsilon where \varepsilon is mean-zero with variance \sigma^2, independent of \hat{f}. Let \bar{f} = \mathbb{E}[\hat{f}(x)] denote the average prediction over random training sets.

Add and subtract \bar{f} inside the square:

(y - \hat{f})^2 = \bigl((f + \varepsilon) - \hat{f}\bigr)^2 = \bigl((f - \bar{f}) + (\bar{f} - \hat{f}) + \varepsilon\bigr)^2.

Take expectations. The cross-terms between \varepsilon and (\hat{f} - \bar{f}) vanish because \varepsilon is independent of \hat{f} and has mean zero. The cross-term between \varepsilon and (f - \bar{f}) vanishes similarly. The term \mathbb{E}[(\bar{f} - \hat{f})(\varepsilon)] also vanishes by independence. What remains is:

\mathbb{E}[(y - \hat{f})^2] = \underbrace{(f - \bar{f})^2}_{\text{Bias}^2} + \underbrace{\mathbb{E}[(\hat{f} - \bar{f})^2]}_{\text{Variance}} + \underbrace{\sigma^2}_{\text{Irreducible}}.

The bias term is the squared difference between the average prediction and the true value — it is zero if the model family can, in expectation, represent the truth. The variance term measures how much the fitted model varies with the training data. The irreducible term arises from the noise in y and cannot be removed by any model.

15.2.2 Exercise 2

For a synthetic dataset, split 70/30 and compute validation MSE for polynomial degrees 1 through 8.

The procedure is: draw a training set, fit each polynomial using ordinary least squares, evaluate mean squared error on the held-out validation set, and plot the curve.

For a smooth function with moderate noise, a typical result is: validation MSE decreases sharply from degree 1 (underfitting) to degree 3 or 4 (approximate sweet spot), then rises gradually as the model starts fitting noise. The crossover point is the bias-variance balance: below it, the model lacks the flexibility to follow the signal; above it, it tracks the sample noise instead.

If the noise level is doubled, the optimal degree typically shifts left (toward simpler models), because the signal now sits deeper inside a noisier observation record. With higher noise, a complex polynomial’s variance term inflates faster relative to any gain from reduced bias, so the optimum moves toward greater regularisation. Conversely, with a large clean dataset, the optimal degree can increase because variance is controlled by sample size.

15.2.3 Exercise 3

Derive the ridge estimator \hat{\beta} = (X^\top X + \lambda I)^{-1} X^\top y.

The ridge objective is

J(\beta) = \|y - X\beta\|^2 + \lambda\|\beta\|^2.

Expand:

J(\beta) = y^\top y - 2\beta^\top X^\top y + \beta^\top X^\top X \beta + \lambda \beta^\top \beta.

Differentiate with respect to \beta:

\frac{\partial J}{\partial \beta} = -2X^\top y + 2X^\top X\beta + 2\lambda\beta.

Setting to zero:

(X^\top X + \lambda I)\hat{\beta} = X^\top y \implies \hat{\beta} = (X^\top X + \lambda I)^{-1} X^\top y.

Invertibility. The matrix X^\top X is positive semi-definite: for any vector v, v^\top X^\top X v = \|Xv\|^2 \geq 0. Adding \lambda I with \lambda > 0 shifts every eigenvalue of X^\top X upward by \lambda, making the smallest eigenvalue at least \lambda > 0. The matrix is therefore positive definite and hence invertible.

When features are nearly collinear, X^\top X has eigenvalues close to zero and ordinary least squares becomes numerically unstable — small perturbations in the data cause large swings in \hat{\beta}. The ridge penalty lifts those near-zero eigenvalues away from the origin, producing a well-conditioned system with a unique, stable solution.

15.2.4 Exercise 4

Describe the regularisation path as \lambda varies for ridge and lasso.

Ridge. As \lambda \to 0, the penalty vanishes and the ridge estimator converges to the ordinary least squares solution. As \lambda \to \infty, the penalty dominates and \hat{\beta} \to 0: all coefficients are shrunk toward zero. The path is smooth — coefficients decrease continuously and none ever reach exactly zero for finite \lambda.

Lasso. The path has the same endpoints: \hat{\beta} converges to the OLS solution as \lambda \to 0, and to zero as \lambda \to \infty. The critical difference is sparsity. As \lambda increases from zero, lasso coefficients do not shrink smoothly in unison. Instead, the coefficients with the weakest signal — those least correlated with the residuals — are the first to reach exactly zero and leave the active set. The lasso therefore performs continuous feature selection: at each value of \lambda, some coefficients are exactly zero while the rest are shrunk but nonzero.

This sparsity emerges because the L1 constraint set is a diamond with corners on the coordinate axes. When the loss contours contact the diamond at a corner, the solution has one or more coordinates equal to zero — a geometric event that has positive probability for any reasonable loss function. The L2 ball has no corners, so this does not happen in ridge.


15.3 Chapter 3: Optimisation for Machine Learning

15.3.1 Exercise 1

Explain in words what the update rule \theta_{k+1} = \theta_k - \eta \nabla J(\theta_k) means.

The update rule says: take the current parameter setting \theta_k, compute the gradient of the objective at that point, and move the parameters a small step in the direction opposite to the gradient.

The gradient \nabla J(\theta_k) is a vector that points in the direction of steepest increase of J at \theta_k. Moving opposite to it therefore reduces the objective, at least locally. The learning rate \eta controls how large a step to take. A small \eta means cautious, incremental moves; a large \eta means bold steps that may overshoot the minimum and cause the objective to increase or oscillate.

The key insight is that the gradient is local information. It describes the slope at the current point, not the global shape of the landscape. This is why the rule must be iterated many times: each step takes a new local measurement and moves accordingly. The path to a minimum is built up from many locally informed decisions, not from a single global view.

15.3.2 Exercise 2

Give an example where a model might achieve a very low training error and still be untrustworthy.

Suppose we fit a polynomial of degree 19 to 20 training observations. The model has enough free parameters to pass through every training point, so training MSE approaches zero. But between the training points, the polynomial oscillates wildly. On a validation set drawn from the same distribution, the error is enormous.

The model has not learned the underlying function. It has memorised the specific sample. When a new observation arrives, the polynomial has no guarantee of producing anything meaningful because its behaviour between training points was never constrained by data.

A different illustration: a classifier trained to detect financial fraud on historical data from a single institution may achieve near-zero training error. But if the fraud patterns it learned were particular to one period or geography, it will fail badly on new cases. Low training error guarantees only that the model fits the examples it saw — not that it has discovered durable structure.

15.3.3 Exercise 3

Why can poor conditioning slow down gradient descent even when the objective is smooth?

Conditioning describes how uniformly the objective changes in different directions through parameter space. A well-conditioned problem has a roughly circular bowl shape: every direction is about as steep as every other. A poorly conditioned problem has an elongated, elliptical bowl: some directions are very steep and others are nearly flat.

Gradient descent takes steps proportional to the gradient magnitude. In an elongated landscape, the steep directions generate large gradient components and the flat directions generate small ones. If the step size is chosen to be safe in the steep direction, it is too small to make meaningful progress in the flat direction. If it is chosen to make fast progress in the flat direction, it overshoots in the steep direction, causing oscillation.

The result is the characteristic zig-zag path: the algorithm bounces back and forth across the narrow valley rather than travelling along its floor. The objective does decrease, but far more slowly than the number of steps suggests. Poor conditioning is therefore a geometric obstacle to convergence, present even when the function is perfectly differentiable.

15.3.4 Exercise 4

Describe one advantage and one disadvantage of using minibatches instead of full-dataset gradients.

Advantage. Each gradient estimate is much cheaper to compute because it uses only a small subset of the training data. This means many more parameter updates can be performed in a given time budget. Even if each update is noisier than a full-batch update, the sheer number of steps taken can outpace the slower but exact approach. For large datasets — millions of examples — full-gradient descent would be practically impossible, while minibatch descent scales naturally.

Disadvantage. The gradient estimate carries noise from the random selection of examples. This means the path through parameter space is no longer a smooth, deterministic descent. The algorithm may wander in the neighbourhood of a minimum rather than converging to it cleanly. Choosing the minibatch size and learning rate schedule requires care: too small a batch produces so much noise that useful information is drowned out; too large a batch wastes the computational advantage. The noisiness also makes it harder to diagnose convergence — the loss curve jiters even after the model is well-trained.


15.4 Chapter 4: Linear Algebra for Representations

15.4.1 Exercise 1

Give an example of a real object or system that could be represented by a vector. What would the coordinates mean?

A hospital patient at the time of admission can be represented as a vector in \mathbb{R}^p where each coordinate captures one measured attribute: age, systolic blood pressure, blood glucose, haemoglobin level, whether they arrived by ambulance (0 or 1), number of previous admissions in the past year, and so on.

The vector is not “the patient” — it is a selected, quantified description of the patient at a particular instant. The choice of which features to include already embeds a modelling decision. Coordinates on a continuous scale (blood pressure) are treated differently from binary indicators (ambulance arrival) and count variables (previous admissions), so the representation may involve scaling and encoding steps before the vector is suitable for geometric operations such as distance measurement or projection.

Once in vector form, patients can be compared (are these two patients similar?), clustered (are there subgroups with distinct risk profiles?), and used as inputs to classifiers or regression models. The vector representation makes mathematical tools available that were not available when the records existed only as narrative clinical notes.

15.4.2 Exercise 2

Explain in words why least squares can be described as a projection problem.

When we fit a linear model \hat{y} = X\beta to a response vector y, the set of all possible predictions is the column space of X: every vector that can be written as a linear combination of the columns of the design matrix. If the system X\beta = y has no exact solution — which is typical when n > p — then no choice of \beta puts \hat{y} exactly at y.

Least squares finds the \hat{y} in the column space of X that is closest to y in Euclidean distance. That is exactly the orthogonal projection of y onto the column space of X. The residual y - \hat{y} is then perpendicular to every column of X, which is why the normal equations X^\top(y - X\hat{\beta}) = 0 are the geometric statement that the residual is orthogonal to the fit space.

So least squares is not an arbitrary formula. It is the answer to the geometric question: if we cannot hit y exactly with our linear model, where should we aim? The answer is the closest point, and the closest point is found by projection.

15.4.3 Exercise 3

Why might a lower-dimensional representation preserve useful structure even when some information is discarded?

Real data are rarely spread uniformly across the full space of their nominal dimensionality. A dataset of images, medical records, or audio clips typically lies near a much lower-dimensional structure because the underlying generating process has far fewer degrees of freedom than the number of pixels, variables, or frequency bins.

Principal component analysis makes this concrete. When we compute the top k principal components and project the data onto them, we discard the directions of smallest variation. If those directions carry mostly noise — random fluctuations unrelated to the signal of interest — then the projection loses very little useful information while eliminating noise. The signal-to-noise ratio of the compressed representation may actually be higher than that of the original.

The information that is discarded is, by construction, the information that varies the least across the dataset. If two objects are genuinely similar, they will remain similar in the compressed space; their proximity is mostly explained by the directions of large variation that were retained. This is why compression can sometimes improve downstream task performance: it acts as a form of regularisation that prevents the model from fitting to irrelevant variation.

15.4.4 Exercise 4

Describe a setting where the choice of representation could change the quality of a learning system even if the optimisation method stayed the same.

Consider classifying land-cover types from satellite imagery. The raw representation is pixel intensity values in several spectral bands. One could add derived features: the Normalised Difference Vegetation Index (NDVI), which combines near-infrared and red band values into a single index known to separate vegetated from non-vegetated areas very cleanly. A classifier trained with NDVI included will typically outperform one trained only on raw bands, because the feature encodes domain knowledge about what separates the classes.

The optimisation algorithm — gradient descent, say — is unchanged. The data split and model family are unchanged. Only the representation is different: the same raw numbers have been transformed into a more informative coordinate system before learning begins. The resulting boundary in the new space corresponds to a much more complex, nonlinear boundary in the original pixel space — one that a linear model in the raw space could never have found.

This demonstrates that representation is not a neutral preprocessing step. It encodes prior knowledge about the structure of the problem and can determine whether a task is easy or hard for a given model family.


15.5 Chapter 5: Probability, Bayesian Models, and Uncertainty

15.5.1 Exercise 1

Give an example where a point prediction would be less useful than a predictive distribution.

Consider a reservoir manager deciding whether to release water through a spillway ahead of a forecast rainstorm. A deterministic model predicts that inflow tomorrow will be 420\ \mathrm{m}^3\mathrm{s}^{-1}. But the distribution around that prediction matters enormously. If the 90th percentile of the forecast distribution is 510\ \mathrm{m}^3\mathrm{s}^{-1}, the manager may decide to pre-release water to create buffer capacity. If the 90th percentile is only 435\ \mathrm{m}^3\mathrm{s}^{-1}, the risk is lower and no action may be needed.

The point estimate 420 is the same in both scenarios. The decision is different because the spread of the predictive distribution differs. A manager armed only with the point prediction cannot make a risk-informed decision — they cannot know whether the forecast is precise or deeply uncertain.

In general, whenever downstream actions are asymmetric, extreme outcomes have high costs, or the decision threshold falls in a region of meaningful probability, the full distribution is needed. A point prediction hides exactly the information that makes uncertainty actionable.

15.5.2 Exercise 2

Explain in words the roles of prior, likelihood, and posterior in Bayesian reasoning.

The prior represents knowledge or belief about a parameter or hypothesis before the current data are examined. It may be informative — based on theory, previous studies, or expert judgment — or relatively uninformative, expressing that many values are plausible. The prior is not a subjective whim; it is a formal commitment about what was reasonable to believe before the evidence arrived.

The likelihood describes how plausible the observed data would be for each possible value of the parameter. It is not a probability over parameters but a function of parameters for fixed data. Parameters under which the data would be common have high likelihood; parameters under which the data would be extremely rare have low likelihood.

The posterior combines prior and likelihood using Bayes’ theorem:

p(\theta \mid \text{data}) \propto p(\text{data} \mid \theta)\, p(\theta).

It is the updated belief about the parameter after taking the data into account. The posterior inherits structure from both sources: the prior pulls the estimate toward previously credible values, while the likelihood pulls it toward values that explain the observed data well. As more data accumulate, the likelihood dominates and the posterior concentrates near the true value regardless of the prior — unless the prior assigns it zero probability.

15.5.3 Exercise 3

What is the difference between having probabilistic predictions and having well-calibrated probabilistic predictions?

A model can output probabilities without those probabilities being reliable. For example, a classifier might output confidence values that consistently over-state certainty: it says 0.95 for many events, but only 60% of them actually occur. The model is probabilistic in form but poorly calibrated in practice.

Well-calibrated probabilistic predictions satisfy the following: among all events to which the model assigns probability p, approximately the fraction p of them should actually occur. A weather forecast is well calibrated if, on all the days it says “70% chance of rain”, roughly 70% of those days actually see rain.

The distinction matters for decision-making. A decision-maker who trusts the output of a poorly calibrated model will over-invest in actions triggered by high-confidence predictions that do not deserve that confidence. Good calibration means the probability numbers can be used directly as inputs to expected-value calculations and risk assessments, not merely as rankings.

15.5.4 Exercise 4

Describe a source of uncertainty in a real-world forecasting problem that would remain even with more data.

In wildfire spread modelling, aleatoric uncertainty arises from wind gusts and micro-topographic effects. Wind speed at a given location varies on timescales of seconds to minutes due to turbulent eddies, and the spatial resolution of any practical wind measurement network cannot resolve all relevant variation. Even with enormous amounts of historical data, this fine-scale variability cannot be predicted precisely because it reflects the inherently chaotic nature of turbulent fluid flow.

More data improve the statistical description of wind distributions and reduce uncertainty about mean conditions. But the specific sequence of gusts during any given fire event remains fundamentally uncertain. This is aleatoric uncertainty: it is intrinsic to the system, not a consequence of insufficient observation.

The practical implication is that wildfire forecasts should always communicate distributional spread — the range of plausible fire perimeters, not just a single expected boundary — because a key driver of the outcome cannot be reduced to a deterministic prediction regardless of the quantity of training data available.


15.6 Chapter 6: Signals, Sequences, and State Estimation

15.6.1 Exercise 1

Give an example where random shuffling of observations would destroy important structure.

Consider a daily river-flow time series used to train a model that predicts tomorrow’s flow from the past week’s flow. The autocorrelation structure — the fact that today’s flow is closely related to yesterday’s — is the very information the model must learn. If the observations are randomly shuffled before the training set is formed, this lagged relationship is destroyed. A sequence of days that was, in reality, a gradual flood recession becomes a random mixture of high-flow and low-flow readings with no temporal coherence.

A model trained on the shuffled data would learn nothing useful about carry-over storage or routing lag. Worse, the shuffled validation set would also be incoherent, so training error might appear low even though the model has learned no real structure. The ordering is not incidental to a time series; it is the data.

15.6.2 Exercise 2

Explain the difference between forecasting a future quantity and estimating a current hidden state.

Forecasting is directed forward in time: given the present and past, what will happen at a future moment? The task produces a prediction for a quantity that has not yet been observed. The accuracy of the forecast can be evaluated retrospectively once the future arrives.

Estimating a hidden state is directed inward: given a history of noisy or partial observations, what is the best estimate of the current (or past) underlying condition of the system? The state may never be directly observed — it must be inferred from the observations that it produces. A classic example is estimating a vehicle’s true position from noisy GPS readings: the true position is the hidden state, and the GPS values are imperfect observations of it.

The two tasks can occur together. A flood-forecasting system might first estimate the current hidden catchment moisture state from recent rainfall and streamflow measurements (filtering), and then use that state estimate as the starting point for a 48-hour ahead prediction (forecasting). Filtering cleans up the present; forecasting extrapolates from it.

15.6.3 Exercise 3

Why is a latent variable not the same thing as a fictitious one?

A latent variable is unobserved, but it plays a causal or structural role in generating the observations. It represents real state or structure that exists in the world but cannot be directly measured. A room’s true thermal state is latent when we measure only a single thermostat: the spatial temperature distribution is real, it affects energy use, occupant comfort, and control responses, but we do not have full access to it.

A fictitious variable would be one invented for mathematical convenience without any correspondence to real structure — introduced purely as an artefact and not representing anything in the world. The distinction matters because latent variables justify a particular model structure: if the variable is real, then inferring it gives us genuine information about the system. If it were merely fictional, the inferred values would be uninterpretable.

In practice, the boundary between latent and fictional can require scientific judgment. But the point is that calling a variable latent carries a commitment: we believe it represents something the world actually contains, even if our instruments cannot directly access it.

15.6.4 Exercise 4

Describe a real system in which memory of earlier inputs affects present output.

A room heated by a radiator provides a clear example. If the radiator is turned on at full power for six hours during the morning, the walls, floor, and furniture absorb heat. Even after the radiator is turned off, the room remains warm for several hours because the thermal mass releases stored energy slowly. A model that predicts room temperature only from the current radiator setting and current outdoor temperature would miss this memory effect.

The minimal state summarising relevant memory might include the average temperature of the building fabric — a measure of how much thermal energy is currently stored. This hidden state is not directly measured by a single air-temperature sensor, but it determines how the room responds to future heating inputs. A state-space model captures this naturally: the state transition equation describes how the thermal mass absorbs and releases heat, and the observation equation connects the measured air temperature to that hidden state.


15.7 Chapter 7: Information Theory and Learning Objectives

15.7.1 Exercise 1

Why does an unlikely observed event carry more information than a very likely one?

Information theory formalises a straightforward intuition. An event that was almost certain to occur tells us very little once it does: we expected it, so nothing about our model of the world needs updating. An event that was improbable forces a larger revision.

Mathematically, the information content of an event with probability p is defined as -\log p. As p \to 1, -\log p \to 0: certain events carry zero information. As p \to 0, -\log p \to \infty: very improbable events carry very large information. The logarithm is chosen because it gives information the additivity property: two independent events together carry as much information as the sum of their individual information values.

For a learning system, this means that being surprised is epistemically meaningful. A model that is frequently surprised by what actually happens is a model that has failed to capture durable structure. The cross-entropy loss penalises this failure in a principled way that is directly tied to this information-theoretic account of surprise.

15.7.2 Exercise 2

Explain in words why cross-entropy is a natural loss for probabilistic classification.

When a classifier assigns probabilities to each possible class, we want to reward it for placing high probability on the class that actually occurs. Cross-entropy does exactly this. For a single example where the true class is k, the cross-entropy loss is -\log p_k, where p_k is the probability the model assigned to class k.

This expression is large when p_k is small — when the model judged the true class unlikely — and small when p_k is large — when the model was confident and correct. The logarithm ensures that the penalty grows sharply as p_k approaches zero: being nearly certain of the wrong answer is catastrophically bad, not merely wrong by a constant amount.

Cross-entropy can also be derived from the principle of maximum likelihood. If the model’s class probabilities are treated as the parameters of a categorical distribution, maximising the likelihood of the observed labels across a training set is mathematically equivalent to minimising the average cross-entropy. This gives cross-entropy a grounding not only in intuitive penalty design but in formal probabilistic inference.

15.7.3 Exercise 3

Why is a confident wrong prediction penalised heavily by log loss?

The log loss for a true event assigned probability p is -\log p. This function is steeply curved near zero. When p = 0.9 (the model is confident and mostly correct), the loss is about 0.105. When p = 0.1 (the model is confidently wrong), the loss is about 2.303 — more than twenty times larger.

The heavy penalty for confidence is intentional and principled. A model that says “I am 90% sure this is class A” and is wrong has not merely made an error: it has made a falsely confident error. It was given the opportunity to hedge and chose not to. From the perspective of decision-making, a confident wrong prediction can be far more damaging than a hesitant wrong prediction, because downstream users will treat the confident output as reliable.

Log loss therefore rewards honest uncertainty. A model that says “60% class A, 40% class B” when the evidence is mixed is penalised less than one that says “99% class A” and is wrong. The mathematics enforces epistemic humility in a way that simpler losses like accuracy or hinge loss do not.

15.7.4 Exercise 4

Give an example of how prediction and compression might be related in practice.

A language model predicts the probability of each next word given previous context. At each step, the model assigns probabilities to the entire vocabulary. An efficient lossless compression scheme for text would assign shorter binary codes to common words in common contexts and longer codes to rare words or surprising continuations. The average code length per word is minimised by using -\log_2 p bits to encode a word that the model assigns probability p.

A model that predicts well — one that assigns high probability to what actually occurs — therefore enables highly efficient compression. Conversely, a good compression scheme implies a good predictive model. The two are two faces of the same mathematics: efficient description of data and accurate prediction of what comes next are equivalent to each other up to representation.

This connection is practically relevant. Large language models are sometimes evaluated using compression metrics, and modern general-purpose compressors incorporate learned statistical models that are essentially predictive models over sequences of bytes. The information-theoretic bridge between prediction and compression is not a metaphor but a mathematical identity.


15.8 Chapter 8: Neural Networks and Deep Learning Mathematics

15.8.1 Exercise 1

Why does a stack of affine maps without nonlinear activations fail to produce genuinely deep expressive behaviour?

An affine map has the form x \mapsto Wx + b. If two such maps are composed:

x \mapsto W_2(W_1 x + b_1) + b_2 = (W_2 W_1)x + (W_2 b_1 + b_2).

The result is still an affine map with weight matrix W_2 W_1 and bias W_2 b_1 + b_2. By induction, any stack of affine layers collapses into a single affine map. No matter how many layers are added, the composed transformation remains linear-plus-translation.

A single affine layer can represent any such transformation. Adding more layers without nonlinearities does not expand the function class — it only adds parameters that are redundant. The network may be wider but it is not more expressive.

Nonlinear activations break this collapse. A function such as ReLU (\max(0, z)) cannot be expressed as an affine map. When an affine layer is followed by a nonlinearity, the composition is genuinely nonlinear, and composing multiple such blocks allows the network to represent increasingly complex nonlinear functions. Depth becomes meaningful precisely because the nonlinearities prevent the layers from collapsing into each other.

15.8.2 Exercise 2

Explain in words why backpropagation is best understood as an application of the chain rule.

A neural network computes a function by passing input through a sequence of layers. The final output depends on the last layer’s parameters. But the last layer’s input came from the previous layer, which itself depends on its parameters and the layer before it, and so on back to the first layer. Every parameter in every layer influences the final loss indirectly through the chain of computations.

The chain rule says exactly how to handle this. If L depends on z and z in turn depends on \theta, then \partial L / \partial \theta = (\partial L / \partial z)(\partial z / \partial \theta). For a deep network, this telescopes: the gradient of the loss with respect to an early-layer parameter is a product of Jacobians, one for each layer between that parameter and the output.

Backpropagation is simply the efficient implementation of this calculation. It computes the gradient from the output backward, reusing intermediate values at each layer rather than recomputing them. The mathematics is identical to applying the chain rule on a computational graph; backpropagation is the bookkeeping strategy that makes the chain rule tractable at scale.

15.8.3 Exercise 3

Give an example of a structural assumption encoded by a convolutional or sequence model architecture.

A convolutional network applied to images encodes the assumption that useful patterns are localised and translation-invariant. A filter that detects a diagonal edge in the top-left corner of an image is applied at every spatial position. This means the network assumes that what constitutes an “edge” is the same regardless of where in the image it appears — a strong structural claim that is well justified for many natural image tasks.

The assumption is baked into the weight sharing: the same filter weights are used at every location, so the number of parameters does not grow with the size of the image. This is both a computational advantage and a regularisation: the network cannot learn that features in the top-left are structurally different from those in the bottom-right. If the task genuinely has this invariance property, the assumption is helpful and the inductive bias accelerates learning. If the task does not — for example, if the position in the image is semantically significant — then the convolutional structure imposes an incorrect prior.

15.8.4 Exercise 4

Why is it misleading to judge a neural network only by whether it can fit the training data well?

A sufficiently large network can interpolate through any finite training set. Given enough parameters and sufficient training, the network can achieve near-zero training loss for virtually any labelled dataset, including one with random labels. This means low training error is not evidence that the model has learned anything useful. It is evidence only that the model had enough capacity and was optimised long enough.

The relevant question is always generalisation: does the model perform well on data it has not seen? Fitting the training data is a necessary condition for a good model — a model that cannot even fit what it was given has clearly failed — but it is nowhere near sufficient. A network that memorises training labels cannot generalise.

Moreover, a network might fit training data well by exploiting spurious correlations that exist in the training set but not in deployment. For example, if all “cat” images in the training set happen to be taken indoors, the network may learn to associate indoor backgrounds with the cat label. Training accuracy can be high while the model is learning something fragile and misleading. This is why evaluation on a genuinely held-out test set, and ideally on data from different sources or time periods, remains the honest judge of whether a neural network has learned durable structure.