7 Linear Algebra for Representations

Linear algebra becomes newly vivid in data science because it gives us a language for representation. Data are gathered into vectors, collections of examples are arranged into matrices, correlations become geometric alignments, and useful structure often appears as a lower-dimensional shape hidden inside a larger space.

This chapter makes that geometry central. The point is not to rehearse matrix manipulation for its own sake. The point is to see how learning systems use spaces, projections, factorisations, eigendirections, and embeddings to organise information.

In this chapter

Treat a dataset as a matrix and a model as a subspace or mapping.
Read least squares as projection (approximation in a chosen representation).
Use PCA/SVD as a disciplined way to compress without erasing structure.
Interpret embeddings as geometric representations you can measure and compare.

7.1 Guiding examples

Dimensionality reduction that makes clusters visible
Low-rank structure as “signal + redundancy” rather than magic compression

Running example: Alberta wildfire smoke and station PM2.5

This chapter’s job in the smoke story is to handle many correlated inputs without drowning in them:

Treat multivariate weather as a matrix and use PCA/SVD to find “regime” directions.
Build a lower-dimensional representation you can plot and reason about.
Use projection geometry to understand what your model can and cannot express.

7.2 Prerequisite anchors

The strongest backward links are:

vol-04/03-systems-matrices.qmd
vol-07/linear-algebra/01-matrices-systems.qmd
vol-07/linear-algebra/02-eigenvalues.qmd
vol-07/numerics/02-numerical-linear-algebra.qmd

7.3 Data matrices and feature spaces

When a dataset has n examples and p measured features, it is natural to write it as a matrix X \in \mathbb{R}^{n \times p}. Each row is a case. Each column is a feature. This is already more than bookkeeping. It invites us to ask geometric questions:

Which columns are nearly redundant?
Which directions in feature space carry most of the variation?
Which combinations of variables matter more than the variables alone?

This point of view is one of the clearest bridges between classical linear algebra and modern machine learning.

7.4 Vectors as representations

A vector does not have to mean a physical arrow in ordinary space. In data science it often means an organised list of measurements or attributes. A person can be represented by age, income, and distance from work. A sound clip can be represented by frequencies and amplitudes. A document can be represented by word counts or learned semantic coordinates.

Once an object is represented by a vector, geometry becomes available. We can measure length, angle, distance, similarity, and direction. These are not merely visual metaphors. They are operational ways of comparing cases and organising information.

7.5 Projection and least squares

One of the deepest ideas in applied mathematics is projection. If we cannot hit the target exactly, we may instead look for the closest point inside a smaller allowable space.

Least squares is the canonical example. In regression, we often cannot match all observed outputs exactly, so we project the target data onto the space spanned by the chosen predictors. The residual is what remains outside that space.

This is why least squares is not just an algebraic trick. It is a geometric act of approximation. The model says, in effect, “among all outputs that can be expressed using this representation, choose the closest one.”

/var/folders/vh/5ydbwp613xz6_ddxgpk0l06h0000gp/T/ipykernel_12537/4115663811.py:56: UserWarning: FigureCanvasAgg is non-interactive, and thus cannot be shown
  plt.show()

Figure 7.1

7.6 Basis, coordinates, and change of viewpoint

A basis provides coordinates for a space. Different bases describe the same space in different ways, and some descriptions reveal structure better than others.

This matters in machine learning because a good representation can make a hard problem easier. A raw set of measurements may look tangled, while a transformed coordinate system reveals a simpler pattern. This is one reason feature engineering, orthogonalisation, and learned representations are so powerful.

The world does not arrive labelled with its best coordinates. Part of modelling is choosing or learning them.

7.7 PCA and low-rank structure

Many datasets live near a lower-dimensional structure even when they are recorded in a high-dimensional space. Principal component analysis makes this idea concrete by finding directions of maximal variation and using them to build a lower-dimensional summary.

The key questions are:

which directions carry the most variation?
how many directions are worth keeping?
what is lost when we compress the representation?

PCA matters not because every dataset should be reduced, but because it teaches a general lesson: representations can be changed, compressed, and reoriented in ways that make structure easier to see.

/var/folders/vh/5ydbwp613xz6_ddxgpk0l06h0000gp/T/ipykernel_12537/1519682986.py:52: UserWarning: FigureCanvasAgg is non-interactive, and thus cannot be shown
  plt.show()

Figure 7.2

7.8 Eigenvectors, modes, and structure

Eigenvectors enter naturally when a transformation stretches some directions more than others while preserving those directions themselves. In data analysis this appears in covariance structure, principal axes, dynamical modes, and repeated patterns.

The older linear-algebra idea of an eigendirection becomes, in applied context, a statement about stable or dominant modes of behaviour. That is why eigenvalues and eigenvectors continue to matter in recommender systems, signal processing, network analysis, and dimensionality reduction.

7.9 Factorisation as hidden structure

A matrix factorisation rewrites a complicated data matrix as a product of simpler pieces. This is valuable because it can expose latent structure.

If a user-item ratings matrix can be approximated by lower-rank factors, then the factors may encode hidden taste dimensions. If a document-term matrix can be factorised, the factors may reveal thematic structure. If a sensor matrix can be compressed, the factors may identify a smaller set of governing patterns.

Factorisation is therefore not only a computational device. It is a structural hypothesis about the world.

/var/folders/vh/5ydbwp613xz6_ddxgpk0l06h0000gp/T/ipykernel_12537/1456060612.py:41: UserWarning: FigureCanvasAgg is non-interactive, and thus cannot be shown
  plt.show()

Figure 7.3

7.10 Embeddings and latent representations

In contemporary AI, the word embedding often refers to a learned vector representation that places similar objects near one another in a latent space. Words, images, users, products, and documents can all be mapped into such spaces.

The terminology may sound modern, but the mathematical instinct is old. We are still building coordinates for meaningful structure. The difference is that the coordinates are now often learned rather than hand-designed.

This is one reason linear algebra remains central even in systems built from deep neural networks. However complex the learning process becomes, useful internal representations are still organised in spaces whose geometry matters.

/var/folders/vh/5ydbwp613xz6_ddxgpk0l06h0000gp/T/ipykernel_12537/3995193224.py:47: UserWarning: FigureCanvasAgg is non-interactive, and thus cannot be shown
  plt.show()

Figure 7.4

7.11 Representation is a modelling choice

There is no neutral representation. The choice of variables, scaling, basis, and transformation influences what patterns become visible and what relationships are easy to learn.

Two models trained on the same raw world can behave very differently if one uses a poor representation and the other uses one aligned with the task. This is why representation learning is not a decorative add-on. It is one of the central problems of AI.

7.12 Interactive: PCA reconstruction

What to try

Increase N_COMPONENTS from 1 to 5 and watch the reconstruction error drop. Set NOISE_LEVEL to 2.0 and notice how the reconstruction quality degrades at low rank. At what number of components does the reconstructed cloud start to match the original shape convincingly?

import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import io, base64

# --- Try changing these parameters ---
N_COMPONENTS = 2    # number of PCs to keep (1–5)
NOISE_LEVEL  = 0.6  # noise added to the synthetic data

rng = np.random.default_rng(42)

# Synthetic dataset: 150 points near a 2D structure embedded in 5D
n = 150
true_dim = 2
A = rng.standard_normal((5, true_dim))   # hidden 5D basis
t = rng.standard_normal((n, true_dim))   # latent coordinates
X_clean = t @ A.T                        # noiseless 5D points
X = X_clean + rng.standard_normal((n, 5)) * NOISE_LEVEL

# Standardise columns
X_mean = X.mean(axis=0)
Xc = X - X_mean

# PCA via SVD
U, S, Vt = np.linalg.svd(Xc, full_matrices=False)
n_keep = max(1, min(N_COMPONENTS, 5))

# Encode then decode
Z = Xc @ Vt[:n_keep].T            # n x n_keep
X_recon = Z @ Vt[:n_keep] + X_mean  # reconstruct in original space

# Reconstruction error (mean squared)
error = np.mean((X - X_recon)**2)

# Variance explained
var_explained = (S[:n_keep]**2).sum() / (S**2).sum() * 100

# --- Plot: first two original dims vs reconstructed ---
fig, axes = plt.subplots(1, 2, figsize=(9, 4))

ax = axes[0]
ax.scatter(Xc[:, 0], Xc[:, 1], s=15, color='#4e8ac4', alpha=0.7)
ax.set_title('Original data (dims 1 & 2)', fontsize=10)
ax.set_xlabel('$x_1$'); ax.set_ylabel('$x_2$')

ax2 = axes[1]
Xr_c = X_recon - X_mean
ax2.scatter(Xr_c[:, 0], Xr_c[:, 1], s=15, color='#c44e4e', alpha=0.7)
ax2.set_title(
    f'Reconstruction ({n_keep} PC{"s" if n_keep>1 else ""})\n'
    f'Var explained: {var_explained:.1f}%   MSE: {error:.3f}',
    fontsize=10
)
ax2.set_xlabel('$x_1$'); ax2.set_ylabel('$x_2$')

fig.tight_layout(pad=2.0)

buf = io.BytesIO()
fig.savefig(buf, format='png', dpi=96, bbox_inches='tight')
buf.seek(0)
img_b64 = base64.b64encode(buf.read()).decode()
print(f'<img src="data:image/png;base64,{img_b64}" style="max-width:100%">')

7.13 Looking ahead

The next chapter turns from geometry to uncertainty. This is a natural step, because a representation may reveal structure without telling us how confident we should be in the inferences drawn from it. Probability fills that gap.

For now, the main ideas to carry forward are:

data can be organised as vectors and matrices
projection gives a geometric account of approximation
low-rank structure explains why compression and denoising can work
embeddings are learned coordinate systems for meaningful similarity
representation is not passive description but active modelling

7.14 Exercises and prompts

Give an example of a real object or system that could be represented by a vector. What would the coordinates mean?
Explain in words why least squares can be described as a projection problem.
Why might a lower-dimensional representation preserve useful structure even when some information is discarded?
Describe a setting where the choice of representation could change the quality of a learning system even if the optimisation method stayed the same.