12 Mini-Projects and Pathways

The most reliable way to make heavy topics feel less heavy is to give them a home. A mini-project provides that home: one dataset, one set of decisions, one evaluation story, and one written explanation that can be revised as your understanding grows.

This chapter offers two things:

A set of mini-project prompts tied to the book’s chapters.
A “degree-shaped” pathway that reflects how programs like MRU’s Data Science major and engineering data-analytics courses tend to braid mathematics, data handling, and communication.

12.1 The degree-shaped pathway (conceptual, not administrative)

Many programs move through a pattern like:

Year 1: calculus, linear algebra, statistics, and programming basics
Year 2: databases, data acquisition/processing, and modelling foundations
Year 3: visualization, regression/time series, and domain context
Year 4: machine learning + capstone work (often with workplace or community constraints)

This book is deliberately concentrated on the mathematics that becomes most important once you reach steps 2–4.

12.2 Mini-project prompts by chapter

Each prompt is designed to be answerable with “small data” you can obtain ethically. Use synthetic data when real data are unavailable. The goal is not a portfolio piece; the goal is a legible learning loop.

12.2.1 Chapter 1: Modelling language

Choose a system you care about (water use, attendance, transit delay, energy, inventory, sensor drift).
Write a one-page modelling brief:
- features, target, hidden variables
- what “error” means in your context
- what constraints must be respected (cost, fairness, safety, compute)

12.2.2 Chapter 2: Learning and generalisation

Build two predictive rules for the same task:
- one simple baseline
- one more flexible model family
Compare them using a train/test split and at least one alternative (e.g., cross-validation or a time-aware split).
Write a short memo explaining which model you would trust and why.

12.2.3 Chapter 3: Optimisation

Pick one objective you can write down explicitly (least squares, logistic regression loss, ridge regression).
Implement gradient descent with two step-size choices:
- one that converges smoothly
- one that oscillates or diverges
Explain what you learn about “training” as an algorithm, not just as a library call.

12.2.4 Chapter 4: Representations and dimensionality reduction

Start with a dataset that has at least 10 features (or construct one).
Use a dimensionality reduction method (PCA is enough) to create a 2D or 3D representation.
Explain what structure becomes visible in the reduced space and what is lost.

12.2.5 Chapter 5: Uncertainty

Choose a prediction task where uncertainty matters (weather proxy, demand, risk score, sensor measurement).
Produce:
- a point prediction
- an uncertainty statement (interval, distribution, or calibrated probability)
Write down at least two reasons your uncertainty could be miscalibrated.

12.2.6 Chapter 6: Sequences and latent state

Take a time series (or simulate one) with noise and a slowly changing trend.
Compare:
- a smoothing/filtering approach
- a model-based state estimation approach (Kalman-style, even in 1D)
Interpret the difference between “noise removal” and “state inference”.

12.2.7 Chapter 7: Information and objectives

For a classification task, compute and interpret:
- accuracy
- log loss / cross-entropy
Construct a scenario where accuracy looks good but cross-entropy reveals a problem (overconfident wrong predictions is the classic).

12.2.8 Chapter 8: Deep learning as composition

Train a small neural network on a simple problem (even synthetic):
- XOR-like classification
- a nonlinear regression curve
Track training and validation behaviour.
Explain overfitting in terms of representation capacity and optimisation.

12.3 A process-data variant (for labs and experiments)

If you have access to laboratory-style data (a vibe closer to engineering courses such as CH E 358), adapt any prompt by adding:

a measurement story (what sensors measure, what they miss)
a data-quality story (drift, missingness, outliers)
a design-of-experiments story (what you can control, what you can randomise)

Doing that turns “machine learning” into “learning under constraints”, which is the more honest and transferable skill.

12.4 Next step: practise on free data

If you want a concrete real-world practice track that combines physical and human geography with long-running time series and event structure, continue to Further Enquiry Playbook. It proposes an Alberta-first, station-based wildfire smoke project (PM2.5) and several follow-on variants.