12 Mini-Projects and Pathways
The most reliable way to make heavy topics feel less heavy is to give them a home. A mini-project provides that home: one dataset, one set of decisions, one evaluation story, and one written explanation that can be revised as your understanding grows.
This chapter offers two things:
- A set of mini-project prompts tied to the book’s chapters.
- A “degree-shaped” pathway that reflects how programs like MRU’s Data Science major and engineering data-analytics courses tend to braid mathematics, data handling, and communication.
12.1 The degree-shaped pathway (conceptual, not administrative)
Many programs move through a pattern like:
- Year 1: calculus, linear algebra, statistics, and programming basics
- Year 2: databases, data acquisition/processing, and modelling foundations
- Year 3: visualization, regression/time series, and domain context
- Year 4: machine learning + capstone work (often with workplace or community constraints)
This book is deliberately concentrated on the mathematics that becomes most important once you reach steps 2–4.
12.2 Mini-project prompts by chapter
Each prompt is designed to be answerable with “small data” you can obtain ethically. Use synthetic data when real data are unavailable. The goal is not a portfolio piece; the goal is a legible learning loop.
12.2.1 Chapter 1: Modelling language
- Choose a system you care about (water use, attendance, transit delay, energy, inventory, sensor drift).
- Write a one-page modelling brief:
- features, target, hidden variables
- what “error” means in your context
- what constraints must be respected (cost, fairness, safety, compute)
12.2.2 Chapter 2: Learning and generalisation
- Build two predictive rules for the same task:
- one simple baseline
- one more flexible model family
- Compare them using a train/test split and at least one alternative (e.g., cross-validation or a time-aware split).
- Write a short memo explaining which model you would trust and why.
12.2.3 Chapter 3: Optimisation
- Pick one objective you can write down explicitly (least squares, logistic regression loss, ridge regression).
- Implement gradient descent with two step-size choices:
- one that converges smoothly
- one that oscillates or diverges
- Explain what you learn about “training” as an algorithm, not just as a library call.
12.2.4 Chapter 4: Representations and dimensionality reduction
- Start with a dataset that has at least 10 features (or construct one).
- Use a dimensionality reduction method (PCA is enough) to create a 2D or 3D representation.
- Explain what structure becomes visible in the reduced space and what is lost.
12.2.5 Chapter 5: Uncertainty
- Choose a prediction task where uncertainty matters (weather proxy, demand, risk score, sensor measurement).
- Produce:
- a point prediction
- an uncertainty statement (interval, distribution, or calibrated probability)
- Write down at least two reasons your uncertainty could be miscalibrated.
12.2.6 Chapter 6: Sequences and latent state
- Take a time series (or simulate one) with noise and a slowly changing trend.
- Compare:
- a smoothing/filtering approach
- a model-based state estimation approach (Kalman-style, even in 1D)
- Interpret the difference between “noise removal” and “state inference”.
12.2.7 Chapter 7: Information and objectives
- For a classification task, compute and interpret:
- accuracy
- log loss / cross-entropy
- Construct a scenario where accuracy looks good but cross-entropy reveals a problem (overconfident wrong predictions is the classic).
12.2.8 Chapter 8: Deep learning as composition
- Train a small neural network on a simple problem (even synthetic):
- XOR-like classification
- a nonlinear regression curve
- Track training and validation behaviour.
- Explain overfitting in terms of representation capacity and optimisation.
12.3 A process-data variant (for labs and experiments)
If you have access to laboratory-style data (a vibe closer to engineering courses such as CH E 358), adapt any prompt by adding:
- a measurement story (what sensors measure, what they miss)
- a data-quality story (drift, missingness, outliers)
- a design-of-experiments story (what you can control, what you can randomise)
Doing that turns “machine learning” into “learning under constraints”, which is the more honest and transferable skill.
12.4 Next step: practise on free data
If you want a concrete real-world practice track that combines physical and human geography with long-running time series and event structure, continue to Further Enquiry Playbook. It proposes an Alberta-first, station-based wildfire smoke project (PM2.5) and several follow-on variants.