I work in close proximity to a lot of AI/ML/big data engineers and have friends who are of the same variety. I somewhat understand the basic concepts of a neural network and I have some knowledge of vectors etc. but I’m tired of being the one left confused when having conversations… So, my plan is to get stuck in a little.. at least to train something and to work my way into a problem enough that I get confused.. and then end up with more of an understanding/appreciation as a result of it.
I had a quick look at some tutorials, the MNIST handwritten digit recognition being a common one. BUT, I didn’t like the amount of hand-holding, e.g.:
from sklearn.datasets import load_digits; digits = load_digits(),(x_train, y_train), (x_test, y_test) = mnist.load_data()So I’m going to work-through several different tutorials:
The goal is to write down my version, with no stones left unturned.
I found I used AI, ML, and other phrases arbitrarily to just talk about the whole thing thing. But, breaking this down so I’m a bit more precise:
I’d heard a lot about various tools, but didn’t know exactly what they were, so:
So I wanted a little bit of each of these layers, so I decided to run Jupyter Lab and Airflow inside docker.
I wanted to expose Airflow to Jupyter, so my DAGs could be written/triggered from the notebook.
I’m always conscious about installing anything on my macbook.. I can count the number of applications installed in homebrew on one hand. I generally use decontainers for absolutely everything - but this doesn’t really work for interacting with the Macbook’s Metal. After some thought (and discussion), I decided to use common AI/ML tooling on top of docker on another host.
This will be more similar to “real life” (or at least from my point of view within BigCorp) since everything is hosted in the cloud.
Setting up a basic docker-compose looks like:
version: "3.9"
services:
postgres:
image: postgres:15
container_name: airflow_postgres
restart: always
environment:
POSTGRES_USER: airflow
POSTGRES_PASSWORD: airflow
POSTGRES_DB: airflow
volumes:
- postgres_data:/var/lib/postgresql/data
ports:
- "5432:5432"
airflow:
image: apache/airflow:2.8.3-python3.11
container_name: airflow
restart: always
environment:
# Set UID to align with Jupyter
AIRFLOW_UID: 50000
AIRFLOW__CORE__EXECUTOR: LocalExecutor
AIRFLOW__CORE__LOAD_EXAMPLES: False
AIRFLOW__CORE__DAGS_FOLDER: /opt/airflow/workspace/dags
# AIRFLOW__WEBSERVER__AUTH_BACKEND: "airflow.api.auth.backend.basic_auth"
AIRFLOW__WEBSERVER__SECRET_KEY: "supersecret"
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres:5432/airflow
user: "${AIRFLOW_UID:-50000}"
volumes:
- ./workspace:/opt/airflow/workspace # Shared DAGs/workspace
- airflow_logs:/opt/airflow/logs # Airflow logs persisted
ports:
- "8080:8080" # Airflow web UI
command: >
bash -c "airflow db init &&
airflow scheduler &
airflow webserver"
# deploy:
# resources:
# reservations:
# devices:
# - driver: nvidia
# capabilities: [gpu] # Optional GPU passthrough
jupyter:
image: jupyter/datascience-notebook:latest
container_name: jupyterlab
restart: always
environment:
# Align UID with Airflow, just to help with file permissions
NB_UID: 50000
JUPYTER_ENABLE_LAB: "yes"
volumes:
- ./workspace:/home/jovyan/work # Shared workspace
ports:
- "8887:8888"
command: start-notebook.sh --NotebookApp.token='' --NotebookApp.password=''
# deploy:
# resources:
# reservations:
# devices:
# - driver: nvidia
# capabilities: [gpu] # Optional GPU passthrough
volumes:
airflow_logs:
postgres_data:
The first thing that we need to do is look at the problem we’re trying to solve, I will do this briefly because I need to learn of the workflow to understand how I need to analyse the problem (meaning this would be more of a phase in “day two” projects).
But, a brief idea is that we have hand-drawn images, we will manipulate the image a bit (downsample to a lower resolution, presumably convert to 1-bit colour pixels). I imagine each of the pixels will end up being an “input” of the model. We’ve have some hidden layer magic and then the output will be a translation of the interpreted number.
Therefore we can provide the model with an image and it will tell us the number.
The raw source of the data is here: http://yann.lecun.com/exdb/mnist/, but appears to now be empty. From archive.org, we can see 4 files:
These files are stored in IDX format:
0x08 -> unsigned byte0x09 -> signed byte0x0B -> short (2 bytes)0x0C -> int (4 bytes)0x0D -> float (4 bytes)0x0E -> double (8 bytes)The two types of files are described as:
TRAINING SET LABEL FILE (train-labels-idx1-ubyte):
[offset] [type] [value] [description]
0000 32 bit integer 0x00000801(2049) magic number (MSB first)
0004 32 bit integer 60000 number of items
0008 unsigned byte ?? label
0009 unsigned byte ?? label
........
xxxx unsigned byte ?? label
The labels values are 0 to 9.
TRAINING SET IMAGE FILE (train-images-idx3-ubyte):
[offset] [type] [value] [description]
0000 32 bit integer 0x00000803(2051) magic number
0004 32 bit integer 60000 number of images
0008 32 bit integer 28 number of rows
0012 32 bit integer 28 number of columns
0016 unsigned byte ?? pixel
0017 unsigned byte ?? pixel
........
xxxx unsigned byte ?? pixel
Pixels are organized row-wise. Pixel values are 0 to 255. 0 means background (white), 255 means foreground (black).
Meaning that for image files, we can read the header, obtain the number of images, the rows and columns and then data for each pixel. For label files, we extract the number of labels and then the data for each label.
Some code to do this would look like:
import numpy as np
import struct
def load_mnist_images(filename):
with open(filename, 'rb') as f:
magic, num, rows, cols = struct.unpack(">IIII", f.read(16))
data = np.frombuffer(f.read(), dtype=np.uint8)
return data.reshape(num, rows, cols)
def load_mnist_labels(filename):
with open(filename, 'rb') as f:
magic, num = struct.unpack(">II", f.read(8))
return np.frombuffer(f.read(), dtype=np.uint8)
images = load_mnist_images("train-images-idx3-ubyte")
labels = load_mnist_labels("train-labels-idx1-ubyte")
I was able to find a copy of the files here: https://github.com/hamlinzheng/mnist/tree/master/dataset
So, within jupyter, starting a terminal and I cloned this in the home directory (outside of work).
At least for large datasets, extracting and converting this data into a format that’s more readily readable by the ML libraries is beneficial, this will make all development (and training runs) more efficient.
There’s a couple of different formats:
Raw IDX files
NumPy (.npy / .npz)
np.load), fast for small datasetsPyTorch (.pt / .pth)
HDF5 (.h5)
h5py), slightly more complex APITFRecord (TensorFlow)
WebDataset / LMDB
Parquet / Arrow
After looking through these, HDF5 felt like the best middle ground, used in production environments, but not tied to a particular framework (especially since we’ll be switching later!).
So, to tie this together, we’ll create a notebook that imports the dataset, inspects them and saves them as HDF5. We’ll then create a RAG to download the dataset, and then convert it.
We create a simple notebook to interpret and dump the data:
import os
import gzip
import numpy as np
import struct
import h5py
def load_mnist_images(filename):
with gzip.open(filename, 'rb') as f:
_, num, rows, cols = struct.unpack(">IIII", f.read(16))
data = np.frombuffer(f.read(), dtype=np.uint8)
return data.reshape(num, rows, cols)
def load_mnist_labels(filename):
with gzip.open(filename, 'rb') as f:
_, num = struct.unpack(">II", f.read(8))
return np.frombuffer(f.read(), dtype=np.uint8)
# Load images
images = load_mnist_images(os.path.join(SOURCE_DATA_DIRECTORY, "train-images-idx3-ubyte.gz"))
# Load labels
labels = load_mnist_labels(os.path.join(SOURCE_DATA_DIRECTORY, "train-labels-idx1-ubyte.gz"))
# Save as HDF5
with h5py.File("mnist.h5", "w") as f:
f.create_dataset("images", data=images, compression="gzip")
f.create_dataset("labels", data=labels, compression="gzip")
Let’s break this down:
magic, num, rows, cols = struct.unpack(">IIII", f.read(16)): struct.unpack will take some binary data and interpret as different data. We’ve providing a format of >IIII, meaning “big-endian” (as per data format spec), and then four unsigned integers (reference). Then simply unpacking the returned tuple into one ignored value (the magic number) and number of images, rows and columns. We’re then passing the first 16 bytes from the filehandle.
np.frombuffer(f.read(), dtype=np.uint8): np.frombuffer will take binary and return an ‘ndarray’, which is an array representing a “multi-dimensional, homogeneous array of fixed-size items”. At this point, we’ve passed it a load of data and given it a data type, so realistically all it has done is split it into the a big array of uint8s.
You can see that for labels, this is where we stop, and this is because labels is just a flat 1d array. But for the image data, we run through data.reshape(num, rows, cols), which then provides the context to the ndarray as to the structure of the data, meaning that we create dimensions for the number of images, the row and the columns of pixels and all of the data will now be indexable via these new dimensions.
Next, let’s take a look at what our data looks like…
This seems to be where Jupyter notebooks sort of shine… now we have images and labels, we should take a look to see what they’re actually made up of. We can add a very basic:
print(images)
print(labels)
and modify this to interact with them however we wish without having to reread/process the files.
 
So, giving that a go: Let’s take a quick look at labels:
import matplotlib.pyplot as plt
plt.figure(figsize=(8,5))
plt.hist(labels, bins=30, color='skyblue', edgecolor='black')
plt.title("Distribution of Label data")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.show()
Nice and easy: 
We can see a relatively even distribution for the data for each number.
How about inspecting some of the images:
import matplotlib.pyplot as plt
idx = 0 # first image
image = images[idx]
label = labels[idx]
# Display the image
plt.figure(figsize=(4,4))
plt.imshow(image, cmap='gray') # grayscale colormap
plt.title(f"Label: {label}")
plt.axis('off') # turn off axis
plt.show()

At this point, I’m not entirely sure if using Airflow here will be overkill… but I want to understand how this fits into a more “real” setup, not just a notebook, so let’s give it a go.
We’ll start with a small DAG pipeline that can process what we have done so far:
We’ll create a slightly more dynamic script for performing the conversion:
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime
import os
import urllib.request
import subprocess
import struct
import numpy as np
import gzip
import h5py
RAW_DIR = "/opt/airflow/dags/data/raw"
PROCESSED_DIR = "/opt/airflow/dags/data/processed"
MNIST_URLS = {
"train-images-idx3-ubyte.gz": "https://github.com/hamlinzheng/mnist/raw/refs/heads/master/dataset/train-images-idx3-ubyte.gz",
"train-labels-idx1-ubyte.gz": "https://github.com/hamlinzheng/mnist/raw/refs/heads/master/dataset/train-labels-idx1-ubyte.gz",
}
def download_data():
os.makedirs(RAW_DIR, exist_ok=True)
for filename, url in MNIST_URLS.items():
filepath = os.path.join(RAW_DIR, filename)
if not os.path.exists(filepath):
print(f"Downloading {filename}...")
urllib.request.urlretrieve(url, filepath)
else:
print(f"{filename} already exists, skipping.")
def load_images(path):
with gzip.open(path, 'rb') as f:
_, num, rows, cols = struct.unpack(">IIII", f.read(16))
data = np.frombuffer(f.read(), dtype=np.uint8)
return data.reshape(num, rows, cols)
def load_labels(path):
with gzip.open(path, 'rb') as f:
_, num = struct.unpack(">II", f.read(8))
return np.frombuffer(f.read(), dtype=np.uint8)
def convert_data():
os.makedirs(PROCESSED_DIR, exist_ok=True)
images = load_images(os.path.join(RAW_DIR, "train-images-idx3-ubyte.gz"))
labels = load_labels(os.path.join(RAW_DIR, "train-labels-idx1-ubyte.gz"))
output_path = os.path.join(PROCESSED_DIR, "mnist.h5")
with h5py.File(output_path, "w") as f:
f.create_dataset("images", data=images, compression="gzip")
f.create_dataset("labels", data=labels, compression="gzip")
print(f"Saved dataset to {output_path}")
with DAG(
dag_id="mnist_pipeline",
start_date=datetime(2024, 1, 1),
schedule_interval=None,
catchup=False,
tags=["ml", "mnist"],
) as dag:
download_task = PythonOperator(
task_id="download_mnist",
python_callable=download_data,
)
convert_task = PythonOperator(
task_id="convert_to_hdf5",
python_callable=convert_data,
)
download_task >> convert_task
Then execute using the following:
import requests
import json
AIRFLOW_URL = "http://airflow:8080/api/v1"
DAG_ID = "mnist_pipeline"
USERNAME = "airflow"
PASSWORD = "airflow"
payload = {
"conf": {}
}
response = requests.post(
f"{AIRFLOW_URL}/dags/{DAG_ID}/dagRuns",
auth=(USERNAME, PASSWORD),
headers={"Content-Type": "application/json"},
data=json.dumps(payload)
)
print(response.status_code, response.json())

Now we have data in ./data/pocessed!
notes
Whilst trying to run this, I saw three issues (all fixed in the above docker-compose):
But also: airflow didn’t contain the required packages I needed (h5py) and errored during startup because of the DAG code. To combat this, performing a simple build of a custom docker image, which just performed a pip install appeared to help… but I ran into:
airflow | ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject
Whilst unfortunate, installing h5py with --no-deps did fix it, I wouldn’t recommend it.. probably safest would have been to dump the installed packages, add h5py and install the lot - at least this way pre-install packages would have been taken into consideration as dependencies rather than steamrolling all over them. So little Dockerfile.airflow later:
FROM apache/airflow:2.8.3-python3.11
# Install h5py
RUN pip install --no-deps h5py
and a little:
@@ -14,7 +14,9 @@
ports:
- "5432:5432"
airflow:
- image: apache/airflow:2.8.3-python3.11
+ build:
+ context: .
+ dockerfile: Dockerfile.airflow
container_name: airflow
restart: always
environment:
and all starts with no errors.
And, honestly, I couldn’t be bothered to get authentication working, so just run:
docker exec airflow airflow users create --role Admin --username airflow --email airflow --firstname airflow --lastname airflow --password airflow
Blame PEBKAC or airflow docs, but :shrug: it works.
Next we’ll take a look at the basis of a neural network - my basic understanding is currently just:
Input -> Hidden Layers -> Output
We have our inputs (rows * columns * colour depth) of pixels to provide the image to the model.
Let’s first read in our data using the new format and validate it:
import h5py
import numpy as np
with h5py.File("./data/processed/mnist.h5", "r") as f:
images = f["images"][:]
labels = f["labels"][:]

To be able to work with the default Jupyter image, I had to install tensorflow. The options are either install in a terminal or build a custom iamge that installs it on top. On top of this, due to cross-dependencies, I had various issues, so pinning tensorflow and numpy worked for me:
pip install tensorflow=2.13 h5py=3.10 numpy=1.24.3
Let’s first gear our data to be suitable for the inputs - we no longer care about the magical 2-dimensions of images, so we’ll have a set of 1-dimensional pixel values. Since everything surrounding neurons in ML is floats between 0 and 1, we’ll need to convert them to be between these values:
x_shaped = images.reshape(images.shape[0], -1)
X = x_shaped.astype("float32") / 255.0
y = labels.astype("int64")
The thing to note here is that images.reshape is taking the shape of outer dimension of our data (the number of images), so 60000 and then telling it to reshape that data based on this and then -1 is telling it to “go figure it out”, so it’s effectively flattening the x/y dimensions into a single dimension of pixels. Realistically, no different to using x_shaped = images.reshape(images.shape[0], 28*28). We are then converting all of the values into floats (for the input data) and dividing by the pixel color depth (which is 1 byte), translating it from 0-255 -> 0-1.
Inputs
Hidden Layers
Outputs
Bias
Weight