Neural Networks Foundation — From Theory to Practice
1 Introduction
For any machine learning or deep learning model to improve, it must learn from its own performance—specifically, by iteratively adjusting its parameters (weights) until predictions align with reality.
Stochastic Gradient Descent (SGD) is the cornerstone of this optimization process and remains the widely used optimization algorithm in modern AI.
In this article, I’ll: * Build an intuitive understanding of SGD. * Walk through a simple example of gradient descent. * Apply these concepts by creating a learner from scratch and training it on the MNIST dataset.
2 The Intuition Behind Gradient Descent
Consider the case where we have a set of 2-D points p(x, y) emulating a natural phenomenon, and we want to model their behavior using a quadratic equation:
y = a.x^2 + b.x + c
To find the best values for a, b, and c, we usually would:
Choose initial (random) values for a, b, and c.
Make predictions \hat{y} for all x using these weights.
Compare predictions to the true y values and compute the error.
Figure out how to adjust a, b, and c to reduce the error (increase or decrease).
Repeat this process until the error is as small as possible.
However, how would we instill this intuition on how to correctly adjust the weights to our model?
The answer lies in gradient descent.
The idea behind gradient descent is to compute the derivative of a function ( in our case, the error function) at some point (the point being the weights chosen). This will tell us how steeply the error changes with respect to each parameter.
And since we are trying to minimize this error, take iterative steps in the opposite direction of the gradient because that’s where the steepest descent is.
$ w=w - η⋅∇L $
Where:
\eta is the learning rate
\nabla{L} is the gradient of the loss function
2.1 Example
Let’s say our loss function is as follow:
$f(w) = (w -5)^2 $
At first glance, it is obvious that the value of w that would minimize f(w) is 5.
But let’s do that with SGD. * Start with w_0 = 0, and a learning rate \eta = 0.1. * Compute gradient \nabla{f(w)} = 2 * (w -5) * Iteration 1: $w_1 = w_0 - ⋅ = 0 - 0.1 * 2(0 - 5) = 1 $
As we can see each step is moving us closer to the solution w = 5. ## Why “Stochastic”?
In Stochastic Gradient Descent (SGD), instead of computing the gradient using the entire dataset at once—which would be computationally heavy, we use just one data point or a small random batch. This makes updates faster and helps the model escape local minima, as it adds a bit of randomness.
!pip install graphvizfrom graphviz import Digraphdot = Digraph(comment='Gradient Descent Process')dot.edge('init', 'predict')dot.edge('predict', 'loss')dot.edge('loss', 'gradient')dot.edge('gradient', 'adjust')dot.edge('adjust', 'stop')dot.edge('adjust', 'predict', label='repeat')# Render in notebookdot.render('gradient_descent', format='png', view=True) # saves and opens the file# To just display in Jupyter notebook:dot
Requirement already satisfied: graphviz in /usr/local/lib/python3.11/dist-packages (0.21)
3 From Theory to Practice: Creating a Learner from scratch on MNIST
Using the MNIST dataset which is a collection of handwritten digits, I’ll apply SGD to train a neural network.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━719.8/719.8 kB10.8 MB/s eta 0:00:00━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━363.4/363.4 MB4.5 MB/s eta 0:00:00━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━13.8/13.8 MB50.3 MB/s eta 0:00:00━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━24.6/24.6 MB32.1 MB/s eta 0:00:00━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━883.7/883.7 kB41.2 MB/s eta 0:00:00━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━664.8/664.8 MB792.7 kB/s eta 0:00:00━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━211.5/211.5 MB1.6 MB/s eta 0:00:00━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━56.3/56.3 MB11.3 MB/s eta 0:00:00━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━127.9/127.9 MB7.2 MB/s eta 0:00:00━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━207.5/207.5 MB2.0 MB/s eta 0:00:00━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━21.1/21.1 MB27.3 MB/s eta 0:00:00━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━1.6/1.6 MB19.9 MB/s eta 0:00:00
Mounted at /content/gdrive
3.2 Dataset Preparation
In chapter <4>, only a sample of the mnist dataset was used, but in my case I need data for all digits. To do that, I went onto fastaidocumentation page to check available URLs.
help(untar_data)
Help on function untar_data in module fastai.data.external:
untar_data(url: 'str', archive: 'Path' = None, data: 'Path' = None, c_key: 'str' = 'data', force_download: 'bool' = False, base: 'str' = None) -> 'Path'
Download `url` using `FastDownload.get`
To prepare the training set, the following steps are necessary:
Collect all training images.
Normalize pixel values to the range [0, 1].
Reshape the images into a stack of flattened vectors with shape (num_images, 28 * 28) in float32 format.
Prepare the corresponding labels (y).
3.2.2 Multi-Class Classification
Since there are multiple categories, this is no longer a binary classification problem. Two intuitive approaches arise:
Binary Classifiers per Label
Train an independent binary classifier for each label.
Issue: Fails to account for mutual exclusivity (an image could erroneously receive multiple labels).
One-Hot Encoding
Represent labels as sparse vectors (e.g., [0, 1, 0, …, 0] for label 1).
Drawback: Memory-inefficient (stores n_classes values per sample) and less intuitive for direct indexing.
Integer Labels A simpler and more efficient approach is to use integer encoding (e.g., y = 2 for class 2). Advantages include:
Memory efficiency: Stores a single integer per sample instead of a full vector.
Debugging simplicity: Easier to interpret and index during model evaluation.
from PIL import Imageimport torchfrom torchvision import transformsdef get_set(path, set_type="training"):"""Returns a set (x, y) of tensors. Where x is the dependent variable and y the independant variable. Args: path (str): full path to the root folder containing the images. set_type (str): Desired set. Values [training, test]. Returns: x(n_images, 28 * 28), y(n_images). """ imgs = [] labels = []for idx, digit_path inenumerate((path/set_type).ls().sorted()): img_paths = digit_path.ls().sorted()# Convert each image to a tensor# And scale it accordingly into [0,1] range imgs += [transforms.ToTensor()(Image.open(im_path)) for im_path in img_paths]# Add one label per image labels += [idx for _ inrange(len(img_paths))]# Stack into tensorsreturn torch.stack(imgs).float().view(-1, 28*28), torch.tensor(labels, dtype=torch.long)
train_dset =list(zip(train_x,train_y))test_dset=list(zip(test_x,test_y))x,y = train_dset[0]x.shape, y
(torch.Size([784]), tensor(0))
3.2.3 Core Components
Now that we have built both our training and test sets into tensors, it’s time to create the necessary elements for learning:
Dataloader
Is class that shuffles the dataset and creates mini-batches from it.
A mini-batch is a random selection of few data items from the dataset. Over which the loss function will be estimated. The point of using mini-batches instead of the whole dataset or a single data item is to find a compromise between calculation time and model performance.
The DataLoader also creates an iterator over these mini-batches of collections.
Model
The neural network we will use for prediction. This neural network will consist of three linear layers with a non-linear activation in between.
Optimizer
Is a class that handles the stochastic gradient descent step. In essence, it computes the gradient for the set of parameters and then adjust these parameters with the help of the learning rate.
Loss function
It is important to choose a loss function that will reflect the impact of small changes in the weights. In this chapter the mnist_loss was used. It is a function that measures the distance between predictions and targets in the case of a binary classification. However, in my particular case, I am dealing with multi-class classification, and I need a meaningful function for learning.
The default function for multi-class classification is the cross-enthropy.
Cross-entropy measures the variation between the predicted and actual probability distributions for the set of labels in our problem. A a perfect cross-entropy value is 0.
In multi-class classification, a softmax activation plus a cross-entropy loss is used. This outputs a vector of predicted probabilities over the input labels.
Metric
While the loss function is determined in a way that drives automated learning, the metric is determined to ease human understanding.
An intuitive metric for model performance remains accuracy. In that we measure the number of correct predictions over the total number of predictions.
# Set the seed so that the results are reproductible# 1. DataLoadertrain_dl = DataLoader(train_dset, batch_size=16, shuffle =True)test_dl = DataLoader(test_dset, batch_size =len(test_dset))# 2. Modelpred_model = nn.Sequential( nn.Linear(28*28, 30), # The total mix of pixels can output 30 features nn.ReLU(), nn.Linear(30,10) # Takes 30 features as input and outputs 10 predictions)# 3. Optimizerclass Optimizer:def__init__(self, params, lr):self.params =list(params)self.lr = lrdef step(self, *args, **kwargs):for p inself.params: p.data -= p.grad.data *self.lrdef zero_grad(self, *args, **kwargs):for p inself.params: p.grad.zero_()opt = Optimizer(pred_model.parameters(), lr =0.01)# 4. Loss function# Provided by pytorchimport torch.nn as nnloss_func = nn.CrossEntropyLoss()# 5 . Metric -- Accuracydef batch_accuracy(y_pred, target): probs = torch.softmax(y_pred, dim=1) preds = probs.argmax(dim=1) correct = (preds == target)return correct.sum().float() /float( target.size(0) )
Putting it all together:
def calc_grads(x, y, model, loss_func): preds = model(x) loss = loss_func(preds, y) loss.backward()return lossdef train_epoch(model, dl, loss_func, lr): opt = Optimizer(model.parameters(), lr=lr)for x, y in dl: loss_value = calc_grads(x, y, model, loss_func) opt.step() opt.zero_grad()returnround(loss_value.item(), 4)def validate_epoch(model, dl):with torch.no_grad(): accs = [batch_accuracy(model(xb), yb) for xb, yb in dl]returnround(torch.stack(accs).mean().item(), 4)def train_model(model, train_dl, test_dl, loss_func, lr =0.0001, epochs =4):for i inrange(epochs): loss_value = train_epoch(model, train_dl, loss_func, lr) accuracy = validate_epoch(model, test_dl)print(f"| epoch {i+1:<2} | error: {loss_value:<7.4f} | accuracy: {accuracy:<7.4f} |" )
train_model(pred_model, train_dl, test_dl, loss_func, lr =0.01, epochs =20)
These steps can be replicated via the fastai modules as follow:
dls = DataLoaders(train_dl, test_dl)pred_model = nn.Sequential( nn.Linear(28*28, 30), # The total mix of pixels can output 30 features nn.ReLU(), nn.Linear(30,30), nn.ReLU(), nn.Linear(30,10) # Takes 30 features as input and outputs 10 predictions)learn = Learner(dls, pred_model, opt_func = SGD, loss_func = F.cross_entropy, metrics=accuracy)learn.fit(15, 0.01)
epoch
train_loss
valid_loss
accuracy
time
0
0.333567
0.342215
0.897300
00:11
1
0.283738
0.268222
0.922000
00:12
2
0.213385
0.225471
0.933200
00:17
3
0.184340
0.203401
0.937900
00:12
4
0.235486
0.183110
0.947000
00:12
5
0.169792
0.171133
0.951400
00:10
6
0.153068
0.152203
0.955200
00:09
7
0.158674
0.143634
0.957900
00:10
8
0.139212
0.135299
0.959600
00:10
9
0.107892
0.128524
0.961600
00:10
10
0.101543
0.125986
0.961300
00:09
11
0.081692
0.129580
0.960900
00:11
12
0.134035
0.114556
0.965000
00:11
13
0.085697
0.108402
0.967500
00:10
14
0.087118
0.114101
0.964700
00:10
4 Recap
The universal approximation theorem states that, even a shallow neural network with a single non-linear activation can approximate any continuous function as closely as needed. However, in practice, deeper models are often chosen for better efficiency, generalization, and training performance, as shown in our final example.
In this exercise, we walked through the complete workflow of building a learner from scratch: preparing the dataset, batching data, defining the model, optimizing with stochastic gradient descent, and evaluating performance.