Training Deep Neural Networks on a GPU with PyTorch
Part 4 of “PyTorch: Zero to GANs”
This post is the fourth in a series of tutorials on building deep learning models with PyTorch, an open source neural networks library. Check out the full series:
- PyTorch Basics: Tensors & Gradients
- Linear Regression & Gradient Descent
- Classification using Logistic Regression
- Feedforward Neural Networks & Training on GPUs (this post)
- Coming soon.. (CNNs, RNNs, transfer learning, GANs etc.)
In the previous tutorial, we trained a logistic regression model to identify handwritten digits from the MNIST dataset with an accuracy of around 86%.
However, we also noticed that it’s quite difficult to improve the accuracy beyond 87%, due to the limited power of the model. In this post, we’ll try to improve upon it using a feedforward neural network. Many parts of this tutorial are inspired from FastAI development notebooks by Jeremy Howard.
System Setup
If you want to follow along and run the code as you read, a fully reproducible Jupyter notebook for this tutorial can be found here on Jovian:
You can clone this notebook, install the required dependencies using conda, and start Jupyter by running the following commands on the terminal:
pip install jovian --upgrade # Install the jovian library
jovian clone fdaae0bf32cf4917a931ac415a5c31b0 # Download notebook
cd 04-feedforward-nn # Enter the created directory
jovian install # Install the dependencies
conda activate 04-feedfoward-nn # Activate virtual env
jupyter notebook # Start Jupyter
On older versions of conda, you might need to run source activate 04-feedforward-nn
to activate the virtual environment. For a more detailed explanation of the above steps, check out the system setup section in the first notebook.
Preparing the Data
The data preparation is identical to the previous tutorial. We begin by importing the required modules & classes.
We download the data and create a PyTorch dataset using the MNIST
class from torchvision.datasets
.
Next, we define and use a function split_indices
to pick a random 20% fraction of the images for the validation set.
We can now create PyTorch data loaders for each of the subsets using a SubsetRandomSampler
, which samples elements randomly from a given list of indices, while creating batches of data.
Model
To improve upon logistic regression, we’ll create a neural network with one hidden layer. Here’s what this means:
- Instead of using a single
nn.Linear
object to transform a batch of inputs (pixel intensities) into a batch of outputs (class probabilities), we'll use twonn.Linear
objects. Each of these is called a layer, and the model itself is called a network. - The first layer (also known as the hidden layer) will transform the input matrix of shape
batch_size x 784
into an intermediate output matrix of shapebatch_size x hidden_size
, wherehidden_size
is a preconfigured parameter (e.g. 32 or 64). - The intermediate outputs are then passed into a non-linear activation function, which operates on individual elements of the output matrix.
- The result of the activation function, which is also of size
batch_size x hidden_size
, is passed into the second layer (also knowns as the output layer), which transforms it into a matrix of sizebatch_size x 10
, identical to the output of the logistic regression model.
Introducing a hidden layer and activation function allows the model to learn more complex, multi-layered and non-linear relationships between the inputs and the targets. Here’s what it looks like visually (the blue boxes represent layer outputs for a single input image):
The activation function we’ll use here is called a Rectified Linear Unit or ReLU, and it has a really simple formula: relu(x) = max(0,x)
i.e. if an element is negative, we replace it by 0, otherwise we leave it unchanged.
To define the model, we extend the nn.Module
class, just as we did with logistic regression.
We’ll create a model that contains a hidden layer with 32 activations.
Let’s take a look at the model’s parameters. We expect to see one weight and bias matrix for each of the layers.
Let’s try and generate some outputs using our model. We’ll take the first batch of 100 images from our dataset, and pass them into our model.
Using a GPU
As the sizes of our models and datasets increase, we need to use GPUs (graphics processing units, also known as graphics cards) to train our models within a reasonable amount of time. GPUs contain hundreds of cores that are optimized for performing expensive matrix operations on floating point numbers in a short time, which makes them ideal for training deep neural networks with many layers. You can use GPUs for free on Kaggle kernels or Google Colab, or rent GPU-powered machines on services like Google Cloud Platform, Amazon Web Services or Paperspace.
We can check if a GPU is available and the required NVIDIA drivers and CUDA libraries are installed using torch.cuda.is_available
.
Let’s define a helper function to select a GPU as the target device if one is available, and default to using the CPU otherwise.
Next, let’s define a function that can move data to a chosen device.
Finally, we define a DeviceDataLoader
class (inspired from fastai) to wrap our existing data loaders and move data to the selected device, as a batches are accessed. Interestingly, we don't need to extend an existing class to create a PyTorch data loader. All we need is an __iter__
method to retrieve batches of data, and an __len__
method to get the number of batches.
We can now wrap our data loaders using DeviceDataLoader
.
Tensors that have been moved to the GPU’s RAM have a device
property which includes the word cuda
. Let's verify this by looking at a batch of data from valid_dl
.
Training the Model
As with logistic regression, we can use cross entropy as the loss function and accuracy as the evaluation metric for our model. The training loop is also identical, so we can reuse the loss_batch
, evaluate
and fit
functions from the previous tutorial.
The loss_batch
function calculates the loss and metric value for a batch of data, and optionally performs gradient descent if an optimizer is provided.
The evaluate
function calculates the overall loss (and a metric, if provided) for the validation set.
The fit
function contains the actual training loop, as defined in the previous tutorials. We’ll make a couple of enhancements to the fit
function:
- Instead of the defining the optimizer manually, we’ll pass in the learning rate and create an optimizer inside the function. This will allows us to train the model with different learning rates, if required.
- We’ll record the validation loss and accuracy at the end of every epoch, and return the history as the output of the
fit
function.
We also define an accuracy
function which calculates the overall accuracy of the model on an entire batch of outputs, so we can use it as a metric in fit
.
Before we train the model, we need to ensure that the data and the model’s parameters (weights and biases) are on the same device (CPU or GPU). We can reuse the to_device
function to move the model's parameters to the right device.
Let’s see how the model performs on the validation set with the initial set of weights and biases.
The initial accuracy is around 10%, which is what one might expect from a randomly intialized model (it has a 1 in 10 chance of getting a label right).
We are now ready to train the model. Let’s train for 5 epochs and look at the results. We can use a relatively higher learning of 0.5.
95% is pretty good! Let’s train the model for 5 more epochs at a lower learning rate of 0.1, to further improve the accuracy.
We can now plot the accuracies to study how the model improves over time.
Our current model outperforms the logistic regression model (which could only reach around 86% accuracy) by a huge margin! It quickly reaches an accuracy of 96%, but doesn’t improve much beyond this. To improve the accuracy further, we need to make the model more powerful. As you can probably guess, this can be achieved by increasing the size of the hidden layer, or adding more hidden layers.
Commit and upload the notebook
As a final step, we can save and commit our work using the jovian library.
Jovian uploads the notebook to www.jvn.io, captures the Python environment and creates a sharable link for the notebook. You can use this link to share your work and let anyone reproduce it easily with the jovian clone command. Jovian also includes a powerful commenting interface, so you (and others) can discuss & comment on specific parts of your notebook.
Summary and Further Reading
Here is a summary of the topics covered in this tutorial:
- We created a neural network with one hidden layer to improve upon the logistic regression model from the previous tutorial.
- We used the ReLU activation function to introduce non-linearity into the model, allowing it to learn more complex relationships between the inputs and outputs.
- We defined some utilities like
get_default_device
,to_device
andDeviceDataLoader
to leverage a GPU if available, by moving the input data and model parameters to the appropriate device. - We were able to use the exact same training loop: the
fit
function we had define earlier, to train out model and evaluate it using the validation dataset.
There’s a lot of scope to experiment here, and I encourage you to use the interactive nature of Jupyter to play around with the various parameters. Here are a few ideas:
- Try changing the size of the hidden layer, or add more hidden layers and see if you can achieve a higher accuracy.
- Try changing the batch size and learning rate to see if you can achieve the same accuracy in fewer epochs.
- Compare the training times on a CPU vs. GPU. Do you see a significant difference? How does it vary with the size of the dataset and the size of the model (no. of weights and parameters)?
- Try building a model for a different dataset, such as the CIFAR10 or CIFAR100 datasets.
Finally, here are some really great resources for further reading:
- A visual proof that neural networks can compute any function, also known as the Universal Approximation Theorem.
- But what is a neural network? — A visual and intuitive introduction to what neural networks are and what the intermediate layers represent
- Stanford CS229 Lecture notes on backpropagation — for a more mathematical treatment of how gradients are calculated and weights are updated for neural networks with multiple layers.
- Video lecture on activation functions — by Andrew NG on Coursera