Practical 2: Convolutional Neural Networks

Open notebook:

Authors: Phillip Lippe

In this practical, we will implement variants of modern CNN architectures. There have been many different architectures been proposed over the past few years. Some of the most impactful ones, and still relevant today, are the following: GoogleNet/Inception architecture (winner of ILSVRC 2014), ResNet (winner of ILSVRC 2015), and DenseNet (best paper award CVPR 2017). All of them were state-of-the-art models when being proposed, and the core ideas of these networks are the foundations for most current state-of-the-art architectures. Thus, it is important to understand these architectures and learn how to implement some of them.

Let’s start with importing our standard libraries here.

[1]:

## Standard libraries
import os
import json
import math
import numpy as np
import copy

## Imports for plotting
import matplotlib.pyplot as plt
from matplotlib import cm
%matplotlib inline
import seaborn as sns
sns.set()

## Progress bar
from tqdm.notebook import tqdm

## PyTorch
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.utils.data as data
import torch.optim as optim

## PyTorch Torchvision
import torchvision
from torchvision.datasets import CIFAR10
from torchvision import transforms

We will use the same the path variables DATASET_PATH and CHECKPOINT_PATH. Adjust the paths if necessary.

[2]:

# Path to the folder where the datasets are/should be downloaded (e.g. CIFAR10)
DATASET_PATH = "../data"
# Path to the folder where the pretrained models are saved
CHECKPOINT_PATH = "../saved_models/practical2"
os.makedirs(CHECKPOINT_PATH, exist_ok=True)

# Ensure that all operations are deterministic on GPU (if used) for reproducibility
torch.backends.cudnn.determinstic = True
torch.backends.cudnn.benchmark = False

device = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")

Similarly as in the last practical, we use the CIFAR10 dataset and load it below:

[3]:

# Dataset statistics for normalizing the input values to zero mean and one std
DATA_MEANS = [0.491, 0.482, 0.447]
DATA_STD = [0.247, 0.243, 0.261]

test_transform = transforms.Compose([transforms.ToTensor(),
                                     transforms.Normalize(DATA_MEANS, DATA_STD)
                                     ])
# For training, we add some augmentation. Networks are too powerful and would overfit.
train_transform = transforms.Compose([transforms.RandomHorizontalFlip(),
                                      transforms.RandomResizedCrop((32,32),scale=(0.8,1.0),ratio=(0.9,1.1)),
                                      transforms.ToTensor(),
                                      transforms.Normalize(DATA_MEANS, DATA_STD)
                                     ])
# Loading the training dataset. We need to split it into a training and validation part
# We need to do a little trick because the validation set should not use the augmentation.
train_dataset = CIFAR10(root=DATASET_PATH, train=True, transform=train_transform, download=True)
val_dataset = CIFAR10(root=DATASET_PATH, train=True, transform=test_transform, download=True)
train_set, _ = torch.utils.data.random_split(train_dataset, [45000, 5000], generator=torch.Generator().manual_seed(42))
_, val_set = torch.utils.data.random_split(val_dataset, [45000, 5000], generator=torch.Generator().manual_seed(42))

# Loading the test set
test_set = CIFAR10(root=DATASET_PATH, train=False, transform=test_transform, download=True)

# Create data loaders for later. Adjust batch size if you have a smaller GPU
train_loader = data.DataLoader(train_set, batch_size=128, shuffle=True, drop_last=True, pin_memory=True, num_workers=3)
val_loader = data.DataLoader(val_set, batch_size=128, shuffle=False, drop_last=False, num_workers=3)
test_loader = data.DataLoader(test_set, batch_size=128, shuffle=False, drop_last=False, num_workers=3)

Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified

PyTorch Lightning

In this practical and in many following ones, we will make use of the library PyTorch Lightning. PyTorch Lightning is a framework that simplifies your code needed to train, evaluate, and test a model in PyTorch. It also handles logging into TensorBoard, a visualization toolkit for ML experiments, and saving model checkpoints automatically with minimal code overhead from our side. This is extremely helpful for us as we want to focus on implementing different model architectures and spend little time on other code overhead. Note that at the time of writing/teaching, the framework has been released in version 1.6. Future versions might have a slightly changed interface and thus might not work perfectly with the code (we will try to keep it up-to-date as much as possible).

Now, we will take the first step in PyTorch Lightning, and continue to explore the framework in our other tutorials. First, we import the library:

[4]:

# PyTorch Lightning
try:
    import pytorch_lightning as pl
except ModuleNotFoundError: # Google Colab does not have PyTorch Lightning installed by default. Hence, we do it here if necessary
    !pip install --quiet pytorch-lightning>=1.6
    import pytorch_lightning as pl

PyTorch Lightning comes with a lot of useful functions, such as one for setting the seed:

[5]:

# Setting the seed
pl.seed_everything(42)

Global seed set to 42

[5]:

Thus, in the future, we don’t have to define our own set_seed function anymore.

In PyTorch Lightning, we define pl.LightningModule’s (inheriting from torch.nn.Module) that organize our code into 5 main sections:

Initialization (__init__), where we create all necessary parameters/models
Optimizers (configure_optimizers) where we create the optimizers, learning rate scheduler, etc.
Training loop (training_step) where we only have to define the loss calculation for a single batch (the loop of optimizer.zero_grad(), loss.backward() and optimizer.step(), as well as any logging/saving operation, is done in the background)
Validation loop (validation_step) where similarly to the training, we only have to define what should happen per step
Test loop (test_step) which is the same as validation, only on a test set.

Therefore, we don’t abstract the PyTorch code, but rather organize it and define some default operations that are commonly used. If you need to change something else in your training/validation/test loop, there are many possible functions you can overwrite (see the docs for details).

Now we can look at an example of how a Lightning Module for training a CNN looks like:

[6]:

class CIFARModule(pl.LightningModule):

    def __init__(self, model_hparams, optimizer_hparams):
        """
        Inputs:
            model_hparams - Hyperparameters for the model, as dictionary.
            optimizer_hparams - Hyperparameters for the optimizer, as dictionary. This includes learning rate, weight decay, etc.
        """
        super().__init__()
        # Exports the hyperparameters to a YAML file, and create "self.hparams" namespace
        self.save_hyperparameters()
        # Create model
        self.create_model()
        # Example input for visualizing the graph in Tensorboard
        self.example_input_array = torch.zeros((1, 3, 32, 32), dtype=torch.float32)

    def create_model(self):
        # No need to fill this yet, we will do it later in the notebook
        # Currently this function is a placeholder to create our model
        raise NotImplementedError

    def forward(self, imgs):
        # Forward function that is run when visualizing the graph
        return self.model(imgs)

    def configure_optimizers(self):
        # Create optimizer
        optimizer = optim.SGD(self.parameters(), **self.hparams.optimizer_hparams)
        # We will reduce the learning rate by 0.1 after 75 and 100 epochs
        scheduler = optim.lr_scheduler.MultiStepLR(
            optimizer, milestones=[70, 90], gamma=0.1)
        return [optimizer], [scheduler]

    def training_step(self, batch, batch_idx):
        # "batch" is the output of the training data loader.
        imgs, labels = batch
        preds = self.model(imgs)
        loss = F.cross_entropy(preds, labels)
        acc = (preds.argmax(dim=-1) == labels).float().mean()

        # Logs the accuracy per epoch to tensorboard (weighted average over batches)
        self.log('train_acc', acc, on_step=False, on_epoch=True)
        self.log('train_loss', loss)
        return loss  # Return tensor to call ".backward" on

    def validation_step(self, batch, batch_idx):
        imgs, labels = batch
        preds = self.model(imgs).argmax(dim=-1)
        acc = (labels == preds).float().mean()
        # By default logs it per epoch (weighted average over batches)
        self.log('val_acc', acc)

    def test_step(self, batch, batch_idx):
        imgs, labels = batch
        preds = self.model(imgs).argmax(dim=-1)
        acc = (labels == preds).float().mean()
        # By default logs it per epoch (weighted average over batches), and returns it afterwards
        self.log('test_acc', acc)

We see that the code is organized and clear, which helps if someone else tries to understand your code.

Another important part of PyTorch Lightning is the concept of callbacks. Callbacks are self-contained functions that contain the non-essential logic of your Lightning Module. They are usually called after finishing a training epoch, but can also influence other parts of your training loop. For instance, we will use the following two pre-defined callbacks: LearningRateMonitor and ModelCheckpoint. The learning rate monitor adds the current learning rate to our TensorBoard, which helps to verify that our learning rate scheduler works correctly. The model checkpoint callback allows you to customize the saving routine of your checkpoints. For instance, how many checkpoints to keep, when to save, which metric to look out for, etc. We import them below:

[7]:

# Callbacks
from pytorch_lightning.callbacks import LearningRateMonitor, ModelCheckpoint

Besides the Lightning module, the second most important module in PyTorch Lightning is the Trainer. The trainer is responsible to execute the training steps defined in the Lightning module and completes the framework. Similar to the Lightning module, you can override any key part that you don’t want to be automated, but the default settings are often the best practice to do. For a full overview, see the documentation. The most important functions we use below are:

trainer.fit: Takes as input a lightning module, a training dataset, and an (optional) validation dataset. This function trains the given module on the training dataset with occasional validation (default once per epoch, can be changed)
trainer.test: Takes as input a model and a dataset on which we want to test. It returns the test metric on the dataset.

For training and testing, we don’t have to worry about things like setting the model to eval mode (model.eval()) as this is all done automatically. See below how we define a training function for our models:

[8]:

def train_model(save_name, data_loaders, max_epochs=100, **kwargs):
    """
    Inputs:
        model_name - Name of the model you want to run. Is used to look up the class in "model_dict"
        save_name (optional) - If specified, this name will be used for creating the checkpoint and logging directory.
    """
    # Create a PyTorch Lightning trainer with the generation callback
    trainer = pl.Trainer(default_root_dir=os.path.join(CHECKPOINT_PATH, save_name),                          # Where to save models
                         gpus=1 if str(device)=="cuda:0" else 0,                                             # We run on a single GPU (if possible)
                         max_epochs=max_epochs,                                                              # How many epochs to train for if no patience is set
                         callbacks=[ModelCheckpoint(save_weights_only=True, mode="max", monitor="val_acc"),  # Save the best checkpoint based on the maximum val_acc recorded. Saves only weights and not optimizer
                                    LearningRateMonitor("epoch")],                                           # Log learning rate every epoch
                         check_val_every_n_epoch=10)                                                        # Frequency with which we evaluate the model
    trainer.logger._log_graph = True         # If True, we plot the computation graph in tensorboard
    trainer.logger._default_hp_metric = None # Optional logging argument that we don't need

    # Check whether pretrained model exists. If yes, load it and skip training
    pretrained_filename = os.path.join(CHECKPOINT_PATH, save_name + ".ckpt")
    if os.path.isfile(pretrained_filename):
        print(f"Found pretrained model at {pretrained_filename}, loading...")
        model = CIFARModule.load_from_checkpoint(pretrained_filename) # Automatically loads the model with the saved hyperparameters
    else:
        pl.seed_everything(42) # To be reproducable
        model = CIFARModule(**kwargs)
        trainer.fit(model, data_loaders['train'], data_loaders['val'])
        model = CIFARModule.load_from_checkpoint(trainer.checkpoint_callback.best_model_path) # Load best checkpoint after training

    # Test best model on validation and test set
    val_result = trainer.test(model, data_loaders['val'], verbose=False)
    test_result = trainer.test(model, data_loaders['test'], verbose=False)
    result = {"test": test_result[0]["test_acc"], "val": val_result[0]["test_acc"]}

    return model, result

This setup simplifies our training a lot, and we recommend using PyTorch Lightning for future practicals as well.

Part 1: ResNet

In the first part of the practical, you will implement a small ResNet. The ResNet paper is one of the most cited AI papers, and has been the foundation for neural networks with more than 1,000 layers. Despite its simplicity, the idea of residual connections is highly effective as it supports stable gradient propagation through the network. Instead of modeling \(x_{l+1}=F(x_{l})\), we model \(x_{l+1}=x_{l}+F(x_{l})\) where \(F\) is a non-linear mapping (usually a sequence of NN modules likes convolutions, activation functions, and normalizations). If we do backpropagation on such residual connections, we obtain:

\[\frac{\partial x_{l+1}}{\partial x_{l}} = \mathbf{I} + \frac{\partial F(x_{l})}{\partial x_{l}}\]

The bias towards the identity matrix guarantees a stable gradient propagation being less effected by \(F\) itself. There have been many variants of ResNet proposed, which mostly concern the function \(F\), or operations applied on the sum. In this tutorial, we look at two of them: the original ResNet block, and the Pre-Activation ResNet block. We visually compare the blocks below (figure credit - He et al.):

5aa42694d5bf450595781ae008a4d17d

The original ResNet block applies a non-linear activation function, usually ReLU, after the skip connection. In contrast, the pre-activation ResNet block applies the non-linearity at the beginning of \(F\). Both have their advantages and disadvantages. For very deep network, however, the pre-activation ResNet has shown to perform better as the gradient flow is guaranteed to have the identity matrix as calculated above, and is not harmed by any non-linear activation applied to it. For comparison, in this notebook, we implement both ResNet types as shallow networks.

For this practical, we will use the Pre-Activation ResNet block. The visualization above already shows what layers are included in \(F\). One special case we have to handle is when we want to reduce the image dimensions in terms of width and height. The ResNet block requires \(F(x_{l})\) to be of the same shape as \(x_{l}\). Thus, we need to change the dimensionality of \(x_{l}\) as well before adding to \(F(x_{l})\). The original implementation used an identity mapping with stride 2 and padded additional feature dimensions with 0. However, the more common implementation is to use a 1x1 convolution with stride 2 as it allows us to change the feature dimensionality while being efficient in parameter and computation cost. Let’s try to implement the ResNet block below:

[ ]:

class PreActResNetBlock(nn.Module):

    def __init__(self, c_in, subsample=False, c_out=-1):
        """
        Inputs:
            c_in - Number of input features
            subsample - If True, we want to apply a stride inside the block and reduce the output shape by 2 in height and width
            c_out - Number of output features. Note that this is only relevant if subsample is True, as otherwise, c_out = c_in
        """
        super().__init__()
        # TODO: Implement the network of the pre-activation ResNet block
        raise NotImplementedError

    def forward(self, x):
        # TODO: Implement the forward pass of the Pre-Activation ResNet block
        raise NotImplementedError

[ ]:

# Testing the ResNet block
c_in = np.random.randint(low=16, high=64)
module = PreActResNetBlock(c_in=c_in, subsample=False)
module.to(device)
img = torch.randn(4, c_in, 32, 32, device=device)
out = module(img)
for i in range(len(img.shape)):
    assert out.shape[i] == img.shape[i], f'Disagreement in shapes: output={out.shape}, input={img.shape}'

c_out = np.random.randint(low=16, high=64)
module = PreActResNetBlock(c_in=c_in, subsample=True, c_out=c_out)
module.to(device)
img = torch.randn(4, c_in, 32, 32, device=device)
out = module(img)
assert out.shape[1] == c_out
assert out.shape[2] == img.shape[2]//2
assert out.shape[3] == img.shape[3]//2

Now that we have the ResNet block, let’s implement the full ResNet. The overall ResNet architecture consists of stacking multiple ResNet blocks, of which some are downsampling the input. When talking about ResNet blocks in the whole network, we usually group them by the same output shape. Hence, if we say the ResNet has [3,3,3] blocks, it means that we have 3 times a group of 3 ResNet blocks, where a subsampling is taking place in the fourth and seventh block. The ResNet with [3,3,3] blocks on CIFAR10 is visualized below.

0c8726583a4c4aa3beb7fc359ad3fb70

The three groups operate on the resolutions \(32\times32\), \(16\times16\) and \(8\times8\) respectively. The blocks in orange denote ResNet blocks with downsampling. Additionally to these blocks, we have one initial convolution that maps the 3 color channels to the initial hidden channels, and a final linear layer after an average pooling over all features. Let’s implement it below:

[ ]:

class ResNet(nn.Module):

    def __init__(self, num_classes=10, num_blocks=[3,3,3], c_hidden=[16,24,32]):
        """
        Inputs:
            num_classes - Number of classification outputs (10 for CIFAR10)
            num_blocks - List with the number of ResNet blocks to use. The first block of each group uses downsampling, except the first.
            c_hidden - List with the hidden dimensionalities in the different blocks.
        """
        super().__init__()
        self.num_classes = num_classes
        self.num_blocks = num_blocks
        self.c_hidden = c_hidden
        self._create_network()
        self._init_params()

    def _create_network(self):
        # TODO: Create network with stacking blocks
        raise NotImplementedError

    def _init_params(self):
        # Fan-out focuses on the gradient distribution, and is commonly used in ResNets
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
            elif isinstance(m, nn.BatchNorm2d):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)

    def forward(self, x):
        # TODO: Implement forward of ResNet
        raise NotImplementedError

[ ]:

# Testing the ResNet
num_classes = np.random.randint(low=5, high=20)
model = ResNet(num_classes=num_classes)
model.to(device)
img = torch.randn(4, 3, 32, 32, device=device)
out = model(img)
assert len(out.shape) == 2
assert out.shape[0] == img.shape[0]
assert out.shape[1] == num_classes

Now that we have all together, let’s train the model. First, we need to define the create_model function in the CIFARModule:

[ ]:

def create_resnet(self):
    self.model = ResNet(num_classes=self.hparams.model_hparams['num_classes'],
                        num_blocks=self.hparams.model_hparams['num_blocks'],
                        c_hidden=self.hparams.model_hparams['c_hidden'])

CIFARModule.create_model = create_resnet

We call the training function below to start the training. We provide default parameters that should allow you to quickly train the model, but feel free to optimize the hyperparameters. For a final run in your report, please use 100 epochs.

[ ]:

resnet_model, resnet_results = train_model(save_name='ResNet',
                                           max_epochs=10,
                                           model_hparams={"num_classes": 10,
                                                          "c_hidden": [16,24,32],
                                                          "num_blocks": [2,2,3]},
                                           optimizer_hparams={"lr": 0.1,
                                                              "momentum": 0.9,
                                                              "weight_decay": 1e-4},
                                           data_loaders={"train": train_loader,
                                                         "val": val_loader,
                                                         "test": test_loader})

[ ]:

print(resnet_results)

Instead of just seeing the test result, it is recommended to also take a look at the TensorBoard log:

[ ]:

%load_ext tensorboard
%tensorboard --logdir ../saved_models/practical2/ResNet/

Part 2: Rotational Invariance

A common argument for CNNs over MLPs is that they have the inductive bias of being (approximately) shift invariant. If we move the image by one pixel to the left, most features remain unchanged and also just shift by one position. However, what about more complex transformations, like rotations? Your task in this part of the practical is to take the trained ResNet, and investigate how its test accuracy changes when you rotate the images. Create a plot of the validation accuracy over rotation angles (0 to 360 degree in steps of 10 degree).

[ ]:

@torch.no_grad()
def test_model_rotated(model, rotation_angle=0.0):
    # TODO: Evaluate model on images that have been rotated.
    # You might want to use torchvision.transforms.functional.rotate
    raise NotImplementedError

[ ]:

# TODO: Determine results
raise NotImplementedError

[ ]:

# TODO: Plot results
raise NotImplementedError

What do these results indicate? Add this plot in your report and discuss it.

Part 3: Pixel shuffling

In the first practical, we have investigated how the MLP reacts to shuffling the pixels of all images with a fixed, random permutation. Now, let’s repeat this experiment for the CNNs. Does the CNN exploit the structural information differently than the MLP?

The first step is to create datasets with a new shuffling of pixels:

[ ]:

# TODO: Create datasets and data loaders with a fixed, random shuffle of pixels
raise NotImplementedError

Next, we can start the training of the model. You can limit your number of epochs to a smaller number like 10.

[ ]:

# TODO: Create datasets and data loaders with a fixed, random shuffle of pixels
raise NotImplementedError

[ ]:

# TODO: Print the results and look at your tensorboard
raise NotImplementedError

Add your observations to your report and discuss what this implies for CNNs.

Conclusion

In this practical, you gained some handson experience with CNNs on computer vision for classification. Your experiments probably give you a good indication why CNNs are a very popular choice for computer vision problems, but also what they can and cannot do.